PYTHON:
It contains the headers for the method/functions you will write. Do not modify the function names or parameters in the template file. The functions written for this lab must not use global variables. In some cases certain built-in functions are restricted in your solutions so carefully read each question. Solutions that don’t follow these guidelines will not earn full credit, even if they produce the correct results in all cases.
Write class HeaderParser that is a subclass of the HTMLParser class. It will find and collect the contents of all the headings in an HTML file fed to it. The parser works by identifying when a header tag has been encountered and setting a boolean variable in the class to indicate that. When the data handler for the class is called and the boolean in the class indicates that a header is currently open, the data inside the header is added to a list. Finally, when a closing header tag is encountered the boolean variable is unset. To implement this parser you will need to override the following methods of the HTMLParser class:
__init__: The constructor calls the parent class constructor, set the boolean variable in the object appropriately, and sets the list of headings to the empty list.
handle_starttag: If the tag that resulted in this method being called is a header, the header indicator should be set.
handle_endtag: If the tag that resulted in this method being called is a header, the header indicator should be unset.
handle_data: If the parser is currently inside a header, then the data should be added to the list of headers contents. Make sure that you strip any leading or trailing spaces or newlines off the contents of the header before adding it to the list.
getHeadings: The function returns the list of headings gathered by the parser.
You can find a template for the class and a test function testHParser() in the csc242lab7.py file. The following shows what that test function would display on several sample web pages. Note that your solution must work on any page, not just the ones provided here. Think carefully about what it means to collect headings in a general context:
Output:
csc242lab7.py
from html.parser import HTMLParser
from urllib.request import urlopen
class HeaderParser(HTMLParser):
”’subclass of imported HTMLParser class. Finds and collects contents of
headings in an HTML file that is fed to it. Identifies when a heading tag
has been encountered and sets a boolean variable to indicate it. When data
handler is called and the boolean in the calss indicates the header is currently
open, the data inside the header is added to a list. Finally, when a closing
header tag is encountered, the boolean variable is unset. Override the
following methods”’
def __init__(self):
”’The constructor calls the parent class constructor, sets the boolean
variable in the object appropriately, and sets the empty list”’
pass
def handle_starttag(self, tag, attrs):
”’If the tag that resulted in this method being called is a header,
the header indicator should be set”’
pass
def handle_endtag(self, tag):
”’If the tag that resulted in this method being called is a header,
the header indicator should be unset”’
pass
def handle_data(self, data):
”’If the parser is currently inside a header, then the data should be
added to the list. Strips any leading or trailing white space.”’
pass
def getHeadings(self):
”’The function returns the list of headings gathered by the parser”’
pass
def testHParser(url):
”’opens the url, reads it, and converts it to a string saved in a variable.
Creats an object of HeaderParser class and feeds it the string. Returns
the the getHeadings method”’
pass
Expert Answer
from HTMLParser import HTMLParser
#to import library for parsingurl
import urllib2
class HeadingParser(HTMLParser):
#class for parsing
def __init__(self):
#initialise
HTMLParser.__init__(self)
#boolean variable
self.h_I=False
#contents
self.c_l=[]
#for start tag handling
def handle_starttag(self, tag,attrs):
#check if the list is set to true
if tag==”h1″ or tag==”h2″ or tag==”h3″ or tag==”h4″ or tag==”h5″ or tag==”h6″:
self.h_I=True
#to handle data
def handle_data(self,data):
#check if the content heading is set to true
if self.h_I==True:
self.c_l.append(data)
#for handling the tag at end
def handle_endtag(self,tag):
#check if it is set to false
if tag==”h1″ or tag==”h2″ or tag==”h3″ or tag==”h4″ or tag==”h5″ or tag==”h6″:
self.h_I=False
#to return
def getheadings(self):
return self.c_l
#to test the function
def testHParser(url):
h_p=HeadingParser()
r_q=urllib2.Request(url)
ans=urllib2.urlopen(r_q)
hml=ans.read();
h_p.feed(hml)
print h_p.getheadings()
h_p.close()