Next: sgmllib Prev: urlparse Up: Internet and WWW Top: Top
htmllibSGMLParser defined in module sgmllib.
The following is a summary of the interface defined by
sgmllib.SGMLParser:
feed()
method, which takes a string argument. This can be called with as
little or as much text at a time as desired;
p.feed(a); p.feed(b) has the same effect as p.feed(a+b).
When the data contains complete
HTML elements, these are processed immediately; incomplete elements
are saved in a buffer. To force processing of all unprocessed data,
call the close() method.
Example: to parse the entire contents of a file, do*
parser.feed(open(file).read()); parser.close().
start_tag(),
end_tag(), or do_tag(). The parser will
call these at appropriate moments: start_tag or
do_tag is called when an opening tag of the form
<tag ...> is encountered; end_tag is called
when a closing tag of the form <tag> is encountered. If
an opening tag requires a corresponding closing tag, like <H1>
... </H1>, the class should define the start_tag
method; if a tag requires no closing tag, like <P>, the class
should define the do_tag method.
SGMLParser base
class, •. It also defines handlers for the following
tags: <LISTING>...</LISTING>, <XMP>...</XMP>, and
<PLAINTEXT> (the latter is terminated only by end of file).
HTMLParser, collects various useful
bits of information from the HTML text. To this end it defines
additional handlers for the following tags: <A>...</A>,
<HEAD>...</HEAD>, <BODY>...</BODY>,
<TITLE>...</TITLE>, <NEXTID>, and <ISINDEX>.
CollectingParser, interprets a wide
selection of HTML tags so it can produce formatted output from the
parsed data. It is initialized with two objects, a formatter
which should define a number of methods to format text into
paragraphs, and a stylesheet which defines a number of static
parameters for the formatting process. Formatters and style sheets
are documented later in this section.
FormattingParser, extends the handling
of the <A>...</A> tag pair to call the formatter's
bgn_anchor() and end_anchor() methods. This allows the
formatter to display the anchor in a different font or color, etc.
CollectingParser (and thus also instances of
FormattingParser and AnchoringParser) have the following
instance variables:
NAME attributes of the <A>
tags encountered.
HREF attributes of the <A> tags
encountered.
TYPE attributes of the <A>
tags encountered.
<A>...</A> tag pair, this is zero. Inside such a
pair, it is a unique integer, which is positive if the anchor has a
HREF attribute, negative if it hasn't. Its absolute value is
one more than the index of the anchor in the anchors,
anchornames and anchortypes lists.
<ISINDEX> tag has been encountered.
<NEXTID> tag encountered, or
an empty list if none.
<TITLE>...</TITLE> tag pair, or
'' if no title has been encountered yet.
anchors, anchornames and anchortypes lists
are ``parallel arrays'': items in these lists with the same index
pertain to the same anchor. Missing attributes default to the empty
string. Anchors with neither a HREF nor a NAME
attribute are not entered in these lists at all.
The module also defines a number of style sheet classes. These should never be instantiated --- their class variables are the only behavior required. Note that style sheets are specifically designed for a particular formatter implementation. The currently defined style sheets are:
stdwin module; it is an alias
for either X11Stylesheet or MacStylesheet.
gl and fm).
setfont() method.
<H1>...</H1>
tag pairs etc.).
<DD> tags.
<UL> tags.
<PRE>...</PRE> and similar tag pairs).
FormattingParser class assumes that formatters have a
certain interface. This interface requires the following methods:
flush().
addword. It should be set to false after a non-empty word has
been added.
'c' (center), 'l' (left
justified), 'r' (right justified) or 'lr' (left and
right justified).
inanchor attribute.
inanchor attribute.
fmt, which in turn uses the module Para. These modules are
not intended as standard library modules; they are available as an
example of how to write a formatter.