Andrew Shearer: HTMLFilter

HTMLFilter

Parse and modify HTML

HTMLFilter is a module for Python programs. It parses an HTML 4 document, allowing subclasses to pass through or modify text and tags as the event stream goes by, and write out a copy that will be an otherwise exact replica of the original, including whitespace and comments. Minor errors in the markup will pass right through without causing indigestion, and ASP, PHP, JSP, and other server-side code will generally survive the round trip. (The only exception can be if it’s embedded inside an HTML tag you’re actually modifying, not just passing through.)

The use can be as simple as adding a <meta> tag to an existing web page, or as complex as merging two HTML pages (as it’s used in ShearerSite, which intelligently merges content pages into template pages).

You can also use it to generate HTML from scratch, with HTMLFilter taking care of the attribute encoding for tags.

Other ways HTMLFilter has been used: as the engine for an HTTP- proxy-like CGI that fixed markup errors and updated links in another vendoṟs web application, and in WebCarbon to modify a Web form to contain the user’s entered values.

Tags are parsed lazily, for efficiency in the common case where the program is only interested in passing through a tag, not reading or modifying attributes.

Documentation

HTMLFilter is intended to be subclassed, and subclasses can output an exact replica of the original or modify specific elements or attributes.

Normally, a user would instantiate such a subclass, then call feedString(originalHTML), then call close().

The subclass would override the handleXXX methods to perform the filtering, and override collectHTML() if it wanted to store the generated data. Subclasses that only wanted to read the file and not output a modified version wouldn’t need to override collectHTML().

The handleXXX methods are overridden through subclassing the main HTMLFilter class, rather than implementing some kind of HTMLHandler interface, so that new handleXXX methods can be added to this base class with default implementations that provide backwards compatibility. (+++: could split off HTMLHandler class if it were used as a base class.)

Data flow through HTMLFilter methods:

    feedString(originalHTML)
        -> multiple calls to handle[Text|Tag|Script|Comment|...](tag...)
           (subclasses will override to observe or modify the HTML code)
            -> collectHTML(html)
               (subclasses can store the pieces of the final HTML code)

Has partial support of server-side scripting tags (ASP, PHP, JSP)-- they work anywhere an HTML tag would work, but HTML tags with embedded code may not be parseable (for instance, if a tag contains ASP code inside an attribute value, subclasses can only reliably pass the whole tag through unmodified, not read or modify the attributes).

Does not support SGML short tag forms (which aren’t normally used or parsed in HTML anyway, and the HTML RFC warns about this).

If a subclass doesn’t override a handleXXX method, the default implementations will pass the data to collectHTML() so that the original HTML code is preserved. New handleXXX methods added in the future will therefore be backwards compatible with older sublcasses, so that file filters never lose text.

HTMLFilter has been successfully tested with versions of Python ranging from 1.5.2 to 2.3. It’s Unicode-savvy; the source encoding can be set, and HTML decoding respects Unicode entities.

It is distributed under a Python license.

Example (test script for HTMLTag objects)

    >>> import HTMLFilter
    >>> tag = HTMLFilter.HTMLFilter.HTMLTag('option')
    >>> print tag.getHTML()
    '<option>'
    >>> tag['value'] = '"This & that"'
    >>> tag.getHTML()
    '<option value="&quot;This &amp; that&quot;'
    >>> tag['value']
    '"This & that"'
    >>> tag.setBooleanAttribute('selected', 1)
    >>> tag.getHTML()
    '<option value="&quot;This &amp; that&quot;" selected>'
    >>> tag['selected']
    'selected'
    >>> tag.setBooleanAttribute('Selected', 0)
    >>> tag.getHTML()
    '<option value="&quot;This &amp; that&quot;">'
    >>> tag['selected']
    None
    >>> tag.setBooleanAttribute('selected', 1)
    >>> tag.getHTML()
    '<option value="&quot;This &amp; that&quot;" selected>'
    >>> del tag['VaLUE']
    >>> tag.getHTML()
    '<option selected>'
    >>> HTMLFilter.HTMLDecode('&#8221;') == unichr(8221)
    True
    >>> HTMLFilter.HTMLDecode('one&#8221;two&#8221;')
    'one”two”'

Download

[View/Download HTMLFilter] Version 1.1; 36K