=============== html5lib Parser =============== `html5lib`_ is a Python package that implements the HTML5 parsing algorithm which is heavily influenced by current browsers and based on the `WHATWG HTML5 specification`_. .. _html5lib: http://code.google.com/p/html5lib/ .. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/ .. _WHATWG HTML5 specification: http://www.whatwg.org/specs/web-apps/current-work/ lxml can benefit from the parsing capabilities of `html5lib` through the ``lxml.html.html5parser`` module. It provides a similar interface to the ``lxml.html`` module by providing ``fromstring()``, ``parse()``, ``document_fromstring()``, ``fragment_fromstring()`` and ``fragments_fromstring()`` that work like the regular html parsing functions. Differences to regular HTML parsing =================================== There are a few differences in the returned tree to the regular HTML parsing functions from ``lxml.html``. html5lib normalizes some elements and element structures to a common format. For example even if a tables does not have a `tbody` html5lib will inject one automatically: .. sourcecode:: pycon >>> from lxml.html import tostring, html5parser >>> tostring(html5parser.fromstring("
foo"))
'
``. If ``create_parent`` is true the
default parent tag (div) is used.
If a bytestring is passed and ``guess_charset`` is true the chardet
library (if installed) will guess the charset if ambiguities exist.
``fragments_fromstring(string, no_leading_text=False, parser=None)``:
Returns a list of the elements found in the fragment. The first item in
the list may be a string. If ``no_leading_text`` is true, then it will
be an error if there is leading text, and it will always be a list of
only elements.
If a bytestring is passed and ``guess_charset`` is true the chardet
library (if installed) will guess the charset if ambiguities exist.
``fromstring(string)``:
Returns ``document_fromstring`` or ``fragment_fromstring``, based
on whether the string looks like a full document, or just a
fragment.
Additionally all parsing functions accept an ``parser`` keyword argument
that can be set to a custom parser instance. To create custom parsers
you can subclass the ``HTMLParser`` and ``XHTMLParser`` from the same
module. Note that these are the parser classes provided by html5lib.
|