==================== BeautifulSoup Parser ==================== BeautifulSoup_ is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2. BeautifulSoup uses a different parsing approach. It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour `support for encoding detection`_. It very much depends on the input which parser works better. .. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/ .. _`support for encoding detection`: http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful%20Soup%20Gives%20You%20Unicode%2C%20Dammit .. _ElementSoup: http://effbot.org/zone/element-soup.htm To prevent users from having to choose their parser library in advance, lxml can interface to the parsing capabilities of BeautifulSoup through the ``lxml.html.soupparser`` module. It provides three main functions: ``fromstring()`` and ``parse()`` to parse a string or file using BeautifulSoup into an ``lxml.html`` document, and ``convert_tree()`` to convert an existing BeautifulSoup tree into a list of top-level Elements. .. contents:: .. 1 Parsing with the soupparser 2 Entity handling 3 Using soupparser as a fallback 4 Using only the encoding detection Parsing with the soupparser =========================== The functions ``fromstring()`` and ``parse()`` behave as known from ElementTree. The first returns a root Element, the latter returns an ElementTree. There is also a legacy module called ``lxml.html.ElementSoup``, which mimics the interface provided by ElementTree's own ElementSoup_ module. Note that the ``soupparser`` module was added in lxml 2.0.3. Previous versions of lxml 2.0.x only have the ``ElementSoup`` module. Here is a document full of tag soup, similar to, but not quite like, HTML: .. sourcecode:: pycon >>> tag_soup = '
' all you need to do is pass it to the ``fromstring()`` function: .. sourcecode:: pycon >>> from lxml.html.soupparser import fromstring >>> root = fromstring(tag_soup) To see what we have here, you can serialise it: .. sourcecode:: pycon >>> from lxml.etree import tostring >>> print tostring(root, pretty_print=True),
Hi all
Entity handling =============== By default, the BeautifulSoup parser also replaces the entities it finds by their character equivalent. .. sourcecode:: pycon >>> tag_soup = '©€-õƽ' >>> body = fromstring(tag_soup).find('.//body') >>> body.text u'\xa9\u20ac-\xf5\u01bd' If you want them back on the way out, you can just serialise with the default encoding, which is 'US-ASCII'. .. sourcecode:: pycon >>> tostring(body) '
©€-õƽ' >>> tostring(body, method="html") '©€-õƽ' Any other encoding will output the respective byte sequences. .. sourcecode:: pycon >>> tostring(body, encoding="utf-8") '\xc2\xa9\xe2\x82\xac-\xc3\xb5\xc6\xbd' >>> tostring(body, method="html", encoding="utf-8") '\xc2\xa9\xe2\x82\xac-\xc3\xb5\xc6\xbd' >>> tostring(body, encoding=unicode) u'\xa9\u20ac-\xf5\u01bd' >>> tostring(body, method="html", encoding=unicode) u'\xa9\u20ac-\xf5\u01bd' Using soupparser as a fallback ============================== The downside of using this parser is that it is `much slower`_ than the HTML parser of lxml. So if performance matters, you might want to consider using ``soupparser`` only as a fallback for certain cases. .. _`much slower`: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ One common problem of lxml's parser is that it might not get the encoding right in cases where the document contains a ```` tag at the wrong place. In this case, you can exploit the fact that lxml serialises much faster than most other HTML libraries for Python. Just serialise the document to unicode and if that gives you an exception, re-parse it with BeautifulSoup to see if that works better. .. sourcecode:: pycon >>> tag_soup = '''\ ... ... ... ...