==================================== Implementing XML languages with lxml ==================================== Dr. Stefan Behnel ----------------- .. class:: center http://codespeak.net/lxml/ lxml-dev@codespeak.net .. image:: tagpython.png .. footer:: Dr. Stefan Behnel, EuroPython 2008, Vilnius/Lietuva .. include:: What is an »XML language«? ========================== * a language in XML notation * aka »XML dialect« * except that it's not a dialect * Examples: * XML Schema * Atom/RSS * (X)HTML * Open Document Format * SOAP * ... add your own one here Popular mistakes to avoid (1) ============================= "That's easy, I can use regular expressions!" .. class:: incremental center No, you can't. Popular mistakes to avoid (2) ============================= "This is tree data, I'll take the DOM!" Popular mistakes to avoid (2) ============================= "This is tree data, I'll take the DOM!" * DOM is ubiquitous, but it's as complicated as Java * uglify your application with tons of DOM code to * walk over non-element nodes to find the data you need * convert text content to other data types * modify the XML tree in memory => write verbose, redundant, hard-to-maintain code Popular mistakes to avoid (3) ============================= "SAX is *so* fast and consumes *no* memory!" Popular mistakes to avoid (3) ============================= "SAX is *so* fast and consumes *no* memory!" * but *writing* SAX code is *not* fast! * write error-prone, state-keeping SAX code to * figure out where you are * find the sections you need * convert text content to other data types * copy the XML data into custom data classes * ... and don't forget the way back into XML! => write confusing state-machine code => debugging into existence Working with XML ================ **Getting XML work done** (instead of getting time wasted) How can you work with XML? ========================== * Preparation: * Implement usable data classes as an abstraction layer * Implement a mapping from XML to the data classes * Implement a mapping from the data classes to XML * Workflow: * parse XML data * map XML data to data classes * work with data classes * map data classes to XML * serialise XML .. class:: incremental * Approach: * get rid of XML and do everything in your own code What if you could simplify this? ================================ * Preparation: * Extend usable XML API classes into an abstraction layer * Workflow: * parse XML data into XML API classes * work with XML API classes * serialise XML .. class:: incremental * Approach: * cover only the quirks of XML and make it work *for* you What if you could simplify this ... =================================== * ... without sacrificing usability or flexibility? * ... using a high-speed, full-featured, pythonic XML toolkit? * ... with the power of XPath, XSLT and XML validation? .. class:: incremental center \... then »lxml« is your friend! Overview ======== * What is lxml? * what & who * How do you use it? * Lesson 0: quick API overview * ElementTree concepts and lxml features * Lesson 1: parse XML * how to get XML data into memory * Lesson 2: generate XML * how to write an XML generator for a language * Lesson 3: working with XML trees made easy * how to write an XML API for a language What is lxml? ============= * a fast, full-featured toolkit for XML and HTML handling * http://codespeak.net/lxml/ * lxml-dev@codespeak.net * based on and inspired by * the C libraries libxml2 and libxslt (by Daniel Veillard) * the ElementTree API (by Fredrik Lundh) * the Cython compiler (by Robert Bradshaw, Greg Ewing & me) * the Python language (by Guido & [*paste Misc/ACKS here*]) * user feedback, ideas and patches (by you!) * keep doing that, we love you all! * maintained (and major parts) written by myself * initial design and implementation by Martijn Faassen * extensive HTML API and tools by Ian Bicking What do you get for your money? =============================== * many tools in one: * Generic, ElementTree compatible XML API: **lxml.etree** * but faster for many tasks and much more feature-rich * Special tool set for HTML handling: **lxml.html** * Special API for pythonic data binding: **lxml.objectify** * General purpose path languages: XPath and CSS selectors * Validation: DTD, XML Schema, RelaxNG, Schematron * XSLT, XInclude, C14N, ... * Fast tree iteration, event-driven parsing, ... * it's free, but it's worth every €-Cent! * what users say: * »no qualification, I would recommend lxml for just about any HTML task« * »THE tool [...] for newbies and experienced developers« * »you can do pretty much anything with an intuitive API« * »lxml takes all the pain out of XML« Lesson 0: a quick overview ========================== why **»lxml takes all the pain out of XML«** (a quick overview of lxml features and ElementTree concepts) .. >>> from lxml import etree, cssselect, html >>> some_xml_data = "

So be it!

stuff

" >>> some_html_data = "

Just a quick note
next line

" >>> xml_tree = etree.XML(some_xml_data) >>> html_tree = html.fragment_fromstring(some_html_data) Namespaces in ElementTree ========================= * uses Clark notation: * wrap namespace URI in ``{...}`` * append the tag name .. sourcecode:: pycon >>> tag = "{http://www.w3.org/the/namespace}tagname" >>> element = etree.Element(tag) * no prefixes! * a single, self-containing tag identifier Text content in ElementTree =========================== * uses ``.text`` and ``.tail`` attributes: .. sourcecode:: pycon >>> div = html.fragment_fromstring( ... "

a paragraph
split in two

parts
") >>> p = div[0] >>> br = p[0] >>> p.text 'a paragraph' >>> br.text >>> br.tail 'split in two' >>> p.tail ' parts' * no text nodes! * simplifies tree traversal a lot * simplifies many XML algorithms Attributes in ElementTree ========================= * uses ``.get()`` and ``.set()`` methods: .. sourcecode:: pycon >>> root = etree.fromstring( ... '') >>> root.get('a') 'the value' >>> root.set('a', "THE value") >>> root.get('a') 'THE value' * or the ``.attrib`` dictionary property: .. sourcecode:: pycon >>> d = root.attrib >>> list(sorted(d.keys())) ['a', 'b', 'c'] >>> list(sorted(d.values())) ['THE value', 'attribute', 'of an'] Tree iteration in lxml.etree (1) ================================ .. >>> import collections .. sourcecode:: pycon >>> root = etree.fromstring( ... " ") >>> print([child.tag for child in root]) # children ['a', 'c'] >>> print([el.tag for el in root.iter()]) # self and descendants ['root', 'a', 'b', 'b', 'c', 'd', 'e', 'f', 'g'] >>> print([el.tag for el in root.iterdescendants()]) ['a', 'b', 'b', 'c', 'd', 'e', 'f', 'g'] >>> def iter_breadth_first(root): ... bfs_queue = collections.deque([root]) ... while bfs_queue: ... el = bfs_queue.popleft() # pop next element ... bfs_queue.extend(el) # append its children ... yield el >>> print([el.tag for el in iter_breadth_first(root)]) ['root', 'a', 'c', 'b', 'b', 'd', 'e', 'g', 'f'] Tree iteration in lxml.etree (2) ================================ .. sourcecode:: pycon >>> root = etree.fromstring( ... " ") >>> tree_walker = etree.iterwalk(root, events=('start', 'end')) >>> for (event, element) in tree_walker: ... print("%s (%s)" % (element.tag, event)) root (start) a (start) b (start) b (end) b (start) b (end) a (end) c (start) d (start) d (end) e (start) f (start) f (end) e (end) g (start) g (end) c (end) root (end) Path languages in lxml ====================== .. sourcecode:: xml

So be it!

stuff

* search it with XPath .. sourcecode:: pycon >>> find_paragraphs = etree.XPath("//p") >>> paragraphs = find_paragraphs(xml_tree) >>> print([ p.text for p in paragraphs ]) ['So be it!', 'stuff'] * search it with CSS selectors .. sourcecode:: pycon >>> find_dialogs = cssselect.CSSSelector("speech.dialog p") >>> paragraphs = find_dialogs(xml_tree) >>> print([ p.text for p in paragraphs ]) ['So be it!'] Summary of lesson 0 =================== * lxml comes with various tools * that aim to hide the quirks of XML * that simplify finding and handling data * that make XML a pythonic tool by itself Lesson 1: parsing XML/HTML ========================== **The input side** (a quick overview) Parsing XML and HTML from ... ============================= * strings: ``fromstring(xml_data)`` * byte strings, but also unicode strings * filenames: ``parse(filename)`` * HTTP/FTP URLs: ``parse(url)`` * file objects: ``parse(f)`` * ``f = open(filename, 'rb')`` ! * file-like objects: ``parse(f)`` * only need a ``f.read(size)`` method * data chunks: ``parser.feed(xml_chunk)`` * ``result = parser.close()`` .. class:: small right (parsing from strings and filenames/URLs frees the GIL) Example: parsing from a string ============================== * using the ``fromstring()`` function: .. sourcecode:: pycon >>> root_element = etree.fromstring(some_xml_data) * using the ``fromstring()`` function with a specific parser: .. sourcecode:: pycon >>> parser = etree.HTMLParser(remove_comments=True) >>> root_element = etree.fromstring(some_html_data, parser) * or the ``XML()`` and ``HTML()`` aliases for literals in code: .. sourcecode:: pycon >>> root_element = etree.XML("") >>> root_element = etree.HTML("

some
paragraph

") Parsing XML into ... ==================== * a tree in memory * ``parse()`` and ``fromstring()`` functions * a tree in memory, but step-by-step with a generator * ``iterparse()`` generates ``(start/end, element)`` events * tree can be cleaned up to save space * SAX-like callbacks without building a tree * ``parse()`` and ``fromstring()`` functions * pass a ``target`` object into the parser Summary of lesson 1 =================== * parsing XML/HTML in lxml is mostly straight forward * simple functions that do the job * advanced use cases are pretty simple * event-driven parsing using ``iterparse()`` * special parser configuration with keyword arguments * configuration is generally local to a parser * BTW: parsing is *very* fast, as is serialising * don't hesitate to do parse-serialise-parse cycles Lesson 2: generating XML ======================== **The output side** (and how to make it safe and simple) The example language: Atom ========================== The Atom XML format * Namespace: http://www.w3.org/2005/Atom * W3C recommendation derived from RSS and friends * Atom feeds describe news entries and annotated links * a ``feed`` contains one or more ``entry`` elements * an ``entry`` contains ``author``, ``link``, ``summary`` and/or ``content`` Example: generate XML (1) ========================= The ElementMaker (or *E-factory*) .. sourcecode:: pycon >>> from lxml.builder import ElementMaker >>> A = ElementMaker(namespace="http://www.w3.org/2005/Atom", ... nsmap={None : "http://www.w3.org/2005/Atom"}) .. class:: incremental .. sourcecode:: pycon >>> atom = A.feed( ... A.author( A.name("Stefan Behnel") ), ... A.entry( ... A.title("News from lxml"), ... A.link(href="http://codespeak.net/lxml/"), ... A.summary("See what's fun about lxml...", ... type="html"), ... ) ... ) .. sourcecode:: pycon >>> from lxml.etree import tostring >>> print( tostring(atom, pretty_print=True) ) Example: generate XML (2) ========================= .. sourcecode:: pycon >>> atom = A.feed( ... A.author( A.name("Stefan Behnel") ), ... A.entry( ... A.title("News from lxml"), ... A.link(href="http://codespeak.net/lxml/"), ... A.summary("See what's fun about lxml...", ... type="html"), ... ) ... ) .. sourcecode:: xml Stefan Behnel News from lxml See what's <b>fun</b> about lxml... Be careful what you type! ========================= .. sourcecode:: pycon >>> atom = A.feed( ... A.author( A.name("Stefan Behnel") ), ... A.entry( ... A.titel("News from lxml"), ... A.link(href="http://codespeak.net/lxml/"), ... A.summary("See what's fun about lxml...", ... type="html"), ... ) ... ) .. sourcecode:: xml Stefan Behnel News from lxml See what's <b>fun</b> about lxml... Want more 'type safety'? ======================== Write an XML generator *module* instead: .. sourcecode:: python # atomgen.py from lxml import etree from lxml.builder import ElementMaker ATOM_NAMESPACE = "http://www.w3.org/2005/Atom" A = ElementMaker(namespace=ATOM_NAMESPACE, nsmap={None : ATOM_NAMESPACE}) feed = A.feed entry = A.entry title = A.title # ... and so on and so forth ... # plus a little validation function: isvalid() isvalid = etree.RelaxNG(file="atom.rng") The Atom generator module ========================= .. >>> import sys >>> sys.path.insert(0, "ep2008") .. sourcecode:: pycon >>> import atomgen as A >>> atom = A.feed( ... A.author( A.name("Stefan Behnel") ), ... A.entry( ... A.link(href="http://codespeak.net/lxml/"), ... A.title("News from lxml"), ... A.summary("See what's fun about lxml...", ... type="html"), ... ) ... ) >>> A.isvalid(atom) # ok, forgot the ID's => invalid XML ... False >>> title = A.titel("News from lxml") Traceback (most recent call last): ... AttributeError: 'module' object has no attribute 'titel' Mixing languages (1) ==================== Atom can embed *serialised* HTML .. sourcecode:: pycon >>> import lxml.html.builder as h >>> html_fragment = h.DIV( ... "this is some\n", ... h.A("HTML", href="http://w3.org/MarkUp/"), ... "\ncontent") .. class:: incremental .. sourcecode:: pycon >>> serialised_html = etree.tostring(html_fragment, method="html") >>> summary = A.summary(serialised_html, type="html") .. sourcecode:: pycon >>> print(etree.tostring(summary)) <div>this is some <a href="http://w3.org/MarkUp/">HTML</a> content</div> Mixing languages (2) ==================== Atom can also embed non-escaped XHTML .. sourcecode:: pycon >>> from copy import deepcopy >>> xhtml_fragment = deepcopy(html_fragment) >>> from lxml.html import html_to_xhtml >>> html_to_xhtml(xhtml_fragment) >>> summary = A.summary(xhtml_fragment, type="xhtml") .. class:: incremental .. sourcecode:: pycon >>> print(etree.tostring(summary, pretty_print=True)) this is some HTML content Summary of lesson 2 =================== * generating XML is easy * use the ElementMaker * wrap it in a module that provides * the target namespace * an ElementMaker name for each language element * a validator * maybe additional helper functions * mixing languages is easy * define a generator module for each \... this is all you need for the *output* side of XML languages Lesson 3: Designing XML APIs ============================ **The Element API** (and how to make it the way *you* want) Trees in C and in Python ======================== * Trees have two representations: * a plain, complete, low-level C tree provided by libxml2 * a set of Python Element proxies, each representing one element * Proxies are created on-the-fly: * lxml creates an Element object for a C node on request * proxies are garbage collected when going out of scope * XML trees are garbage collected when deleting the last proxy .. class:: center .. image:: ep2008/proxies.png Mapping Python classes to nodes =============================== * Proxies can be assigned to XML nodes *by user code* * lxml tells you about a node, you return a class Example: a simple Element class (1) =================================== * define a subclass of ElementBase .. sourcecode:: pycon >>> class HonkElement(etree.ElementBase): ... @property ... def honking(self): ... return self.get('honking') == 'true' * let it replace the default Element class .. sourcecode:: pycon >>> lookup = etree.ElementDefaultClassLookup( ... element=HonkElement) >>> parser = etree.XMLParser() >>> parser.set_element_class_lookup(lookup) Example: a simple Element class (2) =================================== * use the new Element class .. sourcecode:: pycon >>> root = etree.XML('', ... parser) >>> root.honking False >>> root[0].honking True Mapping Python classes to nodes =============================== * The Element class lookup * lxml tells you about a node, you return a class * no restrictions on lookup algorithm * each parser can use a different class lookup scheme * lookup schemes can be chained through fallbacks * Classes can be selected based on * the node type (element, comment or processing instruction) * ``ElementDefaultClassLookup()`` * the namespaced node name * ``CustomElementClassLookup()`` + a fallback * ``ElementNamespaceClassLookup()`` + a fallback * the value of an attribute (e.g. ``id`` or ``class``) * ``AttributeBasedElementClassLookup()`` + a fallback * read-only inspection of the tree * ``PythonElementClassLookup()`` + a fallback Designing an Atom API ===================== * a feed is a container for entries .. sourcecode:: python # atom.py ATOM_NAMESPACE = "http://www.w3.org/2005/Atom" _ATOM_NS = "{%s}" % ATOM_NAMESPACE class FeedElement(etree.ElementBase): @property def entries(self): return self.findall(_ATOM_NS + "entry") * it also has a couple of meta-data children, e.g. ``title`` .. sourcecode:: python class FeedElement(etree.ElementBase): # ... @property def title(self): "return the title or None" return self.find("title") Consider lxml.objectify ======================= * ready-to-use, generic Python object API for XML .. sourcecode:: python >>> from lxml import objectify >>> feed = objectify.parse("atom-example.xml") >>> print(feed.title) Example Feed >>> print([entry.title for entry in feed.entry]) ['Atom-Powered Robots Run Amok'] >>> print(feed.entry[0].title) Atom-Powered Robots Run Amok Still room for more convenience =============================== .. sourcecode:: python from itertools import chain class FeedElement(objectify.ObjectifiedElement): def addIDs(self): "initialise the IDs of feed and entries" for element in chain([self], self.entry): if element.find(_ATOM_NS + "id") is None: id = etree.SubElement(self, _ATOM_NS + "id") id.text = make_guid() Incremental API design ====================== * choose an XML API to start with * lxml.etree is general purpose * lxml.objectify is nice for document-style XML * fix Elements that really need some API sugar * dict-mappings to children with specific content/attributes * properties for specially typed attributes or child values * simplified access to varying content types of an element * shortcuts for unnecessarily deep subtrees * ignore what works well enough with the Element API * lists of homogeneous children -> Element iteration * string attributes -> .get()/.set() * let the API grow at your fingertips * play with it and test use cases * avoid "I want because I can" feature explosion! Setting up the Element mapping ============================== Atom has a namespace => leave the mapping to lxml .. sourcecode:: python # ... _atom_lookup = etree.ElementNamespaceClassLookup( objectify.ObjectifyElementClassLookup()) # map the classes to tag names ns = _atom_lookup.get_namespace(ATOM_NAMESPACE) ns["feed"] = FeedElement ns["entry"] = EntryElement # ... and so on # or use ns.update(vars()) with appropriate class names # create a parser that does some whitespace cleanup atom_parser = etree.XMLParser(remove_blank_text=True) # make it use our Atom classes atom_parser.set_element_class_lookup(_atom_lookup) # and help users in using our parser setup def parse(input): return etree.parse(input, atom_parser) Using your new Atom API ======================= .. sourcecode:: pycon >>> import atom >>> feed = atom.parse("ep2008/atom-example.xml").getroot() >>> print(len(feed.entry)) 1 >>> print([entry.title for entry in feed.entry]) ['Atom-Powered Robots Run Amok'] >>> link_tag = "{%s}link" % atom.ATOM_NAMESPACE >>> print([link.get("href") for link in feed.iter(link_tag)]) ['http://example.org/', 'http://example.org/2003/12/13/atom03'] Summary of lesson 3 =================== To implement an XML API ... 1) start off with lxml's Element API * or take a look at the object API of lxml.objectify 2) specialise it into a set of custom Element classes 3) map them to XML tags using one of the lookup schemes 4) improve the API incrementally while using it * discover inconveniences and beautify them * avoid putting work into things that work Conclusion ========== lxml ... * provides a convenient set of tools for XML and HTML * parsing * generating * working with in-memory trees * follows Python idioms wherever possible * highly extensible through wrapping and subclassing * callable objects for XPath, CSS selectors, XSLT, schemas * iteration for tree traversal (even while parsing) * list-/dict-like APIs, properties, keyword arguments, ... * makes extension and specialisation easy * write a special XML generator module in trivial code * write your own XML API incrementally on-the-fly