Fork me on GitHub

Bruno Rocha at 22:50 of 08/21/2013

Working with XML is a pain in the ass! but unfortunately this format is far from being put into disuse. While we still have to deal with API's that provide data only in XML format, the solution is to count on the help of various Python packages that do this kind of work.

I've worked with many of them including the famous BeautifulSoap, ElementTree and xml.minidom. Among those mentioned, what I like most is the xml.minidom, cause it is simple and is available in the standard library, however when there is a need to "wipe" XML structures in search of specific data one of the best ways is the XPATH and in this case both the ElementTree as BeautifulSoap has good support.

A few weeks ago I started a project in Django to query a sporting results API and write the returned data in to models in the database to feed an app for real time blogging.

When I began to analyze the structure of XML and the amount of XPATH that would have to write , I thought "I would like to work only with JSON because it is the data format of the modern web", so I went to google and typed "convert xml 2 dict" and to my surprise I found this great tool!

That's why Python Rocks!

This week I notice that it has been updated allowing unparse of dicts to xml. Turning it in to my preferred module to deal with XMLs.

xmltodict

xmltodict is a Python module that makes working with XML feel like you are working with JSON, as in this "spec":

>>> doc = xmltodict.parse("""
... <mydocument has="an attribute">
...   <and>
...     <many>elements</many>
...     <many>more elements</many>
...   </and>
...   <plus a="complex">
...     element as well
...   </plus>
... </mydocument>
... """)
>>>
>>> doc['mydocument']['@has']
u'an attribute'
>>> doc['mydocument']['and']['many']
[u'elements', u'more elements']
>>> doc['mydocument']['plus']['@a']
u'complex'
>>> doc['mydocument']['plus']['#text']
u'element as well'

It's very fast (Expat-based) and has a streaming mode with a small memory footprint, suitable for big XML dumps like Discogs or Wikipedia:

>>> def handle_artist(_, artist):
...     print artist['name']
>>> 
>>> xmltodict.parse(GzipFile('discogs_artists.xml.gz'),
...     item_depth=2, item_callback=handle_artist)
A Perfect Circle
Fantômas
King Crimson
Chris Potter
...

It can also be used from the command line to pipe objects to a script like this:

myscript.py

import sys, marshal
while True:
    _, article = marshal.load(sys.stdin)
    print article['title']

sh

$ cat enwiki-pages-articles.xml.bz2 | bunzip2 | xmltodict.py 2 | myscript.py
AccessibleComputing
Anarchism
AfghanistanHistory
AfghanistanGeography
AfghanistanPeople
AfghanistanCommunications
Autism
...

Or just cache the dicts so you don't have to parse that big XML file again. You do this only once:

$ cat enwiki-pages-articles.xml.bz2 | bunzip2 | xmltodict.py 2 | gzip > enwiki.dicts.gz

And you reuse the dicts with every script that needs them:

$ cat enwiki.dicts.gz | gunzip | script1.py
$ cat enwiki.dicts.gz | gunzip | script2.py
...

You can also convert in the other direction, using the unparse() method:

python

>>> mydict = {
...     'page': {
...         'title': 'King Crimson',
...         'ns': 0,
...         'revision': {
...             'id': 547909091,
...         }
...     }
... }
>>> print unparse(mydict)
<?xml version="1.0" encoding="utf-8"?>
<page><ns>0</ns><revision><id>547909091</id></revision><title>King Crimson</title></page>

Ok, how do I get it?

You just need to

$ pip install xmltodict

There is an official Fedora package for xmltodict. If you are on Fedora or RHEL, you can do:

$ sudo yum install python-xmltodict

Donate

If you love xmltodict, consider supporting the author on Gittip.

Note: I am not involved in the project, I am just disseminating because it is one of the most useful lib I've ever encountered in the Pythonic ecosystem.

KISS: Use the built in sum() instead of reduce to aggregate over a list comprehension

KISS: Use the built in sum() instead of reduce to aggregate over a list comprehension

Published at 20:31 of 01/11/2013

Read more »

Django ListField e SeparetedValuesField

Django ListField e SeparetedValuesField

Published at 14:01 of 01/11/2013

Read more »

Add a counter on Django admin home page

Add a counter on Django admin home page

Published at 09:28 of 01/10/2013

Read more »


comments powered by Disqus