xmltodict: makes working with XML feel like you are working with JSON

Working with XML is a pain in the ass! but unfortunately this format is far from being put into disuse. While we still have to deal with API's that provide data only in XML format, the solution is to count on the help of various Python packages that do this kind of work.

I've worked with many of them including the famous BeautifulSoap, ElementTree and xml.minidom. Among those mentioned, what I like most is the xml.minidom, cause it is simple and is available in the standard library, however when there is a need to "wipe" XML structures in search of specific data one of the best ways is the XPATH and in this case both the ElementTree as BeautifulSoap has good support.

A few weeks ago I started a project in Django to query a sporting results API and write the returned data in to models in the database to feed an app for real time blogging.

When I began to analyze the structure of XML and the amount of XPATH that would have to write , I thought "I would like to work only with JSON because it is the data format of the modern web", so I went to google and typed "convert xml 2 dict" and to my surprise I found this great tool!

That's why Python Rocks!

This week I notice that it has been updated allowing unparse of dicts to xml. Turning it in to my preferred module to deal with XMLs.


xmltodict is a Python module that makes working with XML feel like you are working with JSON, as in this "spec":

>>> doc = xmltodict.parse("""
... <mydocument has="an attribute">
...   <and>
...     <many>elements</many>
...     <many>more elements</many>
...   </and>
...   <plus a="complex">
...     element as well
...   </plus>
... </mydocument>
... """)
>>> doc['mydocument']['@has']
u'an attribute'
>>> doc['mydocument']['and']['many']
[u'elements', u'more elements']
>>> doc['mydocument']['plus']['@a']
>>> doc['mydocument']['plus']['#text']
u'element as well'

It's very fast (Expat-based) and has a streaming mode with a small memory footprint, suitable for big XML dumps like Discogs or Wikipedia:

>>> def handle_artist(_, artist):
...     print artist['name']
>>> xmltodict.parse(GzipFile('discogs_artists.xml.gz'),
...     item_depth=2, item_callback=handle_artist)
A Perfect Circle
King Crimson
Chris Potter

It can also be used from the command line to pipe objects to a script like this:


import sys, marshal
while True:
    _, article = marshal.load(sys.stdin)
    print article['title']


$ cat enwiki-pages-articles.xml.bz2 | bunzip2 | xmltodict.py 2 | myscript.py

Or just cache the dicts so you don't have to parse that big XML file again. You do this only once:

$ cat enwiki-pages-articles.xml.bz2 | bunzip2 | xmltodict.py 2 | gzip > enwiki.dicts.gz

And you reuse the dicts with every script that needs them:

$ cat enwiki.dicts.gz | gunzip | script1.py
$ cat enwiki.dicts.gz | gunzip | script2.py

You can also convert in the other direction, using the unparse() method:


>>> mydict = {
...     'page': {
...         'title': 'King Crimson',
...         'ns': 0,
...         'revision': {
...             'id': 547909091,
...         }
...     }
... }
>>> print unparse(mydict)
<?xml version="1.0" encoding="utf-8"?>
<page><ns>0</ns><revision><id>547909091</id></revision><title>King Crimson</title></page>

Ok, how do I get it?

You just need to

$ pip install xmltodict

There is an official Fedora package for xmltodict. If you are on Fedora or RHEL, you can do:

$ sudo yum install python-xmltodict


If you love xmltodict, consider supporting the author on Gittip.

Note: I am not involved in the project, I am just disseminating because it is one of the most useful lib I've ever encountered in the Pythonic ecosystem.

KISS: Use the built in sum() instead of reduce to aggregate over a list comprehension

This post is the beginning of a KISS tag, a place where I will put all the "over complications" I find on codes that I work on, or even to comment on my own mistakes.


Today I was working on a Django reports app and I saw this code:

result = reduce(lambda x, y: x + y, \
        [i.thing.price for i in \

At the first look, specially because of the use of reduce I thought it was a complicated issue to solve.

2 seconds after I realized.

Why using reduce for sum when Python already has the built in sum function?

result = sum([i.thing.price for i in ModelObject.objects.filter(created_at__gte=date)])

Well Python gives us powerful builtins so just use this!

The other problem here is the memory usage of the above solution, it will first get the objects list from filter() and after that it will iterate one by one, doing a field lookup to take the price and return a new list with values to sum.

It can kill your server!

On this case things can be done in a better way! we are talking about Django! and even I am being a Django ORM hater I know that it has some cool things like this one:

Django ORM aggregations

from django.db.models import Count
queryset = ModelObject.objects.filter(created_at__gte=date)
aggregation = queryset.aggregate(price=Sum('thing__price'))
result = aggregation.get('price', 0)

On the above code, the aggregation Sum will translate in to a SQL command and the sum will be performed on the database side! much better

I really do not like the Django ORM syntax, also I hate the way I bind the objects, maybe because I am used to use the wonderful DAL I prefer to refer to data as data, I mean, data as Rows not data as objects. But in cases I am working with Django, I think the best is to use its powerful tools!

Keep It Simple Stupid!

Django ListField e SeparetedValuesField

Revisiting this with a ListField type you can use. But it makes a few of assumptions, such as the fact that you're not storing complex types in your list. For this reason I used ast.literal_eval() to enforce that only simple, built-in types can be stored as members in a ListField:

from django.db import models
import ast

class ListField(models.TextField):
    __metaclass__ = models.SubfieldBase
    description = "Stores a python list"

    def __init__(self, *args, **kwargs):
        super(ListField, self).__init__(*args, **kwargs)

    def to_python(self, value):
        if not value:
            value = []

        if isinstance(value, list):
            return value

        return ast.literal_eval(value)

    def get_prep_value(self, value):
        if value is None:
            return value

        return unicode(value)

    def value_to_string(self, obj):
        value = self._get_val_from_obj(obj)
        return self.get_db_prep_value(value)

class Dummy(models.Model):
    mylist = ListField()

Taking it for a spin:

>>> from foo.models import Dummy, ListField
>>> d = Dummy()
>>> d.mylist
>>> d.mylist = [3,4,5,6,7,8]
>>> d.mylist
[3, 4, 5, 6, 7, 8]
>>> f = ListField()
>>> f.get_prep_value(d.numbers)
u'[3, 4, 5, 6, 7, 8]'

There you have it that a list is stored in the database as a unicode string, and when pulled back out it is run through ast.literal_eval().

Previously I suggested this solution from this blog post about Custom Fields in Django:

An alternative to the CommaSeparatedIntegerField, it allows you to store any separated values. You can also optionally specify a token parameter.

from django.db import models

class SeparatedValuesField(models.TextField):
    __metaclass__ = models.SubfieldBase

    def __init__(self, *args, **kwargs):
        self.token = kwargs.pop('token', ',')
        super(SeparatedValuesField, self).__init__(*args, **kwargs)

    def to_python(self, value):
        if not value: return
        if isinstance(value, list):
            return value
        return value.split(self.token)

    def get_db_prep_value(self, value):
        if not value: return
        assert(isinstance(value, list) or isinstance(value, tuple))
        return self.token.join([unicode(s) for s in value])

    def value_to_string(self, obj):
        value = self._get_val_from_obj(obj)
        return self.get_db_prep_value(value)

Add a counter on Django admin home page

Recently I tried many ways to add a simple record counter on Django admin home page, I needed it to look like this:


I have tried django admin tools, overwriting the _meta on admin.py but the probleam with admin tools is that it installed a lot of aditional stuff I did not like to use, and the problem with other approaches was because I needed it to be dynamic. Overwriting the __meta seemed to be the right way but is binded only one time, and no updates done until the app restarts.

My friend Fernando Macedo did it the right way!

specialize the string type to add your desired dynamic behavior

from django.db import models

class VerboseName(str):
    def __init__(self, func):
        self.func = func

    def decode(self, encoding, erros):
        return self.func().decode(encoding, erros)

class UsedCoupons(models.Model):
    name = models.CharField(max_length=10)

    class Meta:
        verbose_name_plural = VerboseName(lambda: u"Used Coupons (%d)" % UsedCoupons.objects.count())

And this gives us a lesson, try to solve your problems in pure Python before looking for tricks or ready solutions. (wow it is a dynamic language!)

Programmatically check if Django South has migrations to run

Programmatically check if South has migrations to run.

from south import migration
from south.models import MigrationHistory

apps  = list(migration.all_migrations())

applied_migrations = MigrationHistory.objects.filter(app_name__in=[app.app_label() for app in apps])
applied_migrations = ['%s.%s' % (mi.app_name,mi.migration) for mi in applied_migrations]

num_new_migrations = 0
for app in apps:
    for migration in app:
        if migration.app_label() + "." + migration.name() not in applied_migrations:
            num_new_migrations = num_new_migrations + 1

return num_new_migrations

It can be wrapped in to a function and can be used to monitor South state in admin.

Based on south.management.commands.migrate and some C/P from stack overflow

Using Python to get all the external links from a webpage

Based on the Mark Pilgrim - Dive in to Python book

Define the url lister

from sgmllib import SGMLParser

class URLLister(SGMLParser):
    def reset(self):                              
        self.urls = []

    def start_a(self, attrs):                     
        href = [v for k, v in attrs if k=='href']  
        if href:

Now The function which receives an URL, read that url and list all href attrs

def get_urls_from(url):
    url_list = []
    import urllib
    usock = urllib.urlopen(url)
    parser = URLLister()
        [item for item in parser.urls \
            if item.startswith(('http', 'ftp', 'www'))])
    return url_list

Ok, Now you can call this:

from pprint import pprint

and you get:


That was based on some examples from DiveIntoPython book