Technomancy

Entries tagged “data”

How to import current Wikipedia dumps

written by rory, on Oct 21, 2009 10:31:00 PM.

Wikipedia provides database dumps and the main way to process this 5GB compressed XML file is with a C programme called xml2sql, which converts that file into a few raw text files, representing the text of wikipedia articles. However the XML schema changed and the current xml2sql programme doesn't work. If you run it using a recent dump (eg from October 2009), you'll get this error:
$ bzcat enwiki-latest-pages-articles.xml.bz2 | ./xml2sql
unexpected element <redirect>
./xml2sql: parsing aborted at line 33 pos 16.
The problem is the "<redirect />" element in the XML file. xml2sql doesn't know what to do with it and so stops. Each article has a "<redirect>" tag, and it doesn't change for any of the articles. I've managed to run xml2sql by stripping out this tag. You can do it like this:
$ bzcat enwiki-latest-pages-articles.xml.bz2 | grep -v '    <redirect />' | ./xml2sql