Technomancy

How to import current Wikipedia dumps

written by rory, on Oct 21, 2009 10:31:00 PM.

Wikipedia provides database dumps and the main way to process this 5GB compressed XML file is with a C programme called xml2sql, which converts that file into a few raw text files, representing the text of wikipedia articles. However the XML schema changed and the current xml2sql programme doesn't work. If you run it using a recent dump (eg from October 2009), you'll get this error:
$ bzcat enwiki-latest-pages-articles.xml.bz2 | ./xml2sql
unexpected element <redirect>
./xml2sql: parsing aborted at line 33 pos 16.
The problem is the "<redirect />" element in the XML file. xml2sql doesn't know what to do with it and so stops. Each article has a "<redirect>" tag, and it doesn't change for any of the articles. I've managed to run xml2sql by stripping out this tag. You can do it like this:
$ bzcat enwiki-latest-pages-articles.xml.bz2 | grep -v '    <redirect />' | ./xml2sql

Comments

  • This is a very helful article. It's really clearly written and I am going to give it a go tonight when I finish work, strangley looking forward to it!

    Thanks very much!

    Comment by danny — Oct 26, 2009 10:26:20 AM | # - re

  • Thanks a lot for the information! I had this problem a few times and did not know how to fix it. i have to try with the tag you wrote above, I am sure it will work! Thanks again!

    Comment by Samantha Kostenlose Online Spiele — Nov 10, 2009 10:30:17 AM | # - re

Leave a Reply