How to import current Wikipedia dumps
Wikipedia provides database dumps and the main way to process this 5GB compressed XML file is with a C programme called xml2sql, which converts that file into a few raw text files, representing the text of wikipedia articles.
However the XML schema changed and the current xml2sql programme doesn't work. If you run it using a recent dump (eg from October 2009), you'll get this error:
$ bzcat enwiki-latest-pages-articles.xml.bz2 | ./xml2sql unexpected element <redirect> ./xml2sql: parsing aborted at line 33 pos 16.The problem is the "<redirect />" element in the XML file. xml2sql doesn't know what to do with it and so stops. Each article has a "<redirect>" tag, and it doesn't change for any of the articles. I've managed to run xml2sql by stripping out this tag. You can do it like this:
$ bzcat enwiki-latest-pages-articles.xml.bz2 | grep -v ' <redirect />' | ./xml2sql