this post would have been written ages ago! But it doesn’t and so here we are.
TL;DR version: new toy available to play with, uses the bz2 multistream enwikipedia XML dump of current pages and articles, displays text of the article of your choice. No fancy rendering, and raw wikitext infoboxes still look like &^%$#@, that’s left as an exercise to the reader. You will need linux, bash, python, the bz2 multistream files from here or from a mirror site, and the code from here. Ah, you’ll also need about 10GB worth of space, the dumps are large these days!
Want more information? Check out the README.txt and HACKING.txt.
A special mention here goes to the author of the Offline Wikipedia project, which used bzip2recover to exploit the block-oriented nature of the compression algorithm, breaking the XML file into tens of thousands of small files and building an index of page titles against these files. If bz2 used byte-aligned blocks, he might have written my code long ago!
0 Responses to “If only bz2 used byte-aligned blocks…”