If only bz2 used byte-aligned blocks…

this post would have been written ages ago!  But it doesn’t and so here we are.

TL;DR version: new toy available to play with, uses the bz2 multistream enwikipedia XML dump of current pages and articles, displays text of the article of your choice.  No fancy rendering, and raw wikitext infoboxes still look like &^%$#@, that’s left as an exercise to the reader.  You will need linux, bash, python, the bz2 multistream files from here or from a mirror site, and the code from here. Ah, you’ll also need about 10GB worth of space, the dumps are large these days!

Want more information? Check out the README.txt and HACKING.txt.

A special mention here goes to the author of the Offline Wikipedia project, which used bzip2recover to exploit the block-oriented nature of the compression algorithm, breaking the XML file into tens of thousands of small files and building an index of page titles against these files. If bz2 used byte-aligned blocks, he might have written my code long ago!

About these ads

0 Responses to “If only bz2 used byte-aligned blocks…”



  1. Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s





Follow

Get every new post delivered to your Inbox.

%d bloggers like this: