Archive for the 'open content' Category

Greek Wikipedian sued for adding sourced documented information

No joke, sadly. Tomorrow is the preliminary injunction hearing where the judge will decide whether the user should be ordered to temporarily remove the content the suing politician doesn’t like, pending trial… only of course to have the edit reverted by someone else, I’m sure. Read all about it here: We are all Diu!

Unfortunately, as goofy as the lawsuit seems, the threat of censorship is real. What happens tomorrow could influence the future of Wikipedia in a big way. Stay tuned! And pass on the word, let people know.

If only bz2 used byte-aligned blocks…

this post would have been written ages ago!  But it doesn’t and so here we are.

TL;DR version: new toy available to play with, uses the bz2 multistream enwikipedia XML dump of current pages and articles, displays text of the article of your choice.  No fancy rendering, and raw wikitext infoboxes still look like &^%$#@, that’s left as an exercise to the reader.  You will need linux, bash, python, the bz2 multistream files from here or from a mirror site, and the code from here. Ah, you’ll also need about 10GB worth of space, the dumps are large these days!

Want more information? Check out the README.txt and HACKING.txt.

A special mention here goes to the author of the Offline Wikipedia project, which used bzip2recover to exploit the block-oriented nature of the compression algorithm, breaking the XML file into tens of thousands of small files and building an index of page titles against these files. If bz2 used byte-aligned blocks, he might have written my code long ago!

Return of the revenge of the disk space

We’ve been generating bundles of media in use on the various Wikimedia projects, so that readers or editors of these projects can download the media with just a few clicks.  This approach is great for the downloader but takes more space than we would like, since files hosted on Commons in use on multiple projects will be stored multiple times.  We were hoping to be clever about this by pulling out the files stored in multiple places and bundling those up separately for download. The first step was to generate a list of projects that have the largest number of files used by some other project.  The results were discouraaging.

Below is a list of the projects with the most media in use, listed in parentheses, followed by the number of media files in common with some other projects in descending order.

enwiki(2237560): 519734|dewiki 480943|frwiki 304120|plwiki 352064|ruwiki 393602|eswiki 354075|itwiki
dewiki(1426181):  519734|enwiki 318361|frwiki 246937|itwiki 236249|ruwiki 223472|plwiki 222664|eswiki
frwiki(1046546):  480943|enwiki 318361|dewiki 255563|eswiki 250681|itwiki 219759|ruwiki 201138|plwiki
ruwiki(649728):   352064|enwiki 236249|dewiki 219759|frwiki 187919|eswiki 185788|itwiki 173388|plwiki

Eliminating all of the duplication between just the first few top projects would entail the creation of multiple separate files for download,  making things significantly less convenient for the downloader without the space gains to justify it.

For just the top five projects as far as media usage, the number of media files in common to them all is only 66979, a pittance.  But even if we took the 500 thousand files in use on dewiki and on enwiki and put them in a separate bundle, with a separate bundle for the rest of enwiki and a separate one for the rest of dewiki, that’s still not much of a gain compared to the nearly 6 million unique media files total in use.

So for now we’ll just keep the media bundles per project like they are.  If anyone has any bright space-saving ideas, please chime in with a comment.

(Disk) space: the final frontier

Where are those awesome little cubes of holographic data that we used to see on Star Trek, which contained seemingly endless amounts of data? While we wait for someone to get on that problem, I get to sort out mirrors and backups of media in a world where servers with large raids cost a hefty chunk of change. Just a few days ago our host that serves scaled media files was down to less than 90 GB left.

Lost in time and lost in space on the scaled media server

In theory scaled media can be regenerated from the originals at any time, but in practice we don’t have a media scaler cluster big enough to scale all media at once. This means that we need to be a bit selective in how we “garbage collect”. Typically we generate a list of media not in use on the Wikimedia projects and delete the scaled versions of those files. The situation was so bad, however, that the delete script–which sleeps in between every batch of deletes–put enough pressure on the scaled media server that it became slow to respond, causing the scalers to slow down and thus affecting service for the entire site for a few minutes.

The solution to this turned out to be to remove the scaled media server completely from the equation, and rely entirely on the new distributed media backed, Openstack’s Swift. Whew! But we are really only putting off a discussion that needs to happen soonish: how do we keep our pile of scaled media from expanding at crazy rates?

Consider that we will generate thumbnails of any size requested, on demand if it doesn’t exist already, and these files are never deleted until the next time we run some sort of cleanup script. With Swift it’s going to be easy to forget that we have limited disk storage and that scaled media are really just long-lived temporary files. Should we limit thumb generation to specific sizes only (which could be pregenerated rather than produced on the fly)? Should we generate anything requested but toss non-standard sizes every day? Should we toss less frequently used thumbs (and how would we know which ones those are) on a daily or weekly basis?

Media mirrors, om nom nom

People who are close followers of the xml data dumps mailing list will already know that we have a mirror hosting all current wikimedia images at your.org, and that we host bundles of media per project generated every month as well.  (See the mirrors list at Meta for the links.)

Right now the mirror and the downloadable media bundles are hosted off-site; in fact the bundles are generated off-site!  But that’s due to change soon.  We have a couple of nice beefy servers that are going to be installed this week just for that work.

Because the media bundles contain all media for a given project, and many files are re-used across multiple projects, there is a lot of duplication and a lot of wasted space.  Got a couple ideas in mind for cleaning that up.

The other exciting thing happening in media-land is the move to distributed storage (Swift) for originals of all media we host.  Once that happens we’ll need to be able to keep our mirror in sync with the Swift copy.  I’m hoping for testing to be entertaining instead of really really annoying. 🙂  We shall see…

Shoveling dumps out to the (archive.org) cloud

Internet Archive is an amazing service, it truly is, even with all of its different interfaces for getting and uploading data (straight html for some things, json for others, and an S3-compatible-ish REST API for other things). When we look back on this period of our digital history, the Archive will surely be recognized as one of the great repositories of knowledge, a project that changed forever the course of the Internet. Naturally we want Wikimedia XML dumps to be a part of this repository.

Quite a while back I hacked together a python script wrapped around a pile of curl invocations to upload XML dumps with automatic generation of metadata based on the dump. But when it came time to make the script actually *usable*, eh, too much glue and paper clips. Writing a class to upload a large dump file in several pieces via the S3 multi-part upload protocol turned out to be too too ugly with the existing script.

Problems like that have an obvious solution: toss the hacked together crap and write something that sucks less. And thus was born YAS3Lib, yet another S3 library, with a pile of archive.org-specific extensions for handling the json/html/login cookie stuff that lives outside of S3 but is essential for working with archive.org materials.

The first commit of this library went up this past Friday, so we are talking very very beta, a library that will likely eat your data anad then burp loadly when it’s done. Over the next couple of weeks it should get a lot cleaner. For adventurous folks who want to look at it right away, you can browse the code via gitweb. Stay tuned!

Visual Novels as educational tools

I had never heard of a visual novel before I went hunting for a virtual reality client in which I could build a basic adventure game for language learning.  Sidetracks happen, and so now there is the absolutely unapproved, unofficial and surely unfit for public consumption Wikipedia Quiz, written solely to get me acquainted with Ren’Py, a visual novel engine written in Python. It’s pretty slick actually, building bundles for Windoze, MacOS *and* Linux. Oh, bear in mind, I have no graphics abilities whatsoever, so I had to steal all the visual content. Thank goodness for open content, right? Anyways, go check out the engine and maybe build yourself a few quizzes. Let me know how it goes!