Archive Page 2

Media mirrors, om nom nom

People who are close followers of the xml data dumps mailing list will already know that we have a mirror hosting all current wikimedia images at, and that we host bundles of media per project generated every month as well.  (See the mirrors list at Meta for the links.)

Right now the mirror and the downloadable media bundles are hosted off-site; in fact the bundles are generated off-site!  But that’s due to change soon.  We have a couple of nice beefy servers that are going to be installed this week just for that work.

Because the media bundles contain all media for a given project, and many files are re-used across multiple projects, there is a lot of duplication and a lot of wasted space.  Got a couple ideas in mind for cleaning that up.

The other exciting thing happening in media-land is the move to distributed storage (Swift) for originals of all media we host.  Once that happens we’ll need to be able to keep our mirror in sync with the Swift copy.  I’m hoping for testing to be entertaining instead of really really annoying. 🙂  We shall see…


Shoveling dumps out to the ( cloud

Internet Archive is an amazing service, it truly is, even with all of its different interfaces for getting and uploading data (straight html for some things, json for others, and an S3-compatible-ish REST API for other things). When we look back on this period of our digital history, the Archive will surely be recognized as one of the great repositories of knowledge, a project that changed forever the course of the Internet. Naturally we want Wikimedia XML dumps to be a part of this repository.

Quite a while back I hacked together a python script wrapped around a pile of curl invocations to upload XML dumps with automatic generation of metadata based on the dump. But when it came time to make the script actually *usable*, eh, too much glue and paper clips. Writing a class to upload a large dump file in several pieces via the S3 multi-part upload protocol turned out to be too too ugly with the existing script.

Problems like that have an obvious solution: toss the hacked together crap and write something that sucks less. And thus was born YAS3Lib, yet another S3 library, with a pile of extensions for handling the json/html/login cookie stuff that lives outside of S3 but is essential for working with materials.

The first commit of this library went up this past Friday, so we are talking very very beta, a library that will likely eat your data anad then burp loadly when it’s done. Over the next couple of weeks it should get a lot cleaner. For adventurous folks who want to look at it right away, you can browse the code via gitweb. Stay tuned!

Visual Novels as educational tools

I had never heard of a visual novel before I went hunting for a virtual reality client in which I could build a basic adventure game for language learning.  Sidetracks happen, and so now there is the absolutely unapproved, unofficial and surely unfit for public consumption Wikipedia Quiz, written solely to get me acquainted with Ren’Py, a visual novel engine written in Python. It’s pretty slick actually, building bundles for Windoze, MacOS *and* Linux. Oh, bear in mind, I have no graphics abilities whatsoever, so I had to steal all the visual content. Thank goodness for open content, right? Anyways, go check out the engine and maybe build yourself a few quizzes. Let me know how it goes!

It’s May 2012 and hell must be freezing over

There’s no other logical explanation why I would try for the 5th or 6th time to start a regular blog.  Well, there is one:  I need a place to scribble down random musings about free software, open content, free access to information, and collaborative content creation and learning.  So far all my scribbling has been in IRC, which isn’t a very durable medium.  We’ll see how this pans out.  No promises though…