Shoveling dumps out to the (archive.org) cloud

Internet Archive is an amazing service, it truly is, even with all of its different interfaces for getting and uploading data (straight html for some things, json for others, and an S3-compatible-ish REST API for other things). When we look back on this period of our digital history, the Archive will surely be recognized as one of the great repositories of knowledge, a project that changed forever the course of the Internet. Naturally we want Wikimedia XML dumps to be a part of this repository.

Quite a while back I hacked together a python script wrapped around a pile of curl invocations to upload XML dumps with automatic generation of metadata based on the dump. But when it came time to make the script actually *usable*, eh, too much glue and paper clips. Writing a class to upload a large dump file in several pieces via the S3 multi-part upload protocol turned out to be too too ugly with the existing script.

Problems like that have an obvious solution: toss the hacked together crap and write something that sucks less. And thus was born YAS3Lib, yet another S3 library, with a pile of archive.org-specific extensions for handling the json/html/login cookie stuff that lives outside of S3 but is essential for working with archive.org materials.

The first commit of this library went up this past Friday, so we are talking very very beta, a library that will likely eat your data anad then burp loadly when it’s done. Over the next couple of weeks it should get a lot cleaner. For adventurous folks who want to look at it right away, you can browse the code via gitweb. Stay tuned!

0 Responses to “Shoveling dumps out to the (archive.org) cloud”



  1. Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s





Follow

Get every new post delivered to your Inbox.

%d bloggers like this: