Archive for the 'open content' Category

Digging into Structured Data for Media on Commons (Part 2)

Now that we know all about what MediaInfo content looks like and how to request it from the api, let’s see how to add MediaInfo content to an image. If you don’t remember all that, have a look at the previous blog post for a refresher.

Adding captions

Captions are called ‘labels’ in Wikibase lingo, and so we’ll want to put together a string that represents the json text defining one or more labels. Each label can have a string defined for one or more languages. This then gets passed to the MediaWiki api to do the heavy lifting.

Here’s an example of the ‘data’ parameter to the api before url encoding:

data={"labels":{"en":{"language":"en","value":"Category:Mak Grgić"},
"sl":{"language":"sl","value":"Mak Grgic"}}}

You’ll need to pass a csrf token which you can get after logging in to MediaWiki, and the standard parameters to wbeditentity, namely:

action=wbeditentity
id=<Mxxx, the MediaInfo id associated with your image>
summary=<comment summarizing the edit>
token=<csrf token you got from logging in>

Since I spend most of my coding life in Python land, we love the requests module. Here’s the relevant code for that:

        params = {'action': 'wbeditentity',
                  'format': 'json',
                  'id': minfo_id,
                  'data': '{"labels":{"en":{"language":"en","value":"' + caption + '"}}}',
                  'summary': comment,
                  'token': self.args['creds']['commons']['token']}
        response = requests.post(self.args['wiki_api_url'], data=params,
                                 cookies=self.args['creds']['commons']['cookies'],
                                 headers={'User-Agent': self.args['agent']})

where variables like minfo_id, comment and so on should be self-explanatory.

You’ll get json back and if the request fails within MediaWiki, there will be an entry named ‘error’ in the response with some string describing the error.

You can have a look at the add_captions() method in https://github.com/apergos/misc-wmf-crap/blob/master/glyph-image-generator/generate_glyph_pngs.py for any missing details.

Since that’s all pretty straightfoward, let’s move on to…

Adding Depicts Statements

A ‘depicts’ statement is a Wikibase statement (or ‘claim’) that the image associated with the specified MediaInfo id depicts a certain subject. We specify this by using the Wikidata property id associated with ‘depicts’. For www.wikidata.org that is https://www.wikidata.org/wiki/Property:P180 and for the test version of Wikidata I work with at https://wikidata.beta.wmflabs.org it is https://wikidata.beta.wmflabs.org/wiki/Property:P245962 so you’ll need to tailor your script to the Wikibase source you’re using.

When we set a depicts statement via the api, existing statements are not touched, so it’s good to check that we don’t already have a depicts statement that refers to our subject. We can retrieve the existing MediaInfo content (see the previous blog post for instructions) and check that there is no such depicts statement in the content before continuing.

When we add a depicts or other statement, existing labels aren’t disturbed, so you can batch caption some images and then go on to batch add depicts statements without any worries.

The MediaInfo depicts statement, like any other Wikibase claim, has a ‘mainsnak‘, and a ‘snaktype‘ (see previous blog post for more info). Crucially, the value for the depicts property must be an existing item in the Wikidata repository used by your image repo; it cannot be a text string but must be an item id (Qnnn).

Here is an example of the ‘data’ parameter to the api before url encoding:

data={"labels":{"en":{"language":"en","value":"Template:Zoom"},
"fa":{"language":"fa","value":"الگو:Zoom"}},
"sitelinks":{"commonswiki":{"site":"commonswiki",
"title":"Template:Zoom"},"fawiki":{"site":"fawiki",
"title":"الگو:Zoom"}}}

For the requests module, you’ll have something like this:

        depicts = ('{"claims":[{"mainsnak":{"snaktype":"value","property":"' +
                   self.args['depicts'] +
                   '","datavalue":{"value":{"entity-type":"item","id":"' +
                   depicts_id + '"},' +
                   '"type":"wikibase-entityid"}},"type":"statement","rank":"normal"}]}')
        comment = 'add depicts statement'
        params = {'action': 'wbeditentity',
                  'format': 'json',
                  'id': minfo_id,
                  'data': depicts,
                  'summary': comment,
                  'token': self.args['creds']['commons']['token']}
        response = requests.post(self.args['wiki_api_url'], data=params,
                                 cookies=self.args['creds']['commons']['cookies'],
                                 headers={'User-Agent': self.args['agent']})

Note here that while we retrieve MediaInfo content from the api and these entries are called ‘statements’ in the output, when we submit them, they are called ‘claims’. Other than that, make sure that you have the right property id for ‘depicts’ and you should be good to go.

There are some details like the ‘<code>rank:normal</code>’ bit that you can learn about here https://commons.wikimedia.org/wiki/Commons:Depicts#Prominence  (TL;DR: if you use ‘rank:normal’ for now you won’t hurt anything.)

Again, the variables ought to be pretty self-explanatory. For more details you can look at the add_depicts method in the generate_glyph_pngs.py script.

You’ll get an ‘error’ item in the json response, if there’s a problem.

More about the sample script

The script is a quick tool I used to populate our testbed with a bunch of structured data; it’s not meant for production use! It doesn’t check for nor respect maxlag (the databases replication lag status), it doesn’t handle dropped connections, it does no retries, it doesn’t use clientlogin for non-bot scripts, etc. But it does illustrate how to add retrieve MediaInfo content, add captions, add items to Wikidata, and set depicts statements.

Much more is possible; this is just the beginning. Join the #wikimedia-commons-sd IRC channel on freenode.net for more!

Digging into Structured Data for Media on Commons

Commons: Not a tragedy

Commons. What Commons?

If you are a contributor to Wikipedia or one of the other Wikimedia projects, you probably already know. If you aren’t, even as a frequent Wikipedia reader (and aren’t we all?), you may not know that almost all of the photographs and other images in the articles you read are hosted in a media repository, commons.wikimedia.org.

These images are generally free to reuse, modify, and share, for any purpose including commercial use. You can also create an account there and upload your own photographs, as long as they have educational value.

Image from Commons

Searching for cats on Commons

Structured Data is GREAT

But let’s suppose you want to sort through the media for some reason; maybe you want to find all of the photographs of edible wild mushrooms in your region, with the name of each species. Or maybe you want cute cat pictures. You could try the search tab and hope for the best, but all of the information about an image is in a blob of unstructured text and formatted any sort of random way. You could look at the categories and see if you’re lucky enough that one category covers your needs, but odds are that for anything except the simplest of queries, you’ll come up empty-handed.

What about getting that information in a language other than English? Good luck with that; although the project is shared among speakers of many different languages, the predominant language of contribution and of category names is still English.

Suppose you want to monitor newly uploaded images for any of the above, via a script. How can you do that? With difficulty.

Until now.

Structured data allows contributors to add a caption or information about what is depicted, to each uploaded media file, in any language, or in multiple languages. While the file description is still an unstructured string of text per language, descriptions in multiple languages can be specified and descriptions can be extracted for a media file independently of anything else. The same holds true of captions, and more data is likely to be added in the future.

Why am I writing about this? I produce dumps, so what do I care?

Introducing Mediainfo entities

Ah ha!  Someday () we will be producing dumps of this data. You’ll have files that contain all Mediainfo entities, much as there are now files containing the various sorts of Wikidata entities (see ) for download every week. And since we’re going to be dumping it, we need to understand the data: what is its format? How can we retrieve it via the MediaWiki api? How can we set it?

Mediainfo entities are similar-but-different from Wikibase entities. They are similar in that they have a special format (json) and cannot be edited directly via the wiki ‘edit source’ or ‘edit’ tabs. They both rely on the Wikibase extension as well. But Mediainfo entities are not stored in a set of separate tables, as Wikibase entities are. Instead, Mediainfo entities are stored in a secondary slot of the revision of the File page.

Wait, wut? What are slots? And what the devil is a ‘secondary slot’?

A side trip to Multi-Content Revisions

Time for a crash course in ‘Multi-Content Revisions’, also known as MCR. Until last year, MediaWiki had the following data model, as tweaked for use at Wikimedia:

  • Each article, template, user page, discussion page, and so on, is represented by a record in the ‘page’ table.
  • Each edit to a page is represented by a new record in the revision table.
  • Once a revision is added to the table, it never changes. It can be hidden from public view but never really removed.
  • A revision contains a pointer to a record in the text table.
  • The text table record contains a pointer to a record in a blobs table on one of our external storage servers.
  • The blob record contains the (usually gzipped) content of the revision as wikitext or json or css or whatever it happens to be.

So: one revision, one piece of content. And for articles (and File pages), that means one blob of wikitext with whatever formatting various editors have decided to give it.

But suppose…. just suppose that we could attach pieces of data to that page, also editable by contributors, and that could be displayed in a nice table or some other good data display format. A caption for an image, who or what’s in the image, maybe the creation date, maybe the name of the photographer, maybe EXIF data right from the image. Wouldn’t that be nice? Imagine if all of that data was available via the MediaWiki api, or easily searchable. Wouldn’t that be just grand?

That’s what Multi-Content Revisions are all about. That ‘extra data’ has to live somewhere and still be attached to a revision. So: slots. The ‘main slot’ contains wikitext or json or css or whatever the page normally has, that a contributor can edit the usual way. ‘Secondary’ slots, as many as we define, can have other data, like captions or descriptions or whatever else we decide is useful.

Now each edit to a page is represented by a new record in the revision table, but a revision contains a pointer to one or more entries in the slots table.

Each slot table entry contains a field indicating which slot it’s for (main? some other one?) and a pointer to an entry in the content table.

Each content table entry contains a pointer to a record in the text table, but at some point it will likely point directly to a blob on one of our external storage servers.

Back to Mediainfo entities

A Mediainfo entity, then, is a kind of Wikibase entity, structured data, in json format, that is stored in the ‘mediainfo’ slot for revisions of File pages on Commons.

Structured data tab for a file

This is easier to wrap one’s head around with an example.

If you look at https://commons.wikimedia.org/wiki/File:Marionina_welchi_(YPM_IZ_072302).jpeg you can see below the image that there is a ‘Structured Data’ tab. ‘Items portrayed in this file’ has a name listed.

If you look at the edit history for the file, ‎you can see an entry with the following comment:

Created claim: depicts (d:P180): (d:Q5137114)

This means that someone entered the depiction information by clicking ‘Edit’ next to the ‘Items portrayed in this file’ message, and added the information. You can do the same, in any language, or you can modify or remove existing depiction statements.

JSON for “Depicts” data

Let’s see what the raw content behind a ‘depicts statement’ is.

We can get the content of a Mediainfo entity by providing the Mediainfo id to the MediaWiki api and doing a wbgetentities action. But first we need to get that id. How do we do that?

Here’s the trick: the Mediainfo id for a File page is ‘M’ plus the page id! So first we retrieve the page id via the api:

https://commons.wikimedia.org/w/api.php?action=query&prop=info&titles=File:Marionina_welchi_(YPM_IZ_072302).jpeg&format=json

There’s the page id: 82858744. So now we have the Mediainfo id M82858744 we can pass to wbgetentities:

https://commons.wikimedia.org/w/api.php?action=wbgetentities&ids=M82858744&format=json

And here’s the output, prettified for human readers like you and me:

{
  "entities": {
    "M82858744": {
      "pageid": 82858744,
      "ns": 6,
      "title": "File:Marionina welchi (YPM IZ 072302).jpeg",
      "lastrevid": 371114515,
      "modified": "2019-10-19T08:11:24Z",
      "type": "mediainfo",
      "id": "M82858744",
      "labels": {},
      "descriptions": {},
      "statements": {
        "P180": [
          {
            "mainsnak": {
              "snaktype": "value",
              "property": "P180",
              "hash": "bfa568e1a915cc36538364c66cbfeea50913feea",
              "datavalue": {
                "value": {
                  "entity-type": "item",
                  "numeric-id": 5137114,
                  "id": "Q5137114"
                },
                "type": "wikibase-entityid"
              }
            },
            "type": "statement",
            "id": "M82858744$21beba99-41e7-a211-66ac-1cacc78b806d",
            "rank": "preferred"
          }
        ]
      }
    }
  },
  "success": 1
}

You can see the depicts statement under ‘statements’, where ‘P180’ is the property ‘Depicts’, as seen on Wikidata. ‘Q5137114’ is Marionina welchi, as seen also on Wikidata. The P-items and Q-items are available from Wikidata by virtue of Mediainfo’s use of “Federated Wikibase”; see https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Wikibase/+/master/docs/federation.wiki for more on that.

Note!! Since the page id of the File page is effectively embedded in the Mediainfo text blob as stored in the database, any change to the page id can break things. See https://phabricator.wikimedia.org/T232087 for an example!

Raw content of “Depicts” statement

What is the raw content of the slot for that revision in the database, you ask? I’ve got a little script for that. (Link: https://github.com/apergos/misc-wmf-crap/blob/master/get_revision.py) Here’s what I get when I run the script:

{"type":"mediainfo","id":"M82858744","labels":[],"descriptions":[],"statements":{"P180":[{"mainsnak":{"snaktype":"value","property":"P180","hash":"bfa568e1a915cc36538364c66cbfeea50913feea","datavalue":{"value":{"entity-type":"item","numeric-id":5137114,"id":"Q5137114"},"type":"wikibase-entityid"}},"type":"statement","id":"M82858744$21beba99-41e7-a211-66ac-1cacc78b806d","rank":"preferred"}]}}

That’s right, all that mainsnak and snaktype and other stuff is right there in the raw slot content.

How statements work

A statement (in Wikidata parlance, a “claim”), is an assertion about a subject that it has a certain property (Pnnn) with a certain value. If the value turns out to be a person or a place or something else with a Wikidata entry (Qnnn), then that id can be used in place of the text value.

Here’s an example:

Boris Johnson holds the position (P39 “position held”) of Prime Minister of the UK (Q14211). But this hasn’t always been true and it won’t be true forever. This statement therefore needs qualifiers, such as: P580 (“start time”) with value 24 July 2019.

A “snak” (chosen as the next largest data item after “bit” and “byte”) is a claim with a property and value but no qualifiers.

In our case, an image can depict (P180) some person or thing (Q-id for the person/thing if there is one, or the name otherwise). Or it can have been created by (P170) some person. Or it can have been created (P571) on a certain date. The point is that statements (claims) of any sort can be added to a Mediainfo entity. However, some care should be taken before adding statements involving properties other than P180 and P170, preferably after discussion and agreement with the community. See https://commons.wikimedia.org/wiki/Commons_talk:Structured_data/Modeling for some of the discussion around use of properties for Commons media files.

For a note on the terminology “statements” vs. “claims”, see https://phabricator.wikimedia.org/T149410.

For more about snaks and snaktypes, see https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Wikibase/+/master/docs/json.wiki#261.

Looking at Captions

Let’s find a File which has a caption. I’ve already found one so we’ll just use it as an example:

https://commons.wikimedia.org/wiki/File:Stra%C3%9Fenbahn_Haltestelle_Freizeit-_und_Erholungszentrum-3.jpg

Underneath the image, in the ‘File Information’ tab, you can see the entry ‘Captions’, and there are some! If you select ‘See 1 more language’ you can see that there are two captions, one in English and one in German.

Let’s get the Mediainfo id for that file:

https://commons.wikimedia.org/w/api.php?action=query&prop=info&titles=File:Straßenbahn_Haltestelle_Freizeit-_und_Erholungszentrum-3.jpg&format=json

Great, it’s 83198284. Let’s plug M83198284 into our wbgetentities query:

https://commons.wikimedia.org/w/api.php?action=wbgetentities&ids=M83198284&format=json

{
  "entities": {
    "M83198284": {
      "pageid": 83198284,
      "ns": 6,
      "title": "File:Straßenbahn Haltestelle Freizeit- und Erholungszentrum-3.jpg",
      "lastrevid": 371115009,
      "modified": "2019-10-19T08:16:04Z",
      "type": "mediainfo",
      "id": "M83198284",
      "labels": {
        "en": {
          "language": "en",
          "value": "Tram stop in Berlin, Germany"
        },
        "de": {
          "language": "de",
          "value": "Straßenbahn Haltestelle in Berlin"
        }
      },
      "descriptions": {},
      "statements": []
    }
  },
  "success": 1
}

Captions are called ‘labels’ and are returned by language with a simple text value for each caption. What’s the raw data, you ask?

{"type":"mediainfo","id":"M83198284","labels":{"en":{"language":"en","value":"Tram stop in Berlin, Germany"},"de":{"language":"de","value":"Stra\u00dfenbahn Haltestelle in Berlin"}},"descriptions":[],"statements":[]}

Pretty much as we expect, the entity type and id are stored along with the labels, with one entry per language.

Looking at Descriptions

Just kidding!  More seriously, there’s a placeholder for descriptions but that’s due to be removed (see https://phabricator.wikimedia.org/T213502) since captions suffice.

Looking at anything else

I’ve got a script. It’s crap because all code is crap, and all code not written for production is especially crap, and code witten for personal use just to get the job done is extra especially crap. Nonetheless, I use this python script to get the Mediainfo entity for a given File on Commons: https://github.com/apergos/mw-scripts-crapola/blob/master/get_mediainfo.py and it gets the job done.

The Big Payoff: Search

Let’s find some images on Commons using this data. Go to the main page and let’s search for any media with a “depicts” (P180) statement, by entering “haswbstatement:P180” in the search bar.You can check the Structured Data tab below the image of any of the files in the result, and see what’s depicted.

But that’s not all! You can specify what you want depicted: haswbstatement:P180=Q146 will find any media file marked as depicting a… https://www.wikidata.org/wiki/Q146 just by entering it into the search bar.

And that’s still not all! You can specify that you want only those pictures with captions in English and that depict Q146’s by entering hasdescription:en haswbstatement:P180=Q146. Note!! Captions used to be specified by the keyword “hascaption” but this has been changed, though you may see it referenced in older documentation or blog posts.

But there’s more! You can specify that you want pictures that depict something, created by someone else, with captions in some languages but not others, and CirrusSearch will serve that right up to you. Try it by searching for haswbstatement:P170=Q34788025 haswbstatement:P180=Q158942 hasdescription:en -hasdescription:fr and check the results.

But… you guessed it, that’s still not all. You can search for all media files that have a specified text in the caption, in addition to any other search criteria! Try it by searching for incaption:dog hasdescription:fr and check any file in the results.

Bonus: Editing!

I’ve got a script. It just updates captions because the format of those is the easiest, but if you look at it and the api help docs you can figure out the rest.

Script: https://github.com/apergos/misc-wmf-crap/blob/master/glyph-image-generator/set_mediainfo.py

MediaWiki api help: https://commons.wikimedia.org/w/api.php?action=help&modules=wbeditentity

Further Reading

There’s lots more to explore, so here’s some links to get you started.

Credits

Greek Wikipedian sued for adding sourced documented information

No joke, sadly. Tomorrow is the preliminary injunction hearing where the judge will decide whether the user should be ordered to temporarily remove the content the suing politician doesn’t like, pending trial… only of course to have the edit reverted by someone else, I’m sure. Read all about it here: We are all Diu!

Unfortunately, as goofy as the lawsuit seems, the threat of censorship is real. What happens tomorrow could influence the future of Wikipedia in a big way. Stay tuned! And pass on the word, let people know.

If only bz2 used byte-aligned blocks…

this post would have been written ages ago!  But it doesn’t and so here we are.

TL;DR version: new toy available to play with, uses the bz2 multistream enwikipedia XML dump of current pages and articles, displays text of the article of your choice.  No fancy rendering, and raw wikitext infoboxes still look like &^%$#@, that’s left as an exercise to the reader.  You will need linux, bash, python, the bz2 multistream files from here or from a mirror site, and the code from here. Ah, you’ll also need about 10GB worth of space, the dumps are large these days!

Want more information? Check out the README.txt and HACKING.txt.

A special mention here goes to the author of the Offline Wikipedia project, which used bzip2recover to exploit the block-oriented nature of the compression algorithm, breaking the XML file into tens of thousands of small files and building an index of page titles against these files. If bz2 used byte-aligned blocks, he might have written my code long ago!

Return of the revenge of the disk space

We’ve been generating bundles of media in use on the various Wikimedia projects, so that readers or editors of these projects can download the media with just a few clicks.  This approach is great for the downloader but takes more space than we would like, since files hosted on Commons in use on multiple projects will be stored multiple times.  We were hoping to be clever about this by pulling out the files stored in multiple places and bundling those up separately for download. The first step was to generate a list of projects that have the largest number of files used by some other project.  The results were discouraaging.

Below is a list of the projects with the most media in use, listed in parentheses, followed by the number of media files in common with some other projects in descending order.

enwiki(2237560): 519734|dewiki 480943|frwiki 304120|plwiki 352064|ruwiki 393602|eswiki 354075|itwiki
dewiki(1426181):  519734|enwiki 318361|frwiki 246937|itwiki 236249|ruwiki 223472|plwiki 222664|eswiki
frwiki(1046546):  480943|enwiki 318361|dewiki 255563|eswiki 250681|itwiki 219759|ruwiki 201138|plwiki
ruwiki(649728):   352064|enwiki 236249|dewiki 219759|frwiki 187919|eswiki 185788|itwiki 173388|plwiki

Eliminating all of the duplication between just the first few top projects would entail the creation of multiple separate files for download,  making things significantly less convenient for the downloader without the space gains to justify it.

For just the top five projects as far as media usage, the number of media files in common to them all is only 66979, a pittance.  But even if we took the 500 thousand files in use on dewiki and on enwiki and put them in a separate bundle, with a separate bundle for the rest of enwiki and a separate one for the rest of dewiki, that’s still not much of a gain compared to the nearly 6 million unique media files total in use.

So for now we’ll just keep the media bundles per project like they are.  If anyone has any bright space-saving ideas, please chime in with a comment.

(Disk) space: the final frontier

Where are those awesome little cubes of holographic data that we used to see on Star Trek, which contained seemingly endless amounts of data? While we wait for someone to get on that problem, I get to sort out mirrors and backups of media in a world where servers with large raids cost a hefty chunk of change. Just a few days ago our host that serves scaled media files was down to less than 90 GB left.

Lost in time and lost in space on the scaled media server

In theory scaled media can be regenerated from the originals at any time, but in practice we don’t have a media scaler cluster big enough to scale all media at once. This means that we need to be a bit selective in how we “garbage collect”. Typically we generate a list of media not in use on the Wikimedia projects and delete the scaled versions of those files. The situation was so bad, however, that the delete script–which sleeps in between every batch of deletes–put enough pressure on the scaled media server that it became slow to respond, causing the scalers to slow down and thus affecting service for the entire site for a few minutes.

The solution to this turned out to be to remove the scaled media server completely from the equation, and rely entirely on the new distributed media backed, Openstack’s Swift. Whew! But we are really only putting off a discussion that needs to happen soonish: how do we keep our pile of scaled media from expanding at crazy rates?

Consider that we will generate thumbnails of any size requested, on demand if it doesn’t exist already, and these files are never deleted until the next time we run some sort of cleanup script. With Swift it’s going to be easy to forget that we have limited disk storage and that scaled media are really just long-lived temporary files. Should we limit thumb generation to specific sizes only (which could be pregenerated rather than produced on the fly)? Should we generate anything requested but toss non-standard sizes every day? Should we toss less frequently used thumbs (and how would we know which ones those are) on a daily or weekly basis?

Media mirrors, om nom nom

People who are close followers of the xml data dumps mailing list will already know that we have a mirror hosting all current wikimedia images at your.org, and that we host bundles of media per project generated every month as well.  (See the mirrors list at Meta for the links.)

Right now the mirror and the downloadable media bundles are hosted off-site; in fact the bundles are generated off-site!  But that’s due to change soon.  We have a couple of nice beefy servers that are going to be installed this week just for that work.

Because the media bundles contain all media for a given project, and many files are re-used across multiple projects, there is a lot of duplication and a lot of wasted space.  Got a couple ideas in mind for cleaning that up.

The other exciting thing happening in media-land is the move to distributed storage (Swift) for originals of all media we host.  Once that happens we’ll need to be able to keep our mirror in sync with the Swift copy.  I’m hoping for testing to be entertaining instead of really really annoying. 🙂  We shall see…

Shoveling dumps out to the (archive.org) cloud

Internet Archive is an amazing service, it truly is, even with all of its different interfaces for getting and uploading data (straight html for some things, json for others, and an S3-compatible-ish REST API for other things). When we look back on this period of our digital history, the Archive will surely be recognized as one of the great repositories of knowledge, a project that changed forever the course of the Internet. Naturally we want Wikimedia XML dumps to be a part of this repository.

Quite a while back I hacked together a python script wrapped around a pile of curl invocations to upload XML dumps with automatic generation of metadata based on the dump. But when it came time to make the script actually *usable*, eh, too much glue and paper clips. Writing a class to upload a large dump file in several pieces via the S3 multi-part upload protocol turned out to be too too ugly with the existing script.

Problems like that have an obvious solution: toss the hacked together crap and write something that sucks less. And thus was born YAS3Lib, yet another S3 library, with a pile of archive.org-specific extensions for handling the json/html/login cookie stuff that lives outside of S3 but is essential for working with archive.org materials.

The first commit of this library went up this past Friday, so we are talking very very beta, a library that will likely eat your data anad then burp loadly when it’s done. Over the next couple of weeks it should get a lot cleaner. For adventurous folks who want to look at it right away, you can browse the code via gitweb. Stay tuned!

Visual Novels as educational tools

I had never heard of a visual novel before I went hunting for a virtual reality client in which I could build a basic adventure game for language learning.  Sidetracks happen, and so now there is the absolutely unapproved, unofficial and surely unfit for public consumption Wikipedia Quiz, written solely to get me acquainted with Ren’Py, a visual novel engine written in Python. It’s pretty slick actually, building bundles for Windoze, MacOS *and* Linux. Oh, bear in mind, I have no graphics abilities whatsoever, so I had to steal all the visual content. Thank goodness for open content, right? Anyways, go check out the engine and maybe build yourself a few quizzes. Let me know how it goes!