Digging into Structured Data for Media on Commons (Part 2)

Now that we know all about what MediaInfo content looks like and how to request it from the api, let’s see how to add MediaInfo content to an image. If you don’t remember all that, have a look at the previous blog post for a refresher.

Adding captions

Captions are called ‘labels’ in Wikibase lingo, and so we’ll want to put together a string that represents the json text defining one or more labels. Each label can have a string defined for one or more languages. This then gets passed to the MediaWiki api to do the heavy lifting.

Here’s an example of the ‘data’ parameter to the api before url encoding:

data={"labels":{"en":{"language":"en","value":"Category:Mak Grgić"},
"sl":{"language":"sl","value":"Mak Grgic"}}}

You’ll need to pass a csrf token which you can get after logging in to MediaWiki, and the standard parameters to wbeditentity, namely:

action=wbeditentity
id=<Mxxx, the MediaInfo id associated with your image>
summary=<comment summarizing the edit>
token=<csrf token you got from logging in>

Since I spend most of my coding life in Python land, we love the requests module. Here’s the relevant code for that:

        params = {'action': 'wbeditentity',
                  'format': 'json',
                  'id': minfo_id,
                  'data': '{"labels":{"en":{"language":"en","value":"' + caption + '"}}}',
                  'summary': comment,
                  'token': self.args['creds']['commons']['token']}
        response = requests.post(self.args['wiki_api_url'], data=params,
                                 cookies=self.args['creds']['commons']['cookies'],
                                 headers={'User-Agent': self.args['agent']})

where variables like minfo_id, comment and so on should be self-explanatory.

You’ll get json back and if the request fails within MediaWiki, there will be an entry named ‘error’ in the response with some string describing the error.

You can have a look at the add_captions() method in https://github.com/apergos/misc-wmf-crap/blob/master/glyph-image-generator/generate_glyph_pngs.py for any missing details.

Since that’s all pretty straightfoward, let’s move on to…

Adding Depicts Statements

A ‘depicts’ statement is a Wikibase statement (or ‘claim’) that the image associated with the specified MediaInfo id depicts a certain subject. We specify this by using the Wikidata property id associated with ‘depicts’. For www.wikidata.org that is https://www.wikidata.org/wiki/Property:P180 and for the test version of Wikidata I work with at https://wikidata.beta.wmflabs.org it is https://wikidata.beta.wmflabs.org/wiki/Property:P245962 so you’ll need to tailor your script to the Wikibase source you’re using.

When we set a depicts statement via the api, existing statements are not touched, so it’s good to check that we don’t already have a depicts statement that refers to our subject. We can retrieve the existing MediaInfo content (see the previous blog post for instructions) and check that there is no such depicts statement in the content before continuing.

When we add a depicts or other statement, existing labels aren’t disturbed, so you can batch caption some images and then go on to batch add depicts statements without any worries.

The MediaInfo depicts statement, like any other Wikibase claim, has a ‘mainsnak‘, and a ‘snaktype‘ (see previous blog post for more info). Crucially, the value for the depicts property must be an existing item in the Wikidata repository used by your image repo; it cannot be a text string but must be an item id (Qnnn).

Here is an example of the ‘data’ parameter to the api before url encoding:

data={"labels":{"en":{"language":"en","value":"Template:Zoom"},
"fa":{"language":"fa","value":"الگو:Zoom"}},
"sitelinks":{"commonswiki":{"site":"commonswiki",
"title":"Template:Zoom"},"fawiki":{"site":"fawiki",
"title":"الگو:Zoom"}}}

For the requests module, you’ll have something like this:

        depicts = ('{"claims":[{"mainsnak":{"snaktype":"value","property":"' +
                   self.args['depicts'] +
                   '","datavalue":{"value":{"entity-type":"item","id":"' +
                   depicts_id + '"},' +
                   '"type":"wikibase-entityid"}},"type":"statement","rank":"normal"}]}')
        comment = 'add depicts statement'
        params = {'action': 'wbeditentity',
                  'format': 'json',
                  'id': minfo_id,
                  'data': depicts,
                  'summary': comment,
                  'token': self.args['creds']['commons']['token']}
        response = requests.post(self.args['wiki_api_url'], data=params,
                                 cookies=self.args['creds']['commons']['cookies'],
                                 headers={'User-Agent': self.args['agent']})

Note here that while we retrieve MediaInfo content from the api and these entries are called ‘statements’ in the output, when we submit them, they are called ‘claims’. Other than that, make sure that you have the right property id for ‘depicts’ and you should be good to go.

There are some details like the ‘<code>rank:normal</code>’ bit that you can learn about here https://commons.wikimedia.org/wiki/Commons:Depicts#Prominence  (TL;DR: if you use ‘rank:normal’ for now you won’t hurt anything.)

Again, the variables ought to be pretty self-explanatory. For more details you can look at the add_depicts method in the generate_glyph_pngs.py script.

You’ll get an ‘error’ item in the json response, if there’s a problem.

More about the sample script

The script is a quick tool I used to populate our testbed with a bunch of structured data; it’s not meant for production use! It doesn’t check for nor respect maxlag (the databases replication lag status), it doesn’t handle dropped connections, it does no retries, it doesn’t use clientlogin for non-bot scripts, etc. But it does illustrate how to add retrieve MediaInfo content, add captions, add items to Wikidata, and set depicts statements.

Much more is possible; this is just the beginning. Join the #wikimedia-commons-sd IRC channel on freenode.net for more!

0 Responses to “Digging into Structured Data for Media on Commons (Part 2)”



  1. Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.





%d bloggers like this: