One month with the reMarkable tablet

A month has passed. Time for an update on my experience with the reMarkable tablet. Here’s what’s happened since my last blog post on its use, plus a few comments about workflow, feature use, and improvements that I hope the company will consider.

I replaced the first nib. For several days my writing experience was strange, the tablet seemingly more fidgety and sensitive. I finally had another look at the new nib and found it had not been inserted it all the way. Problem solved.

I rsync the tablet every few days to my desktop. Eventually that will be a daily process, with weekly new full backups and daily incremental updates. Cleanup work on the forked script for conversion of notebooks to pdfs is ongoing.

reMarkable tablet with custom sleep screen

I leave WiFi off except during rsyncs; I haven’t yet attempted to convert any pdfs to text. I’d be using an out of band means to do that in any case, since I don’t want to save data on the company’s servers.

The problem with pages in the thumbnail view being cropped on top has still not been resolved. I make allowances by writing a title lower down in the page. It’s not great. (Where’s that open issue tracker, folks?)

I’m getting into the routine of things with my daily todo lists. At some point I’ll probably have notebook blanks with dates already filled in, in big bold text that appears clearly on the thumbnails, and with standard todo items above the date, to be duplicated for each half of every month.

I’ve not yet played with using layers to display titles in large text and free up the virtual lined paper for notes.

I have a thingiverse 3-d printed stylus clip (see https://www.thingiverse.com/thing:2950560) which fits well and prevents the pen from rolling around on the smooth table surface. No folio yet, as I’ve not needed to travel with the tablet and have not seen a solution I like. I’d prefer a book-style cover that is both slim and does not require the tablet to be glued into the cover. Anybody have suggestions? DIY acceptable!

I take scratch notes on various topics while working through items in my todo lists. These eventually get cleaned up and exported into separate notebooks or added to existing ones. Moving out multiple pages is a bit tedious; that’s another feature that would be nice to have.

At some point it will be ok to move non-agenda notebooks off to the laptop; these should be converted to pdfs. That’s more incentive to get the conversion script working properly. I need to investigate the various pdf conversion tools and see what produces clean output as well as

pdfs of a reasonable size.

If I could add the pen switcher menu to the pen/pencil/marker in the small menu display, as well as the thumb view on top, I could stay in small or no menu view almost all of the time, instead of switching to large menu view to get these items and then closing that again. Consider it a step closer to the ‘no clutter’ intended design of the device.

A couple of times I have touched something in passing while holding the tablet with a notebook page displayed and the full menu open, that wiped the page. A couple of undos restored it, but it would be nice to know what cleared the page in the first place.

Sometimes the previous page and next page buttons at the bottom of the tablet don’t respond on first try and I need to press them again. Perhaps this is a sensitivity issue. It’s not a huge drawback, more of a small annoyance, but something that should be ironed out for future shipments of the device.

At this point, I don’t have any desire for other apps on the device except for rsync and some way to manage custom templates, splash screens and so on. Sure, everyone can roll their own but that’s a waste of hours of fiddling around by a lot of users. These items should be easy enough for the company to fold into the firmware.

I’m a little bit anxious about what will happen when the battery starts to fail to hold a charge in a couple of years, because it’s not intended to be replaceable. I’m not one of those folks that gets a new phone every couple of years; devices should be built to last and to be repaired and used as long as possible, rather than regularly added to the landfill. Maybe by then, a third party battery will be available and we’ll be able to manage replacements ourselves, even if completely unauthorized.

I have not once used the tablet to read a pdf or epub; that happens on the laptop. I don’t have commute time that gets filled by reading, and coffee shops are for socializing. We’ll see if that’s changed by the 6 month mark.

I still haven’t cracked open the paper notebooks even once. Maybe when I go somewhere without power for longer than a few days 🙂

Digging into Structured Data for Media on Commons (Part 2)

Now that we know all about what MediaInfo content looks like and how to request it from the api, let’s see how to add MediaInfo content to an image. If you don’t remember all that, have a look at the previous blog post for a refresher.

Adding captions

Captions are called ‘labels’ in Wikibase lingo, and so we’ll want to put together a string that represents the json text defining one or more labels. Each label can have a string defined for one or more languages. This then gets passed to the MediaWiki api to do the heavy lifting.

Here’s an example of the ‘data’ parameter to the api before url encoding:

data={"labels":{"en":{"language":"en","value":"Category:Mak Grgić"},
"sl":{"language":"sl","value":"Mak Grgic"}}}

You’ll need to pass a csrf token which you can get after logging in to MediaWiki, and the standard parameters to wbeditentity, namely:

action=wbeditentity
id=<Mxxx, the MediaInfo id associated with your image>
summary=<comment summarizing the edit>
token=<csrf token you got from logging in>

Since I spend most of my coding life in Python land, we love the requests module. Here’s the relevant code for that:

        params = {'action': 'wbeditentity',
                  'format': 'json',
                  'id': minfo_id,
                  'data': '{"labels":{"en":{"language":"en","value":"' + caption + '"}}}',
                  'summary': comment,
                  'token': self.args['creds']['commons']['token']}
        response = requests.post(self.args['wiki_api_url'], data=params,
                                 cookies=self.args['creds']['commons']['cookies'],
                                 headers={'User-Agent': self.args['agent']})

where variables like minfo_id, comment and so on should be self-explanatory.

You’ll get json back and if the request fails within MediaWiki, there will be an entry named ‘error’ in the response with some string describing the error.

You can have a look at the add_captions() method in https://github.com/apergos/misc-wmf-crap/blob/master/glyph-image-generator/generate_glyph_pngs.py for any missing details.

Since that’s all pretty straightfoward, let’s move on to…

Adding Depicts Statements

A ‘depicts’ statement is a Wikibase statement (or ‘claim’) that the image associated with the specified MediaInfo id depicts a certain subject. We specify this by using the Wikidata property id associated with ‘depicts’. For www.wikidata.org that is https://www.wikidata.org/wiki/Property:P180 and for the test version of Wikidata I work with at https://wikidata.beta.wmflabs.org it is https://wikidata.beta.wmflabs.org/wiki/Property:P245962 so you’ll need to tailor your script to the Wikibase source you’re using.

When we set a depicts statement via the api, existing statements are not touched, so it’s good to check that we don’t already have a depicts statement that refers to our subject. We can retrieve the existing MediaInfo content (see the previous blog post for instructions) and check that there is no such depicts statement in the content before continuing.

When we add a depicts or other statement, existing labels aren’t disturbed, so you can batch caption some images and then go on to batch add depicts statements without any worries.

The MediaInfo depicts statement, like any other Wikibase claim, has a ‘mainsnak‘, and a ‘snaktype‘ (see previous blog post for more info). Crucially, the value for the depicts property must be an existing item in the Wikidata repository used by your image repo; it cannot be a text string but must be an item id (Qnnn).

Here is an example of the ‘data’ parameter to the api before url encoding:

data={"labels":{"en":{"language":"en","value":"Template:Zoom"},
"fa":{"language":"fa","value":"الگو:Zoom"}},
"sitelinks":{"commonswiki":{"site":"commonswiki",
"title":"Template:Zoom"},"fawiki":{"site":"fawiki",
"title":"الگو:Zoom"}}}

For the requests module, you’ll have something like this:

        depicts = ('{"claims":[{"mainsnak":{"snaktype":"value","property":"' +
                   self.args['depicts'] +
                   '","datavalue":{"value":{"entity-type":"item","id":"' +
                   depicts_id + '"},' +
                   '"type":"wikibase-entityid"}},"type":"statement","rank":"normal"}]}')
        comment = 'add depicts statement'
        params = {'action': 'wbeditentity',
                  'format': 'json',
                  'id': minfo_id,
                  'data': depicts,
                  'summary': comment,
                  'token': self.args['creds']['commons']['token']}
        response = requests.post(self.args['wiki_api_url'], data=params,
                                 cookies=self.args['creds']['commons']['cookies'],
                                 headers={'User-Agent': self.args['agent']})

Note here that while we retrieve MediaInfo content from the api and these entries are called ‘statements’ in the output, when we submit them, they are called ‘claims’. Other than that, make sure that you have the right property id for ‘depicts’ and you should be good to go.

There are some details like the ‘<code>rank:normal</code>’ bit that you can learn about here https://commons.wikimedia.org/wiki/Commons:Depicts#Prominence  (TL;DR: if you use ‘rank:normal’ for now you won’t hurt anything.)

Again, the variables ought to be pretty self-explanatory. For more details you can look at the add_depicts method in the generate_glyph_pngs.py script.

You’ll get an ‘error’ item in the json response, if there’s a problem.

More about the sample script

The script is a quick tool I used to populate our testbed with a bunch of structured data; it’s not meant for production use! It doesn’t check for nor respect maxlag (the databases replication lag status), it doesn’t handle dropped connections, it does no retries, it doesn’t use clientlogin for non-bot scripts, etc. But it does illustrate how to add retrieve MediaInfo content, add captions, add items to Wikidata, and set depicts statements.

Much more is possible; this is just the beginning. Join the #wikimedia-commons-sd IRC channel on freenode.net for more!

One week with the reMarkable tablet

Spoiler: I like it a lot because it fits my workflow.

Why get an e-paper tablet?

I used to keep a small spiral notebook next to my laptop. In this I wrote notes on everything: todo lists for each day, notes from team meetings, notes while debugging some issue, notes from phone calls from friends or family with things to follow up; notes about travel or short story ideas or anything else. It all went into this notebook, in the order of arrival, with no indexing system other than “this probably was a couple weeks ago, let me search about 10 pages back”.

I have about 15 or 20 of these notebooks filled right now, with no dates and no way to retrieve the information once it’s been added and the notebook put away.

What finally pushed me over the edge was the pile of carefully cut pages I kept from the last notebook on unit testing, which I needed to refer to for current work. Inconvenient, if I dropped the pile they would be in random order (because of course not only do I not date the pages in these paper notebooks, I don’t number them either because that would be far too much work). And then I saw a reference to a tablet for writing things down, and reading ebooks. Only for that. Not that I cared
about reading ebooks, I have a laptop for that. But, hey, maybe that added feature would come in handy. And being able to scribble notes on some paper I was reading, that seemed useful, if not my primary use case.

Internet investigations, review scourings and repo perusals followed, and finally one day in a pique of fed-upness I placed the order.

First impressions

It ws quick to arrive, save for passing through customs in my country. I had expected a long delay after seeing a notice about a backlog on their web site. But here was the box, in my hands, several days early. Nice!

The packaging of some products annoys me; it comes off as ostentatious, or wasteful, or bulky for no good reason. The reMarkable’s packaging is low-key, with a tab and ribbon-style opener that reminds one of old hardbound books and slower more thoughtful times. I like it; I kept the box.

There was neither a CD nor a full-fledged user’s guide inside; it took a little time for me to find one online for the current version of the firmware (1.8.1.1). I decided to charge the tablet fully before playing, and set it aside. It felt light in the hand, the cable fit snugly, and the pen looked and felt like any stylus pen, a tool for the work and
nothing more. I did notice approvingly the hidden spare nib under the cap.

First use

Of course I didn’t really read the user’s guide right away, are you kidding? Instead I checked for the basics: how to create a folder, how to create a notebook, how to create a page. Then I created a few folders to organize my work; never again all my notes in order in one giant blob! and I started writing.

The writing experience really is superb, not at all the slick uncontrollable gliding all over the surface that so many tablets provide.

For whatever reason, I found that my grip on the stylus caused discomfort and hand cramping after a time. That’s no longer true after a week; my hand must have naturally adjusted to a proper position.

I found that the pencil without tilt and a medium tip, on the narrow-lined template, works very well for my style of note-taking. My style is basically “fit in as much as you can on one page, writing all the way up to the margins, and with bullets or indentation for lists, while keeping it legible.”

That last part about keeping it legible is a lie; often in my paper notebooks I would refer back to some scribbled note later only to discover that I had no idea what it said. I write just a bit slower on the tablet, whether due to the stylus-screen interaction or because I’m kinder to expensive electronic devices than paper. In either case,
my notes remain legible when I go back to review them, and I find I cheerfully erase if the letters turn into a blob from hurried scribbling.

I did not make a cloud account, since I don’t really want to be dependent on a third party cloud to store and retrieve my backups.

Discoveries

In page overview mode, the top part of the page is not visible, and my handwriting is much too small and uniform to be a guide to which content is where. I needed headers, a few inches down in the page. The brush, with a medium tip, was perfect for this.

I discovered the partial and full toolbar hide buttons. Now I almost exclusively work with all toolbar buttons hidden. I’d like the page overview and move buttons to be visible in the partial toolbar, but other than that I have no complaints.

I discovered that if you are at the last page of a notebook, pressing the right button on the bottom of the tablet to take you to the (non-existent) next page creates a new page for you, saving the trouble of unhiding the toolbar, pressing the new page button, and hiding it again.

I discovered that the last page you viewed is the page you see during light sleep mode. Most of the time I make sure that’s my todo list for the day.

Hardware functionality

The battery drains pretty quickly the way I use the tablet. I’ve turned off WiFi, since I don’t need it on for regular use, and that helps.

After a week, the stylus nib has “mushroomed“, i.e. the tip has squashed flat and the edges hang over the sides like a mushroom cap. It doesn’t (yet) seem to affect ease and feel of writing on the screen, so I’ll keep using the nib as is for now.

During the first several days I found I needed to press harder on the stylus than I would with pencil on paper. That too seems to have improved over time; whether wearing down the nib or unconscious adjustment of the hand position made the difference, writing now takes a relatively light touch, without the slipping and sliding on smooth glass so common on most tablets.

The previous and next buttons don’t always work; sometimes I have to press very firmly and deliberately in the center of the button after the first slapdash push fails to elicit a response. Dunno if that’s hardware or software but I can shrug it off for now.

Workflow

I have a todo list for each day; I have two notebooks per month of these, one for the first half and one for the second half. (The reason I don’t keep one per month is that I’ve read that at around 30 pages, page turning gets slow. This could be true only for pdfs; I haven’t tested it yet.)

If I need notes from some todo item that I know are temporary, I’ll create a page after the day’s todo list and work there. When I’m done with the work, I’ll either summarize it and put it in a separate file in the right folder, or more often, delete the scratch page(s).

For entertainment I follow political developments, and these notes go in pages with separate headings in the QuickSheets file. At the end of a day or a few days, when speculation about a pending event has resolved, I delete a bunch of stuff and write a summary of the event, deleting any now unneeded
pages.

Notes for work that I know I’ll need to reference again get put in a notebook in a subfolder somewhere under my work folder. We’ll see how effective this is after 6 months or a year; a week is not nearly enough time to see how the retrieval system will hold up.

My paper notebook is sitting on the shelf and I’ve not touched it at all save to copy over those unit testing notes into the tablet.

Customization

I used to run a bleeding edge kernel on my laptop, with a custom build. I used to maintain my own xterminal key mappings for my editor. I used to customize anything and everything. Years go by and one gets bored of constant tweaking, so I had planned not to mess with anything that couldn’t be done in the
tablet configuration settings via the UI.

Heh. The best-laid plans, etc.

I now have a custom full sleep screen, a custom power off screen, rsync on my tablet (built via the official toolchain from a clone of the rsync repo at samba), a script to rsync all of /home/root to the laptop, and I’m working on extending it to be able to upload pdfs to the tablet from my laptop.

Oh and of course I don’t use the annoying generated password to ssh into the tablet, I have a public key over there, which makes the rsync script nicer.

I have been looking at available templates shared by tablet users and thinking about what might be handy to have. No new templates uploaded yet, however.

I plan a script that will copy back in all custom files after any software update, runnable from the home directory. A one line command after each firmware update is pretty painless, and I’m happy to live with that for the huge gain in functionality with rsync on the device, and the smile that seeing one of my photos as the sleep image brings.

Proposed improvements

Copy-paste of a selected area from one page to another would be a huge win. I often want this when cleaning up temporary notes and distilling out of them the few pieces of information I want to save permanently.

Move and Page Overview icons on the partial menu view, so I don’t have to open the full menu to get to them.

A stylus with an eraser on the back end. I might get the MobiScribe stylus (I hear it takes the reMarkable nibs) just for this reason. I like to be able to stop and erase that e that I just wrote that looks like an ink blot, without having to open up the partial menu to get to the eraser.

Cheaper or longer-lasting nibs, without sacrificing one iota of the writing experience.

A much cheaper price for the tablet when it goes truly mass market. That would be a game-changer.

A public site to submit and follow bug reports, with user comments permitted. Right now there’s the reddit group where some issues are discussed (and sometimes solved!) but it would be nice to have an official site.

A published spec for the .rm files. Folks have reverse engineered them, but a published spec means a commitment to updating that spec when and if the format of these files changes. This would be encouraging to the third party app developer community, a group of people that not only add functionality to the device, but help to publicize it as well. That’s free marketing, always a good thing!

Warning

This tablet is like paper, in that if someone has access, they can read whatever you wrote. Oh sure, I put a pin code on there, but it feels like a very flimsy chain on one’s front door that someone really determined can just shoulder their way through.

So, no sensitive data on the tablet. No notes about vendor contracts, no notes about personal matters that I wouldn’t mind being leaked to the world, etc. This is fine; for those limited instances there’s always editing files on the laptop.

Final verdict

This tablet doesn’t play video and I don’t want it to. Ditto for web surfing, reading email, having a calculator app, and so on. All those things are activities for my laptop. The tablet is for taking and reading notes. It does that very well, so far. It is pricey as a device, but if your workflow is like mine, and you are not too tight on funds, it’s worth it.

Disclaimer

I don’t have any affiliation with any company that produces any tablets or phones or any of that. I’ve never gotten a review copy of any such thing. No one paid me, gave me chocolate or did anything else for me so that I would write a positive (or negative!) review.

If you get the device after reading this review, and it doesn’t meet your expectations, I’m sorry. BUT I am also not responsible in any way. Happy writing!

Digging into Structured Data for Media on Commons

Commons: Not a tragedy

Commons. What Commons?

If you are a contributor to Wikipedia or one of the other Wikimedia projects, you probably already know. If you aren’t, even as a frequent Wikipedia reader (and aren’t we all?), you may not know that almost all of the photographs and other images in the articles you read are hosted in a media repository, commons.wikimedia.org.

These images are generally free to reuse, modify, and share, for any purpose including commercial use. You can also create an account there and upload your own photographs, as long as they have educational value.

Image from Commons

Searching for cats on Commons

Structured Data is GREAT

But let’s suppose you want to sort through the media for some reason; maybe you want to find all of the photographs of edible wild mushrooms in your region, with the name of each species. Or maybe you want cute cat pictures. You could try the search tab and hope for the best, but all of the information about an image is in a blob of unstructured text and formatted any sort of random way. You could look at the categories and see if you’re lucky enough that one category covers your needs, but odds are that for anything except the simplest of queries, you’ll come up empty-handed.

What about getting that information in a language other than English? Good luck with that; although the project is shared among speakers of many different languages, the predominant language of contribution and of category names is still English.

Suppose you want to monitor newly uploaded images for any of the above, via a script. How can you do that? With difficulty.

Until now.

Structured data allows contributors to add a caption or information about what is depicted, to each uploaded media file, in any language, or in multiple languages. While the file description is still an unstructured string of text per language, descriptions in multiple languages can be specified and descriptions can be extracted for a media file independently of anything else. The same holds true of captions, and more data is likely to be added in the future.

Why am I writing about this? I produce dumps, so what do I care?

Introducing Mediainfo entities

Ah ha!  Someday () we will be producing dumps of this data. You’ll have files that contain all Mediainfo entities, much as there are now files containing the various sorts of Wikidata entities (see ) for download every week. And since we’re going to be dumping it, we need to understand the data: what is its format? How can we retrieve it via the MediaWiki api? How can we set it?

Mediainfo entities are similar-but-different from Wikibase entities. They are similar in that they have a special format (json) and cannot be edited directly via the wiki ‘edit source’ or ‘edit’ tabs. They both rely on the Wikibase extension as well. But Mediainfo entities are not stored in a set of separate tables, as Wikibase entities are. Instead, Mediainfo entities are stored in a secondary slot of the revision of the File page.

Wait, wut? What are slots? And what the devil is a ‘secondary slot’?

A side trip to Multi-Content Revisions

Time for a crash course in ‘Multi-Content Revisions’, also known as MCR. Until last year, MediaWiki had the following data model, as tweaked for use at Wikimedia:

  • Each article, template, user page, discussion page, and so on, is represented by a record in the ‘page’ table.
  • Each edit to a page is represented by a new record in the revision table.
  • Once a revision is added to the table, it never changes. It can be hidden from public view but never really removed.
  • A revision contains a pointer to a record in the text table.
  • The text table record contains a pointer to a record in a blobs table on one of our external storage servers.
  • The blob record contains the (usually gzipped) content of the revision as wikitext or json or css or whatever it happens to be.

So: one revision, one piece of content. And for articles (and File pages), that means one blob of wikitext with whatever formatting various editors have decided to give it.

But suppose…. just suppose that we could attach pieces of data to that page, also editable by contributors, and that could be displayed in a nice table or some other good data display format. A caption for an image, who or what’s in the image, maybe the creation date, maybe the name of the photographer, maybe EXIF data right from the image. Wouldn’t that be nice? Imagine if all of that data was available via the MediaWiki api, or easily searchable. Wouldn’t that be just grand?

That’s what Multi-Content Revisions are all about. That ‘extra data’ has to live somewhere and still be attached to a revision. So: slots. The ‘main slot’ contains wikitext or json or css or whatever the page normally has, that a contributor can edit the usual way. ‘Secondary’ slots, as many as we define, can have other data, like captions or descriptions or whatever else we decide is useful.

Now each edit to a page is represented by a new record in the revision table, but a revision contains a pointer to one or more entries in the slots table.

Each slot table entry contains a field indicating which slot it’s for (main? some other one?) and a pointer to an entry in the content table.

Each content table entry contains a pointer to a record in the text table, but at some point it will likely point directly to a blob on one of our external storage servers.

Back to Mediainfo entities

A Mediainfo entity, then, is a kind of Wikibase entity, structured data, in json format, that is stored in the ‘mediainfo’ slot for revisions of File pages on Commons.

Structured data tab for a file

This is easier to wrap one’s head around with an example.

If you look at https://commons.wikimedia.org/wiki/File:Marionina_welchi_(YPM_IZ_072302).jpeg you can see below the image that there is a ‘Structured Data’ tab. ‘Items portrayed in this file’ has a name listed.

If you look at the edit history for the file, ‎you can see an entry with the following comment:

Created claim: depicts (d:P180): (d:Q5137114)

This means that someone entered the depiction information by clicking ‘Edit’ next to the ‘Items portrayed in this file’ message, and added the information. You can do the same, in any language, or you can modify or remove existing depiction statements.

JSON for “Depicts” data

Let’s see what the raw content behind a ‘depicts statement’ is.

We can get the content of a Mediainfo entity by providing the Mediainfo id to the MediaWiki api and doing a wbgetentities action. But first we need to get that id. How do we do that?

Here’s the trick: the Mediainfo id for a File page is ‘M’ plus the page id! So first we retrieve the page id via the api:

https://commons.wikimedia.org/w/api.php?action=query&prop=info&titles=File:Marionina_welchi_(YPM_IZ_072302).jpeg&format=json

There’s the page id: 82858744. So now we have the Mediainfo id M82858744 we can pass to wbgetentities:

https://commons.wikimedia.org/w/api.php?action=wbgetentities&ids=M82858744&format=json

And here’s the output, prettified for human readers like you and me:

{
  "entities": {
    "M82858744": {
      "pageid": 82858744,
      "ns": 6,
      "title": "File:Marionina welchi (YPM IZ 072302).jpeg",
      "lastrevid": 371114515,
      "modified": "2019-10-19T08:11:24Z",
      "type": "mediainfo",
      "id": "M82858744",
      "labels": {},
      "descriptions": {},
      "statements": {
        "P180": [
          {
            "mainsnak": {
              "snaktype": "value",
              "property": "P180",
              "hash": "bfa568e1a915cc36538364c66cbfeea50913feea",
              "datavalue": {
                "value": {
                  "entity-type": "item",
                  "numeric-id": 5137114,
                  "id": "Q5137114"
                },
                "type": "wikibase-entityid"
              }
            },
            "type": "statement",
            "id": "M82858744$21beba99-41e7-a211-66ac-1cacc78b806d",
            "rank": "preferred"
          }
        ]
      }
    }
  },
  "success": 1
}

You can see the depicts statement under ‘statements’, where ‘P180’ is the property ‘Depicts’, as seen on Wikidata. ‘Q5137114’ is Marionina welchi, as seen also on Wikidata. The P-items and Q-items are available from Wikidata by virtue of Mediainfo’s use of “Federated Wikibase”; see https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Wikibase/+/master/docs/federation.wiki for more on that.

Note!! Since the page id of the File page is effectively embedded in the Mediainfo text blob as stored in the database, any change to the page id can break things. See https://phabricator.wikimedia.org/T232087 for an example!

Raw content of “Depicts” statement

What is the raw content of the slot for that revision in the database, you ask? I’ve got a little script for that. (Link: https://github.com/apergos/misc-wmf-crap/blob/master/get_revision.py) Here’s what I get when I run the script:

{"type":"mediainfo","id":"M82858744","labels":[],"descriptions":[],"statements":{"P180":[{"mainsnak":{"snaktype":"value","property":"P180","hash":"bfa568e1a915cc36538364c66cbfeea50913feea","datavalue":{"value":{"entity-type":"item","numeric-id":5137114,"id":"Q5137114"},"type":"wikibase-entityid"}},"type":"statement","id":"M82858744$21beba99-41e7-a211-66ac-1cacc78b806d","rank":"preferred"}]}}

That’s right, all that mainsnak and snaktype and other stuff is right there in the raw slot content.

How statements work

A statement (in Wikidata parlance, a “claim”), is an assertion about a subject that it has a certain property (Pnnn) with a certain value. If the value turns out to be a person or a place or something else with a Wikidata entry (Qnnn), then that id can be used in place of the text value.

Here’s an example:

Boris Johnson holds the position (P39 “position held”) of Prime Minister of the UK (Q14211). But this hasn’t always been true and it won’t be true forever. This statement therefore needs qualifiers, such as: P580 (“start time”) with value 24 July 2019.

A “snak” (chosen as the next largest data item after “bit” and “byte”) is a claim with a property and value but no qualifiers.

In our case, an image can depict (P180) some person or thing (Q-id for the person/thing if there is one, or the name otherwise). Or it can have been created by (P170) some person. Or it can have been created (P571) on a certain date. The point is that statements (claims) of any sort can be added to a Mediainfo entity. However, some care should be taken before adding statements involving properties other than P180 and P170, preferably after discussion and agreement with the community. See https://commons.wikimedia.org/wiki/Commons_talk:Structured_data/Modeling for some of the discussion around use of properties for Commons media files.

For a note on the terminology “statements” vs. “claims”, see https://phabricator.wikimedia.org/T149410.

For more about snaks and snaktypes, see https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Wikibase/+/master/docs/json.wiki#261.

Looking at Captions

Let’s find a File which has a caption. I’ve already found one so we’ll just use it as an example:

https://commons.wikimedia.org/wiki/File:Stra%C3%9Fenbahn_Haltestelle_Freizeit-_und_Erholungszentrum-3.jpg

Underneath the image, in the ‘File Information’ tab, you can see the entry ‘Captions’, and there are some! If you select ‘See 1 more language’ you can see that there are two captions, one in English and one in German.

Let’s get the Mediainfo id for that file:

https://commons.wikimedia.org/w/api.php?action=query&prop=info&titles=File:Straßenbahn_Haltestelle_Freizeit-_und_Erholungszentrum-3.jpg&format=json

Great, it’s 83198284. Let’s plug M83198284 into our wbgetentities query:

https://commons.wikimedia.org/w/api.php?action=wbgetentities&ids=M83198284&format=json

{
  "entities": {
    "M83198284": {
      "pageid": 83198284,
      "ns": 6,
      "title": "File:Straßenbahn Haltestelle Freizeit- und Erholungszentrum-3.jpg",
      "lastrevid": 371115009,
      "modified": "2019-10-19T08:16:04Z",
      "type": "mediainfo",
      "id": "M83198284",
      "labels": {
        "en": {
          "language": "en",
          "value": "Tram stop in Berlin, Germany"
        },
        "de": {
          "language": "de",
          "value": "Straßenbahn Haltestelle in Berlin"
        }
      },
      "descriptions": {},
      "statements": []
    }
  },
  "success": 1
}

Captions are called ‘labels’ and are returned by language with a simple text value for each caption. What’s the raw data, you ask?

{"type":"mediainfo","id":"M83198284","labels":{"en":{"language":"en","value":"Tram stop in Berlin, Germany"},"de":{"language":"de","value":"Stra\u00dfenbahn Haltestelle in Berlin"}},"descriptions":[],"statements":[]}

Pretty much as we expect, the entity type and id are stored along with the labels, with one entry per language.

Looking at Descriptions

Just kidding!  More seriously, there’s a placeholder for descriptions but that’s due to be removed (see https://phabricator.wikimedia.org/T213502) since captions suffice.

Looking at anything else

I’ve got a script. It’s crap because all code is crap, and all code not written for production is especially crap, and code witten for personal use just to get the job done is extra especially crap. Nonetheless, I use this python script to get the Mediainfo entity for a given File on Commons: https://github.com/apergos/mw-scripts-crapola/blob/master/get_mediainfo.py and it gets the job done.

The Big Payoff: Search

Let’s find some images on Commons using this data. Go to the main page and let’s search for any media with a “depicts” (P180) statement, by entering “haswbstatement:P180” in the search bar.You can check the Structured Data tab below the image of any of the files in the result, and see what’s depicted.

But that’s not all! You can specify what you want depicted: haswbstatement:P180=Q146 will find any media file marked as depicting a… https://www.wikidata.org/wiki/Q146 just by entering it into the search bar.

And that’s still not all! You can specify that you want only those pictures with captions in English and that depict Q146’s by entering hasdescription:en haswbstatement:P180=Q146. Note!! Captions used to be specified by the keyword “hascaption” but this has been changed, though you may see it referenced in older documentation or blog posts.

But there’s more! You can specify that you want pictures that depict something, created by someone else, with captions in some languages but not others, and CirrusSearch will serve that right up to you. Try it by searching for haswbstatement:P170=Q34788025 haswbstatement:P180=Q158942 hasdescription:en -hasdescription:fr and check the results.

But… you guessed it, that’s still not all. You can search for all media files that have a specified text in the caption, in addition to any other search criteria! Try it by searching for incaption:dog hasdescription:fr and check any file in the results.

Bonus: Editing!

I’ve got a script. It just updates captions because the format of those is the easiest, but if you look at it and the api help docs you can figure out the rest.

Script: https://github.com/apergos/misc-wmf-crap/blob/master/glyph-image-generator/set_mediainfo.py

MediaWiki api help: https://commons.wikimedia.org/w/api.php?action=help&modules=wbeditentity

Further Reading

There’s lots more to explore, so here’s some links to get you started.

Credits

Xml/sql dumps and MediaWiki-Vagrant, two great tastes that taste great together?

Problems, problems, problems

Recently a colleague asked me to look over a patchset in gerrit that would add some new functionality to a set of weekly dumps, and, as is my wont, I asked if he’d tested the patch. The answer was, “Well, no, just parts of it”. When I dug into the issue a little deeper, it turned out that the reason the script hadn’t been tested is that there was no easy way to do so!

Enter MediaWiki-Vagrant. [1] This lets you set up a virtual machine on your laptop with the latest and greatest version of MediaWiki. By the simple application of puppet roles, you can add multiple wikis and your own skeletal copy of Wikidata for testing. This seemed like the perfect place to add a dumps role.

Adam Wight started working on such a role in April of 2017. [2] We’re all busy people so it took a while, but finally a few weeks ago the role was merged. It lets the user do a basic test of the xml/sql dumps against whatever’s in the master branch of MediaWiki. But it doesn’t allow my colleague to test his script changes. That, it turns out, is complicated.

So, without further ado, here is what I did in order to get tests up and running in a setup that permits xml/sql dumps to run, as well as tests of ‘miscellaneous’ dump scripts such as category dumps or the ‘Wikidata weeklies’.

MediaWiki-Vagrant on Fedora

Fedora is my distro of choice, so there was special prep work for the installation of MediaWiki-Vagrant.

1. I needed libvirt and lxc; I got these by

dnf install vagrant, vagrant-libvirt, vagrant-lxc, vagrant-lxc-doc,
   lxc-libs, lxc, lxc-templates, lxc-extra, nfs-utils redir

2. added myself to the /etc/sudoers file:

meeee ALL=(ALL) ALL

3. edited /etc/lxc/default.conf:

#lxc.network.link = lxcbr0
lxc.network.link = virbr0

4. fixing up the firewall:

firewall-cmd --permanent --zone public --add-port=20048/udp
firewall-cmd --permanent --zone public --add-port=111/udp
firewall-cmd --permanent --zone public --add-port=2049/udp
firewall-cmd --permanent --zone public --add-service mountd
firewall-cmd --permanent --zone public --add-service rpc-bind
firewall-cmd --permanent --zone public --add-service nfs
firewall-cmd --reload

and checking that nfs was indeed in the list by:

firewall-cmd --list-all

5. set up udp for v3, which is vagrant default but is turned off by default in Fedora; this was done by editing /etc/sysconfig/nfs
and changing this line

RPCNFSDARGS=""

to

RPCNFSDARGS="--udp"

then restarting the service:

service nfs-server restart

Installing MediaWiki-Vagrant

This was slightly different than the instructions [3], since I’m using the lxc provider.

git clone --recursive https://gerrit.wikimedia.org/r/mediawiki/vagrant
cd vagrant
vagrant config --required (I just just left the name blank at the prompt)
vagrant up --provider lxc --provision

Provisioning the Wikidata role

The Wikidata role needs some special handling. [4] But it needs even more specialness than the docs say. There’s an issue with the Wikibase extension’s composer setup that we need to work around. [5] Here’s all the steps involved.

vagrant git-update
vagrant ssh

These steps are all done from within the VM:

sudo apt-get update
sudo apt-get upgrade
composer selfupdate --update-keys (and enter the keys from https://composer.github.io/pubkeys.html)
composer config --global process-timeout 9600

Get off the vm, and then:

vagrant roles enable wikidata
vagrant provision

This last fails badly, failing to find a certain class and so everything breaks. Edit {code>mediawiki/composer.local.json and add the line

"extensions/Wikibase/composer.json"

to the merge-plugin include stanza at the end of the file. Now you can rerun composer and the failed steps:

vagrant ssh
cd /vagrant/mediawiki
rm composer.lock
composer update --lock
sudo apachectl restart
sudo -u www-data \(cd /vagrant/mediawiki; /usr/local/bin/foreachwiki 
    update.php --quick --doshared \)

Import some data!

At this point the installation was working but there was only the Main Page in Wikidatawiki. I needed to get some data in there.

I grabbed the first so many pages from one of the wikidata dumps (~170 pages), put them in an xml file, added a ” tag on the end, and put that in srv/wikidata_pages.xml.

Next I needed to enable entity imports, which is done by creating the file  /vagrant/settings.d/wikis/wikidatawiki/settings.d/puppet-managed/10-Wikidata-entities.php with the contents:

<?php 
  $wgWBRepoSettings['allowEntityImport'] = true;

Next came the import:

cd /vagrant
cat /vagrant/srv/wikidata_pages.xml | sudo -u www-data mwscript importDump.php
    --wiki=wikidatawiki --uploads --debug --report 10

This took a lot longer than expected (30 minutes for about 170 pages) but did eventually complete without errors. Then some rebuilds:

sudo -u www-data mwscript rebuildrecentchanges.php --wiki=wikidatawiki
sudo -u www-data mwscript initSiteStats.php --wiki=wikidatawiki

Provisioning the dumps role

At last I could cherry-pick my gerrit change [6]. But because by default I’m using nfs on linux for the mount of /vagrant inside the VM,  I needed to add some tweaks that let puppet create some directories in /vagrant/srv owned by the dumps user.

In /vagrant, I created the file Vagrantfile-extra.rb with the following contents:

mwv = MediaWikiVagrant::Environment.new(File.expand_path('..', __FILE__))
settings = mwv.load_settings

Vagrant.configure('2') do |config|
  if settings[:nfs_shares]
    root_share_options = { id: 'vagrant-root' }
    root_share_options[:type] = :nfs
    root_share_options[:mount_options] = ['noatime', 'rsize=32767', 'wsize=32767', 'async']
    root_share_options[:mount_options] << 'fsc' if settings[:nfs_cache]
    root_share_options[:mount_options] << 'vers=3' if settings[:nfs_force_v3]
    root_share_options[:linux__nfs_options] = ['no_root_squash', 'no_subtree_check', 'rw', 'async']
    config.nfs.map_uid = Process.uid
    config.nfs.map_gid = Process.gid
    config.vm.synced_folder '.', '/vagrant', root_share_options
  end
end

Then I needed to restart the VM so that the freshly nfs-mounted share would permit chown and chmod from within it:

vagrant halt
vagrant up --provider lxc --provision

After that, I was able to enable the dumps role:

vagrant roles enable dumps
vagrant provision

Wikidata dump scripts setup

Next I had to get all the scripts needed for testing, by doing the following:

  • copy into /usr/local/bin: dumpwikidatajson.sh, dumpwikidatardf.sh, wikidatadumps-shared.sh [7]
  • copy into /usr/local/etc: dump_functions.sh dcatconfig.json [7]
  • copy a fresh clone of operations-dumps-dcat into /usr/local/share [8]

And finally, I had to fix up a bunch of values in the dump scripts that are meant for large production wikis.
In dumpwikidatardf.sh:

shards=2
dumpNameToMinSize=(["all"]=`expr 2350 / $shards` ["truthy"]=`expr 1400 / $shards`)

in dumpwikidatajson.sh:

shards=1
if [ $fileSize -lt `expr 20 / $shards` ]; then

in wikidatadumps-shared.sh:

pagesPerBatch=10

and as root, clean up some cruft that has the wrong permissions:

rm -rf /var/cache/mediawiki/*

Running dumps!

For xml/sql dumps:

su -  dumpsgen
cd /vagrant/srv/dumps/xmldumps-backup
python worker.py --configfile /vagrant/srv/dumps/confs/wikidump.conf.dumps [name_of_wiki_here]

Some wikis available for ‘name_of_wiki_here’ are: enwiki, wikidatawiki, ruwiki, zhwiki, among others.

For wikidata json and rdf dumps:

su - dumpsgen
mkdir /vagrant/srv/dumps/output/otherdumps/wikidata/
/usr/local/bin/dumpwikidatajson.sh
/usr/local/bin/dumpwikidatardf.sh all ttl
/usr/local/bin/dumpwikidatardf.sh truthy nt

See how easy that was? 😛 😛

But seriously, folks, we are working on making testing all dumps easy, or at least easier. This brings us one step closer.

Next steps

It’s a nuisance to edit the scripts and change the number of shards and so on; these are being turned into configurable values. A special configuration file will be added to the dumps role that all ‘miscellaneous dumps’ can use for these sorts of values.

It’s annoying to have to copy in the scripts from the puppet repo before testing. We’re considering creating a separate repository operations/dumps/otherdumps which would contain all of these scripts; then a simple ‘git clone’ right from the dumps role itself would add the scripts to the VM.

There are multiple symlinks of the directory containing the php wrapper MWScript.php to different locations, because several scripts expect the layout of the mediawiki and related repos to be the way it’s set up in production. The location should be configurable in all scripts so that it can bepassed in on the command line for testing, and the extra symlinks removed from the dumps role.

The composer workaround will eventually be unnecessary once Wikibase has been fixed up to work with composer the way many MediaWiki extensions do. That’s on folks’ radar already.

The xml file of pages to import into wikidata could be provided in the dumps role and entity imports configured, though the import itself might still be left for the user because it takes so long.

Once the above fixes are in, we’ll probably be starting to move to kubernetes and docker for all testing. 😀

Thanks!

Thanks to: Adam Wight for all the work on the initial dumps role, Stas Malyshev for the composer solution and for being a guinea pig, and the creators and maintainers of MediaWiki-Vagrant for making this all possible.

Footnotes

Docker and Salt Redux

Recently I was digging into salt innards again; that meant it was time to dust off the old docker salt-cluster script and shoehorn a few more features in there.

Salt up close and personal.

NaCl up close and personal.

There are some couples that you just know ought to get themselves to a relationship counselor asap. Docker and SSHD fall smack dab into that category. [1]  When I was trying to get my base images for the various Ubuntu distros set up, I ran into issues with selinux, auditd and changed default config options for root, among others. The quickest way to deal with all these annoyances is to turn off selinux on the docker host and comment the heck out of a bunch of things in various pam configs and the sshd config.

The great thing about Docker though is that once you have your docker build files tested and have created your base images from those, starting up containers is relatively quick. If you need a configuration of several containers from different images with different things installed you can script that up and then with one command you bring up your test or development environment in almost no time.

Using this setup, I was able to test multiple combinations of salt and master versions on Ubuntu distros, bringing them up in a minute and then throwing them away when done, with no more concern than for tossing a bunch of temp files. I was also able to model our production cluster (running lucid, precise and trusty) with the two versions of salt in play, upgrade it, and poke at salt behavior after the upgrade.

A good dev-ops is a lazy dev-ops, or maybe it’s the other way around. Anyways, I can be as lazy as the best of ’em, and so when it came to setting up and testing the stock redis returner on these various salt and ubuntu versions, that needed to be scriptified too; changing salt configs on the fly is a drag to repeat manually. Expect, ssh, cp and docker ps are your best friends for something like this. [2]

In the course of getting the redis stuff to work, I ran across some annoying salt behavior, so before you run into it too, I’ll explain it here and maybe save you some aggravation.

The procedure for setting up the redis returner included the following bit:

– update the salt master config with the redis returner details
– restart the master
– copy the update script to the minions via salt

This failed more often than not, on trusty with 2014.1.10. After these steps, the master would be seen to be running, the minions were running, a test.ping on all the minions came back showing them all responsive, and yet… no script copy.

The first and most obvious thing is that the salt master restart returns right away, but the master is not yet ready to work. It has to read keys, spawn worker threads, each of those has to load a pile of modules, etc.  On my 8-core desktop, for 25 workers this could take up to 10 seconds.

Salt works on a sub/pub model [3], using ZMQ as the underlying transport mechanism for this. There’s no ack from the client; if the client gets the message, it runs the job if it’s one of the targets, and returns the results. If the client happens to be disconnected, it won’t get the message. Now salt minions do reconnect if their connection goes away but this takes time.

Salt (via ZMQ) also encrypts all messages. Upon restart, the master generates a new AES key, but the minions don’t learn about this til they receive their first message, typically with some job to run. They will try to use the key they had lying round from a minute ago to decrypt, fail, and then be forced to try again. But this retry takes time. And while the job will eventually be run and the results sent back to the master, your waiting script may have long since given up and gone away.

With the default salt config, the minion reconnect can take up to 5 seconds. And the minion re-auth retry can take up to 60 seconds. Why so long? Because in a production setting, if you restart the master and thousands of minions all try to connect at once, the thundering herd will kill you. So the 5 seconds is an upper limit, and each minion will wait a random amount of time up to that upper limit before reconnect. Likewise the 60 seconds is an upper limit for re-authentication. [4]

This means that after a master restart, it’s best to wait at least 15 seconds before running any job, 10 for master setup and 5 for the salt minion reconnect. This ensures that the salt minion will actually receive the job. (And after a minion restart, it’s best to wait at least 5 seconds before giving it any work to do, for the same reason.)

Then be sure to run your salt command with a nice long timeout of longer than 60 seconds. This ensures that the re-auth and the job run will get done, and the results returned to the master, before your salt command times out and gives up.

Now the truly annoying bit is that, in the name of perfect forward secrecy, an admittedly worthy goal, the salt master will regenerate its key after 24 hours of use, with the default config. And that means that if you happen to run a job within a few seconds of that regen, whenever it happens, you will hit this issue. Maybe it will be a puppet run that sets a grain, or some other automated task that trips the bug. Solution? Make sure all your scripts check for job returns allowing for the possibility that the minion had to re-auth.

Tune in next time for more docker-salt-zmq fun!

[1] Docker ssh issues on github
[2] Redis returner config automation
[3] ZMQ pub/sub docs
[4] Minion re-auth config and Running Salt at scale

Ditching gnome 3 for kde 4

I finally made the switch. I’ve been a long time fan of gnome, critical of kde memory bloat, and not fond of the lack of integration that has haunted kde and its apps for years. But I finally made the switch.

I have an Nvidia graphics card in this three and a half year old laptop on which I run the Nvidia proprietary drivers. Let’s not kid ourselves; in many cases the open source drivers aren’t up to snuff, and this card and laptop is one of those cases. I’m talking about regular use for watching videos, doing my development work and so on, not games, not exotic uses of blender or what have you, nothing out of the ordinary.

Gnome shell has been a memory hog since its inception, with leaks that force the shell to die a horrible death or hang in odd ways after a few days of uptime. Maybe this is caused by interaction with the Nvidia drivers, and maybe not, but it’s a drag.

Nonetheless, it was a drag I was willing to put up with, in the name of ‘use the current technologies, they’ll stabilize eventually’. No, no they won’t. With the latest upgrade to Fedora 20, I noticed a bizarre mouse pointer bug which goes something like this:

Type… typetypetype… woops mouse pointer is gone. Huh, where is it? Try alt-shift-tab to see the window switcher. Ah *whew*, I can at least switch to another window, and now the pointer is back.

Only, that alt-shift-tab trick didn’t always work the first time, and sometimes it didn’t work at all. I was forced often enough to hard power off the laptop (no alternate consoles to switch into, and the system was in hard lockup doing something disk-intensive, who knows what… maybe swapping to death).

After the last round of package updates I started seeing lockups multiple times a day. The bug reporter, on the few times gnome shell would actually segfault, refused to report the bug because it was a dupe and what was I doing using those proprietary drivers anyways.

Usability has a bunch of factors in there, but basic is the ability to use the system without lockup. So… kde 4.11. Five days later I have had no mouse pointer issues, no lockups, no OOM, no swapping. I miss my global emacs key bindings, I couldn’t get gnome terminal to work right because of the random shrinking terminal bug, and the world clock isn’t exactly the way I’d like it but I’ll live with that. Goodbye gnome 3, if you see gnome 4 around some day, my door will be open.