Problems, problems, problems
Recently a colleague asked me to look over a patchset in gerrit that would add some new functionality to a set of weekly dumps, and, as is my wont, I asked if he’d tested the patch. The answer was, “Well, no, just parts of it”. When I dug into the issue a little deeper, it turned out that the reason the script hadn’t been tested is that there was no easy way to do so!
Enter MediaWiki-Vagrant. [1] This lets you set up a virtual machine on your laptop with the latest and greatest version of MediaWiki. By the simple application of puppet roles, you can add multiple wikis and your own skeletal copy of Wikidata for testing. This seemed like the perfect place to add a dumps role.
Adam Wight started working on such a role in April of 2017. [2] We’re all busy people so it took a while, but finally a few weeks ago the role was merged. It lets the user do a basic test of the xml/sql dumps against whatever’s in the master branch of MediaWiki. But it doesn’t allow my colleague to test his script changes. That, it turns out, is complicated.
So, without further ado, here is what I did in order to get tests up and running in a setup that permits xml/sql dumps to run, as well as tests of ‘miscellaneous’ dump scripts such as category dumps or the ‘Wikidata weeklies’.
MediaWiki-Vagrant on Fedora
Fedora is my distro of choice, so there was special prep work for the installation of MediaWiki-Vagrant.
1. I needed libvirt and lxc; I got these by
dnf install vagrant, vagrant-libvirt, vagrant-lxc, vagrant-lxc-doc, lxc-libs, lxc, lxc-templates, lxc-extra, nfs-utils redir
2. added myself to the /etc/sudoers
file:
meeee ALL=(ALL) ALL
3. edited /etc/lxc/default.conf
:
#lxc.network.link = lxcbr0 lxc.network.link = virbr0
4. fixing up the firewall:
firewall-cmd --permanent --zone public --add-port=20048/udp firewall-cmd --permanent --zone public --add-port=111/udp firewall-cmd --permanent --zone public --add-port=2049/udp firewall-cmd --permanent --zone public --add-service mountd firewall-cmd --permanent --zone public --add-service rpc-bind firewall-cmd --permanent --zone public --add-service nfs firewall-cmd --reload
and checking that nfs was indeed in the list by:
firewall-cmd --list-all
5. set up udp for v3, which is vagrant default but is turned off by default in Fedora; this was done by editing /etc/sysconfig/nfs
and changing this line
RPCNFSDARGS=""
to
RPCNFSDARGS="--udp"
then restarting the service:
service nfs-server restart
Installing MediaWiki-Vagrant
This was slightly different than the instructions [3], since I’m using the lxc provider.
git clone --recursive https://gerrit.wikimedia.org/r/mediawiki/vagrant cd vagrant vagrant config --required (I just just left the name blank at the prompt) vagrant up --provider lxc --provision
Provisioning the Wikidata role
The Wikidata role needs some special handling. [4] But it needs even more specialness than the docs say. There’s an issue with the Wikibase extension’s composer setup that we need to work around. [5] Here’s all the steps involved.
vagrant git-update vagrant ssh
These steps are all done from within the VM:
sudo apt-get update sudo apt-get upgrade composer selfupdate --update-keys (and enter the keys from https://composer.github.io/pubkeys.html) composer config --global process-timeout 9600
Get off the vm, and then:
vagrant roles enable wikidata vagrant provision
This last fails badly, failing to find a certain class and so everything breaks. Edit {code>mediawiki/composer.local.json and add the line
"extensions/Wikibase/composer.json"
to the merge-plugin include stanza at the end of the file. Now you can rerun composer and the failed steps:
vagrant ssh cd /vagrant/mediawiki rm composer.lock composer update --lock sudo apachectl restart sudo -u www-data \(cd /vagrant/mediawiki; /usr/local/bin/foreachwiki update.php --quick --doshared \)
Import some data!
At this point the installation was working but there was only the Main Page in Wikidatawiki. I needed to get some data in there.
I grabbed the first so many pages from one of the wikidata dumps (~170 pages), put them in an xml file, added a ” tag on the end, and put that in srv/wikidata_pages.xml
.
Next I needed to enable entity imports, which is done by creating the file /vagrant/settings.d/wikis/wikidatawiki/settings.d/puppet-managed/10-Wikidata-entities.php
with the contents:
<?php $wgWBRepoSettings['allowEntityImport'] = true;
Next came the import:
cd /vagrant cat /vagrant/srv/wikidata_pages.xml | sudo -u www-data mwscript importDump.php --wiki=wikidatawiki --uploads --debug --report 10
This took a lot longer than expected (30 minutes for about 170 pages) but did eventually complete without errors. Then some rebuilds:
sudo -u www-data mwscript rebuildrecentchanges.php --wiki=wikidatawiki sudo -u www-data mwscript initSiteStats.php --wiki=wikidatawiki
Provisioning the dumps role
At last I could cherry-pick my gerrit change [6]. But because by default I’m using nfs on linux for the mount of /vagrant
inside the VM, I needed to add some tweaks that let puppet create some directories in /vagrant/srv
owned by the dumps user.
In /vagrant, I created the file Vagrantfile-extra.rb
with the following contents:
mwv = MediaWikiVagrant::Environment.new(File.expand_path('..', __FILE__)) settings = mwv.load_settings Vagrant.configure('2') do |config| if settings[:nfs_shares] root_share_options = { id: 'vagrant-root' } root_share_options[:type] = :nfs root_share_options[:mount_options] = ['noatime', 'rsize=32767', 'wsize=32767', 'async'] root_share_options[:mount_options] << 'fsc' if settings[:nfs_cache] root_share_options[:mount_options] << 'vers=3' if settings[:nfs_force_v3] root_share_options[:linux__nfs_options] = ['no_root_squash', 'no_subtree_check', 'rw', 'async'] config.nfs.map_uid = Process.uid config.nfs.map_gid = Process.gid config.vm.synced_folder '.', '/vagrant', root_share_options end end
Then I needed to restart the VM so that the freshly nfs-mounted share would permit chown and chmod from within it:
vagrant halt vagrant up --provider lxc --provision
After that, I was able to enable the dumps role:
vagrant roles enable dumps vagrant provision
Wikidata dump scripts setup
Next I had to get all the scripts needed for testing, by doing the following:
- copy into
/usr/local/bin
: dumpwikidatajson.sh, dumpwikidatardf.sh, wikidatadumps-shared.sh [7] - copy into
/usr/local/etc
: dump_functions.sh dcatconfig.json [7] - copy a fresh clone of operations-dumps-dcat into
/usr/local/share
[8]
And finally, I had to fix up a bunch of values in the dump scripts that are meant for large production wikis.
In dumpwikidatardf.sh
:
shards=2 dumpNameToMinSize=(["all"]=`expr 2350 / $shards` ["truthy"]=`expr 1400 / $shards`)
in dumpwikidatajson.sh
:
shards=1 if [ $fileSize -lt `expr 20 / $shards` ]; then
in wikidatadumps-shared.sh
:
pagesPerBatch=10
and as root, clean up some cruft that has the wrong permissions:
rm -rf /var/cache/mediawiki/*
Running dumps!
For xml/sql dumps:
su - dumpsgen cd /vagrant/srv/dumps/xmldumps-backup python worker.py --configfile /vagrant/srv/dumps/confs/wikidump.conf.dumps [name_of_wiki_here]
Some wikis available for ‘name_of_wiki_here’ are: enwiki, wikidatawiki, ruwiki, zhwiki, among others.
For wikidata json and rdf dumps:
su - dumpsgen mkdir /vagrant/srv/dumps/output/otherdumps/wikidata/ /usr/local/bin/dumpwikidatajson.sh /usr/local/bin/dumpwikidatardf.sh all ttl /usr/local/bin/dumpwikidatardf.sh truthy nt
See how easy that was? 😛 😛
But seriously, folks, we are working on making testing all dumps easy, or at least easier. This brings us one step closer.
Next steps
It’s a nuisance to edit the scripts and change the number of shards and so on; these are being turned into configurable values. A special configuration file will be added to the dumps role that all ‘miscellaneous dumps’ can use for these sorts of values.
It’s annoying to have to copy in the scripts from the puppet repo before testing. We’re considering creating a separate repository operations/dumps/otherdumps which would contain all of these scripts; then a simple ‘git clone’ right from the dumps role itself would add the scripts to the VM.
There are multiple symlinks of the directory containing the php wrapper MWScript.php
to different locations, because several scripts expect the layout of the mediawiki and related repos to be the way it’s set up in production. The location should be configurable in all scripts so that it can bepassed in on the command line for testing, and the extra symlinks removed from the dumps role.
The composer workaround will eventually be unnecessary once Wikibase has been fixed up to work with composer the way many MediaWiki extensions do. That’s on folks’ radar already.
The xml file of pages to import into wikidata could be provided in the dumps role and entity imports configured, though the import itself might still be left for the user because it takes so long.
Once the above fixes are in, we’ll probably be starting to move to kubernetes and docker for all testing. 😀
Thanks!
Thanks to: Adam Wight for all the work on the initial dumps role, Stas Malyshev for the composer solution and for being a guinea pig, and the creators and maintainers of MediaWiki-Vagrant for making this all possible.
0 Responses to “Xml/sql dumps and MediaWiki-Vagrant, two great tastes that taste great together?”