Where are those awesome little cubes of holographic data that we used to see on Star Trek, which contained seemingly endless amounts of data? While we wait for someone to get on that problem, I get to sort out mirrors and backups of media in a world where servers with large raids cost a hefty chunk of change. Just a few days ago our host that serves scaled media files was down to less than 90 GB left.
In theory scaled media can be regenerated from the originals at any time, but in practice we don’t have a media scaler cluster big enough to scale all media at once. This means that we need to be a bit selective in how we “garbage collect”. Typically we generate a list of media not in use on the Wikimedia projects and delete the scaled versions of those files. The situation was so bad, however, that the delete script–which sleeps in between every batch of deletes–put enough pressure on the scaled media server that it became slow to respond, causing the scalers to slow down and thus affecting service for the entire site for a few minutes.
The solution to this turned out to be to remove the scaled media server completely from the equation, and rely entirely on the new distributed media backed, Openstack’s Swift. Whew! But we are really only putting off a discussion that needs to happen soonish: how do we keep our pile of scaled media from expanding at crazy rates?
Consider that we will generate thumbnails of any size requested, on demand if it doesn’t exist already, and these files are never deleted until the next time we run some sort of cleanup script. With Swift it’s going to be easy to forget that we have limited disk storage and that scaled media are really just long-lived temporary files. Should we limit thumb generation to specific sizes only (which could be pregenerated rather than produced on the fly)? Should we generate anything requested but toss non-standard sizes every day? Should we toss less frequently used thumbs (and how would we know which ones those are) on a daily or weekly basis?