We’ve been generating bundles of media in use on the various Wikimedia projects, so that readers or editors of these projects can download the media with just a few clicks. This approach is great for the downloader but takes more space than we would like, since files hosted on Commons in use on multiple projects will be stored multiple times. We were hoping to be clever about this by pulling out the files stored in multiple places and bundling those up separately for download. The first step was to generate a list of projects that have the largest number of files used by some other project. The results were discouraaging.
Below is a list of the projects with the most media in use, listed in parentheses, followed by the number of media files in common with some other projects in descending order.
enwiki(2237560): 519734|dewiki 480943|frwiki 304120|plwiki 352064|ruwiki 393602|eswiki 354075|itwiki
dewiki(1426181): 519734|enwiki 318361|frwiki 246937|itwiki 236249|ruwiki 223472|plwiki 222664|eswiki
frwiki(1046546): 480943|enwiki 318361|dewiki 255563|eswiki 250681|itwiki 219759|ruwiki 201138|plwiki
ruwiki(649728): 352064|enwiki 236249|dewiki 219759|frwiki 187919|eswiki 185788|itwiki 173388|plwiki
Eliminating all of the duplication between just the first few top projects would entail the creation of multiple separate files for download, making things significantly less convenient for the downloader without the space gains to justify it.
For just the top five projects as far as media usage, the number of media files in common to them all is only 66979, a pittance. But even if we took the 500 thousand files in use on dewiki and on enwiki and put them in a separate bundle, with a separate bundle for the rest of enwiki and a separate one for the rest of dewiki, that’s still not much of a gain compared to the nearly 6 million unique media files total in use.
So for now we’ll just keep the media bundles per project like they are. If anyone has any bright space-saving ideas, please chime in with a comment.