Last modified: 2014-06-27 18:02:42 UTC
On toolserver the dumps were stored as "enwiki-latest-pages-articles.xml" for example. This allowed users to hardcode the path without worrying about the date. It would be nice if there were symlinks so that we can just hardcode a path without trying to figure out which date is the latest and then using that one.
changing to infrastructure as this is something labs-wide which I have no access to (before was assigned to bots project)
*** Bug 56093 has been marked as a duplicate of this bug. ***
Can we please get some progress on this? it shouldnt be rocket science. (a basic 20 line python script could probably achieve the goal, if we cannot find a nicer way)
I'm thinking this is easiest done around the dumps process itself. Ariel, thoughts?
I've looked at that recently ("it should be so easy!"), but the dumps process as present isn't as straight-forward as you would think. I assumed it would be enough to replace "rsync DIR" with "rsync DIR && ln -s ...", but the reality (cf. http://git.wikimedia.org/tree/operations%2Fpuppet.git; "download::gluster" is the class that feeds /public/datasets/public) is much more complicated: A list of files (!) to sync is produced on the remote, that is rsync'ed to local and then fed to several (!) rsync workers (limited by count of files and size to transfer) that do the actually copying. This process is run continuously, so there is no obvious point to hook into. This complexity is probably due to the requirement to sync more than 4 TBytes of data :-). I *think* it would be possible to have a cron job that just sets the symlinks without upsetting rsync too much, but I'm definitely not sure about that :-).
Im adding platonides to the CC list and asking for their input, they run the /dumps project on the toolserver
The dumps script does this at time of dump creation, leaving other symlinks in the directory untouched. Someone would have to write a short script that goes through and updates links after each rsync completes.