Last modified: 2014-06-27 18:02:42 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T47646, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 45646 - Create -latest alias for dumps
Create -latest alias for dumps
Status: NEW
Product: Wikimedia Labs
Classification: Unclassified
Infrastructure (Other open bugs)
unspecified
All All
: Normal normal
: ---
Assigned To: Ariel T. Glenn
:
: 56093 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-03-02 21:01 UTC by Kunal Mehta (Legoktm)
Modified: 2014-06-27 18:02 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Kunal Mehta (Legoktm) 2013-03-02 21:01:00 UTC
On toolserver the dumps were stored as "enwiki-latest-pages-articles.xml" for example. This allowed users to hardcode the path without worrying about the date.

It would be nice if there were symlinks so that we can just hardcode a path without trying to figure out which date is the latest and then using that one.
Comment 1 Peter Bena 2013-05-30 08:44:47 UTC
changing to infrastructure as this is something labs-wide which I have no access to (before was assigned to bots project)
Comment 2 Tim Landscheidt 2014-02-04 22:17:31 UTC
*** Bug 56093 has been marked as a duplicate of this bug. ***
Comment 3 Betacommand 2014-02-07 01:13:46 UTC
Can we please get some progress on this? it shouldnt be rocket science. (a basic 20 line python script could probably achieve the goal, if we cannot find a nicer way)
Comment 4 Marc A. Pelletier 2014-02-14 00:19:39 UTC
I'm thinking this is easiest done around the dumps process itself.  Ariel, thoughts?
Comment 5 Tim Landscheidt 2014-02-14 05:04:10 UTC
I've looked at that recently ("it should be so easy!"), but the dumps process as present isn't as straight-forward as you would think.

I assumed it would be enough to replace "rsync DIR" with "rsync DIR && ln -s ...", but the reality (cf. http://git.wikimedia.org/tree/operations%2Fpuppet.git; "download::gluster" is the class that feeds /public/datasets/public) is much more complicated: A list of files (!) to sync is produced on the remote, that is rsync'ed to local and then fed to several (!) rsync workers (limited by count of files and size to transfer) that do the actually copying.  This process is run continuously, so there is no obvious point to hook into.

This complexity is probably due to the requirement to sync more than 4 TBytes of data :-).  I *think* it would be possible to have a cron job that just sets the symlinks without upsetting rsync too much, but I'm definitely not sure about that :-).
Comment 6 Betacommand 2014-05-18 13:33:36 UTC
Im adding platonides to the CC list and asking for their input, they run the /dumps project on the toolserver
Comment 7 Ariel T. Glenn 2014-06-27 18:02:42 UTC
The dumps script does this at time of dump creation, leaving other symlinks in the directory untouched.  Someone would have to write a short script that goes through and updates links after each rsync completes.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links