Last modified: 2014-08-18 14:15:59 UTC
The run of the hourly script for 2014-07-28 05:00 failed with tar: /var/lib/wikimetrics/public/69987: file changed as we read it Error: Either failed to get lock on /data/project/wikimetrics/backup/wikimetrics1/hourly, or tar-ing failed. I checked locks, and they were properly cleaned up. So it seems the issue was only that the file was written while we tried to tar it up. Since we expect more writing over time, should we guard against this from happening again?
It happened again for the 2014-07-28 14:00 run: tar: /var/lib/wikimetrics/public/69989: file changed as we read it tar: /var/lib/wikimetrics/public/69987: file changed as we read it Error: Either failed to get lock on /data/project/wikimetrics/backup/wikimetrics1/hourly, or tar-ing failed. While the bug is of course valid as is, I'll stop reporting further instances for now, as it seems the wikimetrics1 is having more severe issues (bug 68743).
collaboratively tasked on etherpad: http://etherpad.wikimedia.org/p/analytics-68731
Change 153388 had a related patch set uploaded by QChris: Reschedule backups to not interfer with queue runs so easily https://gerrit.wikimedia.org/r/153388
Change 153395 had a related patch set uploaded by QChris: Force redis dump before backing up https://gerrit.wikimedia.org/r/153395
Change 153568 had a related patch set uploaded by QChris: Make hourly backup keep around known-good full backups in case of issues https://gerrit.wikimedia.org/r/153568
Change 153388 merged by Ottomata: Reschedule backups to not interfer with queue runs so easily https://gerrit.wikimedia.org/r/153388
Change 153568 merged by Ottomata: Make hourly backup keep around known-good full backups in case of issues https://gerrit.wikimedia.org/r/153568
Change 153395 merged by Ottomata: Force redis dump before backing up https://gerrit.wikimedia.org/r/153395
Tested throughly on dev but this of course needs baking time in prod. Wish we had a status "READY_TO_DEPLOY" that should be how bugs are left at the end of sprint.