Last modified: 2011-02-02 19:47:19 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T29113, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 27113 - be able to restart history dump after breakage, from where it was interrupted


Summary:	be able to restart history dump after breakage, from where it was interrupted

Status:	NEW

Product:	Datasets
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal enhancement (vote)
Target Milestone:	---
Assigned To:	Ariel T. Glenn

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	27110
	Show dependency tree / graph

Reported:	2011-02-02 19:47 UTC by Ariel T. Glenn
Modified:	2011-02-02 19:47 UTC (History)
CC List:	1 user (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Ariel T. Glenn 2011-02-02 19:47:19 UTC

Dumping the page-meta-history file is the phase that takes the longest; when some external factor causes the dumps to fail (a code push that breaks them, network/db/power/space/other issues), they currently must be restarted from the beginning.  Even when they complete in 2 weeks instead of 6 weeks, the odds of something going wrong in that time is quite high.  Being able to restart from the point of interruption would mean being able to produce them on a reasonable schedule.

Code available: find last page id in file form interrupted run (works only for bz2 files), by seeking to the end and walking through compressed blocks.

Code needed: stream this file to a filter which writes out the MediaWiki header, writes everything up to but excluding the last pageID, writes the MediaWiki footer; this output can be piped to bzip2 to produce an intact bzip2 file.

We can then run from that pageID to the end, take the two bzip2 files, recombine them and be done.

Why can't we just find the truncated bzip2 block, toss it, and start from there? Because at the end of a file bzip2 requires a cumulative crc algorithm, which means rereading all the text the minute we want to add blocks at the end.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links