Last modified: 2014-09-01 01:18:06 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T70349, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 68349 - populateBacklinkNamespace script causing massive slave lag on beta
populateBacklinkNamespace script causing massive slave lag on beta
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
deployment-prep (beta) (Other open bugs)
unspecified
All All
: High critical
: ---
Assigned To: Nobody - You can work on this!
:
: 68373 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-07-21 22:41 UTC by Greg Grossmeier
Modified: 2014-09-01 01:18 UTC (History)
17 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
log of update.php for the beta cluster simplewiki (65.39 KB, text/plain)
2014-07-22 14:59 UTC, Antoine "hashar" Musso (WMF)
Details

Description Greg Grossmeier 2014-07-21 22:41:52 UTC
See https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/2741/console

Ones completed so far (timestamp is "Elapsed time"):

00:00:17.131 deployment-bastion-eqiad,enwikinews completed with result SUCCESS
00:00:17.131 deployment-bastion-eqiad,enwikiquote completed with result SUCCESS
01:02:40.299 deployment-bastion-eqiad,eswiki completed with result SUCCESS
01:02:40.305 deployment-bastion-eqiad,enwikibooks completed with result SUCCESS
01:02:40.305 deployment-bastion-eqiad,ee_prototypewiki completed with result SUCCESS
01:02:40.305 deployment-bastion-eqiad,testwiki completed with result SUCCESS
01:02:40.306 deployment-bastion-eqiad,eowiki completed with result SUCCESS

In otherwords, just doing eswiki took almost an hour.

We can't have the Beta Cluser throwing database locked errors for the entire day.
Comment 1 Bawolff (Brian Wolff) 2014-07-21 22:46:21 UTC
To clarify is this every time or just a specific update?

If the schema is already up to date, it should finish within seconds (especially if the --quick option is present to skip the 5 second delay)
Comment 2 John F. Lewis 2014-07-21 22:49:01 UTC
The previous job was aborted by hashar and prior to that it failed on enwiki. So the past 2 runs failed for reference.
Comment 3 Greg Grossmeier 2014-07-21 22:50:46 UTC
To be explicit: this causes browser tests to fail because the database is in read-only mode (ie: no edits can be made).
Comment 4 Greg Grossmeier 2014-07-21 22:52:04 UTC
(In reply to Bawolff (Brian Wolff) from comment #1)
> To clarify is this every time or just a specific update?
> 
> If the schema is already up to date, it should finish within seconds
> (especially if the --quick option is present to skip the 5 second delay)

It normally completes fine/quickly, eg:
https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/2730/console was 32 seconds.
Comment 5 Bawolff (Brian Wolff) 2014-07-21 22:55:21 UTC
For reference, according to https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/label=deployment-bastion-eqiad,wikidb=eswiki/lastBuild/console

the step taking a lot of time is:
00:00:17.120 Updating *_from_namespace fields in links tables.


Which is to be expected when you're updating that huge a table. (update from b8c038f6784ef0820)

------

Also there's a 3 second jump at

00:00:13.669 ...afl_namespace in table abuse_filter_log already modified by patch /mnt/srv/scap-stage-dir/php-master/extensions/AbuseFilter/db_patches/patch-afl-namespace_int.sql.
00:00:16.808 ...user_daily_contribs table already exists.

which is longer then I would expect (but not an issue)
Comment 6 Bawolff (Brian Wolff) 2014-07-21 23:00:47 UTC
> 
> We can't have the Beta Cluser throwing database locked errors for the entire
> day.

At first glance, I don't see any reason why this update should lock the database.
Comment 7 Bawolff (Brian Wolff) 2014-07-21 23:27:17 UTC
(In reply to Greg Grossmeier from comment #3)
> To be explicit: this causes browser tests to fail because the database is in
> read-only mode (ie: no edits can be made).

Just to write down what was said in irc.

update.php caused massive slave lag (91 minutes currently - http://en.wikipedia.beta.wmflabs.org/w/index.php?maxlag=-1 ), triggering MediaWiki to auto lock the db. The wfWaitForSlaves() function in the update script is ineffective because it will only wait for at most 10 seconds.
Comment 8 James Forrester 2014-07-22 00:30:35 UTC
Assuming it's running them in order (which seems likely), as of this comment it's taken 106 minutes to do 27k of BL-Wikidata's ~30k pages, so roughly an hour to do 15k pages; pages to come:

ruwiki		?k
metawiki	1k
simplewiki	232k
zhwiki		?k
hewiki		3k
enwikiversity	0k
enwiktionary	2k
commonswiki	28k
ukwiki		?k
en_rtlwiki	?k
sqwiki		?k
fawiki		?k
enwikisource	0k
kowiki		?k
dewiki		2k
jawiki		?k
labswiki	?k
arwiki		?k
cawiki		?k
hiwiki		?k
aawiki		?k
loginwiki	0k
enwiki		26k

=> ~300k pages to go (assuming the wikis I couldn't reach due to service timeouts – "?k" – are roughly 1k pages each), which will take a further 21 hours to complete.

At this point I'd suggest that we drop the valueless simplewiki clone (232k pages on a test wiki is insane) and call it the best of a bad job.
Comment 9 Bawolff (Brian Wolff) 2014-07-22 00:41:12 UTC
So right now the update script does a lot of little queries (Batch sizes of 200, each batch involves one select query for all the page_ids in range, plus update queries for each page id. So in total each batch has 1 select, and ~600 update queries).

Perhaps it would be more efficient to do something like
UPDATE pagelinks, page SET pagelinks.pl_from_namespace = page.page_namespace WHERE pagelinks BETWEEN $blockStart AND $blockEnd;

to get rid of the overhead of so many small queries? I don't really know.

I guess at the very least batch size should be much less. I also do wonder if something is perhaps wrong with deployment-db2. It has no entry in the wmflabs ganglia.
Comment 10 Gerrit Notification Bot 2014-07-22 00:45:53 UTC
Change 148296 had a related patch set uploaded by Brian Wolff:
Reduce batch size of populateBacklinkNamespace from 200 to 20

https://gerrit.wikimedia.org/r/148296
Comment 11 Bawolff (Brian Wolff) 2014-07-22 00:53:36 UTC
(In reply to Gerrit Notification Bot from comment #10)
> Change 148296 had a related patch set uploaded by Brian Wolff:
> Reduce batch size of populateBacklinkNamespace from 200 to 20
> 
> https://gerrit.wikimedia.org/r/148296

This is only somewhat related to the bug (As in it would help if it was there from the get-go, but probably not that much unless we restart the update script)

------

If its really important that beta "work" for people, one possibility as a temporary hack would be to add something like the following to beta's config file (after whenever db-labs.php is loaded):

if ( !$wgCommandLineMode ) {
 unset( $wgLBFactoryConf['sectionLoads']['DEFAULT']['deployment-db2'] );
}

This would make the web interface ignore the lagged slave (The update script will still wait in 10 second intervals for it). Things would be editable again, load on the master db for labs would increase by quite a bit (but it's beta, what's the worst case scenario here?) [You should probably run this idea by someone else before actually doing it].
Comment 12 Antoine "hashar" Musso (WMF) 2014-07-22 14:56:56 UTC
*** Bug 68373 has been marked as a duplicate of this bug. ***
Comment 13 Antoine "hashar" Musso (WMF) 2014-07-22 14:59:51 UTC
Created attachment 15999 [details]
log of update.php for the beta cluster simplewiki

Aaron might be interested.

On beta simple wiki (which has roughly 250k pages), the console run is https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/label=deployment-bastion-eqiad,wikidb=simplewiki/2742/console (attached to bug report)
Comment 14 Antoine "hashar" Musso (WMF) 2014-07-22 15:00:16 UTC
Aaron Schulz might be interested in this bug report.
Comment 15 Bawolff (Brian Wolff) 2014-07-22 19:55:04 UTC
FWIW, the update.php job finished successfully. lag on deployment-db2 seems to be holding at about 3 hours and 20 minutes for now. Things will probably be back to normal in several hours.
Comment 16 Bawolff (Brian Wolff) 2014-07-23 00:24:10 UTC
Slave lag is back down to 0. Guess this is fixed.
Comment 17 Greg Grossmeier 2014-07-23 00:56:20 UTC
I want to leave this open until we've figured out if we can prevent this from happening again.
Comment 18 Bawolff (Brian Wolff) 2014-07-23 01:40:40 UTC
(In reply to Greg Grossmeier from comment #17)
> I want to leave this open until we've figured out if we can prevent this
> from happening again.

Well update is done. Update only gets run once so it wont happen again on beta wiki unless someone manually runs populateBacklinkNamespace.php --force, or deletes the relavent entry in the updatelog table.

The deeper issue of course is the population script had too big a batch size. If you want to see if that is fixed i guess it might make sense to remove the line from updatelog on beta after merging comment 10 to see if the update still explodes.
Comment 19 Antoine "hashar" Musso (WMF) 2014-07-23 06:41:52 UTC
I am also wondering how we are going to handle that update in production.  Might end up taking a long time as well.
Comment 20 Bawolff (Brian Wolff) 2014-07-23 07:01:58 UTC
(In reply to Antoine "hashar" Musso from comment #19)
> I am also wondering how we are going to handle that update in production. 
> Might end up taking a long time as well.

Towards the end, beta was updating about 200 rows every 30 seconds. enwiki's page_ids go up to 43371588, which gives ((30/200)*43371588)/(60*60*24) = 75.2

So 75 days to update enwiki (assuming similar performance, which is questionable. enwiki has much more powerful db so can probably do the update faster. OTOH, it should probably have a much smaller batchsize, which could potentially slow down the update. So who knows). Anyways, taking that very rough guess at face value, if the update takes 2.5 months, I don't see any problem. There's no deadline for when the update has to finish by.
Comment 21 Tim Landscheidt 2014-07-23 13:21:35 UTC
(In reply to Bawolff (Brian Wolff) from comment #20)
> (In reply to Antoine "hashar" Musso from comment #19)
> > I am also wondering how we are going to handle that update in production. 
> > Might end up taking a long time as well.

> Towards the end, beta was updating about 200 rows every 30 seconds. enwiki's
> page_ids go up to 43371588, which gives ((30/200)*43371588)/(60*60*24) = 75.2

> So 75 days to update enwiki (assuming similar performance, which is
> questionable. enwiki has much more powerful db so can probably do the update
> faster. OTOH, it should probably have a much smaller batchsize, which could
> potentially slow down the update. So who knows). Anyways, taking that very
> rough guess at face value, if the update takes 2.5 months, I don't see any
> problem. There's no deadline for when the update has to finish by.

I'm no DBA, but running three UPDATEs for every page row doesn't sound like the brightest idea.  I'm pretty sure MariaDB has much nicer performance if you speak to it in SQL like you proposed in comment #9.
Comment 22 Greg Grossmeier 2014-07-23 15:29:56 UTC
(In reply to Antoine "hashar" Musso from comment #19)
> I am also wondering how we are going to handle that update in production. 
> Might end up taking a long time as well.

It already happened in production. Which is the only reason why it was merged to begin with.

Remember folks: If your code goes to production and you want to make a database change, file a Schema Change bug and have our DBA (Sean) take care of it BEFORE you merge. Aaron did that right.
Comment 23 Antoine "hashar" Musso (WMF) 2014-07-23 15:32:45 UTC
Excellent!  So there is nothing to talk about anymore =)   Beta is happy, slave lag is back to 0 seconds.

Topic closed.
Comment 24 Kevin Israel (PleaseStand) 2014-07-31 03:12:16 UTC
(In reply to Greg Grossmeier from comment #22)
> It already happened in production. Which is the only reason why it was
> merged to begin with.

I'm not sure, but in production, this update may still be in progress. The only entry in <https://wikitech.wikimedia.org/wiki/Server_Admin_Log> I see regarding this is under July 30: "21:04 AaronSchulz: Started populateBacklinkNamespace.php on wikidata and commons".

The schema change done before the change was merged was to add some new columns and set them to a default value of 0. The update referred to in this report ("populateBacklinkNamespace script") would happen afterward, setting the correct values for those columns. That's why a $wgUseLinkNamespaceDBFields setting was added. It is not currently enabled in production.
Comment 25 Gerrit Notification Bot 2014-08-15 16:57:31 UTC
Change 148296 abandoned by Brian Wolff:
Reduce batch size of populateBacklinkNamespace from 200 to 20

Reason:
Gerrit change #151027 addresses the same issue but probably more robustly

https://gerrit.wikimedia.org/r/148296
Comment 26 physikerwelt 2014-09-01 01:18:06 UTC
Is there an option to skip this update in update.php? 
I tried  mwscript update.php --quick --nopurge --skip-compat-checks, to run updated that follow the step
"Updating *_from_namespace fields in links tables."
But I did nothing helped.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links