Last modified: 2013-09-04 11:51:25 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T32257, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 30257 - DatabaseMysql.php should have debug output for getLag()
DatabaseMysql.php should have debug output for getLag()
Status: NEW
Product: MediaWiki
Classification: Unclassified
Database (Other open bugs)
1.17.x
All All
: Low normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-08-06 01:18 UTC by nmaul
Modified: 2013-09-04 11:51 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description nmaul 2011-08-06 01:18:47 UTC
This function does a 'SHOW PROCESSLIST' on the current connection. It then looks for all lines for the user 'system user', and throws away a selection of them that don't match the intended criteria. Once all extraneous rows are eliminated, it returns the time on the first remaining row (probably only one).

The problem with this is that the current connection may or may not have the PROCESS permission on the database server. This can lead to no matching rows, and ultimately the function returns 'false'. That in itself is the proper behavior (it should return *something* indicating a failure, and false seems as good as anything). However, it logs no debugging information as to what happened.

This percolates up to LoadMonitor.php (and ultimately LoadBalancer.php), which detects whether a slave is lagged or not. 'False' values will result in the wiki going into read-only mode and never recovering, with zero information as to why this is. Even wgDebugLogFile gives no information as to why Mediawiki thinks the slaves are all lagged.

The proper fix IMO is to add logging in 2 places, reword logs in a 3rd place, and update the overall setup documentation:

DatabaseMysql.php: should call wfDebug if no suitable 'system user' row was found, possibly hinting about the PROCESS permission flag.

LoadMonitor.php: should indicate (wfDebug) that this db server is unsuitable because it failed outright (as opposed to actually being lagged). This should probably be a loop through the $times array, somewhere near the end of the getLagTimes function. it could be at the end of the existing $serverIndexes foreach() loop to avoid adding an extra loop.

LoadBalancer.php: should be changed from 'Server #$i is not replicating" to "Server #$i replication status undeterminable, removing from pool" or similar. That is, it shouldn't jump to conclusions, and it should explicitly state that the server is being removed from the list.

LoadBalancer.php: might be worth considering renaming the 'wgReadOnly' status set by the getReaderIndex() function, in the big do..while loop. Possibly something to point administrators at wgDebugLogFile for more info? Not sure how much we want to expose to end users. This only makes sense if some of the above changes are incorporated... otherwise the debug log file isn't all that helpful.


Finally, the install/update documentation should be updated to reflect that the PROCESS privilege is needed on multi-database systems for detection of slave lag.

I don't know when or why this broke, but I can see that a good bit of things were shuffled around in 1.16->1.17, and our 1.16.5 install on MySQL 5.0.77 works properly.

1.16 on MySQL 5.1 works. Strangely, as far as I can tell it appears to do the same basic thing, but the 'Server X is not replicating' error is never logged, the wiki doesn't go into read-only mode, and the slave server get queries normally.
Comment 1 Sam Reed (reedy) 2011-08-06 01:21:16 UTC
Note in 1.18 this was changed to be able to use SHOW SLAVE STATUS for newer versions of MySQL

http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/db/DatabaseMysql.php?annotate=93628#l373

But was commented out due to it causing errors for other people
Comment 2 nmaul 2011-08-06 01:35:02 UTC
Neither is a particular guaranteed method for measuring lag. "Seconds behind master" is generally okay, but it actually measure the time between the IO thread and the SQL thread on the slave. If the IO thread is behind the master, this measurement won't get notice. That's kinda rare, but more common in high latency or lossy networks.

The most guaranteed method I know of is to make a change on the master and time how long it takes to appear on the slave. Maatkit has a good general test of this that works with all types of replication (even offline mysqldump-restore style "replication"): http://www.maatkit.org/doc/mk-heartbeat.html.

The basic strategy is to update a row on the master, with the current timestamp. Then on the slave, check the current system time and compare with the latest timestamp in the row. The difference is the replication lag. This relies on having synchronized clocks, obviously.

A similar mechanism could probably be adopted. Besides the obvious win of truly accurate timing, it also requires no special permissions, as both processlist and 'show slave status' would need to have.
Comment 3 Sam Reed (reedy) 2011-08-06 01:43:26 UTC
Please feel free to supply patches :)
Comment 4 nmaul 2011-08-15 20:47:45 UTC
(In reply to comment #3)
> Please feel free to supply patches :)

Will do. Personal life getting in the way lately. :)
Comment 5 Mark A. Hershberger 2011-08-16 19:15:20 UTC
Please don't reset priorities to "unprioritized".  I've changed it to "low" priority.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links