Last modified: 2014-08-26 17:48:43 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T50694, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 48694 - Show replication lags in Ganglia
Show replication lags in Ganglia
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
tools (Other open bugs)
unspecified
All All
: Normal normal
: ---
Assigned To: Marc A. Pelletier
:
Depends on:
Blocks: labs-replication
  Show dependency treegraph
 
Reported: 2013-05-21 21:09 UTC by Tim Landscheidt
Modified: 2014-08-26 17:48 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Tim Landscheidt 2013-05-21 21:09:24 UTC
The replication lags of the database servers should be shown in Ganglia (cf. http://toolserver.org/~bryan/stats/replag/ for the Toolserver counterpart).
Comment 1 Tim Landscheidt 2013-05-22 17:42:39 UTC
As a test, I have set up ~scfc/bin/replagstats to run every minute.  The statistics are available at http://ganglia.wmflabs.org/ -> tools -> tools-login -> "Replication Lags metrics".
Comment 3 Marc A. Pelletier 2013-11-22 18:59:45 UTC
Looks like it's working fine to me.
Comment 4 Tim Landscheidt 2014-01-12 23:11:12 UTC
(In reply to comment #3)
> Looks like it's working fine to me.

No, as discussed on IRC, it's still running under my personal account.

As it would be useful to show replication lag for every MariaDB slave, I wanted to discuss this as a wider change with Asher.  But:

a) chance never came about, and
b) it's already there!  For db1035, go to http://ganglia.wikimedia.org/latest/?c=MySQL%20eqiad&h=db1035.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 and search for "mysql_slave_lag".

However, this isn't available for labsdb* yet (cf. http://ganglia.wikimedia.org/latest/?c=MySQL%20eqiad&h=labsdb1001.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2), and at the moment can't be enabled anyway as the monitoring for db1035 et al. assumes that only /one/ MariaDB instance runs on any server, while on labsdb* there are several and so mysql_slave_log & Co. need to be prefixed by, for example, "s1_".

So to resolve this bug, we need to:

a) refactor the monitoring bits and pieces that they handle multiple instances on one server,
b) enable such monitoring for labsdb*, and
c) create a ganglia::view where *_mysql_slave_lag for labsdb* is combined in one report so that the information isn't scattered over three pages and literally hundreds of graphs.
Comment 5 Tim Landscheidt 2014-05-06 16:42:46 UTC
(I moved ~scfc/bin/replagstats to ~tools.admin/bin/, rewrote it from a cron to a continuous job and started it with jstart.)
Comment 6 Tim Landscheidt 2014-05-06 20:38:13 UTC
(I needed to group the statistics at Ganglia under the virtual host "tools-replags".)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links