Last modified: 2013-06-10 23:06:13 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T50338, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 48338 - Show grid status in Ganglia
Show grid status in Ganglia
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
tools (Other open bugs)
unspecified
All All
: Low enhancement
: ---
Assigned To: Marc A. Pelletier
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-10 23:20 UTC by Tim Landscheidt
Modified: 2013-06-10 23:06 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Tim Landscheidt 2013-05-10 23:20:10 UTC
Addshore has made a nice graph at http://tools.wmflabs.org/addshore/toolslab/ that shows the number of jobs running on the grid.  It would be preferable to have this properly integrated in Ganglia.

The "official" contrib repo has SGE functionality at https://github.com/ganglia/gmetric/tree/master/hpc/sge_jobs (cf. also http://comments.gmane.org/gmane.comp.monitoring.ganglia.general/1920).

If I understand the Puppet structure correctly, both sge_jobs.sh and jobqueue_report.php need to puppetized on the Ganglia side, but I'm confused by ganglia and ganglia_new.
Comment 1 Tim Landscheidt 2013-05-11 02:10:58 UTC
1. Apparently, the Puppet modules are structured the other way round: A module typically has a ::monitoring class that adds the gathering thingy to the node.  At Tools, the proper class would probably be gridengine::master::monitoring to be deployed exactly once per SGE cluster.

2. No report has to be defined on the Ganglia side at all.  If one feeds it data, it will make sense of it on its own.

3. As a test, I have set up ~scfc/bin/sge_jobs.pl to be run on tools-login every fifteen minutes.  It gathers information on pending, running and error jobs, and submits it to Ganglia.  The graphs can be found at http://ganglia.wmflabs.org -> tools -> tools-login -> sge_pending/sge_running/sge_error.  I intend to leave it running for a few days before puppetizing.
Comment 2 Addshore 2013-05-11 09:02:25 UTC
Looks lovely, I am guessing it is not exactly an intensive task so could we have it running minutely?
Comment 3 Tim Landscheidt 2013-05-11 14:55:57 UTC
(In reply to comment #2)
> Looks lovely, I am guessing it is not exactly an intensive task so could we
> have it running minutely?

On the grid side, yes, but I don't know how much load this causes on Ganglia (three data points per minute -> 180 per hour -> etc., graph generation may take longer, etc.), so I before "turning it to 11" I'd prefer Ryan's okay.
Comment 4 Addshore 2013-05-12 01:50:21 UTC
Compare to the amount of data ganglia already receives about labs I don't think it will have much effect :)
Comment 5 Ryan Lane 2013-05-14 03:46:41 UTC
This is cool. I have a feeling you can run it more often than that without much impact. A ton of data is already sent to the servers.
Comment 6 Tim Landscheidt 2013-05-14 16:09:46 UTC
(In reply to comment #5)
> This is cool. I have a feeling you can run it more often than that without
> much
> impact. A ton of data is already sent to the servers.

Okay, increased it to an update every minute.
Comment 7 Gerrit Notification Bot 2013-05-19 04:02:17 UTC
Related URL: https://gerrit.wikimedia.org/r/64511 (Gerrit Change I48a65620d2fa5ee0fa3d147f9157af60c44c31c3)
Comment 8 Tim Landscheidt 2013-06-10 23:06:13 UTC
Gerrit change #64511 (and the fix in Gerrit change #67899) got merged, so the status is now available at <http://ganglia.wmflabs.org/latest/?r=hour&cs=&ce=&m=load_one&s=by+name&c=tools&h=tools-master&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4#mg_SGE_div>.

I've removed my cron job on tools-login.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links