Last modified: 2013-06-25 15:00:02 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T50696, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 48696 - Make qacct usable
Make qacct usable
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
tools (Other open bugs)
unspecified
All All
: High enhancement
: ---
Assigned To: Marc A. Pelletier
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-21 21:39 UTC by Tim Landscheidt
Modified: 2013-06-25 15:00 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Tim Landscheidt 2013-05-21 21:39:33 UTC
qacct is probably the weapon of choice to determine resource consumption by a job as it determines it in the same way as the grid (hah!).  It would be very useful to make at least its output for single jobs ("qacct -j JOBID") available.
Comment 1 Peter Bena 2013-06-08 10:18:57 UTC
Coren, is there anything preventing this from happening?
Comment 2 Tim Landscheidt 2013-06-19 18:31:20 UTC
After looking at http://arc.liv.ac.uk/SGE/howto/nfsreduce.html, it seems possible to distribute the master's accounting file to the individual hosts and let qacct use these locally.  Coren said on IRC that ostensibly the accounting file does not contain private information (format is very simple and line-based; cf. /sge/GE/default/common/accounting on Toolserver).

AFAICS, qacct is only really useful on hosts that also can submit jobs, so it would probably make sense to hinge the distribution on gridengine::submit_host as a cron job calling rsync run every x minutes (at (very) least once per hour).

Someone needs to figure out what the correct command line is, and after deployment, we need to document that job information might take x minutes to show up in qacct.
Comment 3 Peter Bena 2013-06-20 12:37:26 UTC
eh... and why not allow people to just ssh to tools-master and query it directly?
Comment 4 Tim Landscheidt 2013-06-24 12:59:47 UTC
After some further brainstorming, directly rsyncing from tools-master to the hosts would make things a bit more complicated as essentially we would need to allow root ssh between hosts, which is a bit scary.  But Coren in another context reminded me that we have /data/project/.system, so we could "cp -f /var/lib/gridengine/default/common/accounting /data/project/.system/accounting.tmp && mv -f /data/project/.system/accounting.tmp /data/project/.system/accounting" on tools-master and "cp /data/project/.system/accounting /var/lib/gridengine/default/common/accounting.tmp && mv -f /var/lib/gridengine/default/common/accounting.tmp /var/lib/gridengine/default/common/accounting" on the submit hosts (".tmp" for atomicity).

Exporting the accounting file with NFS from tools-master could introduce some nasty locks, and aliasing "qacct" to "qacct -f /data/project/.system/accounting" in /etc/profile could cause problems when someone doesn't use an interactive shell to call qacct.

I wanted to try to set up a puppetmaster::self on toolsbeta, so I think I'll use this for testing.
Comment 5 Gerrit Notification Bot 2013-06-25 14:32:12 UTC
Related URL: https://gerrit.wikimedia.org/r/70425 (Gerrit Change I29f0e42e4f49a406565344c31a7c93924bcd7408)
Comment 6 Gerrit Notification Bot 2013-06-25 14:32:15 UTC
Related URL: https://gerrit.wikimedia.org/r/70425 (Gerrit Change I29f0e42e4f49a406565344c31a7c93924bcd7408)
Comment 7 Marc A. Pelletier 2013-06-25 15:00:02 UTC
Fixed by https://gerrit.wikimedia.org/r/#/c/70425/ (merged)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links