Last modified: 2013-06-25 15:00:02 UTC
qacct is probably the weapon of choice to determine resource consumption by a job as it determines it in the same way as the grid (hah!). It would be very useful to make at least its output for single jobs ("qacct -j JOBID") available.
Coren, is there anything preventing this from happening?
After looking at http://arc.liv.ac.uk/SGE/howto/nfsreduce.html, it seems possible to distribute the master's accounting file to the individual hosts and let qacct use these locally. Coren said on IRC that ostensibly the accounting file does not contain private information (format is very simple and line-based; cf. /sge/GE/default/common/accounting on Toolserver). AFAICS, qacct is only really useful on hosts that also can submit jobs, so it would probably make sense to hinge the distribution on gridengine::submit_host as a cron job calling rsync run every x minutes (at (very) least once per hour). Someone needs to figure out what the correct command line is, and after deployment, we need to document that job information might take x minutes to show up in qacct.
eh... and why not allow people to just ssh to tools-master and query it directly?
After some further brainstorming, directly rsyncing from tools-master to the hosts would make things a bit more complicated as essentially we would need to allow root ssh between hosts, which is a bit scary. But Coren in another context reminded me that we have /data/project/.system, so we could "cp -f /var/lib/gridengine/default/common/accounting /data/project/.system/accounting.tmp && mv -f /data/project/.system/accounting.tmp /data/project/.system/accounting" on tools-master and "cp /data/project/.system/accounting /var/lib/gridengine/default/common/accounting.tmp && mv -f /var/lib/gridengine/default/common/accounting.tmp /var/lib/gridengine/default/common/accounting" on the submit hosts (".tmp" for atomicity). Exporting the accounting file with NFS from tools-master could introduce some nasty locks, and aliasing "qacct" to "qacct -f /data/project/.system/accounting" in /etc/profile could cause problems when someone doesn't use an interactive shell to call qacct. I wanted to try to set up a puppetmaster::self on toolsbeta, so I think I'll use this for testing.
Related URL: https://gerrit.wikimedia.org/r/70425 (Gerrit Change I29f0e42e4f49a406565344c31a7c93924bcd7408)
Fixed by https://gerrit.wikimedia.org/r/#/c/70425/ (merged)