Last modified: 2014-01-30 00:43:11 UTC
I need a grid node for checking weblinks in the German wikipedia article namespace with sufficient memory to check 200 links in parallel with an array job in an interval of two weeks. See http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-labs/20131123.txt for the calculation. The grid should be named tools-exec-giftbot-01.
Reading the IRC log, I don't quite understand why you need a *node* of your own. Apparently, you want to run 200 jobs in parallel, and the problem is the 12 concurrent jobs/user limit. So you really want to have the limit for your bot raised to 200? I ask because the grid isn't really saturated; http://ganglia.wmflabs.org/latest/graph_all_periods.php?c=tools&h=tools-master&v=0&m=sge_pending&r=hour&z=default&jr=&js=&st=1387968592&z=large shows that the number of pending jobs is almost always 0.
(In reply to comment #1) > Reading the IRC log, I don't quite understand why you need a *node* of your > own. Apparently, you want to run 200 jobs in parallel, and the problem is > the > 12 concurrent jobs/user limit. So you really want to have the limit for your > bot raised to 200? > [...] Just checked: Currently the limit seemed to be defined by: | scfc@tools-login:~$ qconf -srqs | { | name jobs | description NONE | enabled FALSE | limit users {*} queues {continuous,task} to jobs=16 | } | scfc@tools-login:~$ *but* which a) is "enabled FALSE" and b) apparently allows *32* jobs per user even in one queue ("for NR in {1..100}; do qsub -q task -b y sleep 1m; done"). I changed "enabled" to "TRUE" and added a first rule: | scfc@tools-login:~$ sudo qconf -srqs | { | name jobs | description NONE | enabled TRUE | limit users scfc to jobs=200 | limit users {*} queues {continuous,task} to jobs=16 | } | scfc@tools-login:~$ But I was still only able to launch 32 jobs, so I changed it back. Further digging brought up: | scfc@tools-login:~$ qconf -ssconf | [...] | maxujobs 32 | [...] Ah! I'll test how to set per-user quotas over the next few days.
It was Coren's idea to use a dedicated node, so that the number of jobs would be unlimited. Also if you really want to raise the tool's limit instead, raise it to 200 plus the usual 32 (being 232). I have other scripts that need to be run and should not to be delayed by a 10-day period of blocking grid resources.
The reason why I'd dedicate a node to this is that, for 200 jobs sharing executables, the VMEM-based resource allocation would /vastly/ overcommit and clog the normal nodes (VMEM-based functions on a worst-case memory footprint basis, which we /know/ is not the case when you control the executable). Compare tools-webgrid-01 where a *lot* of jobs are running with large VMEM limits; this is possible because we know that the lighttpd footprint is shared between every job.
(In reply to comment #4) > The reason why I'd dedicate a node to this is that, for 200 jobs sharing > executables, the VMEM-based resource allocation would /vastly/ overcommit and > clog the normal nodes (VMEM-based functions on a worst-case memory footprint > basis, which we /know/ is not the case when you control the executable). > Compare tools-webgrid-01 where a *lot* of jobs are running with large VMEM > limits; this is possible because we know that the lighttpd footprint is > shared > between every job. In this case, you're right :-).
Queue 'gift' was created with a medium instance and is accesible to local-giftbot