Last modified: 2014-01-30 00:43:11 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T60949, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 58949 - create grid node for checking weblinks
create grid node for checking weblinks
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
tools (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Marc A. Pelletier
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-12-24 21:47 UTC by Giftpflanze
Modified: 2014-01-30 00:43 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Giftpflanze 2013-12-24 21:47:17 UTC
I need a grid node for checking weblinks in the German wikipedia article namespace with sufficient memory to check 200 links in parallel with an array job in an interval of two weeks. See http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-labs/20131123.txt for the calculation. The grid should be named tools-exec-giftbot-01.
Comment 1 Tim Landscheidt 2013-12-25 10:52:16 UTC
Reading the IRC log, I don't quite understand why you need a *node* of your own.  Apparently, you want to run 200 jobs in parallel, and the problem is the 12 concurrent jobs/user limit.  So you really want to have the limit for your bot raised to 200?

I ask because the grid isn't really saturated; http://ganglia.wmflabs.org/latest/graph_all_periods.php?c=tools&h=tools-master&v=0&m=sge_pending&r=hour&z=default&jr=&js=&st=1387968592&z=large shows that the number of pending jobs is almost always 0.
Comment 2 Tim Landscheidt 2013-12-25 11:37:04 UTC
(In reply to comment #1)
> Reading the IRC log, I don't quite understand why you need a *node* of your
> own.  Apparently, you want to run 200 jobs in parallel, and the problem is
> the
> 12 concurrent jobs/user limit.  So you really want to have the limit for your
> bot raised to 200?

> [...]

Just checked: Currently the limit seemed to be defined by:

| scfc@tools-login:~$ qconf -srqs
| {
|    name         jobs
|    description  NONE
|    enabled      FALSE
|    limit        users {*} queues {continuous,task} to jobs=16
| }
| scfc@tools-login:~$

*but* which a) is "enabled FALSE" and b) apparently allows *32* jobs per user even in one queue ("for NR in {1..100}; do qsub -q task -b y sleep 1m; done").

I changed "enabled" to "TRUE" and added a first rule:

| scfc@tools-login:~$ sudo qconf -srqs
| {
|    name         jobs
|    description  NONE
|    enabled      TRUE
|    limit        users scfc to jobs=200
|    limit        users {*} queues {continuous,task} to jobs=16
| }
| scfc@tools-login:~$

But I was still only able to launch 32 jobs, so I changed it back.

Further digging brought up:

| scfc@tools-login:~$ qconf -ssconf
| [...]
| maxujobs                          32
| [...]

Ah!  I'll test how to set per-user quotas over the next few days.
Comment 3 Giftpflanze 2013-12-25 12:37:40 UTC
It was Coren's idea to use a dedicated node, so that the number of jobs would be unlimited. Also if you really want to raise the tool's limit instead, raise it to 200 plus the usual 32 (being 232). I have other scripts that need to be run and should not to be delayed by a 10-day period of blocking grid resources.
Comment 4 Marc A. Pelletier 2013-12-25 15:04:02 UTC
The reason why I'd dedicate a node to this is that, for 200 jobs sharing executables, the VMEM-based resource allocation would /vastly/ overcommit and clog the normal nodes (VMEM-based functions on a worst-case memory footprint basis, which we /know/ is not the case when you control the executable).

Compare tools-webgrid-01 where a *lot* of jobs are running with large VMEM limits; this is possible because we know that the lighttpd footprint is shared between every job.
Comment 5 Tim Landscheidt 2014-01-03 14:56:33 UTC
(In reply to comment #4)
> The reason why I'd dedicate a node to this is that, for 200 jobs sharing
> executables, the VMEM-based resource allocation would /vastly/ overcommit and
> clog the normal nodes (VMEM-based functions on a worst-case memory footprint
> basis, which we /know/ is not the case when you control the executable).

> Compare tools-webgrid-01 where a *lot* of jobs are running with large VMEM
> limits; this is possible because we know that the lighttpd footprint is
> shared
> between every job.

In this case, you're right :-).
Comment 6 Marc A. Pelletier 2014-01-09 18:44:36 UTC
Queue 'gift' was created with a medium instance and is accesible to local-giftbot

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links