Last modified: 2014-05-12 23:15:22 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T64942, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 62942 - Normal jobs sometimes run on tools-webgrid-tomcat.eqiad.wmflabs
Normal jobs sometimes run on tools-webgrid-tomcat.eqiad.wmflabs
Status: ASSIGNED
Product: Wikimedia Labs
Classification: Unclassified
tools (Other open bugs)
unspecified
All All
: Normal minor
: ---
Assigned To: Marc A. Pelletier
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-03-21 23:23 UTC by Merlijn van Deen (test)
Modified: 2014-05-12 23:15 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Merlijn van Deen (test) 2014-03-21 23:23:54 UTC
nl:User:Valhallasw-toolserver-botje runs from a crontab on tools-login under the nlwikibots service group:

0 * * * * qsub $HOME/bin/tvpupdater > /dev/null

The job is queued hourly, and only at midnight local time (Europe/Amsterdam), pages are edited on nlwiki. This currently corresponds to 2300UTC, and 2200UTC in the near future.

In the edit message, the bot reports the host at which it is currently running:
https://nl.wikipedia.org/wiki/Speciaal:Bijdragen/Valhallasw-toolserver-botje

Expected behavior would be running from the tools-exec-* hosts, but the bot often runs from the tools-webgrid-tomcat host.


$HOME/bin/tvpupdater sets several SGE parameters:

#$ -l h_rt=0:30:00  # max runtime
#$ -l virtual_free=25M # max memory use, excluding shared libs, toolserver
#$ -l h_vmem=256M # max memory use, including shared libs, Tools Labs
#$ -l arch=* # mag op zowel linux als solaris
#$ -N tvpupdater-valhallasw # naam van taak, eindigt in naam eigenaar
#$ -M valhallasw@arctus.nl
#$ -m a # alleen mails bij een abort (vanwege bv. runtime-overschrijding)
#$ -b y # draai over netwerkschijf ipv het bestand te kopiëren
#$ -o /dev/null # output naar /dev/null
#$ -e $HOME/tvpupdater-valhallasw.err

and then changes the directory, and invokes ~/bots/tvpupdater/runbot, which activates a virtualenv and, in turn, starts the actual bot script.



I have been able to get two job numbers for runs on the tomcat hosts. Their qacct data is shown below.

qacct -j 40719
==============================================================
qname        webgrid-tomcat
hostname     tools-webgrid-tomcat.eqiad.wmflabs
group        tools.nlwikibots
owner        tools.nlwikibots
project      NONE
department   defaultdepartment
jobname      tvpupdater-valhallasw
jobnumber    40719
taskid       undefined
account      sge
priority     0
qsub_time    Mon Mar 17 23:00:02 2014
start_time   Mon Mar 17 23:00:13 2014
end_time     Mon Mar 17 23:00:45 2014
granted_pe   NONE
slots        1
failed       0
exit_status  0
ru_wallclock 32
ru_utime     0.196
ru_stime     0.056
ru_maxrss    16456
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    12120
ru_majflt    0
ru_nswap     0
ru_inblock   8
ru_oublock   32
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     639
ru_nivcsw    78
cpu          0.252
mem          0.019
io           0.002
iow          0.000
maxvmem      209.957M
arid         undefined

qacct -j 64375
==============================================================
qname        webgrid-tomcat
hostname     tools-webgrid-tomcat.eqiad.wmflabs
group        tools.nlwikibots
owner        tools.nlwikibots
project      NONE
department   defaultdepartment
jobname      tvpupdater-valhallasw
jobnumber    64375
taskid       undefined
account      sge
priority     0
qsub_time    Fri Mar 21 23:00:02 2014
start_time   Fri Mar 21 23:00:17 2014
end_time     Fri Mar 21 23:00:43 2014
granted_pe   NONE
slots        1
failed       0
exit_status  0
ru_wallclock 26
ru_utime     0.216
ru_stime     0.080
ru_maxrss    16444
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    12106
ru_majflt    2
ru_nswap     0
ru_inblock   64
ru_oublock   32
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     506
ru_nivcsw    171
cpu          0.296
mem          0.021
io           0.002
iow          0.000
maxvmem      209.945M
arid         undefined
Comment 1 Tim Landscheidt 2014-05-12 23:15:22 UTC
Job #190385 seems to be somehow related (after "sudo qmod -cj 190385" in tools-master's /var/spool/gridengine/qmaster/messages):

| 05/12/2014 22:59:51|worker|tools-master|W|job 190385.1 failed on host tools-webgrid-tomcat.eqiad.wmflabs general searching requested shell because: 05/12/2014 22:59:50 [3838:26265]: execvp(/var/spool/gridengine/execd/tools-webgrid-tomcat/job_scripts/190385, "/var/spool/gridengine/execd/tools-webgrid-tomcat/job_scripts/190385") failed: No such file or directory
| 05/12/2014 22:59:51|worker|tools-master|W|rescheduling job 190385.1

I don't know if the job would start on a normal exec node (because if it'd be a systematic error affecting all jobs, users would be shouting *very* loudly), but the coincidence is certainly interesting.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links