Last modified: 2013-12-16 07:47:46 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T59479, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 57479 - Unable to mount /public directories on queue nodes
Unable to mount /public directories on queue nodes
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
tools (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Marc A. Pelletier
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-23 09:06 UTC by bgwhite
Modified: 2013-12-16 07:47 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description bgwhite 2013-11-23 09:06:39 UTC
On tools-login, I can see this file:  /public/datasets/public/pdcwiki/20131115/pdcwiki-20131115-pages-articles.xml.bz2

However, the queue nodes cannot see the file.  I get "Can't open input file /data/project/checkwiki/dumps/pdcwiki-20131115-pages-articles.xml.bz2: No such file or directory"  

If I copy the file to /data/project/checkwiki, the queue can see the file and run normally.   

I think the problem happened around 0z.  Some programs started at 0:00z and 0:01z just fine.  Some programs started at 0:03z and they died.  Any programs since also die.

Bryan
Comment 1 Marc A. Pelletier 2013-11-25 15:26:05 UTC
I'm not sure I understand your bug report; the error message you mention does not match the file you expected (/public/datasets/... vs /data/project), it looks like your tool isn't trying to read the file from where you expect it to?
Comment 2 bgwhite 2013-11-25 19:01:36 UTC
It would be helpful if I didn't give a test run result.

Several queue machines cannot see /public/datasets/public/dumps/*  
Note:  This doesn't affect all queue machines.   

I meant to say "Can't open input file /public/datasets/public/dumps/pdcwiki-20131115-pages-articles.xml.bz2: No such file or directory"

I tested things out on the 24th and still got "Can't open input file /public/datasets/public/gdwiki/20131123/gdwiki-20131123-pages-articles.xml.bz2: No such file or directory."

I just test things out and I still get the same error.
Comment 3 Tim Landscheidt 2013-11-26 01:28:17 UTC
On tools-exec-01, /public is empty and /home looks rather sparse:

| scfc@tools-login:~$ ssh tools-exec-01 ls -l /public
| total 0
| scfc@tools-login:~$ ssh tools-exec-01 ls -l /home
| total 20
| drwx------ 3 dapete  wikidev 4096 Nov 24 17:43 dapete
| drwxr-xr-x 2 gmetric gmetric 4096 Feb 27  2013 gmetric
| drwx------ 4 marc    wikidev 4096 Nov 22 17:09 marc
| drwx------ 3 scfc    wikidev 4096 Nov 26 01:16 scfc
| drwxr-xr-x 3 ubuntu  ubuntu  4096 Feb 27  2013 ubuntu
| scfc@tools-login:~$

Also, on tools-exec-01 (at least) my home directory is not my "real" one:

| scfc@tools-exec-01:~$ ll /home/scfc
| total 32
| drwx------ 3 scfc wikidev 4096 Nov 26 01:16 ./
| drwxr-xr-x 7 root root    4096 Nov 25 22:50 ../
| -rw------- 1 scfc wikidev  242 Nov 26 01:16 .bash_history
| -rw------- 1 scfc wikidev  220 Nov 25 22:50 .bash_logout
| -rw------- 1 scfc wikidev 3387 Nov 25 22:50 .bashrc
| drwx------ 2 scfc wikidev 4096 Nov 25 22:50 .cache/
| -rw------- 1 root root      43 Nov 26 01:12 .lesshst
| -rw------- 1 scfc wikidev  675 Nov 25 22:50 .profile
| scfc@tools-exec-01:~$

I don't see any differences in /etc/auto* compared to tools-exec-02, but it looks like the automounts for /home and /public aren't working (/data/project is mounted fine).
Comment 4 Marc A. Pelletier 2013-11-26 03:05:34 UTC
Yeah, I just checked and it's definitely broken.

Annoyingly, autofs is really bad at restarting if any mounts are active, so there is little to do but drain the node from all jobs and wait for it to be idle before forcibly restarting it.

I'm going to remove it from the queue allocation now and let it drain; it'll take a while before every job goes away (I don't want to disrupt running tools), but it won't get assigned for new jobs in the meantime so nothing will hit the broken /public
Comment 5 Tim Landscheidt 2013-12-16 07:47:46 UTC
tools-exec-01 was restarted early December and /home and /public seem to be properly mounted.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links