Last modified: 2013-09-17 11:17:02 UTC
Around 6.45 UTC, /data/project disappeared for all instances in dumps project with error: Transport endpoint is not connected. I've followed the steps in https://wikitech.wikimedia.org/wiki/Help:Shared_storage#Troubleshooting including reboot but it seems the error is persistent and/or not related to my instance: [2013-09-15 07:41:38.598177] E [socket.c:1715:socket_connect_finish] 0-dumps-project-client-0: connection to 10.0.0.41:24007 failed (Connection refused) [2013-09-15 07:41:38.598968] E [socket.c:1715:socket_connect_finish] 0-dumps-project-client-1: connection to 10.0.0.42:24007 failed (Connection refused) [2013-09-15 07:41:38.599003] E [afr-common.c:3665:afr_notify] 0-dumps-project-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. [2013-09-15 07:41:38.604444] I [fuse-bridge.c:4191:fuse_graph_setup] 0-fuse: switched to graph 0 [2013-09-15 07:41:38.604886] I [fuse-bridge.c:3376:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.17 [2013-09-15 07:41:38.605293] W [fuse-bridge.c:513:fuse_attr_cbk] 0-glusterfs-fuse: 2: LOOKUP() / => -1 (Transport endpoint is not connected) By the way, while I investigated what command to use to properly mount a volume, I found this comment which suggests we shouldn't use Ubuntu's packages but the new ones from <https://launchpad.net/~semiosis/+archive/ubuntu-glusterfs-3.3>: <http://unix-heaven.org/comment/1854#comment-1854>. They are supposed to solve some issues we have.
Created attachment 13286 [details] dumps-1.pmtpa.wmflabs:/var/log/glusterfs/data-project.log
Is this failure happening in a particular project, or in /all/ projects?
(In reply to comment #2) > Is this failure happening in a particular project, or in /all/ projects? I've asked on the labs-l mailing list but didn't get an answer. I also don't remember if I have access to other projects nor how to find a list of projects I have access to.
Oh, sorry, you said 'dumps'. Should be fixed -- please close this bug if you can confirm.
Yes! Thank you so much. :D
For the records, on the instance that I hadn't rebooted glusterfs has been extremely slow for a while, apparently till labstore1 and labstore2 were very busy (in terms of network and CPU) communicating with each other and with the instance's glusterfs process. (Extremely slow as in ls on a directory with 50 files taking many minutes.) Now I'm also seeing weird errors like this file which exists but doesn't exist at the same time, but I expect a reboot will fix it: $ ls 2011/2011-10-01.csv 2011/2011-10-01.csv nemobis@dumps-2:/data/project/commonsgrab2$ ls /data/project/commonsgrab2/2011/2011-10-01.csv ls: cannot access /data/project/commonsgrab2/2011/2011-10-01.csv: Input/output error nemobis@dumps-2:/data/project/commonsgrab2$ stat /data/project/commonsgrab2/2011/2011-10-01.csv stat: cannot stat `/data/project/commonsgrab2/2011/2011-10-01.csv': Input/output error