Last modified: 2013-09-17 11:17:02 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T56143, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 54143 - GlusterFS appears to be down (Transport endpoint is not connected, All subvolumes are down)


Summary:	GlusterFS appears to be down (Transport endpoint is not connected, All subvol...

Status:	RESOLVED FIXED

Product:	Wikimedia Labs
Classification:	Unclassified
Component:	Infrastructure (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Unprioritized critical
Target Milestone:	---
Assigned To:	Ryan Lane

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2013-09-15 07:57 UTC by Nemo
Modified:	2013-09-17 11:17 UTC (History)
CC List:	3 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
dumps-1.pmtpa.wmflabs:/var/log/glusterfs/data-project.log (2.33 MB, text/x-log) 2013-09-15 07:58 UTC, Nemo	Details
Add an attachment (proposed patch, testcase, etc.)

Description Nemo 2013-09-15 07:57:00 UTC

Around 6.45 UTC, /data/project disappeared for all instances in dumps project with error: Transport endpoint is not connected.

I've followed the steps in https://wikitech.wikimedia.org/wiki/Help:Shared_storage#Troubleshooting including reboot but it seems the error is persistent and/or not related to my instance:

[2013-09-15 07:41:38.598177] E [socket.c:1715:socket_connect_finish] 0-dumps-project-client-0: connection to 10.0.0.41:24007 failed (Connection refused)
[2013-09-15 07:41:38.598968] E [socket.c:1715:socket_connect_finish] 0-dumps-project-client-1: connection to 10.0.0.42:24007 failed (Connection refused)
[2013-09-15 07:41:38.599003] E [afr-common.c:3665:afr_notify] 0-dumps-project-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2013-09-15 07:41:38.604444] I [fuse-bridge.c:4191:fuse_graph_setup] 0-fuse: switched to graph 0
[2013-09-15 07:41:38.604886] I [fuse-bridge.c:3376:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.17
[2013-09-15 07:41:38.605293] W [fuse-bridge.c:513:fuse_attr_cbk] 0-glusterfs-fuse: 2: LOOKUP() / => -1 (Transport endpoint is not connected)

By the way, while I investigated what command to use to properly mount a volume, I found this comment which suggests we shouldn't use Ubuntu's packages but the new ones from <https://launchpad.net/~semiosis/+archive/ubuntu-glusterfs-3.3>: <http://unix-heaven.org/comment/1854#comment-1854>. They are supposed to solve some issues we have.

Comment 1 Nemo 2013-09-15 07:58:28 UTC

Created attachment 13286 [details]
dumps-1.pmtpa.wmflabs:/var/log/glusterfs/data-project.log

Comment 2 Andrew Bogott 2013-09-15 20:11:38 UTC

Is this failure happening in a particular project, or in /all/ projects?

Comment 3 Nemo 2013-09-15 20:17:02 UTC

(In reply to comment #2)
> Is this failure happening in a particular project, or in /all/ projects?

I've asked on the labs-l mailing list but didn't get an answer. I also don't remember if I have access to other projects nor how to find a list of projects I have access to.

Comment 4 Andrew Bogott 2013-09-15 20:17:32 UTC

Oh, sorry, you said 'dumps'.  Should be fixed -- please close this bug if you can confirm.

Comment 5 Nemo 2013-09-15 20:20:58 UTC

Yes! Thank you so much. :D

Comment 6 Nemo 2013-09-17 11:17:02 UTC

For the records, on the instance that I hadn't rebooted glusterfs has been extremely slow for a while, apparently till labstore1 and labstore2 were very busy (in terms of network and CPU) communicating with each other and with the instance's glusterfs process. (Extremely slow as in ls on a directory with 50 files taking many minutes.)

Now I'm also seeing weird errors like this file which exists but doesn't exist at the same time, but I expect a reboot will fix it:
$ ls 2011/2011-10-01.csv
2011/2011-10-01.csv
nemobis@dumps-2:/data/project/commonsgrab2$ ls /data/project/commonsgrab2/2011/2011-10-01.csv
ls: cannot access /data/project/commonsgrab2/2011/2011-10-01.csv: Input/output error
nemobis@dumps-2:/data/project/commonsgrab2$ stat /data/project/commonsgrab2/2011/2011-10-01.csv
stat: cannot stat `/data/project/commonsgrab2/2011/2011-10-01.csv': Input/output error

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links