Last modified: 2013-08-26 13:16:35 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T54500, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 52500 - Broken Disk Controller is Broken
Broken Disk Controller is Broken
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
Infrastructure (Other open bugs)
unspecified
All All
: Highest critical
: ---
Assigned To: Ryan Lane
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-03 12:49 UTC by Johannes Kroll (WMDE)
Modified: 2013-08-26 13:16 UTC (History)
10 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Johannes Kroll (WMDE) 2013-08-03 12:49:16 UTC
One of the storage disk controllers has been broken for some time, leading to permanent problems on labs, such as delays or timeouts while copying files and the web server becoming unresponsive.

The problem has been known to the admins for a while and discussed in #wikimedia-labs for at least 2 weeks.

This is something that should be fixed right away, obviously. People will want to use Labs at WikiMania. Talks and demos rely on the web server. You should /not/ wait until WikiMania is over.
Comment 1 Sam Reed (reedy) 2013-08-03 15:04:48 UTC
CC'ing Marc and Chris too.

Where is the problematic server? If EQIAD, Chris will need to deal with it, else delegate to Steve for Tampa.
Comment 2 Johannes Kroll (WMDE) 2013-08-03 15:54:33 UTC
(In reply to comment #1)
> CC'ing Marc and Chris too.
> 
> Where is the problematic server? If EQIAD, Chris will need to deal with it,
> else delegate to Steve for Tampa.

Coren said that the controller has to be switched. Supposedly he knows which.
Comment 3 Addshore 2013-08-03 16:04:56 UTC
Labs is slowly becoming less and less usable. Some values of i/o timeout I have grabbed from the -labs irc channel below..

Today Written and deleted 4 bytes on /data/project in 00:02:22.7323050
31st Written and deleted 4 bytes on /data/project in 00:01:39.6748470
30th Written and deleted 4 bytes on /data/project in 00:01:00.5361310

I am not sure if I have accidently selected ones that slowly get worse... but thats what it looks like!

Trying to actively develop on labs currently and that sue is what it feel like!
Comment 4 Ryan Lane 2013-08-03 19:49:56 UTC
The problem isn't as cut and dry as replacing the controller. It could be a kernel bug, it could be a controller issue, it could be an issue with the firmware on the controller. Over the past few months we've been working on tracking it down and fixing it, but it's not simple.
Comment 5 Marc A. Pelletier 2013-08-03 19:53:52 UTC
It could also be an issue with the cable, the shelf, or the disk themselves.

The two principal issues that complicate matters are that (a) any change to any of those components require (sometimes significant) down time of the NFS server - impacting much of labs and (b) the problem only surfaces in presence of significant usage, making it hard to test outside of production.

We're going to be using the opportunity that Wikimania offers us (many of the ops together in a single place) to sit down and work out scenarios for testing and fixing the issue.
Comment 6 Johannes Kroll (WMDE) 2013-08-26 12:53:28 UTC
Well, the last time I asked you about the problem, Marc, you said that it was a problem with the controller, and that you were going to replace it after Wikimania.

Has there been any news since then?
Comment 7 Marc A. Pelletier 2013-08-26 13:16:35 UTC
The controller at issue is no longer in use while we determine exactly which of the previously mentioned issue is the root problem; the NFS has been completely stable since (at the cost of slightly less disk space and some feature reduction in the underlying filesystem).

We will return to the external storage array once we have eliminated the problem with it.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links