Last modified: 2013-05-13 17:53:41 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T50257, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 48257 - http connections to European Bits server often time out for some users since ~2013-05-06
http connections to European Bits server often time out for some users since ...
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
Site requests (Other open bugs)
wmf-deployment
All All
: Highest critical with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
: ops
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-08 13:24 UTC by TMg
Modified: 2013-05-13 17:53 UTC (History)
11 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Bits and Meta subdomain requests time out (124.85 KB, image/png)
2013-05-10 10:45 UTC, TMg
Details

Description TMg 2013-05-08 13:24:12 UTC
The server http://bits.wikimedia.org/ is insanely slow since two days. Requests almost never return anything. The requests time out instead. This leaves all Mediawiki projects (including Commons) naked without any CSS (except for my user CSS).

Maybe an DNS issue?

Is there an DoS going on?

I'm sure this is not an issue on my side because I tested this on different computers using different internet connections. It's the same everywhere.

I'm in Germany. Here is the relevant part of a tracert:

C:\>tracert bits.wikimedia.org
Routenverfolgung zu bits-lb.esams.wikimedia.org [91.198.174.233]:
[...]
  8    50 ms    51 ms    52 ms  ge0-1-0-cr0.ixf.de.as6908.net [80.81.192.244]
  9    58 ms    56 ms    59 ms  xe-5-1-0-core0.nknik.nl.as6908.net [62.149.50.42]
 10    54 ms    55 ms    54 ms  xe-0-0-1.cr2-knams.wikimedia.org [78.41.155.38]
 11    57 ms    56 ms    56 ms  bits-lb.esams.wikimedia.org [91.198.174.233]
Ablaufverfolgung beendet.

I can't explain why the tracert looks so good. Requesting any bits URL in the browser almost always times out.
Comment 1 Niklas Laxström 2013-05-08 13:43:40 UTC
It seems completely down at the moment:

Failed to load resource: the server responded with a status of 503 (Service Unavailable)
Comment 2 TMg 2013-05-09 16:21:25 UTC
To let you know: It's much better now but not solved. Currently it feels like 5% of the requests in the German Wikipedia time out. Nothing happens for a minute and a "server does not respond" is shown. When I try again it works most of the time. Some edits are lost because of this. Multiple users reported the same problem.
Comment 3 Tomasz W. Kozlowski 2013-05-09 16:26:10 UTC
Raising priority then, adding the 'ops' keyword.
Comment 4 Andre Klapper 2013-05-10 01:16:47 UTC
(In reply to comment #2)
> Multiple users reported the same problem.

URLs welcome, as I haven't seen anything on the usual Commons forums that I try to follow.


There was an outage on Wednesday, 13:30 - 14:00 UTC, due to a memcached server going offline. "As usual this caused all kinds of cascading failures on other clusters such as Squid/Varnish. When not overloaded, these clusters would only serve cached pages at that point."
That would not cover "since 2 days" but that's what people immediately mentioned when I brought up this bug report in the operations channel.


I'm currently also in Germany and I ran "mtr" on my Linux machine for a while:


        My traceroute  [v0.82]
embrace.foo (0.0.0.0)                             Fri May 10 02:59:33 2013
Resolver: Received error response 2. (server failure)er of fields   quit
                                  Packets               Pings
 Host                         Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. fritz.box                  0.0%   158    1.1   9.1   1.0 606.0  58.0
 2. 217.0.117.142              0.0%   158   20.1  30.4  19.1 529.5  48.8
 3. 87.186.195.6               0.0%   158   22.3  31.6  20.4 434.4  39.0
 4. hh-ea4-i.HH.DE.NET.DTAG.DE 0.6%   158   27.7  41.4  27.0 338.0  35.6
 5. 194.25.208.234             0.0%   158   30.2  45.8  27.7 1023.  81.4
    80.156.160.242
    80.150.168.162
    80.156.163.126
 6. hbg-bb1-link.telia.net     0.0%   158   27.4  48.4  27.2 999.9  80.3
    hbg-bb1-link.telia.net
    hbg-bb1-link.telia.net
 7. adm-bb3-link.telia.net     0.0%   158   33.5  50.3  32.6 1029. 102.1
    adm-bb3-link.telia.net
    adm-bb3-link.telia.net
    adm-bb3-link.telia.net
    adm-bb3-link.telia.net
    adm-bb3-link.telia.net
    adm-bb3-link.telia.net
    adm-bb3-link.telia.net
 8. adm-b5-link.telia.net      0.0%   158   35.1  51.0  33.9 943.4  92.1
 9. wikimedia-ic-129908-
            adm-b3.c.telia.net 5.7%   158   37.0  49.0  34.6 846.6  68.5
10. bits.esams.wikimedia.org   0.0%   158   35.2  47.8  35.2 746.2  69.9
Comment 5 Bawolff (Brian Wolff) 2013-05-10 02:07:57 UTC
(In reply to comment #0)
> The server http://bits.wikimedia.org/ is insanely slow since two days.
> Requests
> almost never return anything. The requests time out instead. This leaves all
> Mediawiki projects (including Commons) naked without any CSS (except for my
> user CSS).
> 

I would note that your user css is served via bits. What urls specifically are timing out, or is it random?


>There was an outage on Wednesday, 13:30 - 14:00 UTC, due to a memcached server
going offline.

Shouldn't these sorts of things show up in the server admin log...
Comment 6 kimmo.virtanen 2013-05-10 05:01:18 UTC
I answered to before to bug 42653 (comments: 14 - 17), but i will write the key points to here too. It seems that bits-lb.esams.wikimedia.org http is broken. IP itself answers to ping and https links are working fine. 

Eg. this works:
-  curl -i https://bits.wikimedia.org/ 

This will fail most of the times
-  curl -i http://bits.wikimedia.org/

Error is:
curl: (7) Failed to connect to 2620:0:862:ed1a::a: Network is unreachable
Comment 8 Bawolff (Brian Wolff) 2013-05-10 05:36:39 UTC
Out of curiosity, does
 curl -i -4 http://bits.wikimedia.org/
Also give you errors?
Comment 9 kimmo.virtanen 2013-05-10 05:53:12 UTC
Yes
Comment 10 Bawolff (Brian Wolff) 2013-05-10 05:55:36 UTC
(In reply to comment #9)
> Yes

To clarify, does it give the same error (it definitely should not)
Comment 11 kimmo.virtanen 2013-05-10 05:59:48 UTC
> To clarify, does it give the same error (it definitely should not)

Error message is: 
curl -i -4 http://bits.wikimedia.org/
curl: (7) Failed connect to bits.wikimedia.org:80; Connection timed out
Comment 12 kimmo.virtanen 2013-05-10 06:10:34 UTC
And when connection works the response is pretty much instant. So it is not like that http server is too slow, but more like it just works or it doesn't work.

Example response from http query which worked:
- time curl -i -4 http://bits.wikimedia.org/

HTTP/1.1 200 OK
Server: Apache
Last-Modified: Thu, 12 Aug 2010 16:12:20 GMT
ETag: "b2-48da2a1772100"
Content-Type: text/html
X-Varnish: 1991165982
Via: 1.1 varnish
Content-Length: 178
Accept-Ranges: bytes
Date: Fri, 10 May 2013 06:05:39 GMT
X-Varnish: 3599832084
Age: 0
Via: 1.1 varnish
Connection: keep-alive
X-Cache: sq67 miss (0), cp3022 miss (0)

<html>
	<head><title>bits and pieces</title>
		<meta http-equiv="refresh" content="1;url=http://www.wikimedia.org/" />
	</head>
<body>
bits and pieces live here!
</body>
</html>

real	0m0.281s
user	0m0.004s
sys	0m0.004s
Comment 13 TMg 2013-05-10 10:25:40 UTC
At the moment all Wikimedia projects are kind of dead and unusable because of this. Here are some example URLs that all time out:

http://bits.wikimedia.org/de.wikipedia.org/load.php?debug=false&lang=de&modules=startup&only=scripts&skin=vector&*
http://bits.wikimedia.org/commons.wikimedia.org/load.php?debug=false&lang=de&modules=startup&only=scripts&skin=vector&*
http://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=ext.gadget.ReferenceTooltips%2Ccharinsert%2Ctoolbaralert2%7Cext.wikihiero%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmw.PopUpMediaTransform%7Cskins.vector&only=styles&skin=vector&*

Its like comment #12 said. Some requests to bits.wikimedia.org return immediately, some requests take a very long time (about 30 seconds) and some requests never return (time out).

Again, I'm sitting in Germany.

C:\>tracert bits.wikimedia.org
Routenverfolgung zu bits-lb.esams.wikimedia.org [91.198.174.233]:
[...]
  8    53 ms    51 ms    51 ms  ge0-1-0-cr0.ixf.de.as6908.net [80.81.192.244]
  9    58 ms    57 ms    57 ms  xe-5-1-0-core0.nknik.nl.as6908.net [62.149.50.42]
 10    55 ms    55 ms    55 ms  xe-0-0-1.cr2-knams.wikimedia.org [78.41.155.38]
 11    59 ms    55 ms    57 ms  bits-lb.esams.wikimedia.org [91.198.174.233]
Comment 14 TMg 2013-05-10 10:45:42 UTC
Created attachment 12291 [details]
Bits and Meta subdomain requests time out

Here is a screenshot from the Opera Dragonfly debugger. Please not that it's not only bits.wikimedia.org (all URLs that start with load.php). Also some meta.wikimedia.org URLs time out.
Comment 15 TMg 2013-05-10 11:00:41 UTC
(In reply to comment #6)
> https links are working fine.

Wow, you are right. The problem is immediately solved when I switch from http to https. I guess this is the reason why most of the users can't reproduce my problem.

https://de.wikipedia.org/wiki/Wikipedia:Fragen_zur_Wikipedia#Wikipedia-Server_sterbenslahm
Comment 16 TMg 2013-05-10 11:39:27 UTC
And now both http and https have the same problem and are unusable. Guys, what's going on?

Here are some https example URLs that time out:

https://bits.wikimedia.org/commons.wikimedia.org/load.php?debug=false&lang=de&modules=startup&only=scripts&skin=vector&*
https://bits.wikimedia.org/commons.wikimedia.org/load.php?debug=false&lang=de&modules=site&only=scripts&skin=vector&*
https://de.wikipedia.org/

Raising importance.
Comment 17 Nemo 2013-05-10 11:45:55 UTC
(In reply to comment #16)
> And now both http and https have the same problem and are unusable.

That may be related or not. In Italy, for me HTTPS is down since about 20 min ago, while HTTP sometimes loads after a long time (with or without styles).
Comment 18 kimmo.virtanen 2013-05-10 14:55:23 UTC
Another easy personal workaround is to switch to Google DNS server so the bits.wikimedia.org resolves to bits-lb.eqiad.wikimedia.org which works fine. This is one reason why problem is mainly in Europe.
Comment 19 Andre Klapper 2013-05-10 17:36:51 UTC
The URLs in comment 13 and comment 16 load fine for me in Firefox 18 (same for using http:// instead of https://), no matter how often I try to reload, and I am based in Germany too currently.

I assume you bypass the cache when trying to reload these URLs?
http://en.wikipedia.org/wiki/Wikipedia:Bypass_your_cache


Summarizing the aforementioned VP/forum threads (thanks for the links!):
* https://en.wikipedia.org/wiki/Wikipedia:Help_desk#Problems_getting_pages_to_load states "Now very suddenly working again" today by reporter.
* Both https://de.wikipedia.org/wiki/Wikipedia:Fragen_zur_Wikipedia#Wikipedia-Server_sterbenslahm and http://fi.wikipedia.org/wiki/Wikipedia:Kahvihuone_%28tekniikka%29#Wikipedian_hidastelu also imply that only http:// is affected, but the Finnish thread has no new comments since May 8th when the outage happened (see comment 4) so it's unclear if there's still bigger problems.


As I don't see indicators yet that this is a problem that a large number of users in Europe is affected by I'll set this back to "highest" priority and "critical".
Comment 20 TMg 2013-05-10 20:52:10 UTC
(In reply to comment #19)
> Firefox

I'm sure the browser does not matter. I tried both Firefox and Opera.

> I assume you bypass the cache when trying to reload these URLs?

Yes, I know that and tried everything. In this case bypassing the browser cache made the problem worse. I tried to do the opposite, forcing the browser to never reload these resources if they are in the cache. But it seems there is no setting to do this. As far as I understand the browser always does a HEAD request to check if the cached resources changed. Some of these HEAD requests timed out.

Currently everything seems to work. Both http and https.

I still think there was an overload, maybe caused by a DoS. We will see if the problem comes back every 24 hours.
Comment 21 Mark Bergsma 2013-05-13 17:53:41 UTC
We've migrated the network in Europe (esams) to a new topology on Friday (May 10th), which probably also explains why this hasn't been happening since.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links