Last modified: 2014-06-01 11:08:40 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T52891, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 50891 - Intermittent "502 Bad Gateway" errors on Wikimedia wikis via HTTPS
Intermittent "502 Bad Gateway" errors on Wikimedia wikis via HTTPS
Status: RESOLVED WORKSFORME
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
wmf-deployment
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-07-07 16:49 UTC by MZMcBride
Modified: 2014-06-01 11:08 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Screenshot of 502 Bad Gateway error on https://en.wikipedia.org (109.76 KB, image/png)
2013-07-21 18:02 UTC, MZMcBride
Details
Capturing https://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&c=SSL+cluster+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report for reference (43.52 KB, image/png)
2013-07-22 21:57 UTC, MZMcBride
Details

Description MZMcBride 2013-07-07 16:49:02 UTC
Intermittently, I get 502 Bad Gateway errors using HTTPS on en.wikipedia.org while logged in. The footer reads "nginx/1.1.1.9" or similar. <https://en.wikipedia.org/wiki/List_of_Anything_Muppets> is a sample URL. A browser window refresh solves the issue, but we should investigate and address what's causing these intermittent errors.
Comment 1 MZMcBride 2013-07-21 18:01:19 UTC
I just got this again at this URL: <https://en.wikipedia.org/w/index.php?title=Special:LinkSearch&limit=250&offset=0&target=http%3A%2F%2Ftoolserver.org%2F~mzmcbride%2Fcgi-bin%2Fwatcher>.

---
<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.1.19</center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
---

I'll upload a screenshot momentarily.
Comment 2 MZMcBride 2013-07-21 18:02:41 UTC
Created attachment 12910 [details]
Screenshot of 502 Bad Gateway error on https://en.wikipedia.org
Comment 3 Kunal Mehta (Legoktm) 2013-07-22 07:06:11 UTC
Just got one of these on https://meta.wikimedia.org/wiki/User:MF-Warburg/abuse. Reloading fixed the issue.

The source was basically the same as in comment 1, but no <!-- html comments -->.
Comment 4 Kunal Mehta (Legoktm) 2013-07-22 20:07:44 UTC
Just got it on https://en.wikipedia.org/wiki/File:%22Modernism%22_oil_painting_by_Fred_Sexton,_circa_1940s.png. As usual, the page refresh fixed it.
Comment 5 Ryan Lane 2013-07-22 20:42:01 UTC
It looks like traffic through the ssl cluster has doubled in the past month. eqiad hardware is being overloaded. We're adding some more nodes to the cluster.
Comment 7 Ryan Lane 2013-07-23 00:20:24 UTC
Two new ssl servers were just pooled in eqiad. We'll need to do this in esams eventually as well, but they have newer/better hardware. Please let me know if you're still having this issue.
Comment 8 Ryan Lane 2013-07-23 00:37:33 UTC
I had to depool them due to issues with ipv6. I'll update the ticket when they are repooled.
Comment 9 Ryan Lane 2013-07-23 07:14:35 UTC
They are repooled now and everything should be working. I'll close this as fixed. Please re-open if it's not.
Comment 10 MZMcBride 2013-09-25 02:12:51 UTC
https://upload.wikimedia.org/wikipedia/mediawiki/9/91/Agora_specs.pdf gave me a 502 Bad Gateway error today.
Comment 11 James Alexander 2013-09-25 02:50:58 UTC
yeah, had a 501 on meta itself (the whole page) twice this evening too. Looking at it through firebug I get multiple 502's from upload as I wander the sites (the best one to test I've found is the meta front page because it has a ton of images but I've seen 3-4 on random enWiki pages too and usually at least 1 if it has any images at all).
Comment 12 James Alexander 2013-09-25 02:51:20 UTC
501 --> 502 <sigh.
Comment 13 Andre Klapper 2013-11-22 15:30:01 UTC
MZMcBride: Has this happened recently?
Comment 14 Risker 2013-12-15 00:26:01 UTC
(In reply to comment #13)
> MZMcBride: Has this happened recently?

I can confirm that I received a "502 Bad Gateway" (nginx/1.1.19) today on enwiki when following a perfectly fine link.  The second time I followed the link, it took me where it was supposed to. 

Just for the record, the link was https://en.wikipedia.org/w/index.php?title=User_talk%3ARYasmeen_%28WMF%29&diff=586114761&oldid=586070808
Comment 15 MZMcBride 2013-12-18 14:16:58 UTC
This hasn't happened recently for me. I wonder if this bug report should re-focus on better logging/monitoring of 502s.
Comment 16 MZMcBride 2013-12-18 14:19:27 UTC
Ori or Nemo: do you know if we graph this data (users hitting nginx gateway timeout errors --> 502s) anywhere or if it would be possible to do so?
Comment 17 Nemo 2013-12-18 14:30:04 UTC
(In reply to comment #16)
> Ori or Nemo: do you know if we graph this data (users hitting nginx gateway
> timeout errors --> 502s) anywhere or if it would be possible to do so?

Presumably they appear in https://gdash.wikimedia.org/dashboards/reqerror/ mixed with all the 5xx (I'm not able to assess how complete/precise this report is)?
If this intermittent problem has the same cause that such problems have had lately, i.e. network links at capacity, it might be more fruitful to try and set up a monitoring tool for the network like https://monitor.archive.org/weathermap/weathermap.html
Comment 18 Andre Klapper 2014-03-17 15:22:22 UTC
(In reply to MZMcBride from comment #15)
> This hasn't happened recently for me. 

If that's still the case I propose RESOLVED WORKSFORME.

> I wonder if this bug report should
> re-focus on better logging/monitoring of 502s.

We have only 5xx monitoring - if you have specific recommendations, could you put them into a separate enhancement requests?
Comment 19 Nemo 2014-06-01 11:08:40 UTC
(In reply to Andre Klapper from comment #18)
> (In reply to MZMcBride from comment #15)
> > This hasn't happened recently for me. 
> 
> If that's still the case I propose RESOLVED WORKSFORME.

Yeah, this report is not particularly actionable by now. Icinga reports were also added and are regularly acted upon, for instance:

23.48 <+icinga-wm_> PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data exceeded the critical threshold [500.0]
00.03 <+icinga-wm_> RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0]

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links