Last modified: 2014-06-01 11:08:40 UTC
Intermittently, I get 502 Bad Gateway errors using HTTPS on en.wikipedia.org while logged in. The footer reads "nginx/1.1.1.9" or similar. <https://en.wikipedia.org/wiki/List_of_Anything_Muppets> is a sample URL. A browser window refresh solves the issue, but we should investigate and address what's causing these intermittent errors.
I just got this again at this URL: <https://en.wikipedia.org/w/index.php?title=Special:LinkSearch&limit=250&offset=0&target=http%3A%2F%2Ftoolserver.org%2F~mzmcbride%2Fcgi-bin%2Fwatcher>. --- <html> <head><title>502 Bad Gateway</title></head> <body bgcolor="white"> <center><h1>502 Bad Gateway</h1></center> <hr><center>nginx/1.1.19</center> </body> </html> <!-- a padding to disable MSIE and Chrome friendly error page --> <!-- a padding to disable MSIE and Chrome friendly error page --> <!-- a padding to disable MSIE and Chrome friendly error page --> <!-- a padding to disable MSIE and Chrome friendly error page --> <!-- a padding to disable MSIE and Chrome friendly error page --> <!-- a padding to disable MSIE and Chrome friendly error page --> --- I'll upload a screenshot momentarily.
Created attachment 12910 [details] Screenshot of 502 Bad Gateway error on https://en.wikipedia.org
Just got one of these on https://meta.wikimedia.org/wiki/User:MF-Warburg/abuse. Reloading fixed the issue. The source was basically the same as in comment 1, but no <!-- html comments -->.
Just got it on https://en.wikipedia.org/wiki/File:%22Modernism%22_oil_painting_by_Fred_Sexton,_circa_1940s.png. As usual, the page refresh fixed it.
It looks like traffic through the ssl cluster has doubled in the past month. eqiad hardware is being overloaded. We're adding some more nodes to the cluster.
Created attachment 12925 [details] Capturing https://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&c=SSL+cluster+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report for reference
Two new ssl servers were just pooled in eqiad. We'll need to do this in esams eventually as well, but they have newer/better hardware. Please let me know if you're still having this issue.
I had to depool them due to issues with ipv6. I'll update the ticket when they are repooled.
They are repooled now and everything should be working. I'll close this as fixed. Please re-open if it's not.
https://upload.wikimedia.org/wikipedia/mediawiki/9/91/Agora_specs.pdf gave me a 502 Bad Gateway error today.
yeah, had a 501 on meta itself (the whole page) twice this evening too. Looking at it through firebug I get multiple 502's from upload as I wander the sites (the best one to test I've found is the meta front page because it has a ton of images but I've seen 3-4 on random enWiki pages too and usually at least 1 if it has any images at all).
501 --> 502 <sigh.
MZMcBride: Has this happened recently?
(In reply to comment #13) > MZMcBride: Has this happened recently? I can confirm that I received a "502 Bad Gateway" (nginx/1.1.19) today on enwiki when following a perfectly fine link. The second time I followed the link, it took me where it was supposed to. Just for the record, the link was https://en.wikipedia.org/w/index.php?title=User_talk%3ARYasmeen_%28WMF%29&diff=586114761&oldid=586070808
This hasn't happened recently for me. I wonder if this bug report should re-focus on better logging/monitoring of 502s.
Ori or Nemo: do you know if we graph this data (users hitting nginx gateway timeout errors --> 502s) anywhere or if it would be possible to do so?
(In reply to comment #16) > Ori or Nemo: do you know if we graph this data (users hitting nginx gateway > timeout errors --> 502s) anywhere or if it would be possible to do so? Presumably they appear in https://gdash.wikimedia.org/dashboards/reqerror/ mixed with all the 5xx (I'm not able to assess how complete/precise this report is)? If this intermittent problem has the same cause that such problems have had lately, i.e. network links at capacity, it might be more fruitful to try and set up a monitoring tool for the network like https://monitor.archive.org/weathermap/weathermap.html
(In reply to MZMcBride from comment #15) > This hasn't happened recently for me. If that's still the case I propose RESOLVED WORKSFORME. > I wonder if this bug report should > re-focus on better logging/monitoring of 502s. We have only 5xx monitoring - if you have specific recommendations, could you put them into a separate enhancement requests?
(In reply to Andre Klapper from comment #18) > (In reply to MZMcBride from comment #15) > > This hasn't happened recently for me. > > If that's still the case I propose RESOLVED WORKSFORME. Yeah, this report is not particularly actionable by now. Icinga reports were also added and are regularly acted upon, for instance: 23.48 <+icinga-wm_> PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data exceeded the critical threshold [500.0] 00.03 <+icinga-wm_> RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0]