Last modified: 2014-09-19 19:49:33 UTC
Since 2013-12-10 we're seeing between 15-40 lines per day in the sampled-1000 cache logs that have “nan” in the “Request service time” column [1]. For the mobile-sampled-100 cache logs it is 10-30 such lines per day [2]. The zero logs do not show any affected lines up to now. The affected lines come with “-/” for the “Request status / HTTP status code” column, and “Reply size” is “-”. [1] The first occurrence for the sampled-1000 stream can be obtained by running zgrep 49094481570 /a/squid/archive/sampled/sampled-1000.tsv.log-20131210.gz on stat1002. That line is for cp1050.eqiad.wmnet on 2013-12-10T03:36:04. [2] The first occurrence for the mobile-sampled-100 stream can be obtained by running zgrep 9650126710 /a/squid/archive/mobile/mobile-sampled-100.tsv.log-20131210.gz on stat1002. That line is for cp1047.eqiad.wmnet on 2013-12-10T02:01:24.
The host name of the affected lines range over all hosts in upload-eqiad, upload-esams, and mobile-eqiad, but no other cache. The request method of the affected lines look mangled. HEAD+GET account for ~55% of the affected lines. The other 45% of the affected lines, the logged request method looks like being part of the http header [1]. The URL column of the affected lines is for about 8% the mobile version of jawiki and ptwiki. None of those 8% show a User Agent, Referrer, or Accept-Language. For >80% of the affected lines, the URL column is upload.wikimedia.org. For the remaining ~10% of affected lines, the URL column did not make too much sense. They are for example “http://-en-US,en;q=0.8”. So it's also header information that spilled over into this column. Requested URLs are not distributed all over the place, but rather isolated. So for example the requested URL for ~33% of the affected lines was: http://upload.wikimedia.orgHTTP/1.1 For ~12% it was http://upload.wikimedia.org/wikipedia/en/c/ca/2M_TV_(logo).gif"" . And those requests are not clustered around a certain date / time, but span the whole date / time range between 2013-12-10 and now. The referrers are “-” for ~60% of affected lines, and for ~32% some Url from mobile enwiki. The logged User-Agents, Accept-Languages look sane. [1] Here are some examples: Accept-Language: x-Mercedes-Benz_Three_Point_Star.jpg exagonal_Icon.svg.png .svg.png Referer: HTTP/1.1 To see more of them, run cut -f 8 /home/qchris/nan.tsv | sort | uniq -c | sort -n on stat1002.
Hi Christian, Erik -- what's the impact of this? Do we need to spend some time working with ops to track it down? thanks, -Toby
Per Andrew/Stefan -- this ends up causing problems with Hive. Magnus has created a patch; we hope to get this deployed this week.
(In reply to comment #2) > what's the impact of this? It's not super critical. But since it seems to be more than just the nans, it may be more than the initial few lines that are affected. > Do we need to spend some > time > working with ops to track it down? Ops already know and had a first look :-) It seems only the nans started on 2013-12-10. Some bogus requests were there before, but it seems they have been treated differently. Not sure yet.
(In reply to comment #3) > Per Andrew/Stefan -- this ends up causing problems with Hive. > > Magnus has created a patch; we hope to get this deployed this week. Yes, but that'll only address the issue on the varnishkafka side. varnishncsa (hence udp2log, webstatscollector, tsvs) do not benefit from this fix.
Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1391