Last modified: 2014-09-19 19:49:33 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T61645, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 59645 - Lines with “nan” for “Request service time” column and empty HTTP status code in cache logs
Lines with “nan” for “Request service time” column and empty HTTP status code...
Status: RESOLVED FIXED
Product: Analytics
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-01-04 14:46 UTC by christian
Modified: 2014-09-19 19:49 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description christian 2014-01-04 14:46:04 UTC
Since 2013-12-10 we're seeing between 15-40 lines per day in the
sampled-1000 cache logs that have “nan” in the “Request service time”
column [1].
For the mobile-sampled-100 cache logs it is 10-30 such lines per day [2].
The zero logs do not show any affected lines up to now.

The affected lines come with “-/” for the “Request status / HTTP
status code” column, and “Reply size” is “-”.



[1] The first occurrence for the sampled-1000 stream can be obtained by
running

  zgrep 49094481570 /a/squid/archive/sampled/sampled-1000.tsv.log-20131210.gz

on stat1002. That line is for cp1050.eqiad.wmnet on
2013-12-10T03:36:04.

[2] The first occurrence for the mobile-sampled-100 stream can be
obtained by running

  zgrep 9650126710 /a/squid/archive/mobile/mobile-sampled-100.tsv.log-20131210.gz

on stat1002. That line is for cp1047.eqiad.wmnet on
2013-12-10T02:01:24.
Comment 1 christian 2014-01-04 14:46:24 UTC
The host name of the affected lines range over all hosts in
upload-eqiad, upload-esams, and mobile-eqiad, but no other cache.

The request method of the affected lines look mangled. HEAD+GET
account for ~55% of the affected lines. The other 45% of the affected
lines, the logged request method looks like being part of the http
header [1].

The URL column of the affected lines is for about 8% the mobile
version of jawiki and ptwiki. None of those 8% show a User Agent,
Referrer, or Accept-Language.

For >80% of the affected lines, the URL column is upload.wikimedia.org.

For the remaining ~10% of affected lines, the URL column did not make
too much sense. They are for example “http://-en-US,en;q=0.8”. So it's
also header information that spilled over into this column.

Requested URLs are not distributed all over the place, but rather
isolated. So for example the requested URL for ~33% of the affected
lines was:
  http://upload.wikimedia.orgHTTP/1.1
For ~12% it was
  http://upload.wikimedia.org/wikipedia/en/c/ca/2M_TV_(logo).gif""
.

And those requests are not clustered around a certain date / time, but
span the whole date / time range between 2013-12-10 and now.

The referrers are “-” for ~60% of affected lines, and for ~32% some
Url from mobile enwiki.

The logged User-Agents, Accept-Languages look sane.



[1] Here are some examples:
  Accept-Language:
  x-Mercedes-Benz_Three_Point_Star.jpg
  exagonal_Icon.svg.png
  .svg.png
  Referer:
  HTTP/1.1

To see more of them, run
  cut -f 8 /home/qchris/nan.tsv | sort | uniq -c | sort -n
on stat1002.
Comment 2 Toby Negrin 2014-01-07 14:18:18 UTC
Hi Christian, Erik -- what's the impact of this? Do we need to spend some time working with ops to track it down?

thanks,

-Toby
Comment 3 Toby Negrin 2014-01-07 15:10:39 UTC
Per Andrew/Stefan -- this ends up causing problems with Hive.

Magnus has created a patch; we hope to get this deployed this week.
Comment 4 christian 2014-01-07 21:32:47 UTC
(In reply to comment #2)
> what's the impact of this?

It's not super critical.
But since it seems to be more than just the nans, it may be more than the
initial few lines that are affected.

> Do we need to spend some
> time
> working with ops to track it down?

Ops already know and had a first look :-)

It seems only the nans started on 2013-12-10. Some bogus requests were
there before, but it seems they have been treated differently.
Not sure yet.
Comment 5 christian 2014-01-07 21:34:10 UTC
(In reply to comment #3)
> Per Andrew/Stefan -- this ends up causing problems with Hive.
> 
> Magnus has created a patch; we hope to get this deployed this week.

Yes, but that'll only address the issue on the varnishkafka side.

varnishncsa (hence udp2log, webstatscollector, tsvs) do not benefit
from this fix.
Comment 6 Bingle 2014-01-28 21:57:03 UTC
Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1391

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links