Last modified: 2014-08-14 11:24:09 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T71371, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 69371 - zero.log contains duplicate host in logs
zero.log contains duplicate host in logs
Status: RESOLVED INVALID
Product: Analytics
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-08-10 18:23 UTC by Yuri Astrakhan
Modified: 2014-08-14 11:24 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Graph of the bad host counts per day as detected in zero logs (78.18 KB, image/png)
2014-08-11 17:58 UTC, Yuri Astrakhan
Details

Description Yuri Astrakhan 2014-08-10 18:23:25 UTC
yurik@stat1002:/a/squid/archive/zero$ zgrep orghttp zero.tsv.log-20140810.gz

produces a significant number of broken entries, whose URL is duplicated:

http://en.m.wikipedia.orghttp://en.m.wikipedia.org/
http://en.m.wikipedia.orghttp://en.m.wikipedia.org/favicon.ico
etc

The first entry is just the host with protocol, the second is identical protocol+host with the rest of the URL.

Stats in the last few days (I also ran it for the august of last year - similar numbers, so this is not a recent bug)

yurik@stat1002:/a/squid/archive/zero$ zgrep -c  orghttp zero.tsv.log-201408*
zero.tsv.log-20140801.gz:183
zero.tsv.log-20140802.gz:215
zero.tsv.log-20140803.gz:167
zero.tsv.log-20140804.gz:170
zero.tsv.log-20140805.gz:144
zero.tsv.log-20140806.gz:235
zero.tsv.log-20140807.gz:272
zero.tsv.log-20140808.gz:259
zero.tsv.log-20140809.gz:169
zero.tsv.log-20140810.gz:173


Host distribution:
$ sort tmp | uniq --count
      1 cp1046.eqiad.wmnet
      4 cp1047.eqiad.wmnet
      8 cp1059.eqiad.wmnet
      6 cp1060.eqiad.wmnet
     51 cp3011.esams.wikimedia.org
     58 cp3012.esams.wikimedia.org
     62 cp3013.esams.wmnet
     57 cp3014.esams.wmnet
      4 cp4011.ulsfo.wmnet
      1 cp4012.ulsfo.wmnet
      4 cp4019.ulsfo.wmnet
      3 cp4020.ulsfo.wmnet
Comment 1 nuria 2014-08-11 16:08:11 UTC
Looking into these. Will look to see if there is anything common to all urls with this string 'orghttp'. We should probably entertain the possibility of thee being actual real requests.
Comment 2 nuria 2014-08-11 17:09:26 UTC
This problem is also present on sampled logs, mobile and api since records are kept on stats1002. That means since about May 2013. I imagine there must have been another bug fixed on this regard a while back as it seems real visible. Will try to do some searches on bugzilla.
Comment 3 Kevin Leduc 2014-08-11 17:31:55 UTC
If you compare the number of occurrences to the number of lines in the file, this seems to happen less than 0.025% of the time.  This seems insignificant.  Could this just be corrupt data to ignore?

kleduc@stat1002:/a/squid/archive/zero$ find . -name "zero.tsv.log-201408*" |xargs wc -l
    750507 ./zero.tsv.log-20140801.gz
    858969 ./zero.tsv.log-20140802.gz
    911424 ./zero.tsv.log-20140803.gz
    924049 ./zero.tsv.log-20140804.gz
    990494 ./zero.tsv.log-20140805.gz
   1099479 ./zero.tsv.log-20140806.gz
   1080449 ./zero.tsv.log-20140807.gz
   1103330 ./zero.tsv.log-20140808.gz
   1214806 ./zero.tsv.log-20140809.gz
   1247318 ./zero.tsv.log-20140810.gz
   1266218 ./zero.tsv.log-20140811.gz
Comment 4 Yuri Astrakhan 2014-08-11 17:57:09 UTC
Sure, we could ignore it, but i am worried if this could point to some bigger issue.  Parsing all zero.tsv* files i noticed a large number of other strange items - highly broken URLs that still return miss/200 result, some of which are images.  Please take a look at my home dir (i think its public) at

  yurik@stat1002:~/zero-sms/scripts$ grep '/200' *.txt

Various IP addresses and random hosts (e.g. 0.facebook.com) keep appearing, and get resolved just fine by the backend, even though they clearly shouldn't.

My code does regex substitute on all URLs now:  ^(https?://.+)\1   ->   \1
But not counting those, there are 35000 other host parsing errors in the logs, with some big spikes (will attach graph of them).
Comment 5 Yuri Astrakhan 2014-08-11 17:58:02 UTC
Created attachment 16166 [details]
Graph of the bad host counts per day as detected in zero logs
Comment 6 nuria 2014-08-12 10:58:33 UTC
>Parsing all zero.tsv* files i noticed a large number 
>of other strange items - highly broken URLs that still return miss/200 result.

I think we are missing issues here. I will address this comment but let's try to focus on 'orghttp' problem first.

The fact that "highly broken" urls return miss/200 doesn't seem that strange if we are caching negative image lookups on the edge-caches and returning 200 for those (negative lookup meaning an image request we do not have in origin). So, without knowing more about our caching infrastructure, I would not flag this as a problem. Now, let's get back to this once we have taken a deeper look at "orghttp" issue. 


>Various IP addresses and random hosts (e.g. 0.facebook.com) keep appearing, 
>and get resolved just fine by the backend, even though they clearly shouldn't.

There are requests with "0.facebook.com" on the User_Agent (facebook-bot) and requests with "facebook" on the referrer. I see nothing obviously wrong with either.

After there are (very few) requests like : http://zero.wikipedia.org/uc6.ucweb.com:80/?0.facebook.com/?zout=1&ucid=6883926810-f3e4e4a2 that seem to come from the same kind of phone.
Comment 7 nuria 2014-08-12 11:01:42 UTC
I think we are missing issues here -> "I think we are MIXing issues"
Comment 8 nuria 2014-08-12 19:42:39 UTC
Please take a look at our logging format on varninsh:

https://git.wikimedia.org/blob/operations%2Fpuppet/production/modules%2Fvarnish%2Ffiles%2Fvarnishncsa.default#L9

The interesting part: http://%{Host}i%U%q

%{Host}i : logs host header
%U request The request URL without any query string. Defaults to '-' if not known.
%q The query string, if no query string exists, an empty string.

Looking at patterns on urls with the "orghttp" string in zero logs I see little in common, the big majority of requests with faulty logging are for "text/html" content type, under http and they are cache misses (200/miss).
Comment 9 nuria 2014-08-13 09:36:29 UTC
Summing up, somehow we are generating requests in the client like: "http://es.m.wikipedia.org/http://es.m.wikipedia.org/wiki/Wikipedia:Portada"

which, according to http are valid requests, those requests depending on the target project are responded in different manners, 404, 304, miss/200

Overall, we do no think this is a problem when it comes to login requests. It might be a problem on the client as those requests should not be generated but with the format we are login if requests like "http://es.m.wikipedia.org/http://es.m.wikipedia.org/wiki/Wikipedia:Portada"
come along they will be logged like: http://es.m.wikipedia.orghttp://es.m.wikipedia.org/wiki/Wikipedia:Portada
Comment 10 nuria 2014-08-13 16:16:12 UTC
Need to look at IP ranges as I looked at languages and wikipedia's for geographic commonality and that might not be the best.
Comment 11 nuria 2014-08-13 16:51:26 UTC
I take my prior comment back, this looks like a proxy issue not client issue. Data 
below for requests that match "orghttp" in the month of August thus far in zero, mobile and sampled.

1. zero 
Most requests come from  41.202.219.* (41.202.219.66 for example) in Cameroon 
<snipet>     
     16 101.222.232.111
     34 101.222.224.76
     43 41.202.219.77
     57 41.202.219.76
     59 41.202.219.65
     66 41.202.219.79
     80 41.202.219.68
     84 41.202.219.71
     88 41.202.219.78
     93 41.202.219.75
     96 41.202.219.69
    100 41.202.219.72
    101 41.202.219.74
    102 41.202.219.64
    132 41.202.219.67
    132 41.202.219.73
    134 41.202.219.70
    170 41.202.219.66
</snipet>

2.mobile
<snipet>
     12 67.205.74.205
     14 122.96.59.102
     20 41.202.219.74
     83 202.67.40.50
    104 202.67.41.51
   4368 203.223.95.248 -> Bangladesh
</snipet>


3.sampled
<snipet>
     75 208.43.239.42
     89 194.116.187.25
    176 195.74.38.153
    264 180.179.40.100
    282 172.15.215.113
    784 202.67.40.50
    804 203.223.95.248
   1036 202.67.41.51
   3674 91.121.6.4 
</snipet>
Comment 12 nuria 2014-08-13 17:25:56 UTC
IPs with most issues in zero are not the most used IPs so, again, this points to a proxy issue.
Comment 13 Yuri Astrakhan 2014-08-13 19:39:17 UTC
Nuria, are you saying one of our proxies is causing this? Or is it some common proxy software that many carriers are using that sets incorrect HOST value when forwarding request?
Comment 14 nuria 2014-08-13 19:41:47 UTC
Well, neither. By looking at the data looks to be caused by a proxy but I do not think is "common" software as the percentage data affected seems pretty small.
Comment 15 nuria 2014-08-14 11:24:09 UTC
You can reproduce one of this requests by using telnet as follows. Please note "host" and "url below

[prompt]$ telnet 91.198.174.204 80
Trying 91.198.174.204...
Connected to mobile-lb.esams.wikimedia.org.
Escape character is '^]'.
GET http://fr.m.wikipedia.org/favicon.ico HTTP/1.1
Host: fr.m.wikipedia.org
Connection: close
 
HTTP/1.1 200 OK
Server: Apache
X-Content-Type-Options: nosniff
Expires: Fri, 15 Aug 2014 11:12:17 +0000
Content-Type: image/x-icon
X-Varnish: 2057974935 2057973396, 2628462505, 3056161189 3056101957
Via: 1.1 varnish, 1.1 varnish, 1.1 varnish
Content-Length: 15086
Accept-Ranges: bytes
Date: Thu, 14 Aug 2014 11:13:43 GMT
Age: 86
Connection: close
X-Cache: cp1047 hit (1), cp3014 miss (0), cp3011 frontend hit (8)
Cache-Control: s-maxage=3600, max-age=3600
Set-Cookie: GeoIP=ES:Madrid:40.4086:-3.6922:v4; Path=/; Domain=.wikipedia.org

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links