Last modified: 2013-10-05 17:31:02 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T55806, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 53806 - Clarify zero markers in squid logs for IPs that should not get one


Summary:	Clarify zero markers in squid logs for IPs that should not get one

Status:	RESOLVED FIXED

Product:	Analytics
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Unprioritized normal
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:	54779
Blocks:
	Show dependency tree / graph

Reported:	2013-09-05 17:19 UTC by Diederik van Liere
Modified:	2013-10-05 17:31 UTC (History)
CC List:	8 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Diederik van Liere 2013-09-05 17:19:51 UTC

In the squid logs at /a/squid/archive/zero/zero.tsv.log-* on stat1002, we have zero markers for IPs that should not get one.

Examples

Run

zgrep 198278752 /a/squid/archive/zero/zero.tsv.log-20130731.gz
on stats1002.

The printed request holds a zero marker from Beeline Russia (zero=250-99), but the client IP is from Austria and not listed in Beeline Russia's IPs.

The request does not hold a Forwarded request.

The problem is not limited to Beeline Russia or Austrian IPs, as can be seen by running

zgrep 2215263992 /a/squid/archive/zero/zero.tsv.log-20130731.gz
on stat1002.

Comment 1 Diederik van Liere 2013-09-05 17:21:59 UTC

Prioritization and scheduling of this bug is tracked on Mingle card https://mingle.corp.wikimedia.org/api/v2/projects/analytics//https://mingle.corp.wikimedia.org/api/v2/projects/analytics/cards/1137.xml

Comment 2 Diederik van Liere 2013-09-05 21:46:16 UTC

Prioritization and scheduling of this bug is tracked on Mingle card
https://mingle.corp.wikimedia.org/projects/analytics/cards/1137 (correct link)

Comment 3 Toby Negrin 2013-09-10 20:50:31 UTC

From Mark Bergsma:

Hi Diederik,

On Sep 5, 2013, at 12:13 PM, Diederik van Liere <dvanliere@wikimedia.org> wrote:
> We are seeing some odd tagging of Zero traffic where requests are being tagged as Zero traffic while it does not fall within the cidr range of a mobile carrier.
>
> Example:
>
> cp3011.esams.wikimedia.org        198278752        2013-07-30T06:47:59        0.000113249        178.115.XXX.XXX        hit/200        345        GET        http://ru.m.wikipedia.org/w/api.php?format=json&search=%D0%A1&action=opensearch&namespace=0&limit=15        -        application/json; charset=utf-8        http://ru.m.wikipedia.org/wiki/%d0%a8%d0%b2%d0%b5%d1%86%d0%b8%d1%8f        -        Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; NOKIA; Lumia 610)        de-AT        zero=250-99

[snip]

> P.S.: As the examples show, the issue is not simply the X-Forwarded-For header, as for example for [1], neither 66.54.XXX.XXX, nor 1.38.XXX.XXX is in the ranges of Aircell India, and the Beeline example above comes without a X-Forwarded-For setting.
>
> [1] Here is a recent example for Aircell India:
> cp1046.eqiad.wmnet      4331766631      2013-09-04T06:31:38     0.000135422     66.54.XXX.XXX     hit/200 20      GET     http://meta.m.wikimedia.org/wiki/Special:RecordImpression?result=hide&reason=empty&country=IN&uselang=en&project=wikipedia&db=enwiki&bucket=0&anonymous=true&device=unknown    -       image/png     http://en.m.wikipedia.org/wiki/Sexy      1.38.XXX.XXX      Mozilla/5.0 (Series40; Nokia206/03.58; Profile/MIDP-2.1 Configuration/CLDC-1.1) Gecko/20100401 S40OviBrowser/2.2.0.0.36        en      zero=404-01

Thanks for flagging this. I'll investigate it when I get an opportunity.

I've already looked into the XFF header, which might be the culprit still. We (almost) always add an XFF header at the start of a request. However, unfortunately it's not being logged, as the logging program just checks what headers were received from the client, not what we add in VCL...

Comment 4 Andrew Otto 2013-09-11 18:42:56 UTC

Ok, so Adam and Christian and I just went through the VCL logic a bit.  I don't think I have a full handle on all the steps varnish takes to tag things, but it doesn't seem like this particular issue is caused by a mistaken XFF problem.  vcl_recv calls spoof_clientip (which conditionally replaces the client_ip with a value from XFF) BEFORE it calls tag_carrier.  At this point, it doesn't actually matter if the client_ip was not set correctly by spoof_clientip.  Whatever client_ip is set to, tag_carrier should use that to conditionally set the X-CS header.

We're checking examples where the client_ip in the varnishncsa logs does belong to the tagged X-CS acl.  And we have no idea how that could happen.

Ja, there might be some weird logic happening in spoof_clientip that is not working with XFF properly, but that seems irrelevant to this problem.

How could X-CS not match the client_ip?

Comment 5 christian 2013-09-11 21:36:46 UTC

Looking further, we do not log the X-Analytics header of the request, but
the X-Analytics header of the response. Could it be that the caching
interferes here?

Doing some quick check for the 404-01 carrier (India) seems to
somewhat support this, as ~30% of hits come misattributed, while only
~1% of the misses do. It a least shows a really heavy bias that hits
are more affected than misses.

Could it be (wild guess) that upon a cache hit, varnish uses the
X-Analytics header of the cached document?





If the caching is not the issue... I am not sure if this is related,
but X-Analytics column in the logs is not always
  zero=XXX-XX
We also have log lines where the X-Analytics column
  zero=XXX-XX;mf-m=a
  zero=XXX-XX;mf-m=b
and in the sampled-1000 (not limited to mobile or zero) stream we also
see some
  mf-m=a
  mf-m=b
. I could not find this 'mf-m' in the puppet repo. And each of those
'mf-m' log lines is for a 'pass/200' request.

Comment 6 Diederik van Liere 2013-09-11 21:41:29 UTC

mf-m=a and mf-m=b are set by the MobileFrontEnd and indicate whether the alpha (a) or beta (b) MobileFrontend are used. Remember, X-Analytics is a list of <key,value> pairs of settings.

Comment 7 Toby Negrin 2013-09-11 22:21:18 UTC

Yuri has updated the mechanism by which traffic is assigned to Zero, so this may cause the symptom of this bug to go away.

Yuri -- can you let us know when we can check the data again to see if this is true?

Diederik believes Phaidon and Mark are still looking at this bug since it could be quite serious.

Comment 8 Andre Klapper 2013-09-16 08:34:52 UTC

(In reply to comment #7)
> Yuri -- can you let us know when we can check the data again to see if this
> is true?

Yuri: Any estimated timeframe?

Comment 9 Toby Negrin 2013-09-16 17:15:26 UTC

Current status (per Deiderik):

1. Phaidon/Mark debugging varnish

2. Yuri has the ball on the deploying the netmapper code.

Comment 10 Toby Negrin 2013-09-25 22:58:15 UTC

What's the current status on this? I know that various folks have been working on it.

thanks,

-Toby

Comment 11 dr0ptp4kt 2013-09-26 00:29:23 UTC

(In reply to comment #10)
> What's the current status on this? I know that various folks have been
> working
> on it.
> 
> thanks,
> 
> -Toby



If the netmapper stuff doesn't get rolled into production and fix this issue before I get to it, I'm going to provide the suggested configuration for logging a few extra fields in Varnish logs on a particular node in the cluster instead of all nodes (it sounds like turning this on for all nodes could be a challenge for other log consumer jobs). I'll need help from Andrew on actually getting it up and running for one node.

Comment 12 Diederik van Liere 2013-09-26 00:44:31 UTC

Coordination of this bug is happening on the Ops list because neither Mark, Brandon nor Faidon are CC'ed on this bug. I sent this update to the Ops list yesterday. As everybody is CC'ed on the Ops thread lets have the discussion there. 


I spoke with Brandon and Yuri and a quick update about the attempt to fix incorrect tagging of W0 traffic:

- we have the new varnish packages (with netmapper rolled inside)  on the mobile caches now, but not the other caches (where we don't need it at this time anyways)
- Brandon is checking whether varnish is acting normal (so far so good it seems)

Next steps (assuming that Varnish will continue to act normal):
- Brandon will make changes to the VCL config to replace the current hardcoded mobile ACLs with netmapper
- finish the rollout of the upgraded varnish package, from 3.0.3-rc1-wm14 to -wm16, so that the netmapper software is included on all varnish boxes

Comment 13 Gerrit Notification Bot 2013-09-26 13:18:08 UTC

Change 86079 had a related patch set uploaded by Faidon Liambotis:
varnish: remove X-CS/Analytics tag from vcl_fetch

https://gerrit.wikimedia.org/r/86079

Comment 14 Gerrit Notification Bot 2013-09-26 13:21:13 UTC

Change 86079 merged by Faidon Liambotis:
varnish: remove X-CS/Analytics tag from vcl_fetch

https://gerrit.wikimedia.org/r/86079

Comment 15 Diederik van Liere 2013-09-26 13:48:52 UTC

Patch has been merged.

Comment 16 christian 2013-09-26 15:33:36 UTC

Since the bug is closed, how did verify the problem actually has been solved?

Comment 17 christian 2013-09-29 12:49:35 UTC

Reopening, since (although the mistagged log lines went to almost zero) we do
not observe the expected drop in requests per carrier.

Additionally, we also see IPv6 addresses (without X-Forward-For) coming with
zero tags [1]. Such lines should not receive zero markers, as carriers do not
come with IPv6 addresses up to now. Those IPv6 zero lines are just a few per
day, but they might hint at a further issue in the logging infrastructure.



[1] Run for example on stat1002:
  zgrep 3779252990 /a/squid/archive/zero/zero.tsv.log-20130928.gz

Comment 18 christian 2013-09-29 17:42:07 UTC

(In reply to comment #17)
> Additionally, we also see IPv6 addresses (without X-Forward-For) coming with
> zero tags [...]

Those were just IPv4 in IPv6 addresses.
Netmapper seems to be aware of them, while varnish acls were not.

Thanks Faidon for pointing that out!

Comment 19 christian 2013-10-05 17:23:43 UTC

Although there are no more mistagged lines since 2013-09-26 22:57,
there was some further discussion outside of bz, as we did not see the
expected change in number of total zero requests.

Neither a change in total zero requests per carrier.

Neither a change in the distribution of User-Agents, Accept-Language,
or X-Forwarded-For distribution per carrier.

This does not match the bug description of the zero tags being wrong.

Adam however came up with a different explanation: Maybe the zero tags
were not wrong, but the client ips were wrong. This would nicely explain
what we saw before and after the fixes. We're tagging the same traffic
as before, but this time with correct IP addresses. And code-wise, it
seems plausible as well, as the switch to netmapper in
  1755cfe5d556f20acae6a9c3cb59e9b38e1327e3
removed the hacky spoof_clientip function.

Although the above commit has only been merged on 2013-09-26 23:22
(~25 minutes /after/ the last mistagged line), Adam's explanation
seems nevertheless plausible, as the fixes were tested manually on the
varnishes beforehand.

It seems the combination of Faidon's and Brandon's patches finally
fixed the bug for good, and allow to explain the data that we're seeing.

Thanks Faidon and Brandon!

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links