Last modified: 2014-03-21 17:46:49 UTC
Beginning with the 2014-03-21 files, zero tags may come doubled, like zero=250-99;zero=250-99 instead of zero=250-99 . (I could not find tags with differing MCC MNCs) At least /a/squid/archive/zero/zero.tsv.log-20140321.gz /a/squid/archive/sampl mobile/mobile-sampled-100.tsv.log-20140321.gz /a/squid/archive/sampled/sampled-1000.tsv.log-20140321.gz /a/log/webrequest/zero/zero.tsv.log-20140321.gz /a/log/webrequest/mobile/mobile-sampled-100.tsv.log-20140321.gz Raw data in Hadoop Hive's webrequest table are affected. Since the first occurrence was on 2014-03-21T00:15:41, it might be that https://gerrit.wikimedia.org/r/#/c/119795/ is relevant (which mangles zero tags and got merged around that time).
Patch in gerrit: https://gerrit.wikimedia.org/r/#/c/120010/
Patch was merged, please close the bug if duplicates disappear. Is there an easy way to clean up the logs / hadoop?
I checked on live udp2log stream and no more double zero tags after the above fix have been merged.