Last modified: 2014-10-31 12:54:58 UTC
The bits and upload webrequest partition [1] for 2014-10-30T21/1H have not been marked successful. What happened? [1] _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 08:12:06 // exit code: 130 cwd: ~ ~/cluster-scripts/dump_webrequest_status.sh +------------------+--------+--------+--------+--------+ | Date | bits | mobile | text | upload | +------------------+--------+--------+--------+--------+ [...] | 2014-10-30T19/1H | . | . | . | . | | 2014-10-30T20/1H | . | . | . | . | | 2014-10-30T21/1H | X | . | . | X | | 2014-10-30T22/1H | . | . | . | . | | 2014-10-30T23/1H | . | . | . | . | [...] +------------------+--------+--------+--------+--------+ Statuses: . --> Partition is ok M --> Partition manually marked ok X --> Partition is not ok (duplicates, missing, or nulls)
For bits, it only affected cp3020. The affected period is 2014-10-30T21:25:41/2014-10-30T21:26:26. No lost messages, only 70660 duplicates, which is <2 seconds worth of data for bits. For bits, it only affected cp3018. The affected period is 2014-10-30T21:25:18/2014-10-30T21:26:10. No lost messages, only 34087 duplicates, which is <2 seconds worth of data for upload. I could not find anything relevant in puppet, nor in SAL. It's again only esams. According to ganglia, kafka.rdkafka.brokers.*.rtt.avg's Max went up during that time on * cp3018 to 6.0M for analytics1018) * cp3020 to 12.6M for analytics1018) But other caches had even higher Max values for that average ( cp3019 had 36.7M for analytics1021 cp3010 had 11.8M for analytics1021 cp3010 had 8.8M for analytics1022 ), but did not show duplicates. According to ganglia, kafka.rdkafka.brokers.*.outbuf_cnt's Max went up during that time on * cp3018 to 334.9 for analytics1022 (not analytics1018! It had 28.4 max for analytics1018) * cp3020 to 720.8 for analytics1018 But cp3019 had 479 for analytics1021 (i.e. a similar Max value), but did not show duplicates.