Last modified: 2014-10-30 18:54:28 UTC
The bits webrequest partition [1] for 2014-10-26T21/1H has not been marked successful. What happened? [1] _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 07:51:53 // exit code: 0 cwd: ~ ~/cluster-scripts/dump_webrequest_status.sh +------------------+--------+--------+--------+--------+ | Date | bits | mobile | text | upload | +------------------+--------+--------+--------+--------+ [...] | 2014-10-26T19/1H | . | . | . | . | | 2014-10-26T20/1H | . | . | . | . | | 2014-10-26T21/1H | X | . | . | . | | 2014-10-26T22/1H | . | . | . | . | | 2014-10-26T23/1H | . | . | . | . | [...] +------------------+--------+--------+--------+--------+ Statuses: . --> Partition is ok M --> Partition manually marked ok X --> Partition is not ok (duplicates, missing, or nulls)
Only cp3019 is affected. For that host data worth ~55 seconds got lost in the ~1 minute between 2014-10-26T21:16:22 2014-10-26T21:17:24. I could neither find changes in puppet, dns, or SAL that look relevant. cp3019 (as all other esams caches) are gone from ganglia, so it's hard to see further data from cp3019 itself for non-Ops. Icinga shows the “Varnishkafka Delivery Errors” service having status WARNING since 2014-10-24 17:11:57 (but the same holds true for the other esams caches too).
Kafka logs did not show peculiar entries in the relevant period of time.
ganglia again shows data for esams caches, but the data between ~2014-10-24T12 and ~2014-10-27T16 is missing (which contains the minute where we had cp3019 issues). Judging from the cumulative counters, neither varnish nor varnishkafka got restarted on cp3019. ottomata ... since I cannot find any explanation, does cp3019 or 2014-10-26T21:16 ring a bell for you? Was there some other migration/testing/network issue that I am missing?
ottomata had a look at the logs on cp3019 and said that there were produce errors about full buffers. So we're writing it off as temporary network issues for now.