Last modified: 2014-10-21 10:43:27 UTC
For the hour 2014-10-08T23:xx:xx, bits, text, and upload [1] were not marked successful. What happened? [1] _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 10:55:54 // exit code: 0 cwd: ~/cluster-scripts ./dump_webrequest_status.sh +---------------------+--------+--------+--------+--------+ | Date | bits | text | mobile | upload | +---------------------+--------+--------+--------+--------+ [...] | 2014-10-08T21:xx:xx | . | . | . | . | | 2014-10-08T22:xx:xx | . | . | . | . | | 2014-10-08T23:xx:xx | X | X | . | X | | 2014-10-09T00:xx:xx | . | . | . | . | | 2014-10-09T01:xx:xx | . | . | . | . | [...] +---------------------+--------+--------+--------+--------+ Statuses: . --> Partition is ok X --> Partition is not ok (duplicates, missing, or nulls)
For bits and upload we saw both duplicates and missing. For mobile we only saw duplicates. The affected period was 23:02:00 -- 23:11:00 It seems to have been a ulsfo glitch, as only ulsfo hosts were affected. In total ~5M duplicates and ~2M missing: +---------+--------+-------------+-----------+ | cluster | host | # duplicate | # missing | +---------+--------+-------------+-----------+ | bits | cp4001 | 183537 | 0 | | bits | cp4002 | 517220 | 218080 | | bits | cp4003 | 381150 | 143408 | | bits | cp4004 | 266275 | 0 | | text | cp4008 | 26116 | 0 | | text | cp4009 | 215291 | 0 | | text | cp4010 | 126667 | 0 | | text | cp4016 | 1904 | 0 | | text | cp4018 | 167577 | 0 | | upload | cp4005 | 592259 | 352364 | | upload | cp4006 | 581932 | 340563 | | upload | cp4007 | 507600 | 299497 | | upload | cp4013 | 592460 | 389971 | | upload | cp4014 | 408414 | 61688 | | upload | cp4015 | 560605 | 291017 | +---------+--------+-------------+-----------+ Since being 7M off is within reach for our streams, I marked the streams “ok“ by hand. On the kafka brokers, I the only thing that looked related were exceptions like [2014-10-08 23:46:04,585] 1658280892 [kafka-request-handler-9] ERROR kafka.server.KafkaApis - [KafkaApi-21] Error when processing fetch request for partition [webrequest_upload,5] offset 34174421023 from consumer with correlation id 2 kafka.common.OffsetOutOfRangeException: Request for offset 34174421023 but we only have log segments in the range 37690961900 to 40340044788. at kafka.log.Log.read(Log.scala:380) [...] 6 times on analytics1012 affecting each of webrequest_upload,3 webrequest_upload,7 webrequest_upload,11 twice. 6 times on analytics1018 affecting each of webrequest_upload,0 webrequest_upload,4 webrequest_upload,8 twice. 6 times on analytics1021 affecting each of webrequest_upload,1 webrequest_upload,5 webrequest_upload,9 twice. 6 times on analytics1022 affecting each of webrequest_upload,2 webrequest_upload,6 webrequest_upload,10 twice. All those 24 exceptions were around 23:46. Checking in the affected caches in ganglia, I noticed that some readings are missing are missing around that time. SAL did not show anything relevant, but the #wikimedia-operations channel had [22:59:55] <icinga-wm> PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR [...] [23:00:55] <icinga-wm> PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [...] [23:03:04] <icinga-wm> PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [23:03:55] <icinga-wm> PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [23:05:16] <icinga-wm> RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 9, down: 0, shutdown: 0 [...] [23:07:04] <icinga-wm> PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR [...] [23:10:14] <icinga-wm> PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [23:12:05] <icinga-wm> PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [...] [23:15:05] <icinga-wm> PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR [23:15:05] <icinga-wm> RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 9, down: 0, shutdown: 0 [23:16:14] <icinga-wm> PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: puppet fail [...] [23:16:45] <icinga-wm> PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: puppet fail [23:17:45] <icinga-wm> PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: puppet fail [...] [23:18:54] <icinga-wm> PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: puppet fail [...] [23:19:05] <icinga-wm> PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 1 failures [23:19:26] <icinga-wm> PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures [...] [23:23:56] <icinga-wm> RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures Afterwards services recovered. So it looks like a general ULSFO issue.