Last modified: 2014-10-21 10:43:27 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T73876, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 71876 - Raw webrequest partitions for 2014-10-08T23:xx:xx not marked successful
Raw webrequest partitions for 2014-10-08T23:xx:xx not marked successful
Status: RESOLVED WONTFIX
Product: Analytics
Classification: Unclassified
Refinery (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks: 72298
  Show dependency treegraph
 
Reported: 2014-10-09 11:42 UTC by christian
Modified: 2014-10-21 10:43 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description christian 2014-10-09 11:42:35 UTC
For the hour 2014-10-08T23:xx:xx, bits, text, and upload [1] were not
marked successful.

What happened?



[1]
_________________________________________________________________
qchris@stat1002 // jobs: 0 // time: 10:55:54 // exit code: 0
cwd: ~/cluster-scripts
./dump_webrequest_status.sh 
  +---------------------+--------+--------+--------+--------+
  | Date                |  bits  |  text  | mobile | upload |
  +---------------------+--------+--------+--------+--------+
[...]
  | 2014-10-08T21:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-10-08T22:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-10-08T23:xx:xx |    X   |    X   |    .   |    X   |    
  | 2014-10-09T00:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-10-09T01:xx:xx |    .   |    .   |    .   |    .   |    
[...]
  +---------------------+--------+--------+--------+--------+


Statuses:

  . --> Partition is ok
  X --> Partition is not ok (duplicates, missing, or nulls)
Comment 1 christian 2014-10-09 11:43:02 UTC
For bits and upload we saw both duplicates and missing.
For mobile we only saw duplicates.
The affected period was 23:02:00 -- 23:11:00

It seems to have been a ulsfo glitch, as only ulsfo hosts were
affected.

In total ~5M duplicates and ~2M missing:

+---------+--------+-------------+-----------+
| cluster | host   | # duplicate | # missing |
+---------+--------+-------------+-----------+
| bits    | cp4001 |      183537 |         0 |
| bits    | cp4002 |      517220 |    218080 |
| bits    | cp4003 |      381150 |    143408 |
| bits    | cp4004 |      266275 |         0 |
| text    | cp4008 |       26116 |         0 |
| text    | cp4009 |      215291 |         0 |
| text    | cp4010 |      126667 |         0 |
| text    | cp4016 |        1904 |         0 |
| text    | cp4018 |      167577 |         0 |
| upload  | cp4005 |      592259 |    352364 |
| upload  | cp4006 |      581932 |    340563 |
| upload  | cp4007 |      507600 |    299497 |
| upload  | cp4013 |      592460 |    389971 |
| upload  | cp4014 |      408414 |     61688 |
| upload  | cp4015 |      560605 |    291017 |
+---------+--------+-------------+-----------+

Since being 7M off is within reach for our streams, I marked the
streams “ok“ by hand.



On the kafka brokers, I the only thing that looked related were
exceptions like

  [2014-10-08 23:46:04,585] 1658280892 [kafka-request-handler-9] ERROR kafka.server.KafkaApis  - [KafkaApi-21] Error when processing fetch request for partition 
  [webrequest_upload,5] offset 34174421023 from consumer with correlation id 2
  kafka.common.OffsetOutOfRangeException: Request for offset 34174421023 but we only have log segments in the range 37690961900 to 40340044788.
          at kafka.log.Log.read(Log.scala:380)
          [...]

6 times on analytics1012 affecting each of
                      webrequest_upload,3
                      webrequest_upload,7
                      webrequest_upload,11
                   twice.
6 times on analytics1018 affecting each of
                      webrequest_upload,0
                      webrequest_upload,4
                      webrequest_upload,8
                   twice.
6 times on analytics1021 affecting each of
                      webrequest_upload,1
                      webrequest_upload,5
                      webrequest_upload,9
                   twice.
6 times on analytics1022 affecting each of
                      webrequest_upload,2
                      webrequest_upload,6
                      webrequest_upload,10
                   twice.

All those 24 exceptions were around 23:46.

Checking in the affected caches in ganglia, I noticed that some
readings are missing are missing around that time.

SAL did not show anything relevant, but the #wikimedia-operations
channel had

[22:59:55] <icinga-wm>	 PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR  
[...]
[23:00:55] <icinga-wm>	 PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192  
[...]
[23:03:04] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0]  
[23:03:55] <icinga-wm>	 PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.8  with snmp version 2  
[23:05:16] <icinga-wm>	 RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 9, down: 0, shutdown: 0  
[...]
[23:07:04] <icinga-wm>	 PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR  
[...]
[23:10:14] <icinga-wm>	 PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192  
[23:12:05] <icinga-wm>	 PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.8  with snmp version 2  
[...]
[23:15:05] <icinga-wm>	 PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR  
[23:15:05] <icinga-wm>	 RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 9, down: 0, shutdown: 0  
[23:16:14] <icinga-wm>	 PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: puppet fail  
[...]
[23:16:45] <icinga-wm>	 PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: puppet fail  
[23:17:45] <icinga-wm>	 PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: puppet fail  
[...]
[23:18:54] <icinga-wm>	 PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: puppet fail  
[...]
[23:19:05] <icinga-wm>	 PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 1 failures  
[23:19:26] <icinga-wm>	 PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures  
[...]
[23:23:56] <icinga-wm>	 RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures  

Afterwards services recovered.
So it looks like a general ULSFO issue.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links