Last modified: 2014-10-21 10:43:27 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T73876, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 71876 - Raw webrequest partitions for 2014-10-08T23:xx:xx not marked successful


Summary:	Raw webrequest partitions for 2014-10-08T23:xx:xx not marked successful

Status:	RESOLVED WONTFIX

Product:	Analytics
Classification:	Unclassified
Component:	Refinery (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Unprioritized normal
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	72298
	Show dependency tree / graph

Reported:	2014-10-09 11:42 UTC by christian
Modified:	2014-10-21 10:43 UTC (History)
CC List:	7 users (show)

See Also:	71879
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description christian 2014-10-09 11:42:35 UTC

For the hour 2014-10-08T23:xx:xx, bits, text, and upload [1] were not
marked successful.

What happened?



[1]
_________________________________________________________________
qchris@stat1002 // jobs: 0 // time: 10:55:54 // exit code: 0
cwd: ~/cluster-scripts
./dump_webrequest_status.sh 
  +---------------------+--------+--------+--------+--------+
  | Date                |  bits  |  text  | mobile | upload |
  +---------------------+--------+--------+--------+--------+
[...]
  | 2014-10-08T21:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-10-08T22:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-10-08T23:xx:xx |    X   |    X   |    .   |    X   |    
  | 2014-10-09T00:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-10-09T01:xx:xx |    .   |    .   |    .   |    .   |    
[...]
  +---------------------+--------+--------+--------+--------+


Statuses:

  . --> Partition is ok
  X --> Partition is not ok (duplicates, missing, or nulls)

Comment 1 christian 2014-10-09 11:43:02 UTC

For bits and upload we saw both duplicates and missing.
For mobile we only saw duplicates.
The affected period was 23:02:00 -- 23:11:00

It seems to have been a ulsfo glitch, as only ulsfo hosts were
affected.

In total ~5M duplicates and ~2M missing:

+---------+--------+-------------+-----------+
| cluster | host   | # duplicate | # missing |
+---------+--------+-------------+-----------+
| bits    | cp4001 |      183537 |         0 |
| bits    | cp4002 |      517220 |    218080 |
| bits    | cp4003 |      381150 |    143408 |
| bits    | cp4004 |      266275 |         0 |
| text    | cp4008 |       26116 |         0 |
| text    | cp4009 |      215291 |         0 |
| text    | cp4010 |      126667 |         0 |
| text    | cp4016 |        1904 |         0 |
| text    | cp4018 |      167577 |         0 |
| upload  | cp4005 |      592259 |    352364 |
| upload  | cp4006 |      581932 |    340563 |
| upload  | cp4007 |      507600 |    299497 |
| upload  | cp4013 |      592460 |    389971 |
| upload  | cp4014 |      408414 |     61688 |
| upload  | cp4015 |      560605 |    291017 |
+---------+--------+-------------+-----------+

Since being 7M off is within reach for our streams, I marked the
streams “ok“ by hand.



On the kafka brokers, I the only thing that looked related were
exceptions like

  [2014-10-08 23:46:04,585] 1658280892 [kafka-request-handler-9] ERROR kafka.server.KafkaApis  - [KafkaApi-21] Error when processing fetch request for partition 
  [webrequest_upload,5] offset 34174421023 from consumer with correlation id 2
  kafka.common.OffsetOutOfRangeException: Request for offset 34174421023 but we only have log segments in the range 37690961900 to 40340044788.
          at kafka.log.Log.read(Log.scala:380)
          [...]

6 times on analytics1012 affecting each of
                      webrequest_upload,3
                      webrequest_upload,7
                      webrequest_upload,11
                   twice.
6 times on analytics1018 affecting each of
                      webrequest_upload,0
                      webrequest_upload,4
                      webrequest_upload,8
                   twice.
6 times on analytics1021 affecting each of
                      webrequest_upload,1
                      webrequest_upload,5
                      webrequest_upload,9
                   twice.
6 times on analytics1022 affecting each of
                      webrequest_upload,2
                      webrequest_upload,6
                      webrequest_upload,10
                   twice.

All those 24 exceptions were around 23:46.

Checking in the affected caches in ganglia, I noticed that some
readings are missing are missing around that time.

SAL did not show anything relevant, but the #wikimedia-operations
channel had

[22:59:55] <icinga-wm>	 PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR  
[...]
[23:00:55] <icinga-wm>	 PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192  
[...]
[23:03:04] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0]  
[23:03:55] <icinga-wm>	 PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.8  with snmp version 2  
[23:05:16] <icinga-wm>	 RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 9, down: 0, shutdown: 0  
[...]
[23:07:04] <icinga-wm>	 PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR  
[...]
[23:10:14] <icinga-wm>	 PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192  
[23:12:05] <icinga-wm>	 PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.8  with snmp version 2  
[...]
[23:15:05] <icinga-wm>	 PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR  
[23:15:05] <icinga-wm>	 RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 9, down: 0, shutdown: 0  
[23:16:14] <icinga-wm>	 PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: puppet fail  
[...]
[23:16:45] <icinga-wm>	 PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: puppet fail  
[23:17:45] <icinga-wm>	 PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: puppet fail  
[...]
[23:18:54] <icinga-wm>	 PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: puppet fail  
[...]
[23:19:05] <icinga-wm>	 PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 1 failures  
[23:19:26] <icinga-wm>	 PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures  
[...]
[23:23:56] <icinga-wm>	 RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures  

Afterwards services recovered.
So it looks like a general ULSFO issue.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links