Last modified: 2014-10-21 13:14:53 UTC
None of the webrequest partitions [1] for 2014-10-20T13/1H have been been marked successful. What happened? [1] _________________________________________________________________ qchris@stat1002 // jobs: 0 // time: 09:43:10 // exit code: 0 cwd: ~/refinery/hive/webrequest ~/cluster-scripts/dump_webrequest_status.sh +------------------+--------+--------+--------+--------+ | Date | bits | mobile | text | upload | +------------------+--------+--------+--------+--------+ [...] | 2014-10-20T11/1H | . | . | . | . | | 2014-10-20T12/1H | . | . | . | . | | 2014-10-20T13/1H | X | X | X | X | | 2014-10-20T14/1H | . | . | . | . | | 2014-10-20T15/1H | . | . | . | . | [...] +------------------+--------+--------+--------+--------+ Statuses: . --> Partition is ok M --> Partition manually marked ok X --> Partition is not ok (duplicates, missing, or nulls)
The affected period is 13:07:11--2014-10-20T13:25:38. It affected only ulsfo caches, but all ulsfo caches. The affected period shows round 2M duplicates, which are worth * 79 seconds of ulsfo data, or * 15 seconds of total data. The affected period shows round 27M missing lines, which are worth * 16 minutes of ulsfo data, or * 3 minutes of total data. Ops reported [1] that at 13:07 network issues between ulsfo and eqiad started. This aligns and explains the issues that we're seeing. [1] https://lists.wikimedia.org/mailman/private/ops/2014-October/042274.html