Last modified: 2014-10-30 18:54:28 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T74548, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 72548 - Raw webrequest bits partition for 2014-10-26T21/1H not marked successful
Raw webrequest bits partition for 2014-10-26T21/1H not marked successful
Status: RESOLVED FIXED
Product: Analytics
Classification: Unclassified
Refinery (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks: 72298
  Show dependency treegraph
 
Reported: 2014-10-27 07:53 UTC by christian
Modified: 2014-10-30 18:54 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description christian 2014-10-27 07:53:12 UTC
The bits webrequest partition [1] for 2014-10-26T21/1H has not been marked
successful.

What happened?


[1]
_________________________________________________________________
qchris@stat1002 // jobs: 0 // time: 07:51:53 // exit code: 0
cwd: ~
~/cluster-scripts/dump_webrequest_status.sh 
  +------------------+--------+--------+--------+--------+
  | Date             |  bits  | mobile |  text  | upload |
  +------------------+--------+--------+--------+--------+
[...]
  | 2014-10-26T19/1H |    .   |    .   |    .   |    .   |
  | 2014-10-26T20/1H |    .   |    .   |    .   |    .   |
  | 2014-10-26T21/1H |    X   |    .   |    .   |    .   |
  | 2014-10-26T22/1H |    .   |    .   |    .   |    .   |
  | 2014-10-26T23/1H |    .   |    .   |    .   |    .   |
[...]
  +------------------+--------+--------+--------+--------+


Statuses:
  . --> Partition is ok
  M --> Partition manually marked ok
  X --> Partition is not ok (duplicates, missing, or nulls)
Comment 1 christian 2014-10-27 08:22:12 UTC
Only cp3019 is affected. For that host data worth ~55 seconds got lost
in the ~1 minute between 2014-10-26T21:16:22 2014-10-26T21:17:24.

I could neither find changes in puppet, dns, or SAL that look relevant.

cp3019 (as all other esams caches) are gone from ganglia, so it's hard
to see further data from cp3019 itself for non-Ops.

Icinga shows the “Varnishkafka Delivery Errors” service having status
WARNING since 2014-10-24 17:11:57 (but the same holds true for the
other esams caches too).
Comment 2 christian 2014-10-27 08:31:45 UTC
Kafka logs did not show peculiar entries in the relevant period of time.
Comment 3 christian 2014-10-28 22:26:12 UTC
ganglia again shows data for esams caches, but the data between
~2014-10-24T12 and ~2014-10-27T16 is missing (which contains the
minute where we had cp3019 issues).
Judging from the cumulative counters, neither varnish nor varnishkafka
got restarted on cp3019.

ottomata ... since I cannot find any explanation, does cp3019
or 2014-10-26T21:16 ring a bell for you?

Was there some other migration/testing/network issue that I am missing?
Comment 4 christian 2014-10-30 18:54:28 UTC
ottomata had a look at the logs on cp3019 and said that there were
produce errors about full buffers.
So we're writing it off as temporary network issues for now.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links