Last modified: 2014-11-10 15:59:12 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T71615, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 69615 - Make webrequest partition validation handle races between time and sequence numbers
Make webrequest partition validation handle races between time and sequence n...
Status: PATCH_TO_REVIEW
Product: Analytics
Classification: Unclassified
Refinery (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks: 70085
  Show dependency treegraph
 
Reported: 2014-08-15 16:57 UTC by christian
Modified: 2014-11-10 15:59 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description christian 2014-08-15 16:57:06 UTC
The timestamp reported by varnish is taken when the request arrives.
The sequence number reported by varnish is taken when the response is
sent.

So when requests Foo and Bar arrive at the same cache, and Foo arrives
slightly before Bar, but Bar get its response sooner, it may occur that
the requests get logged as:

  +--------------+----------+-----------------+
  | request name |     Time | Sequence number |
  +--------------+----------+-----------------+
  | ...          | 09:59:59 |            4708 |
  | ...          | 09:59:59 |            4709 |
  | Foo          | 09:59:59 |            4711 |
  | Bar          | 10:00:00 |            4710 |
  | ...          | 10:00:00 |            4712 |
  | ...          | 10:00:00 |            4713 |
  +--------------+----------+-----------------+

Since we partition by timestamp, the partition for the 9th hour would
look like

  +--------------+----------+-----------------+
  | request name |     Time | Sequence number |
  +--------------+----------+-----------------+
  | ...          | 09:59:59 |            4708 |
  | ...          | 09:59:59 |            4709 |
  | Foo          | 09:59:59 |            4711 |
  +--------------+----------+-----------------+

(hence appearing to miss sequence number 4710), and the partition for
the 10th hour would look like:

  +--------------+----------+-----------------+
  | request name |     Time | Sequence number |
  +--------------+----------+-----------------+
  | Bar          | 10:00:00 |            4710 |
  | ...          | 10:00:00 |            4712 |
  | ...          | 10:00:00 |            4713 |
  +--------------+----------+-----------------+

(hence appearing to miss sequence number 4711).

So both partitions look like they'd be missing lines when being looked
at in isolation, and our per partition monitoring flags them both as
faulty.

But when looking at both partitions combined, no line is actually
missing, and the monitoring could flag them as ok.

In the past 2 weeks, we had two such occasions (one for bits, one for
upload).

The manual fix is simple: Generate the _SUCCESS file by hand for both
partitions.

Let's improve our monitoring to be aware of such races and check
for them automatically.

(One way would be that if the naive validation fails, Oozie's <error
to="..."> would no longer go to "kill", but to a follow-up step that
check specifically for this race.)
Comment 1 Dan Andreescu 2014-08-15 17:24:19 UTC
nice find!
Comment 2 christian 2014-08-21 16:25:51 UTC
Just to help us keep track how relevant this is, it again happened on

  2014-08-17T20:xx:xx/2014-08-17T21:xx:xx (on upload)
  2014-08-18T20:xx:xx/2014-08-18T21:xx:xx (on upload)
Comment 3 Gerrit Notification Bot 2014-08-21 16:37:32 UTC
Change 155578 had a related patch set uploaded by QChris:
DO NOT MERGE HiveQL file to check for sequence stats across hours

https://gerrit.wikimedia.org/r/155578
Comment 4 Gerrit Notification Bot 2014-08-21 16:38:22 UTC
Change 155578 abandoned by QChris:
DO NOT MERGE HiveQL file to check for sequence stats across hours

Reason:
Just parked code to aid manual checking across hours

https://gerrit.wikimedia.org/r/155578
Comment 5 christian 2014-08-25 09:28:52 UTC
Happened again for:

  2014-08-23T16:xx:xx/2014-08-23T17:xx:xx (on upload)
Comment 6 christian 2014-09-03 09:23:49 UTC
Happened again for:

  2014-08-31T15:xx:xx/2014-08-31T16:xx:xx (on upload)
  2014-09-02T17:xx:xx/2014-09-02T18:xx:xx (on upload)
Comment 7 christian 2014-09-05 20:53:58 UTC
Happened again for:

  2014-09-04T22:xx:xx/2014-09-04T23:xx:xx (on upload)
Comment 8 christian 2014-09-29 18:35:53 UTC
Happened again for:

  2014-09-28T14:xx:xx/2014-09-28T15:xx:xx (on upload)
Comment 9 christian 2014-10-10 11:08:46 UTC
Happened again for:

  2014-10-09T17:xx:xx/2014-10-09T18:xx:xx (on upload)
Comment 10 Toby Negrin 2014-10-14 18:21:49 UTC
This looks like something we should resource. Christian, how does this line up with the other things that you are working on in severity?
Comment 11 christian 2014-10-15 10:44:41 UTC
In terms of severity and other work, it does not line up too well.

And since we're suffering more severe issues, it has not been fixed yet.

But since this race-condition a distraction nonetheless, it's one of the
things I meant when bringing up “work on cluster infrastructure” during
the discussion of goals for this quarter.

So, for me, it's also well within scope, and the fix will happen once the more
pressing things are done.
Comment 12 Toby Negrin 2014-10-15 15:55:30 UTC
ok -- makes sense. Thanks Christian.
Comment 13 christian 2014-10-22 10:15:50 UTC
Happened again for:

  2014-10-21T14/2H (on upload)
Comment 14 christian 2014-10-22 12:04:56 UTC
Happened again for:

  2014-10-22T06/2H (on bits)
Comment 15 christian 2014-10-27 07:21:08 UTC
Happened again for:

  2014-10-24T19/2H (on bits)
Comment 16 christian 2014-10-29 15:18:31 UTC
Happened again for:

  2014-10-29T01/2H (on upload)
Comment 17 christian 2014-10-29 17:29:33 UTC
Happened again for:

  2014-10-29T08/2H (on bits)
Comment 18 christian 2014-11-06 10:55:21 UTC
Happened again for:

  2014-11-05T14/2H (on bits)
Comment 19 christian 2014-11-08 20:54:02 UTC
Happened again for:

  2014-11-08T00/2H (on bits)
Comment 20 christian 2014-11-08 21:22:56 UTC
Happened again for:

  2014-11-08T12/2H (on text) <-- first time on text
Comment 21 christian 2014-11-10 15:00:57 UTC
Happened again for:

  2014-11-09T21/2H (on upload)
Comment 22 christian 2014-11-10 15:59:12 UTC
Happened again for:

  2014-11-10T10/2H (on upload)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links