Last modified: 2014-09-02 13:17:27 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T72118, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 70118 - Webstatscollector's pagecounts file shaky between 2014-08-24 14:00 and 2014-08-27 21:00


Summary:	Webstatscollector's pagecounts file shaky between 2014-08-24 14:00 and 2014-0...

Status:	RESOLVED FIXED

Product:	Analytics
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Unprioritized normal
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:	70053
Blocks:
	Show dependency tree / graph

Reported:	2014-08-28 09:28 UTC by christian
Modified:	2014-09-02 13:17 UTC (History)
CC List:	5 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description christian 2014-08-28 09:28:40 UTC

TL;DR: For the files between 2014-08-24 14:00 and 2014-08-27 21:00
webstatscollector output are showing some irregularities. This affects
both the pagecounts and projectcount files from

https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/

and all services that process them.

--------------------------------------------

The pagecount file 20140824 14:00 [1] did not show
irregularities. But the file for one day later show a drop of 80% for
a few pages.

As gadolinium (the host that is writing the pagecount files) showed a
high and still increasing process count for no longer needed services
(bug 70053), those services got turned off around 2014-08-26 19:00:00.

But although more resources were freed on gadolinium, its
webstatscollector's collector process degraded further. Since the
service did not recover (UDP Receive Buffers filled up again and
again, disks could not take the write load, and the service gathered
95GB of virtual memory since its last restart), the service got
restarted on 2014-08-28 ~15:32.

Since the restart did not relax the situation either, the service was
put on tmpfs 2014-08-28 ~19:48, which reduced load on the disks, and
made the service work again. The files starting at 2014-08-27 21:00
are good again.

Thanks ottomata for all those fixes!
More details are in the corresponding IRC channel logs [3].

Closer investigation of the files between 20140824 14:00 and
2014-08-27 21:00 is still pending.

[1] https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/pagecounts-20140824-140000.gz
[2] https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/pagecounts-20140827-210000.gz
[3] http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20140827.txt

Comment 1 christian 2014-08-28 11:06:07 UTC

(In reply to christian from comment #0)
> [...] restarted on 2014-08-28 ~15:32.

That timestamp is meant to read "2014-08-27 ~15:32." (one day before).

> [...] put on tmpfs 2014-08-28 ~19:48, which [...]

That timestamp is meant to read "2014-08-27 ~19:48" (one day before).

Comment 2 Toby Negrin 2014-08-28 16:57:16 UTC

Any ideas as to the root cause? Do you think this has been an ongoing problem that we haven't noticed?

Comment 3 christian 2014-08-28 22:22:07 UTC

(In reply to Toby Negrin from comment #2)
> Any ideas as to the root cause?

Yes. Bug 70053.
Due to the increasing number of processes, less and less resources
were available for other processes, and at some point it started to
affect other services like webstatscollector.



> Do you think this has been an ongoing
> problem that we haven't noticed?

We have no official monitoring of webstatscollector.
So we have no way of telling.
Hence, we also have no way of being alerted about such issues in the
first place.

It would be a nice quarterly goal to have proper, official monitoring
in place ;-) That would allow us to answer such questions.


Mhmm ... that said ... no it's not an ongoing problem.
Data up to 2014-08-24 14:00 is within the usual expectations.

Comment 4 christian 2014-09-02 13:17:27 UTC

(In reply to christian from comment #0)
> Closer investigation of the files between 20140824 14:00 and
> 2014-08-27 21:00 is still pending.

The ending of the interval is crisp:
The last bad file is:
  https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/pagecounts-20140827-200000.gz
The first file that's good again is:
  https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/pagecounts-20140827-210000.gz

But due to the lack of granular monitoring, and hard boundaries,
finding a crisp beginning of the interval of bad data did not work
out. The last known to be good file is:
  https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/pagecounts-20140824-140000.gz
The first known to be bad file is:
  https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-08/pagecounts-20140825-140000.gz

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links