Last modified: 2014-08-26 19:04:37 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T50894, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 48894 - Include pagecounts dumps in datasets
Include pagecounts dumps in datasets
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
tools (Other open bugs)
unspecified
All All
: High enhancement
: ---
Assigned To: Marc A. Pelletier
:
: 67909 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-28 13:00 UTC by Addshore
Modified: 2014-08-26 19:04 UTC (History)
11 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Addshore 2013-05-28 13:00:36 UTC
Include the dumps from http://dumps.wikimedia.org/other/pagecounts-raw/ in /public/datasets/public/
Comment 1 Peter Bena 2013-06-08 10:06:53 UTC
unattended for weeks boosting priority
Comment 2 Peter Bena 2013-06-26 09:40:49 UTC
+ Ariel

bump
Comment 3 Ariel T. Glenn 2013-06-26 12:39:16 UTC
This could be done but it's 3.1T (and will only get bigger); is there space for this, Ryan?
Comment 4 Marc A. Pelletier 2013-07-02 16:53:30 UTC
There is space, though 3.1T is big enough that I'd like to see if we can somehow manage to share access to a single copy rather than duplicate it around.
Comment 5 Marc A. Pelletier 2013-07-02 16:54:21 UTC
... wait, that's already accessible through HTTP; why doesn't that suffice?
Comment 6 Ariel T. Glenn 2013-07-03 08:31:45 UTC
It makes sense(In reply to comment #5)
> ... wait, that's already accessible through HTTP; why doesn't that suffice?

So everyone downloads their own copy, 3.1T worth, and puts it where?

It makes sense to me that we have  one shared copy accessible to the lab projects.

If folks don't need the whole thing but only the most recent x days/weeks, I can arrange for that, as we do with the dumps, to save space.
Comment 7 Yuvi Panda 2013-07-23 00:16:52 UTC
*bump*
Comment 8 Ariel T. Glenn 2013-07-23 09:28:14 UTC
Did we decide folks need the whole 3.1T?
Comment 9 Addshore 2013-07-23 09:35:15 UTC
I can't remember who initially asked me about this.

I imagine a Year would suffice, who large would that be?
Comment 10 Yuvi Panda 2013-10-20 18:05:18 UTC
If it is no problem getting all 3.1T of it in Labs, we should!
Comment 11 Yuvi Panda 2013-10-22 18:42:31 UTC
Talked with apergos about it, and I've gotten a good idea of what needs doing. Will try to do a puppet patchset in a while.

Assuming that 3.1T on NFS won't be an issue... :D
Comment 12 Diederik van Liere 2013-10-22 18:45:54 UTC
@Yuvi, Apergos: can we please coordinate this as the Analytics team is working on a mysql setup with this data
Comment 13 Diederik van Liere 2013-10-22 18:49:10 UTC
We have an RFC for making this pageview data queryable: https://www.mediawiki.org/wiki/Analytics/Hypercube
Comment 14 Gerrit Notification Bot 2013-10-22 20:30:45 UTC
Change 91293 had a related patch set uploaded by Yuvipanda:
dumps: Copy pagecounts data to public labs nfs too

https://gerrit.wikimedia.org/r/91293
Comment 15 Gerrit Notification Bot 2013-10-24 13:05:48 UTC
Change 91293 merged by ArielGlenn:
dumps: Copy pagecounts data to public labs nfs too

https://gerrit.wikimedia.org/r/91293
Comment 16 Marc A. Pelletier 2013-10-29 13:45:06 UTC
Now available in /public/pagecounts/pagecounts-raw/
Comment 17 Ariel T. Glenn 2014-03-18 11:12:20 UTC
So the new labs nfs is not as big as the old one and the page counts will only grow in size.  I'd like to revisit how fae back we keep the old files.  Marc, what available space do we have and what gets used?
Comment 18 Ariel T. Glenn 2014-04-17 10:02:46 UTC
Folks, if I don't hear back on this soon I'm going to whack files so only the last year is there; we're down to 300 gb or so on that filesystem.
Comment 19 Tim Landscheidt 2014-04-17 12:45:41 UTC
Just to clarify, are we talking about:

| scfc@tools-login:~$ df -h /public/dumps/pagecounts-raw
| Filesystem                       Size  Used Avail Use% Mounted on
| labstore.svc.eqiad.wmnet:/dumps  9.1T  8.1T  1.1T  89% /public/dumps
| scfc@tools-login:~$

If so, I'd rather add a terabyte or two than delete stuff until [[mw:Analytics/Hypercube]] is available.
Comment 20 Ariel T. Glenn 2014-04-17 12:51:39 UTC
I don't know how feasible that is. Marc?
Comment 21 Marc A. Pelletier 2014-04-17 13:06:34 UTC
On the dumps FS?  Fairly hard: it doesn't live on the shelves since it didn't need the same level of redundancy and fully fills its raid.

There are three things there, however, so perhaps we can move one to another filesystem.  The dumps currently occupy 4.4T, and the pagecounts 3.7T.

In a pinch, I have the /scratch filesystem which has the same properties and has some 7T available.
Comment 22 Marc A. Pelletier 2014-07-11 16:16:44 UTC
There will be a new server allocated for dumps and pagecounts which will give us several times more space than we need, and will return that space from labstore1001 back to the usable pool.

More news soon.
Comment 23 Marc A. Pelletier 2014-07-12 02:44:49 UTC
*** Bug 67909 has been marked as a duplicate of this bug. ***
Comment 24 Tim Landscheidt 2014-07-16 17:56:08 UTC
New server tracked in RT #7578.
Comment 25 jeremyb 2014-08-23 08:01:01 UTC
(In reply to Tim Landscheidt from comment #24)
> New server tracked in RT #7578.

I can't see that. I assume it's in procurement.

There's several more relevant tickets e.g. RT #7948, RT #8090

None have been updated for a couple weeks; maybe wikimania/travel got in the way.

Looks like we got at least as far as an OS install and then issues with puppet/DNS and then idk.
Comment 26 Marc A. Pelletier 2014-08-26 19:04:37 UTC
After a couple of odd issues with the underlying filsystem that took us several days to fix, the server is back online with the dumps.

The pagecounts aren't /quite/ done copying yet but up to 2014 and should be done soon.

For reference, the canonical location is:

/public/dumps/pagecounts-raw/

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links