Last modified: 2014-01-03 16:00:24 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T61384, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 59384 - DBQ-129 create a list of total individual user edits per month for English wikipedia
DBQ-129 create a list of total individual user edits per month for English wi...
Status: RESOLVED FIXED
Product: Tool Labs tools
Classification: Unclassified
Database Queries (Other open bugs)
unspecified
All All
: Unprioritized major
: ---
Assigned To: Bugzilla Bug Importer (valhallasw)
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-01-03 15:59 UTC by Bugzilla Bug Importer (valhallasw)
Modified: 2014-01-03 16:00 UTC (History)
0 users

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Bugzilla Bug Importer (valhallasw) 2014-01-03 15:59:55 UTC
This issue was converted from https://jira.toolserver.org/browse/DBQ-129.
Summary: create a list of total individual user edits per month for English wikipedia
Issue type: Task - A task that needs to be done.
Priority: Major
Status: Done
Assignee: Hoo man <hoo@online.de>

-------------------------------------------------------------------------------
From: Chris55 <moschris@googlemail.com>
Date: Mon, 21 Mar 2011 23:38:58
-------------------------------------------------------------------------------

Output 3 colums: userid, month, totaledits_this_month for English wikipedia from the start with users with non-zero edits. IP edits can be disregarded.  
Having now looked at the database definition I believe that the sql statement I need executed is:  
select rev_user, left(rev_timestamp,6), count![][1] from revision where rev_user!=0 group by rev_user,left(rev_timestamp,6);

If doing this in parts an extra condition such as left(rev_timestamp,4)=='2005' is needed.  
Thanks

   [1]: https://jira.toolserver.org/images/icons/emoticons/star_yellow.gif
Comment 1 Bugzilla Bug Importer (valhallasw) 2014-01-03 15:59:57 UTC
-------------------------------------------------------------------------------
From: Chris55 <moschris@googlemail.com>
Date: Tue, 22 Mar 2011 23:35:23
-------------------------------------------------------------------------------

distinguish IP & registered users
Comment 2 Bugzilla Bug Importer (valhallasw) 2014-01-03 15:59:58 UTC
-------------------------------------------------------------------------------
From: Hoo man <hoo@online.de>
Date: Wed, 23 Mar 2011 19:08:33
-------------------------------------------------------------------------------

So I'm understanding you right, you want it this way:  
user_id, user_name, edits  
for every month from the start of enwiki on?  
(user_name is required if you want IP edits as well, cause IPs don't have a user_id)

If so, you have to be patient, cause this will take quite a while (I have to run it per month to not kill the server ![][1])  
Further I will have a separate file for every month, if that's ok

   [1]: https://jira.toolserver.org/images/icons/emoticons/tongue.gif
Comment 3 Bugzilla Bug Importer (valhallasw) 2014-01-03 16:00:00 UTC
-------------------------------------------------------------------------------
From: Chris55 <moschris@googlemail.com>
Date: Wed, 23 Mar 2011 21:16:03
-------------------------------------------------------------------------------

Not interested in IP edits, so rev_user!=0 and so I don't need user names - but I do need the month in the output.  
There are about 650,000 current users, growing approximately linearly since 2005, so that makes 23m possible entries; but probably most ids only appear in a single month, so the actual output will be much smaller, say 2m. *20 char/record=40mb output, compressed.  
Can we compromise on yearly batches? I'm trying to nail down the problems with the growth of users mainly in the period 2005-8, so maybe yearly from 2005 with 2001-4 in one batch? But if you have to take smaller bites, so be it.
Comment 4 Bugzilla Bug Importer (valhallasw) 2014-01-03 16:00:02 UTC
-------------------------------------------------------------------------------
From: Hoo man <hoo@online.de>
Date: Thu, 24 Mar 2011 15:03:26
-------------------------------------------------------------------------------

I just wanted to let you know, that I've started the script ![][1]

I'll let you know when it's finished ![][2]

   [1]: https://jira.toolserver.org/images/icons/emoticons/wink.gif
   [2]: https://jira.toolserver.org/images/icons/emoticons/smile.gif
Comment 5 Bugzilla Bug Importer (valhallasw) 2014-01-03 16:00:03 UTC
-------------------------------------------------------------------------------
From: Chris55 <moschris@googlemail.com>
Date: Tue, 29 Mar 2011 12:20:20
-------------------------------------------------------------------------------

Have you any idea how long this might take? I presume no partial results will be available.  
I wonder slightly at the tactic of splitting the task up too finely: it's a straight query without joins and doing it in too small chunks (ie 120 cf 10) might lead to a significant increase in overall runtime.
Comment 6 Bugzilla Bug Importer (valhallasw) 2014-01-03 16:00:05 UTC
-------------------------------------------------------------------------------
From: Hoo man <hoo@online.de>
Date: Tue, 29 Mar 2011 12:47:27
-------------------------------------------------------------------------------

Be sure, I know what I'm doing (I'm not using SQL the first time ![][1])... running it in one would take days and we have an active query killer, so that queries aren't usually able to run for more than 5-6 hours, further that would eat a lot of ram ![][2]

I'm currently at 2008-09 and I'm confident that I can give you the results tomorrow (in one big file and in smaller per month files)

Sorry, that it took so long, but I first choose the wrong (a slower) DB server, so I lost a lot of time ![][3]

   [1]: https://jira.toolserver.org/images/icons/emoticons/tongue.gif
   [2]: https://jira.toolserver.org/images/icons/emoticons/wink.gif
   [3]: https://jira.toolserver.org/images/icons/emoticons/sad.gif
Comment 7 Bugzilla Bug Importer (valhallasw) 2014-01-03 16:00:06 UTC
-------------------------------------------------------------------------------
From: Chris55 <moschris@googlemail.com>
Date: Tue, 29 Mar 2011 13:15:38
-------------------------------------------------------------------------------

Thanks - wasn't doubting your competence! I'm aware that estimating resource requirements in large databases is a fine art.
Comment 8 Bugzilla Bug Importer (valhallasw) 2014-01-03 16:00:08 UTC
-------------------------------------------------------------------------------
From: Hoo man <hoo@online.de>
Date: Thu, 31 Mar 2011 12:32:28
-------------------------------------------------------------------------------

I'm sorry, but I've calculated the ETA wrong (I have thought that the number of contributors would at maximum double after late 2005, but it grew even further), so my script is only at 2010-02 atm.

Therefore I can't give you the results today ![][1]

I've recalculated the ETA now and my result is over 40 hours, therefore you will have to wait till Saturday (April 2)

   [1]: https://jira.toolserver.org/images/icons/emoticons/sad.gif
Comment 9 Bugzilla Bug Importer (valhallasw) 2014-01-03 16:00:10 UTC
-------------------------------------------------------------------------------
From: Chris55 <moschris@googlemail.com>
Date: Thu, 31 Mar 2011 14:29:40
-------------------------------------------------------------------------------

Is it possible to release the early years now? I had thought of putting in a report for the UK Wiki conference and the deadline is 2 Apr so I need to do at least the 2007 analysis soon.
Comment 10 Bugzilla Bug Importer (valhallasw) 2014-01-03 16:00:11 UTC
-------------------------------------------------------------------------------
From: Hoo man <hoo@online.de>
Date: Thu, 31 Mar 2011 14:39:24
-------------------------------------------------------------------------------

Sure, I can give you the results up to 2010-02 in a few hours (still have to re run some sub queries because the query killer killed some of the earlier queries) ![][1]

Do you still have a need for the rest of the data then (after April 2)?

   [1]: https://jira.toolserver.org/images/icons/emoticons/wink.gif
Comment 11 Bugzilla Bug Importer (valhallasw) 2014-01-03 16:00:13 UTC
-------------------------------------------------------------------------------
From: Chris55 <moschris@googlemail.com>
Date: Thu, 31 Mar 2011 15:34:24
-------------------------------------------------------------------------------

Hmm, it's obviously getting harder ![][1]  
It seems that the whole db is getting too large to get proper stats these days. It's impossible to get any detail beyond the start of 2010. So - yes, why don't we call it a day at 2010/02?

   [1]: https://jira.toolserver.org/images/icons/emoticons/sad.gif
Comment 12 Bugzilla Bug Importer (valhallasw) 2014-01-03 16:00:14 UTC
-------------------------------------------------------------------------------
From: Hoo man <hoo@online.de>
Date: Fri, 01 Apr 2011 11:45:03
-------------------------------------------------------------------------------

A few more days and it would been finished...

Code (wrote it in php cause I was to lazy for bash ![][1], but that doesn't matter):
    
    #!/usr/bin/php -q
    <?php
    $current = '201103';
    $month =  '05';
    $year = '2007';
    $log_file = '/mnt/user-store/hoo/dbq-129.log.txt';
    $save_file_pre = '/mnt/user-store/hoo/dbq/dbq-129-';
    $db = 'enwiki_p';
    $server = '10.24.1.33';
    while(true) {
            $month++;
            if($month > 12) {
                    $month = '01';
                    $year++;
            }
            if(strlen($month) == 1) {
                    $month = '0' . $month;
            }
            if($current < $year . $month) {
                    break;
            }
            while(true) {
                    $query = "SELECT /* SLOW_OK */ rev_user, COUNT(*) as edits, LEFT(rev_timestamp, 6) as month FROM revision WHERE rev_timestamp LIKE '" . $year . $month . "%' AND rev_user != 0 GROUP BY rev_user;";
                    $file = $save_file_pre . $year . '-' . $month . '.txt';
                    exec('mysql --host=' . $server . ' --database=' . $db . ' -e"' . $query . '" | cat > ' . $file);
                    if(filesize($file) != 0) {
                            break;
                    }
            }
    }
    ?>
    

Result:  
http://toolserver.org/~hoo/dbq/dbq-129/ (dir with the per month files)  
http://toolserver.org/~hoo/dbq/dbq-129/full.txt (everything in one file (plain text ~125MB), I haven't tested it, but I hope it's going to import fine ![][2] )

   [1]: https://jira.toolserver.org/images/icons/emoticons/tongue.gif
   [2]: https://jira.toolserver.org/images/icons/emoticons/wink.gif
Comment 13 Bugzilla Bug Importer (valhallasw) 2014-01-03 16:00:16 UTC
-------------------------------------------------------------------------------
From: Chris55 <moschris@googlemail.com>
Date: Fri, 01 Apr 2011 13:04:32
-------------------------------------------------------------------------------

Thanks, I've downloaded it and it looks fine.  
However I asked that it wouldn't include IP edits but it appears to and I have no way of eliminating them.  
The stats say that by Feb 2010 there were 573,891 Wikipedians but the file includes 3,328,228 user ids for 7,704,721 monthly records. Looking at the id's, there are a lot of 8 digit ids even in 2001 which most appear for very short periods.  
Since I'm particularly interested in people who've created ids but never made more than a few edits it's crucial that I can eliminate IP edits. Is there any way to do it with the file you've given me?
Comment 14 Bugzilla Bug Importer (valhallasw) 2014-01-03 16:00:18 UTC
-------------------------------------------------------------------------------
From: Hoo man <hoo@online.de>
Date: Fri, 01 Apr 2011 13:17:10
-------------------------------------------------------------------------------

The files seem to be correct, all IP edits have rev_user = 0. The strange user_id thingy seems fine as well eg. user "11419925" who made 2 edits in 2001-02:
    
    mysql> SELECT * FROM user WHERE user_id = '11419925';
    +----------+-------------------+-------------------+----------------+
    | user_id  | user_name         | user_registration | user_editcount |
    +----------+-------------------+-------------------+----------------+
    | 11419925 | Cobrand.bomis.com | 20100112120114    |              0 |
    +----------+-------------------+-------------------+----------------+
    1 row in set (0.02 sec)
    

And on wiki you can see (two edits in ZeuS):  
http://en.wikipedia.org/wiki/Special:Contributions/Cobrand.bomis.com

Things like this are caused by imports from the "old" wikipedia eg. for the example above the edits are from: http://nostalgia.wikipedia.org/w/index.php?title=ZeuS&action=history  
such edits are linked to users with the same name, even if they registered years later ![][1]

   [1]: https://jira.toolserver.org/images/icons/emoticons/wink.gif
Comment 15 Bugzilla Bug Importer (valhallasw) 2014-01-03 16:00:20 UTC
-------------------------------------------------------------------------------
From: Chris55 <moschris@googlemail.com>
Date: Fri, 01 Apr 2011 13:56:59
-------------------------------------------------------------------------------

Hmm, the data is a lot more noisy than I'd hoped. <a href="http://en.wikipedia.org/wiki/Special:Statistics">This page</a> says there are 14m registered users and the number can't surely have quadrupled in the last year. It says 453m edits whereas the sum in the file you've given me is 230m which is nearer to being likely but still not reassuring. The edits by top users are nothing like the published figures (starting at 2.8m) but I'll guess those don't include bots.  
Oh well, I'll continue to see if I can get anything meaningful out...
Comment 16 Bugzilla Bug Importer (valhallasw) 2014-01-03 16:00:21 UTC
-------------------------------------------------------------------------------
From: Hoo man <hoo@online.de>
Date: Fri, 01 Apr 2011 14:06:12
-------------------------------------------------------------------------------

Well, of course the revision number differs because of three reasons: it ends in 2010-02, we excluded IPs and the archive table (which includes the revision of deleted pages) hasn't been included.  
And yes, bots are included (this is only avoidable with a join against the user_groups table, which would be quite slow)

If you need it, I can give you a list of all bots, so that you can easily exclude them out of your data...

I'm now going to import the data myself to my workstation just to take a short look at the data
Comment 17 Bugzilla Bug Importer (valhallasw) 2014-01-03 16:00:23 UTC
-------------------------------------------------------------------------------
From: Hoo man <hoo@online.de>
Date: Fri, 01 Apr 2011 14:38:35
-------------------------------------------------------------------------------

I've just noticed the following on the imported data: the full.txt file didn't include the two 2010 months ![][1]  
I've updated it now ![][2]

(The script I used to merge them http://toolserver.org/~hoo/dbq/dbq-129/merge.sh used **200** as pattern to find the files, but the 2010 files aren't including 200 in their names. I've now changed the script)

   [1]: https://jira.toolserver.org/images/icons/emoticons/sad.gif
   [2]: https://jira.toolserver.org/images/icons/emoticons/wink.gif
Comment 18 Bugzilla Bug Importer (valhallasw) 2014-01-03 16:00:24 UTC
This bug was imported as RESOLVED. The original assignee has therefore not been
set, and the original reporters/responders have not been added as CC, to
prevent bugspam.

If you re-open this bug, please consider adding these people to the CC list:
Original assignee: hoo@online.de
CC list: hoo@online.de

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links