Last modified: 2014-03-04 08:34:36 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T60531, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 58531 - [grabbers] PHP Fatal error: Out of memory in mediawikibot.class.php
[grabbers] PHP Fatal error: Out of memory in mediawikibot.class.php
Status: RESOLVED FIXED
Product: Utilities
Classification: Unclassified
grabbers (Other open bugs)
unspecified
All All
: Normal major (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-12-16 13:08 UTC by Nemo
Modified: 2014-03-04 08:34 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Nemo 2013-12-16 13:08:26 UTC
With
    php grabText.php --url=http://wikihow.com/api.php
I got
    PHP Fatal error:  Out of memory (allocated 1851260928) (tried to allocate 32 bytes)
in mediawikibot.class.php.

This wiki is rather huge with about 3.5M pages (I was hoping to have better success than with dumpgenerator.py), is there any way to limit the amount of data stored in RAM?
Comment 1 Kunal Mehta (Legoktm) 2013-12-16 16:45:00 UTC
How far did the script make it? What was the last output it gave before running out of memory?
Comment 2 Nemo 2013-12-16 17:14:34 UTC
(In reply to comment #1)
> How far did the script make it? What was the last output it gave before
> running
> out of memory?

260k pages in ns0, then a count for the next ns and presumably it was getting titles for the third.
Comment 3 Isarra 2013-12-16 18:29:30 UTC
So it looks like it's running out of memory to store the initial list before it starts grabbing the pages (workflow being that it gets the list of namespaces, then gets the list of pages, and then gets the revisions themselves and inserts everything into the database)? That would certainly understandable at these sizes, so it should probably... what, be writing the page list to a temp file or something? 

What's the best practice here? What should it be doing?
Comment 4 Kunal Mehta (Legoktm) 2014-01-03 08:57:57 UTC
I think we should insert into the database in batches so we don't have to store everything in memory, just a few parts.
Comment 5 Gerrit Notification Bot 2014-01-03 09:10:58 UTC
Change 105153 had a related patch set uploaded by Legoktm:
grabText: Don't store entire list of pages in memory

https://gerrit.wikimedia.org/r/105153
Comment 6 Gerrit Notification Bot 2014-01-07 13:10:52 UTC
Change 105153 merged by Jack Phoenix:
grabText: Don't store entire list of pages in memory

https://gerrit.wikimedia.org/r/105153
Comment 7 Andre Klapper 2014-03-01 01:34:34 UTC
Can somebody confirm this is FIXED by the commit in comment 6?
Nemo?
Comment 8 Nemo 2014-03-04 08:34:36 UTC
I don't know, my local checkout no longer works. Tentatively closing.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links