Last modified: 2014-10-21 12:10:11 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T74209, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 72209 - testExturlusage takes forever on test.wikipedia
testExturlusage takes forever on test.wikipedia
Status: RESOLVED FIXED
Product: Pywikibot
Classification: Unclassified
tests (Other open bugs)
core-(2.0)
All All
: Unprioritized normal
: ---
Assigned To: Pywikipedia bugs
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-10-18 13:29 UTC by John Mark Vandenberg
Modified: 2014-10-21 12:10 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description John Mark Vandenberg 2014-10-18 13:29:59 UTC
testExturlusage uses:

    for link in mysite.exturlusage('www.google.com', namespaces=[2, 3], total=5)

This returns quickly on test.wikidata , as there is very little data that matches

https://test.wikidata.org/w/index.php?title=Special%3ALinkSearch&target=http%3A%2F%2Fwww.google.com

All of the other travis build platforms also provide the five records requested in a reasonable period of time.

While test.wikipedia has a lot of data that matches:

https://test.wikipedia.org/w/index.php?title=Special%3ALinkSearch&target=http%3A%2F%2Fwww.google.com

PageGenerator on test.wikipedia yields four results after a few API calls, however after the fourth result it has backed off to requesting data with a geulimit of 1, resulting in the following data request/results sequence:

{'inprop': [u'protection'], 'geuprotocol': [u'http'], 'iiprop': [u'timestamp', u'user', u'comment', u'url', u'size', u'sha1', u'metadata'], 'maxlag': ['5'], u'geuoffset': [u'20'], 'generator': [u'exturlusage'], 'format': ['json'], 'prop': [u'info', u'imageinfo', u'categoryinfo'], 'meta': ['userinfo'], 'indexpageids': [u''], u'geulimit': [u'1'], 'action': [u'query'], u'geunamespace': [u'2', u'3'], 'geuquery': [u'www.google.com'], 'uiprop': ['blockinfo', 'hasmsg']}
{u'query-continue': {u'exturlusage': {u'geuoffset': 21}}, u'query': {u'userinfo': {u'messages': u'', u'id': 25377, u'name': u'JVbot'}}}

{'inprop': [u'protection'], 'geuprotocol': [u'http'], 'iiprop': [u'timestamp', u'user', u'comment', u'url', u'size', u'sha1', u'metadata'], 'maxlag': ['5'], u'geuoffset': [u'21'], 'generator': [u'exturlusage'], 'format': ['json'], 'prop': [u'info', u'imageinfo', u'categoryinfo'], 'meta': ['userinfo'], 'indexpageids': [u''], u'geulimit': [u'1'], 'action': [u'query'], u'geunamespace': [u'2', u'3'], 'geuquery': [u'www.google.com'], 'uiprop': ['blockinfo', 'hasmsg']}
{u'query-continue': {u'exturlusage': {u'geuoffset': 22}}, u'query': {u'userinfo': {u'messages': u'', u'id': 25377, u'name': u'JVbot'}}}

{'inprop': [u'protection'], 'geuprotocol': [u'http'], 'iiprop': [u'timestamp', u'user', u'comment', u'url', u'size', u'sha1', u'metadata'], 'maxlag': ['5'], u'geuoffset': [u'22'], 'generator': [u'exturlusage'], 'format': ['json'], 'prop': [u'info', u'imageinfo', u'categoryinfo'], 'meta': ['userinfo'], 'indexpageids': [u''], u'geulimit': [u'1'], 'action': [u'query'], u'geunamespace': [u'2', u'3'], 'geuquery': [u'www.google.com'], 'uiprop': ['blockinfo', 'hasmsg']}
{u'query-continue': {u'exturlusage': {u'geuoffset': 23}}, u'query': {u'userinfo': {u'messages': u'', u'id': 25377, u'name': u'JVbot'}}}

{'inprop': [u'protection'], 'geuprotocol': [u'http'], 'iiprop': [u'timestamp', u'user', u'comment', u'url', u'size', u'sha1', u'metadata'], 'maxlag': ['5'], u'geuoffset': [u'23'], 'generator': [u'exturlusage'], 'format': ['json'], 'prop': [u'info', u'imageinfo', u'categoryinfo'], 'meta': ['userinfo'], 'indexpageids': [u''], u'geulimit': [u'1'], 'action': [u'query'], u'geunamespace': [u'2', u'3'], 'geuquery': [u'www.google.com'], 'uiprop': ['blockinfo', 'hasmsg']}
{u'query-continue': {u'exturlusage': {u'geuoffset': 24}}, u'query': {u'userinfo': {u'messages': u'', u'id': 25377, u'name': u'JVbot'}}}

It then proceeds to iterate continuously seemingly forever. (I killed it after 10 mins)
Comment 1 Mpaa 2014-10-18 22:19:53 UTC
There are more results far away, I do not know if you reached there.

Try with different values of geuoffset=10000 in the link below.
Between 12000 and 12500 they finish.

https://test.wikipedia.org/w/api.php?inprop=protection&geuprotocol=http&maxlag=5&generator=exturlusage&format=jsonfm&geuquery=www.google.com&prop=info|imageinfo|categoryinfo&meta=userinfo&indexpageids=&geulimit=5000&geuoffset=10000&action=query&geunamespace=2|3&iiprop=timestamp|user|comment|url|size|sha1|metadata&uiprop=blockinfo|hasmsg

{
    "query-continue": {
        "exturlusage": {
            "geuoffset": 10500
        }
    },
    "warnings": {
        "exturlusage": {
            "*": "geulimit may not be over 500 (set to 5000) for users"
        }
    },
    "query": {
        "pageids": [
            "12828"
        ],
        "pages": {
            "12828": {
                "pageid": 12828,
                "ns": 2,
                "title": "User:\u05dc\u05e2\u05e8\u05d9 \u05e8\u05d9\u05d9\u05e0\u05d4\u05d0\u05e8\u05d8/monobook.js",
                "contentmodel": "javascript",
                "pagelanguage": "en",
                "touched": "2012-04-10T19:34:24Z",
                "lastrevid": 112424,
                "counter": "",
                "length": 4432,
                "protection": []
            }
        },
        "userinfo": {
            "id": 25083,
            "name": "Mpaa"
        }


Between 12000 and 12500 they finish.

{
    "warnings": {
        "exturlusage": {
            "*": "geulimit may not be over 500 (set to 5000) for users"
        }
    },
    "query": {
        "userinfo": {
            "id": 25083,
            "name": "Mpaa"
        }
    }
}
Comment 2 Mpaa 2014-10-18 22:25:59 UTC
A possible strategy could be to increase the new_limit if the code is in this condition in api.py, line 1090:

else:
    # if query-continue is present, self.resultkey might not have been
    # fetched yet
    if "query-continue" not in self.data:
        # No results.
        return
    --> start to increase counter?
        the tricky part is to maintain the total number returned = count

It tries to fetch only the number of elements left to reach 5.
When 1 is reached, it stays there for 12000 queries ..

*********        500 500 5 5 0          **********
[[test:User:Nip]]
[[test:User:TeleComNasSprVen]]
*********        500 500 5 3 2          **********
[[test:User:MaxSem/wap]]
*********        500 500 5 2 3          **********
*********        500 500 5 2 3          **********
*********        500 500 5 2 3          **********
*********        500 500 5 2 3          **********
*********        500 500 5 2 3          **********
*********        500 500 5 2 3          **********
[[test:User:HersfoldCiteBot/Citation errors needing manual review]]
*********        500 500 5 1 4          **********
*********        500 500 5 1 4          **********
*********        500 500 5 1 4          **********
......
Comment 3 John Mark Vandenberg 2014-10-18 22:59:13 UTC
(In reply to Mpaa from comment #2)
> It tries to fetch only the number of elements left to reach 5.
> When 1 is reached, it stays there for 12000 queries ..

But MW doesnt return one row, as requested..?

Yes, pywikibot will need to detect that it is 'getting nowhere slowly', and exponentially increase the new_limit until it finds data or end of dataset.
Comment 4 Mpaa 2014-10-19 08:50:21 UTC
(In reply to John Mark Vandenberg from comment #3)
> (In reply to Mpaa from comment #2)
> > It tries to fetch only the number of elements left to reach 5.
> > When 1 is reached, it stays there for 12000 queries ..
> 
> But MW doesnt return one row, as requested..?
I meant that it will keep sending request with geulimit=1. So to get to to 12000, it will send 12000 request.
It returns one row at the time, containing just query-continue data:
{u'exturlusage': {u'geuoffset': 24}}
{u'exturlusage': {u'geuoffset': 25}}
...

> Yes, pywikibot will need to detect that it is 'getting nowhere slowly', and
> exponentially increase the new_limit until it finds data or end of dataset.
Comment 5 John Mark Vandenberg 2014-10-19 08:54:19 UTC
geulimit=1 says the client wants 1 only record.

The MW API isnt returning one record.  It is moving the cursor forward by one and returning zero records.

It feels like MW is interpreting 'geulimit=1' as 'only look at one record, and return the data if it meets the request criteria'
Comment 6 John Mark Vandenberg 2014-10-19 09:03:28 UTC
The API documentation explains it

  eunamespace         - The page namespace(s) to enumerate.
                        NOTE: Due to $wgMiserMode, using this may result in fewer than "eulimit" results
                        returned before continuing; in extreme cases, zero results may be returned
                        Values (separate with '|'): 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ..
                        Maximum number of values 50 (500 for bots)
Comment 7 Gerrit Notification Bot 2014-10-19 10:10:20 UTC
Change 167438 had a related patch set uploaded by Mpaa:
api.py: increase api limits when data are sparse

https://gerrit.wikimedia.org/r/167438
Comment 8 Mpaa 2014-10-19 10:15:14 UTC
Yes, that is what I meant.
Comment 9 Gerrit Notification Bot 2014-10-21 05:39:44 UTC
Change 167438 merged by jenkins-bot:
Increase limits in QueryGenerator when data are sparse

https://gerrit.wikimedia.org/r/167438

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links