Last modified: 2014-10-21 12:10:11 UTC
testExturlusage uses: for link in mysite.exturlusage('www.google.com', namespaces=[2, 3], total=5) This returns quickly on test.wikidata , as there is very little data that matches https://test.wikidata.org/w/index.php?title=Special%3ALinkSearch&target=http%3A%2F%2Fwww.google.com All of the other travis build platforms also provide the five records requested in a reasonable period of time. While test.wikipedia has a lot of data that matches: https://test.wikipedia.org/w/index.php?title=Special%3ALinkSearch&target=http%3A%2F%2Fwww.google.com PageGenerator on test.wikipedia yields four results after a few API calls, however after the fourth result it has backed off to requesting data with a geulimit of 1, resulting in the following data request/results sequence: {'inprop': [u'protection'], 'geuprotocol': [u'http'], 'iiprop': [u'timestamp', u'user', u'comment', u'url', u'size', u'sha1', u'metadata'], 'maxlag': ['5'], u'geuoffset': [u'20'], 'generator': [u'exturlusage'], 'format': ['json'], 'prop': [u'info', u'imageinfo', u'categoryinfo'], 'meta': ['userinfo'], 'indexpageids': [u''], u'geulimit': [u'1'], 'action': [u'query'], u'geunamespace': [u'2', u'3'], 'geuquery': [u'www.google.com'], 'uiprop': ['blockinfo', 'hasmsg']} {u'query-continue': {u'exturlusage': {u'geuoffset': 21}}, u'query': {u'userinfo': {u'messages': u'', u'id': 25377, u'name': u'JVbot'}}} {'inprop': [u'protection'], 'geuprotocol': [u'http'], 'iiprop': [u'timestamp', u'user', u'comment', u'url', u'size', u'sha1', u'metadata'], 'maxlag': ['5'], u'geuoffset': [u'21'], 'generator': [u'exturlusage'], 'format': ['json'], 'prop': [u'info', u'imageinfo', u'categoryinfo'], 'meta': ['userinfo'], 'indexpageids': [u''], u'geulimit': [u'1'], 'action': [u'query'], u'geunamespace': [u'2', u'3'], 'geuquery': [u'www.google.com'], 'uiprop': ['blockinfo', 'hasmsg']} {u'query-continue': {u'exturlusage': {u'geuoffset': 22}}, u'query': {u'userinfo': {u'messages': u'', u'id': 25377, u'name': u'JVbot'}}} {'inprop': [u'protection'], 'geuprotocol': [u'http'], 'iiprop': [u'timestamp', u'user', u'comment', u'url', u'size', u'sha1', u'metadata'], 'maxlag': ['5'], u'geuoffset': [u'22'], 'generator': [u'exturlusage'], 'format': ['json'], 'prop': [u'info', u'imageinfo', u'categoryinfo'], 'meta': ['userinfo'], 'indexpageids': [u''], u'geulimit': [u'1'], 'action': [u'query'], u'geunamespace': [u'2', u'3'], 'geuquery': [u'www.google.com'], 'uiprop': ['blockinfo', 'hasmsg']} {u'query-continue': {u'exturlusage': {u'geuoffset': 23}}, u'query': {u'userinfo': {u'messages': u'', u'id': 25377, u'name': u'JVbot'}}} {'inprop': [u'protection'], 'geuprotocol': [u'http'], 'iiprop': [u'timestamp', u'user', u'comment', u'url', u'size', u'sha1', u'metadata'], 'maxlag': ['5'], u'geuoffset': [u'23'], 'generator': [u'exturlusage'], 'format': ['json'], 'prop': [u'info', u'imageinfo', u'categoryinfo'], 'meta': ['userinfo'], 'indexpageids': [u''], u'geulimit': [u'1'], 'action': [u'query'], u'geunamespace': [u'2', u'3'], 'geuquery': [u'www.google.com'], 'uiprop': ['blockinfo', 'hasmsg']} {u'query-continue': {u'exturlusage': {u'geuoffset': 24}}, u'query': {u'userinfo': {u'messages': u'', u'id': 25377, u'name': u'JVbot'}}} It then proceeds to iterate continuously seemingly forever. (I killed it after 10 mins)
There are more results far away, I do not know if you reached there. Try with different values of geuoffset=10000 in the link below. Between 12000 and 12500 they finish. https://test.wikipedia.org/w/api.php?inprop=protection&geuprotocol=http&maxlag=5&generator=exturlusage&format=jsonfm&geuquery=www.google.com&prop=info|imageinfo|categoryinfo&meta=userinfo&indexpageids=&geulimit=5000&geuoffset=10000&action=query&geunamespace=2|3&iiprop=timestamp|user|comment|url|size|sha1|metadata&uiprop=blockinfo|hasmsg { "query-continue": { "exturlusage": { "geuoffset": 10500 } }, "warnings": { "exturlusage": { "*": "geulimit may not be over 500 (set to 5000) for users" } }, "query": { "pageids": [ "12828" ], "pages": { "12828": { "pageid": 12828, "ns": 2, "title": "User:\u05dc\u05e2\u05e8\u05d9 \u05e8\u05d9\u05d9\u05e0\u05d4\u05d0\u05e8\u05d8/monobook.js", "contentmodel": "javascript", "pagelanguage": "en", "touched": "2012-04-10T19:34:24Z", "lastrevid": 112424, "counter": "", "length": 4432, "protection": [] } }, "userinfo": { "id": 25083, "name": "Mpaa" } Between 12000 and 12500 they finish. { "warnings": { "exturlusage": { "*": "geulimit may not be over 500 (set to 5000) for users" } }, "query": { "userinfo": { "id": 25083, "name": "Mpaa" } } }
A possible strategy could be to increase the new_limit if the code is in this condition in api.py, line 1090: else: # if query-continue is present, self.resultkey might not have been # fetched yet if "query-continue" not in self.data: # No results. return --> start to increase counter? the tricky part is to maintain the total number returned = count It tries to fetch only the number of elements left to reach 5. When 1 is reached, it stays there for 12000 queries .. ********* 500 500 5 5 0 ********** [[test:User:Nip]] [[test:User:TeleComNasSprVen]] ********* 500 500 5 3 2 ********** [[test:User:MaxSem/wap]] ********* 500 500 5 2 3 ********** ********* 500 500 5 2 3 ********** ********* 500 500 5 2 3 ********** ********* 500 500 5 2 3 ********** ********* 500 500 5 2 3 ********** ********* 500 500 5 2 3 ********** [[test:User:HersfoldCiteBot/Citation errors needing manual review]] ********* 500 500 5 1 4 ********** ********* 500 500 5 1 4 ********** ********* 500 500 5 1 4 ********** ......
(In reply to Mpaa from comment #2) > It tries to fetch only the number of elements left to reach 5. > When 1 is reached, it stays there for 12000 queries .. But MW doesnt return one row, as requested..? Yes, pywikibot will need to detect that it is 'getting nowhere slowly', and exponentially increase the new_limit until it finds data or end of dataset.
(In reply to John Mark Vandenberg from comment #3) > (In reply to Mpaa from comment #2) > > It tries to fetch only the number of elements left to reach 5. > > When 1 is reached, it stays there for 12000 queries .. > > But MW doesnt return one row, as requested..? I meant that it will keep sending request with geulimit=1. So to get to to 12000, it will send 12000 request. It returns one row at the time, containing just query-continue data: {u'exturlusage': {u'geuoffset': 24}} {u'exturlusage': {u'geuoffset': 25}} ... > Yes, pywikibot will need to detect that it is 'getting nowhere slowly', and > exponentially increase the new_limit until it finds data or end of dataset.
geulimit=1 says the client wants 1 only record. The MW API isnt returning one record. It is moving the cursor forward by one and returning zero records. It feels like MW is interpreting 'geulimit=1' as 'only look at one record, and return the data if it meets the request criteria'
The API documentation explains it eunamespace - The page namespace(s) to enumerate. NOTE: Due to $wgMiserMode, using this may result in fewer than "eulimit" results returned before continuing; in extreme cases, zero results may be returned Values (separate with '|'): 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, .. Maximum number of values 50 (500 for bots)
Change 167438 had a related patch set uploaded by Mpaa: api.py: increase api limits when data are sparse https://gerrit.wikimedia.org/r/167438
Yes, that is what I meant.
Change 167438 merged by jenkins-bot: Increase limits in QueryGenerator when data are sparse https://gerrit.wikimedia.org/r/167438