Last modified: 2014-11-09 20:50:02 UTC
(Creating bug from https://gerrit.wikimedia.org/r/#/c/149645/) In May 2014 this changeset broke 'category.py remove' for categories with only files and no pages. https://gerrit.wikimedia.org/r/#/c/130301/ It is also a significant performance regression as 'categoryinfo' requires the server to fetch all of the category members, which is slow, only to return counts, when category.py needs to iterate over the category members, so it must hit the server again.
I'm confused: I don't see a possibility to get the counts without hitting the server. And the server must be hit at least once to get the pages or categories in the “worst case”. So the previous handling without using 'categoryinfo' seems to be the better solution. Of course it could be written a bit more pythonic by simply using articles = cat.articles() if articles: # remove them else: # "complain" Even the change in 'CategoryTreeRobot' should be reverted because the database should be the fasted way to get the number of articles in a category. By the way database: This seems the only way to avoid hitting the server in the “best case” (|articles| == 0). But 'CategoryMoveRobot' doesn't use the database and it's only used for read-only operations. But the changes above line 644 (old)/642 (new) look fine to me. Now of course I'm fairly new to the bot's internals, so my assessment is probably off.
Your assessment is exactly the same as mine. CC:ing the people involved in the previous changeset in case they can see some problem with our assessment. I didnt +2 your changeset because it looks to me like the previous changeset needs to be mostly reverted, except for the formatting changes. Do you want to put up a 'revert' patch for review? (Yes, the server must be hit at least once. Invoking cat.categoryinfo is the unnecessary performance regression, as the client logic needs all contents, so it isnt useful to ask the server to compute the counts when the client can do it.) I believe the database is only used for read-only operations because it isnt a significant problem if the read-only operations use stale data. Write operations need to use fresh data to minimise errors.
cat.articles() is a generator and bool(cat.articles()) is always True. You may try a simple performace test like: >>> import pwb, pywikibot as py >>> s = py.Site() >>> c = py.Category(s, 'Mann') >>> c.categoryinfo['pages'] 456626 You'll get the result above immediately >>> g = c.articles() len(set(g)) You may wait several minutes to get a result. I've given up after 5.
(In reply to xqt from comment #3) > cat.articles() is a generator and bool(cat.articles()) is always True. > > You may try a simple performace test like: > > >>> import pwb, pywikibot as py > >>> s = py.Site() > >>> c = py.Category(s, 'Mann') > >>> c.categoryinfo['pages'] > 456626 > > You'll get the result above immediately I see the server doesnt spend much time working on this, as it loads the data from the category table. https://www.mediawiki.org/wiki/Manual:Category_table However it is another API call on the server, which adds more network latency (think someone in the jungle of Borneo on a 3G modem), server workload and delays (an issue on slower wikis), etc. > > >>> g = c.articles() > len(set(g)) > > You may wait several minutes to get a result. I've given up after 5. len(set(g)) is obviously not the best way to determine if the generator has any results. The best way is, I think, to process the generator with an else block to detect if the generator had no results. There are other tricks to instantaneously determine if the generator is empty.
Okay my mistake that 'bool(cat.articles)' return true if it's not None. But you could then do something like: empty = True for page in cat.articles: empty = False if empty: # complain Now I'm sure if this is slower (but it doesn't appear so). And maybe there is a more pythonic way. The for-else does not differ between an empty list and a non empty.
(In reply to Fabian from comment #5) > The for-else does not differ between an empty list > and a non empty. This will work for i, page in enumerate(generator, 1): .... else: if 'i' not in locals(): ... (I dont like locals() being used) There are many other ways to do it, including wrapping the generator in a generator class that allows peaking.
https://gerrit.wikimedia.org/r/149645/ merged, is this fixed?