Last modified: 2014-11-09 20:50:02 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T70705, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 68705 - 'category remove' regression when category contains only files
'category remove' regression when category contains only files
Status: RESOLVED FIXED
Product: Pywikibot
Classification: Unclassified
category.py (Other open bugs)
core-(2.0)
All All
: Unprioritized normal
: ---
Assigned To: Pywikipedia bugs
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-07-28 02:12 UTC by John Mark Vandenberg
Modified: 2014-11-09 20:50 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description John Mark Vandenberg 2014-07-28 02:12:55 UTC
(Creating bug from https://gerrit.wikimedia.org/r/#/c/149645/)

In May 2014 this changeset broke 'category.py remove' for categories with only files and no pages.
https://gerrit.wikimedia.org/r/#/c/130301/

It is also a significant performance regression as 'categoryinfo' requires the server to fetch all of the category members, which is slow, only to return counts, when category.py needs to iterate over the category members, so it must hit the server again.
Comment 1 Fabian 2014-07-28 11:46:18 UTC
I'm confused: I don't see a possibility to get the counts without hitting the server. And the server must be hit at least once to get the pages or categories in the “worst case”.

So the previous handling without using 'categoryinfo' seems to be the better solution. Of course it could be written a bit more pythonic by simply using

articles = cat.articles()
if articles:
    # remove them
else:
    # "complain"

Even the change in 'CategoryTreeRobot' should be reverted because the database should be the fasted way to get the number of articles in a category.

By the way database: This seems the only way to avoid hitting the server in the “best case” (|articles| == 0). But 'CategoryMoveRobot' doesn't use the database and it's only used for read-only operations.

But the changes above line 644 (old)/642 (new) look fine to me.

Now of course I'm fairly new to the bot's internals, so my assessment is probably off.
Comment 2 John Mark Vandenberg 2014-07-28 13:42:47 UTC
Your assessment is exactly the same as mine.  CC:ing the people involved in the previous changeset in case they can see some problem with our assessment.

I didnt +2 your changeset because it looks to me like the previous changeset needs to be mostly reverted, except for the formatting changes.  Do you want to put up a 'revert' patch for review?

(Yes, the server must be hit at least once.  Invoking cat.categoryinfo is the unnecessary performance regression, as the client logic needs all contents, so it isnt useful to ask the server to compute the counts when the client can do it.)

I believe the database is only used for read-only operations because it isnt a significant problem if the read-only operations use stale data.  Write operations need to use fresh data to minimise errors.
Comment 3 xqt 2014-07-28 15:32:53 UTC
cat.articles() is a generator and bool(cat.articles()) is always True.

You may try a simple performace test like:

>>> import pwb, pywikibot as py
>>> s = py.Site()
>>> c = py.Category(s, 'Mann')
>>> c.categoryinfo['pages']
456626

You'll get the result above immediately


>>> g = c.articles()
len(set(g))

You may wait several minutes to get a result. I've given up after 5.
Comment 4 John Mark Vandenberg 2014-07-28 15:56:05 UTC
(In reply to xqt from comment #3)
> cat.articles() is a generator and bool(cat.articles()) is always True.
> 
> You may try a simple performace test like:
> 
> >>> import pwb, pywikibot as py
> >>> s = py.Site()
> >>> c = py.Category(s, 'Mann')
> >>> c.categoryinfo['pages']
> 456626
> 
> You'll get the result above immediately

I see the server doesnt spend much time working on this, as it loads the data from the category table.

https://www.mediawiki.org/wiki/Manual:Category_table

However it is another API call on the server, which adds more network latency (think someone in the jungle of Borneo on a 3G modem), server workload and delays (an issue on slower wikis), etc.

> 
> >>> g = c.articles()
> len(set(g))
> 
> You may wait several minutes to get a result. I've given up after 5.

len(set(g)) is obviously not the best way to determine if the generator has any results.

The best way is, I think, to process the generator with an else block to detect if the generator had no results.  There are other tricks to instantaneously determine if the generator is empty.
Comment 5 Fabian 2014-07-28 17:07:05 UTC
Okay my mistake that 'bool(cat.articles)' return true if it's not None. But you could then do something like:

empty = True
for page in cat.articles:
    empty = False
if empty:
    # complain

Now I'm sure if this is slower (but it doesn't appear so). And maybe there is a more pythonic way. The for-else does not differ between an empty list and a non empty.
Comment 6 John Mark Vandenberg 2014-07-28 17:37:29 UTC
(In reply to Fabian from comment #5)
> The for-else does not differ between an empty list
> and a non empty.

This will work

for i, page in enumerate(generator, 1):
    ....
else:
    if 'i' not in locals():
        ...

(I dont like locals() being used)

There are many other ways to do it, including wrapping the generator in a generator class that allows peaking.
Comment 7 Ricordisamoa 2014-08-08 09:58:54 UTC
https://gerrit.wikimedia.org/r/149645/ merged, is this fixed?

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links