Last modified: 2014-09-17 07:04:44 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T69117, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 67117 - Performance review of PubSubHubbub extension
Performance review of PubSubHubbub extension
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
PubSubHubbub (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks: 67623
  Show dependency treegraph
 
Reported: 2014-06-26 09:10 UTC by Sebastian Brückner
Modified: 2014-09-17 07:04 UTC (History)
12 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Sebastian Brückner 2014-06-26 09:10:01 UTC
The PubSubHubbubExtension (https://www.mediawiki.org/wiki/Extension:PubSubHubbub) needs to be reviewed before deployment on Wikidata.
Comment 1 Aaron Schulz 2014-07-08 21:52:34 UTC
A few things:
* This seems to be missing a ArticleRevisionVisibilitySet handler
* http_post() needs to handle $wgHTTPProxy
* The maximum jobs attempts for the queue will have to be set very high to avoid update losses (maxTries)
* NewRevisionFromEditComplete and other hooks trigger before COMMIT so the jobs should probably be delayed (using 'jobReleaseTimestamp'). Moving them post-COMMIT is not an option since the network partition could cause nothing to be enqueued (the reverse, a job and no COMMIT, is wasteful but harmless).
* We tend to run lots of jobs for one wiki at a time. http_post() could benefit from some sort of singleton on the curl handle instead closing it each time. See http://stackoverflow.com/questions/972925/persistent-keepalive-http-with-the-php-curl-library.
* createLastChangesOutput() should use LIMIT+DISTINCT instead of a "break" statement. Also, I'm not sure how well that works. There can only be one job for hitting the URL that returns this result in the queue, but it only does the last 60 seconds of changes. Also, it selects rc_timestamp but does not use it now. Is it OK if the Hub missing a bunch of changes from this (e.g. are the per-Title jobs good enough?)
* It's curious that the hub is supposed to talk back to a special page, why not an API page instead? 
* The Link headers also go there. What is the use of these? Also, since they'd take 30 days to apply to all pages (the varnish cache TTL), it would be a pain to change them. They definitely need to be stable.

Come to think of it, it seems like we need to send the following events to the hub:
* New revisions
* Page (un)deletions
* Revision (un)deletions
* Page moves
All of the above leave either edit or log entries in recent changes. Page moves only leave one at the old title...though rc_params can be inspected to get the new title. I wonder if instead of a job per title if there can instead be a single job that sends all changes since the "last update time" and updates the "last update time" on success. The advantages would be:
a) Far fewer jobs needed
b) All updates would be batched
c) Supporting more hubs is easier since only another job and time position is needed (rather than N jobs for each hub for each title)
Of course I may have missed something.
Comment 2 Alexander Lehmann 2014-07-10 15:12:52 UTC
Hi Aaron.

Thanks for your notes. Some questions to your notes are inline.

(In reply to Aaron Schulz from comment #1)
> A few things:
> * This seems to be missing a ArticleRevisionVisibilitySet handler
> * http_post() needs to handle $wgHTTPProxy
> * The maximum jobs attempts for the queue will have to be set very high to
> avoid update losses (maxTries)
> * NewRevisionFromEditComplete and other hooks trigger before COMMIT so the
> jobs should probably be delayed (using 'jobReleaseTimestamp'). Moving them
> post-COMMIT is not an option since the network partition could cause nothing
> to be enqueued (the reverse, a job and no COMMIT, is wasteful but harmless).

I'm not sure if I understood the problem correctly. How should I set the time period of the delay? Should the delay dynamically adjust or a should we set a fixed period of time.

> * We tend to run lots of jobs for one wiki at a time. http_post() could
> benefit from some sort of singleton on the curl handle instead closing it
> each time. See
> http://stackoverflow.com/questions/972925/persistent-keepalive-http-with-the-
> php-curl-library.
> * createLastChangesOutput() should use LIMIT+DISTINCT instead of a "break"
> statement. Also, I'm not sure how well that works. There can only be one job
> for hitting the URL that returns this result in the queue, but it only does
> the last 60 seconds of changes. Also, it selects rc_timestamp but does not
> use it now. Is it OK if the Hub missing a bunch of changes from this (e.g.
> are the per-Title jobs good enough?)
> * It's curious that the hub is supposed to talk back to a special page, why
> not an API page instead? 

The extension provides data in the MediaWiki XML export format, I don't know if the API is intended for this format. Does it make a real difference to using a special page?

> * The Link headers also go there. What is the use of these? Also, since
> they'd take 30 days to apply to all pages (the varnish cache TTL), it would
> be a pain to change them. They definitely need to be stable.
> 

They need to be there according to the specification of PubSubHubbub[1].
But they are stable.

> Come to think of it, it seems like we need to send the following events to
> the hub:
> * New revisions
> * Page (un)deletions
> * Revision (un)deletions
> * Page moves
> All of the above leave either edit or log entries in recent changes. Page
> moves only leave one at the old title...though rc_params can be inspected to
> get the new title. I wonder if instead of a job per title if there can
> instead be a single job that sends all changes since the "last update time"
> and updates the "last update time" on success. The advantages would be:
> a) Far fewer jobs needed
> b) All updates would be batched
> c) Supporting more hubs is easier since only another job and time position
> is needed (rather than N jobs for each hub for each title)
> Of course I may have missed something.

Unfortunately, there is no possibility to identify the request of the Hubs. Theoretically, anyone can call the PubSubHubbub export interface. Therefore, we do not know the "last update time". The hub cannot provide the time, because the resource is identified only through the exact URL.
Comment 3 Aaron Schulz 2014-07-11 21:49:23 UTC
* The delay could be in theory be based on the current max slave lag...but I simple approach would be to just pick a value. 10 seconds would be very safe under non-broken circumstances.
* It's not a huge deal, but it would be nice to use the API (which also works with things like OAuth if we wanted hubs with more access). Brad might have an opinion on special page vs API. I don't feel strongly, but we should get it right given the Link header and cache interaction.
* You could still use the recent changes table to reduce the number of jobs. There could be at most one de-duplicated job per hub that would grab all titles changes from the last time and send them to the hub. When the job succeeds (not HTTP errors posting the changed URIs), it could bump the time. The time table could be a simple DB table. The job could be delayed to give it time to cover a larger time range. The range could be "last time" to present (or a smaller range to limit the number of items, especially for the new hub case). Maybe 5 minutes. It could hopefully batch more titles in one or a few HTTP requests (or use pipelining or at least some curl_multi).
Comment 4 Brad Jorsch 2014-07-14 17:48:01 UTC
(In reply to Aaron Schulz from comment #3)
> * It's not a huge deal, but it would be nice to use the API (which also
> works with things like OAuth if we wanted hubs with more access). Brad might
> have an opinion on special page vs API. I don't feel strongly, but we should
> get it right given the Link header and cache interaction.

OAuth works with special pages too.

It sounds like the general idea of the extension is possible to be configured for general users to use? Then an API endpoint wouldn't necessarily be out of place. On the other hand, is there anything wrong with the existing special page implementation? If not, is there a reason not to just stay with that?
Comment 5 Aaron Schulz 2014-07-15 22:26:12 UTC
I don't think it works on special pages. I asked Chris about that a few days back.
Comment 6 Brad Jorsch 2014-07-16 13:25:51 UTC
I tested it before posting comment 4, just to be sure. It does.

You can now test it yourself. Go to http://tools.wmflabs.org/oauth-hello-world/index.php?action=authorize, then go to http://tools.wmflabs.org/oauth-hello-world/index.php?action=testspecial which will dump the response received when hitting Special:MyPage.
Comment 7 Daniel Kinzler 2014-07-16 14:28:52 UTC
The updates uses the XML dump format. That's not really compatible with the way the API structures XML output, and in case JSON is requested from the API, would would have to be encoded as a single string value. That's really annoying.

Also, API modules generally shouldn't mess with HTTP headers, which is necessary for PuSH, as far as I understand.

A Special page seems to fit the bill much better on the technical level, though it's a bit awkward conceptually. There are several special pages that generate non-HTML output though. 

Hm, perhaps the best way to implement this would be a custom action?...
Comment 8 Dimitris Kontokostas 2014-09-17 07:04:44 UTC
Without looking at the code we had a few comments on this extension that also relate to performance [1] [2].

The basic idea is whether the client (wiki mirror) can delay the poll (e.g. 5-10 minutes) and aggregate multiple edits of the same page into one. OAI had a native support for this.

The reason is that in DBpedia Live,  we had a lot of scaling issues when the database was in constant update (even with the aggregated edits  with 5-10 minutes poll delay) and using in parallel the MW API to get data for extraction.

I don't know if this is supported by the PubSubHubbub specs but the client could do some aggregation manually.

We had another query for Hub data persistence but not sure which thread is more appropriate for this question.

[1] https://lists.wikimedia.org/pipermail/wikidata-tech/2014-July/000533.html
[2] https://lists.wikimedia.org/pipermail/wikidata-tech/2014-July/000539.html

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links