Last modified: 2014-11-13 15:56:01 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T57889, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 55889 - Improve support for asynchronous requests (saving/preloading pages)


Summary:	Improve support for asynchronous requests (saving/preloading pages)

Status:	PATCH_TO_REVIEW

Product:	Pywikibot
Classification:	Unclassified
Component:	General (Other open bugs)
Version:	core-(2.0)
Hardware:	All All

Importance:	Low enhancement
Target Milestone:	---
Assigned To:	Pywikipedia bugs

URL:
Whiteboard:
Keywords:

Depends on:	55220
Blocks:
	Show dependency tree / graph

Reported:	2013-10-18 18:04 UTC by Strainu
Modified:	2014-11-13 15:56 UTC (History)
CC List:	1 user (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Strainu 2013-10-18 18:04:25 UTC

Currently, the generators functions use yield, which is not tread-safe. PWB should offer a thread-safe version using one of the many interesting  suggestions from http://www.dabeaz.com/generators/Generators.pdf (or any other method :P)

Comment 1 Merlijn van Deen (test) 2013-10-18 18:15:02 UTC

What is the goal you want to achieve by this? Remember that threads in Python are useless for computations, due to the GIL.

Comment 2 Strainu 2013-10-18 21:05:26 UTC

As I understand it, I/O happens outside of GIL. As the API requests are the most time-consuming part of many of my robots (and more precisely the connection to the servers), being able to do requests from several threads should somewhat improve performance (as long as the throttling is not too aggressive).

I've noticed that the preloading limit is not only 50 pages, making this problem even more stringent for many small pages. It's probably also a good idea for things like image upload/download.

If it helps, we can do some tests to see if performance is increased for a simple file downloader?

Comment 3 Merlijn van Deen (test) 2013-10-18 21:26:46 UTC

I see. There are already some features in place, but we are maybe not using asynchronous requests at all points where it might be useful.

First of all, connections should be re-used - this is already a feature in the httplib2 library.

The next layer, comms.threadedhttp, supports asynchronous requests ('features' would be a closer term - basically, you create a request and then wait for a lock to be released). However, I don't think we use this feature anywhere, as it's not exposed in the higher-up layers.

For saving pages, which (I think) is the most relevant place for async request, we already have support, where requests that do not return a reply that has to be handled can be handled asynchronously - see Page.put_async.


For pagegenerators, we might be able to win a bit by requesting the (i+1)th page before returning the i-th page (or, for the PreloadingGenerator, by requesting the (i+1)th batch before all pages from the i-th batch have been returned).

Comment 4 Strainu 2013-10-18 22:25:17 UTC

(In reply to comment #3)
> The next layer, comms.threadedhttp, supports asynchronous requests. [...] I don't think we use this feature anywhere, as
> it's not exposed in the higher-up layers.

I've noticed that while writing the answer to Gerard's questions today :)

> For saving pages, which (I think) is the most relevant place for async
> request,
> we already have support, where requests that do not return a reply that has
> to
> be handled can be handled asynchronously - see Page.put_async.

I've experimented with put_async with mixed results. When the upload works, it's mostly OK, however when one request hits an error (like a 504 from the server) it just keeps trying again and again, keeping the thread blocked. 

Instead, the request should probably be de-queued, processed and, if a callback has been registered, the callback should be called in order to allow the bot to re-queue the request. This, however, could cause trouble if the order of the requests is important. The bot can receive a callback, but AFAIK it cannot remove already queued requests. Also, what happens if no callback has been registered? Should we simply re-queue the request? I don't have a perfect solution at this time, but this is a point that should be considered. 

Another possible issue, that PWB can't really do much about, is that one can get a 504 even if the save is successful, making the re-queueing useless. I don't have a good solution for that either, but we could consult with the Wikimedia developers.

> For pagegenerators, we might be able to win a bit by requesting the (i+1)th
> page before returning the i-th page (or, for the PreloadingGenerator, by
> requesting the (i+1)th batch before all pages from the i-th batch have been
> returned).

This should be especially useful if it can be controlled by the user. Do you have any ideas on how to do this?

I think there were some good ideas brought up on this bug. Should we start a thread on the mailing list so we can gather more input on this?

Comment 5 Gerrit Notification Bot 2014-11-13 15:55:59 UTC

Change 172023 had a related patch set uploaded by John Vandenberg:
Asynchronous HTTP requests

https://gerrit.wikimedia.org/r/172023

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links