Last modified: 2012-05-22 21:08:36 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T36785, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 34785 - Some API action=upload & filekey requests take >30 seconds to complete
Some API action=upload & filekey requests take >30 seconds to complete
Status: RESOLVED FIXED
Product: MediaWiki
Classification: Unclassified
Uploading (Other open bugs)
unspecified
All All
: High major (vote)
: Future release
Assigned To: Aaron Schulz
: platformeng
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-02-29 00:05 UTC by ineuw
Modified: 2012-05-22 21:08 UTC (History)
12 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Screenshot of the error message (25.88 KB, image/jpeg)
2012-02-29 00:05 UTC, ineuw
Details

Description ineuw 2012-02-29 00:05:34 UTC
Created attachment 10126 [details]
Screenshot of the error message

Twice I received this error message. Please see attached .jpg screenshot. The message reports failed upload, but the upload is successful.
Comment 1 Bawolff (Brian Wolff) 2012-03-01 00:20:42 UTC
Including actual error message in bug title (  UploadStashFileNotFoundException ) for easier searching and for people too lazy to look at image ;)
Comment 2 Aaron Schulz 2012-03-01 00:33:21 UTC
How often does this occur?
Comment 3 Mark A. Hershberger 2012-03-01 21:44:15 UTC
resetting milestone till we figure out that this is more dependably replicable
Comment 4 Rob Lanphier 2012-03-01 22:56:48 UTC
Roan has been working on this one.
Comment 5 ineuw 2012-03-17 05:54:55 UTC
Just completed the upload of 19 greyscale images, total size 7.1MB and two reported the same bug, although the images were in fact uploaded and not rejected. Unfortunately I can't provide any technical data.
Comment 6 Max Semenik 2012-03-18 23:04:39 UTC
Just encountered it, upping severity:

== Request ==

POST /w/api.php HTTP/1.1 
User-Agent: Opera/9.80 (Windows NT 6.1; U; ru) Presto/2.10.229 Version/11.61
Host: commons.wikimedia.org
Accept-Language: ru,en;q=0.9
Accept-Encoding: gzip, deflate
Referer: https://commons.wikimedia.org/wiki/Special:UploadWizard
Cookie: *
Connection: Keep-Alive
Content-Length: 955
Content-Type: application/x-www-form-urlencoded
Accept: application/json, text/javascript, */*; q=0.01
X-Requested-With: XMLHttpRequest
Content-Transfer-Encoding: binary

== POST variables ==

action	upload
format	json
filekey	10firwjkg56g.caqj15.20471.jpg
filename	Vladislav_Kovalyov_execution_notice.jpg
text	=={{int:filedesc}}== {{Information |description={{en|1=Notice sent to the mother of the excuted 2011 Minsk metro bombing perpetrator Vladislav Kovalyov}} |date=2012-03-16 |source=http://charter97.org/photos/20120317_prigovor.jpg |author=Belarussian Supreme Court |permission= |other_versions= |other_fields= }} =={{int:license-header}}== {{subst:Custom license marker added by UW}} {{PD-BY-exempt}} [[Category:Uploaded with UploadWizard]] [[Category:Death penalty]] [[Category:2012 in Belarus]] [[Category:2011 Minsk Metro bombing]] 
summary	User created page with UploadWizard (alpha)
token	9bf46413397017ca89f840746cf1dcc2+\

== Response ==
HTTP/1.1 200 OK 
Server: nginx/0.7.65
Date: Sat, 17 Mar 2012 10:48:26 GMT
Content-Type: application/json; charset=utf-8
Connection: keep-alive
X-Content-Type-Options: nosniff
Cache-Control: private
MediaWiki-API-Error: internal_api_error_UploadStashFileNotFoundException
Content-Encoding: gzip
Vary: Accept-Encoding
X-Vary-Options: Accept-Encoding;list-contains=gzip
Content-Length: 452
X-Cache: MISS from sq62.wikimedia.org
X-Cache-Lookup: MISS from sq62.wikimedia.org:3128
X-Cache: MISS from amssq38.esams.wikimedia.org
X-Cache-Lookup: MISS from amssq38.esams.wikimedia.org:3128
X-Cache: MISS from amssq33.esams.wikimedia.org
X-Cache-Lookup: MISS from amssq33.esams.wikimedia.org:80
Via: 1.1 sq62.wikimedia.org:3128 (squid/2.7.STABLE9), 1.0 amssq38.esams.wikimedia.org:3128 (squid/2.7.STABLE9), 1.0 amssq33.esams.wikimedia.org:80 (squid/2.7.STABLE9)

== Body ==

{"servedby":"srv195","error":{"code":"internal_api_error_UploadStashFileNotFoundException","info":"Exception Caught: key '10firwjkg56g.caqj15.20471.jpg' not found in stash","*":"\n\n#0 \/usr\/local\/apache\/common-local\/php-1.19\/includes\/upload\/UploadStash.php(135): UploadStash->getFile('10firwjkg56g.ca...')\n#1 \/usr\/local\/apache\/common-local\/php-1.19\/includes\/upload\/UploadFromStash.php(79): UploadStash->getMetadata('10firwjkg56g.ca...')\n#2 \/usr\/local\/apache\/common-local\/php-1.19\/includes\/api\/ApiUpload.php(308): UploadFromStash->initialize('10firwjkg56g.ca...', 'Vladislav_Koval...')\n#3 \/usr\/local\/apache\/common-local\/php-1.19\/includes\/api\/ApiUpload.php(64): ApiUpload->selectUploadModule()\n#4 \/usr\/local\/apache\/common-local\/php-1.19\/includes\/api\/ApiMain.php(706): ApiUpload->execute()\n#5 \/usr\/local\/apache\/common-local\/php-1.19\/includes\/api\/ApiMain.php(360): ApiMain->executeAction()\n#6 \/usr\/local\/apache\/common-local\/php-1.19\/includes\/api\/ApiMain.php(344): ApiMain->executeActionWithErrorHandling()\n#7 \/usr\/local\/apache\/common-local\/php-1.19\/api.php(117): ApiMain->execute()\n#8 \/usr\/local\/apache\/common-local\/live-1.5\/api.php(3): require('\/usr\/local\/apac...')\n#9 {main}\n\n"}}
Comment 7 Erik Moeller 2012-05-08 05:31:42 UTC
Here's an example of this bug from user-testing:

https://accounts.usertesting.com/ViewVideo.aspx?file=t0uwDTLDpas%3d

Firefox 12 user. File first fails with "unknown error", then on retry with the stash error. What's noticeable is that the final step takes a long time before it errors out. These are just image files, albeit large ones.

This is definitely a high priority bug that needs some deeper investigation, probably by Aaron as well since it might be FileBackend/SWIFT related. I'll poke at it as well to see if I can produce it at all in local testing.
Comment 8 Erik Moeller 2012-05-08 05:43:42 UTC
I should note that the above user actually successfully uploaded the balloon picture which reported the error. (I deleted the file, so you can no longer see it.) The timing was the same as the other uploads. So UW apparently correctly uploaded the file but reported an API error.
Comment 9 Erik Moeller 2012-05-09 06:27:00 UTC
Here's what appears to be happening. The MediaWiki API JavaScript convenience functions in core (resources/mediawiki.api/mediawiki.api.js) have a default timeout of 30 seconds. This causes uploads to report an error if the final step runs longer than 30 seconds. The uploads should be successfully going through, regardless, which is why the "Retry" shows a stash error (at that point the file has been successfully removed from the stash). 

We can "fix" this by overriding the default timeout, but we should determine why the action=upload for a file (even one of just a few hundred KB) in the stash takes upwards of 30 seconds. This is a severe performance regression that seems to have started around the 1.19 deployment.
Comment 10 Erik Moeller 2012-05-10 02:06:08 UTC
We deployed the timeout increase; please report if you still see this problem in production. We're further investigating the reasons for the slow API response. site_stats contention issues are currently a prime suspect.
Comment 11 Nemo 2012-05-10 10:12:06 UTC
Now I managed to upload <http://commons.wikimedia.org/wiki/File:Wikimedia_Italia_-_WikiGuida_4_-_Wikiquote.ogv>, so I suppose the timeout increase worked for me.
Comment 12 Erik Moeller 2012-05-11 10:00:06 UTC
Sorry, the previous patch was broken and actually had no effect on the timeout. This one should do the trick once deployed: https://gerrit.wikimedia.org/r/#/c/3808/
Comment 13 Erik Moeller 2012-05-11 18:19:08 UTC
Deployed. Confirmed that the timeout issue is fixed and users should no longer see the error message in the last step. 

However, we still need to figure out why the last step is taking so long.
Comment 14 Erik Moeller 2012-05-16 23:11:23 UTC
Updating bug description. This is not inherently an UploadWizard issue, although UploadWizard may exacerbate contention issues because it does batch uploading.
Comment 15 Rob Lanphier 2012-05-17 01:50:27 UTC
Latest status on this:  we're currently testing a change that should fix this problem.  Aaron has enabled this on mediawiki.org, and we'll likely make this change on commons tomorrow.

If our theory is correct, the problem is caused when page editing or image uploading requires an update to the site statistics.  The code that fixes this is deployed, but needs to be configured.

Tech minutia follows:

The fix we're relying on is here:  https://gerrit.wikimedia.org/r/#/c/7724

The configuration change needs to be made, changing $wgSiteStatsAsyncFactor.

You can see which wikis this has been deployed to by looking for wgSiteStatsAsyncFactor here:
http://wikitech.wikimedia.org/view/Server_admin_log
Comment 16 Rob Lanphier 2012-05-17 01:55:09 UTC
Gah, copied the wrong link.  Two relevant revs:
https://gerrit.wikimedia.org/r/#/c/7136/
https://gerrit.wikimedia.org/r/#/c/7244/
Comment 17 Erik Moeller 2012-05-19 00:30:08 UTC
$wgSiteStatsAsyncFactor is now set to 1 for Wikimedia Commons. 

I'm no longer able to get 15-30 second requests, although I did catch a 12 second call for a 400K filekey upload that was part of a large batch (most of which was processed very quickly). So it's possible this issue is not (fully) resolved yet.

Let's keep the bug open for now -- if you see the issue in the wild (manifests as last step of Upload Wizard uploads taking a very long time), please report it here.
Comment 18 Aaron Schulz 2012-05-19 01:03:29 UTC
Asher and I our still tracking down additional problems. I've made https://gerrit.wikimedia.org/r/8006 to add more profiling. The problem is tracked down to LocalFile::upload.
Comment 19 Aaron Schulz 2012-05-22 15:56:42 UTC
(In reply to comment #18)
> Asher and I our still tracking down additional problems. I've made
> https://gerrit.wikimedia.org/r/8006 to add more profiling. The problem is
> tracked down to LocalFile::upload.

For upload profiling: The edit portions tends to be around <=1 sec at all times. p90 for purging from swift goes up to 11 sec for much of the day. As a workaround, perhaps the thumbnails purging could be skipped for brand-new uploads, since there should be nothing to purge.

The problem (same as bug 34717), is that the Swift container DBs have extremely poor performance and while likely need to be mapped to SSD devices instead of HDD.
SELECT queries go fast when the sqlite3 db file is paged in memory, but slow otherwise (much of the time). The slowness just comes from the listings, the actual concurrent DELETEs are very fast (<1 second even for p90).
Comment 20 Aaron Schulz 2012-05-22 15:57:28 UTC
(In reply to comment #19)
> The problem (same as bug 34717), is that the Swift container DBs have extremely
> poor performance and while likely need to be mapped to SSD devices instead of
> HDD.
*will likely*
Comment 21 Max Semenik 2012-05-22 16:18:51 UTC
They're using SQLite with no other options? That's stupid cause even if you put the DB on the fastest drive on the Earth, it will still suck because SQLite had never been intended for any serious level of concurrency, and it's not a bug but a design decision.
Comment 22 Aaron Schulz 2012-05-22 20:03:25 UTC
Per the LocalFile graphs at https://graphite.wikimedia.org, this was fixed for brand new uploads. Re-uploads will still have potential slowness.
Comment 23 Aaron Schulz 2012-05-22 20:12:09 UTC
(In reply to comment #21)
> They're using SQLite with no other options? That's stupid cause even if you put
> the DB on the fastest drive on the Earth, it will still suck because SQLite had
> never been intended for any serious level of concurrency, and it's not a bug
> but a design decision.

The concurrency limitations are probably dealt with by the fact that we shard the containers for the large wikis. This also mitigates the B-tree index update slowness (proportional to the no. of items in the container object list table).

Since the container service and object service are running on the same boxes (and the later gets used way more and on way more data), no DB files can stay in the page cache. The random seeks can't be well handled by SATA drives.

Asher also floating trying some hacks with http://www.kernel.org/doc/man-pages/online/pages/man2/posix_fadvise.2.html for the sake of investigation.
Comment 24 Erik Moeller 2012-05-22 21:08:36 UTC
Confirmed that new file uploads are now zippy, yay. 

Closing this bug as fixed; additional improvements should be discussed in separate bugs.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links