Last modified: 2014-09-16 19:36:48 UTC
when a bot or user visits a wiki’s SpecialNewFiles page, and some other pages like this page, missing thumbnails are created on the fly. this can potentially flood the server(s) with thumbnail creation jobs, which slow down the wiki or potentially bring down its ability to server web pages. GWToolset has the potential to create this situation when it uploads several large mediafiles at once @see http://lists.wikimedia.org/pipermail/glamtools/2014-May/000135.html. during the zürich hackathon i spoke with aaron schultz, faiden liambotis, and brion vibber regarding approaches to dealing with this issue. in summary, the idea aaron came up with is to create initial thumbnails on download of the original mediafile to the wiki. this should block the appearance of the title on the new files page and anywhere else until the thumbnails and title creation/edit have completed. aaron thought, and faidon and i agree, that further throttling of gwtoolset will not help resolve the issue. i am currently looking into implementing this approach and will use this bug to track activity on it.
According to Gergo, workaround (not: fix) is in https://gerrit.wikimedia.org/r/#/c/132111/ https://gerrit.wikimedia.org/r/#/c/132112/ Related: bug 49118 triggered by https://commons.wikimedia.org/wiki/Commons:Village_pump/Archive/2014/05#Images_so_big_they_break_Commons.3F (and to very little extend also bug 52045).
my initial thoughts on how to approach this, utilising methods within thumb.php, are not accessible to jobs run in the job queue. another approach, discussed with gilles and gergo in irc, involves uploading the media file to an upload stash, creating thumbnails based on that media file, and then creating the title for the media file. this requires re-architecting the way the job queue jobs currently run, which i don’t have time to work on at the moment. will try and get to this when time permits.
The consensus on the ops list was that https://gerrit.wikimedia.org/r/#/c/132112/ is not enough to safely resume uploads, and bug 52045 probably would not help much. The current plan is to * extract a large thumbnail from the file, and use that thumbnail to create smaller thumbnails (possibly in a chain, i.e. use some of those smaller thumbnails to create even smaller thumbnails) * make this thumbnail generation happen immediately after upload * limit the number of expensive thumbnail generations that can happen in parallel
I recently realized that we still download the source file, even if its above $wgMaxImageArea (e.g. https://commons.wikimedia.org/wiki/File:Map_of_New-York_Bay_and_Harbor_and_the_environs_-_founded_upon_a_trigonometrical_survey_under_the_direction_of_F._R._Hassler,_superintendent_of_the_Survey_of_the_Coast_of_the_United_States;_NYPL1696369.tiff is a 540 mb file, which takes 37 seconds just to get to the error message that says we aren't even going to attempt to thumbnail the file). I've submitted https://gerrit.wikimedia.org/r/135101 to fix this. I've missed much of the events that unfolded around this situation. Looking back in mailing list archives, I'm not even clear if it is swift being overloaded, or time taken to actually thumbnail the image that's the problem (or both. Or something else). One of the earlier emails says: >We just had a brief imagescaler outage today at approx. 11:20 UTC that >was investigated and NYPL maps were found to be the cause of the outage. >Besides the complete outage of imagescaling, Swift's (4Gbps) bandwidth >was saturated again, which would cause slowdowns and timeouts in file >serving as well. So possibly (correct me if I'm off base here) its just swift network connection being overloaded, which in turn causes the image scalars to have to wait longer before getting the original image asset is delivered to them, causing them to be overloaded. If so, the fact we are fetching the original > 100 mb source file, only to not even try to scale it, and doing so repetitively until 4 attempts at a specific file width trigger attempt-failures to stop it for an hour on that particular size only, may be a very significant contributor to the situation. The attempt-failures thing only increments the cache key after the attempt failed. Given it was taking ~ 38 seconds just to download the file to the image scalar (in the case I tried), A lot of people could try and render that file in that time before the key is incremented (Still limited by the pool counter though). Maybe that key should be incremented at the beginning of the request. Sure in certain situations a couple people might get an error for the couple of seconds it takes a good file to render, but that would only last a couple seconds and would much more quickly limit the damage a stampede of people requesting a hard to render file could do.
I was reading over the thread on multimedia - I'm not entirely sure the Special:Newfiles theory makes sense, I think its more likely someone maybe viewed a category of the tiff uploads from gwtoolset or something like that. So we have this graph of april 21, with a peak about 2:55 to 3:20 UTC http://lists.wikimedia.org/pipermail/multimedia/attachments/20140420/35015082/attachment-0001.png However when you look at the uploads from around the time, the peak in large tiff uploads do not correspond with a peak in the graph: MariaDB [commonswiki_p]> select substring( img_timestamp, 9, 3) "time", count(*) "# images", round(MAX(img_width*img_height/1000000)) "max Mpx", round( avg(img_width*img_height/1000000)) "avg mpx", round(avg (img_size/(1024*1024))) "avg MB", round(sum(img_size/(1024*1024))) "total mb", round( max( img_size/(1024*1024))) "max mb" from image where img_timestamp > '20140421010000' and img_timestamp < '20140421050000' and img_minor_mime = 'tiff' and img_user_text = 'Fæ' group by substring( img_timestamp, 1, 11); +------+----------+---------+---------+--------+----------+--------+ | time | # images | max Mpx | avg mpx | avg MB | total mb | max mb | +------+----------+---------+---------+--------+----------+--------+ | 010 | 40 | 60 | 42 | 121 | 4822 | 172 | | 011 | 40 | 39 | 39 | 110 | 4409 | 112 | | 012 | 19 | 60 | 42 | 120 | 2280 | 172 | | 013 | 37 | 60 | 60 | 171 | 6328 | 173 | | 014 | 17 | 60 | 60 | 172 | 2916 | 173 | | 015 | 20 | 60 | 60 | 171 | 3427 | 173 | | 020 | 35 | 60 | 60 | 171 | 5986 | 173 | | 021 | 15 | 60 | 60 | 170 | 2555 | 172 | | 022 | 26 | 60 | 60 | 172 | 4463 | 173 | | 023 | 18 | 60 | 60 | 171 | 3079 | 173 | | 030 | 6 | 60 | 59 | 170 | 1018 | 173 | | 032 | 5 | 60 | 60 | 171 | 857 | 173 | | 033 | 2 | 60 | 60 | 172 | 343 | 173 | +------+----------+---------+---------+--------+----------+--------+ 13 rows in set (0.01 sec) That is between 2:50-3:20 there was a total of 6 tiff files uploaded by Fae with gwtoolset (out of 141 total uploads in that time period, 4.2%), compared to say 1:00-1:30 which didn't have a spike but had 99 tiff files uploaded by fae (compared to 373 total, 27%). If it was caused by viewing Special:Newfiles, I would expect the spike would come when the 99 tiffs were uploaded instead of when the 6 tiffs were uploaded. Which leads me to suspect the issue was not with people viewing Special:NewFiles a lot, but maybe viewing something else that had a lot of uncached thumbnail hits associated. Maybe the category for the batch upload, which would have up to 200 images on it, probably a lot over the $wgMaxImageArea so triggering what I mentioned in comment 4 - and the rest might simply have not been viewed before, was viewed by several someones at the same time. [[Commons:Category:NYPL maps (over 50 megapixels)]] was linked in the VP at the time (although it had been for about a day), maybe somebody just hit reload on that page repetitively for some unknown reason and that overloaded things. Or something. With all that said, I guess even if it wasn't Special:Newfiles, it probably doesn't change much as its still related to on-demand thumbnailing.
(In reply to Bawolff (Brian Wolff) from comment #5) > I was reading over the thread on multimedia - I'm not entirely sure the > Special:Newfiles theory makes sense, I think its more likely someone maybe > viewed a category of the tiff uploads from gwtoolset or something like that. <snip> > > Which leads me to suspect the issue was not with people viewing > Special:NewFiles a lot, but maybe viewing something else that had a lot of > uncached thumbnail hits associated. Maybe the category for the batch upload, > which would have up to 200 images on it, probably a lot over the > $wgMaxImageArea so triggering what I mentioned in comment 4 - and the rest > might simply have not been viewed before, was viewed by several someones at > the same time. [[Commons:Category:NYPL maps (over 50 megapixels)]] was > linked in the VP at the time (although it had been for about a day), maybe > somebody just hit reload on that page repetitively for some unknown reason > and that overloaded things. Or something. > > With all that said, I guess even if it wasn't Special:Newfiles, it probably > doesn't change much as its still related to on-demand thumbnailing. You could be on to something. For example, all of the thumbnails in [[commons:Category:Sanborn maps of Staten Island]] are broken when you go to view an image in full resolution. It doesn't have to be someone hitting reload repeatedly, the call for the thumb regenerates on its own once it fails. For example: https://upload.wikimedia.org/wikipedia/commons/thumb/9/9d/Staten_Island%2C_Plate_No._12_%28Map_bounded_by_Boyd%2C_Brooks%2C_Mc_Keon%2C_Varian_Cedar%29_NYPL1957089.tiff/lossy-page1-3000px-Staten_Island%2C_Plate_No._12_%28Map_bounded_by_Boyd%2C_Brooks%2C_Mc_Keon%2C_Varian_Cedar%29_NYPL1957089.tiff.jpg I can open that up in a background browser tab and it just keeps hitting the server over and over for thumbnail requests.
(In reply to Keegan Peterzell from comment #6) > https://upload.wikimedia.org/wikipedia/commons/thumb/9/9d/ > Staten_Island%2C_Plate_No. > _12_%28Map_bounded_by_Boyd%2C_Brooks%2C_Mc_Keon%2C_Varian_Cedar%29_NYPL195708 > 9.tiff/lossy-page1-3000px-Staten_Island%2C_Plate_No. > _12_%28Map_bounded_by_Boyd%2C_Brooks%2C_Mc_Keon%2C_Varian_Cedar%29_NYPL195708 > 9.tiff.jpg > > I can open that up in a background browser tab and it just keeps hitting the > server over and over for thumbnail requests. I should clarify: My browswer (Chrome 34.0.1847.137 m) is giving different behaviors when I open up images from that gallery. One image failed upon its own refresh call six times before halting and returning the proper error message (There have been too many recent failed attempts (4 or more) to render this thumbnail. Please try again later.) Another image reloaded only twice before halting with no error message. Yet another image just keep reloading without the error message.
(In reply to Keegan Peterzell from comment #7) > I should clarify: My browswer (Chrome 34.0.1847.137 m) is giving different > behaviors when I open up images from that gallery. One image failed upon its > own refresh call six times before halting and returning the proper error > message (There have been too many recent failed attempts (4 or more) to > render this thumbnail. Please try again later.) Another image reloaded only > twice before halting with no error message. Yet another image just keep > reloading without the error message. And by without the error message, I mean that the server is leaving the field blank. Error generating thumbnail Error creating thumbnail:
(In reply to Keegan Peterzell from comment #8) > (In reply to Keegan Peterzell from comment #7) > > I should clarify: My browswer (Chrome 34.0.1847.137 m) is giving different > > behaviors when I open up images from that gallery. One image failed upon its > > own refresh call six times before halting and returning the proper error > > message (There have been too many recent failed attempts (4 or more) to > > render this thumbnail. Please try again later.) Another image reloaded only > > twice before halting with no error message. Yet another image just keep > > reloading without the error message. > > And by without the error message, I mean that the server is leaving the > field blank. > > Error generating thumbnail > > Error creating thumbnail: Well the blank error message is consistent with an out of memory error for a tiff file (Since process gets killed and doesn't output anything to stdout. Other formats return the exit code, but tiff doesn't). However your web browser is not supposed to be loading the page over and over again by itself. My copy of chrome doesn't do that. ----- Furthermore, looking at the irc logs - http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20140421.txt the servers had issues way up to 14:50 UTC on april 21, which is long after Fae's uploads stop and are off Special:Newfiles/Special:Listfiles. Similarly for the outage at 11:20 utc on May 11 - http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20140511.txt [[commons:file:Bronx,_V._12,_Double_Page_Plate_No._273_%28Map_bounded_by_Whiting_Ave.,_Ewen_Ave.,_Warren_Ave.,_Hudson_River%29_NYPL2001533.tiff]] is mentioned, which is one of the images uploaded back on april 21, so definitely not on Special:NewFiles. (Also that file is over the $wgMaxImageArea, so Gerrit change #135101 would have stopped that particular file from causing a problem. Of course the irc log is unclear if that was the main file causing problems or if it was just one example of many files being currently requested)
(In reply to Tisza Gergő from comment #3) > The consensus on the ops list was that > https://gerrit.wikimedia.org/r/#/c/132112/ is not enough to safely resume > uploads, and bug 52045 probably would not help much. The current plan is to > > * extract a large thumbnail from the file, and use that thumbnail to create > smaller thumbnails (possibly in a chain, i.e. use some of those smaller > thumbnails to create even smaller thumbnails) I sort of did this for tiff as part of the work to make vips work on tiffs - see Gerrit change #135289.
with these gerrit patches merged, and deployed onto production, is it time for fae to re-try one of his large tiff uploads? * https://gerrit.wikimedia.org/r/#/c/107419/ * https://gerrit.wikimedia.org/r/#/c/127642/ * https://gerrit.wikimedia.org/r/#/c/132111/ * https://gerrit.wikimedia.org/r/#/c/135701/ * https://gerrit.wikimedia.org/r/#/c/135702/ * https://gerrit.wikimedia.org/r/#/c/135976/ or do these need to also be deployed to production before we try testing large tiffs again? * https://gerrit.wikimedia.org/r/#/c/135703 * https://gerrit.wikimedia.org/r/#/c/135704
Sorry for the slow response, I got unCCd from this bug somehow. (In reply to dan from comment #11) > with these gerrit patches merged, and deployed onto production, is it time > for fae to re-try one of his large tiff uploads? The changes you mention don't really help: > * https://gerrit.wikimedia.org/r/#/c/107419/ > * https://gerrit.wikimedia.org/r/#/c/127642/ These only help with thumbnails which completely fail to render, and even for those have limited effect (as Bawolff pointed out above - the rendering would still take up time and memory, until the failure threshold is hit). Also, the first was merged long ago, and the second right after the first outage, so they did not stop the second one. > * https://gerrit.wikimedia.org/r/#/c/132111/ > * https://gerrit.wikimedia.org/r/#/c/135701/ > * https://gerrit.wikimedia.org/r/#/c/135702/ > * https://gerrit.wikimedia.org/r/#/c/135976/ These don't really do anything without the two pending ones you mention. (Sorry to be so sluggish on this - we were distracted by troubles with the MediaViewer rollout on enwiki. Also, Gilles is on vacation next week, so unless someone else is willing to review them, not much will happen. I hope to get them merged the following week.) Bawolff's $wgMaxArea patch might help somewhat: https://gerrit.wikimedia.org/r/#/c/135101/ not sure if the files involved in the second outage were that large, though. The multi-step scaling patches might also help, once they get merged: https://gerrit.wikimedia.org/r/#/c/135289/ https://gerrit.wikimedia.org/r/#/c/135008/ (the second one is only for JPEGs at the moment though)
(In reply to Bawolff (Brian Wolff) from comment #4) > The attempt-failures thing only increments the cache key after the attempt > failed. Given it was taking ~ 38 seconds just to download the file to the > image scalar (in the case I tried), A lot of people could try and render > that file in that time before the key is incremented (Still limited by the > pool counter though). Maybe that key should be incremented at the beginning > of the request. That would be a semaphore, basically (except that its value would decrease with failures). Isn't that what the FileRender poolcounter does already?
(In reply to Tisza Gergő from comment #13) > (In reply to Bawolff (Brian Wolff) from comment #4) > > The attempt-failures thing only increments the cache key after the attempt > > failed. Given it was taking ~ 38 seconds just to download the file to the > > image scalar (in the case I tried), A lot of people could try and render > > that file in that time before the key is incremented (Still limited by the > > pool counter though). Maybe that key should be incremented at the beginning > > of the request. > > That would be a semaphore, basically (except that its value would decrease > with failures). Isn't that what the FileRender poolcounter does already? Yes. You're right.
Cc Sam here because I don't know where else, about: samwilson> one thing i've been tinkering with is a system of generating thumbnails offline and ploking them in their correct locations. that'd reduce a pile of the out-of-memory things i see on DH [DreamHost] sites.
(Thanks for the heads-up re this, Nemo.) My thing isn't really a fix! It's just a simple way for the site administrator to be told that some thumbnail is missing, and where it should go in the filesystem, so that they can generate it locally (i.e. Gimp or whatnot) and upload it (via some easy interface, although I've not considered that bit; scp is my usual). So, not really a help. But good for memory-poor places like Dreamhost!
(In reply to Sam Wilson from comment #16) > (Thanks for the heads-up re this, Nemo.) > > My thing isn't really a fix! It's just a simple way for the site > administrator to be told that some thumbnail is missing, and where it should > go in the filesystem, so that they can generate it locally (i.e. Gimp or > whatnot) and upload it (via some easy interface, although I've not > considered that bit; scp is my usual). > > So, not really a help. But good for memory-poor places like Dreamhost! [Slightly off topic] how memory poor is dreamhost?
Their shared hosting: 90M. Acutally, I think imagemagick failures are also the processes running too long and being kissed.
Agh, *killed*. Unless DH is the mafia I guess...