Last modified: 2012-08-26 18:31:08 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T36695, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 34695 - Some corrupt thumbs remain from initial Swift deploy
Some corrupt thumbs remain from initial Swift deploy
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
Media storage (Other open bugs)
unspecified
All All
: High normal (vote)
: ---
Assigned To: Ben Hartshorne
:
: 34611 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-02-24 19:45 UTC by Rob Lanphier
Modified: 2012-08-26 18:31 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Rob Lanphier 2012-02-24 19:45:17 UTC
Ralf Schmitt 2012-02-24 09:40:59 UTC reported in bug 34611#c3 :

btw, upload.wikimedia.org is currently serving corrupt thumb images
(see below). What makes you think that you solved the problem?

,----
| wget -S
http://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Commons-emblem-disambig-notice.svg/1200px-Commons-emblem-disambig-notice.svg.png
| --2012-02-24 10:30:55-- 
http://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Commons-emblem-disambig-notice.svg/1200px-Commons-emblem-disambig-notice.svg.png
| Resolving upload.wikimedia.org... 208.80.152.211
| Connecting to upload.wikimedia.org|208.80.152.211|:80... connected.
| HTTP request sent, awaiting response...
|   HTTP/1.0 200 OK
|   Last-Modified: Thu, 02 Feb 2012 17:10:31 GMT
|   Accept-Ranges: bytes
|   Content-Type: image/png
|   Content-Length: 102400
|   Date: Mon, 20 Feb 2012 03:49:30 GMT
|   Age: 366058
|   X-Cache: HIT from sq83.wikimedia.org
|   X-Cache-Lookup: HIT from sq83.wikimedia.org:3128
|   X-Cache: MISS from sq84.wikimedia.org
|   X-Cache-Lookup: MISS from sq84.wikimedia.org:80
|   Connection: keep-alive
| Length: 102400 (100K) [image/png]
| Saving to: `1200px-Commons-emblem-disambig-notice.svg.png'
|
| 100%[======================================>] 102,400      112K/s   in 0.9s
|
| 2012-02-24 10:30:56 (112 KB/s) -
`1200px-Commons-emblem-disambig-notice.svg.png' saved [102400/102400]
|
| [py27]  ~/t/ % md5sum 1200px-Commons-emblem-disambig-notice.svg.png
| 4a42cbe023060d011d6dc1f92572eb1c 
1200px-Commons-emblem-disambig-notice.svg.png
| [py27]  ~/t/ % display 1200px-Commons-emblem-disambig-notice.svg.png
| display: Expected 8192 bytes; found 3893 bytes
`1200px-Commons-emblem-disambig-notice.svg.png' @
warning/png.c/MagickPNGWarningHandler/1754.
| display: Read Exception `1200px-Commons-emblem-disambig-notice.svg.png' @
error/png.c/MagickPNGErrorHandler/1728.
| display: corrupt image `1200px-Commons-emblem-disambig-notice.svg.png' @
error/png.c/ReadPNGImage/3695.
Comment 1 Ralf Schmitt 2012-02-24 20:25:46 UTC
*** Bug 34611 has been marked as a duplicate of this bug. ***
Comment 2 Ralf Schmitt 2012-02-24 20:29:17 UTC
Adding my comment from Bug 34611:

But I guess not all of the corrupt images have been removed. Judging from the
ones I looked at today these were all .svg images and the truncated files have
a filesize that is a multiple of 4096.
Comment 3 Ben Hartshorne 2012-02-24 22:35:28 UTC
The initial run to purge broken thumbnails reduced our incidence from about 1.5% of all thumbnails to 0.003%, but I believe there are still a few left.  I am currently working on a slow process to cull the rest (this will likely run for at least 2 weeks to complete).  Though it will take a long time I think it's ok given the low incidence. 

Re: Ralf's comment "what makes you think you solved the problem?": we were able to recreate the issue by initiating a connection to swift requesting a thumbnail that doens't alraedy exist and closing the connection before the entire thumbnail is returned.  Closing the client side early resulted in a truncated file getting written to Swift.  We adjusted the code in Swift to pay attention to the Content-Length header and the ETag headers (if they exist) and at the same time adjusted the code on ms5 (Swift's current backend) and the image scalers to create content-length and ETag headers whenever possible.  After making these changes, closing the client connection prematurely resulted in nothing getting written to Swift instead of a truncated image.  The PUT to swift would fail because the closed connection meant that the data pushed into the system did not match whichever headers were available.  

While we can never be absolutely sure that a different bug with the same symptoms doesn't also exist, all my tests so far have been unable to recreate truncated images in Swift.  Additionally, I installed a process to monitor roughly 30% of all newly created Swift objects and check them against the copy on ms5 to identify any new incidence of the same (or similar) bugs.  This monitoring process hasn't seen any truncated images appear since we deployed the fix to the dropped connection bug.

The files referenced in this bug (the Commons emblem) was created truncated in swift prior to the deploy of the fix for the dropped connection bug, so is a left over remnant rather than a new example.  

I'll close this bug when the final cleanup of the remaining broken thumbnails is complete.
Comment 4 Ben Hartshorne 2012-02-24 23:11:00 UTC
(oh, I forgot; in the mean time, if there are specific images you find that are truncated, please feel free to ?action=purge on them.  That will clear up the problem for a specific image that's affecting you while I continue to do the more complete scan of all thumbnails.)
Comment 5 Ralf Schmitt 2012-02-25 01:11:03 UTC
The best I can (sanely) do here is purge all images that have a filesize which is a multiple of 4096. But, I think you should be able to do that with much less overhead.
Comment 6 Ralf Schmitt 2012-02-27 20:28:05 UTC
*** Bug 34611 has been marked as a duplicate of this bug. ***
Comment 7 Ralf Schmitt 2012-02-28 08:25:13 UTC
Doesn't the "?action=purge" open up a good opportunity for a DOS attack?
We already know that the current system can't handle the load generated by the pdf cluster if all of the thumbnails have to be regenerated.
Comment 8 Mark A. Hershberger 2012-02-28 18:43:26 UTC
taking 1.19 milestone off of this bug since we have it mostly solved and it'll take longer than this Wednesday to fix.
Comment 9 Rob Lanphier 2012-04-26 23:53:28 UTC
Update on this issue.  Ben wrote a 'delete-old-objects' script just before going out-of-office for a while, which will delete all thumbnails generated before February 5.  Leslie has taken over the process of running this, which is a long running process, but is 70% (?) done now.  Basically, there are 5 Swift backend boxes, and the process has run on #1-3 already, it's running on #4, so #5 is the only one left untouched.

After this process is done, there may be a *few* images left (since I think there's a grey zone between February 5 and when we're much more certain that things are fixed), so there may be another much shorter pass that's needed.  Ben should be back in the office to finish this off, do some verification, and then mark this bug fixed, asking for independent confirmation.
Comment 10 Ben Hartshorne 2012-05-10 22:33:40 UTC
Update:

The script to delete truncated images has run completely a few times and eliminated most of the truncated images.  There were some left and with more digging I found that they were objects in swift that do not exist in the container listings.  (as though you can read a file in a directory but when you list the contents of the directory you don't see the file.)  

I've started a process that is crawling every object in swift and testing to verify that it is present in the container listing, deleting those that aren't.  So far it has found several (in the tens of objects per commons shard) objects that aren't listed in the containers - about 0.04%.  Based on the progress of the script so far I expect it should take about 12 days to complete the sweep.  Results so far show that the most recent file that exists but is not listed in a container is 2012-03-20, so it seems that whatever triggered the bug that allowed them to exist is no longer happening.

Note that most of the objects missing from the container listings are not actually truncated images, but it is best to purge them anyways, since they will still cause trouble if the original image is updated.  In other words, there are two problems: truncated images and objects missing from the container listing.  When both problems affect a single file, the symptom is at truncated file that can't be purged.
Comment 11 Aaron Schulz 2012-08-06 18:13:37 UTC
Is this bug still there?
Comment 12 Aaron Schulz 2012-08-26 18:31:08 UTC
After many runs of the cleaner script and the fact that we have long since disabled the PUT code in rewrite.py that caused problems, and I haven't reports on this occurring, I'm closing this bug.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links