Last modified: 2013-08-04 17:12:18 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T54056, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 52056 - Uninstall UploadBlacklist extension from Wikimedia wikis
Uninstall UploadBlacklist extension from Wikimedia wikis
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
Site requests (Other open bugs)
wmf-deployment
All All
: Normal enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
technical-debt
:
Depends on: 44975
Blocks:
  Show dependency treegraph
 
Reported: 2013-07-25 21:14 UTC by MZMcBride
Modified: 2013-08-04 17:12 UTC (History)
15 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description MZMcBride 2013-07-25 21:14:16 UTC
The AbuseFilter extension can block uploads based on SHA-1 hash. I don't believe the UploadBlacklist extension is still necessary and I propose removing the following lines:

from <https://noc.wikimedia.org/conf/CommonSettings.php.txt>:

---
include( $IP . '/extensions/UploadBlacklist/UploadBlacklist.php' );

# Upload spam system
// SHA-1 hashes of blocked files:
# FIXME should check file size too
$ubUploadBlacklist = array(
	// Goatse:
	'aebbf277146e497c036937d3c3d6d0cac49a37a8', // 20050901082002!Patoo.jpg
	// Spam:
	'7740dab676725bcf6ea58b03b231aa4ec6c7ff34', // Austriaflaggemodern.jpg
	'1f1c44af6ee4f6e4b6cb48b892e625fa52238bd1', // Nostalgieplattenspielerei.jpg
	'e6eb4549756b88e2c69171ffbd278be51c3e2bfe', // Patioboy.jpg
	'eeb9b16edb9b5e9c58f47a558589e7eb970f32c0', // Shoessss.jpg, 73464736474847367.jpg
	'14e4858e63b008a7e087f2b90d3f57c021ab0f78', // Vacuumbigmell.jpg
	'f989e303ef505c4706db42d5cdad67841042e2b9', // 998_pre_1.jpg
	// Ass pus:
	'27979159b13b819d1bf62e1071a0c2a54b373ed5', // Squish.png
	'7176aeddf3d7d8aada785721773ffeb7ee7b292e', // 20050905221505!Linguistics_stub.png *
	'27979159b13b819d1bf62e1071a0c2a54b373ed5', // 20050905235133!Leaf.png
	'bb3acc61413ef813453a4b0c0198e30b2cd8fcf9', // Kitty100.jpg
	'855e55c4925644aeaef262ef25dd00815761c076', // Wikipedia-logo-100px
	'203bc24e5291e543779201734c49cfd88fcb2445', // Wikipodia-logo.png
	'14d2a0c0f3081815d04493f72ab5970c51422bc7', // Bung.jpg
	'3c610bc87d0ba49467c6f2d3cfba4b3321f6b351', // Blue_morpho_butterfly_300x271.png
	'7176aeddf3d7d8aada785721773ffeb7ee7b292e', // 20050905235450!Blue_morpho_butterfly_300x271.png
	'7a7f9d7ef52ed8967cb6b0171ef8d45e2a0c68b9', // Leaf.png
	'1ecfaf883c4130e1827290ad063158d0037631e6', // Wikimedia-button1.png
	'1c73d6596685175a8af6b08508468252c4dff8e2', // Windbuchencom.jpg
	'203bc24e5291e543779201734c49cfd88fcb2445', // Leaf.png
	'95d825bcf01ca3e553f4175dd7238ff12ba1d153', // 20050915055251!New_Orleans_Survivor_Flyover.jpg
	'bbd292d917d7fa7dec9a524de77ca39bd8cdf738', // 20050915060435!New_Orleans_Survivor_Flyover.jpg

	// Some singnet guy
	'bed74eef04f5b54884dc650679e5688c7c1f74cb', // Peniscut.jpg
);
---

from <https://noc.wikimedia.org/conf/InitialiseSettings.php.txt>:

'UploadBlacklist' => "udp://$wmfUdp2logDest/upload-blacklist",

This will help reduce our technical debt.
Comment 1 MZMcBride 2013-07-25 21:48:59 UTC
Bug 44975 is a soft dependency, not a hard dependency.
Comment 2 Andre Klapper 2013-07-26 12:46:49 UTC
This likely needs broader discussion.
Comment 3 MZMcBride 2013-07-26 16:33:45 UTC
(In reply to comment #2)
> This likely needs broader discussion.

http://lists.wikimedia.org/pipermail/wikitech-l/2013-July/070796.html
Comment 4 Alex Monk 2013-07-26 18:23:06 UTC
Might be worth waiting for global AbuseFilters.
Comment 5 MZMcBride 2013-07-27 01:45:49 UTC
https://gerrit.wikimedia.org/r/76229
Comment 6 MZMcBride 2013-07-27 01:46:16 UTC
https://gerrit.wikimedia.org/r/76230
Comment 7 Antoine "hashar" Musso (WMF) 2013-07-30 01:26:10 UTC
Sounds good to me. The hashes listed in the settings files are more than 8 years ago and there is only a few entries, so I guess that prove UploadBlacklist is not  useful anymore :-)

I will be happy to see it gone.
Comment 8 Chad H. 2013-07-30 01:32:53 UTC
(In reply to comment #1)
> Bug 44975 is a soft dependency, not a hard dependency.

I'd disagree right now. Considering these are currently blocked on *all* wikis, having to go around and add AF rules for each wiki to block these same images would be a huge waste of time.

Other than that, totally in favor of killing this.
Comment 9 MZMcBride 2013-07-30 03:36:26 UTC
(In reply to comment #8)
> I'd disagree right now. Considering these are currently blocked on *all*
> wikis, having to go around and add AF rules for each wiki to block these same
> images would be a huge waste of time.

Respectfully, I think you're making a fatal assumption here: the current blacklist isn't ever being hit. According to Reedy's examination of the UploadBlacklist logs, there have been 0 hits this year, as I understand it.

While the current blacklist is indeed global, there's been no evidence presented that there will be any need to go around adding AbuseFilter rules related to globally blacklisted images to any wiki. (This of course side-steps the point that many wikis disable local uploads altogether and rely on a single wiki [Wikimedia Commons].)
Comment 10 Chad H. 2013-07-30 03:45:39 UTC
(In reply to comment #9)
> (In reply to comment #8)
> > I'd disagree right now. Considering these are currently blocked on *all*
> > wikis, having to go around and add AF rules for each wiki to block these same
> > images would be a huge waste of time.
> 
> Respectfully, I think you're making a fatal assumption here: the current
> blacklist isn't ever being hit. According to Reedy's examination of the
> UploadBlacklist logs, there have been 0 hits this year, as I understand it.
> 

Not hitting doesn't mean people wouldn't try if the blacklist was gone. Maybe they gave up long ago ;-)

> While the current blacklist is indeed global, there's been no evidence
> presented that there will be any need to go around adding AbuseFilter rules
> related to globally blacklisted images to any wiki. (This of course
> side-steps
> the point that many wikis disable local uploads altogether and rely on a
> single
> wiki [Wikimedia Commons].)

This is true. Blacklisting on commons would cover a great many cases.
Comment 11 Antoine "hashar" Musso (WMF) 2013-07-30 11:20:23 UTC
We indeed have a central AbuseFilter database.  So I guess it would be all about adding the existing hash in a new global rule :-]

I have no idea who can create the new rule though.
Comment 12 Chris Steipp 2013-07-30 12:56:25 UTC
Once the blocker for this bug (bug 44975) is finished then we'll add the hashes as global rules.
Comment 13 MZMcBride 2013-07-31 19:14:51 UTC
(In reply to comment #12)
> Once the blocker for this bug (bug 44975) is finished then we'll add the
> hashes as global rules.

No, we really won't.

Creating and deploying an AbuseFilter filter (particularly a global filter) requires a demonstration of active abuse. There's no such demonstration here (cf. comment 9).
Comment 14 MZMcBride 2013-07-31 19:34:52 UTC
(In reply to comment #5)
> https://gerrit.wikimedia.org/r/76229

Merged and deployed.

(In reply to comment #6)
> https://gerrit.wikimedia.org/r/76230

Merged and deployed.

This bug is resolved/fixed (cf. [[Special:Version]]). Thank you, Reedy!
Comment 15 Philippe Verdy 2013-08-04 16:39:17 UTC
It's so easy to derive a spammed image by schaning a few random bits in it (including within invisible embedded metadata, such as camera info, or creator software version string, or adding some randomly selected image backgrounds around the bad image) that I think it is superfluous to check the SHA1 digital signature to detect spammed images.

SHA1 is the wrong method to identify spammed images, and a better method based on image subsampling, with some distance threashold on color plane values, ignoring all metadata fields, but taking into account the ICC profiles to produce the accurate final color before subsampling, will be much better.

Image could be identified by creating identifiable bounding boxes between the most contrasting pixels, in order to eliminate the effect of image realignement with custom internal margins of variable sizes. This done the subsampling can  be correctly aligned to a box of 512x512 pixels (if the image is not square, its minimum width/height size will be set between 256 and 512, the maximum will be set to a multiple of 512, creating a horizontal or vertical band of 512x512 squares), and then SHA1 can be computed on subblocks of 8x8 pixels, to compute the number of common subblocks, giving a note for possible copies.

Above some threshold, this note will bring an alert for human inspection in a specific category or report showing the two images (one which is identified as spammed or infringing a copyright, and the new image).

There exists probably newer algorithms to help matching comparable images. For example Google is able to recognize people faces, or monuments automatically from any photo, using heuristic methods that can correct the effects of difference of light, change of resolution, image cropping, border decorations, slight rotations...

Many spammed images are also displaying text in them (e.g. domain names, or tiny URLs), and some OCR may recognize those texts as an additional method to identify spam (we could also forbid the display of external URLs, notably those hosted on tiny URL providers).

Are there works somewhere about automatizing recognition of image subjects and a way to develop an extension allowing to compare new incoming images with some wellknown bad images, in a special page where the problematic images will not be publicly downloadable/reusable and so that Commons will not be the distribution vector, notably by phishing emails ? Do we monitor security alerts about phishing emails containing images that could be hosted on Commons or on another wiki?

Can we also develop identification mechanims as well for other media types (notably PDF, ePUB, audio and video, without using the basic SHA1 signature ?
Comment 16 Bawolff (Brian Wolff) 2013-08-04 17:12:18 UTC
Patches welcome. Or even a link to a description of an algorithm (preferably along with analysis of how effective algol is) that can generate some sort of hash from an image that stays the same for things like resizing (or recompressing) the image, and is very efficient to compute.

It should be noted that upload blacklist was never an antispam measure. It was meant to prevent malicious (but stupid) users from uploading very disturbingly graphic images.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links