Last modified: 2011-11-25 07:43:29 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T34478, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 32478 - AbuseFilter not setting utf-8 flag
AbuseFilter not setting utf-8 flag
Status: NEW
Product: MediaWiki
Classification: Unclassified
Database (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-11-18 17:15 UTC by Ben Hartshorne
Modified: 2011-11-25 07:43 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Ben Hartshorne 2011-11-18 17:15:15 UTC
The wiki page http://wikitech.wikimedia.org/view/Text_storage_data shows the change in the number of text storage objects of each type over the last 18 months.  Some of those object types are changing when they shouldn't be.  As written in that page, "The rise in "object/historyblobcurstub" doesn't really make sense. The rise in "gzip,external/simple pointer" is concerning."

Find out why they're changing and fix it.
Comment 1 Brion Vibber 2011-11-18 19:37:49 UTC
object/historyblobcurstub entries should only exist on old stuff that referenced the MediaWiki 1.4-era "cur" table.

You should absolutely *never* see an increase in those objects unless something weird is happening like existing ones are getting duplicated somehow in the course of trying to merge data...

It should be possible to compare their contents to see if the new ones are unique or duplicates?


'gzip,external/simple pointer' sounds like the default that gets created when saving a new revision, so if there's editing activity and in the absence of anything else I'd expect those to go up over time.
Comment 2 Brion Vibber 2011-11-18 19:48:58 UTC
Ah, the 'gzip,external/simple pointer' is stuff not marked as UTF-8... that might be a bit worrying actually. :)

Shouldn't occur on new entries unless there's some config special case off the top of my head. Whether those entries are problematic or not depends on what the $wgLegacyEncoding setting is on the sites those blobs belong to.
Comment 3 Platonides 2011-11-18 22:31:05 UTC
(In reply to comment #2)
> Ah, the 'gzip,external/simple pointer' is stuff not marked as UTF-8... that
> might be a bit worrying actually. :)
> 
> Shouldn't occur on new entries unless there's some config special case off the
> top of my head. Whether those entries are problematic or not depends on what
> the $wgLegacyEncoding setting is on the sites those blobs belong to.

This is interesting. I went to inquiry about a just-produced one (eswiki)

+----------+---------------------+
| old_id   | old_flags           |
+----------+---------------------+
| 51956028 | utf-8,gzip,external |
| 51956027 | utf-8,gzip,external |
| 51956026 | utf-8,gzip,external |
| 51956025 | utf-8,gzip,external |
| 51956024 | utf-8,gzip,external |
| 51956023 | gzip,external       |
| 51956022 | utf-8,gzip,external |
| 51956021 | utf-8,gzip,external |
| 51956020 | utf-8,gzip,external |
| 51956019 | utf-8,gzip,external |
+----------+---------------------+


It turns out it doesn't (apparently) have revision:

select rev_id, rev_text_id, old_flags  from revision join text on (rev_text_id=old_id) where rev_id <= 51506304 order by rev_id desc limit 10;

+----------+-------------+---------------------+
| rev_id   | rev_text_id | old_flags           |
+----------+-------------+---------------------+
| 51506304 |    51956026 | utf-8,gzip,external |
| 51506303 |    51956025 | utf-8,gzip,external |
| 51506302 |    51956024 | utf-8,gzip,external | <--
| 51506301 |    51956022 | utf-8,gzip,external | <-- 
| 51506300 |    51956021 | utf-8,gzip,external |
| 51506299 |    51956020 | utf-8,gzip,external |
| 51506298 |    51956019 | utf-8,gzip,external |
| 51506297 |    51956018 | utf-8,gzip,external |
| 51506296 |    51956016 | utf-8,gzip,external |
| 51506295 |    51956015 | utf-8,gzip,external |
+----------+-------------+---------------------+

Special:Recentchanges doesn't show anything suspicious around those two entries.

There were three Abusefilter hits at that time, Especial:AbuseLog/1077805-1077807 and it more or less correlates with the number of gzip,external entries.

AbuseFilter is indeed storing items in text table, *and not setting utf-8 flag for them*.
So I think all of them will be AbuseFilter hits, which are utf-8 but not marked as such, not content in legacy encoding.
Comment 4 Platonides 2011-11-18 22:39:44 UTC
Yes, select afl_var_dump from abuse_filter_log order by afl_id desc confirms that suspicion.
Comment 5 Brion Vibber 2011-11-18 23:28:49 UTC
Good catch, Platonides!

Looks like this is where AbuseFilter stores variable state dumps on filter matches so they can be checked out later.


AbuseFilter::storeVarDump() and AbuseFilter::loadVarDump() are manually using the text table and ExternalStore -- this probably should be using a common interface underneath Revision's use of the same.

Since it doesn't use the Revision code paths right now the var dumps won't be run through $wgLegacyEncoding conversion on load, but if they get refactored into a common code path, that conversion might start happening before a refactored AbuseFilter::loadVarDump gets its data back.

This should not be fatal but would at least cause those old entries to display incorrectly on some sites[1] where non-ASCII chars were contained in the text.

[1] from http://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php
'wgLegacyEncoding' => array(
    'enwiki' => 'windows-1252',
    'dawiki' => 'windows-1252',
    'svwiki' => 'windows-1252',
    'nlwiki' => 'windows-1252',

    'dawiktionary' => 'windows-1252',
    'svwiktionary' => 'windows-1252',

    'default' => false,
),
// all other sites will not attempt conversion


A bigger worry is that batch recompression or other maintenance work might try to renormalize those entries in a way that AbuseFilter's code path doesn't recognize.

Switching to using a common code path would avoid having to worry about new data formats (or saving the wrong data format, as it currently does!) but we'd have to fix the old data entries or devise a workaround.
Comment 6 Platonides 2011-11-20 23:35:12 UTC
Fixing those should be easy enough. What about also adding to them an abusefilter flag?
That way at least maintenance scripts could easily recognise and skip them.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links