Last modified: 2013-08-30 05:48:31 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T31564, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 29564 - Bad UTF-8 in ThreadSignature breaks display and can't be exported
Bad UTF-8 in ThreadSignature breaks display and can't be exported
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
LiquidThreads (Other open bugs)
unspecified
All All
: Low major (vote)
: ---
Assigned To: Nobody - You can work on this!
https://hu.wikipedia.org/wiki/Speci%C...
:
Depends on:
Blocks: 29821
  Show dependency treegraph
 
Reported: 2011-06-24 13:30 UTC by Marcin Cieślak
Modified: 2013-08-30 05:48 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Dump of the text node of page 803932 (821 bytes, text/plain)
2013-01-24 10:42 UTC, Marcin Cieślak
Details
XML dump of <page id="803932"/> (2.51 KB, application/xml)
2013-01-24 11:29 UTC, Marcin Cieślak
Details

Description Marcin Cieślak 2011-06-24 13:30:47 UTC
Export of one of the discussion threads (this is page ID 803932 in huwiki_p):

https://secure.wikimedia.org/wikipedia/hu/wiki/Speciális:Lapok_exportálása/Téma:Szerkesztővita:Dencey/Fölösleges_információk/válasz_(3)

contains invalid (truncated) probably UTF-8 for the thread poster signature.

Hexdump of the export page reveals:

00000be0  74 3b 67 72 65 65 6e 26  71 75 6f 74 3b 20 66 61  |t;green&quot; fa|
00000bf0  63 65 3d 26 71 75 6f 74  3b 4c 75 63 69 64 61 20  |ce=&quot;Lucida |
00000c00  63 61 6c 6c 69 67 72 61  70 68 79 26 71 75 6f 74  |calligraphy&quot|
00000c10  3b 26 67 74 3b ce 93 ce  bf cf 85 ce b2 ce b2 ce  |;&gt;...........|
00000c20  bf cf 82 20 ce 98 ce b9  ce bb ce bf ce 3c 2f 54  |... .........</T|
00000c30  68 72 65 61 64 53 69 67  6e 61 74 75 72 65 3e 0a  |hreadSignature>.|

0xCE byte at offset 0x00000c2a should be followed by at least one more byte to get a correct UTF-8 encoding.

XML dump process fails silently - the last page in those dumps:

http://download.wikimedia.org/huwiki/20110531/huwiki-20110531-pages-articles.xml.bz2

http://download.wikimedia.org/huwiki/20110614/huwiki-20110614-pages-articles.xml.bz2

is page ID 803931, after this there is no XML so whole dump is a non-valid XML.

It gets compressed via bzip2, though. 

This problem was reported on the pywikipedia mailing list by Bináris:

http://thread.gmane.org/gmane.comp.python.pywikipediabot.general/11335
Comment 1 Marcin Cieślak 2011-06-24 13:49:28 UTC
It looks like that database entries got truncated at 256th byte:

> select thread_signature  from thread where thread_root=803932 \G
*************************** 1. row ***************************
thread_signature: <span title="bétaverzió"> <!--<font style="text-decoration: blink;">--><font color="red">♥</font><font color="white">♥</font><font color="green">♥</font> </font> [[User:Gubbubu|<font color="green" face="Lucida calligraphy">Γουββος ΘιλοÎ

"thread_signature" field is a TINYBLOB (http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/LiquidThreads/lqt.sql?revision=72707&view=markup) but no attempt is obviously made to truncate UTF-8 contents sensibly. 

This means that database entries need to be fixed first, adding "shell" keyword and bumping priority.
Comment 2 Brion Vibber 2011-06-24 18:00:41 UTC
So we can split this into a few separate parts:
* saving data into thread_signature fails to properly truncate long strings
* LQT's extension to XML export fails to run UTF-8 validation & cleanup on output
* old db entries potentially ought to get cleaned up (shell issue, but probably mostly irrelevant if the above is fixed)
Comment 3 Brion Vibber 2011-06-24 18:13:47 UTC
r90723 fixes the XML export on trunk; one-line fix will be easy to merge to deployment.

Applies UtfNormal::cleanUp() on the XML chunk that LQT adds into the output stream; this is already applied on the rest of the export data via WikiExporter's xmlsafe() escaping wrapper etc.
Comment 4 Marcin Cieślak 2011-06-24 18:32:51 UTC
Thanks for looking at this quickly.

I just went through the LQT wikis using the toolserver databases, issuing a query:

select thread_id, thread_signature from thread where length(thread_signature)=255;

149     sql enwikinews_p < problem.sql
150     sql enwiktionary_p < problem.sql
151     sql mediawikiwiki_p < problem.sql
153     sql ptwikibooks_p < problem.sql
154     sql strategywiki_p < problem.sql
155     sql sewikimedia_p < problem.sql
156     sql svwikisource_p < problem.sql
157     sql wikimania2010wiki_p < problem.sql
158     sql wikimania2011wiki_p < problem.sql

officewiki_p couldn't be checked because we don't have this one :)

Few wikis have that long signatures stored, but the above case in huwiki
is the only one that ends with a broken UTF-8 sequence. Many signatures in other database ended up encoded in HTML entities, so they have no chance to break UTF-8 this way.

So it seems to be that only one row with thread_id = 1288 needs to be updated in the huwiki_p database.
Comment 5 Ariel T. Glenn 2012-10-31 09:25:28 UTC
Are the current dumps still missing a bunch of pages (as described in the original report)?

What content should go into the thread_signature field for thread_id 1288 in order to fix this manually for the one row?
Comment 6 Andre Klapper 2013-01-18 15:50:01 UTC
Marcin: Could you answer comment 5, please?
Comment 7 Marcin Cieślak 2013-01-24 09:11:50 UTC
1. I just checked the current dump and it looks like that it is not truncated after the abovementioned page; but currently I can't find the page ID 803931 there. I'll double check that again, but simple pywikipedia loop:


Python 2.7.3 (default, Sep 17 2012, 21:25:11)
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import xmlreader
>>> z = xmlreader.XmlDump("huwiki-20121021-pages-articles.xml.bz2")
>>> for i in z.parse():
...     if i.id == 803931:
...             print repr(i)
...
Reading XML dump...

does not seem to give any results.

2. To fix this entry in the database I would simply remove the last byte of the "thread_signature" field. Or maybe a whole greek text can be removed and
this:

[[User:Gubbubu|<font color="green" face="Lucida
calligraphy">Γουββος ΘιλοÎ

changed to

[[User:Gubbubu|Gubbubu]]

or something like that.
Comment 8 Marcin Cieślak 2013-01-24 09:43:11 UTC
Sorry, I used the wrong dump above, now tried this with 0 results:

import xmlreader
z = xmlreader.XmlDump("huwiki-20130120-pages-meta-current.xml.bz2")
for i in z.parse():
    if i.id in [803931, 803932]:
       print repr(i)
Comment 9 Marcin Cieślak 2013-01-24 10:42:53 UTC
Created attachment 11679 [details]
Dump of the text node of page 803932

Attached please find the result of running:

import xmlreader
out = open("803932.txt", "w")
z = xmlreader.XmlDump("huwiki-20130120-pages-meta-current.xml.bz2")
for i in z.parse():
    if i.id in ["803932"]:
       out.write(i.text.encode("utf-8"))
       break
out.close()

What's interesting, this body looks more complete than what is acutally displayed under the URL of this bug. Is the output prepared for export of better quality than the rendered wikipage? Interesting.
Comment 10 Marcin Cieślak 2013-01-24 11:29:33 UTC
Created attachment 11680 [details]
XML dump of <page id="803932"/>

This is the node taken from the uncompressed dump.

It seems that <ThreadSignature> part looks correct now:

00000380  62 75 7c 26 6c 74 3b 66  6f 6e 74 20 63 6f 6c 6f  |bu|&lt;font colo|
00000390  72 3d 26 71 75 6f 74 3b  67 72 65 65 6e 26 71 75  |r=&quot;green&qu|
000003a0  6f 74 3b 20 66 61 63 65  3d 26 71 75 6f 74 3b 4c  |ot; face=&quot;L|
000003b0  75 63 69 64 61 20 63 61  6c 6c 69 67 72 61 70 68  |ucida calligraph|
000003c0  79 26 71 75 6f 74 3b 26  67 74 3b ce 93 ce bf cf  |y&quot;&gt;.....|
000003d0  85 ce b2 ce b2 ce bf cf  82 20 ce 98 ce b9 ce bb  |......... ......|
000003e0  ce bf ef bf bd 3c 2f 54  68 72 65 61 64 53 69 67  |.....</ThreadSig|
000003f0  6e 61 74 75 72 65 3e 0a  3c 2f 44 69 73 63 75 73  |nature>.</Discus|

We have few more bytes from the signature available and XML tools do not complain about UTF-8 anymore.
Comment 11 Marcin Cieślak 2013-01-24 14:44:06 UTC
To sum up:

1) The dump looks okay.

2) I am confused about the actual information in the database: toolserver replica still shows truncated bytes in the database and the webpage itself shows truncated wikitext as well as [[Special:Export]].
Comment 12 Nemo 2013-08-30 05:33:44 UTC
(In reply to comment #11)
> 2) I am confused about the actual information in the database: toolserver
> replica still shows truncated bytes in the database and the webpage itself
> shows truncated wikitext as well as [[Special:Export]].

To clarify, truncated only before </ThreadSignature> but continuing after that. We also don't see the signature displayed after that point, so this is a user-facing problem.

I'm reducing severity and updating the bug summary now that the export works.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links