Last modified: 2013-07-17 14:26:18 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T48867, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 46867 - Incomplete population of wb_terms table
Incomplete population of wb_terms table
Status: VERIFIED FIXED
Product: MediaWiki extensions
Classification: Unclassified
WikidataRepo (Other open bugs)
master
All All
: Normal normal (vote)
: ---
Assigned To: Wikidata bugs
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-04-03 22:37 UTC by Sam Reed (reedy)
Modified: 2013-07-17 14:26 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Sam Reed (reedy) 2013-04-03 22:37:50 UTC
So, due to many hours of replag, which are only going to get worse for the next 7-8 hours (meaning at least 12 hours of replag), I've cancelled the current run of rebuildTermsSearchKey.php

Whilst trying to work out where to start again from:

mysql:wikiadmin@db35 [wikidatawiki]> select min(term_row_id) from wb_terms where term_search_key = '';
+------------------+
| min(term_row_id) |
+------------------+
|           247135 |
+------------------+
1 row in set (1 min 9.97 sec)

mysql:wikiadmin@db35 [wikidatawiki]> select * from wb_terms where term_row_id > 247130 limit 10;
+-------------+----------------+------------------+---------------+-----------+--------------------+--------------------+
| term_row_id | term_entity_id | term_entity_type | term_language | term_type | term_text          | term_search_key    |
+-------------+----------------+------------------+---------------+-----------+--------------------+--------------------+
|      247131 |          41253 | item             | bn            | alias     | Movie theaters     | movie theaters     |
|      247132 |          41253 | item             | bn            | alias     | Movie house        | movie house        |
|      247133 |          41253 | item             | bn            | alias     | Exhibition         | exhibition         |
|      247134 |          41253 | item             | bn            | alias     | Film theatre       | film theatre       |
|      247135 |          41253 | item             | bn            | alias     | �                   |                    |
|      247136 |          41253 | item             | bn            | alias     | সিনেমা             | সিনেমা             |
|      247137 |          41253 | item             | bn            | alias     | Film exhibitor     | film exhibitor     |
|      247138 |          41253 | item             | bn            | alias     | Matinee            | matinee            |
|      247139 |          41253 | item             | bn            | alias     | Picture house      | picture house      |
|      247140 |          41253 | item             | bn            | alias     | Moviegoer          | moviegoer          |
+-------------+----------------+------------------+---------------+-----------+--------------------+--------------------+
10 rows in set (0.05 sec)

mysql:wikiadmin@db35 [wikidatawiki]> select min(term_row_id) from wb_terms where term_row_id > 247140 AND term_search_key = '';
+------------------+
| min(term_row_id) |
+------------------+
|           254476 |
+------------------+
1 row in set (15.35 sec)

mysql:wikiadmin@db35 [wikidatawiki]> select * from wb_terms where term_row_id = 254476;
+-------------+----------------+------------------+---------------+-----------+-----------+-----------------+
| term_row_id | term_entity_id | term_entity_type | term_language | term_type | term_text | term_search_key |
+-------------+----------------+------------------+---------------+-----------+-----------+-----------------+
|      254476 |          41607 | item             | bn            | alias     | �          |                 |
+-------------+----------------+------------------+---------------+-----------+-----------+-----------------+
1 row in set (0.00 sec)


These show as a square box on my shell, but are having a resultant term_search_key that is ''.

This makes manually finding a starting point difficult, as above. --only-missing would help, but it's still going to go through the process of finding all these rows that are apparently still '', attempting to repopulate them, and then find the next one. This might take a while.

So my first point is, why is the term_search_key coming out as ''? Is this correct? If necessary, we can try and get the results dumped somewhere so we can work out what said character is.. Or with the IDs above, you might be able to find out through the end user interface.


I can/will start the script again when the replag is fixed. In the meantime, finding out if the above is right/wrong/we don't care would be useful
Comment 1 Daniel Kinzler 2013-04-11 12:58:37 UTC
My guess is that the input is a unicode control character that gets stripped in the normalization process. We shouldn't really accept that kind of thing as input, but apparently we do.

If this is true, the '' key is technically correct. But we could hack it to just use the original string in that case. Not sure what's the Right Thing here.
Comment 2 Sam Reed (reedy) 2013-04-18 20:58:34 UTC
http://p.defau.lt/?R6bnyOoKgSyCXvU7kZ_m6A
Comment 3 Sam Reed (reedy) 2013-04-18 20:59:06 UTC
There's 365 currently that won't populate: http://p.defau.lt/?Qzt3dpKhjAIOGG_YrrIvNw
Comment 4 Sam Reed (reedy) 2013-06-22 19:12:49 UTC
reedy@tin:/a/common$ mwscript extensions/Wikibase/repo/maintenance/rebuildTermsSearchKey.php wikidatawiki --force --only-missing
Updated 100 search keys, up to row 85621099.
Updated 100 search keys, up to row 115374456.
Updated 100 search keys, up to row 142209402.
Updated 51 search keys, up to row 151258620.
Done. Updated 351 search keys.
reedy@tin:/a/common$ mwscript extensions/Wikibase/repo/maintenance/rebuildTermsSearchKey.php wikidatawiki --force --only-missing
Updated 100 search keys, up to row 85621099.
Updated 100 search keys, up to row 115374456.
Updated 100 search keys, up to row 142209402.
Updated 51 search keys, up to row 151258620.
Done. Updated 351 search keys.
reedy@tin:/a/common$ mwscript extensions/Wikibase/repo/maintenance/rebuildTermsSearchKey.php wikidatawiki --force --only-missing
Updated 100 search keys, up to row 85621099.
Updated 100 search keys, up to row 115374456.
Updated 100 search keys, up to row 142209402.
Updated 51 search keys, up to row 151258620.
Done. Updated 351 search keys.
Comment 5 Daniel Kinzler 2013-06-24 10:46:15 UTC
Part of the issue is that preg_replace apparently returns an empty string if it encounters a bad unicode sequence anywhere in the input.
Comment 6 Gerrit Notification Bot 2013-06-24 10:48:32 UTC
Related URL: https://gerrit.wikimedia.org/r/70139 (Gerrit Change I702e01b3f021bb2e86fb309e0d51db2a10475ac2)
Comment 7 Gerrit Notification Bot 2013-06-24 10:50:02 UTC
Related URL: https://gerrit.wikimedia.org/r/70140 (Gerrit Change Iedd9cc3b56c0db2e5ed6c02a398d7c35b1c96a1b)
Comment 8 Gerrit Notification Bot 2013-07-10 09:27:51 UTC
Change 70139 merged by Jeroen De Dauw:
(bug 46867) trim bad utf-8 sequences before normalizing.

https://gerrit.wikimedia.org/r/70139
Comment 9 Gerrit Notification Bot 2013-07-10 10:18:10 UTC
Change 70140 merged by Denny Vrandecic:
(bug 46867) skip bad search keys and report them.

https://gerrit.wikimedia.org/r/70140
Comment 10 Daniel Kinzler 2013-07-10 11:02:37 UTC
Sam: please confirm that the issue is now solved, so we can set this to "verified".
Comment 11 abraham.taherivand 2013-07-17 14:26:18 UTC
Verified in Wikidata demo time July 17th

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links