Last modified: 2014-11-17 10:35:52 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T43577, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 41577 - Use normalized search key in term search index
Use normalized search key in term search index
Status: VERIFIED FIXED
Product: MediaWiki extensions
Classification: Unclassified
WikidataRepo (Other open bugs)
unspecified
All All
: Highest critical with 1 vote (vote)
: ---
Assigned To: Wikidata bugs
: i18n, performance
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-10-31 10:39 UTC by Daniel Kinzler
Modified: 2014-11-17 10:35 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Daniel Kinzler 2012-10-31 10:39:01 UTC
The term search index currently uses on-the-fly conversion to utf8 (and then lower case) to perform comparisons. That means a full table scan followed by a file sort on a table that is likely to contain several dozen million rows. That's likely to kill the DB server.

To avoid this, there should be a dedicated search key column holding the normalized key (similar to the way a search key column is used for category sorting and finding external links). The same normalization shall apply to the index term when inserted and the search term when generating the query. In particular, the following normalization shall apply:

* unicode normalization (NFC)
* trim leading and trailing whitespace (ideally, all unicode whitespace chars)
* lower case (ideally, using the implementation from the appropriate Language class).
* optionally, apply a configurable regular expression for stripping separators (e.g. per default stripping all internal whitespace and hyphens, so "foobar" will match "foo-bar" and "foo bar").

This will provide case-insensitive matches with some flexibility regarding whitespace, etc. If only exact matches are desired, the "soft" result could be filtered programmatically before returning it to the caller.
Comment 1 Daniel Kinzler 2012-10-31 10:39:35 UTC
Marking as critical because this problem poses a serious problem for database cluster performance.
Comment 2 jeblad 2012-10-31 11:15:44 UTC
Names are a problem because they often contain letters from other languages. It is common to strip letters for all accented signs and use the base form. In normalized form it should be simple to write a function that does this for all languages. This must also be done for the search term, and it imply that the search will be using several terms.
Comment 3 Jeroen De Dauw 2012-10-31 14:06:44 UTC
https://gerrit.wikimedia.org/r/#/c/30984/
Comment 4 Andre Klapper 2012-11-04 18:34:55 UTC
Patch in Gerrit has the status "Merged". Is there something left or can this be closed as RESOLVED FIXED?
Comment 5 Anja Jentzsch 2012-11-29 12:43:32 UTC
Verified in Wikidata demo time for sprint 21

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links