Last modified: 2014-11-17 10:35:52 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T43577, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 41577 - Use normalized search key in term search index


Summary:	Use normalized search key in term search index

Status:	VERIFIED FIXED

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	WikidataRepo (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Highest critical with 1 vote (vote)
Target Milestone:	---
Assigned To:	Wikidata bugs

URL:
Whiteboard:
Keywords:	i18n, performance

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2012-10-31 10:39 UTC by Daniel Kinzler
Modified:	2014-11-17 10:35 UTC (History)
CC List:	7 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Daniel Kinzler 2012-10-31 10:39:01 UTC

The term search index currently uses on-the-fly conversion to utf8 (and then lower case) to perform comparisons. That means a full table scan followed by a file sort on a table that is likely to contain several dozen million rows. That's likely to kill the DB server.

To avoid this, there should be a dedicated search key column holding the normalized key (similar to the way a search key column is used for category sorting and finding external links). The same normalization shall apply to the index term when inserted and the search term when generating the query. In particular, the following normalization shall apply:

* unicode normalization (NFC)
* trim leading and trailing whitespace (ideally, all unicode whitespace chars)
* lower case (ideally, using the implementation from the appropriate Language class).
* optionally, apply a configurable regular expression for stripping separators (e.g. per default stripping all internal whitespace and hyphens, so "foobar" will match "foo-bar" and "foo bar").

This will provide case-insensitive matches with some flexibility regarding whitespace, etc. If only exact matches are desired, the "soft" result could be filtered programmatically before returning it to the caller.

Comment 1 Daniel Kinzler 2012-10-31 10:39:35 UTC

Marking as critical because this problem poses a serious problem for database cluster performance.

Comment 2 jeblad 2012-10-31 11:15:44 UTC

Names are a problem because they often contain letters from other languages. It is common to strip letters for all accented signs and use the base form. In normalized form it should be simple to write a function that does this for all languages. This must also be done for the search term, and it imply that the search will be using several terms.

Comment 3 Jeroen De Dauw 2012-10-31 14:06:44 UTC

https://gerrit.wikimedia.org/r/#/c/30984/

Comment 4 Andre Klapper 2012-11-04 18:34:55 UTC

Patch in Gerrit has the status "Merged". Is there something left or can this be closed as RESOLVED FIXED?

Comment 5 Anja Jentzsch 2012-11-29 12:43:32 UTC

Verified in Wikidata demo time for sprint 21

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links