Last modified: 2013-12-13 20:38:52 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T42133, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 40133 - Guillemets ("french quotes") are not tokenized as word boundaries
Guillemets ("french quotes") are not tokenized as word boundaries
Status: UNCONFIRMED
Product: MediaWiki
Classification: Unclassified
Search (Other open bugs)
1.19.2
All All
: Normal minor (vote)
: ---
Assigned To: Nobody - You can work on this!
: i18n
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-09-10 14:54 UTC by Kaspar Manz
Modified: 2013-12-13 20:38 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Comment 1 Dereckson 2012-09-10 16:25:45 UTC
Hello,

Thank you for your bug report.

Google seems to ignore French quotes, as demonstrated in the following URL:
https://www.google.com/search?btnG=1&pws=0&q=%22%C2%AB+Top+Dogs+%C2%BB%22
Comment 2 Kaspar Manz 2012-09-11 07:40:01 UTC
Yes – I guess I should clarify:

Phrases on wiki pages that are set between guillemets can't be found by the internal search, probably because the search index saves those phrases as "«Top" and "Dogs»" instead of "Top" and "Dogs".

Example: The page http://tls.theaterwissenschaft.ch/wiki/Volker_Hesse, contains the sentence

"H. inszenierte unter anderem 1993 die deutschsprachige Erstaufführung von Tony Kushners «Angels in America», die Dürrenmatt-Collage «Fritz», 1995 Coline Serreaus «Weissalles und Dickedumm», 1996 die Uraufführung von Hürlimanns «Carleton», 1997 Schnitzers «Liebelei» und wurde mit den Ensembleprojekten «In Sekten» (1994) und «Top Dogs» (1996, Textgrundlage: →Urs Widmer) an das Berliner Theatertreffen eingeladen."

Searching for "Ensembleprojekten", "Textgrundlage", "Berliner Theatertreffen" will find this page. Searching for "Fritz", "Weissalles", "Carleton" and "Liebelei" will yield no results, as those words are all set in-between guillemets in the aforementioned sentence. Searching for those last phrases with guillemets in place (as "«Fritz»", "«Carleton»" or "«Liebelei»") will produce results.
Comment 3 Nemo 2013-12-13 20:38:52 UTC
(In reply to comment #2)
> Phrases on wiki pages that are set between guillemets can't be found by the
> internal search, probably because the search index saves those phrases as
> "«Top" and "Dogs»" instead of "Top" and "Dogs".

Ok. This depends on the tokenization system being used: it's probably not an issue with Lucene or Cirrus/ElasticSearch, what search is that wiki using? How much control do we have on the tokenization for the standard MediaWiki search which IIRC may use MySQL directly or something like that?

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links