Last modified: 2014-04-17 20:50:10 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T62484, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 60484 - CirrusSearch: Don't index text in visually hidden elements


Summary:	CirrusSearch: Don't index text in visually hidden elements

Status:	NEW

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	CirrusSearch (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal normal (vote)
Target Milestone:	---
Assigned To:	Nik Everett

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-01-27 18:56 UTC by Nik Everett
Modified:	2014-04-17 20:50 UTC (History)
CC List:	10 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Nik Everett 2014-01-27 18:56:00 UTC

Cirrus shouldn't index the contents of visually hidden elements.  I haven't confirmed that we do this but I wouldn't be surprised.  We're not going to boot a browser and run javascript use that to figure out if the contents are hidden, but we should be able to do something.

Comment 1 Nik Everett 2014-02-10 18:23:10 UTC

I'd like to get to work on this but I'm not really sure what classes we normally use for hidden text.

Comment 2 Chad H. 2014-02-10 18:48:01 UTC

I'm wondering if we should avoid using any existing classes for this and go a completely different route. Introduce something like a "nosearch" class and document it as being completely ignored by search.

Comment 3 Nik Everett 2014-02-10 19:37:38 UTC

I'll just do the nosearch class thing.  This also has the neat side effect of making it possible to highlight all text on the page that is not included in search by injecting css like so:
.nosearch {
    background-color: red;
}

Comment 4 Gerrit Notification Bot 2014-02-10 19:49:48 UTC

Change 112500 had a related patch set uploaded by Manybubbles:
Don't search text wrapped in nosearch class

https://gerrit.wikimedia.org/r/112500

Comment 5 Nemo 2014-02-18 15:18:04 UTC

(In reply to Chad H. from comment #2)
> I'm wondering if we should avoid using any existing classes for this and go
> a completely different route. Introduce something like a "nosearch" class
> and document it as being completely ignored by search.

How many such classes do we have, is there a list somewhere? Do we have an estimate of how much overlap there would be with "noprint" and "nomobile"? There were some complaints when "nomobile" was "spinned off" from noprint, IIRC. Users who complained about excessive indexing might be a suitable target whom to ask an estimate of where they'd like to add such a class on a live wiki.

Comment 6 James Salsman 2014-02-19 07:23:11 UTC

(In reply to Nik Everett from comment #0)
> Cirrus shouldn't index the contents of visually hidden elements.

Many of the "hat" templates used to elide comments are referred to as "archive" templates, but the templates are not removed when the pages are actually archived.

For that reason, I recommend not excluding them from search.

Was there a different use case?

Comment 7 Nik Everett 2014-02-19 15:14:33 UTC

The biggest case I can think of for excluding text from search is the license information on commons. Please take that as an example. Maybe it is the only example I think it is pretty important.
1. The license information doesn't add a whole lot to the result. Try searching commons with Cirrus for "distribute", "transmit", or "following" and you'll very quickly start to see the text of the CC license. And the searches find 14 million results. Heaven forbid you want to find "distributed transmits" or something. You'll almost exclusively get the license highlighted and you'll still find 14 million results. This isn't _horrible_ because the top results all have "distribute" or "transmit" in the title but it isn't great.
2. Knock on effect from #2: because relevance is calculated based on the inverse of the number of documents that contain the word the then every term in the CC license is worth less then words not in the license. I can't point to any example of why that is bad but I feel it in my bones. Feel free to ignore this. I'm probably paranoid.
3. Entirely self serving: given #1, the contents of the license take up an awful lot of space for very little benefit. If I had more space I could make Cirrus a beta on more wikis. It is kind of a lame reason and I'm attacking the space issue from other angles so maybe it'll be moot long before we get this deployed and convince the community that it is worth doing.
4. Really really self serving: if .nosearch is the right solution and is useful then it is super duper easy to implement. Like one line of code, a few tests, and bam. Its already done, just waiting to be rebased and merged. It was so easy it would have taken longer to estimate the effort then to propose an implementation.
I really wouldn't be surprised if someone couldn't come up with great reason why #1 is silly and we just shouldn't do it.

The big problem with the nosearch class implementation is that it'd be pretty simple to abuse and hard to catch the abuse because the text is still on the page. One of the nice things about the solution is you could use a web browser's debugger to highlight all the text excluded from search by writing a simple CSS class.

Comment 8 Nemo 2014-02-19 15:33:05 UTC

(In reply to Nik Everett from comment #7).
> [...] 3.  Entirely self serving: given #1, the contents of the license take up an
> awful lot of space for very little benefit. [...]

Is the RDF markup in the HTML also indexed, currently? That's a good part of the HTML, a fee hundreds bytes like this:
    <p><span class="licensetpl_link" style="display:none;">http://creativecommons.org/licenses/by/2.0</span> <span class="licensetpl_short" style="display:none;">CC-BY-2.0</span> <span class="licensetpl_long" style="display:none;">Creative Commons Attribution 2.0</span> <span class="licensetpl_link_req" style="display:none;">true</span><span class="licensetpl_attr_req" style="display:none;">true</span></p>

Are those already ignored due to the display:none?

The rest of the license is indeed not covered by noprint, noclass or "Exclude in print" :/, though a lot of stuff is on other projects, see e.g. https://en.wikipedia.org/wiki/Category:Exclude_in_print

Comment 9 Chad H. 2014-02-19 15:48:46 UTC

(In reply to Nemo from comment #8)
> Are those already ignored due to the display:none?
> 

No. That would be nice, and I imagine we could detect the naïve case of it being in a style="" tag pretty easily. It's a whole lot harder when it's in CSS.

Comment 10 Nik Everett 2014-02-19 17:21:48 UTC

Removing PATCH_TO_REVIEW as we're going to have a few patches I think.

Nemo pointed out to me on irc that we're indexing all the welcome messages too.  Those are good candidates to skip if possible.

Comment 11 Gerrit Notification Bot 2014-04-17 20:50:10 UTC

Change 112500 abandoned by Manybubbles:
Don't search text wrapped in nosearch class

Reason:
Replacing in favor of I70be621d2702409d88befd8f7ba0936cbc4912ef.

It won't save us any space but it'll improve snippets and relevance.  Hopefully.

https://gerrit.wikimedia.org/r/112500

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links