Last modified: 2014-06-30 02:22:55 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T54905, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 52905 - should include link URLs in search?
should include link URLs in search?
Status: NEW
Product: MediaWiki extensions
Classification: Unclassified
CirrusSearch (Other open bugs)
unspecified
All All
: Normal normal (vote)
: ---
Assigned To: Nik Everett
:
: 59205 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-16 00:15 UTC by Sumana Harihareswara
Modified: 2014-06-30 02:22 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Sumana Harihareswara 2013-08-16 00:15:32 UTC
1. Note that https://test2.wikipedia.org/w/index.php?title=Birch_beer&oldid=57684 includes a link to growstuff.org .

2. Search test2wiki for "growstuff.org" - https://test2.wikipedia.org/w/index.php?search=growstuff.org&title=Special%3ASearch 

3. Empty results set.

What is the desired behavior here? If a page does not *mention* growstuff.org but does *link* to it, should we include it in the results set?
Comment 1 Nik Everett 2013-08-19 15:32:01 UTC
I _think_ we should include it.  One way to think of this is, if we did include it, how would you like it highlighted?  Another thing to consider is that we're mostly optimized for searching for words and might not be able to notice a url in the stream to properly not split it and (heaven forbid) stem it.

Like Bug 53013, my gut says set the priority to low because we're mostly concerned with searching words.  So I'm setting the priority to low.  We should revisit this once we're comfortable with other issues.
Comment 2 Chad H. 2013-12-20 20:06:34 UTC
I've been pondering this, and I'm not convinced we should index it. I can't think of a sane way of doing so, or how to reinsert it into the content (which we've already stripped of all wikitext and html).

We have Special:LinkSearch, does it not work?
Comment 3 Nik Everett 2014-01-02 13:32:42 UTC
*** Bug 59205 has been marked as a duplicate of this bug. ***
Comment 4 Nik Everett 2014-01-02 13:41:02 UTC
Bug 59205 showed us that folks do expect link searches to work.  Options:

0.  Do nothing.
1.  Detect a link in the search and people to Special:LinkSearch.  If folks are searching for full uris without extra terms this would probably work.
2.  Index links in their own multivalued field like section heading but with a uri or non-splitting analyzer and display them like file contents matches.  Search them all the time.  This would find links to places in the results.
3.  #2 but only search them with terms that "look like" uris.  This one makes more sense if users are searching for whole uris AND other terms at the same time.
4.  Figure out some way to get the uris back into the text but strip them out on matches for which they were not explicitly searched.  This would produce results similar to what works now but is technically more difficult (changes to how we get parsed output, changes to cirrus, probably changes to Elasticsearch to strip the uris during the highlighting phrase).
Comment 5 Nik Everett 2014-01-02 22:39:27 UTC
Chad got us indexing the links:  https://gerrit.wikimedia.org/r/#/c/104986/

Now I'll grab searching them.
Comment 6 Nik Everett 2014-01-02 22:40:33 UTC
I'm going to shoot for option #3 in comment 4.  So we'll only look in the link field one of the terms looks like a URI.
Comment 7 Nik Everett 2014-01-03 15:48:19 UTC
An important point I didn't realize at first:  if a term "looks like" a link, we can't just search the links.  We have to OR that together with searching the text.  No big deal, just more syntax we have to send to Elasticsearch.
Comment 8 Nik Everett 2014-01-03 15:51:04 UTC
Another point: Sumana's original query still wouldn't find her growstuff link.  You'd have to search for it as http://growstuff.org.  Still, we're better off then we were.
Comment 9 Gerrit Notification Bot 2014-01-03 16:18:27 UTC
Change 105202 had a related patch set uploaded by Manybubbles:
Search links

https://gerrit.wikimedia.org/r/105202
Comment 10 Quiddity 2014-02-17 21:33:07 UTC
(In reply to Nik Everett from comment #8)
> Another point: Sumana's original query still wouldn't find her growstuff
> link.  You'd have to search for it as http://growstuff.org.  Still, we're
> better off then we were.

Just a note, that being able to search for partial URL strings is quite useful when trying to combat spam, or to update links to sites that reorganized their directory structure without leaving proper redirects.

Hence, option 1 from comment #4 might be a good addition. Thanks!
Comment 11 Gerrit Notification Bot 2014-02-20 21:38:50 UTC
Change 105202 abandoned by Manybubbles:
Search links

https://gerrit.wikimedia.org/r/105202
Comment 12 Dan Garry 2014-02-20 21:41:37 UTC
The patch was abandoned as it wasn't relevant.

We will possibly redirect users to [[Special:LinkSearch]] if they type a URL into the search box, as it will serve the user's needs.
Comment 13 Quiddity 2014-06-30 02:22:55 UTC
Now that insource: is available, it is at least possible to find the desired content. E.g. https://test2.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=insource%3Agrowstuff.org&fulltext=Search

Perhaps we could somehow add "insource:" as an option (or text-hint) at Advanced Search, in order to remind editors of that feature?  (Because only crazy people like me, are actually going to hunt their way to [[mw:Help:CirrusSearch#insource:]] ;)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links