Last modified: 2014-06-27 12:01:40 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T67783, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 65783 - Allow search in the raw wiki text source via insource:
Allow search in the raw wiki text source via insource:
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
CirrusSearch (Other open bugs)
master
All All
: Unprioritized normal with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-05-26 22:11 UTC by Thiemo Mättig
Modified: 2014-06-27 12:01 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Thiemo Mättig 2014-05-26 22:11:20 UTC
The German community did a voting for a "technical wish list". "Source search" made it into the top 20 wishes. See [[w:de:WP:Umfragen/Technische_Wünsche/Suche#Wünsche]].

I had the chance to talk with one of the current CirrusSearch developers and we think it should be fairly easy to implement this: In addition to the current field (which contains the visible text only) we could add a second field that contains the plain, untransformed wiki text. I suggest a keyword "insource:..." to allow searching this field. This could be very powerful in combination with the existing "hastemplate:...".

Possible problems:

1. This will roughly double the size of the index. Is this worth it?
2. Stemming should be disabled on this field, if that's possible. And it probably needs a few more tweaks.
3. Searching for special characters can't work, right?
4. Can this still work if we switch to Parsoid some day? It should, right?
Comment 1 Gerrit Notification Bot 2014-06-05 18:49:02 UTC
Change 137733 had a related patch set uploaded by Manybubbles:
Basic insource support

https://gerrit.wikimedia.org/r/137733
Comment 2 Gerrit Notification Bot 2014-06-17 21:54:37 UTC
Change 137733 merged by jenkins-bot:
Insource support

https://gerrit.wikimedia.org/r/137733
Comment 3 MZMcBride 2014-06-18 00:57:08 UTC
(In reply to Gerrit Notification Bot from comment #2)
> Change 137733 merged by jenkins-bot:
> Insource support
> 
> https://gerrit.wikimedia.org/r/137733

Whoa, really?
Comment 4 Chad H. 2014-06-18 03:40:14 UTC
(In reply to MZMcBride from comment #3)
> (In reply to Gerrit Notification Bot from comment #2)
> > Change 137733 merged by jenkins-bot:
> > Insource support
> > 
> > https://gerrit.wikimedia.org/r/137733
> 
> Whoa, really?

Yep, should start making its way live with the next wmf branch come Thursday.
Comment 5 Nik Everett 2014-06-18 12:58:02 UTC
(In reply to Chad H. from comment #4)
> (In reply to MZMcBride from comment #3)
> > (In reply to Gerrit Notification Bot from comment #2)
> > > Change 137733 merged by jenkins-bot:
> > > Insource support
> > > 
> > > https://gerrit.wikimedia.org/r/137733
> > 
> > Whoa, really?
> 
> Yep, should start making its way live with the next wmf branch come Thursday.

Caveats for regexes:
1.  Its kinda slow.
2.  We only allow 2 concurrent queries at a time.
3.  We have a maximum queue of 10.  This is to keep more then 12 apaches stuck waiting for it.
4.  Syntax error feedback is only OK, not great.
5.  If you fill up the queue then you won't get a useful error message.
6.  No highlighting of results at all.  Something I'll work on fixing in the next couple weeks.
7.  Its going to take some time after the initial release for all pages to be indexed.  We didn't have the source indexed before so we'll have to regenerate all the documents and we didn't write anything fancy to do just the source so we'll end up rerendering everything.  Its slow, but it'll work.
8.  The regex language is actually Lucene's regex which is designed to be efficient rather then super expressive.  I chose it because its safe.
9.  Other stuff I don't remember?

Docs are here:   https://www.mediawiki.org/wiki/Search/CirrusSearchFeatures#insource:

We were tired of waiting for ops to build out infrastructure for easy copying to labs.  So we figured we'd just make it in prod and limit it to a few executors.  Hopefully everything will be just fine.  We might, but haven't yet, decided it'd be best to limit it to users with a permission, or signed in users, or something.  We'd only do that if we saw that it was crushing us or that some asshole was keeping the queue full and no legitimate users could use it.
Comment 6 Chad H. 2014-06-18 15:24:45 UTC
(In reply to Nik Everett from comment #5)
> We were tired of waiting for ops to build out infrastructure for easy
> copying to labs.

Well it's a lower priority for Swift than say image storage, so I understand the delay. We still want it though for backups and labs :)
Comment 7 Nik Everett 2014-06-18 15:26:38 UTC
(In reply to Chad H. from comment #6)
> (In reply to Nik Everett from comment #5)
> > We were tired of waiting for ops to build out infrastructure for easy
> > copying to labs.
> 
> Well it's a lower priority for Swift than say image storage, so I understand
> the delay. We still want it though for backups and labs :)

Yeah!  I totally want backups!  I just was tired of waiting for it for regexes.  Hopefully it won't turn out to be a mistake.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links