Last modified: 2014-09-24 17:19:44 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T72919, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 70919 - Byte counts under-reported in search results
Byte counts under-reported in search results
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
CirrusSearch (Other open bugs)
master
All All
: Normal normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-09-17 00:19 UTC by Yusuke Matsubara
Modified: 2014-09-24 17:19 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Yusuke Matsubara 2014-09-17 00:19:48 UTC
It seems that CirrusSearch (as of the version deployed on English Wikipedia) has a problem with byte counts. I assume they should reflect the wiki text size shown in the history page, but CirrusSearch reports sizes consistently smaller than that.

For https://en.wikipedia.org/wiki/Western_Star (833 bytes),

https://en.wikipedia.org/w/index.php?search=western+star&title=Special%3ASearch&fulltext=1&srbackend=CirrusSearch
=> 741 B (116 words)

https://en.wikipedia.org/w/index.php?search=western+star&title=Special%3ASearch&fulltext=1&srbackend=LuceneSearch
=> 833 B (115 words)
Comment 1 Chad H. 2014-09-19 17:52:06 UTC
If I had to guess I'd say we're reporting the pre-expansion size. I'll have a look at this next week.
Comment 2 Nik Everett 2014-09-19 18:06:57 UTC
Cirrus's bytes count is just PHP's strlen function on the text field which is probably wrong now that we're stripping out 'aux_text'.  Should we make sure that its the length of the wikitext or of the rendered text?  Yusuke Matsubara, what do you use the length for?  I'm curious because that'll inform what it should be.  When we started cirrus we didn't think anyone really used the field so we just took a guess at how to implement it and never wrote any regression tests for it.

You can see what Cirrus stores for the page here:
https://en.wikipedia.org/wiki/Western_Star?action=cirrusdump
Comment 3 Chad H. 2014-09-19 18:24:46 UTC
I think it should be the size of the full post-expanded text...we should be able to fetch that from the Revision or Page objects we have on hand during indexing.
Comment 4 Nik Everett 2014-09-19 18:26:13 UTC
You want it with the html?
Comment 5 Chad H. 2014-09-19 18:43:25 UTC
(In reply to Nik Everett from comment #4)
> You want it with the html?

Actually probably not. Revision stores it as the strlen() of the wikitext. We just need to get that length before stripping aux, like you said.
Comment 6 Gerrit Notification Bot 2014-09-24 16:40:48 UTC
Change 162627 had a related patch set uploaded by Chad:
Use proper page sizes

https://gerrit.wikimedia.org/r/162627
Comment 7 Gerrit Notification Bot 2014-09-24 17:08:08 UTC
Change 162627 merged by jenkins-bot:
Use proper page sizes

https://gerrit.wikimedia.org/r/162627
Comment 8 Chad H. 2014-09-24 17:19:44 UTC
Should be fixed, sizes will slowly correct as patch goes out and pages are reindexed.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links