Last modified: 2014-09-24 17:19:44 UTC
It seems that CirrusSearch (as of the version deployed on English Wikipedia) has a problem with byte counts. I assume they should reflect the wiki text size shown in the history page, but CirrusSearch reports sizes consistently smaller than that. For https://en.wikipedia.org/wiki/Western_Star (833 bytes), https://en.wikipedia.org/w/index.php?search=western+star&title=Special%3ASearch&fulltext=1&srbackend=CirrusSearch => 741 B (116 words) https://en.wikipedia.org/w/index.php?search=western+star&title=Special%3ASearch&fulltext=1&srbackend=LuceneSearch => 833 B (115 words)
If I had to guess I'd say we're reporting the pre-expansion size. I'll have a look at this next week.
Cirrus's bytes count is just PHP's strlen function on the text field which is probably wrong now that we're stripping out 'aux_text'. Should we make sure that its the length of the wikitext or of the rendered text? Yusuke Matsubara, what do you use the length for? I'm curious because that'll inform what it should be. When we started cirrus we didn't think anyone really used the field so we just took a guess at how to implement it and never wrote any regression tests for it. You can see what Cirrus stores for the page here: https://en.wikipedia.org/wiki/Western_Star?action=cirrusdump
I think it should be the size of the full post-expanded text...we should be able to fetch that from the Revision or Page objects we have on hand during indexing.
You want it with the html?
(In reply to Nik Everett from comment #4) > You want it with the html? Actually probably not. Revision stores it as the strlen() of the wikitext. We just need to get that length before stripping aux, like you said.
Change 162627 had a related patch set uploaded by Chad: Use proper page sizes https://gerrit.wikimedia.org/r/162627
Change 162627 merged by jenkins-bot: Use proper page sizes https://gerrit.wikimedia.org/r/162627
Should be fixed, sizes will slowly correct as patch goes out and pages are reindexed.