Last modified: 2013-10-29 16:34:21 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T47983, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 45983 - Enable creation of dumps dedicated to feeding a search index
Enable creation of dumps dedicated to feeding a search index
Status: RESOLVED INVALID
Product: MediaWiki
Classification: Unclassified
ContentHandler (Other open bugs)
1.21.x
All All
: Normal normal (vote)
: ---
Assigned To: Wikidata bugs
:
Depends on:
Blocks: 42234
  Show dependency treegraph
 
Reported: 2013-03-11 10:19 UTC by Daniel Kinzler
Modified: 2013-10-29 16:34 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Daniel Kinzler 2013-03-11 10:19:56 UTC
Some search backends, like LuceneSearch, rely on XML dumps to build the search index. The indexer has no knowledge of content models, so it will index everything in the dump as-is. For non-text content models, this means it will index the serialized form, which will often lead to bad results (see bug 42234).

To solve this, a brief discussion on wikitech-l suggests to implement an option for the dump creation process that would output generated text instead of raw serialized data into the dumps. This option could then be used to create dumps especially for rebuilding a search index. See http://www.gossamer-threads.com/lists/wiki/wikitech/340638

The Content interface already defined the function getTextForSearchIndex for generating such pseudo-content. It only needs to be hooked up to dump generation.
Comment 1 Ariel T. Glenn 2013-03-11 17:42:08 UTC
The work should be done in Export.php I suppose, because then all the actual dump infrastructure will 'just work'.  Additionally someone using Special:Export would be able to exprt content in this format (if given the right checkboxes).
Comment 2 Daniel Kinzler 2013-10-29 16:34:21 UTC
Since the move from Lucene to ElasticSearch, this is no longer an issue.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links