Last modified: 2014-03-07 18:34:13 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T64209, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 62209 - feature request: Text extraction from custom wiki markup
feature request: Text extraction from custom wiki markup
Status: RESOLVED WONTFIX
Product: MediaWiki extensions
Classification: Unclassified
TextExtracts (Other open bugs)
unspecified
All All
: Unprioritized enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-03-04 16:00 UTC by Dimitris Kontokostas
Modified: 2014-03-07 18:34 UTC (History)
1 user (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Dimitris Kontokostas 2014-03-04 16:00:08 UTC
Hi,

This is a very interesting project for DBpedia [1]. We already try to extract abstracts from articles (e.g. [2]) but up to now we hack in the mw core to get it [3].

Looking at the code I noticed that for getting section 0 you parse the whole page. This is a very expensive operation for us. We usually get the wiki markup part that we want to extract and use this api call to get it.

api.php?format=xml&action=parse&prop=text&title=[...]&text=[...]

Then our hacked mw engine returns as clean text. As you probably guess, title is used to resolve self references {{PAGENAME}} and text the part of the page markup we want to get text from.

So, to get to the point, is this feasible in your extension? With some guiding from your side, we can also work on this.

[1] http://dbpedia.org
[2] http://dbpedia.org/page/Berlin
[3] https://github.com/dbpedia/extraction-framework/wiki/Dbpedia-Abstract-Extraction-step-by-step-guide#wiki-prepare-mediawiki---configuration-and-settings
Comment 1 Max Semenik 2014-03-04 20:43:42 UTC
1) If you specify &exintro only intro will be parsed.
2) TE operates only over HTML returned by parser, doing anything with wikitext directly would be essentially a different extension. What do you mean by "custom wiki markup"?
Comment 2 Dimitris Kontokostas 2014-03-04 21:25:45 UTC
Thanks,

I already saw the &exintro option so, one question to understand this. 

when I use this call: http://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro=&explaintext=&titles=Athens

does this extension loads the whole page, convert it to html and then return the first section? 
if not this extension is perfect for our purpose and don't read the rest :)

if yes, we would like to avoid loading the whole page as it would slow down our extraction.

What we do so far is to take the wiki markup of the page up to the first section and feed it in the mw "parse" api call [1] which normally returns html. Then we hack into the mw core to return cleaned text. 

So, the request is to add the "text" and "title" parameters in your api. When they are given, instead of parsing the page by title you will parse the "text" parameter ("title" is used for  magic words like {{PAGENAME}}), get the html and clean it the same way you do now.

Cheers,
Dimitris

[1] https://www.mediawiki.org/wiki/API:Parsing_wikitext#parse
Comment 3 Max Semenik 2014-03-05 09:37:17 UTC
(In reply to Dimitris Kontokostas from comment #2)
> does this extension loads the whole page, convert it to html and then return
> the first section? 

Once again, 
> 1) If you specify &exintro only intro will be parsed.
Comment 4 Dimitris Kontokostas 2014-03-07 15:13:07 UTC
Thanks again,

still, is it possible to add these two parameters? 
This setting works for us but it would suit us better if we had the text/title option.

This way we only have to load the templates in the database and feed the text in the api. Otherwise we need to load the whole dump.

If you agree to this request, we can work on this addition.
Comment 5 Max Semenik 2014-03-07 18:34:13 UTC
I don't think that turning TE into yet another wikitext parsing facility is the way we want it to evolve. You can do it trivially for your infrastructure though, using ExtractFormatter class.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links