Last modified: 2013-08-16 17:09:39 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T49956, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 47956 - Authorship Tracking
Authorship Tracking
Status: UNCONFIRMED
Product: MediaWiki extensions
Classification: Unclassified
Extensions requests (Other open bugs)
unspecified
All All
: Low enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-02 00:03 UTC by Michael
Modified: 2013-08-16 17:09 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Michael 2013-05-02 00:03:51 UTC
We propose to implement authorship tracking for the text in the Wikipedia. The goal is to annotate every word of Wikipedia content with the revision where it was inserted, and the author who created it.

We have been developing robust and efficient algorithms for computing the authorship information. The algorithms compare each new revision of a page with all previous revisions, and attribute any new content in the latest revision to its earliest plausible match in previous content. In this way, if content is deleted (e.g. by a vandal, or in the course of a dispute), and later re-inserted, the content is still correctly attributed to its original author. To achieve an efficient implementation, the algorithm keeps a specially-encoded summary of the history of a wiki page. The size of this summary is proportional to the amount of change that the page has undergone; as we drop information on content that has been absent for longer than 90 days and longer than 100 edits, this summary is on average about 10 times the size of a typical revision. When a user creates a new revision, the algorithm:

    Reads the page summary
    Computes the authorship for the new revision, and stores it
    Stores an updated summary of the history which includes also the new revision.

The process takes about one second of processing time per revision, including the time to serialize and un-serialize the summary, which is generally the dominant time.

The algorithm code is already available, and it works. What we propose to do in this Summer of Code project is make the algorithm run on the actual Wikipedia, integrating the algorithm with the production environment, text store, temporary database tables, etc, as required to make it actually work for many (as many as possible or desired) language editions of the Wikipedia.

Detailed Information
We have been developing robust and efficient algorithms for computing the authorship information. The algorithms compare each new revision of a page with all previous revisions, and attribute any new content in the latest revision to its earliest plausible match in previous content. In this way, if content is deleted (e.g. by a vandal, or in the course of a dispute), and later re-inserted, the content is still correctly attributed to its original author. To achieve an efficient implementation, the algorithm keeps a specially-encoded summary of the history of a wiki page. The size of this summary is proportional to the amount of change that the page has undergone; as we drop information on content that has been absent for longer than 90 days and longer than 100 edits, this summary is on average about 10 times the size of a typical revision. When a user creates a new revision, the algorithm:

    Reads the page summary
    Computes the authorship for the new revision, and stores it
    Stores an updated summary of the history which includes also the new revision.

The process takes about one second of processing time per revision, including the time to serialize and un-serialize the summary, which is generally the dominant time.

The algorithm code is already available, and it works. What we propose to do in this Summer of Code project is make the algorithm run on the actual Wikipedia, integrating the algorithm with the production environment, text store, temporary database tables, etc, as required to make it actually work for many (as many as possible or desired) language editions of the Wikipedia.

Detailed Information:

https://www.mediawiki.org/wiki/User:Mshavlovsky/Authorship_Tracking
Comment 1 Andre Klapper 2013-05-02 13:08:01 UTC
Hi Michael,
is there a specific reason why you filed this under the "WikidataRepo" component? Asking as your comment here in Bugzilla and also the wikipage with the detailed information does not mention Wikidata at all.
Comment 2 Michael 2013-05-02 16:48:45 UTC
Andre,
I filed this under the "WikidataRepo" by mistake.
I changed it.
Thanks.
Comment 3 jeblad 2013-08-16 14:31:08 UTC
I would like some numbers for this algorithm in those cases where there are no previous analysis done for an article. That is what is the expected processing time for some of the most typical articles, how is the worse case handled, and also what is the worse case.

I've been experimenting with methods for extracting amount of contributions for some time and as far as I know there is no single method that can be said to be "correct", or even be more than indicative of authorship.
Comment 4 Michael 2013-08-16 17:09:39 UTC
I am not sure what numbers do you want, but you can find detailed analysis of the algorithm in this paper http://dl.acm.org/citation.cfm?id=2488419 .

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links