Last modified: 2013-08-16 17:09:39 UTC
We propose to implement authorship tracking for the text in the Wikipedia. The goal is to annotate every word of Wikipedia content with the revision where it was inserted, and the author who created it. We have been developing robust and efficient algorithms for computing the authorship information. The algorithms compare each new revision of a page with all previous revisions, and attribute any new content in the latest revision to its earliest plausible match in previous content. In this way, if content is deleted (e.g. by a vandal, or in the course of a dispute), and later re-inserted, the content is still correctly attributed to its original author. To achieve an efficient implementation, the algorithm keeps a specially-encoded summary of the history of a wiki page. The size of this summary is proportional to the amount of change that the page has undergone; as we drop information on content that has been absent for longer than 90 days and longer than 100 edits, this summary is on average about 10 times the size of a typical revision. When a user creates a new revision, the algorithm: Reads the page summary Computes the authorship for the new revision, and stores it Stores an updated summary of the history which includes also the new revision. The process takes about one second of processing time per revision, including the time to serialize and un-serialize the summary, which is generally the dominant time. The algorithm code is already available, and it works. What we propose to do in this Summer of Code project is make the algorithm run on the actual Wikipedia, integrating the algorithm with the production environment, text store, temporary database tables, etc, as required to make it actually work for many (as many as possible or desired) language editions of the Wikipedia. Detailed Information We have been developing robust and efficient algorithms for computing the authorship information. The algorithms compare each new revision of a page with all previous revisions, and attribute any new content in the latest revision to its earliest plausible match in previous content. In this way, if content is deleted (e.g. by a vandal, or in the course of a dispute), and later re-inserted, the content is still correctly attributed to its original author. To achieve an efficient implementation, the algorithm keeps a specially-encoded summary of the history of a wiki page. The size of this summary is proportional to the amount of change that the page has undergone; as we drop information on content that has been absent for longer than 90 days and longer than 100 edits, this summary is on average about 10 times the size of a typical revision. When a user creates a new revision, the algorithm: Reads the page summary Computes the authorship for the new revision, and stores it Stores an updated summary of the history which includes also the new revision. The process takes about one second of processing time per revision, including the time to serialize and un-serialize the summary, which is generally the dominant time. The algorithm code is already available, and it works. What we propose to do in this Summer of Code project is make the algorithm run on the actual Wikipedia, integrating the algorithm with the production environment, text store, temporary database tables, etc, as required to make it actually work for many (as many as possible or desired) language editions of the Wikipedia. Detailed Information: https://www.mediawiki.org/wiki/User:Mshavlovsky/Authorship_Tracking
Hi Michael, is there a specific reason why you filed this under the "WikidataRepo" component? Asking as your comment here in Bugzilla and also the wikipage with the detailed information does not mention Wikidata at all.
Andre, I filed this under the "WikidataRepo" by mistake. I changed it. Thanks.
I would like some numbers for this algorithm in those cases where there are no previous analysis done for an article. That is what is the expected processing time for some of the most typical articles, how is the worse case handled, and also what is the worse case. I've been experimenting with methods for extracting amount of contributions for some time and as far as I know there is no single method that can be said to be "correct", or even be more than indicative of authorship.
I am not sure what numbers do you want, but you can find detailed analysis of the algorithm in this paper http://dl.acm.org/citation.cfm?id=2488419 .