Last modified: 2014-08-29 15:21:53 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T56328, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 54328 - Make it possible for edit diff to be provided as a raw text
Make it possible for edit diff to be provided as a raw text
Status: NEW
Product: MediaWiki
Classification: Unclassified
API (Other open bugs)
unspecified
All All
: Normal enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks: 55793
  Show dependency treegraph
 
Reported: 2013-09-19 14:59 UTC by Peter Bena
Modified: 2014-08-29 15:21 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Peter Bena 2013-09-19 14:59:46 UTC
Currently it is possible to request html source code of diff, it would be far more useful if diff could be retrieved in a way, that for example in xml format we could have

<diff>
<original>Previous text</original>
<newtext>New text</newtext>
</diff>
Comment 1 Brad Jorsch 2013-09-19 15:45:36 UTC
The data structure would have to be rather more complicated than that. At first guess, something along the lines of (in JSON):

 "diff": [
     { "line": 1, "type": "context", "content": "Line" },
     { "line": 2, "type": "removed", "old": "Line" },
     { "line": 2, "type": "added", "new": "Line" },
     { "line": 3, "type": "context", "content": "Line" },
     { "line": 47, "type": "context", "content": "Line" },
     { "line": 48, "type": "changed", "old": "Line", "new": "Line" },
     { "line": 49, "type": "context", "content": "Line" }
 ]

If you want indication in the line of what changed for "changed" types, that's another complication. Instead of just "Line" it would have to be an array of fragments. One simple way might be that even array indexes are unchanged and odd are changed:

   "old": [
       "foo bar ",
       "",
       "quux ",
       "poop",
   ],
   "new": [
       "foo bar ",
       "baz ",
       "quux ",
       "etc.",
   ]

That might indicate that "baz" was inserted into the list and "poop" at the end was replaced with "etc.". Or maybe it would be better to combine "old" and "new" into one datastructure somehow.

Also, keep in mind that lots of little objects can use a surprising amount of memory (see bug 53663).
Comment 2 Peter Bena 2013-10-17 13:53:14 UTC
I think that for beginning splitting new text and old text would be enough, right now it's hard to find out what was added by user and what was there before they edited the page
Comment 3 Aaron Halfaker 2014-08-29 08:28:38 UTC
1. Character vs. line offset
I'd much rather represent diffs based on a character offset I'm afraid of representing position with something like lineno since linebreaks are differently defined between systems.  Character offsets would also allow us to make changes to our diff detection strategy without changing the output.

2. Machine readable vs. human readable diffs
Machine readable diff opcode formats tend to represent the full set of operations used to recreate a revision -- not just the context.  A common format that I'm familiar with would something like this:

a = "These are wrd."
b = "These are words."
{
  diff: [
    {
      op: "equal",
      a_start: 0,
      a_end: 10,
      b_start: 0
      b_end: 10
    },
    {
      op: "remove",
      a_start: 10,
      a_end: 13,
      b_start: 10,
      b_end: 10,
      content: "wrd",
    },
    {
      op: "insert",
      a_start: 13,
      a_end: 13,
      b_start: 10,
      b_end: 15,
      content: "words",
    },
    {
      op: "equal",
      a_start: 13,
      a_end: 14,
      b_start: 15,
      b_end: 16
    }
  ]
}

3. compressed format:
I don't see the value in compressing the format given that the API doesn't really let you query for more than one diff at a time and diffs tend to be represented in few operations.  However, we could simply represent each operation as a tuple with agreed upon field order:


    {
      op: "insert",
      a_start: 13,
      a_end: 13,
      b_start: 15,
      b_end: 18,
      content: "foo"
    }

could be 

    [
      "insert",
      13,
      13,
      15,
      18,
      "foo"
    ]

or if we really want to get a tight format (since the rest of the fields are derivable in a sequence of operations).

   [
     "insert",
     15,
     "foo"
   ]
Comment 4 Brad Jorsch 2014-08-29 15:21:53 UTC
(In reply to Aaron Halfaker from comment #3)
> 1. Character vs. line offset
> I'd much rather represent diffs based on a character offset I'm afraid of
> representing position with something like lineno since linebreaks are
> differently defined between systems.

Isn't that an argument for line-based rather than chatacter-based offsets?

>  Character offsets would also allow us
> to make changes to our diff detection strategy without changing the output.
> 
> 2. Machine readable vs. human readable diffs
> Machine readable diff opcode formats tend to represent the full set of
> operations used to recreate a revision -- not just the context.

OTOH, what is the usual use of querying the diffs? I suspect it's more often that the client is wanting to display a human-readable diff to the end user than because the client is wanting to do the equivalent of the 'patch' utility on an already-downloaded local copy of the article.

> and diffs tend to be represented in few operations.

On talk pages, maybe. But someone heavily copyediting an article is likely to generate a huge number of operations. With the way the diff algorithm works, even some simple edits will generate many operations as it tries to match up individual letters in the old vs new paragraphs.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links