Last modified: 2014-08-29 15:21:53 UTC
Currently it is possible to request html source code of diff, it would be far more useful if diff could be retrieved in a way, that for example in xml format we could have <diff> <original>Previous text</original> <newtext>New text</newtext> </diff>
The data structure would have to be rather more complicated than that. At first guess, something along the lines of (in JSON): "diff": [ { "line": 1, "type": "context", "content": "Line" }, { "line": 2, "type": "removed", "old": "Line" }, { "line": 2, "type": "added", "new": "Line" }, { "line": 3, "type": "context", "content": "Line" }, { "line": 47, "type": "context", "content": "Line" }, { "line": 48, "type": "changed", "old": "Line", "new": "Line" }, { "line": 49, "type": "context", "content": "Line" } ] If you want indication in the line of what changed for "changed" types, that's another complication. Instead of just "Line" it would have to be an array of fragments. One simple way might be that even array indexes are unchanged and odd are changed: "old": [ "foo bar ", "", "quux ", "poop", ], "new": [ "foo bar ", "baz ", "quux ", "etc.", ] That might indicate that "baz" was inserted into the list and "poop" at the end was replaced with "etc.". Or maybe it would be better to combine "old" and "new" into one datastructure somehow. Also, keep in mind that lots of little objects can use a surprising amount of memory (see bug 53663).
I think that for beginning splitting new text and old text would be enough, right now it's hard to find out what was added by user and what was there before they edited the page
1. Character vs. line offset I'd much rather represent diffs based on a character offset I'm afraid of representing position with something like lineno since linebreaks are differently defined between systems. Character offsets would also allow us to make changes to our diff detection strategy without changing the output. 2. Machine readable vs. human readable diffs Machine readable diff opcode formats tend to represent the full set of operations used to recreate a revision -- not just the context. A common format that I'm familiar with would something like this: a = "These are wrd." b = "These are words." { diff: [ { op: "equal", a_start: 0, a_end: 10, b_start: 0 b_end: 10 }, { op: "remove", a_start: 10, a_end: 13, b_start: 10, b_end: 10, content: "wrd", }, { op: "insert", a_start: 13, a_end: 13, b_start: 10, b_end: 15, content: "words", }, { op: "equal", a_start: 13, a_end: 14, b_start: 15, b_end: 16 } ] } 3. compressed format: I don't see the value in compressing the format given that the API doesn't really let you query for more than one diff at a time and diffs tend to be represented in few operations. However, we could simply represent each operation as a tuple with agreed upon field order: { op: "insert", a_start: 13, a_end: 13, b_start: 15, b_end: 18, content: "foo" } could be [ "insert", 13, 13, 15, 18, "foo" ] or if we really want to get a tight format (since the rest of the fields are derivable in a sequence of operations). [ "insert", 15, "foo" ]
(In reply to Aaron Halfaker from comment #3) > 1. Character vs. line offset > I'd much rather represent diffs based on a character offset I'm afraid of > representing position with something like lineno since linebreaks are > differently defined between systems. Isn't that an argument for line-based rather than chatacter-based offsets? > Character offsets would also allow us > to make changes to our diff detection strategy without changing the output. > > 2. Machine readable vs. human readable diffs > Machine readable diff opcode formats tend to represent the full set of > operations used to recreate a revision -- not just the context. OTOH, what is the usual use of querying the diffs? I suspect it's more often that the client is wanting to display a human-readable diff to the end user than because the client is wanting to do the equivalent of the 'patch' utility on an already-downloaded local copy of the article. > and diffs tend to be represented in few operations. On talk pages, maybe. But someone heavily copyediting an article is likely to generate a huge number of operations. With the way the diff algorithm works, even some simple edits will generate many operations as it tries to match up individual letters in the old vs new paragraphs.