Last modified: 2014-02-12 23:35:39 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T47161, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 45161 - Kill all non-trivial parsers in $.tablesorter
Kill all non-trivial parsers in $.tablesorter
Status: NEW
Product: MediaWiki
Classification: Unclassified
Interface (Other open bugs)
1.21.x
All All
: Normal normal with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
: javascript
Depends on:
Blocks: code_quality 31601
  Show dependency treegraph
 
Reported: 2013-02-19 18:29 UTC by Bartosz Dziewoński
Modified: 2014-02-12 23:35 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Bartosz Dziewoński 2013-02-19 18:29:51 UTC
$.tablesorter does a lot of guesswork trying to infer what does the content in a table mean. It sometimes work, mostly for English; however, trying to (for example) guess if a column contains dates just plain doesn't work for some, if not most, languages.

Apart from numbers, $.tablesorter tries to guess dates and times in four distinct formats, currencies, IP addresses, and for some reasons URLs.

I'd kill everything but detecting and sorting numeric values, and even this I'd do differently (using "natural sorting" - identifying runs of digits and non-digits and sorting them separately). [http://www.codinghorror.com/blog/2007/12/sorting-for-humans-natural-sort-order.html]

For sorting any remotely complicated data one should use a data-sort-value attribute on the <td> (which will be used as the sortkey for given cell), or a template that wraps it.

Bugs probably caused by or related to this misfeature (I didn't double-check): bug 15404, bug 31137, bug 34475, bug 39069, bug 42097, bug 42607, bug 42924, bug 43165.
Comment 1 Brad Jorsch 2013-02-20 17:11:08 UTC
That's not particularly user-friendly. If a column contains dates that can be parsed, we shouldn't be forcing users to add a data-sort-value for every cell to get it to sort right.

Maybe the guessing could be toned down or removed though. I haven't looked to see how well/poorly it works.
Comment 2 Bartosz Dziewoński 2013-02-20 18:10:44 UTC
Let's go through all the parsers, then.

* 'text', obviously. It also has some rudimentary collation capabilities (which have to be configured per-wiki), and while extremely limited this is actually useful and used e.g. on pl.wikipedia. This one is good.

* 'IPAddress'. It actually only supports IPv4 addresses. If natural sorting was implemented, it would be entirely 100% useless. Kill it.

* 'currency'. By $.tablesorter definitions this actually means pounds, dollars, euros and yens; all other currency signs are not detected and treated as 'text'. It would be similarly superseded by natural sorting. Kill it.

* 'url'. It just removes ftp:, file:, http: and https: protocols and then sorts as text. I have absolutely no idea why would anybody consider this useful. Kill it.

* 'isoDate'. That is YYYY-M(M)-D(D), where (X) means that this part is optional. Would be similarly superseded by natural sorting, kill it.

* 'usLongDate'. It uses a 114-character regex I'm not going to decipher right now, but the regex ends with (AM|PM). English-specific, completely useless for other languages, kill it.

* 'date'. It uses some generated regexes based on wgMonthNames and wgMonthNamesShort. Assumes that all languages use a XXX[,.'-/]XXX[,.'-/]YY(YY) date formats, where X can be either D or M depending on wgDefaultDateFormat and wgContentLanguage==en. I have no idea why exactly those separators were used. Probably doesn't work for most languages; I didn't test carefully, but bug 42607 is due to this malfunctioning. Kill it.

* 'time'. Matches a HH:MM time with optional AM/PM. Again, would be obsoleted by natural sorting, let's kill it.

* 'number'. Would be obsoleted by natural sorting, let's kill it.

Is that enough justification?
Comment 3 Brad Jorsch 2013-02-20 18:44:06 UTC
You seem to be making some major assumptions on how this natural sorting will work, particularly with respect to punctuation, that may or may not be warranted.

(In reply to comment #2)
> 
> * 'IPAddress'. It actually only supports IPv4 addresses. If natural sorting
> was
> implemented, it would be entirely 100% useless. Kill it.

Should be improved to support IPv6, which by no stretch will be handled correctly by "natural sorting".

I also note that "10.123.234.255" in locales that use '.' for grouping by thousands looks like the number 10123234225. While "255.0.0.1" does not look like any number, so it would probably be interpreted as 255 followed by some other junk, and be wrongly sorted before 10.123.234.255.

And in locales that use '.' for a decimal separator, is "10.23.45.6" the numbers "10.23" and "45.6" separated by a period, or is it four integers? This makes a difference if the table also contains "10.3.0.0".

> * 'currency'. By $.tablesorter definitions this actually means pounds,
> dollars,
> euros and yens; all other currency signs are not detected and treated as
> 'text'. It would be similarly superseded by natural sorting. Kill it.

If it's actually the case that your natural sorting handles it.

> * 'url'. It just removes ftp:, file:, http: and https: protocols and then
> sorts
> as text. I have absolutely no idea why would anybody consider this useful.
> Kill
> it.

That does seem like an odd choice. Perhaps the implementors should be asked for a reason behind it.

> * 'isoDate'. That is YYYY-M(M)-D(D), where (X) means that this part is
> optional. Would be similarly superseded by natural sorting, kill it.

It also handles YYYY/MM/DD.

> * 'usLongDate'. It uses a 114-character regex I'm not going to decipher right
> now, but the regex ends with (AM|PM). English-specific, completely useless
> for
> other languages, kill it.
>
> * 'date'. It uses some generated regexes based on wgMonthNames and
> wgMonthNamesShort. Assumes that all languages use a
> XXX[,.'-/]XXX[,.'-/]YY(YY)
> date formats, where X can be either D or M depending on wgDefaultDateFormat
> and
> wgContentLanguage==en. I have no idea why exactly those separators were used.
> Probably doesn't work for most languages; I didn't test carefully, but bug
> 42607 is due to this malfunctioning. Kill it.

*Lots* of assumptions there.

What we probably need are specific parsers for various date formats, and/or a way for a wiki to call $.tablesorter.addDateParser( 'name', 'format-string' ) to easily add additional date parsers for their language, and/or a general parser that reads the format string from a "data-date-format" property on the header cell.

> * 'time'. Matches a HH:MM time with optional AM/PM. Again, would be obsoleted
> by natural sorting, let's kill it.

In what world would your natural sorting not put 2AM after 1PM? Or is it smart enough to decide it's looking at a time, as well?

> * 'number'. Would be obsoleted by natural sorting, let's kill it.

Your natural sorting is smart enough to handle numbers like 6.022e23 correctly? Interesting.

> Is that enough justification?

Not really.

All in all, it seems that the existing parsers started out as relatively English-centric and were then adjusted to try to address other situations. But getting rid of all that to rely entirely on a poorly-specified "natural sort" algorithm doesn't seem like much of an improvement.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links