Last modified: 2014-11-19 10:22:05 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T49733, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 47733 - Word count is wrong, does not recognize non-ASCII characters
Word count is wrong, does not recognize non-ASCII characters
Status: PATCH_TO_REVIEW
Product: MediaWiki extensions
Classification: Unclassified
ArticleFeedbackv5 (Other open bugs)
master
All All
: Lowest minor (vote)
: ---
Assigned To: Matthias Mullie
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-04-26 15:34 UTC by TMg
Modified: 2014-11-19 10:22 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description TMg 2013-04-26 15:34:32 UTC
The following example counts 42 words. But I count 40 words.

http://de.wikipedia.org/wiki/Spezial:Artikelr%C3%BCckmeldungen_v5/Yellowstone-Nationalpark/04f917900607eb1692a1842b2b77d79c

I think the current count searches for words made of the letters a to z. Because of this a German word like "schönen" is counted as two words.

The best solution would be to use \p{L} instead of \w or [a-z] in the regular expression. Please note that this does not work in JavaScript.

http://www.regular-expressions.info/unicode.html
Comment 1 Matthias Mullie 2013-04-30 12:06:10 UTC
ß & ö are indeed the culprits.
PHP's native str_word_count is used, which isn't mb-safe.
However, using a regex matching chars (with diacritics) is not ideal either, since that would count words like "you're" or hyphenated words (and quite possibly in other languages other combinations with other characters) as multiple words. So that would be substituting 1 bad solution for another sub-optimal solution.
Perhaps we should split based on whitespace, remove all occurences without letters, and count that number?

Besides, the character length is wrong too, but switching strlen for mb_strlen should do the trick.
Comment 2 TMg 2013-05-02 13:08:58 UTC
Splitting on whitespace is not good because some users write like this,without spaces.The count will be wrong.

I don't think the word count is an essential information. Pretty much every solution will be wrong depending on the language. Don't put time in this. I suggest to choose one of these very simple solutions:

a) Remove the word count. Stick with the characters (but switch to mb_strlen, of course).

b) Don't change the code but change the message to "Approx. 42 words". Maybe add a max(min($count, 10), round($count / 10) * 10) function and make it "Approx. 40 words".
Comment 3 Gerrit Notification Bot 2013-05-06 13:55:49 UTC
Related URL: https://gerrit.wikimedia.org/r/62435 (Gerrit Change I84a72f3894fb19d2834719ebf253e33f2d436d8e)
Comment 4 Matthias Mullie 2013-05-06 14:00:55 UTC
I've removed it completely. I first decided to go with b), but even showing an approximate value makes increasingly less sense on languages with increasing multibyte characters (e.g. Chinese). Since it's a useless metric, I think it's best to remove it.
Comment 5 Gerrit Notification Bot 2014-03-31 09:51:01 UTC
Change 62435 abandoned by Matthias Mullie:
(bug 47733) Word count is wrong, does not recognize non-ASCII characters

Reason:
AFT is unmaintained, these patches are not going to get reviewed

https://gerrit.wikimedia.org/r/62435

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links