Last modified: 2012-12-13 11:16:56 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T38439, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 36439 - Setting labels should normalize some things, API should return the actual label on success


Summary:	Setting labels should normalize some things, API should return the actual lab...

Status:	VERIFIED FIXED

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	WikidataRepo (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Highest normal (vote)
Target Milestone:	---
Assigned To:	Wikidata bugs

URL:
Whiteboard:	storypoints: 5
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2012-05-02 13:24 UTC by Daniel A. R. Werner
Modified:	2012-12-13 11:16 UTC (History)
CC List:	4 users (show)

See Also:	36432
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Daniel A. R. Werner 2012-05-02 13:24:15 UTC

When setting the label of an item (via API), some normalization should be done. The case I am thinking about right now is about having several spaces within the label like "My    Item" where the four spaces should be replaced with one. This would be consistent with MediaWiki page titles where the same thing is done.

Also, the API should return the label when setting it, so we can grab it and display it to the user in the ui accordingly.

Comment 1 jeblad 2012-05-04 07:17:43 UTC

Se also "Bug 36432 - Normalize titles and namespaces". Whitespace (and also the underscore) is stripped in front of text and afterwards, some places also stripped infix, it is done some up-/lowercasing and so forth.

It is not clear where ordinary normalization should be done, that is in the API or in the WikibaseItem.

If the strings somehow changes before, during or after storing the pre- and post normalized form should be reported. so the UI could adjust itself accordingly.

Comment 2 jeblad 2012-05-10 11:40:25 UTC

API does report the new values "as is" fter it is set in the same style as the rest of the API, but this is somewhat cumbersome to unwind later. It is set in a "normalized" structure with "from" and "to", if they are different, but this can later lead to a inconsistency if several language attributes are set at the same time.

A better solution would be to unconditionally report back the structure as it actually are after changes.

Comment 3 denny vrandecic 2012-06-20 22:02:14 UTC

This actually does work for the labels if I am not mistaken -- but it does not seem to work for descriptions and aliases.

Comment 4 jeblad 2012-06-20 22:39:09 UTC

There is a very rudimentary mechanism in place for labels. I propose we do something similar as for titles for the labels and aliases, but I am more unsure about how harshly we shall normalize the description. I'm tempted to do something similar as for summary. That is allow links but disallow templates.

Comment 5 denny vrandecic 2012-06-21 12:52:24 UTC

The following normalization should be done for Labels, Descriptions, and Aliases:
* Unicode normalization of the labels to be done on the Repo.
* Trimming
* Internal whitespace compression

The UI should display the returned normalized value.

Comment 6 denny vrandecic 2012-06-28 11:35:46 UTC

Picked up for Sprint 8.

Comment 7 jeblad 2012-06-29 10:06:52 UTC

Note
* the vast majority of input data is already in form C, using precomposed
  characters
* Form C is supposed to be relatively lossless, with the only changes being
  invisible transformations between base character + combining character
  sequences and precomposed chars. In theory text should never change
  appearance because it's been normalized to form C.
* and further, the W3C recommends it

http://www.mediawiki.org/wiki/Unicode_normalization_considerations#What_is_it.3F

This means that an accented character works if it can be normalized into a precomposed character. For example O₂ and O² works because they can be normalized into precomposed characters. The code sequence U+30A COMBINING RING ABOVE preceded by a might be interpreted as a U+00E5 LATIN SMALL LETTER A WITH RING ABOVE, but it can also be interpreted as an a followed by a small ring. The same thing happens with a lot of accented letters.

There are also the problem with similarly looking character, which the following shows

package main
import "fmt"
func main() {
    a1 := string([]byte{0xe2,0x84,0xab})
    a2 := string([]byte{0xc3,0x85})
    fmt.Println(a1, a2, a1 == a2)
}

Prints:

Å Å false

One character is Angstrom while the other is an A with a ring above, that is the usual character in Danish and Norwegian.

For now the aliases, labels and descriptions will be normalized into the form C, and the text will then be trimmed for leading and trailing whitespace and internal whitespace will be compressed. Whitespace will only be handled for a limited set of whitespace characters.

Comment 8 denny vrandecic 2012-07-02 12:38:16 UTC

Thanks for the write up!

What would

 toNFC(a1) == toNFC(a2)

return?

Comment 9 jeblad 2012-07-04 10:41:59 UTC

Accidently done in https://gerrit.wikimedia.org/r/#/c/14032/

Comment 10 jeblad 2012-07-04 11:19:54 UTC

Normalization of aliases is done in https://gerrit.wikimedia.org/r/#/c/13492/

Comment 11 jeblad 2012-07-04 17:30:51 UTC

Some results from normalization
Source   - encoded     - normalized - comment
Åland    - %C3%85land  - %C3%85land - codepoint for char
Åland    - A%CC%8Aland - %C3%85land - combining ring above
Ångstrom - %E2%84%ABngstrom - %C3%85ngstrom - The initial letter is code point for an unit

So seems like our current normalization (C) rewrites from capital letter A with an combining ring above into a valid code point.

"Characters are decomposed and then recomposed by canonical equivalence."

Seems like it only will fail in kases with multiple combining characters, but I'm not sure if that will ever happen.

In my opinion, this works now, case closed.

See also http://en.wikipedia.org/wiki/Unicode_normalization#Normalization

Comment 12 jeblad 2012-07-04 17:55:42 UTC

Just for the record, conversion of the initial letter in Ångstrøm into a normal codepoint for Å seems a little bit weird.

Comment 13 Anja Jentzsch 2012-11-29 12:37:21 UTC

Verified in Wikidata demo time for sprint 8

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links