Last modified: 2014-09-05 12:27:27 UTC
The LicenseUrl element has a trailing '\n' element, making it an incorrect URL. Example: https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&format=json&iiprop=timestamp|user|userid|comment|parsedcomment|url|size|dimensions|sha1|mime|thumbmime|mediatype|metadata|extmetadata|archivename|bitdepth|uploadwarning&iilimit=10&titles=File%3ACar_Park_entrance_at_Ealing_Magistrates_Court_-_geograph.org.uk_-_1729971.jpg
Looking more around, '\n' are added to several values: See <https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=query&prop=imageinfo&format=json&iiprop=extmetadata&iilimit=10&titles=File%3ACompans%20lake%20-%20Anas%20platyrhynchos%2007.JPG> : "Credit": { "value": "\nSelf-photographed", "source": "commons-desc-page", "hidden": "" }, "LicenseUrl": { "value": "http://creativecommons.org/licenses/by-sa/3.0\n", "source": "commons-desc-page", "hidden": "" }, "LicenseShortName": { "value": "CC-BY-SA-3.0\n", "source": "commons-desc-page", "hidden": "" }, "UsageTerms": { "value": "Creative Commons Attribution-Share Alike 3.0\n", "source": "commons-desc-page", "hidden": "" },
Change 97743 had a related patch set uploaded by Gergő Tisza: Trim HTML-based metadata values https://gerrit.wikimedia.org/r/97743
Change 97743 abandoned by Gergő Tisza: Trim HTML-based metadata values Reason: Abandoning this change since InformationParser has been completely rewritten in the meantime. https://gerrit.wikimedia.org/r/97743
Change 120948 had a related patch set uploaded by Gergő Tisza: Clean parsed HTML https://gerrit.wikimedia.org/r/120948
Change 120948 merged by jenkins-bot: Clean parsed HTML https://gerrit.wikimedia.org/r/120948
This issue is occurring again. See e.g. https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&format=json&iiprop=commonmetadata|extmetadata&iilimit=1&titles=File%3ALandsort%20Lighthouse%20August%202013%2009.jpg where "LicenseShortName": { "value": "CC-BY-SA-3.0\n", "source": "commons-desc-page", "hidden": "" }, "UsageTerms": { "value": "Creative Commons Attribution-Share Alike 3.0\n", "source": "commons-desc-page", "hidden": "" }, "LicenseUrl": { "value": "http://creativecommons.org/licenses/by-sa/3.0\n", "source": "commons-desc-page", "hidden": "" },
Looking at the html source of the example above [1] there is no trace of these newline characters. Hence it might not be a cleaning/trimming issue in the TemplateParser but rather inserted by it? [1] https://commons.wikimedia.org/wiki/File:Landsort_Lighthouse_August_2013_09.jpg
*** Bug 69497 has been marked as a duplicate of this bug. ***
As stated in bug 69497, these newlines are in the license template, and the code doing the HTML scraping there had better remove them.
The code to remove is in https://gerrit.wikimedia.org/r/#/c/120948/1/TemplateParser.php which at a glance seems correct to me. Also, Lokal_Profil is right that the newline is not always present in the HTML code. I'll test locally with the examples mentioned here.
This code does _not_ look good. '/^\s+(.*)\s+$/' is wrong. It fails to trim if there are no leading blanks (or no trailing blanks). And watch out for the greedy (.*), that also looks wrong.
(In reply to Tisza Gergő from comment #10) > Also, Lokal_Profil is right that the newline is > not always present in the HTML code. I'll test locally with the examples > mentioned here. Not correct. See https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&iiprop=extmetadata&format=jsonfm&titles=File:Landsort_Lighthouse_August_2013_09.jpg Returns the same trailing newlines for UsageTerms and LicenseUrl.
Change 155901 had a related patch set uploaded by TheDJ: TemplateParser: Fix whitespace trim https://gerrit.wikimedia.org/r/155901
Change 155901 merged by jenkins-bot: TemplateParser: Fix whitespace trim https://gerrit.wikimedia.org/r/155901
(In reply to Lupo from comment #11) > This code does _not_ look good. '/^\s+(.*)\s+$/' is wrong. It fails to trim > if there are no leading blanks (or no trailing blanks). And watch out for > the greedy (.*), that also looks wrong. D'oh, that was stupid. Thanks for fixing, Lupo & TheDJ!
*** Bug 66652 has been marked as a duplicate of this bug. ***