Last modified: 2014-09-10 11:53:18 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T72607, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 70607 - replace.py does not recognize "\r\n" pattern


Summary:	replace.py does not recognize "\r\n" pattern

Status:	RESOLVED WORKSFORME

Product:	Pywikibot
Classification:	Unclassified
Component:	Other scripts (Other open bugs)
Version:	core-(2.0)
Hardware:	All All

Importance:	Lowest enhancement
Target Milestone:	---
Assigned To:	Pywikipedia bugs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-09-09 14:42 UTC by JAn Dudík
Modified:	2014-09-10 11:53 UTC (History)
CC List:	3 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description JAn Dudík 2014-09-09 14:42:59 UTC

In compat: 
replace.py -regex -nocase -file:aa.log  "==\s*Externí odkazy(.*?)\r\n\{\{Commonscat" "== Externí odkazy\1\n* {{Commonscat"  -summary:"řádková verze {{Commonscat}}"

Getting 60 pages from wikipedia:cs...
...
No changes were necessary in [[Roman Polák (lední hokejista)]]


>>> Roman Polanski <<<
- {{Commonscat|Roman Polanski}}
+ * {{Commonscat|Roman Polanski}}


In core, the same command:
pwb.py replace -regex -nocase -file:aa.log  "==\s*Externí odkazy(.
*?)\r\n\{\{Commonscat" "== Externí odkazy\1\n* {{Commonscat"  -summary:"řádková
verze {{Commonscat}}"

Retrieving 50 pages from wikipedia:cs.
...
No changes were necessary in [[Roman Polanski]]
No changes were necessary in [[Roman Polák (lední hokejista)]]
No changes were necessary in [[Roman Romaněnko]]


Why?

Comment 1 JAn Dudík 2014-09-09 20:49:15 UTC

After some testing - core does not recognize \r\n, but only \n

Comment 2 Merlijn van Deen (test) 2014-09-09 21:20:15 UTC

There is a bug in Compat's PreloadingpageGenerator which makes it return page content (incorrectly) with '\r\n' instead of '\n'. compat's page.get() /does/ return '\n' by default.

I think using \n makes much more sense (and note that this works for both \n *and* \r\n due to python's universal newlines system), so I'm not even sure whether we should support \r\n at all.

Marking it as low-priority feature request for now.

Comment 3 Fabian 2014-09-09 21:22:36 UTC

Is that a problem of the bot then? Shouldn't it suffice to edit the regex (and if you want to be sure you could use (?:\r|\r\n|\n) instead of exactly \r\n.

Comment 4 Merlijn van Deen (test) 2014-09-09 21:40:13 UTC

Okay, I'm a bit confused about the newlines now, as

 re.match(r'\n', '\r\n')

does not work. However,

 python replace.py -lang:cs -regex -nocase -page:"Roman Polanski"  '==\s*Externí odkazy(.*?)\n\{\{Commonscat' '== Externí odkazy\1\n* {{Commonscat'  -summary:"řádkováverze {{Commonscat}}"

*did* work in compat (i.e. the variant without \r in it). I'm not sure why, though.

Comment 5 xqt 2014-09-10 11:53:18 UTC

compat retrieves \r\n as linefeed via special export whereas core always get \n. See also config.line_separator variable.

You may use \r?\n for the regex for both framework branches.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links