Last modified: 2014-09-10 11:53:18 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T72607, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 70607 - replace.py does not recognize "\r\n" pattern
replace.py does not recognize "\r\n" pattern
Status: RESOLVED WORKSFORME
Product: Pywikibot
Classification: Unclassified
Other scripts (Other open bugs)
core-(2.0)
All All
: Lowest enhancement
: ---
Assigned To: Pywikipedia bugs
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-09-09 14:42 UTC by JAn Dudík
Modified: 2014-09-10 11:53 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description JAn Dudík 2014-09-09 14:42:59 UTC
In compat: 
replace.py -regex -nocase -file:aa.log  "==\s*Externí odkazy(.*?)\r\n\{\{Commonscat" "== Externí odkazy\1\n* {{Commonscat"  -summary:"řádková verze {{Commonscat}}"

Getting 60 pages from wikipedia:cs...
...
No changes were necessary in [[Roman Polák (lední hokejista)]]


>>> Roman Polanski <<<
- {{Commonscat|Roman Polanski}}
+ * {{Commonscat|Roman Polanski}}


In core, the same command:
pwb.py replace -regex -nocase -file:aa.log  "==\s*Externí odkazy(.
*?)\r\n\{\{Commonscat" "== Externí odkazy\1\n* {{Commonscat"  -summary:"řádková
verze {{Commonscat}}"

Retrieving 50 pages from wikipedia:cs.
...
No changes were necessary in [[Roman Polanski]]
No changes were necessary in [[Roman Polák (lední hokejista)]]
No changes were necessary in [[Roman Romaněnko]]


Why?
Comment 1 JAn Dudík 2014-09-09 20:49:15 UTC
After some testing - core does not recognize \r\n, but only \n
Comment 2 Merlijn van Deen (test) 2014-09-09 21:20:15 UTC
There is a bug in Compat's PreloadingpageGenerator which makes it return page content (incorrectly) with '\r\n' instead of '\n'. compat's page.get() /does/ return '\n' by default.

I think using \n makes much more sense (and note that this works for both \n *and* \r\n due to python's universal newlines system), so I'm not even sure whether we should support \r\n at all.

Marking it as low-priority feature request for now.
Comment 3 Fabian 2014-09-09 21:22:36 UTC
Is that a problem of the bot then? Shouldn't it suffice to edit the regex (and if you want to be sure you could use (?:\r|\r\n|\n) instead of exactly \r\n.
Comment 4 Merlijn van Deen (test) 2014-09-09 21:40:13 UTC
Okay, I'm a bit confused about the newlines now, as

 re.match(r'\n', '\r\n')

does not work. However,

 python replace.py -lang:cs -regex -nocase -page:"Roman Polanski"  '==\s*Externí odkazy(.*?)\n\{\{Commonscat' '== Externí odkazy\1\n* {{Commonscat'  -summary:"řádkováverze {{Commonscat}}"

*did* work in compat (i.e. the variant without \r in it). I'm not sure why, though.
Comment 5 xqt 2014-09-10 11:53:18 UTC
compat retrieves \r\n as linefeed via special export whereas core always get \n. See also config.line_separator variable.

You may use \r?\n for the regex for both framework branches.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links