Last modified: 2012-10-23 17:16:26 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T41756, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 39756 - CSV log processing garbled on page names with '@'


Summary:	CSV log processing garbled on page names with '@'

Status:	RESOLVED INVALID

Product:	MediaWiki extensions
Classification:	Unclassified
Component:	E3 Experiments (Other open bugs)
Version:	master
Hardware:	All All

Importance:	Unprioritized normal (vote)
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2012-08-29 03:37 UTC by spage
Modified:	2012-10-23 17:16 UTC (History)
CC List:	5 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description spage 2012-08-29 03:37:54 UTC

While processing the output of (I think) ct2csv.py, some of the CSV lines have an excess field.  In all cases (see Google Doc) some script split the Page field into two.

For example, row 29063:

enwiki,editEvent@1,2012-08-27 11:39:57,1,8EQA1K5WdqfHMcH8TtsIoLiSDt5Bam2L4,0,27881,629,341,151,wpPreview,update,false,Einstein,Home,505769298

that page name was originally "Einstein@Home".  Likewise "News @ 1", "Folding@Home", etc.

I think the fix is to implement backslash escaping, and test it end-to-end.  The original page name should have been encoded as
  this@that@Einstein\@Home@other
and then subsequent processing only split on '@', not '\@'.  Zero-width negative look-behind FTW: conceptually you're splitting on "'@' not preceded by a backslash", which disregarding escaping is (?<!\)@, in Python perhaps it's split('(?<!\\\)' + sep)

You could instead convert @ to some crazy escape using some encoding system, like &#64;, but that just introduces more complexity.

Comment 1 Ori Livneh 2012-10-23 17:16:26 UTC

We're no longer using this homebrew encoding format, fortunately.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links