Last modified: 2012-10-23 17:16:26 UTC
While processing the output of (I think) ct2csv.py, some of the CSV lines have an excess field. In all cases (see Google Doc) some script split the Page field into two. For example, row 29063: enwiki,editEvent@1,2012-08-27 11:39:57,1,8EQA1K5WdqfHMcH8TtsIoLiSDt5Bam2L4,0,27881,629,341,151,wpPreview,update,false,Einstein,Home,505769298 that page name was originally "Einstein@Home". Likewise "News @ 1", "Folding@Home", etc. I think the fix is to implement backslash escaping, and test it end-to-end. The original page name should have been encoded as this@that@Einstein\@Home@other and then subsequent processing only split on '@', not '\@'. Zero-width negative look-behind FTW: conceptually you're splitting on "'@' not preceded by a backslash", which disregarding escaping is (?<!\)@, in Python perhaps it's split('(?<!\\\)' + sep) You could instead convert @ to some crazy escape using some encoding system, like @, but that just introduces more complexity.
We're no longer using this homebrew encoding format, fortunately.