Last modified: 2013-12-20 21:40:30 UTC
Since 2013-11-13 we have been seeing lines like cp1046.eqiad.wmnet 1008323724 201cp3011.esams.wikimedia.org 6978832710 2013-11-15T19:29:25 [...] in the mobile and zero logs. There, part of a log line (part past 201 in the third column) got overwritten by a new line (cp3011.esams[...]). All 16 fields of this new line gets written, followed by a line feed. The next line is a normal log line again. For each file we had between 5 to 15 such lines. We did not see such lines for the sampled-1000 logs.
Checking for mangled lines which field got overwritten showed that the probability of a field to get overwritten directly corresponded to the field length. ~41% of the collisions happened for minutes 0, and 59. ~54% of the collisions happened for minutes 29, and 30. So ~95% of the collisions happened at the times puppet is run.
It seems puppet injected a new udp2log filter configuration file and restarted udp2log on oxygen (mobile and zero logs get written there) upon some puppet runs. The culprit seems to be serializing the hash of mobile hosts into the filter configuration without sorting them. So upon each puppet run, the serialized string of mobile hosts stands a chance of producing the hosts in a different order. Since, commit 55ee2641ad2d74adf8b9d8353f63b87f6e007348 of the puppet repo, the list gets sorted before serialization. Hence, the serialized string should not change between puppet runs, if the mobile hosts do not change. After merging the above change, we do not see mangled log lines any longer. Last mangled log lines are for 2013-12-16 17:30. All other udp2log generated log files generated on oxygen have been affected as well (e.g.: edit logs). But the problem is now fixed for them as well.