Last modified: 2014-02-14 12:48:12 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T57709, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 55709 - texvc failure due to missing MW cgroup
texvc failure due to missing MW cgroup
Status: NEW
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
wmf-deployment
All All
: Normal major with 2 votes (vote)
: ---
Assigned To: Nobody - You can work on this!
: ops
: 54367 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-10-14 19:25 UTC by Umherirrender
Modified: 2014-02-14 12:48 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Umherirrender 2013-10-14 19:25:22 UTC
More and more pages on dewiki shows the error "Fehler beim Parsen(Unbekannter Fehler)" (english: Failed to parse(unknown error))

After a purge, edit or nulledit the error is gone and a png is shown, but it also possible that after a purge the png is gone and the error is shown.

MathJax works, only the PNG option is effected.

Google also indexed some of that pages:
https://www.google.de/#q=%22Fehler+beim+Parsen%22+site:de.wikipedia.org
https://www.google.de/#q=%22Failed+to+parse(unknown+error)%22+site:en.wikipedia.org

Please have a look. Thanks.
Comment 1 MZMcBride 2013-10-14 21:47:17 UTC
Probably an ops or shell issue?
Comment 2 Faidon Liambotis 2013-10-14 21:51:41 UTC
It's not a thumb (or ops/shell) issue -- the output that I see is e.g.
<dl>
<dd><strong class='error'>Fehler beim Parsen(Unbekannter Fehler): \sigma\frown\psi:=(-1)^{pq}\psi(\sigma\circ\iota_{0\ldots q})\sigma\circ\iota_{q\ldots p}</strong></dd>
</dl>
so clearly it never gets to this point.
Comment 3 Faidon Liambotis 2013-10-15 07:33:36 UTC
I quickly retracted that comment of mine on IRC last (European) night when I saw the code :)  I little more investigation happened at #wikimedia-tech -- from what I can see on the SAL after I left, Tim found it was a missing cgroups issue on mw1035, mw1145, mw1078, mw1152, mw1150, which means the "texvc" invocation failed.

Tim manually fixed the situation apparently, so the effects of this bug shouldn't be still happening, but we need a proper fix in place so this won't happen again in the future. The cgroup issue is recurring (https://gerrit.wikimedia.org/r/#/c/83067/ was the latest attempt to fix it) and we've yet to found an optimal solution. On the plus side, there was a couple of additions on the logging side because of this...
Comment 4 Andre Klapper 2013-10-15 12:42:27 UTC
Thanks Faidon!


For the records: Topic also brought up on https://de.wikipedia.org/w/index.php?title=Portal_Diskussion:Mathematik&oldid=123471154#r.C3.A4tselhafte_tex-fehler
Comment 5 Tim Starling 2013-10-18 02:48:57 UTC
(In reply to comment #3)
> I quickly retracted that comment of mine on IRC last (European) night when I
> saw the code :)  I little more investigation happened at #wikimedia-tech --
> from what I can see on the SAL after I left, Tim found it was a missing
> cgroups
> issue on mw1035, mw1145, mw1078, mw1152, mw1150, which means the "texvc"
> invocation failed.

Actually, pretty much all of the apaches had the issue. Those 5 apaches didn't even have the cgroup filesystem mounted, indicating that cgconfig hadn't been started. On the rest of the apaches, cgconfig had been started but not mw-cgroup.

When this bug occurred just now on mw1109, I ran "initctl log-priority debug" then stopped and started cgconfig a couple of times. The logs showed the started/stopped events going to cgred, but not to mw-cgroup. When I edited mw-cgroup.conf with an irrelevant change, the syslog showed a configuration reload due to inotify, and after that, mw-cgroup was started and stopped as expected. So my suspicion is that at some point, something went wrong with cgconfig or mw-cgroup or both, which, in combination with a bug in upstart, caused mw-cgroup to stop receiving events.
Comment 6 Tim Starling 2013-10-21 00:15:35 UTC
379 out of 391 apaches show the old config when you run "initctl show-config mw-cgroup", i.e. without the "stop on" trigger. This is apparently because upstart config changes only take effect after the job is stopped or restarted.

The stop script of mw-cgroup will routinely fail since rmdir() fails with EBUSY if there are any tasks remaining in the cgroup. The cgdelete command can be used instead -- it attempts to move any tasks in the cgroup to the parent cgroup before executing the rmdir(). However, this is still prone to failure if new tasks are added to the cgroup while the cgdelete command is in progress:

# cgdelete -r memory:mediawiki
cgdelete: cannot remove group 'mediawiki': Device or resource busy

cgclear, which is run from the stop script of the cgconfig upstart job, is also prone to failing for the same reason. This would not be a problem if cgconfig and mw-cgroup were started on boot and never touched again, but of course the cgroup-bin package has a prerm trigger which stops the cgconfig job, causing breakage on upgrade. Says mw1109:

2013-09-09 18:41:49 upgrade cgroup-bin 0.37.1-1ubuntu10 0.37.1-1ubuntu10.1

Per usual Debian convention, to fix a bug in cgrulesengd it is necessary to unmount and remount the entire cgroup filesystem.
Comment 7 Gerrit Notification Bot 2013-10-22 01:35:46 UTC
Change 91115 had a related patch set uploaded by Tim Starling:
Improve logging for wfShellExec()

https://gerrit.wikimedia.org/r/91115
Comment 8 Sam Reed (reedy) 2013-10-23 15:45:13 UTC
*** Bug 54367 has been marked as a duplicate of this bug. ***
Comment 9 Tim Starling 2013-10-29 02:28:00 UTC
I suggest migrating Apache to upstart and having it stop if mw-cgroup stops.
Comment 10 Gerrit Notification Bot 2013-10-29 22:42:19 UTC
Change 91115 merged by jenkins-bot:
Improve logging for wfShellExec() and ignore missing cgroup

https://gerrit.wikimedia.org/r/91115
Comment 11 Andre Klapper 2014-02-14 12:48:12 UTC
[Patch merged; resetting bug status]

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links