Last modified: 2014-10-09 23:01:41 UTC
I think I may have tracked down a cause of Trusty VMs (and labs instances) occasionally locking up. The php5 package shipped with Ubuntu 14.04 includes a cron job to cleanup stale session files stored on disk. A similar cron job has been present in previous versions of Ubuntu, but the version shipped with 14.04 attempts to fix a long standing bug <https://bugs.debian.org/626640> by scanning for sessions in active use before purging. To do this, the upstream maintainer decided that running `/usr/bin/lsof -w -l +d "/var/lib/php5"` would be a great way to find out if any processes had session files open. As pointed out in an upstream bug report <https://bugs.launchpad.net/ubuntu/+source/php5/+bug/1356113> lsof can have a highly variable runtime cost depending on the processes running at the time it is invoked. I have now several times caught multiple versions of the /usr/lib/php5/sessionclean script running on my VM. Since this script is only invoked at :09 and :39 of each hour by cron, if multiple instances are seen this implies that the older version has been running for at least 30 minutes. The Ubuntu 12.04 version of the cleanup script was inlined in /etc/cron.d/php5: [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -depth -mindepth 1 -maxdepth 1 -type f -cmin +$(/usr/lib/php5/maxlifetime) ! -execdir fuser -s {} 2>/dev/null \; -delete The proposed fix in the upstream Debian package is at <http://anonscm.debian.org/cgit/pkg-php/php.git/tree/debian/sessionclean>. This fix is not currently scheduled for backporting to Ubuntu 14.04. I think we should either, 1) revert the cron job to the 12.04 version, 2) backport the Debian fix as part of our Puppet configuration, or 3) remove the clean up job entirely as part of our Puppet configuration. Since we do not currently set $wgSessionsInObjectCache, we are actively using files to store php sessions. I think this implies that option 3 would be less than optimal.
I setup a brand new vm and left it running overnight. When I checked it today I found this in the process list: $ pstree init─┬─VBoxService───7*[{VBoxService}] ├─acpid ├─apache2───5*[apache2] ├─atd ├─cron───20*[cron───sh───sessionclean─┬─awk] │ ├─lsof───lsof] │ └─xargs] That is 20 copies of the session cleanup script running simultaneously.
Change 164877 had a related patch set uploaded by BryanDavis: Backport sessionclean from Debian package https://gerrit.wikimedia.org/r/164877
Change 164877 merged by jenkins-bot: Backport sessionclean from Debian package https://gerrit.wikimedia.org/r/164877