Last modified: 2014-10-24 13:45:44 UTC
Every few days it is going critical. Let's two criticals were * September 17 19:00 * September 22 11:20 $ df -h .. tmpfs 512M 505M 7.8M 99% /var/lib/jenkins-slave/tmpfs .. Example contents: [18:25 UTC] krinkle at lanthanum.eqiad.wmnet in /var/lib/jenkins-slave/tmpfs $ l mediawiki-core-extensions-integration/ mediawiki-core-install-sqlite/ mediawiki-core-phpunit-api/ mediawiki-core-phpunit-databaseless/ mediawiki-core-phpunit-misc/ mediawiki-core-regression-REL1_23/ mwext-Flow-qunit/ mwext-WikimediaEvents-testextension/ parsoidsvc-php-parsertests/ mediawiki-core-regression-master: build7546.sqlite build7547.sqlite build7550.sqlite mediawiki-vendor-integration: total 21M MW_PHPUnit_ExifRotationTest_pCtdaJ/ MW_PHPUnit_TextPassDumperTest_Cpe5LO/ MW_PHPUnit_TextPassDumperTest_J1dvmI MW_PHPUnit_TextPassDumperTest_QmhN3D MW_PHPUnit_TextPassDumperTest_rYX7QZ/ 277K Sep 22 17:26 build1787.sqlite 291K Sep 22 17:35 build1788.sqlite 291K Sep 22 17:42 build1793.sqlite 291K Sep 22 17:54 build1798.sqlite 291K Sep 22 18:02 build1809.sqlite 265K Sep 22 18:03 build1812.sqlite 271K Sep 22 18:05 build1813.sqlite 291K Sep 22 18:12 build1814.sqlite 291K Sep 22 18:25 build1819.sqlite 4.1M Aug 20 15:08 mw-A8c7xJ 4.1M Aug 20 15:12 mw-BxZ3Ez 4.1M Aug 26 13:42 mw-R4MZCN 4.1M Aug 26 13:42 mw-tbdCvd 0 Sep 20 18:21 transform_f2c5b84944be-1.jpg I've purged a bunch of files for now, possibly broke a few currently running builds. Problems: * The tmpfs partition is way too small (~ 500MB). * Stuff isn't being purged. * These are not regular build artefacts (which Jenkins stores separately and we do have them expire/purge properly). * These are files only needed for the duration of the test and should be removed right after a test has run.
> * The tmpfs partition is way too small (~ 500MB). /var/lib/jenkins/tmpfs is only 512MB because that is a tmpfs, hence it consumes RAM. > * Stuff isn't being purged. At least sqlite files are purged since https://gerrit.wikimedia.org/r/#/c/102149/ : mw-install-sqlite.sh:find "$SQLITE_DIR" -type f -name '*.sqlite' -mmin +60 -delete > * These are not regular build artefacts (which Jenkins stores separately and we do have them expire/purge properly). > * These are files only needed for the duration of the test and should be removed right after a test has run. Seems that is covered by bug 68563 "Jenkins: point TMP/TEMP to workspace and delete it after build completion". Looking on gallium, the main offenders are the qunit jobs, each consume ~ 7MB and we have ten of them for mediawiki-core-qunit. Seem we had a surge of tests running over an hour. The find -mtime 60 is pretty lame. Since then I found a way to have a task to run on build completion which is the 'postbuildscript' publisher. The qunit jobs already have such a macro qunit-cleanup (in macro.yaml), so we can just add a step that would delete the sqlite file.
And again... ssh lanthanum.eqiad.wmnet cd /var/lib/jenkins/tmpfs sudo -su jenkins-slave ll rm -rf *@* mwext-* ll
(In reply to Antoine "hashar" Musso from comment #1) > The find -mtime 60 is pretty lame. Since then I found a way to have a task > to run on build completion which is the 'postbuildscript' publisher. The > qunit jobs already have such a macro qunit-cleanup (in macro.yaml), so we > can just add a step that would delete the sqlite file. Cool. Let's see if we can update our macros that create tmp dbs, to use postbuildscript to clean it up. I guess keeping it in tmpfs is useful for now, we can just do an 'rm -rf' of the containing directory since its tied to workspace-id (jobname[@concurreny]), so no parallel conflicts.
RT #8582 for requesting more ram in those 2 machines
We can use a postbuilder publisher that execute a shell script to teardown the database. Should be done in jobs creating sqlite databases such as the ones having the macro prepare-mediawiki-qunit (being renamed to prepare-mediawiki).
That is surely annoying but not that critical imho. Now that I have completed the Zuul cloner sprint, I will adjust the Jenkins jobs to delete the sqlite file on completion (suggested on Comment #1).
Change 167948 had a related patch set uploaded by Hashar: mw-install-sqlite: clear sqlite DB after 20 mins https://gerrit.wikimedia.org/r/167948
Change 167948 merged by jenkins-bot: mw-install-sqlite: clear sqlite DB after 20 mins https://gerrit.wikimedia.org/r/167948
(In reply to Gerrit Notification Bot from comment #8) > Change 167948 merged by jenkins-bot: > mw-install-sqlite: clear sqlite DB after 20 mins > > https://gerrit.wikimedia.org/r/167948 Deployed. But that is lame workaround.
Change 168558 had a related patch set uploaded by Hashar: Refactor mw sqlite related env variables https://gerrit.wikimedia.org/r/168558
Change 168558 merged by jenkins-bot: Refactor mw sqlite related env variables https://gerrit.wikimedia.org/r/168558
Change 168562 had a related patch set uploaded by Hashar: mw-teardown.sh: to be run after mw jobs https://gerrit.wikimedia.org/r/168562
Change 168562 merged by jenkins-bot: mw-teardown.sh: to be run after mw jobs https://gerrit.wikimedia.org/r/168562
Change 168566 had a related patch set uploaded by Hashar: Mediawiki teardown publisher https://gerrit.wikimedia.org/r/168566
Change 168566 merged by jenkins-bot: Mediawiki teardown publisher https://gerrit.wikimedia.org/r/168566
The patch above cause jobs to delete the sqlite file on completion. That should keep tmpfs usage at a minimum level now. Leaving the bug open for a while though.
Lets just assume this is fixed for now. Additionally I have manually cleared the tmpfs partitions on both gallium and lanthanum. If that occurs again one can still reopen the bug.