Last modified: 2014-10-24 13:45:44 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T73128, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 71128 - Jenkins: lanthanum/gallium tmpfs are filling up with stale tmp files
Jenkins: lanthanum/gallium tmpfs are filling up with stale tmp files
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
Continuous integration (Other open bugs)
wmf-deployment
All All
: High normal (vote)
: ---
Assigned To: Antoine "hashar" Musso (WMF)
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-09-22 18:31 UTC by Krinkle
Modified: 2014-10-24 13:45 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Krinkle 2014-09-22 18:31:52 UTC
Every few days it is going critical. Let's two criticals were

* September 17 19:00
* September 22 11:20


$ df -h
..
 tmpfs           512M  505M  7.8M  99% /var/lib/jenkins-slave/tmpfs
..

Example contents:

[18:25 UTC] krinkle at lanthanum.eqiad.wmnet in /var/lib/jenkins-slave/tmpfs
$ l
 mediawiki-core-extensions-integration/
 mediawiki-core-install-sqlite/
 mediawiki-core-phpunit-api/
 mediawiki-core-phpunit-databaseless/
 mediawiki-core-phpunit-misc/
 mediawiki-core-regression-REL1_23/
 mwext-Flow-qunit/
 mwext-WikimediaEvents-testextension/
 parsoidsvc-php-parsertests/

mediawiki-core-regression-master:
 build7546.sqlite
 build7547.sqlite
 build7550.sqlite


mediawiki-vendor-integration:
total 21M
 MW_PHPUnit_ExifRotationTest_pCtdaJ/
 MW_PHPUnit_TextPassDumperTest_Cpe5LO/
 MW_PHPUnit_TextPassDumperTest_J1dvmI
 MW_PHPUnit_TextPassDumperTest_QmhN3D
 MW_PHPUnit_TextPassDumperTest_rYX7QZ/
 277K Sep 22 17:26 build1787.sqlite
 291K Sep 22 17:35 build1788.sqlite
 291K Sep 22 17:42 build1793.sqlite
 291K Sep 22 17:54 build1798.sqlite
 291K Sep 22 18:02 build1809.sqlite
 265K Sep 22 18:03 build1812.sqlite
 271K Sep 22 18:05 build1813.sqlite
 291K Sep 22 18:12 build1814.sqlite
 291K Sep 22 18:25 build1819.sqlite
 4.1M Aug 20 15:08 mw-A8c7xJ
 4.1M Aug 20 15:12 mw-BxZ3Ez
 4.1M Aug 26 13:42 mw-R4MZCN
 4.1M Aug 26 13:42 mw-tbdCvd
 0 Sep 20 18:21 transform_f2c5b84944be-1.jpg

I've purged a bunch of files for now, possibly broke a few currently running builds.

Problems:
* The tmpfs partition is way too small (~ 500MB).
* Stuff isn't being purged.
* These are not regular build artefacts (which Jenkins stores separately and we do have them expire/purge properly).
* These are files only needed for the duration of the test and should be removed right after a test has run.
Comment 1 Antoine "hashar" Musso (WMF) 2014-09-22 18:51:22 UTC
> * The tmpfs partition is way too small (~ 500MB).

/var/lib/jenkins/tmpfs is only 512MB because that is a tmpfs, hence it consumes RAM.

> * Stuff isn't being purged.

At least sqlite files are purged since https://gerrit.wikimedia.org/r/#/c/102149/ :

  mw-install-sqlite.sh:find "$SQLITE_DIR" -type f -name '*.sqlite' -mmin +60 -delete

> * These are not regular build artefacts (which Jenkins stores separately and we do have them expire/purge properly).
> * These are files only needed for the duration of the test and should be removed right after a test has run.

Seems that is covered by bug 68563 "Jenkins: point TMP/TEMP to workspace and delete it after build completion".

Looking on gallium, the main offenders are the qunit jobs, each consume ~ 7MB and we have ten of them for mediawiki-core-qunit. Seem we had a surge of tests running over an hour.



The find -mtime 60 is pretty lame. Since then I found a way to have a task to run on build completion which is the 'postbuildscript' publisher.  The qunit jobs already have such a macro  qunit-cleanup (in macro.yaml), so we can just add a step that would delete the sqlite file.
Comment 2 Krinkle 2014-09-29 23:33:49 UTC
And again...



ssh lanthanum.eqiad.wmnet
cd /var/lib/jenkins/tmpfs
sudo -su jenkins-slave
ll
rm -rf *@* mwext-*
ll
Comment 3 Krinkle 2014-09-29 23:39:11 UTC
(In reply to Antoine "hashar" Musso from comment #1)
> The find -mtime 60 is pretty lame. Since then I found a way to have a task
> to run on build completion which is the 'postbuildscript' publisher.  The
> qunit jobs already have such a macro  qunit-cleanup (in macro.yaml), so we
> can just add a step that would delete the sqlite file.

Cool. Let's see if we can update our macros that create tmp dbs, to use postbuildscript to clean it up.

I guess keeping it in tmpfs is useful for now, we can just do an 'rm -rf' of the containing directory since its tied to workspace-id (jobname[@concurreny]), so no parallel conflicts.
Comment 4 Sam Reed (reedy) 2014-10-07 19:38:42 UTC
RT #8582 for requesting more ram in those 2 machines
Comment 5 Antoine "hashar" Musso (WMF) 2014-10-07 20:00:42 UTC
We can use a postbuilder publisher that execute a shell script to teardown the database.  Should be done in jobs creating sqlite databases such as the ones having the macro prepare-mediawiki-qunit (being renamed to prepare-mediawiki).
Comment 6 Antoine "hashar" Musso (WMF) 2014-10-07 20:30:54 UTC
That is surely annoying but not that critical imho.

Now that I have completed the Zuul cloner sprint, I will adjust the Jenkins jobs to delete the sqlite file on completion (suggested on Comment #1).
Comment 7 Gerrit Notification Bot 2014-10-21 21:07:50 UTC
Change 167948 had a related patch set uploaded by Hashar:
mw-install-sqlite: clear sqlite DB after 20 mins

https://gerrit.wikimedia.org/r/167948
Comment 8 Gerrit Notification Bot 2014-10-21 21:08:21 UTC
Change 167948 merged by jenkins-bot:
mw-install-sqlite: clear sqlite DB after 20 mins

https://gerrit.wikimedia.org/r/167948
Comment 9 Antoine "hashar" Musso (WMF) 2014-10-21 21:11:05 UTC
(In reply to Gerrit Notification Bot from comment #8)
> Change 167948 merged by jenkins-bot:
> mw-install-sqlite: clear sqlite DB after 20 mins
> 
> https://gerrit.wikimedia.org/r/167948

Deployed. But that is lame workaround.
Comment 10 Gerrit Notification Bot 2014-10-24 09:22:19 UTC
Change 168558 had a related patch set uploaded by Hashar:
Refactor mw sqlite related env variables

https://gerrit.wikimedia.org/r/168558
Comment 11 Gerrit Notification Bot 2014-10-24 09:23:31 UTC
Change 168558 merged by jenkins-bot:
Refactor mw sqlite related env variables

https://gerrit.wikimedia.org/r/168558
Comment 12 Gerrit Notification Bot 2014-10-24 09:30:48 UTC
Change 168562 had a related patch set uploaded by Hashar:
mw-teardown.sh: to be run after mw jobs

https://gerrit.wikimedia.org/r/168562
Comment 13 Gerrit Notification Bot 2014-10-24 09:31:25 UTC
Change 168562 merged by jenkins-bot:
mw-teardown.sh: to be run after mw jobs

https://gerrit.wikimedia.org/r/168562
Comment 14 Gerrit Notification Bot 2014-10-24 09:54:54 UTC
Change 168566 had a related patch set uploaded by Hashar:
Mediawiki teardown publisher

https://gerrit.wikimedia.org/r/168566
Comment 15 Gerrit Notification Bot 2014-10-24 10:04:43 UTC
Change 168566 merged by jenkins-bot:
Mediawiki teardown publisher

https://gerrit.wikimedia.org/r/168566
Comment 16 Antoine "hashar" Musso (WMF) 2014-10-24 10:26:15 UTC
The patch above cause jobs to delete the sqlite file on completion. That should keep tmpfs usage at a minimum level now.

Leaving the bug open for a while though.
Comment 17 Antoine "hashar" Musso (WMF) 2014-10-24 13:45:44 UTC
Lets just assume this is fixed for now. Additionally I have manually cleared the tmpfs partitions on both gallium and lanthanum.

If that occurs again one can still reopen the bug.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links