Last modified: 2014-11-20 13:37:41 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T75472, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 73472 - [OPS] Jenkins: puppet master fills /var on labs with yaml reports
[OPS] Jenkins: puppet master fills /var on labs with yaml reports
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
Continuous integration (Other open bugs)
unspecified
All All
: High normal (vote)
: ---
Assigned To: Yuvi Panda
: ops
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-11-15 15:15 UTC by Krinkle
Modified: 2014-11-20 13:37 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Krinkle 2014-11-15 15:15:21 UTC
integration-puppetmaster has a /var of only 1.9GB and most of it is filled up with:

 110M /var/lib/git
 1.2G /var/lib/puppet

Looking at the disk usage graphs, one can see it has close to 0 free space and is constantly going up and down every few hours.

https://tools.wmflabs.org/nagf/?project=integration#h_integration-puppetmaster_disk

It going up is puppet runs writing yaml files to /var/lib/puppet/reports. It going down is the the cleanup cronjob running.

https://github.com/wikimedia/operations-puppet/blob/fcf51231/modules/puppetmaster/manifests/scripts.pp#L35

I ran it manually once to delete beyond past 2 hours instead of past 36 hours and it shrank from 1.2G to 115M for /var/lib/puppet.
Comment 1 Antoine "hashar" Musso (WMF) 2014-11-18 08:26:36 UTC
The /var on labs is indeed only 2GB.  puppetmaster reports takes 600MB of disk right now.

modules/puppetmaster/manifests/scripts.pp has a cronjob 'removeoldreports' which removes the reports after 2160 minutes (or 36 hours). I am wondering whether we could use hierea() to use a lower retention time and run puppet less often.   CCing Giuseppe and Yuvi.


/var is /dev/vda2 , I am wondering whether it can be extended somehow.  CCing Andrew B and Marc-André.
Comment 2 Marc A. Pelletier 2014-11-18 14:45:30 UTC
(In reply to Antoine "hashar" Musso (WMF) from comment #1)
> /var is /dev/vda2 , I am wondering whether it can be extended somehow. 
> CCing Andrew B and Marc-André.

The latest images, through some rather ugly trickery, have /var on a logical volume and thus are expandable at will.  No such luck for the older images which have physical partitions.
Comment 3 Antoine "hashar" Musso (WMF) 2014-11-18 15:51:57 UTC
I looked at the state of the beta cluster puppet master (deployment-salt).

There, /var/lib is a symlink to /srv/var-lib/ which gives more free space.  The puppet.master has the reports.logstash which explains why nothing is written on disk.
Comment 4 Gerrit Notification Bot 2014-11-18 15:53:16 UTC
Change 174132 had a related patch set uploaded by Yuvipanda:
puppetmaster: Make time to keep old reports for configurable

https://gerrit.wikimedia.org/r/174132
Comment 5 Antoine "hashar" Musso (WMF) 2014-11-18 15:56:46 UTC
(In reply to Antoine "hashar" Musso (WMF) from comment #3)
> I looked at the state of the beta cluster puppet master (deployment-salt).
> 
> There, /var/lib is a symlink to /srv/var-lib/ which gives more free space. 
> The puppet.master has the reports.logstash which explains why nothing is
> written on disk.

On beta we have a patch to send reports to logstash which discards reporting on disk  https://gerrit.wikimedia.org/r/#/c/143788/10/modules/puppetmaster/templates/30-logstash.conf.erb,unified
Comment 6 Greg Grossmeier 2014-11-18 18:05:42 UTC
(In reply to Antoine "hashar" Musso (WMF) from comment #1)
> The /var on labs is indeed only 2GB.  puppetmaster reports takes 600MB of
> disk right now.

Can we not just increase the size of the beta cluster instances' diskspace? We've run into this issue many many many many times and playing whack-a-mole with symlinks and cronjobs to move data around is not sustainable.
Comment 7 Andrew Bogott 2014-11-18 18:09:41 UTC
Greg --

For new instances /var/log is somewhat resizeable.  For existing instances you can remount /var/log but that's very messy since every service expects to already have an open file and a directory in /var/log.
Comment 8 Greg Grossmeier 2014-11-18 18:30:00 UTC
(In reply to Andrew Bogott from comment #7)
> Greg --
> 
> For new instances /var/log is somewhat resizeable.

How much? Can we just change the default for new deployment-prep instances to be $large-enough-to-not-matter?

> For existing instances
> you can remount /var/log but that's very messy since every service expects
> to already have an open file and a directory in /var/log.

Worst case scenario is creating a second instance of whatever with a larger disk, moving traffic to it, then shutting down the old one, right? Not saying we should do that soon, but... continued hacks like this are hurting the stability of Beta Cluster (as opposed to addressing the real underlying issue of too little space on the VMs we use for our integration environment which everyone depends on daily).
Comment 9 Andrew Bogott 2014-11-18 18:34:51 UTC
(In reply to Greg Grossmeier from comment #8)
> (In reply to Andrew Bogott from comment #7)
> > Greg --
> > 
> > For new instances /var/log is somewhat resizeable.
> 
> How much? Can we just change the default for new deployment-prep instances
> to be $large-enough-to-not-matter?

Resizeable up to the available space selected when the instance was originally created.

It should be possible to set up sizing of /var/log based on project.  I'll have a look at that if that's the direction you want to go.
 
> Worst case scenario is creating a second instance of whatever with a larger
> disk, moving traffic to it, then shutting down the old one, right? 

That's correct.  In perfect-puppet-land, doing that should be trivial, but I've been led to understand that in the real world it's a big pain.

> (as opposed to addressing the real underlying
> issue of too little space on the VMs we use for our integration environment
> which everyone depends on daily).

One might argue that the 'real problem' is unbounded log growth, and that beta just displays the symptoms sooner than production.  But I don't know if the issue really is unbounded growth or if growth is bounded properly but just bounded outside the capacity of existing instances.
Comment 10 Greg Grossmeier 2014-11-18 18:47:37 UTC
(In reply to Andrew Bogott from comment #9)
> (In reply to Greg Grossmeier from comment #8)
> > (In reply to Andrew Bogott from comment #7)
> > > Greg --
> > > 
> > > For new instances /var/log is somewhat resizeable.
> > 
> > How much? Can we just change the default for new deployment-prep instances
> > to be $large-enough-to-not-matter?
> 
> Resizeable up to the available space selected when the instance was
> originally created.
> 
> It should be possible to set up sizing of /var/log based on project.  I'll
> have a look at that if that's the direction you want to go.

I guess we should way this ^ and the unbounded growth concern below.

> > Worst case scenario is creating a second instance of whatever with a larger
> > disk, moving traffic to it, then shutting down the old one, right? 
> 
> That's correct.  In perfect-puppet-land, doing that should be trivial, but
> I've been led to understand that in the real world it's a big pain.

Sadly, but that also points out other legitimate bugs :)

> > (as opposed to addressing the real underlying
> > issue of too little space on the VMs we use for our integration environment
> > which everyone depends on daily).
> 
> One might argue that the 'real problem' is unbounded log growth, and that
> beta just displays the symptoms sooner than production.  But I don't know if
> the issue really is unbounded growth or if growth is bounded properly but
> just bounded outside the capacity of existing instances.

Touche. But I'm still worried about all the differences between prod and beta that cause surprises :/
Comment 11 Greg Grossmeier 2014-11-19 16:34:49 UTC
Just keeping the heat on this bug, we had an outage this morning (times in Eastern US):
07:49 < icinga-wm> PROBLEM - BetaLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: deployment-prep.deployment-mediawiki01.diskspace._var.byte_avail.value (33.33%)

That probably caused the outage (the only other thing around that time is bug 73567, which hasn't been fixed/reverted yet beta is back up).

I *really really really* want to just throw hardware at the problem, but it's a pain given how OpenStack/Beta work, but I'm getting annoyed by all the warnings that we can't do anything else about. Our (Release Engineering's) job is not to rework prod logging policies on a case-by-case basis to make it work in Beta. Continued diff creation for reasons like that only complexify (it's a word) things.
Comment 12 Antoine "hashar" Musso (WMF) 2014-11-19 19:44:25 UTC
I know about two reasons for the HHVM application servers on beta cluster fill /var/ :

Bug 73262 - hhvm apache fills /var/log/apache2 with access logs

They need to send their log to syslog (that would thus end up to the logstash instance) instead of writing to disk debug / access logs.


Some bug I can't find which is that the HHVM coredump end up under /var/ as well when they should be saved to /data/project (since we care about) and garbage collected automatically (Bryan wrote a cron to handle that).

Finally this bug with puppet filling puppet master disk, that is being worked on by Yuvi.



Sorry for hijacking this bug. I can't firefight all the issues nor triage / set priority on bugs flagged hhvm.
Comment 13 Gerrit Notification Bot 2014-11-20 07:33:15 UTC
Change 174132 merged by Yuvipanda:
puppetmaster: Make time to keep old reports for configurable

https://gerrit.wikimedia.org/r/174132
Comment 14 Yuvi Panda 2014-11-20 07:40:14 UTC
Can someone with projectadmin on integration project edit https://wikitech.wikimedia.org/wiki/Hiera:Integration to add the line:

"puppetmaster::scripts::keep_report_minutes": 360

This will keep reports only for 6 hours.
Comment 15 Antoine "hashar" Musso (WMF) 2014-11-20 13:27:05 UTC
(In reply to Yuvi Panda from comment #14)
> Can someone with projectadmin on integration project edit
> https://wikitech.wikimedia.org/wiki/Hiera:Integration to add the line:
> 
> "puppetmaster::scripts::keep_report_minutes": 360
> 
> This will keep reports only for 6 hours.

I have copy pasted on:
https://wikitech.wikimedia.org/wiki/Hiera:Integration

Updated the git repo on integration-puppetmaster.eqiad.wmflabs to include the above Gerrit change and ran puppet.  The puppet crontab still has the old entry:

 # crontab -l -u puppet |egrep -v ^#
 27 0,8,16 * * * find /var/lib/puppet/reports -type f -mmin +2160 -delete

:-/
Comment 16 Antoine "hashar" Musso (WMF) 2014-11-20 13:37:41 UTC
<yuvipanda>	 hashar: bah, typo on my end. it's 'keep_reports_minutes' (s after report)

I have reedited the wiki page, ran puppet again:

 Notice: /Stage[main]/Puppetmaster::Scripts/Cron[removeoldreports]/command:
 command changed
     'find /var/lib/puppet/reports -type f -mmin +2160 -delete'
  to 'find /var/lib/puppet/reports -type f -mmin +360 -delete'



  # crontab -l -u puppet |egrep -v ^#
  27 0,8,16 * * * find /var/lib/puppet/reports -type f -mmin +360 -delete

That solves the issue for the 'integration' project.


I did the same for 'deployment-prep' ( https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prep&diff=135116&oldid=134263 ) and it is all happy as well.


Thank you Yuvi!

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links