Last modified: 2014-09-18 20:21:20 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T72971, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 70971 - puppet agent on labs-vagrant instance using 99% of CPU, looking for git::clone.rb
puppet agent on labs-vagrant instance using 99% of CPU, looking for git::clon...
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
Infrastructure (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-09-17 22:55 UTC by spage
Modified: 2014-09-18 20:21 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
puppet agent --debug output (590.27 KB, text/plain)
2014-09-18 01:15 UTC, Bryan Davis
Details

Description spage 2014-09-17 22:55:51 UTC
After successfully provisioning the labs-vagrant instance flow-tests.wmflabs (see bug 70967), I rebooted it.

It came up OK but terribly slow. `top` showed puppet agent at 99% of CPU for minutes.  I killed it, then the next cron.d run also chewed up CPU. I killed that and ran the puppet agent command line from a terminal; it gets so far as

Info: Applying configuration version '1410992606'
Notice: /Stage[first]/Apt::Update/Exec[/usr/bin/apt-get update]/returns: executed successfully

and never prints anything more, while at 99% of CPU.

strace shows it's endlessly stat()ing the same set of 18 paths for lib/puppet/type/git::clone.rb. It stat()s "/var/lib/puppet/lib/puppet/type/git::clone.rb" twice for each time it stat()s the other 17 paths, such as "/usr/lib/ruby/1.9.1/puppet/type/git::clone.rb", "/usr/lib/ruby/1.9.1/x86_64-linux/puppet/type/git::clone.rb", etc.
Comment 1 spage 2014-09-17 23:23:09 UTC
$ sudo timeout  -k 300 1800 puppet agent --onetime --debug --verbose --no-daemonize --no-splay --show_diff  2>&1 | tee /tmp/puppet.out

shows it gets to 

Debug: /Stage[main]/Labs_lvm/Exec[create-volume-group]/unless:   vd:r/w:772:-1:0:1:1:-1:0:1:1:8970240:4096:2190:2190:0:DE0xG4-xoJp-kANc-YU6o-RPD3-BqFG-UHi7RW
Debug: Failed to load library 'msgpack' for feature 'msgpack'
Debug: file_metadata supports formats: pson b64_zlib_yaml yaml raw
Debug: Exec[migrate legacy files](provider=posix): Executing check '/usr/bin/test -d /mnt/vagrant'
Debug: Executing '/usr/bin/test -d /mnt/vagrant'

and then no more output and puppet at 99% of CPU.
Comment 2 Bryan Davis 2014-09-18 01:12:55 UTC
I'm pretty sure that "Debug: Executing '/usr/bin/test -d /mnt/vagrant'" is not what's really running while the system goes nuts. I noticed the exact same behavior on sul-test.eqiad.wmflabs (strace full of puppet looking for puppet/type/git::clone.rb in every location that ruby might be storing a puppet type plugin.

I let my process run with debug logging enabled and after a very long wait the next thing it spit out was:

Debug: Exec[diamond_sudo_for_puppet_sudo_linting](provider=posix): Executing check
 'test -e /etc/sudoers.d/50_diamond_sudo_for_puppet && /usr/sbin/visudo -cf /etc/s
udoers.d/50_diamond_sudo_for_puppet || exit 0'


Here's a bit more context around the pause:
Debug: Failed to load library 'msgpack' for feature 'msgpack'
Debug: file_metadata supports formats: pson b64_zlib_yaml yaml raw
Debug: Exec[migrate legacy files](provider=posix): Executing check '/usr/bin/test
-d /mnt/vagrant'
Debug: Executing '/usr/bin/test -d /mnt/vagrant'
#
# ... pauses here for at least 10 minutes, possibly longer
#
Debug: Exec[diamond_sudo_for_puppet_sudo_linting](provider=posix): Executing check 'test -e /etc/sudoers.d/50_diamond_sudo_for_puppet && /usr/sbin/visudo -cf /etc/s
udoers.d/50_diamond_sudo_for_puppet || exit 0'
Debug: Executing 'test -e /etc/sudoers.d/50_diamond_sudo_for_puppet && /usr/sbin/v
isudo -cf /etc/sudoers.d/50_diamond_sudo_for_puppet || exit 0'
Debug: /Stage[main]/Role::Labs::Instance/Diamond::Collector::Minimalpuppetagent[mi
nimal-puppet-agent]/Admin::Sudo[diamond_sudo_for_puppet]/Exec[diamond_sudo_for_pup
pet_sudo_linting]/unless: /etc/sudoers.d/50_diamond_sudo_for_puppet: parsed OK
Debug: Failed to load library 'msgpack' for feature 'msgpack'
Debug: file_metadata supports formats: pson b64_zlib_yaml yaml raw
Debug: Failed to load library 'msgpack' for feature 'msgpack'


This also seems like a red herring. I'll upload the full --debug log so others can take a look at it.
Comment 3 Bryan Davis 2014-09-18 01:15:24 UTC
Created attachment 16509 [details]
puppet agent --debug output

Output of:

sudo puppet agent --onetime --verbose --no-daemonize --splay --splaylimit 59 --show_diff --debug

On sul-test.eqiad.wmflabs ending near Thu Sep 18 01:14:52 UTC 2014
Comment 4 Bryan Davis 2014-09-18 02:23:26 UTC
Possibly related to upstream puppet bug <https://tickets.puppetlabs.com/browse/PUP-2924>
Comment 5 Bryan Davis 2014-09-18 04:09:31 UTC
I ran another test with this command:

$ TZ=UTC strace /usr/bin/ruby /usr/bin/puppet agent --onetime --verbose --no-daemonize --no-splay --debug 2>&1 | tee /tmp/loud-puppet-strace.log

This is nuts:

$ grep stat\( loud-puppet-strace.log | grep git::clone | wc -l
2293677

$ grep stat\( loud-puppet-strace.log | grep :: | grep .rb | wc -l
2329047

$ grep stat\( loud-puppet-strace.log | grep :: | sed -n 's/^.*\/type\/\([^"]*\).*/\1/p' | sort | uniq -c
    567 admin::sudo.rb
    162 admissible_admin::sudo.rb
    972 admissible_apt::conf.rb
    324 admissible_apt::pin.rb
    162 admissible_apt::puppet.rb
    162 admissible_apt::repository.rb
    324 admissible_base::syslogs::syslogs::readable.rb
    162 admissible_diamond::collector::minimalpuppetagent.rb
    648 admissible_diamond::collector.rb
    162 admissible_ferm::rule.rb
    162 admissible_generic::upstart_job.rb
    162 admissible_git::clone.rb
    162 admissible_labs_lvm::volume.rb
   1134 admissible_nrpe::check.rb
   1134 admissible_nrpe::monitor_service.rb
    162 admissible_rsyslog::conf.rb
    324 admissible_salt::grain.rb
    324 admissible_sysctl::conffile.rb
    324 admissible_sysctl::parameters.rb
   3402 apt::conf.rb
    918 apt::pin.rb
    891 apt::puppet.rb
    729 apt::repository.rb
    702 base::syslogs::syslogs::readable.rb
    162 completed_admin::sudo.rb
    972 completed_apt::conf.rb
    324 completed_apt::pin.rb
    162 completed_apt::puppet.rb
    162 completed_apt::repository.rb
    324 completed_base::syslogs::syslogs::readable.rb
    162 completed_diamond::collector::minimalpuppetagent.rb
    648 completed_diamond::collector.rb
    162 completed_ferm::rule.rb
    162 completed_generic::upstart_job.rb
    162 completed_git::clone.rb
    162 completed_labs_lvm::volume.rb
   1134 completed_nrpe::check.rb
   1134 completed_nrpe::monitor_service.rb
    162 completed_rsyslog::conf.rb
    324 completed_salt::grain.rb
    324 completed_sysctl::conffile.rb
    324 completed_sysctl::parameters.rb
    513 diamond::collector::minimalpuppetagent.rb
   1782 diamond::collector.rb
    459 ferm::rule.rb
    405 generic::upstart_job.rb
2293353 git::clone.rb
    756 labs_lvm::volume.rb
   3024 nrpe::check.rb
   4347 nrpe::monitor_service.rb
    432 rsyslog::conf.rb
   1107 salt::grain.rb
    864 sysctl::conffile.rb
    864 sysctl::parameters.rb
Comment 6 Bryan Davis 2014-09-18 05:27:27 UTC
Another debug run done with:

$ TZ=UTC /usr/bin/ruby /usr/bin/puppet agent --onetime --verbose --no-daemonize --no-splay --debug --trace --evaltrace --noop 2>&1 | tee /tmp/puppet-noop.log

This one shows the long pause where nothing is logged (which lines up with the massive number of stat calls in other runs) as happening between this block:

^[[0;32mInfo: Git::Clone[vagrant]: Starting to evaluate the resource^[[0m
^[[0;32mInfo: Git::Clone[vagrant]: Evaluated in 0.01 seconds^[[0m

And this block:

^[[0;32mInfo: /Stage[main]/Labs_vagrant/File[/srv/vagrant]: Starting to evaluate the resource^[[0m
^[[0;32mInfo: /Stage[main]/Labs_vagrant/File[/srv/vagrant]: Evaluated in 0.00 seconds^[[0m

This is potentially more revealing of what is going on. The File[/srv/vagrant] resource does a recursive chown on the directory where MWV has been checked out. It also requires Git::Clone['vagrant']. My new guess at what is happening is that under the hood puppet creates a separate file resource for each file that is found in the recursive directory search. Each of these in turn inherits the require constraint. As puppet augments it's DAG with these nodes each one tries to figure out if Git::Clone is a type which leads to the insane number of stat calls.

I think there are a couple of things that could be done here (besides some upstream patches to be smarter about things in general), the require on the file could be switched to be a before on the clone. My guess is that will make a big difference in the number of stat calls. The other additional thing that could be done would be to change the recursive ownership management to an exec of a find command to switches the ownership of files under the target directory. Having seen what this seems to be doing I think that would actually end up being quite a bit faster than letting puppet make a ton of ruby objects (1 for each file?) and evaluate them in series.
Comment 7 Gerrit Notification Bot 2014-09-18 17:51:00 UTC
Change 161276 had a related patch set uploaded by BryanDavis:
Stop making so many virtual resources

https://gerrit.wikimedia.org/r/161276
Comment 8 Gerrit Notification Bot 2014-09-18 18:01:31 UTC
Change 161276 merged by Andrew Bogott:
Stop making so many virtual resources

https://gerrit.wikimedia.org/r/161276
Comment 9 Bryan Davis 2014-09-18 18:03:41 UTC
(In reply to Bryan Davis from comment #4)
> Possibly related to upstream puppet bug
> <https://tickets.puppetlabs.com/browse/PUP-2924>

The upstream bug was supposedly fixed with <https://github.com/puppetlabs/puppet/pull/2979> but still the behavior we used was silly.
Comment 10 Bryan Davis 2014-09-18 20:21:20 UTC
After this patch landed, /var/log/puppet.log says:

  Notice: Finished catalog run in 60.99 seconds

Before that every puppet run from cron was killed with "Notice: Caught TERM; calling stop" before actually finishing.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links