Last modified: 2014-07-04 16:18:14 UTC
I created a (vanilla, no configuration, smallest image) instance yesterday ("icinga-scfc-test") and the initial Puppet run didn't finish. So I waited several hours, but still no luck. Manual Puppet runs ("sudo puppetd -tv") showed: | err: Could not request certificate: The certificate retrieved from the master does not match the agent's private key. | Certificate fingerprint: 05:91:9E:EE:6C:28:8B:24:FE:19:39:66:03:93:6C:44 | To fix this, remove the certificate from both the master and the agent and then start a puppet run, which will automatically regenerate a certficate. | On the master: | puppet cert clean i-00000906.pmtpa.wmflabs | On the agent: | rm -f /var/lib/puppet/ssl/certs/i-00000906.pmtpa.wmflabs.pem | puppet agent -t I deleted the instance, created another one ("icinga-scfc-test2"), and ran into the same situation again. I deleted the instance, created a bigger one ("icinga-scfc-test3"), and the error occured there as well (after waiting several hours in each case). I reported this error in December (cf. http://permalink.gmane.org/gmane.org.wikimedia.labs/1976), but then it apparently resolved itself after waiting (cf. http://permalink.gmane.org/gmane.org.wikimedia.labs/1977), while now there doesn't seem to be any light at the end of the tunnel. (i-00000906 is not (and was not then) the name of any of the created instances, but refers to labs-vmbuilder-precise.)
I just now tried creating a new instance, and it came up fine, and the puppet cert worked. I do see the error on icinga-scfc-test3, though. I want to think that this is some kind of occasional error that happens when there's an ID collision or when an old ID is used. But the fact that it's complaining about an ID different from the instance is very strange. I'll investigate further... in the meantime, though, if you create yet another instance, most likely it'll work :/
OK, on a working instance: # ls -ltra /var/lib/puppet/ssl/certs total 16 -rw-r--r-- 1 puppet puppet 847 Feb 15 08:42 ca.pem -rw-r--r-- 1 puppet puppet 883 Feb 15 08:43 i-00000a65.pmtpa.wmflabs.pem On icinga-scfc-test3: # ls -ltra /var/lib/puppet/ssl/certs total 20 -rw-r--r-- 1 puppet puppet 847 Feb 14 21:31 ca.pem -rw-r----- 1 puppet puppet 883 Feb 14 21:32 i-00000a64.pmtpa.wmflabs.pem -rw-r--r-- 1 puppet puppet 883 Feb 14 21:35 i-00000906.pmtpa.wmflabs.pem Now my theory is that early in its life an instance thinks that its ID is i-00000906 (inherited by mistake from the original image build), and that if a user forces a puppet run during that early stage it tries to create a cert for the wrong ID and is forever after doomed. Is that possibly what happened here? Changing the certname in /etc/puppet/puppet.conf to the actual instance ID seems to resolve the problem. (Another possibility, testing a weaker theory -- were specific puppet classes selected via the wikitech GUI before this instance was able to complete a puppet run?)
Let's start with the last bit: No, I didn't even open the configuration field :-). I usually run "sudo puppetd -tv" on the first login just because the initial motd is still the Ubuntu one. But just now I created another instance ("ici"; the web form is really quick to react to Enter keys :-)), logged in, looked for a Puppet agent running ("ps auxfwww | fgrep puppet"), found none, looked in /var/lib/puppet/ssl/certs: | scfc@ici:~$ sudo ls -l /var/lib/puppet/ssl/certs | total 8 | -rw-r--r-- 1 puppet puppet 847 Feb 15 11:37 ca.pem | -rw-r----- 1 puppet puppet 883 Feb 15 11:38 i-00000a66.pmtpa.wmflabs.pem | scfc@ici:~$ and ran Puppet - boom: | scfc@ici:~$ sudo puppetd -tv | info: Creating a new SSL key for i-00000906.pmtpa.wmflabs | info: Caching certificate for i-00000906.pmtpa.wmflabs | err: Could not request certificate: The certificate retrieved from the master does not match the agent's private key. | Certificate fingerprint: 05:91:9E:EE:6C:28:8B:24:FE:19:39:66:03:93:6C:44 | To fix this, remove the certificate from both the master and the agent and then start a puppet run, which will automatically regenerate a certficate. | On the master: | puppet cert clean i-00000906.pmtpa.wmflabs | On the agent: | rm -f /var/lib/puppet/ssl/certs/i-00000906.pmtpa.wmflabs.pem | puppet agent -t | Exiting; failed to retrieve certificate and waitforcert is disabled | scfc@ici:~$ Afterwards: | scfc@ici:~$ sudo ls -l /var/lib/puppet/ssl/certs | total 12 | -rw-r--r-- 1 puppet puppet 847 Feb 15 11:37 ca.pem | -rw-r----- 1 puppet puppet 883 Feb 15 11:46 i-00000906.pmtpa.wmflabs.pem | -rw-r----- 1 puppet puppet 883 Feb 15 11:38 i-00000a66.pmtpa.wmflabs.pem | scfc@ici:~$ So I created another instance ("ici2"), didn't do anything but looked at /etc/puppet/puppet.conf: | certname = i-00000906.pmtpa.wmflabs Should that really be there?
I've looked at this quite a bit now, but still have no good solution. The problem seems limited to this particular project. I suspect that the very first step of instance startup ('firstboot.sh') is not running, since that should set up puppet.conf properly. That or there's some kind of early ldap failure. I'll look at this more as soon as i have a chance.
So... here's what I think is happening: 1) Puppet can't run on instances in the 'nagios' project. I think this is because of a name conflict... there seems to be a new /etc/sudoers.d/nagios file defined in production puppet which collides with the standard /etc/sudoers.d/<projectname> sudoers file. (All of this is speculation, I haven't looked for the offending class yet.) 2) Being unable to complete a puppet run, puppet.conf was never updated by puppet. 3) #2 shouldn't have mattered because in theory our image automatically sets up puppet.conf. But the image was broken due to the issue fixed in https://gerrit.wikimedia.org/r/#/c/113788/ Which that fix, new instances now throw the following error, which is what led me to speculate about step 1: err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: File[/etc/sudoers.d/nagios] is already defined in file /etc/puppet/manifests/sudo.pp at line 11; cannot redefine at /etc/puppet/manifests/sudo.pp:23 on node i-00000a70.pmtpa.wmflabs
Assuming this is still a problem, does somebody plan to work on this or is everybody busy with the Eqiad migration?
I'm not actively working on it. The easy fix is to not have a project called 'nagios' :) So far no one has claimed ownership of the 'nagios' project, which means it will probably be shut down in the migration, at which point this will be largely moot I think.
(In reply to Andrew Bogott from comment #7) > So far no one has claimed ownership of the 'nagios' project, which means it > will probably be shut down in the migration Do you know if this happened, and does this make this ticket obsolete?
abogott: Do you know if this happened, and does this make this ticket obsolete?
Yes, I just now deleted the 'nagios' project as it had no instances.