Last modified: 2014-07-04 16:18:14 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T63413, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 61413 - New instances are stuck in "The certificate retrieved from the master does not match the agent's private key."


Summary:	New instances are stuck in "The certificate retrieved from the master does no...

Status:	RESOLVED WONTFIX

Product:	Wikimedia Labs
Classification:	Unclassified
Component:	General (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Unprioritized blocker
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	60112
	Show dependency tree / graph

Reported:	2014-02-15 08:30 UTC by Tim Landscheidt
Modified:	2014-07-04 16:18 UTC (History)
CC List:	3 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Tim Landscheidt 2014-02-15 08:30:56 UTC

I created a (vanilla, no configuration, smallest image) instance yesterday ("icinga-scfc-test") and the initial Puppet run didn't finish.  So I waited several hours, but still no luck.  Manual Puppet runs ("sudo puppetd -tv") showed:

| err: Could not request certificate: The certificate retrieved from the master does not match the agent's private key.
| Certificate fingerprint: 05:91:9E:EE:6C:28:8B:24:FE:19:39:66:03:93:6C:44
| To fix this, remove the certificate from both the master and the agent and then start a puppet run, which will automatically regenerate a certficate.
| On the master:
|   puppet cert clean i-00000906.pmtpa.wmflabs
| On the agent:
|   rm -f /var/lib/puppet/ssl/certs/i-00000906.pmtpa.wmflabs.pem
|   puppet agent -t

I deleted the instance, created another one ("icinga-scfc-test2"), and ran into the same situation again.  I deleted the instance, created a bigger one ("icinga-scfc-test3"), and the error occured there as well (after waiting several hours in each case).

I reported this error in December (cf. http://permalink.gmane.org/gmane.org.wikimedia.labs/1976), but then it apparently resolved itself after waiting (cf. http://permalink.gmane.org/gmane.org.wikimedia.labs/1977), while now there doesn't seem to be any light at the end of the tunnel.  (i-00000906 is not (and was not then) the name of any of the created instances, but refers to labs-vmbuilder-precise.)

Comment 1 Andrew Bogott 2014-02-15 10:47:13 UTC

I just now tried creating a new instance, and it came up fine, and the puppet cert worked.  I do see the error on icinga-scfc-test3, though.

I want to think that this is some kind of occasional error that happens when there's an ID collision or when an old ID is used.  But the fact that it's complaining about an ID different from the instance is very strange.  I'll investigate further...  in the meantime, though, if you create yet another instance, most likely it'll work :/

Comment 2 Andrew Bogott 2014-02-15 10:57:00 UTC

OK, on a working instance:

# ls -ltra /var/lib/puppet/ssl/certs
total 16
-rw-r--r-- 1 puppet puppet  847 Feb 15 08:42 ca.pem
-rw-r--r-- 1 puppet puppet  883 Feb 15 08:43 i-00000a65.pmtpa.wmflabs.pem

On icinga-scfc-test3:

# ls -ltra /var/lib/puppet/ssl/certs
total 20
-rw-r--r-- 1 puppet puppet  847 Feb 14 21:31 ca.pem
-rw-r----- 1 puppet puppet  883 Feb 14 21:32 i-00000a64.pmtpa.wmflabs.pem
-rw-r--r-- 1 puppet puppet  883 Feb 14 21:35 i-00000906.pmtpa.wmflabs.pem

Now my theory is that early in its life an instance thinks that its ID is i-00000906 (inherited by mistake from the original image build), and that if a user forces a puppet run during that early stage it tries to create a cert for the wrong ID and is forever after doomed.  Is that possibly what happened here?  Changing the certname in /etc/puppet/puppet.conf to the actual instance ID seems to resolve the problem.

(Another possibility, testing a weaker theory -- were specific puppet classes selected via the wikitech GUI before this instance was able to complete a puppet run?)

Comment 3 Tim Landscheidt 2014-02-15 11:53:28 UTC

Let's start with the last bit: No, I didn't even open the configuration field :-).

I usually run "sudo puppetd -tv" on the first login just because the initial motd is still the Ubuntu one.

But just now I created another instance ("ici"; the web form is really quick to react to Enter keys :-)), logged in, looked for a Puppet agent running ("ps auxfwww | fgrep puppet"), found none, looked in /var/lib/puppet/ssl/certs:

| scfc@ici:~$ sudo ls -l /var/lib/puppet/ssl/certs
| total 8
| -rw-r--r-- 1 puppet puppet 847 Feb 15 11:37 ca.pem
| -rw-r----- 1 puppet puppet 883 Feb 15 11:38 i-00000a66.pmtpa.wmflabs.pem
| scfc@ici:~$

and ran Puppet - boom:

| scfc@ici:~$ sudo puppetd -tv
| info: Creating a new SSL key for i-00000906.pmtpa.wmflabs
| info: Caching certificate for i-00000906.pmtpa.wmflabs
| err: Could not request certificate: The certificate retrieved from the master does not match the agent's private key.
| Certificate fingerprint: 05:91:9E:EE:6C:28:8B:24:FE:19:39:66:03:93:6C:44
| To fix this, remove the certificate from both the master and the agent and then start a puppet run, which will automatically regenerate a certficate.
| On the master:
|   puppet cert clean i-00000906.pmtpa.wmflabs
| On the agent:
|   rm -f /var/lib/puppet/ssl/certs/i-00000906.pmtpa.wmflabs.pem
|   puppet agent -t

| Exiting; failed to retrieve certificate and waitforcert is disabled
| scfc@ici:~$

Afterwards:

| scfc@ici:~$ sudo ls -l /var/lib/puppet/ssl/certs
| total 12
| -rw-r--r-- 1 puppet puppet 847 Feb 15 11:37 ca.pem
| -rw-r----- 1 puppet puppet 883 Feb 15 11:46 i-00000906.pmtpa.wmflabs.pem                                                                                     
| -rw-r----- 1 puppet puppet 883 Feb 15 11:38 i-00000a66.pmtpa.wmflabs.pem                                                                                     
| scfc@ici:~$

So I created another instance ("ici2"), didn't do anything but looked at /etc/puppet/puppet.conf:

| certname = i-00000906.pmtpa.wmflabs

Should that really be there?

Comment 4 Andrew Bogott 2014-02-15 14:59:10 UTC

I've looked at this quite a bit now, but still have no good solution.  The problem seems limited to this particular project.  I suspect that the very first step of instance startup ('firstboot.sh') is not running, since that should set up puppet.conf properly.  That or there's some kind of early ldap failure.  I'll look at this more as soon as i have a chance.

Comment 5 Andrew Bogott 2014-02-17 17:04:37 UTC

So... here's what I think is happening:

1)  Puppet can't run on instances in the 'nagios' project.  I think this is because of a name conflict... there seems to be a new /etc/sudoers.d/nagios file defined in production puppet which collides with the standard /etc/sudoers.d/<projectname> sudoers file.  (All of this is speculation, I haven't looked for the offending class yet.)

2)  Being unable to complete a puppet run, puppet.conf was never updated by puppet.

3)  #2 shouldn't have mattered because in theory our image automatically sets up puppet.conf.  But the image was broken due to the issue fixed in https://gerrit.wikimedia.org/r/#/c/113788/

Which that fix, new instances now throw the following error, which is what led me to speculate about step 1:

err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: File[/etc/sudoers.d/nagios] is already defined in file /etc/puppet/manifests/sudo.pp at line 11; cannot redefine at /etc/puppet/manifests/sudo.pp:23 on node i-00000a70.pmtpa.wmflabs

Comment 6 Andre Klapper 2014-03-19 13:58:15 UTC

Assuming this is still a problem, does somebody plan to work on this or is everybody busy with the Eqiad migration?

Comment 7 Andrew Bogott 2014-03-19 14:03:34 UTC

I'm not actively working on it.  The easy fix is to not have a project called 'nagios' :)

So far no one has claimed ownership of the 'nagios' project, which means it will probably be shut down in the migration, at which point this will be largely moot I think.

Comment 8 Andre Klapper 2014-05-17 17:32:37 UTC

(In reply to Andrew Bogott from comment #7)
> So far no one has claimed ownership of the 'nagios' project, which means it
> will probably be shut down in the migration

Do you know if this happened, and does this make this ticket obsolete?

Comment 9 Andre Klapper 2014-07-04 11:44:35 UTC

abogott: Do you know if this happened, and does this make this ticket obsolete?

Comment 10 Andrew Bogott 2014-07-04 16:18:14 UTC

Yes, I just now deleted the 'nagios' project as it had no instances.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links