Last modified: 2014-03-26 20:21:55 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T64771, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 62771 - WMFLabs: Auto-creation of home directories broken (new members and instances unable to login)


Summary:	WMFLabs: Auto-creation of home directories broken (new members and instances ...

Status:	RESOLVED FIXED

Product:	Wikimedia Labs
Classification:	Unclassified
Component:	General (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	High major
Target Milestone:	---
Assigned To:	Marc A. Pelletier

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-03-18 06:09 UTC by Krinkle
Modified:	2014-03-26 20:21 UTC (History)
CC List:	8 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Krinkle 2014-03-18 06:09:34 UTC

As of late, the home directory creation for users is broken.

I got errors in two scenarios:

* Added user 'rxy' as member to the 'cvn' group.
* Him connecting to a pre-existing pmtpa instance that I (krinkle) can log in on fine, yields:

$ ssh cvn-app2.pmtpa.wmflabs
(..)
Creating directory '/home/rxy'.
Unable to create and initialize directory '/home/rxy'.

$ ssh cvn-app3.eqiad.wmflabs
(..)
* Created new instance in eqiad.
* Me connecting to this 15 minutes after its creation, I still get:

Creating directory '/home/krinkle'.
Unable to create and initialize directory '/home/krinkle'.


I've updated my .ssh/config with the most recent version of the example on https://wikitech.wikimedia.org/wiki/Help:Access#ProxyCommand, but that only made it worse (main difference is using bastion-eqiad, instance of bastion2.pmtpa).

With that config I can't even connect to it:

$ ssh cvn-app3.eqiad.wmflabs 
channel 0: open failed: connect failed: Connection timed out
ssh_exchange_identification: Connection closed by remote host

Trying manually:

$ ssh -A bastion1.eqiad.wmflabs
krinkle at bastion1.eqiad.wmflabs in ~
$ ping cvn-app3
PING cvn-app3.eqiad.wmflabs (10.68.16.170) 56(84) bytes of data.
64 bytes from cvn-app3.eqiad.wmflabs (10.68.16.170): icmp_req=1 ttl=64 time=2.26
64 bytes from cvn-app3.eqiad.wmflabs (10.68.16.170): icmp_req=2 ttl=64 time=0.71
$ ssh cvn-app3
ssh: connect to host cvn-app3 port 22: Connection timed out
$ ssh cvn-app3.eqiad.wmflabsssh: connect to host cvn-app3.eqiad.wmflabs port 22: Connection timed out

Comment 1 Andrew Bogott 2014-03-19 05:02:47 UTC

The pmtpa issue is an unrelated and soon-to-be-moot gluster failure.

The eqiad issue I've seen before, but don't know how to fix (other than by waiting and rebooting.)  Perhaps Coren will have time to debug this sometime soon...

Comment 2 Daniel Zahn 2014-03-19 05:31:22 UTC

confirmed, had exact same issue today, with 2 newly created eqiad instances. the problem disappeared after rebooting the second one a second time (or so)

Comment 3 Marc A. Pelletier 2014-03-19 12:53:43 UTC

The nature of the problem is known (the instance attempts to mount /home and /data/project before the NFS server has updated its ACLs for it, then caches the negative result for some time), but a proper fix hasn't been found yet.

I have some ideas on how to prevent this from happening that I will be trying today.

In the meantime, doing a reboot at least 10 minutes after the issue occurs then waiting at least another 20 minutes seem to be sufficient to let the ACL time out.

Comment 4 Krinkle 2014-03-22 01:42:22 UTC

(In reply to Marc A. Pelletier from comment #3)
> The nature of the problem is known (the instance attempts to mount /home and
> /data/project before the NFS server has updated its ACLs for it, then caches
> the negative result for some time), but a proper fix hasn't been found yet.
> 
> I have some ideas on how to prevent this from happening that I will be
> trying today.
> 
> In the meantime, doing a reboot at least 10 minutes after the issue occurs
> then waiting at least another 20 minutes seem to be sufficient to let the
> ACL time out.

I've rebooted cvn-app3 shortly after I created it and it wasn't working, then I reported this bug.

I've rebooted it again yesterday, and again today just now. Still getting:

krinkle at KrinkleMac in ~ $ ssh cvn-app3.eqiad.wmflabs 
channel 0: open failed: connect failed: Connection timed out
ssh_exchange_identification: Connection closed by remote host


Could be unrelated, but it's also not showing any life signs in ganglia since and including during its creation:
http://ganglia.wmflabs.org/latest/?c=cvn&h=cvn-app3

Comment 5 Marc A. Pelletier 2014-03-26 20:21:55 UTC

The race condition has been prevented for new images (that is, attempts to mount a filesystem before it has been made available rw will now fail rather than mount readonly); subsequent puppet runs will try again.

This will prevent the fundamental issue (and the annoying caching that makes it hard to go away), but not for existing instances which will still require some manual manipulation.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links