Last modified: 2014-10-03 08:38:04 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T73431, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 71431 - deployment-rsync01 20GB hard drive is too small


Summary:	deployment-rsync01 20GB hard drive is too small

Status:	RESOLVED FIXED

Product:	Wikimedia Labs
Classification:	Unclassified
Component:	deployment-prep (beta) (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Bryan Davis

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-09-29 21:26 UTC by Sam Reed (reedy)
Modified:	2014-10-03 08:38 UTC (History)
CC List:	8 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
disk space percent free graph (19.82 KB, image/png) 2014-10-01 15:10 UTC, Greg Grossmeier	Details
Add an attachment (proposed patch, testcase, etc.)

Description Sam Reed (reedy) 2014-09-29 21:26:22 UTC

Exactly what it says on the tin.

It's causing auto code updates to break

Comment 1 Antoine "hashar" Musso (WMF) 2014-09-30 07:18:47 UTC

deployment-rsync01.eqiad.wmflabs ( https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000002f4.eqiad.wmflabs ) is a m1 small with 20GB disk allocation partitioned as:

hashar@deployment-rsync01:~$ df -h -x nfs
Filesystem                          Size  Used Avail Use% Mounted on
/dev/vda1                           7.6G  2.1G  5.2G  29% /
udev                                998M   12K  998M   1% /dev
tmpfs                               401M  316K  401M   1% /run
none                                5.0M     0  5.0M   0% /run/lock
none                               1002M     0 1002M   0% /run/shm
/dev/vda2                           1.9G  647M  1.2G  36% /var
cgroups                            1002M     0 1002M   0% /sys/fs/cgroup
/dev/mapper/vd-second--local--disk  8.5G  6.1G  1.9G  77% /srv

The scap process has most probably filled /srv/ because of the l10n cache but it has been cleaned up.


I am not sure which files needs to be cleaned up. We can do that either in scap itself or in the Jenkins job beta-scap-eqiad.

Comment 2 Greg Grossmeier 2014-09-30 15:25:06 UTC

Let's not make the Jenkins beta-scap-eqiad job very divergent from prod (at all).
<voice="ori">Let's make the Beta Cluster like prod, not make more hacks that are different.</voice>

Comment 3 Sam Reed (reedy) 2014-09-30 15:46:55 UTC

How long did it take to break?

I deleted a weird tmp dir, killed the whole cache dir, and re-ran sync-common. Which gave ~2G free space.

I'm wondering if it's a "one off", or this will break again quickly etc. In which case, we should reinstall it to a larger machine

Comment 4 Antoine "hashar" Musso (WMF) 2014-10-01 13:42:56 UTC

It seems the root cause of the issue was the LDAP being upgraded/unreacheable intermittently over the past few days.  As a result, when puppet run it considers that the mwdeploy/l10nupdate (among others) users do not exist and thus create a local copy of them.  Whenever LDAP comes back, we end up with files having conflicting UID.  That most probably confuse rsync.

Bryan deleted the local users yesterday.  He also cleaned up some all 'common' directories which were left around thus reclaiming a huge amount of disk space.

So it is all fixed for now.

Puppet creating local users when LDAP is unreachable is documented at https://bugzilla.wikimedia.org/show_bug.cgi?id=71480 .

Comment 5 Greg Grossmeier 2014-10-01 15:10:11 UTC

Created attachment 16640 [details]
disk space percent free graph

So it appears that things are stable again, disk-space-free-wise.

Also, does the drop in available disk space around Sept 11th correlate with anything we should worry about?

I'm inclined to close this bug for now if we aren't realistically going to hit the limit any time soon (and since we hit the limit this time due to an unrelated breakage we needed to catch anyways).

Comment 6 Bryan Davis 2014-10-01 15:22:14 UTC

(In reply to Antoine "hashar" Musso from comment #4)
> It seems the root cause of the issue was the LDAP being
> upgraded/unreacheable intermittently over the past few days.  As a result,
> when puppet run it considers that the mwdeploy/l10nupdate (among others)
> users do not exist and thus create a local copy of them.  Whenever LDAP
> comes back, we end up with files having conflicting UID.  That most probably
> confuse rsync.

This was an issue across several hosts in the beta cluster, but it turned out to be unrelated to the disk space issues on rsync01.

> Bryan deleted the local users yesterday.  He also cleaned up some all
> 'common' directories which were left around thus reclaiming a huge amount of
> disk space.

This was the real problem. When I originally added scap deployment to beta I found that the primary disks for all of the hosts that needed copies of MediaWiki were too small to comfortably contain a full sync. I added secondary LVS mounts to all of these hosts on /srv (or made /srv a symlink to /mnt/srv if LVS was already attached on /mnt). Then I created a symlink from /usr/local/apache/common-local to /srv/common-local where the synced tree from deployment-bastion would be stored.

Recently Ori dove into operations/puppet and started working on cleaning up the legacy file paths (/a/common, /usr/local/apache) and replacing them with more modern locations. /usr/local/apache/common and /usr/local/apache/common-local (former was a symlink to the latter) were replaced with /srv/mediawiki. When these changes hit beta, things mostly just worked because puppet and scap worked together to create the right content in the right place.

A side effect of this change finally bit us on rsync01. There was no puppet code added to clean up the old /srv/common-local sync target. This left ~3G of files on each scap target host. For the deployment-mediawiki* hosts this was not a big deal. The secondary disk on those hosts is 68G leaving lots of space for the new copy of everything. On deployment-rsync01 however, /srv is an 8.5G partition, so 3G is a significant chunk of the available drive space.

I have deleted /src/common-local from all of the hosts in beta.

Comment 7 Antoine "hashar" Musso (WMF) 2014-10-03 08:38:04 UTC

Thanks Bryan for the detailed explanation :-)

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links