Last modified: 2014-10-03 08:38:04 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T73431, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 71431 - deployment-rsync01 20GB hard drive is too small
deployment-rsync01 20GB hard drive is too small
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
deployment-prep (beta) (Other open bugs)
unspecified
All All
: Normal normal
: ---
Assigned To: Bryan Davis
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-09-29 21:26 UTC by Sam Reed (reedy)
Modified: 2014-10-03 08:38 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
disk space percent free graph (19.82 KB, image/png)
2014-10-01 15:10 UTC, Greg Grossmeier
Details

Description Sam Reed (reedy) 2014-09-29 21:26:22 UTC
Exactly what it says on the tin.

It's causing auto code updates to break
Comment 1 Antoine "hashar" Musso (WMF) 2014-09-30 07:18:47 UTC
deployment-rsync01.eqiad.wmflabs ( https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000002f4.eqiad.wmflabs ) is a m1 small with 20GB disk allocation partitioned as:

hashar@deployment-rsync01:~$ df -h -x nfs
Filesystem                          Size  Used Avail Use% Mounted on
/dev/vda1                           7.6G  2.1G  5.2G  29% /
udev                                998M   12K  998M   1% /dev
tmpfs                               401M  316K  401M   1% /run
none                                5.0M     0  5.0M   0% /run/lock
none                               1002M     0 1002M   0% /run/shm
/dev/vda2                           1.9G  647M  1.2G  36% /var
cgroups                            1002M     0 1002M   0% /sys/fs/cgroup
/dev/mapper/vd-second--local--disk  8.5G  6.1G  1.9G  77% /srv

The scap process has most probably filled /srv/ because of the l10n cache but it has been cleaned up.


I am not sure which files needs to be cleaned up. We can do that either in scap itself or in the Jenkins job beta-scap-eqiad.
Comment 2 Greg Grossmeier 2014-09-30 15:25:06 UTC
Let's not make the Jenkins beta-scap-eqiad job very divergent from prod (at all).
<voice="ori">Let's make the Beta Cluster like prod, not make more hacks that are different.</voice>
Comment 3 Sam Reed (reedy) 2014-09-30 15:46:55 UTC
How long did it take to break?

I deleted a weird tmp dir, killed the whole cache dir, and re-ran sync-common. Which gave ~2G free space.

I'm wondering if it's a "one off", or this will break again quickly etc. In which case, we should reinstall it to a larger machine
Comment 4 Antoine "hashar" Musso (WMF) 2014-10-01 13:42:56 UTC
It seems the root cause of the issue was the LDAP being upgraded/unreacheable intermittently over the past few days.  As a result, when puppet run it considers that the mwdeploy/l10nupdate (among others) users do not exist and thus create a local copy of them.  Whenever LDAP comes back, we end up with files having conflicting UID.  That most probably confuse rsync.

Bryan deleted the local users yesterday.  He also cleaned up some all 'common' directories which were left around thus reclaiming a huge amount of disk space.

So it is all fixed for now.

Puppet creating local users when LDAP is unreachable is documented at https://bugzilla.wikimedia.org/show_bug.cgi?id=71480 .
Comment 5 Greg Grossmeier 2014-10-01 15:10:11 UTC
Created attachment 16640 [details]
disk space percent free graph

So it appears that things are stable again, disk-space-free-wise.

Also, does the drop in available disk space around Sept 11th correlate with anything we should worry about?

I'm inclined to close this bug for now if we aren't realistically going to hit the limit any time soon (and since we hit the limit this time due to an unrelated breakage we needed to catch anyways).
Comment 6 Bryan Davis 2014-10-01 15:22:14 UTC
(In reply to Antoine "hashar" Musso from comment #4)
> It seems the root cause of the issue was the LDAP being
> upgraded/unreacheable intermittently over the past few days.  As a result,
> when puppet run it considers that the mwdeploy/l10nupdate (among others)
> users do not exist and thus create a local copy of them.  Whenever LDAP
> comes back, we end up with files having conflicting UID.  That most probably
> confuse rsync.

This was an issue across several hosts in the beta cluster, but it turned out to be unrelated to the disk space issues on rsync01.

> Bryan deleted the local users yesterday.  He also cleaned up some all
> 'common' directories which were left around thus reclaiming a huge amount of
> disk space.

This was the real problem. When I originally added scap deployment to beta I found that the primary disks for all of the hosts that needed copies of MediaWiki were too small to comfortably contain a full sync. I added secondary LVS mounts to all of these hosts on /srv (or made /srv a symlink to /mnt/srv if LVS was already attached on /mnt). Then I created a symlink from /usr/local/apache/common-local to /srv/common-local where the synced tree from deployment-bastion would be stored.

Recently Ori dove into operations/puppet and started working on cleaning up the legacy file paths (/a/common, /usr/local/apache) and replacing them with more modern locations. /usr/local/apache/common and /usr/local/apache/common-local (former was a symlink to the latter) were replaced with /srv/mediawiki. When these changes hit beta, things mostly just worked because puppet and scap worked together to create the right content in the right place.

A side effect of this change finally bit us on rsync01. There was no puppet code added to clean up the old /srv/common-local sync target. This left ~3G of files on each scap target host. For the deployment-mediawiki* hosts this was not a big deal. The secondary disk on those hosts is 68G leaving lots of space for the new copy of everything. On deployment-rsync01 however, /srv is an 8.5G partition, so 3G is a significant chunk of the available drive space.

I have deleted /src/common-local from all of the hosts in beta.
Comment 7 Antoine "hashar" Musso (WMF) 2014-10-03 08:38:04 UTC
Thanks Bryan for the detailed explanation :-)

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links