Last modified: 2014-07-28 20:06:09 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T70444, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 68444 - WMFLabs: Diamond not running / won't start
WMFLabs: Diamond not running / won't start
Status: RESOLVED FIXED
Product: Wikimedia Labs
Classification: Unclassified
Infrastructure (Other open bugs)
unspecified
All All
: Unprioritized critical
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-07-23 15:43 UTC by Krinkle
Modified: 2014-07-28 20:06 UTC (History)
5 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Krinkle 2014-07-23 15:43:19 UTC
Looking at graphite, the values for cvn instances appear all constant (cpu, memory, time since puppet run, everything).

For example:
http://graphite.wmflabs.org/render/?width=578&height=289&from=00%3A00_20140723&until=23%3A45_20140723&hideLegend=false&target=cvn.*.cpu.total.user.value

Checking the local instance (e.g. cvn-dev.eqiad.wmflabs) I see that the diamond directory has been idle for the past 16 days:

$ l /var/log/diamond/
total 77M
drwxr-xr-x  2 diamond root    4.0K Jul  7 14:04 ./
drwxr-xr-x 16 root    root    4.0K Jul 23 06:45 ../
-rw-r--r--  1 diamond nogroup 6.6M Jul  7 15:42 archive.log
-rw-r--r--  1 diamond nogroup  11M Jul  2 23:59 archive.log.2014-07-02
-rw-r--r--  1 diamond nogroup  11M Jul  3 23:59 archive.log.2014-07-03
-rw-r--r--  1 diamond nogroup  11M Jul  4 23:59 archive.log.2014-07-04
-rw-r--r--  1 diamond nogroup  10M Jul  5 23:59 archive.log.2014-07-05
-rw-r--r--  1 diamond nogroup  11M Jul  6 23:59 archive.log.2014-07-06
-rw-r--r--  1 diamond nogroup 1.4M Jul  8 17:03 diamond.log
-rw-r--r--  1 diamond nogroup  19M Jul  7 14:03 diamond.log.2014-07-06

And there is no diamond process running

$ ps -u diamond f
(empty)

$ ps aux | grep diamond | grep -v grep
(empty)

$ service diamond status
diamond stop/waiting

$ service diamond start
start: Rejected send message, 1 matched rules; type="method_call", sender=":1.49" (uid=2008 pid=6910 comm="start diamond ") interface="com.ubuntu.Upstart0_6.Job" member="Start" error name="(unset)" requested_reply="0" destination="com.ubuntu.Upstart" (uid=0 pid=1 comm="/sbin/init")

$ service diamond status
diamond stop/waiting



Graphite continues to register data from the instance (the last known value repeated), that seems like a bug in the aggregator because the instance hasn't been producing any values for over 16 days.

And of course, aside from Graphite being lied to by the aggregator (making it hard to monitor and see that it was down), the diamond process won't start?

Puppet is running fine (no errors), and the drives are fine too:
$ df -h
Filesystem                                     Size  Used Avail Use% Mounted on
/dev/vda1                                      7.6G  1.3G  5.9G  19% /
udev                                           2.0G   12K  2.0G   1% /dev
tmpfs                                          396M  288K  396M   1% /run
none                                           5.0M     0  5.0M   0% /run/lock
none                                           2.0G     0  2.0G   0% /run/shm
/dev/vda2                                      1.9G  525M  1.3G  29% /var
labstore.svc.eqiad.wmnet:/dumps                9.1T  9.1T     0 100% /public/dumps
labstore.svc.eqiad.wmnet:/project/cvn/project   30T   17T   14T  57% /data/project
labstore.svc.eqiad.wmnet:/project/cvn/home      30T   17T   14T  57% /home
labstore.svc.eqiad.wmnet:/scratch              7.3T  2.6T  4.7T  36% /data/scratch
labstore.svc.eqiad.wmnet:/keys                 960M   39M  921M   5% /public/keys
labstore.svc.eqiad.wmnet:/backups               20T  3.0G   20T   1% /public/backups
/dev/mapper/vd-second--local--disk              29G  172M   27G   1% /srv
Comment 1 Chase 2014-07-23 15:59:54 UTC
from irc:

so the I haven't dug in disclaimer :)
chasemp
but I will say
chasemp
we whitelisted projects that diamond is enabled for in puppet
chasemp
so if it's not in the array in manifests/role/diamond.pp
chasemp
it's expected to be stopped
chasemp
and also one of the undesirable features of lots of the statsd implementations
chasemp
is this continuity idea where they keep flushing stats even without a real source
chasemp
to keep whisper files from having invalid xfactor ratios
chasemp
which is...insane
chasemp
but this sounds exactly like that and is a big reason I did my own thing statsd wise in the past
Comment 2 Gerrit Notification Bot 2014-07-23 16:15:07 UTC
Change 148689 had a related patch set uploaded by Krinkle:
diamond: Enable for 'cvn' project in labs

https://gerrit.wikimedia.org/r/148689
Comment 3 Gerrit Notification Bot 2014-07-28 17:07:20 UTC
Change 148689 merged by coren:
diamond: Enable for 'cvn' project in labs

https://gerrit.wikimedia.org/r/148689
Comment 4 Krinkle 2014-07-28 20:06:09 UTC
Thx.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links