Last modified: 2014-09-06 19:58:54 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T72076, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 70076 - Internal DNS look-ups fail every once in a while
Internal DNS look-ups fail every once in a while
Status: REOPENED
Product: Wikimedia Labs
Classification: Unclassified
Infrastructure (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Marc A. Pelletier
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-08-27 02:13 UTC by Tim Landscheidt
Modified: 2014-09-06 19:58 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Tim Landscheidt 2014-08-27 02:13:24 UTC
DNS look-ups from Labs instances for the IP addresses of Labs instances fail every once in a while (all times UTC):

| tools-webproxy.eqiad.wmflabs : Aug 24 20:09:33 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs

| tools-webproxy.eqiad.wmflabs : Aug 25 18:23:00 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs

| tools-webproxy.eqiad.wmflabs : Aug 25 23:28:06 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs

| Date: Sat, 23 Aug 2014 20:27:11 +0000 (3 days, 5 hours ago)

| error: unable to send message to qmaster using port 6444 on host "tools-master.eqiad.wmflabs": can't resolve host name

| Date: Tue, 26 Aug 2014 06:39:11 +0000 (19 hours, 30 minutes ago)

| error: unable to send message to qmaster using port 6444 on host "tools-master.eqiad.wmflabs": can't resolve host name
Comment 1 Bryan Davis 2014-08-27 02:30:38 UTC
This happens several times each day during the beta-scap-eqiad Jenkins job [0] as well:

ssh: Could not resolve hostname deployment-mediawiki02.eqiad.wmflabs: Name or service not known

Ths failing host name varies from run to run and generally works if the job is re-run immediately upon notification of failure.

[0]: https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/
Comment 2 Tim Landscheidt 2014-08-27 15:36:29 UTC
Today the frequency of those occurences has increased quite a bit at Tools:

| tools-webproxy.eqiad.wmflabs : Aug 27 00:03:35 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs

| tools-webproxy.eqiad.wmflabs : Aug 27 01:25:37 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs

| tools-webproxy.eqiad.wmflabs : Aug 27 02:25:43 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs

| tools-webproxy.eqiad.wmflabs : Aug 27 03:40:40 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs

| tools-webproxy.eqiad.wmflabs : Aug 27 06:24:43 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs

| tools-webproxy.eqiad.wmflabs : Aug 27 11:39:49 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs

| tools-webproxy.eqiad.wmflabs : Aug 27 14:25:02 : diamond : unable to resolve host tools-webproxy.eqiad.wmflabs

| Date: Wed, 27 Aug 2014 07:05:11 +0000 (8 hours, 15 minutes ago)

| error: unable to send message to qmaster using port 6444 on host "tools-master.eqiad.wmflabs": can't resolve host name

I don't remember consciously that hosts besides tools-master and tools-webproxy were affected in the past in a major way, and with tools-webproxy probably doing a lot of lookups for the log files, my assumption is that the DNS server/OpenStack's/Ubuntu's network layer has some throttling in place per client host.

It may be worth a test to install a caching dnsmasq locally to see if that solves the problem.  In that case, non-cached queries need to be forwarded to the Labs DNS server so that the special rewrites in openstack::network-service are honoured.
Comment 3 Marc A. Pelletier 2014-08-27 17:00:27 UTC
dnsmasq is a [bleep]ing piece of unreliable [bleep] that crumbles under the lightest load.  I've been meaning to have a real DNS server in labs for a while now, and this increase in failures just bumped that up in priority.
Comment 4 Tim Landscheidt 2014-09-02 04:33:30 UTC
I ran "iptables -A OUTPUT -d 10.68.16.1 -p udp -m udp --dport 53" on all Tools instances to get an idea of the order of magnitude between tools-master, tools-webproxy and the rest.  To display:

| scfc@tools-login:~$ sudo iptables -nvxL
| Chain INPUT (policy ACCEPT 20004 packets, 10193318 bytes)
|     pkts      bytes target     prot opt in     out     source               destination         

| Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
|     pkts      bytes target     prot opt in     out     source               destination         

| Chain OUTPUT (policy ACCEPT 19480 packets, 4383218 bytes)
|     pkts      bytes target     prot opt in     out     source               destination         
|      139     9403            udp  --  *      *       0.0.0.0/0            10.68.16.1           udp dpt:53
| scfc@tools-login:~$
Comment 5 Marc A. Pelletier 2014-09-02 13:56:53 UTC
Replacing dnsmasq is... more complicated than reasonable because of the way it's being invoked and managed by Openstack.

As a first pass, I'm going to enable local caching of name resolution; I expect this will lighten the load on it by an order of magnitude or more and will make resolution more robust even if it does falter.
Comment 6 Gerrit Notification Bot 2014-09-02 13:57:59 UTC
Change 157816 had a related patch set uploaded by coren:
Labs: provide saner nscd defaults

https://gerrit.wikimedia.org/r/157816
Comment 7 Gerrit Notification Bot 2014-09-02 15:09:32 UTC
Change 157816 merged by coren:
Labs: provide saner nscd defaults

https://gerrit.wikimedia.org/r/157816
Comment 8 Tim Landscheidt 2014-09-02 18:56:04 UTC
I reset the counters at 16:15Z because after the merge data from before and after is hard to compare :-).

I suggest that we revisit this after Thursday (2014-09-04); if the error mails have stopped then (or significantly decreased), I would consider this issue fixed.
Comment 9 Tim Landscheidt 2014-09-04 16:45:40 UTC
I haven't seen any errors since Tuesday morning, so the change of the nscd configuration seems to have fixed the issue.
Comment 10 jeremyb 2014-09-06 19:03:29 UTC
Just recurred
Comment 11 Tim Landscheidt 2014-09-06 19:58:54 UTC
(In reply to jeremyb from comment #10)
> Just recurred

To be precise:

| [...]
| Date: Sat, 06 Sep 2014 19:02:02 +0000 (50 minutes, 59 seconds ago)

| error: unable to send message to qmaster using port 6444 on host "tools-master.eqiad.wmflabs": can't resolve host name
| [...]

If that would remain the only occurence, I still would consider this fixed.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links