Last modified: 2014-09-02 10:34:17 UTC
Reported today: various times, domains and users. I also saw confusing outputs while querying/pinging the domains (on different machines). Also after some times single domains worked again. domains include: * bits.beta.wmflabs.org * tools-login.wmflabs.org * simple.wikipedia.beta.wmflabs.org * icinga.wmflabs.org * ganglia.wmflabs.org * bots.wmflabs.org Excerps from IRC: * <se4598>I can't reach bits.beta. ping: unknown host bits.beta.wmflabs.org on 2 independent pc's. * <scfc_de> se4598: Hard to say, but looks certainly odd; IIRC "unknown host" would imply an authorative answer from the WMF DNS server (as different from "couldn't resolve", which means the DNS server wasn't reachable). Coren rebooted wikitech about 15:20Z, and I think the LDAP server that provides the DNS records is located there as well, so that could explain some of that, but I wouldn't assume that the negative answers are cached for nearly * <se4598> scfc_de: icinga.wmflabs.org resolves (win7) on nslookup, but not on ping. On remote unix also icinga.wmflabs.org has address 208.80.155.156 but ping: unknown host icinga.wmflabs.org * <se4598> What does it mean when "host <.....>" doesn't give an output but simply returns? Happens to me on one maschine for "host icinga.wmflabs.org" * <Withoutaname> se4598: actually I dont know if this is relevant but downforeveryoneorjustme.com is also reporting errors * <se4598> Withoutaname: just pinged ganglia (a not already tested domain) first try: unknown host, 3 seconds later second try: it works. Read in today's IRC-log from #wikimedia-labs (can't link b/c bots. doesn't resolve for me at he moment.....)
To elaborate: I'm wondering what happens (happened) when wikitech/virt* is unavailable: - Will the DNS server query only one or fall back unto the other LDAP server? (I think that there are two LDAP servers, may be wrong about that.) - Have both LDAP servers the same data? - What does the DNS server return if no LDAP server is available?
Can you try going to the domains directly over HTTPS and report what happens? You may need to explicitly mark the certificates as trusted.
Oh, never mind, I saw ghosts of a totally different bug and jumped to conclusions. IGNORE ME!
relevant irc log is http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-labs/20140408.txt (hurray, I can reach bots again) (our) posts related to this report are between 18:24:12 and 22:14:26 UTC / log-time
We restarted virt1000 last night (heartbleed) and that probably caused this outage due to the stupid order-of-startup bug with pdns vs. opendj. I've opened bug 63717 about that.
I assume this has been fixed by Andrew restarting pdns in the mean time; the underlying problem will be dealt with in bug #63717.