Last modified: 2014-02-18 20:31:28 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T42514, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 40514 - "Couldn't resolve host 'ms-fe.pmtpa.wmnet'" CloudFiles errors
"Couldn't resolve host 'ms-fe.pmtpa.wmnet'" CloudFiles errors
Status: RESOLVED WORKSFORME
Product: Wikimedia
Classification: Unclassified
DNS (Other open bugs)
wmf-deployment
All All
: High normal (vote)
: ---
Assigned To: Nobody - You can work on this!
: ops
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-09-25 21:10 UTC by Aaron Schulz
Modified: 2014-02-18 20:31 UTC (History)
2 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Aaron Schulz 2012-09-25 21:10:48 UTC
Lots of spam in swift-backend.log on fluorine, all coming from precise job runners (nothing else seems to be affected):

2012-09-25 19:21:46 mw8 zhwiki: InvalidResponseException in 'SwiftFileBackend::doGetFileStat' (given '{"src":"mwstore:\/\/local-swift\/timeline-render\/514fa51b83364f1ee36033b141081616.err"}'): Invalid response (0): (curl error: 6) Couldn't resolve host 'ms-fe.pmtpa.wmnet': Failed to obtain valid HTTP response.
2012-09-25 19:21:46 mw8 zhwiki: InvalidResponseException in 'SwiftFileBackend::doGetFileStat' (given '{"src":"mwstore:\/\/local-swift\/timeline-render\/514fa51b83364f1ee36033b141081616.err"}'): Invalid response (0): (curl error: 6) Couldn't resolve host 'ms-fe.pmtpa.wmnet': Failed to obtain valid HTTP response.
2012-09-25 19:21:46 mw8 zhwiki: InvalidResponseException in 'SwiftFileBackend::doGetFileStat' (given '{"src":"mwstore:\/\/local-swift\/timeline-render\/514fa51b83364f1ee36033b141081616.map"}'): Invalid response (0): (curl error: 6) Couldn't resolve host 'ms-fe.pmtpa.wmnet': Failed to obtain valid HTTP response.
2012-09-25 19:22:24 mw8 zhwiki: InvalidResponseException in 'SwiftFileBackend::doGetFileStat' (given '{"src":"mwstore:\/\/local-swift\/timeline-render\/42983552dc252c7e27508bb31d6940a3.err"}'): Invalid response (0): (curl error: 6) Couldn't resolve host 'ms-fe.pmtpa.wmnet': Failed to obtain valid HTTP response.

It seems somewhat random in that running eval.php on those boxes and manually doing swift calls via CloudFiles seems to work fine.
Comment 1 Aaron Schulz 2012-11-08 00:21:53 UTC
Logging at this and the log "swift" via fenari, it seems like these requests are not even hitting swift (or if they are, must be dying hard enough that no reponse is given and nothing is logged).
Comment 2 Aaron Schulz 2012-11-08 00:52:35 UTC
The error log has died down after ms-be3 was pulled out. This might come back easily...until the replacement hardware is up.
Comment 3 Aaron Schulz 2012-11-28 22:45:18 UTC
I improved the errors for auth requests and they are now "Couldn't resolve host 'ms-fe.pmtpa.wmnet'", so these are all the same error.
Comment 4 Aaron Schulz 2012-11-29 18:15:51 UTC
Switched auth URL to an IP to avoid dns lookups for auth requests. I'll see if this works around the dns problems or pushes them down the road.
Comment 5 Aaron Schulz 2012-12-03 19:43:53 UTC
(In reply to comment #4)
> Switched auth URL to an IP to avoid dns lookups for auth requests. I'll see if
> this works around the dns problems or pushes them down the road.

Can is down the road :)

This really needs an ops person to look at.
Comment 6 Faidon Liambotis 2012-12-16 15:34:43 UTC
Sorry for not updating the ticket earlier -- I've actually attempted to debug this and have chatted with Aaron last week or the one before that.

I've verified that at the time errors were spawned, DNS replies were coming into the system. Also, it's peculiar how no other infrastructure seems to be affected, not even the application server ones (this apparently affects only job runners). It's also something that has manifested recently, possibly after the precise upgrade.

I have some suspections that it may be curl-related (curl has an internal DNS cache that is enabled by default, so it's not just simple libc resolver calls).

I've asked Aaron to isolate the code in question and produce some kind of script that we can run repetively, reproduce and run under strace/gdb, rather than trying to attach them on random jobrunners and hope we catch it. The issue happens on jobrunners, so it's under php cli anyway, so the environment won't be that different anyway.
Comment 7 Andre Klapper 2013-03-25 14:04:28 UTC
(In reply to comment #6 by Faidon)
> I've asked Aaron to isolate the code in question and produce some kind of
> script that we can run repetively, reproduce and run under strace/gdb, rather
> than trying to attach them on random jobrunners and hope we catch it.

Faidon / Aaron: Has this happened yet?
Comment 8 Andre Klapper 2013-04-25 11:45:14 UTC
(In reply to comment #6 by Faidon)
> I've asked Aaron to isolate the code in question and produce some kind of
> script that we can run repetively, reproduce and run under strace/gdb, rather
> than trying to attach them on random jobrunners and hope we catch it.

Faidon / Aaron: Has this happened yet?
Comment 9 Andre Klapper 2013-08-14 14:12:31 UTC
(In reply to comment #6 by Faidon)
> I've asked Aaron to isolate the code in question and produce some kind of
> script that we can run repetively, reproduce and run under strace/gdb, rather
> than trying to attach them on random jobrunners and hope we catch it.

Faidon / Aaron: Has this happened yet?
Comment 10 Aaron Schulz 2013-08-15 05:57:38 UTC
Tried that a long time ago, didn't work.
Comment 11 Antoine "hashar" Musso (WMF) 2013-12-15 13:30:01 UTC
This is still occurring from time to time :(

$ zgrep -c 'resolve host' swift-backend.log-201312*
swift-backend.log-20131201.gz:0
swift-backend.log-20131202.gz:0
swift-backend.log-20131203.gz:0
swift-backend.log-20131204.gz:0
swift-backend.log-20131205.gz:0
swift-backend.log-20131206.gz:0
swift-backend.log-20131207.gz:0
swift-backend.log-20131208.gz:0
swift-backend.log-20131209.gz:0
swift-backend.log-20131210.gz:51
swift-backend.log-20131211.gz:115
swift-backend.log-20131212.gz:35
swift-backend.log-20131213.gz:0
swift-backend.log-20131214.gz:0
swift-backend.log-20131215.gz:0
$
Comment 12 Aaron Schulz 2014-02-18 20:31:28 UTC
Not seeing these anymore

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links