Last modified: 2014-04-21 18:40:45 UTC
Created attachment 15055 [details] diagram showing failed connection attepmts of some jobs around 2014-04-08 Sporadically some attempt of an Hadoop task fails with error messages like Error: java.io.IOException: Bad connect ack with firstBadLink as 10.64.36.116:50010 . See for example http://analytics1010.eqiad.wmnet:19888/jobhistory/attempts/job_1387838787660_2971/m/FAILED . The failed attempts are correctly restarted by Hadoop and eventually succeed. But as the cluster is now pretty clean and not under heavy beating by different jobs, I do not expect to see above failures at all. I cannot recall having seen the error message for Hive queries, and up to now, I only saw tasks of camus webrequest importer jobs having such failed attempts. However, it does not matter whether it's a full run of importing the whole seven day's worth of wobile request traffic (e.g.: above's job_1387838787660_2971), or just importing the last hour (e.g.: http://analytics1010.eqiad.wmnet:19888/jobhistory/attempts/job_1387838787660_2965/m/FAILED ). I briefly scanned the attempts of recently run applications, and there seems to be a pattern that connecting to analytics10{11,16,17} is more likely an issue than connecting to other machines [1]. Not sure if this is a misinterpretation, as it may be time/scheduling dependent, but it looks strange. (See attachment failures.png for dot output of the failed connection attempts.) [1] +---------------------------------------+---------------+---------------+ | Attempt | Source | Destination | +---------------------------------------+---------------+---------------+ | attempt_1387838787660_2856_m_000006_0 | analytics1013 | analytics1016 | | attempt_1387838787660_2859_m_000002_0 | analytics1013 | analytics1017 | | attempt_1387838787660_2955_m_000003_1 | analytics1019 | analytics1011 | | attempt_1387838787660_2955_m_000009_1 | analytics1011 | analytics1017 | | attempt_1387838787660_2956_m_000003_1 | analytics1011 | analytics1016 | | attempt_1387838787660_2956_m_000005_1 | analytics1013 | analytics1017 | | attempt_1387838787660_2956_m_000006_1 | analytics1015 | analytics1011 | | attempt_1387838787660_2956_m_000007_1 | analytics1017 | analytics1011 | | attempt_1387838787660_2956_m_000008_0 | analytics1011 | analytics1018 | | attempt_1387838787660_2971_m_000001_0 | analytics1012 | analytics1016 | | attempt_1387838787660_2971_m_000003_1 | analytics1020 | analytics1011 | | attempt_1387838787660_2971_m_000005_0 | analytics1018 | analytics1011 | | attempt_1387838787660_2971_m_000007_1 | analytics1013 | analytics1016 | | attempt_1387838787660_2971_m_000008_1 | analytics1015 | analytics1017 | | attempt_1387838787660_2972_m_000003_0 | analytics1015 | analytics1011 | +---------------------------------------+---------------+---------------+
Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1535