Last modified: 2014-04-04 22:04:27 UTC
When walking through the Hadoop applications from early April 2014 (until 2014-04-03 09:00) on [1], it seems applications failed if and only if they were started on analytics1012:8042 [2]. And I checked about a dozen of succeeded (hence started on nodes different to analytics1012:8042) applications, and their subordinated mapreduce jobs again failed if and only if they were run on analytics1012:8042 [3]. Is there something wrong with analytics1012:8042 ? [1] http://analytics1010.eqiad.wmnet:8088/cluster [2] The URLs for the corresponding failed applications are http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2843 http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2837 http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2836 http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2820 http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2798 http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2790 http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2788 http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2787 http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2786 [3] So for example application 1387838787660_2796 [4] was started on analytics1015:8042 and hence succeeded. But it had one failed map attempt, which was again on analytics1012:8042 [5]. Such failed subordinated mapreduce jobs on analytics1012:8042 fail with notes about timeouts. As for example here: AttemptID:attempt_1387838787660_2796_m_000001_0 Timed out after 600 secs [4] http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2796 [5] http://analytics1010.eqiad.wmnet:19888/jobhistory/attempts/job_1387838787660_2796/m/FAILED
Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1522
Bug 63472 might be related.
This matches some anecdotal evidence from Oliver that there were problems with the analytics2012 node. Diederik updated the java version IIRC. I do not know how he made this change. I suspect the fastest way forward with this node is to decommission it and repave it because we don't really know what Diederik did with it. Perhaps puppet can tell us if there versions are different?
(In reply to Toby Negrin from comment #3) > This matches some anecdotal evidence from Oliver that there were problems > with the analytics2012 node. > Yep. I reported this a while ago, but it looks like the bug turned out to be a pair of bugs ("analytics1012 keeps dropping jobs" and "INSERT OVERWRITE doesn't work") and the second one masked the first. > Diederik updated the java version IIRC. I do not know how he made this > change. > Not sure the details, but I'm pretty sure he just went into the box and upgraded by hand.
YES! Found it. /etc/hosts had a bad IP listed on analytics1012 for itself. Fixed and things look much better now!