Last modified: 2014-04-04 22:04:27 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T65470, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 63470 - analytics1012 fails Hadoop applications and jobs
analytics1012 fails Hadoop applications and jobs
Status: RESOLVED FIXED
Product: Analytics
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-04-03 10:13 UTC by christian
Modified: 2014-04-04 22:04 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description christian 2014-04-03 10:13:14 UTC
When walking through the Hadoop applications from early April 2014
(until 2014-04-03 09:00) on [1], it seems applications failed if and
only if they were started on analytics1012:8042 [2].

And I checked about a dozen of succeeded (hence started on nodes
different to analytics1012:8042) applications, and their subordinated
mapreduce jobs again failed if and only if they were run on
analytics1012:8042 [3].

Is there something wrong with analytics1012:8042 ?



[1] http://analytics1010.eqiad.wmnet:8088/cluster


[2] The URLs for the corresponding failed applications are
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2843
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2837
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2836
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2820
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2798
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2790
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2788
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2787
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2786


[3] So for example application 1387838787660_2796 [4] was started on
analytics1015:8042 and hence succeeded. But it had one failed map
attempt, which was again on analytics1012:8042 [5].

Such failed subordinated mapreduce jobs on analytics1012:8042 fail
with notes about timeouts. As for example here:
  AttemptID:attempt_1387838787660_2796_m_000001_0 Timed out after 600 secs


[4] http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2796


[5] http://analytics1010.eqiad.wmnet:19888/jobhistory/attempts/job_1387838787660_2796/m/FAILED
Comment 1 Bingle 2014-04-03 10:15:23 UTC
Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1522
Comment 2 christian 2014-04-03 10:49:29 UTC
Bug 63472 might be related.
Comment 3 Toby Negrin 2014-04-03 15:03:16 UTC
This matches some anecdotal evidence from Oliver that there were problems with the  analytics2012 node.

Diederik updated the java version IIRC. I do not know how he made this change.

I suspect the fastest way forward with this node is to decommission it and repave it because we don't really know what Diederik did with it. Perhaps puppet can tell us if there versions are different?
Comment 4 Oliver Keyes 2014-04-03 16:44:34 UTC
(In reply to Toby Negrin from comment #3)
> This matches some anecdotal evidence from Oliver that there were problems
> with the  analytics2012 node.
> 
Yep. I reported this a while ago, but it looks like the bug turned out to be a pair of bugs ("analytics1012 keeps dropping jobs" and "INSERT OVERWRITE doesn't work") and the second one masked the first.

> Diederik updated the java version IIRC. I do not know how he made this
> change.
> 

Not sure the details, but I'm pretty sure he just went into the box and upgraded by hand.
Comment 5 Andrew Otto 2014-04-04 22:04:27 UTC
YES!  Found it.  /etc/hosts had a bad IP listed on analytics1012 for itself.  Fixed and things look much better now!

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links