Last modified: 2014-04-04 22:04:27 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T65470, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 63470 - analytics1012 fails Hadoop applications and jobs


Summary:	analytics1012 fails Hadoop applications and jobs

Status:	RESOLVED FIXED

Product:	Analytics
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Unprioritized normal
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-04-03 10:13 UTC by christian
Modified:	2014-04-04 22:04 UTC (History)
CC List:	4 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description christian 2014-04-03 10:13:14 UTC

When walking through the Hadoop applications from early April 2014
(until 2014-04-03 09:00) on [1], it seems applications failed if and
only if they were started on analytics1012:8042 [2].

And I checked about a dozen of succeeded (hence started on nodes
different to analytics1012:8042) applications, and their subordinated
mapreduce jobs again failed if and only if they were run on
analytics1012:8042 [3].

Is there something wrong with analytics1012:8042 ?



[1] http://analytics1010.eqiad.wmnet:8088/cluster


[2] The URLs for the corresponding failed applications are
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2843
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2837
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2836
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2820
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2798
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2790
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2788
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2787
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2786


[3] So for example application 1387838787660_2796 [4] was started on
analytics1015:8042 and hence succeeded. But it had one failed map
attempt, which was again on analytics1012:8042 [5].

Such failed subordinated mapreduce jobs on analytics1012:8042 fail
with notes about timeouts. As for example here:
  AttemptID:attempt_1387838787660_2796_m_000001_0 Timed out after 600 secs


[4] http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2796


[5] http://analytics1010.eqiad.wmnet:19888/jobhistory/attempts/job_1387838787660_2796/m/FAILED

Comment 1 Bingle 2014-04-03 10:15:23 UTC

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1522

Comment 2 christian 2014-04-03 10:49:29 UTC

Bug 63472 might be related.

Comment 3 Toby Negrin 2014-04-03 15:03:16 UTC

This matches some anecdotal evidence from Oliver that there were problems with the  analytics2012 node.

Diederik updated the java version IIRC. I do not know how he made this change.

I suspect the fastest way forward with this node is to decommission it and repave it because we don't really know what Diederik did with it. Perhaps puppet can tell us if there versions are different?

Comment 4 Oliver Keyes 2014-04-03 16:44:34 UTC

(In reply to Toby Negrin from comment #3)
> This matches some anecdotal evidence from Oliver that there were problems
> with the  analytics2012 node.
> 
Yep. I reported this a while ago, but it looks like the bug turned out to be a pair of bugs ("analytics1012 keeps dropping jobs" and "INSERT OVERWRITE doesn't work") and the second one masked the first.

> Diederik updated the java version IIRC. I do not know how he made this
> change.
> 

Not sure the details, but I'm pretty sure he just went into the box and upgraded by hand.

Comment 5 Andrew Otto 2014-04-04 22:04:27 UTC

YES!  Found it.  /etc/hosts had a bad IP listed on analytics1012 for itself.  Fixed and things look much better now!

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links