Last modified: 2014-04-21 18:40:45 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T65693, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 63693 - Attempts of Hadoop tasks randomly fail "Bad connect ack with firstBadLink as $SOME_CLUSTER_IP"
Attempts of Hadoop tasks randomly fail "Bad connect ack with firstBadLink as ...
Status: NEW
Product: Analytics
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Normal normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-04-08 19:15 UTC by christian
Modified: 2014-04-21 18:40 UTC (History)
3 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
diagram showing failed connection attepmts of some jobs around 2014-04-08 (75.03 KB, image/png)
2014-04-08 19:15 UTC, christian
Details

Description christian 2014-04-08 19:15:39 UTC
Created attachment 15055 [details]
diagram showing failed connection attepmts of some jobs around 2014-04-08

Sporadically some attempt of an Hadoop task fails with error messages
like

  Error: java.io.IOException: Bad connect ack with firstBadLink as 10.64.36.116:50010

. See for example

  http://analytics1010.eqiad.wmnet:19888/jobhistory/attempts/job_1387838787660_2971/m/FAILED

. The failed attempts are correctly restarted by Hadoop and eventually
succeed. But as the cluster is now pretty clean and not under heavy
beating by different jobs, I do not expect to see above failures at
all.

I cannot recall having seen the error message for Hive queries, and up
to now, I only saw tasks of camus webrequest importer jobs having such
failed attempts. However, it does not matter whether it's a full run
of importing the whole seven day's worth of wobile request traffic
(e.g.: above's job_1387838787660_2971), or just importing the last
hour (e.g.:

  http://analytics1010.eqiad.wmnet:19888/jobhistory/attempts/job_1387838787660_2965/m/FAILED
).

I briefly scanned the attempts of recently run applications, and there
seems to be a pattern that connecting to analytics10{11,16,17} is more
likely an issue than connecting to other machines [1]. Not sure if
this is a misinterpretation, as it may be time/scheduling dependent,
but it looks strange. (See attachment failures.png for dot output of
the failed connection attempts.)





[1]
+---------------------------------------+---------------+---------------+
| Attempt | Source | Destination |
+---------------------------------------+---------------+---------------+
| attempt_1387838787660_2856_m_000006_0 | analytics1013 | analytics1016 |
| attempt_1387838787660_2859_m_000002_0 | analytics1013 | analytics1017 |
| attempt_1387838787660_2955_m_000003_1 | analytics1019 | analytics1011 |
| attempt_1387838787660_2955_m_000009_1 | analytics1011 | analytics1017 |
| attempt_1387838787660_2956_m_000003_1 | analytics1011 | analytics1016 |
| attempt_1387838787660_2956_m_000005_1 | analytics1013 | analytics1017 |
| attempt_1387838787660_2956_m_000006_1 | analytics1015 | analytics1011 |
| attempt_1387838787660_2956_m_000007_1 | analytics1017 | analytics1011 |
| attempt_1387838787660_2956_m_000008_0 | analytics1011 | analytics1018 |
| attempt_1387838787660_2971_m_000001_0 | analytics1012 | analytics1016 |
| attempt_1387838787660_2971_m_000003_1 | analytics1020 | analytics1011 |
| attempt_1387838787660_2971_m_000005_0 | analytics1018 | analytics1011 |
| attempt_1387838787660_2971_m_000007_1 | analytics1013 | analytics1016 |
| attempt_1387838787660_2971_m_000008_1 | analytics1015 | analytics1017 |
| attempt_1387838787660_2972_m_000003_0 | analytics1015 | analytics1011 |
+---------------------------------------+---------------+---------------+
Comment 1 Bingle 2014-04-08 19:20:24 UTC
Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1535

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links