Last modified: 2014-04-04 22:05:00 UTC
"Ended Job = job_1387838787660_1390 with errors Error during job, obtaining debugging information... Job Tracking URL: http://analytics1010.eqiad.wmnet:8088/proxy/application_1387838787660_1390/ FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask MapReduce Jobs Launched: Job 0: HDFS Read: 0 HDFS Write: 0 FAIL Total MapReduce CPU Time Spent: 0 msec" This happens to different types of queries, at different times, and doesn't seem to bear any relation to the query itself; I reran the query that generated the error /this/ time immediately after it errored out, and it worked fine.
(Presumably the actual error console can break the errors down by task and so provide more useful data than 'code 2')
Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1440
This bug (or class of bug) has continued to make itself known. It's particularly concerning and frequent when running queries that contain subqueries, since it's treated as multiple jobs, and that increases the probability that one will fail - and if any ONE element fails, it all fails. As an example, I've been running variants of: INSERT OVERWRITE TABLE ironholds.distinct_ip SELECT distip FROM (SELECT ip AS distip, COUNT(*) as count FROM wmf.webrequest_mobile WHERE year = 2014 AND month = 1 AND day = 20 AND content_type IN ('text/html\; charset=utf-8','text/html\; charset=iso-8859-1','text/html\; charset=UTF-8','text/html') GROUP BY ip HAVING COUNT(*) >= 2) sub1 LIMIT 10000; and I've had three failures out of the previous four queries (which, with subqueries, works out as 3/8). Syntactically valid queries failing seemingly-randomly with no explanation is a pretty substantial blocker to being able to rely on Hive for production tasks.
There were indeed some issues with analytics1012, it was running an old version of Java. Ottomata has resolved that and I tried your query with success. @Oliver: can you run your query again to confirm that the issue has been resolved?
Now fixed; Analytics 1012 had an outdated version of Java.
Still broken, still on analytics1012 - see task 1387838787660_1540. Most helpfully, the errors message was " Application application_1387838787660_1540 failed 1 times due to . Failing the application. "
Digging through more log files, I found: 2014-02-14 01:05:07,873 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1387838787660_1547_r_000542_0: Error: java.lang.RuntimeException: Hive Runtime Error while closing operators: Unable to rename output from: hdfs://kraken/tmp/hive-ironholds/hive_2014-02-14_00-38-53_191_252484601784449773/_task_tmp.-mr-10002/_tmp.000542_0 to: hdfs://kraken/tmp/hive-ironholds/hive_2014-02-14_00-38-53_191_252484601784449773/_tmp.-mr-10002/000542_0 Which maps to a Hive issue: https://issues.apache.org/jira/browse/HIVE-4605 @Oliver: can you rerun the query without the OVERWRITE statement and see if that solves the problem?
Otto -- can you just pull this machine from the cluster? It's causing a lot of problems and we should repave it or something. thanks, -Toby
Oliver's most recent issue doesn't seem to have anything to do with analytics1012 anymore. He's still having problems, just not related to his initial report. There's also this issue: https://issues.apache.org/jira/browse/HIVE-3828
Ooh; plausible. Thanks for the explanation :). I'm confused as to why it's only /sometimes/ failing, though.
Btw, the analytics1012 problem is fixed, woo!