Last modified: 2014-04-04 22:05:00 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T63100, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 61100 - Hive freezes starting a query, and produces the following error...
Hive freezes starting a query, and produces the following error...
Status: REOPENED
Product: Analytics
Classification: Unclassified
Refinery (Other open bugs)
unspecified
All All
: Unprioritized major
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-02-08 23:41 UTC by Oliver Keyes
Modified: 2014-04-04 22:05 UTC (History)
6 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Oliver Keyes 2014-02-08 23:41:19 UTC
"Ended Job = job_1387838787660_1390 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://analytics1010.eqiad.wmnet:8088/proxy/application_1387838787660_1390/
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched: 
Job 0:  HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec"

This happens to different types of queries, at different times, and doesn't seem to bear any relation to the query itself; I reran the query that generated the error /this/ time immediately after it errored out, and it worked fine.
Comment 1 Oliver Keyes 2014-02-08 23:41:40 UTC
(Presumably the actual error console can break the errors down by task and so provide more useful data than 'code 2')
Comment 2 Bingle 2014-02-08 23:45:39 UTC
Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1440
Comment 3 Oliver Keyes 2014-02-12 00:08:40 UTC
This bug (or class of bug) has continued to make itself known. It's particularly concerning and frequent when running queries that contain subqueries, since it's treated as multiple jobs, and that increases the probability that one will fail - and if any ONE element fails, it all fails. As an example, I've been running variants of:

INSERT OVERWRITE TABLE ironholds.distinct_ip
SELECT distip
FROM (SELECT ip AS distip, COUNT(*) as count FROM wmf.webrequest_mobile WHERE year = 2014 AND month = 1 AND day = 20 AND content_type IN ('text/html\; charset=utf-8','text/html\; charset=iso-8859-1','text/html\; charset=UTF-8','text/html') GROUP BY ip HAVING COUNT(*) >= 2) sub1 LIMIT 10000;

and I've had three failures out of the previous four queries (which, with subqueries, works out as 3/8). Syntactically valid queries failing seemingly-randomly with no explanation is a pretty substantial blocker to being able to rely on Hive for production tasks.
Comment 4 Diederik van Liere 2014-02-12 18:27:29 UTC
There were indeed some issues with analytics1012, it was running an old version of Java. Ottomata has resolved that and I tried your query with success. 
@Oliver: can you run your query again to confirm that the issue has been resolved?
Comment 5 Oliver Keyes 2014-02-13 19:56:54 UTC
Now fixed; Analytics 1012 had an outdated version of Java.
Comment 6 Oliver Keyes 2014-02-13 23:14:56 UTC
Still broken, still on analytics1012 - see task 1387838787660_1540. Most helpfully, the errors message was " Application application_1387838787660_1540 failed 1 times due to . Failing the application. "
Comment 7 Diederik van Liere 2014-02-14 17:20:38 UTC
Digging through more log files, I found:

2014-02-14 01:05:07,873 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1387838787660_1547_r_000542_0: Error: java.lang.RuntimeException: Hive Runtime Error while closing operators: Unable to rename output from: hdfs://kraken/tmp/hive-ironholds/hive_2014-02-14_00-38-53_191_252484601784449773/_task_tmp.-mr-10002/_tmp.000542_0 to: hdfs://kraken/tmp/hive-ironholds/hive_2014-02-14_00-38-53_191_252484601784449773/_tmp.-mr-10002/000542_0


Which maps to a Hive issue: https://issues.apache.org/jira/browse/HIVE-4605

@Oliver: can you rerun the query without the OVERWRITE statement and see if that solves the problem?
Comment 8 Toby Negrin 2014-02-19 01:20:11 UTC
Otto -- can you just pull this machine from the cluster? It's causing a lot of problems and we should repave it or something.

thanks,

-Toby
Comment 9 Andrew Otto 2014-02-19 14:40:10 UTC
Oliver's most recent issue doesn't seem to have anything to do with analytics1012 anymore.  He's still having problems, just not related to his initial report.

There's also this issue:
https://issues.apache.org/jira/browse/HIVE-3828
Comment 10 Oliver Keyes 2014-02-19 17:16:15 UTC
Ooh; plausible. Thanks for the explanation :). I'm confused as to why it's only /sometimes/ failing, though.
Comment 11 Andrew Otto 2014-04-04 22:05:00 UTC
Btw, the analytics1012 problem is fixed, woo!

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links