Last modified: 2014-04-04 22:05:00 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T63100, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 61100 - Hive freezes starting a query, and produces the following error...


Summary:	Hive freezes starting a query, and produces the following error...

Status:	REOPENED

Product:	Analytics
Classification:	Unclassified
Component:	Refinery (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Unprioritized major
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-02-08 23:41 UTC by Oliver Keyes
Modified:	2014-04-04 22:05 UTC (History)
CC List:	6 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Oliver Keyes 2014-02-08 23:41:19 UTC

"Ended Job = job_1387838787660_1390 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://analytics1010.eqiad.wmnet:8088/proxy/application_1387838787660_1390/
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched: 
Job 0:  HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec"

This happens to different types of queries, at different times, and doesn't seem to bear any relation to the query itself; I reran the query that generated the error /this/ time immediately after it errored out, and it worked fine.

Comment 1 Oliver Keyes 2014-02-08 23:41:40 UTC

(Presumably the actual error console can break the errors down by task and so provide more useful data than 'code 2')

Comment 2 Bingle 2014-02-08 23:45:39 UTC

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1440

Comment 3 Oliver Keyes 2014-02-12 00:08:40 UTC

This bug (or class of bug) has continued to make itself known. It's particularly concerning and frequent when running queries that contain subqueries, since it's treated as multiple jobs, and that increases the probability that one will fail - and if any ONE element fails, it all fails. As an example, I've been running variants of:

INSERT OVERWRITE TABLE ironholds.distinct_ip
SELECT distip
FROM (SELECT ip AS distip, COUNT(*) as count FROM wmf.webrequest_mobile WHERE year = 2014 AND month = 1 AND day = 20 AND content_type IN ('text/html\; charset=utf-8','text/html\; charset=iso-8859-1','text/html\; charset=UTF-8','text/html') GROUP BY ip HAVING COUNT(*) >= 2) sub1 LIMIT 10000;

and I've had three failures out of the previous four queries (which, with subqueries, works out as 3/8). Syntactically valid queries failing seemingly-randomly with no explanation is a pretty substantial blocker to being able to rely on Hive for production tasks.

Comment 4 Diederik van Liere 2014-02-12 18:27:29 UTC

There were indeed some issues with analytics1012, it was running an old version of Java. Ottomata has resolved that and I tried your query with success. 
@Oliver: can you run your query again to confirm that the issue has been resolved?

Comment 5 Oliver Keyes 2014-02-13 19:56:54 UTC

Now fixed; Analytics 1012 had an outdated version of Java.

Comment 6 Oliver Keyes 2014-02-13 23:14:56 UTC

Still broken, still on analytics1012 - see task 1387838787660_1540. Most helpfully, the errors message was " Application application_1387838787660_1540 failed 1 times due to . Failing the application. "

Comment 7 Diederik van Liere 2014-02-14 17:20:38 UTC

Digging through more log files, I found:

2014-02-14 01:05:07,873 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1387838787660_1547_r_000542_0: Error: java.lang.RuntimeException: Hive Runtime Error while closing operators: Unable to rename output from: hdfs://kraken/tmp/hive-ironholds/hive_2014-02-14_00-38-53_191_252484601784449773/_task_tmp.-mr-10002/_tmp.000542_0 to: hdfs://kraken/tmp/hive-ironholds/hive_2014-02-14_00-38-53_191_252484601784449773/_tmp.-mr-10002/000542_0


Which maps to a Hive issue: https://issues.apache.org/jira/browse/HIVE-4605

@Oliver: can you rerun the query without the OVERWRITE statement and see if that solves the problem?

Comment 8 Toby Negrin 2014-02-19 01:20:11 UTC

Otto -- can you just pull this machine from the cluster? It's causing a lot of problems and we should repave it or something.

thanks,

-Toby

Comment 9 Andrew Otto 2014-02-19 14:40:10 UTC

Oliver's most recent issue doesn't seem to have anything to do with analytics1012 anymore.  He's still having problems, just not related to his initial report.

There's also this issue:
https://issues.apache.org/jira/browse/HIVE-3828

Comment 10 Oliver Keyes 2014-02-19 17:16:15 UTC

Ooh; plausible. Thanks for the explanation :). I'm confused as to why it's only /sometimes/ failing, though.

Comment 11 Andrew Otto 2014-04-04 22:05:00 UTC

Btw, the analytics1012 problem is fixed, woo!

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links