Last modified: 2014-06-27 17:49:12 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T67420, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 65420 - Hive queries inconsistently failing
Hive queries inconsistently failing
Status: NEW
Product: Analytics
Classification: Unclassified
Refinery (Other open bugs)
unspecified
All All
: High critical
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-05-16 21:44 UTC by Oliver Keyes
Modified: 2014-06-27 17:49 UTC (History)
7 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
Stderr (4.62 MB, text/plain)
2014-06-04 16:24 UTC, Oliver Keyes
Details
Errors (1.62 KB, text/plain)
2014-06-27 17:49 UTC, Oliver Keyes
Details

Description Oliver Keyes 2014-05-16 21:44:25 UTC
(yes, another of Those issues)

A set of unrelated queries[0] are inconsistently failing on hive. This isn't a tremendous problem, or wouldn't be if, the task view page wasn't seemingly deleted WHEN the query failed; it just 404s out. Dan thinks it might be a problem with the transfer to a single, unified table.

[0] set hive.mapred.mode = nonstrict;
ADD JAR /usr/lib/hcatalog/share/hcatalog/hcatalog-core-0.5.0-cdh4.3.1.jar;
SELECT *
  FROM (
    SELECT dt,ip AS IP FROM wmf.webrequest_text
    WHERE year = 2014
    AND month = 05
    AND content_type RLIKE('text/html')
    AND ip NOT RLIKE(':')
    ORDER BY rand()) dtretrieve
LIMIT 5000000;

and

SELECT uri_path,uri_host,uri_query FROM webrequest WHERE year = 2014 AND month = 04 LIMIT 1;
Comment 1 Dan Andreescu 2014-05-16 22:27:08 UTC
when executing on wmf.webrequest instead, I get:

java.io.FileNotFoundException: Path is not a file: /wmf/data/external/webrequest/webrequest_mobile/hourly/2014/05/15/08/08
Comment 2 Oliver Keyes 2014-05-19 18:46:34 UTC
Breaking News: we have succeeded in getting a sensical error out of hive!

"Java.io.IOException(com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Java heap space)"

I think that's Hive for "you need more machines, and don't even think about doing anything until Andrew gets back"
Comment 3 Toby Negrin 2014-05-19 23:39:05 UTC
Do we have the ability to up the JVM heap? e.g 

http://stackoverflow.com/questions/18546201/increase-jvm-heap-space-while-runnig-from-hadoop-unix
Comment 4 Andrew Otto 2014-05-20 07:28:44 UTC
Um, so, yeah, sorry I haven't fully announced this yet, as it was still slightly in transition on Friday...and I was off biking around France on Monday.

The webrequest_* tables will not longer work.  The data has been moved to a new directory so for the single webrequest table.

If you want to just select webrequest text data, you should add a where webrequest_source = 'text' to your query.
Comment 5 Oliver Keyes 2014-05-20 08:10:28 UTC
Yep. See the second example query ;p. Requests are failing uniformly, table or no table, and tend to feature the JVM heap being exceeded.
Comment 6 Oliver Keyes 2014-05-20 08:10:42 UTC
*webrequest_* table or no table
Comment 7 Andrew Otto 2014-05-20 15:59:32 UTC
Oliver, I'm not sure!  I just ran one of your queries and it took a while, but finished just fine:

SELECT uri_path,uri_host,uri_query FROM webrequest WHERE year = 2014 AND month = 04 LIMIT 1;

2014-05-20 15:55:53,214 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 240151.2 sec
MapReduce Total cumulative CPU time: 2 days 18 hours 42 minutes 31 seconds 200 msec
Ended Job = job_1387838787660_11438
MapReduce Jobs Launched:
Job 0: Map: 26998   Cumulative CPU: 240151.2 sec   HDFS Read: 17445133029 HDFS Write: 3115525 SUCCESS
Total MapReduce CPU Time Spent: 2 days 18 hours 42 minutes 31 seconds 200 msec
OK
uri_path	uri_host	uri_query
//upload.wikimedia.org/wikipedia/commons/thumb/0/07/Sturmpanzer.Saumur.0008gkp7.jpg/300px-Sturmpanzer.Saumur.0008gkp7.jpg	zh.m.wikipedia.org
Time taken: 2447.217 seconds
Comment 8 Oliver Keyes 2014-05-20 16:44:14 UTC
Try the other one with wmf.webrequest ;).
Comment 9 Oliver Keyes 2014-05-20 21:38:09 UTC
Two new queries that are exploding, both with slightly different error reports:

hive (wmf)> SELECT uri_host FROM webrequest WHERE uri_path = '/wiki/Education' AND year = 2014 AND month = 05;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.lang.String.substring(String.java:1913)
	at java.net.URI$Parser.substring(URI.java:2850)
	at java.net.URI$Parser.parse(URI.java:3046)
	at java.net.URI.<init>(URI.java:753)
	at org.apache.hadoop.fs.Path.<init>(Path.java:73)
	at org.apache.hadoop.fs.Path.<init>(Path.java:58)
	at org.apache.hadoop.hdfs.protocol.HdfsFileStatus.getFullPath(HdfsFileStatus.java:209)
	at org.apache.hadoop.hdfs.DistributedFileSystem.makeQualified(DistributedFileSystem.java:372)
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:416)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1427)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1467)
	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
	at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:206)
	at org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:69)
	at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:411)
	at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:377)
	at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:387)
	at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:479)
	at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:471)
	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:366)
	at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1269)
	at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1266)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:1266)
	at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:606)
	at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:601)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
	at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:601)
FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.MapRedTask

and:

hive (wmf)> SELECT DISTINCT(uri_host) FROM webrequest WHERE uri_path = '/wiki/Education/' AND year = 2014 AND month = 05;        
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 999
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
java.io.IOException: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Java heap space
	at org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:448)
	at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1526)
	at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1509)
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:405)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1427)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1467)
	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
	at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:206)
	at org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:69)
	at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:411)
	at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:377)
	at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:387)
	at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:479)
	at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:471)
	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:366)
	at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1269)
	at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1266)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:1266)
	at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:606)
	at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:601)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
	at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:601)
	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:586)
	at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:448)
	at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:138)
	at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138)
	at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
	at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:47)
Caused by: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Java heap space
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:238)
	at com.sun.proxy.$Proxy13.getListing(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor125.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
	at com.sun.proxy.$Proxy13.getListing(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:441)
	... 32 more
Caused by: java.lang.OutOfMemoryError: Java heap space
	at org.apache.hadoop.hdfs.protocol.proto.HdfsProtos$HdfsFileStatusProto$Builder.buildPartial(HdfsProtos.java:9398)
	at org.apache.hadoop.hdfs.protocol.proto.HdfsProtos$DirectoryListingProto$Builder.mergeFrom(HdfsProtos.java:11422)
	at org.apache.hadoop.hdfs.protocol.proto.HdfsProtos$DirectoryListingProto$Builder.mergeFrom(HdfsProtos.java:11241)
	at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:275)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetListingResponseProto$Builder.mergeFrom(ClientNamenodeProtocolProtos.java:18775)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetListingResponseProto$Builder.mergeFrom(ClientNamenodeProtocolProtos.java:18629)
	at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:300)
	at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:238)
	at com.google.protobuf.AbstractMessageLite$Builder.mergeFrom(AbstractMessageLite.java:162)
	at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:716)
	at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:238)
	at com.google.protobuf.AbstractMessageLite$Builder.mergeFrom(AbstractMessageLite.java:153)
	at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:709)
	at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:238)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
	at com.sun.proxy.$Proxy13.getListing(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor125.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
	at com.sun.proxy.$Proxy13.getListing(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:441)
	at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1526)
	at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1509)
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:405)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1427)
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1467)
	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
	at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:206)
	at org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:69)
	at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:411)
Job Submission failed with exception 'java.io.IOException(com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Java heap space)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask
Comment 10 Toby Negrin 2014-05-20 21:40:18 UTC
I think we need to up the memory on the hadoop nodes themselves not the client.
Comment 11 Toby Negrin 2014-05-20 21:40:26 UTC
I think we need to up the memory on the hadoop nodes themselves not the client.
Comment 12 Oliver Keyes 2014-05-21 22:11:03 UTC
Heap size increases on the client end don't seem to have helped; it /ran/, for a change, but still cacked out. The disadvantage of cronjobs is there's no .out file to tell me why: I'll try running it again tomorrow by hand and see if I can get anything useful out of it.
Comment 13 Andrew Otto 2014-05-22 15:41:50 UTC
Oliver, did it error out with the same error message as before?  The JVM Heap one?
Comment 14 Andrew Otto 2014-05-22 15:42:41 UTC
Somehow my reply to Toby's comment didn't make it here...

Toby, you might be right, but from the looks of it, these jobs weren't even properly making it to the cluster.
Comment 15 Oliver Keyes 2014-05-22 15:57:10 UTC
Andrew: dunno. Cron job, see? :p. Rerunning today to see what it does.
Comment 16 christian 2014-05-22 19:22:23 UTC
(In reply to Oliver Keyes from comment #15)
> Andrew: dunno. Cron job, see? :p

You can use the cron's MAILTO to automatically get a job's output
into your mailbox.

If you do not want to see it in your inbox, standard redirections like ">",
and "2>" also work in cron. That way you can get the stdout, and stderr of
the cron job to a file.
Comment 17 Andrew Otto 2014-05-22 19:58:20 UTC
Oliver, I'm running it right now too!
Comment 18 Oliver Keyes 2014-05-23 19:43:47 UTC
So, it's still running. Mapping got to 100%, reducing has been at 4% for the last 40 minutes - which I think is an argument for 'we need more machines' as much as anything else.

We definitely need a simple patch to up the heap size on the server side, for two reasons. First: it's wasted effort to require every researcher/analyst/etc/etc to up the heap size before every query, which is what you have to do (it doesn't hold between shell sessions, obviously). Second: I'm getting OOM errors connecting to analytics1027 directly through R, which suggests it's a problem at the server end. I'd appreciate fixes sooner rather than later because almost all of my work is blocked on this, including the pageviews stuff, high-priority stuff for C-levels, and requests from the Zero and Global Development teams.
Comment 19 Toby Negrin 2014-05-23 19:56:35 UTC
Andrew -- 

Can you please dig in here? I believe it's a hadoop setting on the data nodes and we'll have to solve this sooner or later.

thanks,

-Toby
Comment 20 Andrew Otto 2014-05-23 23:36:35 UTC
Geez, sometimes my mail doesn't check often enough, I just got this.

Yes can look into it!  But, it'll have to wait til next week...:/

That query Oliver is running generates > 80,000 mappers.  Hive has some fancy ways to do do sampling of data, but it doesn't work on external tables.  If we get that sorted out, these types of queries should be more feasible.  We need to get the data refining (aka ETL) phase up and going for that first.

Yes, there are almost certainly tweaks we can do to make Hadoop more efficient for things like this, but I have yet to be convinced that there is actually a memory problem on the datanodes themselves.  All of the OOMs that we've seen were client side.  We brainstormed for a few minutes about this in standup today.

Re: R OOMing connecting to analytics1027, I'd need to check, but that also sounds like weird client side stuff.  analytics1027 is not a datanode.  You're connecting to Hive there with R just like you do with the Hive CLI.
Comment 21 Oliver Keyes 2014-05-23 23:45:25 UTC
Hrm. Anything come of the brainstorm? It seems weird for the client-side stuff to be causing the problems. Re the query: actually mapping is the fast bit. I got through mapping at 11am. I've been reducing since, and that seems to be the incredibly slow bit.
Comment 22 Andrew Otto 2014-05-24 04:51:55 UTC
Not really, we just discussed a bit. I hadn't looked into it because I was waiting for your query to fail!

I don't think that client side stuff is causing problem for your currently running query.  There were client side stuff that was OOMing for weird reasons I don't understand.  I tried various versions of your query working on different sizes of partitions.  Smaller sizes of partitions don't OOM, larger sizes of partitions do.  The OOM happens before the application is given to Hadoop, and it doesn't happen if HADOOP_HEAPSIZE is increased on the client.  Hence why I think the OOM errors we saw are somehow client related, although I'm not entirely sure how.

I suspect that maybe hive issues some metadata queries to do its query planning, and maybe when there are lots of partitions it gets bogged down somewhere?  Not really sure.
Comment 23 Oliver Keyes 2014-05-24 05:11:58 UTC
I think failing is what it may be doing. So, I set two queries to run, one small (hunting out a single count(*)) and one large (hunting out 5m rows) on cron jobs. The large one died, and it's not a syntax problem, because I deliberately used identical syntax for each cron entry.
Comment 24 Toby Negrin 2014-05-25 05:51:53 UTC
Thanks Andrew -- the 80K mappers seems busted. If this is because Hive doesn't understand how to properly create a query to the table Oliver is accessing, that is probably the root cause.

Should we split out the client issues into a separate bug?
Comment 25 Andrew Otto 2014-05-26 14:15:21 UTC
Oliver, did you job finish?  What's the application ID?
Comment 26 Oliver Keyes 2014-05-26 19:02:49 UTC
"The last application ID to come from Ironholds" :(. The disadvantage of crons is that it doesn't tell me. The disadvantage of hive is that for big queries I need crons.
Comment 27 Andrew Otto 2014-05-26 19:43:38 UTC
You should probably be redirecting your stdout from crons anyway, eh?

At the end of your cron command, add

  > /home/ironholds/blabla/whatever/path/whatever_file.log 2>&1

That will redirect both stdout and stderr to that file.
Comment 28 Oliver Keyes 2014-05-26 19:48:02 UTC
Neat; the last bit I was missing :). Let's see what happens...
Comment 29 Toby Negrin 2014-05-27 19:38:58 UTC
Hi Andrew -- 

Oliver and I were talking and we believe that the select * query works fine on the old partitions. Could there be a difference in the configurations between the old and new partitions?

thanks,

-Toby
Comment 30 Oliver Keyes 2014-05-27 19:43:15 UTC
(Clarity: that's the random sampling query in my first post. Also, 'worked' - I don't think we have the old partitions any more).
Comment 31 Andrew Otto 2014-05-27 19:49:14 UTC
Not sure what you mean by old and new partitions.  Do you mean the single table vs the old 4 tables?

There is a difference, yes, in that you query much more data by default with the webrequest table.  For example, bits is very large.  If are pretty sure you don't want bits data, add a "where webrequest_source != 'bits'" into the query.  That will cut down the data size down a lot.

I'm googling for ways to make these large queries run and am learning things, but am not yet sure.  I'm also looking for errors in the logs to find out why they died.

See also:
http://mail-archives.apache.org/mod_mbox/hive-user/201212.mbox/%3C20121214155449.0F0F.13FE4A9A@gmo.jp%3E

Also, since we were talking about HADOOP_HEAPSIZE and Hive CLI earlier, this is the documentation on HADOOP_HEAPSIZE for Hive CLI:

  # Larger heap size may be required when running queries over large number of files or partitions. 
  # By default hive shell scripts use a heap size of 256 (MB).  Larger heap size would also be 
  # appropriate for hive server (hwi etc).


So it seems the Hive CLI itself needs to have larger heapsizes when running over larger datasets, as we were assuming.  I'm still not sure why that would be.  I suppose it looks at the data before submitting the job to Hadoop?
Comment 32 Oliver Keyes 2014-05-27 19:52:47 UTC
Hrm. Okay, I've tried relaunching with bits excluded to see what explodes.
Comment 33 Oliver Keyes 2014-06-03 17:33:23 UTC
Newest error message, on the random sample query mentioned above:

Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#2
	at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:121)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:379)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: java.lang.OutOfMemoryError: Java heap space
	at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:58)
	at org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:45)
	at org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.<init>(InMemoryMapOutput.java:63)
	at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:297)
	at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:287)
	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:360)
	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:295)
	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:154)

Container killed by the ApplicationMaster.
Comment 34 Andrew Otto 2014-06-04 15:28:40 UTC
Oliver, could you give me the application id for that job?  I'd like to look at the full output.
Comment 35 Oliver Keyes 2014-06-04 16:24:38 UTC
Created attachment 15565 [details]
Stderr

Application ID is application_1387838787660_16216 - also attaching stderr in its entirety.
Comment 36 Oliver Keyes 2014-06-09 06:57:18 UTC
The same darn thing we've discussed before - the query idling at ~33 percent, getting to 40, and dropping on the reduce portion - just happened a /second/ time on the same request. Is one of the machines broken or something?
Comment 37 Oliver Keyes 2014-06-09 07:26:48 UTC
(appplication_1387838787660_17009 if you want to check out the hive-side internal logs. I should just start running with -v)
Comment 38 Oliver Keyes 2014-06-27 17:49:12 UTC
Created attachment 15764 [details]
Errors

After running for 33 days, the Legendarily Broken Query /actually spat out an error message/. Attached.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links