Last modified: 2014-09-03 09:16:13 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T72203, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 70203 - Hive is broken on stat1002


Summary:	Hive is broken on stat1002

Status:	RESOLVED FIXED

Product:	Analytics
Classification:	Unclassified
Component:	Refinery (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Unprioritized normal
Target Milestone:	---
Assigned To:	Nobody - You can work on this!

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-08-30 07:14 UTC by Oliver Keyes
Modified:	2014-09-03 09:16 UTC (History)
CC List:	8 users (show)

See Also:	70330
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Oliver Keyes 2014-08-30 07:14:49 UTC

ironholds@stat1002:~$ hive
Unable to determine Hadoop version information.
'hadoop version' returned:
No default-logstash-fields.properties resource present, using defaults Hadoop 2.3.0-cdh5.0.2 Subversion git://github.sf.cloudera.com/CDH/cdh.git -r 8e266e052e423af592871e2dfe09d54c03f6a0e8 Compiled by jenkins on 2014-06-09T16:20Z Compiled with protoc 2.5.0 From source with checksum 75596fe27f833e512f27fbdaaa7b0ab This command was run using /usr/lib/hadoop/hadoop-common-2.3.0-cdh5.0.2.jar

Comment 1 christian 2014-08-30 07:30:14 UTC

(just wanted to file the same bug :-) )

The breakage happened around 2014-08-30 ~00:49 [1].

Around that time bc8e34859268b6943f1e2c9621bd01bdc6676371 got merged,
which turns gelf logging on.

(We saw having gelf logging on to cause the exact same problems 4
days ago [2], which was worked around by
turning gelf logging off (See
82cab341b6070d95437b00f005280fed3289dcac)).

-------------------------------------

The immediate work-around is to create an empty default-logstash-fields.properties in the current directory:

  touch default-logstash-fields.properties

Then hive again starts without issues, and also queries etc work.

-------------------------------------

[1] I had a couple jobs running during the night.
On 00:47:19 the last successful one started.
On 00:49:00 the first failing job started.

[2] See
http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20140826.txt
starting at 20:49:30, and
http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20140826.txt
starting at 20:55:17

Comment 2 Toby Negrin 2014-08-30 19:16:56 UTC

Opsen -- can we please consider some sort of sanity check post cluster maintenance? I'm also wondering if the data quality scripts also broke.

Thanks for grabbing Christian.

Comment 3 Jeff Gage 2014-08-30 21:59:43 UTC

Sorry folks. I did sanity check with 'hdfs', but because that output is just a warning I didn't think it would cause problems. I'll also test with 'hive' in the future. Did a lot of research into upstream defaults before making this change, was surprised at the outcome. I'll disable gelf again for now.

I discovered this ticket via Google search results while troubleshooting :P

Comment 4 christian 2014-08-30 22:05:09 UTC

(Adding jgage to CC)

(In reply to Toby Negrin from comment #2)
> I'm also wondering if the data quality scripts also broke.

Even if ... our setup allows to re-check partitions easily without getting
Icinga confused. So we're safe and prepared for that.

However, the hive breakage is limited to non-cluster machines.
Like stat1002.
The monitoring however runs from within the cluster. So the monitoring
is working:

  +---------------------+--------+--------+--------+--------+
  | Date                |  bits  |  text  | mobile | upload |
  +---------------------+--------+--------+--------+--------+
  [...]
  | 2014-08-30T00:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-08-30T01:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-08-30T02:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-08-30T03:xx:xx |    X   |    X   |    X   |    X   |  <-- problematic commit was merged.
  | 2014-08-30T04:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-08-30T05:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-08-30T06:xx:xx |    .   |    .   |    .   |    X   |  <-- needs investigation
  | 2014-08-30T07:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-08-30T08:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-08-30T09:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-08-30T10:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-08-30T11:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-08-30T12:xx:xx |    .   |    .   |    .   |    .   |    
  [...]


Statuses:

  . --> Partition is ok
  X --> Partition is not ok (duplicates, missing, or nulls)

> Thanks for grabbing Christian.

I didn't grab the issue -- I just provided a work-around :-)
There is not much I can to there. Only ops people can merge
to the operations/puppet repo. And since there is a workaround that
makes hive work again on stat1002, I think we can safely wait for
a proper fix next week.

Let's not forget: Hive is not yet a production service ;-)

Comment 5 Toby Negrin 2014-09-01 03:10:10 UTC

Christian -- 

I ran a hive query and redirected output to file -- thus I thought hive was running :(

Totally agree -- Hive is not a production service and there is no expectation of off-hour support.

Gage --

We can cc you on all tickets if you want. We are pretty bugzilla focused here. Let's discuss Tuesday.

thanks all

-Toby

Comment 6 christian 2014-09-03 09:00:38 UTC

Works for me again (Hence closing). Thanks!

Comment 7 christian 2014-09-03 09:16:13 UTC

Just to keep bugs connected:

(In reply to christian from comment #4)
> The monitoring however runs from within the cluster. So the monitoring
> is working:
> 
>   +---------------------+--------+--------+--------+--------+
>   | Date                |  bits  |  text  | mobile | upload |
>   +---------------------+--------+--------+--------+--------+
>   [...]
[...]
>   | 2014-08-30T03:xx:xx |    X   |    X   |    X   |    X   |  <--
> problematic commit was merged.

This monitoring alert is tracked in bug 70330

[...]
>   | 2014-08-30T06:xx:xx |    .   |    .   |    .   |    X   |  <-- needs
> investigation

This monitoring alert is tracked in bug 70331

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links