Last modified: 2014-09-03 09:16:13 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T72203, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 70203 - Hive is broken on stat1002
Hive is broken on stat1002
Status: RESOLVED FIXED
Product: Analytics
Classification: Unclassified
Refinery (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-08-30 07:14 UTC by Oliver Keyes
Modified: 2014-09-03 09:16 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Oliver Keyes 2014-08-30 07:14:49 UTC
ironholds@stat1002:~$ hive
Unable to determine Hadoop version information.
'hadoop version' returned:
No default-logstash-fields.properties resource present, using defaults Hadoop 2.3.0-cdh5.0.2 Subversion git://github.sf.cloudera.com/CDH/cdh.git -r 8e266e052e423af592871e2dfe09d54c03f6a0e8 Compiled by jenkins on 2014-06-09T16:20Z Compiled with protoc 2.5.0 From source with checksum 75596fe27f833e512f27fbdaaa7b0ab This command was run using /usr/lib/hadoop/hadoop-common-2.3.0-cdh5.0.2.jar
Comment 1 christian 2014-08-30 07:30:14 UTC
(just wanted to file the same bug :-) )

The breakage happened around 2014-08-30 ~00:49 [1].

Around that time bc8e34859268b6943f1e2c9621bd01bdc6676371 got merged,
which turns gelf logging on.

(We saw having gelf logging on to cause the exact same problems 4
days ago [2], which was worked around by
turning gelf logging off (See
82cab341b6070d95437b00f005280fed3289dcac)).

-------------------------------------

The immediate work-around is to create an empty default-logstash-fields.properties in the current directory:

  touch default-logstash-fields.properties

Then hive again starts without issues, and also queries etc work.

-------------------------------------

[1] I had a couple jobs running during the night.
On 00:47:19 the last successful one started.
On 00:49:00 the first failing job started.

[2] See
http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20140826.txt
starting at 20:49:30, and
http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20140826.txt
starting at 20:55:17
Comment 2 Toby Negrin 2014-08-30 19:16:56 UTC
Opsen -- can we please consider some sort of sanity check post cluster maintenance? I'm also wondering if the data quality scripts also broke.

Thanks for grabbing Christian.
Comment 3 Jeff Gage 2014-08-30 21:59:43 UTC
Sorry folks. I did sanity check with 'hdfs', but because that output is just a warning I didn't think it would cause problems. I'll also test with 'hive' in the future. Did a lot of research into upstream defaults before making this change, was surprised at the outcome. I'll disable gelf again for now.

I discovered this ticket via Google search results while troubleshooting :P
Comment 4 christian 2014-08-30 22:05:09 UTC
(Adding jgage to CC)

(In reply to Toby Negrin from comment #2)
> I'm also wondering if the data quality scripts also broke.

Even if ... our setup allows to re-check partitions easily without getting
Icinga confused. So we're safe and prepared for that.

However, the hive breakage is limited to non-cluster machines.
Like stat1002.
The monitoring however runs from within the cluster. So the monitoring
is working:

  +---------------------+--------+--------+--------+--------+
  | Date                |  bits  |  text  | mobile | upload |
  +---------------------+--------+--------+--------+--------+
  [...]
  | 2014-08-30T00:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-08-30T01:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-08-30T02:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-08-30T03:xx:xx |    X   |    X   |    X   |    X   |  <-- problematic commit was merged.
  | 2014-08-30T04:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-08-30T05:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-08-30T06:xx:xx |    .   |    .   |    .   |    X   |  <-- needs investigation
  | 2014-08-30T07:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-08-30T08:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-08-30T09:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-08-30T10:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-08-30T11:xx:xx |    .   |    .   |    .   |    .   |    
  | 2014-08-30T12:xx:xx |    .   |    .   |    .   |    .   |    
  [...]


Statuses:

  . --> Partition is ok
  X --> Partition is not ok (duplicates, missing, or nulls)

> Thanks for grabbing Christian.

I didn't grab the issue -- I just provided a work-around :-)
There is not much I can to there. Only ops people can merge
to the operations/puppet repo. And since there is a workaround that
makes hive work again on stat1002, I think we can safely wait for
a proper fix next week.

Let's not forget: Hive is not yet a production service ;-)
Comment 5 Toby Negrin 2014-09-01 03:10:10 UTC
Christian -- 

I ran a hive query and redirected output to file -- thus I thought hive was running :(

Totally agree -- Hive is not a production service and there is no expectation of off-hour support.

Gage --

We can cc you on all tickets if you want. We are pretty bugzilla focused here. Let's discuss Tuesday.

thanks all

-Toby
Comment 6 christian 2014-09-03 09:00:38 UTC
Works for me again (Hence closing). Thanks!
Comment 7 christian 2014-09-03 09:16:13 UTC
Just to keep bugs connected:

(In reply to christian from comment #4)
> The monitoring however runs from within the cluster. So the monitoring
> is working:
> 
>   +---------------------+--------+--------+--------+--------+
>   | Date                |  bits  |  text  | mobile | upload |
>   +---------------------+--------+--------+--------+--------+
>   [...]
[...]
>   | 2014-08-30T03:xx:xx |    X   |    X   |    X   |    X   |  <--
> problematic commit was merged.

This monitoring alert is tracked in bug 70330

[...]
>   | 2014-08-30T06:xx:xx |    .   |    .   |    .   |    X   |  <-- needs
> investigation

This monitoring alert is tracked in bug 70331

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links