Last modified: 2013-10-29 08:02:24 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T49141, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 47141 - Reindexing causes nagios alerts
Reindexing causes nagios alerts
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
lucene-search-2 (Other open bugs)
wmf-deployment
All All
: High normal (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-04-11 21:17 UTC by Rob Lanphier
Modified: 2013-10-29 08:02 UTC (History)
4 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Rob Lanphier 2013-04-11 21:17:55 UTC
We have chronic problems with the search indexing process causing spurious alerts and (worse) downtime.  As a result, the alerts that we get are kinda useless.  Let's fix that.
Comment 1 Antoine "hashar" Musso (WMF) 2013-04-15 21:41:39 UTC
What kind of errors? I wrote the Nagios plugin that does the Lucene check so I can probably help there.
Comment 2 Tim Starling 2013-04-15 23:58:46 UTC
We're not having many Icinga alerts at the moment, but there are probably regular brief overload events that do not result in an alert.

Basically, the problem is that when a new index is swapped in after an rsync, most of it is not in the kernel cache, so many requests are slow. This results in client-side timeouts, and thus a flood of "zero results" pages delivered to users.

The deployment of PoolCounter for search probably helped with this, since it would avoid using up all search workers, so there would still be a couple left to service Icinga requests. Also, the use of PoolCounter avoids exhausting Apache MaxClients. This removes the need to set an aggressively low client side timeout, so part of the solution here may be just to increase the timeout, allowing the cluster to ride out short periods of slowness by queueing.

There is also a "warmup" feature in lucene-search-2 which probably needs to be configured correctly.

Chad is about to commit a patch for logging of client-side timeouts, so that will give us more visibility into the scale of the problem.
Comment 3 Munagala Ramanath (Ram) 2013-04-16 00:51:48 UTC
That sounds good. That gives us a way to tell if any fix we implement has made things better, made them worse, or had no effect at all.
Comment 4 Munagala Ramanath (Ram) 2013-04-17 22:11:47 UTC
As I mentioned at the platform meeting this week, we are seeing a number of failures in the log files about not being able to connect to the RMI Registry
on various hosts. The exceptions look like this:

----------------------------------------------------------------------
2013-04-17 00:00:48,681 [pool-1-thread-64] WARN  org.wikimedia.lsearch.interoperability.RMIMessengerClient  - Cannot contact RMI registry for host search1020 : error during JRMP connection establishment; nested exception is:
  java.net.SocketTimeoutException: Read timed out
2013-04-17 00:00:48,914 [pool-1-thread-41] WARN  org.wikimedia.lsearch.interoperability.RMIMessengerClient  - Cannot contact RMI registry for host search1020 : error during JRMP connection establishment; nested exception is:
  java.net.SocketTimeoutException: Read timed out
---------------------------------------------------------------------

While a plausible explanation for client-side timeouts is, as Tim suggests, that searches are having to access disk, faiing to even connect to the RMI Registry is indicative of a different issue -- the system being so busy that it is not able to accept a connection.

So I wrote a script to do some rudimentary log analysis; the results are appended below for search1015 and search1016. The histogram shows the number of these failures on an hour-by-hour basis. Notice that we see a spike around 0500 which only abates around 3 to 4 hours later. This is consistent with the failures being related to rsync'ing new snapshots since snapshots are generated every day at 0430. However, there is a second spike around 0700 as well that is unexplained.

The numbers are way higher for Apr-15 since I asked Peter to drop the log level from WARN to TRACE over the weekend resulting in substantially larger log files (~11GB versus 370MB normally); I'm guessing that having to log so much data itself caused additional failures.

All these failures involve attempts to reach search1019 or search1020, though why there are many more failures for the latter than the former (counts shown at the end of each histogram) is another item needing investigation.

Histograms of RMI Registry failures on search1015
Hr.   Apr-16  Apr-15
--------------------
0 -  :     0     217 
1 -  :     7     142 
2 -  :    12     267 
3 -  :     3     333 
4 -  :    63     504 
5 -  :   599    1782
6 -  :   141     684 
7 -  :   314     864 
8 -  :    34     245 
9 -  :    47      40  
10 - :    10      64 
11 - :    36      21 
12 - :    10      74 
13 - :    32     371
14 - :    71     447
15 - :    50     666
16 - :    65     827
17 - :   465     451
18 - :    66     589
19 - :   176       7
20 - :    79      16
21 - :    97      18
22 - :   144      28
23 - :    81       3
Total = 2602    8660
search1020: 2210
search1019: 390

Histograms of RMI Registry failures on search1016
Hr.   Apr-16  Apr-15
--------------------
 0 - :    0      208
 1 - :   10      166
 2 - :   23      338
 3 - :    0      351
 4 - :   71      485
 5 - :  621     1677
 6 - :  131      766
 7 - :  325      951
 8 - :   23      241
 9 - :   40       41
10 - :    9       57
11 - :   28       24
12 - :    6       74
13 - :   28      388
14 - :   65      479
15 - :   40      677
16 - :   65      827
17 - :  441      473
18 - :   61      114
19 - :  177       10
20 - :   94       28
21 - :   96       21
22 - :  180       20
23 - :   89        0
Total = 2623    8416
search1020: 2239
search1019: 382
Comment 5 Munagala Ramanath (Ram) 2013-04-17 23:18:53 UTC
Correction: "second spike around 0700" should read "second spike around 1700", sorry about that.
Comment 6 Andre Klapper 2013-05-30 13:25:19 UTC
(In reply to comment #2 by Tim Starling)
> Chad is about to commit a patch for logging of client-side timeouts, so that
> will give us more visibility into the scale of the problem.

Chad: Has this happened?

(In reply to comment #4 by Ram)
> As I mentioned at the platform meeting this week, we are seeing a number of
> failures in the log files about not being able to connect to the RMI Registry
> on various hosts.

Anybody knows if that's still the case?
Comment 7 Andre Klapper 2013-06-13 19:14:04 UTC
(In reply to comment #2 by Tim Starling)
> Chad is about to commit a patch for logging of client-side timeouts, so that
> will give us more visibility into the scale of the problem.

Chad: Has this happened?

(In reply to comment #4 by Ram)
> As I mentioned at the platform meeting this week, we are seeing a number of
> failures in the log files about not being able to connect to the RMI Registry
> on various hosts.

Anybody knows if that's still the case?
Comment 8 Antoine "hashar" Musso (WMF) 2013-06-24 09:19:54 UTC
Ram is no more contracting for the wmf. Resetting assignee.
Comment 9 Chad H. 2013-10-29 08:02:24 UTC
Logging stuff for this went out a long time ago. It's far from perfect, but has allowed us to limp along. Along with other fixes, we're no longer waking up every night trying to keep search from exploding.

I'm marking this FIXED, as enough's been done that we don't need to do anything here at this point.

Cirrus/Elasticsearch has all kinds of new infrastructure with its own monitoring and so forth so nothing here is really relevant anymore.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links