Last modified: 2013-07-25 07:04:23 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T50164, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 48164 - JobQueue does not scale for TTMServer
JobQueue does not scale for TTMServer
Status: RESOLVED FIXED
Product: MediaWiki extensions
Classification: Unclassified
Translate (Other open bugs)
master
All All
: High critical (vote)
: ---
Assigned To: Siebrand Mazeland
https://meta.wikimedia.org/wiki/Speci...
: performance
: 48266 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-06 17:09 UTC by Niklas Laxström
Modified: 2013-07-25 07:04 UTC (History)
8 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Niklas Laxström 2013-05-06 17:09:47 UTC
So yesterday we got a report that Special:Translations is not working on meta. I looked few ganglia stats but could not figure out anything specific.

Today with help of Max we have a theory:
Many concurrent updates slow solr down to crawling (1 every 2 seconds taking average of almost 20 s to finish). Also search requests take longer than 10 seconds and time out.

The JobQueue seems to actually work against us here. The lack of accessible statistics to examine JobQueue behavior is problematic. 

Max proposed that we store the list of changes and use cronjob to push them in batches with a single thread.

No other solutions have been proposed.

RT-5085 should be related to this.
Comment 1 Nemo 2013-05-06 17:15:58 UTC
(In reply to comment #0)
> RT-5085 should be related to this.

It's the ops ticket about this bug.

> The JobQueue seems to actually work against us here. The lack of accessible
> statistics to examine JobQueue behavior is problematic. 

Statistics exist now: <https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=hume.wikimedia.org&v=823574&m=Global_JobQueue_length&r=hour&z=default&jr=&js=&st=1365625056&z=large>

However the good old https://gdash.wikimedia.org/dashboards/jobq/ gives 503 now, and I assume *executed* job items are what matters?
Comment 2 Nemo 2013-05-09 01:06:48 UTC
After updates have been disabled per bug 48266 comment 0 vanadium is back at normal load and since 2013-05-08 17.20 Special:SearchTranslations "works".
Comment 3 Aaron Schulz 2013-05-09 02:34:43 UTC
*** Bug 48266 has been marked as a duplicate of this bug. ***
Comment 4 Aaron Schulz 2013-05-09 02:58:23 UTC
Is the problem *frequent* updates and/or *concurrent* updates?

Anyway, one can simulate cron jobs with de-duplication + delayed jobs in the queue (Wikidata will do something similar soon). So if there was a job inserted for each event requires a TTM update, always posted to metawiki, to check the lists of changes for each wiki and do the updates, and that job had ignoreDuplicates() return true and had a delay set, it would basically act like a cron where any of the job runner servers can service it.
Comment 5 Aaron Schulz 2013-05-29 23:47:32 UTC
Here is some profiling info from trying to use a single runJobs.php process.
* 99% of time is in SolrTTMServer::update (no surprise)
* It can be quite fast for some (hundreds) of jobs but very slow for others
* The slow ones *seem* to involve empty translations


Output dump below:

aaron@mw1001:~$ sudo -u apache php /usr/local/apache/common/multiversion/MWScript.php runJobs.php metawiki --type TTMServerMessageUpdateJob --maxjobs 100 --profiler text
2013-05-29 23:02:14 TTMServerMessageUpdateJob Translations:Mobile_projects/Mobile_Gateway/Mobile_homepage_formatting/27 STARTING
2013-05-29 23:03:11 TTMServerMessageUpdateJob Translations:Mobile_projects/Mobile_Gateway/Mobile_homepage_formatting/27 t=56467 good
2013-05-29 23:03:11 TTMServerMessageUpdateJob Translations:Fundraising_2012/Translation/FAQ_short/1 STARTING         
2013-05-29 23:03:24 TTMServerMessageUpdateJob Translations:Fundraising_2012/Translation/FAQ_short/1 t=13436 good     
2013-05-29 23:03:24 TTMServerMessageUpdateJob Translations:Fundraising_2012/Translation/FAQ_short/2 STARTING         
2013-05-29 23:03:38 TTMServerMessageUpdateJob Translations:Fundraising_2012/Translation/FAQ_short/2 t=13499 good     
2013-05-29 23:03:38 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-appeal STARTING           
2013-05-29 23:03:39 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-appeal t=799 good                                                                                                                                 
2013-05-29 23:03:39 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-appeal-template STARTING                                                                                                                          
2013-05-29 23:03:40 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-appeal-template t=996 good                                                                                                                        
2013-05-29 23:03:40 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-form-template STARTING                                                                                                                            
2013-05-29 23:03:40 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-form-template t=556 good                                                                                                                          
2013-05-29 23:03:40 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-form-countryspecific STARTING                                                                                                                     
2013-05-29 23:03:41 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-form-countryspecific t=649 good                                                                                                                   
2013-05-29 23:03:41 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_frYY_autohide-hide-cookie-name STARTING                                                                                                                             
2013-05-29 23:03:42 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_frYY_autohide-hide-cookie-name t=1070 good                                                                                                                          
2013-05-29 23:03:42 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_frYY_autohide-hide-cookie-max-count STARTING                                                                                                                        
2013-05-29 23:03:43 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_frYY_autohide-hide-cookie-max-count t=898 good                                                                                                                      
2013-05-29 23:03:43 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_frYY_autohide-hide-die-size STARTING                                                                                                                                
2013-05-29 23:03:44 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_frYY_autohide-hide-die-size t=1237 good                                                                                                                             
2013-05-29 23:03:44 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_itYY_autohide-hide-cookie-name STARTING                                                                                                                             
2013-05-29 23:03:44 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_itYY_autohide-hide-cookie-name t=414 good                                                                                                                           
2013-05-29 23:03:44 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_itYY_autohide-hide-cookie-max-count STARTING                                                                                                                        
2013-05-29 23:03:45 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_itYY_autohide-hide-cookie-max-count t=720 good                                                                                                                      
2013-05-29 23:03:45 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_itYY_autohide-hide-die-size STARTING                                                                                                                                
2013-05-29 23:03:47 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_itYY_autohide-hide-die-size t=1497 good                                                                                                                             
2013-05-29 23:03:47 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_jaYY_autohide-hide-cookie-name STARTING                                                                                                                             
2013-05-29 23:03:47 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_jaYY_autohide-hide-cookie-name t=603 good                                                                                                                           
2013-05-29 23:03:47 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_jaYY_autohide-hide-cookie-max-count STARTING                                                                                                                        
2013-05-29 23:03:48 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_jaYY_autohide-hide-cookie-max-count t=556 good                                                                                                                      
2013-05-29 23:03:48 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_jaYY_autohide-hide-die-size STARTING                                                                                                                                
2013-05-29 23:03:49 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_jaYY_autohide-hide-die-size t=831 good                                                                                                                              
2013-05-29 23:03:49 TTMServerMessageUpdateJob CNBanner:B13_text_2013_dr_en-hide-cookie-name STARTING                                                                                                                                         
2013-05-29 23:03:49 TTMServerMessageUpdateJob CNBanner:B13_text_2013_dr_en-hide-cookie-name t=609 good                                                                                                                                       
2013-05-29 23:03:49 TTMServerMessageUpdateJob CNBanner:B13_text_2013_dr_en-hide-cookie-max-count STARTING                                                                                                                                    
2013-05-29 23:03:50 TTMServerMessageUpdateJob CNBanner:B13_text_2013_dr_en-hide-cookie-max-count t=587 good                                                                                                                                  
2013-05-29 23:03:50 TTMServerMessageUpdateJob CNBanner:B13_text_2009_dr_en-hide-cookie-name STARTING                                                                                                                                         
2013-05-29 23:03:51 TTMServerMessageUpdateJob CNBanner:B13_text_2009_dr_en-hide-cookie-name t=741 good                                                                                                                                       
2013-05-29 23:03:51 TTMServerMessageUpdateJob CNBanner:B13_text_2009_dr_en-hide-cookie-max-count STARTING                                                                                                                                    
2013-05-29 23:03:51 TTMServerMessageUpdateJob CNBanner:B13_text_2009_dr_en-hide-cookie-max-count t=632 good                                                                                                                                  
2013-05-29 23:03:51 TTMServerMessageUpdateJob Translations:Wikimedia_Foundation_elections/Board_elections/2013/Candidates/51 STARTING                                                                                                        
2013-05-29 23:08:18 TTMServerMessageUpdateJob Translations:Wikimedia_Foundation_elections/Board_elections/2013/Candidates/51 t=267215 good
Comment 6 Aaron Schulz 2013-05-29 23:48:52 UTC
Further runs had several slow similar Translations:Wikimedia_Foundation_elections/Board_elections/2013/Candidates instances.
Comment 7 Niklas Laxström 2013-05-30 06:46:46 UTC
Is there a way to find the slow functions via profiling?
Comment 8 Aaron Schulz 2013-05-30 07:15:00 UTC
(In reply to comment #7)
> Is there a way to find the slow functions via profiling?

You mean methods inside of SolrTTMServer::update? It would help if there profiling calls in there I suppose (for all i/o at least).

Maybe strace might help here too...
Comment 9 Gerrit Notification Bot 2013-06-06 12:33:20 UTC
Related URL: https://gerrit.wikimedia.org/r/67249 (Gerrit Change Ie230f2e40e97fa1fb00f056d4631e64cbcbf9eae)
Comment 10 Gerrit Notification Bot 2013-06-06 12:47:09 UTC
Related URL: https://gerrit.wikimedia.org/r/67250 (Gerrit Change I8a54e5dd00d6e6298e59000b8bdeaf4135d3158a)
Comment 11 Gerrit Notification Bot 2013-06-06 13:19:48 UTC
Related URL: https://gerrit.wikimedia.org/r/67252 (Gerrit Change I61d03705a80f67762091ea3bde8fc2bbcc6ce520)
Comment 12 Siebrand Mazeland 2013-06-06 13:39:09 UTC
I submitted 3 patch sets that I expect will take away all issues. The main problem was a Solr that was not optimised.

Gerrit change #67249 enables autoCommit in Solr, which makes it possible for Solr to batch a bunch of index updates and commit them in one go. All open updates will be committed after at most 5 seconds.

Gerrit change #67250 ensures that Translate no longer requires immediate commits, but demands that Solr commits within 5 seconds using the commitWithin feature.

Gerrit change #67252 increases ramBufferSizeMB from 32 (MB) to 100 (MB). This is the new default since Solr 4.1[1]. This reduces the number of shard merges at the cost of additional memory usage.

One possible easy optimisation remains, which is the mergeFactor. As I wasn't really able to choose between two evils at this point in time, and without access to the Solr admin to check what's actually happen, I think these changes should go live and we should see if any additional optimisation is needed before doing another round of updates.

[1] https://issues.apache.org/jira/browse/SOLR-4074
Comment 13 Nemo 2013-06-11 18:34:17 UTC
Aaron says that TTMServerMessageUpdateJob jobs are running since yesterday and there doesn't seem to be a big backlog; there have never been many, so he also hadn't dropped any such job from the queue. Vanadium seems happy and Special:SearchTranslations works.
Can this bug be closed?
Comment 14 Niklas Laxström 2013-06-12 07:11:40 UTC
I believe this has been fixed now.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links