Last modified: 2013-07-25 07:04:23 UTC
So yesterday we got a report that Special:Translations is not working on meta. I looked few ganglia stats but could not figure out anything specific. Today with help of Max we have a theory: Many concurrent updates slow solr down to crawling (1 every 2 seconds taking average of almost 20 s to finish). Also search requests take longer than 10 seconds and time out. The JobQueue seems to actually work against us here. The lack of accessible statistics to examine JobQueue behavior is problematic. Max proposed that we store the list of changes and use cronjob to push them in batches with a single thread. No other solutions have been proposed. RT-5085 should be related to this.
(In reply to comment #0) > RT-5085 should be related to this. It's the ops ticket about this bug. > The JobQueue seems to actually work against us here. The lack of accessible > statistics to examine JobQueue behavior is problematic. Statistics exist now: <https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=hume.wikimedia.org&v=823574&m=Global_JobQueue_length&r=hour&z=default&jr=&js=&st=1365625056&z=large> However the good old https://gdash.wikimedia.org/dashboards/jobq/ gives 503 now, and I assume *executed* job items are what matters?
After updates have been disabled per bug 48266 comment 0 vanadium is back at normal load and since 2013-05-08 17.20 Special:SearchTranslations "works".
*** Bug 48266 has been marked as a duplicate of this bug. ***
Is the problem *frequent* updates and/or *concurrent* updates? Anyway, one can simulate cron jobs with de-duplication + delayed jobs in the queue (Wikidata will do something similar soon). So if there was a job inserted for each event requires a TTM update, always posted to metawiki, to check the lists of changes for each wiki and do the updates, and that job had ignoreDuplicates() return true and had a delay set, it would basically act like a cron where any of the job runner servers can service it.
Here is some profiling info from trying to use a single runJobs.php process. * 99% of time is in SolrTTMServer::update (no surprise) * It can be quite fast for some (hundreds) of jobs but very slow for others * The slow ones *seem* to involve empty translations Output dump below: aaron@mw1001:~$ sudo -u apache php /usr/local/apache/common/multiversion/MWScript.php runJobs.php metawiki --type TTMServerMessageUpdateJob --maxjobs 100 --profiler text 2013-05-29 23:02:14 TTMServerMessageUpdateJob Translations:Mobile_projects/Mobile_Gateway/Mobile_homepage_formatting/27 STARTING 2013-05-29 23:03:11 TTMServerMessageUpdateJob Translations:Mobile_projects/Mobile_Gateway/Mobile_homepage_formatting/27 t=56467 good 2013-05-29 23:03:11 TTMServerMessageUpdateJob Translations:Fundraising_2012/Translation/FAQ_short/1 STARTING 2013-05-29 23:03:24 TTMServerMessageUpdateJob Translations:Fundraising_2012/Translation/FAQ_short/1 t=13436 good 2013-05-29 23:03:24 TTMServerMessageUpdateJob Translations:Fundraising_2012/Translation/FAQ_short/2 STARTING 2013-05-29 23:03:38 TTMServerMessageUpdateJob Translations:Fundraising_2012/Translation/FAQ_short/2 t=13499 good 2013-05-29 23:03:38 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-appeal STARTING 2013-05-29 23:03:39 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-appeal t=799 good 2013-05-29 23:03:39 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-appeal-template STARTING 2013-05-29 23:03:40 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-appeal-template t=996 good 2013-05-29 23:03:40 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-form-template STARTING 2013-05-29 23:03:40 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-form-template t=556 good 2013-05-29 23:03:40 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-form-countryspecific STARTING 2013-05-29 23:03:41 TTMServerMessageUpdateJob CNBanner:B13_051120_plaintext_staycurious-lp-form-countryspecific t=649 good 2013-05-29 23:03:41 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_frYY_autohide-hide-cookie-name STARTING 2013-05-29 23:03:42 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_frYY_autohide-hide-cookie-name t=1070 good 2013-05-29 23:03:42 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_frYY_autohide-hide-cookie-max-count STARTING 2013-05-29 23:03:43 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_frYY_autohide-hide-cookie-max-count t=898 good 2013-05-29 23:03:43 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_frYY_autohide-hide-die-size STARTING 2013-05-29 23:03:44 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_frYY_autohide-hide-die-size t=1237 good 2013-05-29 23:03:44 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_itYY_autohide-hide-cookie-name STARTING 2013-05-29 23:03:44 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_itYY_autohide-hide-cookie-name t=414 good 2013-05-29 23:03:44 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_itYY_autohide-hide-cookie-max-count STARTING 2013-05-29 23:03:45 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_itYY_autohide-hide-cookie-max-count t=720 good 2013-05-29 23:03:45 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_itYY_autohide-hide-die-size STARTING 2013-05-29 23:03:47 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_itYY_autohide-hide-die-size t=1497 good 2013-05-29 23:03:47 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_jaYY_autohide-hide-cookie-name STARTING 2013-05-29 23:03:47 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_jaYY_autohide-hide-cookie-name t=603 good 2013-05-29 23:03:47 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_jaYY_autohide-hide-cookie-max-count STARTING 2013-05-29 23:03:48 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_jaYY_autohide-hide-cookie-max-count t=556 good 2013-05-29 23:03:48 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_jaYY_autohide-hide-die-size STARTING 2013-05-29 23:03:49 TTMServerMessageUpdateJob CNBanner:B13_022000_survey_jaYY_autohide-hide-die-size t=831 good 2013-05-29 23:03:49 TTMServerMessageUpdateJob CNBanner:B13_text_2013_dr_en-hide-cookie-name STARTING 2013-05-29 23:03:49 TTMServerMessageUpdateJob CNBanner:B13_text_2013_dr_en-hide-cookie-name t=609 good 2013-05-29 23:03:49 TTMServerMessageUpdateJob CNBanner:B13_text_2013_dr_en-hide-cookie-max-count STARTING 2013-05-29 23:03:50 TTMServerMessageUpdateJob CNBanner:B13_text_2013_dr_en-hide-cookie-max-count t=587 good 2013-05-29 23:03:50 TTMServerMessageUpdateJob CNBanner:B13_text_2009_dr_en-hide-cookie-name STARTING 2013-05-29 23:03:51 TTMServerMessageUpdateJob CNBanner:B13_text_2009_dr_en-hide-cookie-name t=741 good 2013-05-29 23:03:51 TTMServerMessageUpdateJob CNBanner:B13_text_2009_dr_en-hide-cookie-max-count STARTING 2013-05-29 23:03:51 TTMServerMessageUpdateJob CNBanner:B13_text_2009_dr_en-hide-cookie-max-count t=632 good 2013-05-29 23:03:51 TTMServerMessageUpdateJob Translations:Wikimedia_Foundation_elections/Board_elections/2013/Candidates/51 STARTING 2013-05-29 23:08:18 TTMServerMessageUpdateJob Translations:Wikimedia_Foundation_elections/Board_elections/2013/Candidates/51 t=267215 good
Further runs had several slow similar Translations:Wikimedia_Foundation_elections/Board_elections/2013/Candidates instances.
Is there a way to find the slow functions via profiling?
(In reply to comment #7) > Is there a way to find the slow functions via profiling? You mean methods inside of SolrTTMServer::update? It would help if there profiling calls in there I suppose (for all i/o at least). Maybe strace might help here too...
Related URL: https://gerrit.wikimedia.org/r/67249 (Gerrit Change Ie230f2e40e97fa1fb00f056d4631e64cbcbf9eae)
Related URL: https://gerrit.wikimedia.org/r/67250 (Gerrit Change I8a54e5dd00d6e6298e59000b8bdeaf4135d3158a)
Related URL: https://gerrit.wikimedia.org/r/67252 (Gerrit Change I61d03705a80f67762091ea3bde8fc2bbcc6ce520)
I submitted 3 patch sets that I expect will take away all issues. The main problem was a Solr that was not optimised. Gerrit change #67249 enables autoCommit in Solr, which makes it possible for Solr to batch a bunch of index updates and commit them in one go. All open updates will be committed after at most 5 seconds. Gerrit change #67250 ensures that Translate no longer requires immediate commits, but demands that Solr commits within 5 seconds using the commitWithin feature. Gerrit change #67252 increases ramBufferSizeMB from 32 (MB) to 100 (MB). This is the new default since Solr 4.1[1]. This reduces the number of shard merges at the cost of additional memory usage. One possible easy optimisation remains, which is the mergeFactor. As I wasn't really able to choose between two evils at this point in time, and without access to the Solr admin to check what's actually happen, I think these changes should go live and we should see if any additional optimisation is needed before doing another round of updates. [1] https://issues.apache.org/jira/browse/SOLR-4074
Aaron says that TTMServerMessageUpdateJob jobs are running since yesterday and there doesn't seem to be a big backlog; there have never been many, so he also hadn't dropped any such job from the queue. Vanadium seems happy and Special:SearchTranslations works. Can this bug be closed?
I believe this has been fixed now.