Last modified: 2014-11-18 18:58:21 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T71613, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 69613 - Translation memory (solr) is failing


Summary:	Translation memory (solr) is failing

Status:	RESOLVED FIXED

Product:	Wikimedia
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	wmf-deployment
Hardware:	All All

Importance:	High normal (vote)
Target Milestone:	---
Assigned To:	Nik Everett

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-08-15 16:17 UTC by Niklas Laxström
Modified:	2014-11-18 18:58 UTC (History)
CC List:	11 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Niklas Laxström 2014-08-15 16:17:26 UTC

I noticed that I do no longer get suggestions for many messages while translating TechNews. Some suggestions, especially for short messages still work.

It is unknown to me whether there has been gradual slowdown and we passed the 10s timeout limit now, or whether there is a sudden slowdown making everything slower.

ElasticSearch backend for translation memory is not yet ready. Since I do not know the clause I do not know if there is anything we can do before migrating to ElasticSearch.

Comment 1 Nemo 2014-08-15 20:12:17 UTC

I kept some logs of whether a simple one-word query was failing on Meta, but nothing was logged after 2013-05-08; too bad I didn't keep any more details.

No access to icinga for me and nothing remarkable on https://ganglia.wikimedia.org/latest/?r=year&cs=&ce=&c=Miscellaneous+eqiad&h=zinc.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=ALLGROUPS

Comment 2 Andre Klapper 2014-08-18 11:37:26 UTC

Preliminarily assigning to Nik; feel free to reassign (also wondering if this might somehow touch deployment systems territory. Probably not).

Comment 3 Nik Everett 2014-08-18 12:18:43 UTC

Created rt ticket to get access to the machine running the translation memory: zinc.

Icinga reports all is well with zinc.

Comment 4 Nik Everett 2014-08-18 12:24:26 UTC

Seeing a whole lot of this in the exception log:
2014-08-18 10:10:33 mw1016 mediawikiwiki: [64d4ab30] [no req]   Exception from line 46 of /usr/local/apache/common-local/php-1.24wmf17/extensions/Translate/tag/TranslateRenderJob.php: Oops, this should not happen!

Probably not good.

Comment 5 Nik Everett 2014-08-18 12:25:28 UTC

Saw this:
2014-08-18 10:37:52 mw1053 commonswiki: [e97ae14a] /rpc/RunJobs.php?wiki=commonswiki&type=MessageUpdateJob&maxtime=30&maxmem=300M   Exception from line 1192 of /usr/local/apache/common-local/php-1.24wmf16/includes/db/Database.php: A database error has occurred. Did you forget to run maintenance/update.php after upgrading?  See: https://www.mediawiki.org/wiki/Manual:Upgrading#Run_the_update_script
Query: INSERT IGNORE INTO `translate_groupstats` (tgs_group,tgs_lang,tgs_total,tgs_translated,tgs_fuzzy,tgs_proofread) VALUES ('page-File:The GLAM-Wiki Revolution.webm/srt','ab','188','0','0','0')
Function: MessageGroupStats::forItemInternal
Error: 1213 Deadlock found when trying to get lock; try restarting transaction (10.64.16.29)

#0 /usr/local/apache/common-local/php-1.24wmf16/includes/db/Database.php(1158): DatabaseBase->reportQueryError()
#1 /usr/local/apache/common-local/php-1.24wmf16/includes/db/Database.php(1956): DatabaseBase->query()
#2 /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/utils/MessageGroupStats.php(384): DatabaseBase->insert()
#3 /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/utils/MessageGroupStats.php(295): MessageGroupStats::forItemInternal()
#4 /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/utils/MessageGroupStats.php(114): MessageGroupStats::forGroupInternal()
#5 /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/tag/TranslatablePage.php(751): MessageGroupStats::forGroup()
#6 /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/tag/PageTranslationHooks.php(159): TranslatablePage->getTranslationPercentages()
#7 /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/tag/PageTranslationHooks.php(136): PageTranslationHooks::updateTranslationPage()
#8 /usr/local/apache/common-local/php-1.24wmf16/includes/Hooks.php(206): PageTranslationHooks::onSectionSave()
#9 /usr/local/apache/common-local/php-1.24wmf16/includes/GlobalFunctions.php(4034): Hooks::run()
#10 /usr/local/apache/common-local/php-1.24wmf16/includes/page/WikiPage.php(2047): wfRunHooks()
#11 /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/utils/MessageUpdateJob.php(44): WikiPage->doEditContent()
#12 /usr/local/apache/common-local/php-1.24wmf16/includes/jobqueue/JobRunner.php(109): MessageUpdateJob->run()
#13 /usr/local/apache/common-local/rpc/RunJobs.php(42): JobRunner->run()
#14 {main}



Just once so far.

Comment 6 Nik Everett 2014-08-18 12:26:10 UTC

More frequent now - some deadlock.  Great.

Comment 7 Nik Everett 2014-08-18 12:29:05 UTC

And some of this in fatal:
[2014-08-18 11:45:30] Catchable fatal error: Argument 1 passed to MessageGroups::getSharedGroups() must implement interface MessageGroup, null given, called in /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/tag/SpecialPageTranslation.php on line 93 and defined at /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/MessageGroups.php on line 305
Server: mw1091
Method: GET
URL: http://meta.wikimedia.org/w/index.php?title=Special:PageTranslation&do=discourage&target=-redacted-&revision=-1&token=ac62e5547a4bae78516b834da1a7779d%2B%5C
Cookie: -redacted-
Backtrace:
#0 /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/MessageGroups.php(305): MessageGroups::getSharedGroups()
#1 /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/tag/SpecialPageTranslation.php(93): MessageGroups::getSharedGroups(NULL)
#2 /usr/local/apache/common-local/php-1.24wmf16/includes/specialpage/SpecialPage.php(382): SpecialPageTranslation->execute(NULL)
#3 /usr/local/apache/common-local/php-1.24wmf16/includes/specialpage/SpecialPageFactory.php(516): SpecialPage->run(NULL)
#4 /usr/local/apache/common-local/php-1.24wmf16/includes/MediaWiki.php(294): SpecialPageFactory::executePath(Object(Title), Object(RequestContext))
#5 /usr/local/apache/common-local/php-1.24wmf16/includes/MediaWiki.php(609): MediaWiki->performRequest()
#6 /usr/local/apache/common-local/php-1.24wmf16/includes/MediaWiki.php(458): MediaWiki->main()
#7 /usr/local/apache/common-local/php-1.24wmf16/index.php(46): MediaWiki->run()
#8 /usr/local/apache/common-local/w/index.php(3): require('/usr/local/apac...')
#9 {main}

Comment 8 Nik Everett 2014-08-18 12:48:22 UTC

Was able to reproduce the problem.  It didn't log anything incriminating.  I'll need access to zinc before I can proceed.

Comment 9 Andrew Otto 2014-08-18 14:46:34 UTC

Access to zinc granted here:

https://gerrit.wikimedia.org/r/#/c/154816/

See also: https://rt.wikimedia.org/Ticket/Display.html?id=8144

Comment 10 Nik Everett 2014-08-18 16:05:32 UTC

Update now that I have access.  Some requests are taking a long, long time.  For example:
manybubbles@zinc:~$ time curl -s 'localhost:8983/solr/select?wt=json&q=_val_%3A%22strdist%28%5C%22%3A%2A+%E2%80%9C%24upload%E2%80%9D+displays+a+special+page+that+allows+logged-in+users+to+upload+images+and+other+files+to+the+wiki.+Uploaded+files+can+be+linked+from+or+embedded+in+wiki+pages.+This+link+is+not+displayed+if+file+uploading+is+not+enabled.%3Cbr+%2F%3E%27%27Uploading+files%2C+viewing+files+on+the+server%2C+including+them+in+wiki+pages+and+managing+the+uploaded+files+is+discussed+in+the+%5B%5BHelp%3AManaging+files%7Cmanaging+files%5D%5D+section+of+this+manual.%27%27%5C%22%2Ccontent%2Cedit%29%22&start=0&rows=25&fl=globalid%2Ccontent%2Cscore&fq=language%3A%22en%22'
...
real	0m22.406s

Rerunning the query doesn't speed it up.

My mad linux skillz (top) tells me that this query pegs one cpu for about half the time, then shifts to another cpu and pegs it for the other half the time.  Then returns.

I'm believe the slowness to be caused by getting the string distance of lots and lots and lots of matches but I'm not sure though.  I'm also not sure what changed.

Comment 11 Nik Everett 2014-08-18 16:07:37 UTC

So I'm not sure where to go from here - this strdist is exactly what the Elasticsearch translation memory doesn't support.  And I'm pretty sure its super duper slow.  Anyway - here is an example of the number of matches (not returned results) that can cause the result to be slow:
        "numFound": 81067,

Comment 12 Nemo 2014-08-18 19:58:19 UTC

22 seconds is too high for the ~10 seconds max semi-originally envisioned, but is not unrealistical.

Comment 13 Andre Klapper 2014-08-19 16:52:00 UTC

(In reply to Nik Everett from comment #11)
> So I'm not sure where to go from here - this strdist is exactly what the
> Elasticsearch translation memory doesn't support.

Any idea if this was "just" gradual slowdown that we're now in this situation?
If not, wondering who else could continue investigating (and how)... :-/

Comment 14 Nik Everett 2014-08-20 12:35:27 UTC

I don't think we have logs that would show of is this is gradual or what.  My recommendation for a quick fix is to raise the timeout.  For a permanent fix is to rethink using string distance as a measure of relevance.  Its a very slow measure.  Beyond that, I'm not sure.

Comment 15 Niklas Laxström 2014-08-20 12:59:57 UTC

Given the info provided by Nik above, it looks like gradual slowdown. What worries me is that the slow queries seem to block the MessageUpdateJobs.

Increasing the timeout even more might not be a good idea in that light, or it might not make a difference. We could also try to limit the set of potential matches even further in the code.

We should be very cautious about deploying the same algorithm to ElasticSearch (as per existing plans to stop using Solr). Nik knows better whether it might cause the same problem there, or whether that setup can make the problem smaller.

I do have initial plans on working on translation memory (on my research time), but the ETA for that is more like end of this year. I would like to avoid is to not be able to run TM service at all until that.

Comment 16 Nik Everett 2014-08-20 13:30:39 UTC

I'm pretty sure the timeouts we have now are client side only - so the server will keep working any way.  Give that I think we should raise them for now and think more about replacing string distance later.  Its important that we think about it soon, though.  Is the feature useful at all if it takes 40 seconds to return?

Comment 17 Niklas Laxström 2014-08-20 14:12:23 UTC

We have preloading that is able to hide lots of delay in the common case of translating messages sequentially.

Comment 18 Nemo 2014-08-22 06:18:21 UTC

It would still be interesting if someone with access could give a look to icinga (which also has history and acknowledgements, right?): the check_solr / check_replicated_solr check is still enabled in role::solr.

Or if you already did, sorry and ignore me:

2014-08-20 22.41 < Nemo_bis> manybubbles: for zinc/ttm, did you manage to look into the icinga alert and see if any meaningful history can be extracted from it?
2014-08-21  1.56 < manybubbles> Nemo_bis: regarding zinc/ttm - I'll check icinga now.  I didn't see anything before but I wasn't being super thorough because I wanted to look at logs on the box

I only found:

$ grep "Solr on zinc" irclogs/FreeNode/#wikimedia-operations.log 
2014-01-12 18.15 <+icinga-wm> PROBLEM - Solr on zinc is CRITICAL: Average request time is 1000.0105 (gt 1000)  
2014-02-21 20.04 <+icinga-wm> RECOVERY - Solr on zinc is OK: All OK

Comment 19 Nemo 2014-08-22 08:00:36 UTC

For reference: https://gerrit.wikimedia.org/r/#/c/154816/1 https://rt.wikimedia.org/Ticket/Display.html?id=8144

Comment 20 Niklas Laxström 2014-08-31 10:08:37 UTC

Just occurred to me that perhaps the fix for bug 67921 is related to this, as it causes more candidates to exist and thus slowing down the queries.

Comment 21 Niklas Laxström 2014-11-18 18:58:21 UTC

We have switched to ES now. ES cluster is so much faster that with same logic a query which was previously timing out now takes less than 4 seconds.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links