Last modified: 2014-03-11 10:53:36 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T63802, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 61802 - Use a different format for l10n_cache (or document why the current one is the best one)
Use a different format for l10n_cache (or document why the current one is the...
Status: NEW
Product: MediaWiki
Classification: Unclassified
Database (Other open bugs)
1.23.0
All All
: Low enhancement (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-02-22 10:42 UTC by Oliver Keyes
Modified: 2014-03-11 10:53 UTC (History)
10 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Oliver Keyes 2014-02-22 10:42:20 UTC
Our current l10n_cache model seems to use serialised PHP arrays as the storage mechanism for localisation strings. This makes perfect sense if we assume that all use cases for retrieving the data are centred around PHP, which, for production, they are. Unfortunately it's tremendously frustrating from a research perspective. As an example, let's use namespace names and aliases, which are stored in l10n_cache and accessible via the MediaWiki API.

Namespace names and aliases are a relatively commmon thing to need to retrieve, at least for me, for things like introducing granularity into our request logs or UA data.

Fortunately for our machines and unfortunately for our researchers, the research and analytics machines are, very deliberately, not connected to the internet directly (with the exception of stat1, which is being decommissioned). Accordingly, the API option is not available to us if we want to retrieve namespace names, we need to use the l10n_cace table.

Doing this requires us to be using a language with a PHP parser in it (Python has one, R does not), roll our own if one isn't available, or write something incredibly hacky where we read the data in, de-serialise it and save it in a more usable format /through/, say, PHP or Python. This is an unattractive proposition because it makes for less readable code, which is a concern not only for transparency but in the situation where the code is 'productionised' by the analytics engineers, for which it needs to be workable in Java.

Can we switch away from serialised PHP to, say, JSON objects? If not, why not? Is there documentation of the rationale for using serialised PHP anywhere?
Comment 1 Niklas Laxström 2014-02-22 10:49:31 UTC
We are not using serialized files for localisation cache. The cache is either in CDB format or in the SQL database.

Are you saying that the strings in the database are serialized?
Comment 2 Oliver Keyes 2014-02-22 11:25:03 UTC
(In reply to Niklas Laxström from comment #1)

> Are you saying that the strings in the database are serialized?

Indeed (or, at least, the arrays appear to be)
Comment 3 Oliver Keyes 2014-02-23 05:11:59 UTC
https://www.mediawiki.org/wiki/Manual:L10n_cache_table would seem to count as "confirmed".
Comment 4 Nemo 2014-02-23 09:18:52 UTC
But we still lack a use case. One server in the world with DB but without API isn't really convincing. The kind of information you need (like content namespace or not) also has nothing to do with l10n, so there is no reason to believe it would last forever in there if it's even available.

On the "why", that's the format used in other tables too, for instance log_params, so your question should be rephrased as: document why and where the DB tables contain serialiased data.
Comment 5 Oliver Keyes 2014-02-23 19:23:31 UTC
(In reply to Nemo from comment #4)
> But we still lack a use case. One server in the world with DB but without
> API isn't really convincing.
> 
It's more than "one server"; it's going to be all our analytics machines, and that's the crucial element. If the scenario was reversed - some strange universe in which we had one production machine and the rest were analytics-based - the argument wouldn't hold any water. Part of this, sure, is that production is 90% of our use cases, but that's only part of it. The other is simply that research and analytics /are/ a use case, and one of increasing importance. We now have 6 researchers, 4 analytics engineers, a director-level position and a Product Manager; arguing that this is not something worth addressing, either by justifying it or solving it, simply because those people can get their work done on only a few machines, ignores the tremendous resources being thrown behind analytics.

A use case was provided with the original post ;p.

>The kind of information you need (like content
> namespace or not) also has nothing to do with l10n, so there is no reason to
> believe it would last forever in there if it's even available.

It's absolutely available; please do me the courtesy of assuming I did basic research and attempted to solve the problem through other means before submitting the bug. If you want to check yourself, look for namespaceNames and namespaceAliases in lc_key.

Sure, namespaceNames are not /directly/ a localisation problem - they're wiki-based as well as language based - but they are, most of the time, language-based.

More importantly, whether they do or do not last forever (which seems a very strange test. /nothing/ we do lasts forever. well, other than the projects. That's kind of why we're here), they're currently stored there. The solution to the problem is for the language engineering team to either (a) explain why serialised PHP is the best plausible way to store this data or (b) change it.  I'm not asking for a solution proof against any possible permutation of future events, because that's impossible, but the fact of the matter is that this table is the source of the issue as it stands.


> On the "why", that's the format used in other tables too, for instance
> log_params, so your question should be rephrased as: document why and where
> the DB tables contain serialiased data.

Sure, if documentation was all I was asking for ;p. log_params is less crucial; to my knowledge the only thing we tend to use it for is retrieving the patrol status of a page (that is, whether it was patrolled automatically or manually), and that's something you can extract with a minimal amount of effort because it's not a complex piece of data - it's just the 1 or 0 closest to the end of the string.
Comment 6 Niklas Laxström 2014-02-24 09:23:45 UTC
For definite answer you need to ask Tim who wrote the code. My guess is that it just followed the pattern of BagOStuff which serializes values.

I will undermine your use case a bit though:
1) You should not be accessing a cache directly, it may not even exist.
2) WMF is using CDB files, no DB for l10n cache.
3) Why don't you spin up MediaWiki and ask it to provide you the information you need. No network needed for that.

The language engineering team does not have knowledge of or capacity to address all i18n/l10n related issues [1], but we will do our best to help you and make you depend less on us.

[1] We spend largest portion of our time on feature development.
Comment 7 Nemo 2014-02-24 10:26:22 UTC
By the way, Toolserver has a special table for namespace names and I'm told Labs recently got such feature too. https://wiki.toolserver.org/view/Toolserver_database
Comment 8 Oliver Keyes 2014-02-24 17:26:49 UTC
(In reply to Niklas Laxström from comment #6)
> For definite answer you need to ask Tim who wrote the code. My guess is that
> it just followed the pattern of BagOStuff which serializes values.
> 
> I will undermine your use case a bit though:
> 1) You should not be accessing a cache directly, it may not even exist.
Sure; if you can point to a better way to automatically get to this data on a machine with no connection to the internet, I'm happy to hear it.
> 2) WMF is using CDB files, no DB for l10n cache.
Then why is the table there? It's a MediaWiki feature we don't use?
> 3) Why don't you spin up MediaWiki and ask it to provide you the information
> you need. No network needed for that.
Sure; that's not an easily replicable solution, though. To use that scenario; every time the table is updated, I'd need to spin up a local MediaWiki instance, query it, retrieve the data, save that to a file, manually transfer to the file to [stat1001/stat1002/analytics*/delete-as-applicable]...and so would anyone else looking at doing things with granularity at the namespace level. That kind of approach would be more easily done by just querying the APIs in a loop, retrieving the data as JSON files, and transferring that....the problem being the 'automation' bit.
> 
> The language engineering team does not have knowledge of or capacity to
> address all i18n/l10n related issues [1], but we will do our best to help
> you and make you depend less on us.
> 

Thanks :).
Comment 9 Oliver Keyes 2014-02-24 17:28:01 UTC
(In reply to Nemo from comment #7)
> By the way, Toolserver has a special table for namespace names and I'm told
> Labs recently got such feature too.
> https://wiki.toolserver.org/view/Toolserver_database

huh; interesting. Prrobably not directly applicable, because there's no direct connection betwixt the analytics machines and the toolserver dbs to my knowledge, but the way they extracted that could be useful.
Comment 10 Nemo 2014-02-24 17:43:59 UTC
(In reply to Oliver Keyes from comment #9)
> huh; interesting. Prrobably not directly applicable, because there's no
> direct connection betwixt the analytics machines and the toolserver dbs to
> my knowledge, but the way they extracted that could be useful.

Or you could start using the infrastructure everyone uses instead of reinventing the wheel.
Comment 11 Oliver Keyes 2014-02-24 17:48:53 UTC
(In reply to Nemo from comment #10)
> (In reply to Oliver Keyes from comment #9)
> > huh; interesting. Prrobably not directly applicable, because there's no
> > direct connection betwixt the analytics machines and the toolserver dbs to
> > my knowledge, but the way they extracted that could be useful.
> 
> Or you could start using the infrastructure everyone uses instead of
> reinventing the wheel.

You mean, Tool Labs? Sure, I'll just poke Legal and see how they feel about moving our request logs to Labs. Oh, wait ;p.
Comment 12 Niklas Laxström 2014-02-24 18:30:04 UTC
(In reply to Oliver Keyes from comment #8)
> > 2) WMF is using CDB files, no DB for l10n cache.
> Then why is the table there? It's a MediaWiki feature we don't use?

Database tables are created unconditionally. The store can be changed with wgGlobals configuration, and WMF uses CDB distributed to the application servers for speed.

Can you tell a bit more where do you need the namespaces? Are you parsing the wikitext? Doesn't for example pagelinks table have the namespace resolved to the numerical id already?
Comment 13 Oliver Keyes 2014-02-24 18:30:57 UTC
(In reply to Niklas Laxström from comment #12)
> 
> Can you tell a bit more where do you need the namespaces? Are you parsing
> the wikitext? Doesn't for example pagelinks table have the namespace
> resolved to the numerical id already?

Nope, the RequestLogs - which are unfortunately totally detached from MediaWiki proper.
Comment 14 Niklas Laxström 2014-02-24 22:17:52 UTC
Then I don't see any other solution currently other than 3) I mentioned. If you don't want to install MediaWiki on the analytics servers, the second best thing would be to automate it with some kind of script keeping the constraints in mind.

After getting the namespaces, should be prepared to mirror some of transformations MW handles: spaces/plusses/underscores, url encoding, charset encoding, case insensitivity, unicode normalizations, redirect pages... Some of these cause an actual redirect, some of the don't.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links