Last modified: 2014-02-11 21:06:07 UTC
Some wikis (e.g. Inclupedia) may wish to import Wikipedia's logging, page, recentchange, revision, and user data, using the same primary key values for that imported data as Wikipedia, while also having their own data in those fields, starting at, e.g., log_id, page_id, rc_id, rev_id, and user_id 1 quadrillion for future-proofing purposes. This will require a change of these primary keys to bigint unsigned, since int unsigned maxes out at about 2.7 billion. These keys are going to hit their maxes eventually anyway on enwiki, although it might take a few decades more. Going to bigint unsigned will use up more storage, though. Another issue with this is extensive core use of intval() on these primary keys. One solution is to create a wfBigintval() global function that removes any decimal as intval does and removes any non-numeric characters. However, then it will return a string rather than an int; hopefully no one does any mathematical operations on these, or it will be necessary to come up with alternative ways of doing those. I can change old_id to bigint unsigned as well, if you think that would be helpful. Should we just migrate all primary keys to bigint unsigned? See also bug 60962, which this will at least partly fix. See also [[mw:Manual:Primary key storage in other fields]].
The use case I have in mind is that the forthcoming rewritten [[mw:Extension:MirrorTools]] will do some schema changes that ALTER the logging, page, recentchanges, revision, and user tables with AUTO_INCREMENT = 1000000000000000; (i.e. 1 quadrillion). That way, we won't have collisions when the primary keys for imported log events, pages, recent changes, revisions, and users from Wikipedia reach high numbers. The alternative would be to set an auto increment of, say, 3 billion. That would lead to eventual collisions, though.
Oh yeah, I just realized -- 3 billion won't work either, because that's above the intval max of ~2.1 billion. One way or another, this intval stuff has to change.
Note that PHP's 'int' type is dependent on the underlying architecture's C-language 'int' type. This means that 'int's and intval() are 32-bit on 32-bit architectures, but are 64-bit on most 64-bit architectures. (Windows may be an exception to this as Windows is weird. If you're using a Windows server, you may wish to consider a nice Linux virtual machine.)
(In reply to comment #3) > Note that PHP's 'int' type is dependent on the underlying architecture's > C-language 'int' type. > > This means that 'int's and intval() are 32-bit on 32-bit architectures, but > are > 64-bit on most 64-bit architectures. (Windows may be an exception to this as > Windows is weird. If you're using a Windows server, you may wish to consider > a > nice Linux virtual machine.) Good point. I'm running 32-bit Ubuntu Linux; it looks like I'll need to install the 64-bit flavor. That seems like a better solution than making such extensive changes to the core, which might introduce some undesired effects.
The biggest two of those primary keys are (The other's are at least an order of magnitude smaller): MariaDB [enwiki_p]> select max(rc_id) from recentchanges; +------------+ | max(rc_id) | +------------+ | 636232414 | +------------+ 1 row in set (0.04 sec) MariaDB [enwiki_p]> select max(rev_id) from revision; +-------------+ | max(rev_id) | +-------------+ | 594743381 | +-------------+ So in 13 years, we've managed to get up to an rc_id of 636,232,414 or about 30% of possible values (Since rc_id appears to be unsigned for some reason). It doesn't seem like this is something we should really be worrying about. It also doesn't seem like fixing this will really become any harder in the future than it is now. Thus this seems like a solution in search of a problem. (Although perhaps making the default schema have these fields be unsigned might be a good idea just for cleanliness sake)
Yeah, if there weren't a particular use case I had in mind, I would totally say, Leave it as is for now. There are a few different options for Inclupedia, that I can think of. (1) discard the enwiki rev_ids of imported revisions. (2) keep those rev_ids in the mirrorbot table, in case they're needed later. (3) do what I'm doing now, which is import them with the enwiki rev_ids, and use higher rev_ids for Inclupedia-only revisions. (4) add a few fields to the revision table, or add new tables, similar to what you see at [[mw:Extension:MirrorTools#mirror-logging.sql]]. Option 3 seemed the simplest, especially since options 1 and 2 would still have required another field to be added indicating that the revisions were imported (it's pertinent information when people are browsing through). If I go with option 3, then an auto-increment starting point has to be selected, e.g. 3 billion. It might take decades, but there will eventually be a collision. These are wikis whose life spans are projected to be possibly decades or centuries for all we know. For peace of mind, I'd prefer to not set up the equivalent of a Y2K issue. The next decision is, (a) change the schema for everyone, or (b) change it for just Inclupedia. I don't really mind changing it just for Inclupedia. Part of the point of filing the bug was to figure out which direction people wanted to go in. We can WONTFIX and then I'll just add SQL files to MirrorTools for the necessary changes. This is especially true in light of the fact that intval's behavior can be changed by going to a 64-bit system. Since I was going to make a schema change anyway, I figured I'd ask, Hey, want to just change it in the core?
See also [[m:Solution in search of a problem]]
Thanks for the feedback; I'm WONTFIXing this. That saves me some work, as it's easier to make a schema change by MediaWiki extension than by MediaWiki core.