Last modified: 2014-07-07 18:28:14 UTC
Sometimes database changes are made to core (not likely) or deployed extensions (much more likely) that are not backwards compatible. It's things like new tables or schema changes. In production, these are handled manually. In the Beta Cluster, we run update.php on a regular basis. As such, finding errors (and not having them magically disappear) is hard. I'm not (necessarily?) proposing to stop running update.php regularly. I am trying to figure out a way to catch these types of mistakes before we have outages in production.
Strawman #1: * We run a set of integration tests "right" before a run of update.php * We run the same set "right" after. * Some how compare the two that isn't too noisy/manual.
(In reply to Greg Grossmeier from comment #0) > I am trying to figure out a way to catch these types of mistakes before we > have outages in production. The errors aren't really present in beta are they? The problem is that we have a gap in code review/procedure that allows changes requiring database schema or massive cache invalidation or similarly disruptive changes (which I think I've heard called "scap traps" before) to be merged without producing some sort of durable list of required actions that are needed to deploy the code in production. I've had similar problems everywhere I've worked where the size of the development plus operations team was greater than one (and sometimes even when I was working solo). The most easily automated solution I've seen in practice was used at $DAYJOB-1. We used a tool developed in-house that could compare a canonical schema which we kept in version control with the schema of any live database. This tool would emit DDL changes to sync the database with the canonical DDL. For local development and our integration environment these DDL changes would be applied automatically by a script. In our staging and production environments, the DDL alter script would be generated as part of the build for the environment but then manually reviewed and applied by a DBA. The major problem with this approach is scaling it as the deploy cycle accelerates from once per week to once per day/hour/minute.
(In reply to Bryan Davis from comment #2) > (In reply to Greg Grossmeier from comment #0) > > I am trying to figure out a way to catch these types of mistakes before we > > have outages in production. > > The errors aren't really present in beta are they? From Physikerwelt in the gerrit change that prompted this: >> See error on betlabs: A database query error has occurred. This may indicate a bug in the software. >> Function: MathRenderer::readFromDatabase Error: 1146 Table 'labswiki.mathoid' doesn't exist (10.68.17.94) There is some amount of time between new table dependency is merged and the table is not created on beta cluster (run of update.php) where errors are logged.
(In reply to Greg Grossmeier from comment #3) > (In reply to Bryan Davis from comment #2) > > (In reply to Greg Grossmeier from comment #0) > > > I am trying to figure out a way to catch these types of mistakes before we > > > have outages in production. > > > > The errors aren't really present in beta are they? > > There is some amount of time between new table dependency is merged and the > table is not created on beta cluster (run of update.php) where errors are > logged. See also: https://bugzilla.wikimedia.org/show_bug.cgi?id=67485