Last modified: 2013-11-07 04:16:12 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T58577, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 56577 - Some revisions seem to have disappeared on it.wikipedia
Some revisions seem to have disappeared on it.wikipedia
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
General/Unknown (Other open bugs)
wmf-deployment
All All
: Highest critical (vote)
: ---
Assigned To: Nobody - You can work on this!
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-04 19:30 UTC by Vituzzu@it.wiki
Modified: 2013-11-07 04:16 UTC (History)
11 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Vituzzu@it.wiki 2013-11-04 19:30:22 UTC
I cannot see any rev to undelete/unsuppress here:
https://it.wikipedia.org/w/index.php?title=Speciale:Ripristina&target=Utente%3AVandra%2FSandbox3
https://it.wikipedia.org/w/index.php?title=Speciale:Ripristina&target=Utente%3ADott_Alessandro_Sartore
no suppressions have been made but just two deletions have
Comment 1 Tim Starling 2013-11-04 23:44:21 UTC
The archive rows appear on the master but not on the slaves:

mysql:root@db1027 [frwiki]> select count(*) from archive;
+----------+
| count(*) |
+----------+
|  4476558 |
+----------+

mysql:root@db1040 [frwiki]> select count(*) from archive;
+----------+
| count(*) |
+----------+
|  4469921 |
+----------+

The difference is 6637 missing archive rows.
Comment 2 Tim Starling 2013-11-04 23:45:16 UTC
Sorry, the above comment should have gone on bug 56589 (which this will probably be duped to anyway)
Comment 3 Tim Starling 2013-11-04 23:49:13 UTC
The same calculation shows 4371 missing archive rows on itwiki.
Comment 4 Sean Pringle 2013-11-05 01:22:55 UTC
We have some MariaDB replication slave drift due to our use of statement-based
binlogs (necessary for a further 6+ months, for operational reasons, before a
switch to row-based).

This is an onoing problem in general, but in this case it's also been
exacerbated by recent schema changes on the `archive` table.

No data is actually missing and the affected slaves are being resynchronized.
Comment 5 Tim Starling 2013-11-05 01:28:19 UTC
SBR is what we've always used, how can it cause "slave drift"? Is there a particular bug with it in MariaDB?

How did you do the ar_id schema change? I remember when we added log_id, it wasn't possible to just do an ALTER TABLE on each slave, because the resulting IDs wouldn't be synchronised. We had to recreate the table and insert all rows into it. Is that a fixed bug, or did you work around that somehow?
Comment 6 Sean Pringle 2013-11-05 02:12:37 UTC
Slave drift is not a particular bug with MariaDB, but a series of possible problems mostly related to statements that are non-deterministic plus a few fringe cases.

http://dev.mysql.com/doc/refman/5.5/en/replication-rbr-safe-unsafe.html

Since we upgraded to MariaDB 5.5 our mysqld error logs have been steadily growing with warnings related to unsafe binary log statements that could cause a Slave to drift out of sync with its master. An example (not specifically related to this bug):

131028  5:58:22 [Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. INSERT... ON DUPLICATE KEY UPDATE

These warning messages didn't appear as frequently in MySQL 5.1 error logs, not because it was safer then, but because they're additional warnings added to MariaDB 5.5 to identify possible problems exposed in bug reports filed *since* 5.1 went GA. They indicate that our slave datasets have been slowly drifting for some time.

I've been investigating and correcting the problem using the pt-table-sync tool [1]. Unfortunately that tool requires a primary or unique key on each table to operate efficiently, which `archive` did not have, hence it may have been further out of sync on slaves than other tables. This is also the reason I wanted to find a way to do the schema change on the master rather than rotating a slave.

In this case archive schema was modified by:

- Delaying a slave as fallback
- Creating a copy or `archive` as `_archive_new`
- Altering `_archive_new` to add ar_id PK
- Altering `_archive_new` to add ar_hash (an md5 hash of the row, for sanity checks)
- Adding INSERT/UPDATE/DELETE triggers to `archive` to keep `_arcive_new` in sync
- Batch inserting data from `archive` to `_archive_new`
- Cross checking the two tables for differences
- Removing ar_hash
- Switching the two tables
- Keeping old table around as `archive_save` for a time

The batch transfer step was done with pt-online-schema-change using the temporary ar_hash as key (normally it would use the table primary key, but of course `archive` didn't have one). 

[1] As there may yet be more unforseen bugs with SBR, the only true fix is to switch to RBR.
Comment 7 Sean Pringle 2013-11-05 02:21:17 UTC
Oops forgot to add:

The batch transfer process used a REPLACE INTO ... SELECT query walking the archive.ar_rev_id index in fixed sizes. This statement in some forms is one that gets listed as unsafe for SBR.

My initial checks on slaves showed it should have been ok there, but the focus was always to ensure master accuracy first and continue to check slave sync regularly while SBR is in effect.

This bug occuring around the time of the schema change isn't likely to be a co-incidence. Hence me placing SBR in the spotlight.
Comment 8 Tim Starling 2013-11-05 03:10:36 UTC
Thanks for that.

If MediaWiki is generating any queries which are actually causing slave drift, you should file bugs for that. We may be able to work around it on the application side.
Comment 9 Vituzzu@it.wiki 2013-11-05 13:24:53 UTC
(In reply to comment #8)
> Thanks for that.
> 
> If MediaWiki is generating any queries which are actually causing slave
> drift,
> you should file bugs for that. We may be able to work around it on the
> application side.

At least for these two pages it seems to work again.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links