Last modified: 2014-07-29 13:37:59 UTC
(happened on wed 9th of July) * check with gage whether this work already took place * check that during the switchover, hosts were correctly reporting * like only the correct hosts going down during the migration * the other hosts were picking up the correct traffic * check that each host is still reporting the expected number of requests * sampled-1000 logs (stat1002 /a/squid/...) * mobile-sampled-100 logs (stat1002 /a/squid/...) can be done by plotting requests per host per time * zero logs (stat1002 /a/squid/...) * edit logs (stat1002 /a/squid/...) * Find out where those files get written, and find a way to cover * oxygen, * gadolinium (unicast) * gadolinium (multicast) * erbium if they are not covered by the above file
Meeting notes on etherpad: http://etherpad.wikimedia.org/p/6gR8aSREkz
* check with gage whether this work already took place Checked with gage, ops had not checked the network traffic in-depth during the switchover * check that during the switchover, hosts were correctly reporting * like only the correct hosts going down during the migration * the other hosts were picking up the correct traffic Checked traffic on all hosts (ulsfo, eqiad, and esams) by using data gathered by Christian from the sampled logs. Found that only ULSFO hosts had their traffic go down, and only for the expected period. Also found that only EQIAD hosts had their traffic increase abnormally, and again for the expected period. Overall, I believe that no traffic leaked or increased anywhere outside of what was expected. I will attach pictorial proof and spreadsheets.
Created attachment 16018 [details] the daily traffic to ulsfo and eqiad, by host, during the switchover
Created attachment 16019 [details] the daily traffic to ulsfo and eqiad, by datacenter, during the switchover
Created attachment 16020 [details] the hourly traffic to ulsfo and eqiad, by host, during the switchover
Created attachment 16021 [details] the hourly traffic to ulsfo and eqiad, by datacenter, during the switchover
Created attachment 16022 [details] spreadsheet of daily data for ulsfo and eqiad with totals and graph
Created attachment 16023 [details] spreadsheet of hourly data for ulsfo and eqiad with totals and graph
Ok, by examining router interface statistics with LibreNMS, I have confirmed that when traffic from* ULSFO ceased, traffic from EQIAD increased by a similar amount. * Rather than looking at actual inbound web request traffic, I'm looking at the outbound responses because they should correlate and are much bigger. I've provided URLs for reference; LibreNMS access may be requested by emailing access-requests@rt.wikimedia.org. EQIAD: cr1-eqiad xe 5/3/1 (transit) https://librenms.wikimedia.org/graphs/to=1406129520/id=4515/type=port_bits/from=1404228720/ +700 Mbps cr1-eqiad xe 4/3/2 (transit) https://librenms.wikimedia.org/graphs/to=1405179180/id=6821/type=port_bits/from=1404747180/ +700 Mbps cr1-eqiad xe 4/3/1 (peering) https://librenms.wikimedia.org/graphs/to=1405154040/id=6820/type=port_bits/from=1404722040/ +250 Mbps cr2-eqiad xe 5/3/1 (transit) https://librenms.wikimedia.org/graphs/to=1405159680/id=134/type=port_bits/from=1404727680/ +1000 Mbps cr2-eqiad xe 5/3/3 (peering) https://librenms.wikimedia.org/graphs/to=1405159800/id=136/type=port_bits/from=1404727800/ +1000 Mbps ULSFO: cr1: 0/0/3 (transit) https://librenms.wikimedia.org/graphs/to=1405158120/id=7200/type=port_bits/from=1404726120/ -1800 cr2: 0/0/2 (transit) https://librenms.wikimedia.org/graphs/to=1405158600/id=7139/type=port_bits/from=1404726600/ -400 maybe cr2: 0/0/3 (peering) https://librenms.wikimedia.org/graphs/to=1405158480/id=7140/type=port_bits/from=1404726480/ -1100 Increase at EQIAD: roughly 3650 Mbps Decrease at ULSFO: roughly 3300 Mbps I had to visually estimate the values from the graphs, so this seems like acceptable equivalence. In addition to the traffic math, I'm not aware of any user reports of service disruption, and our 3rd party monitoring reports 100% availability in all significant categories for that week. Therefore I have high confidence that traffic was successfully rerouted without loss during the migration. Approximate timeline: 2014-07-09 16:00 UTC: Mark reroutes traffic to EQIAD 2014-07-09 17:00 UTC: I arrive at ULSFO 2014-07-09 17:30 UTC: ULSFO becomes unreachable [servers and routers are moved to a new room, everything is plugged back in] 2014-07-09 21:30 UTC: routers back online 2014-07-09 22:45 UTC: Mark restores traffic to ULSFO 2014-07-10 00:30 UTC: I leave ULSFO
Awe-some. Thank you so much Jeff
(In reply to Dan Andreescu from comment #2) > Also found > that only EQIAD hosts had their traffic increase abnormally, [...] I had expected to see amssq47 (esams) being called out, as it picked up traffic just as ULSFO's went down. That's just a coincidence. Right? (In reply to Jeff Gage from comment #9) > Approximate timeline: > 2014-07-09 16:00 UTC: Mark reroutes traffic to EQIAD While I see that the timelime is labeled as “approximate”, but since we're looking at numbers of hourly at hourly granularity ... Looking at the graphs, they take the deep downward dive already ~4-5 hours earlier. This earlier time also nicely matches Mark's rerouting commit [1], which is shows as gotten merged on 2014-07-09 10:40 UTC in gerrit. > 2014-07-09 22:45 UTC: Mark restores traffic to ULSFO While that might be right, it is neither reflected by graphs, nor the puppet repo. Looking at the graphs, they start to rise only ~1-2 hours later. This later time again nicely aligns with the puppet repo. There, Brandon's (not Mark's) rerouting commits [2] are shown in as gotten merged between 2014-07-10 00:38 and 2014-07-10 08:37. [1] https://gerrit.wikimedia.org/r/#/c/144934/ [2] They are a series of commits between https://gerrit.wikimedia.org/r/#/c/145182/ https://gerrit.wikimedia.org/r/#/c/145221/
Hi, We should definitely trust the commits over my information source, the Server Admin Log whose events are manually input: https://wikitech.wikimedia.org/wiki/SAL 22:42 mark: Enabling PAIX BGP sessions on cr2-ulsfo 22:40 mark: Enabling WMF HQ BGP sessions on cr1-ulsfo 22:38 mark: Enabling TiNet transit links on cr1-ulsfo 22:35 mark: Enabling WMF HQ BGP sessions on cr2-ulsfo 22:34 mark: Enabling NTT and HE transit links on cr2-ulsfo 16:17 mark: ulsfo is now offline 16:16 mark: Shutdown NTT BGP sessions on cr2-ulsfo 16:13 mark: Shutdown TiNet BGP sessions on cr1-ulsfo 16:10 mark: Shutdown IXP BGP sessions on cr2-ulsfo 16:10 mark: Shutdown WMF HQ BGP sessions on cr2-ulsfo 16:09 mark: Shutdown WMF HQ BGP sessions on cr1-ulsfo From the patch we can see that all traffic directed away from ULSFO was sent to EQIAD. Therefore it does seem like any increased traffic to ESAMS would be coincidental. I'll ask Mark to comment on this.
Yeah, amssq47 had been used as a test server before, not receiving any traffic. Brandon reinstalled and put it back in production around that time, so that would explain it.
(In reply to Mark Bergsma from comment #13) > [ Situation around amssq47 ] Thanks for confirming. ------------------ Per host per hour packet loss numbers look good. (The only host that sticks out there a bit is cp3013--a mobile esams cache. But esams should not have seen changes from the ULSFO move, and that host is not super-stable around packetloss anyways. Nothing concerning, but it is on the border more often than not. As the total volume of messages looks sound, and other parts of this host's log do too, I am assuming it's a coincidence) Per host per hour total traffic numbers look wrong. But during the ULSFO floor move, a semi-final for the 2014 FIFA World Cup took place (Yay, coincidence!) This caused a traffic spike during the ULSFO move which makes numbers look really skew. However, when limiting to various slices of non-soccer traffic, no spike is visible. For each individual non-soccer slice, data looks good. Per host per hour urls look good. Per host per hour per status code numbers look good. Per host per hour referers look good. From my point of view, log data is overall good.