Last modified: 2014-07-29 13:37:59 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T70199, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 68199 - ULSFO post-move verification
ULSFO post-move verification
Status: RESOLVED FIXED
Product: Analytics
Classification: Unclassified
General/Unknown (Other open bugs)
unspecified
All All
: Unprioritized normal
: ---
Assigned To: christian
u=Kevin c=General/Unknown p=0 s=2014-...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-07-17 23:00 UTC by Kevin Leduc
Modified: 2014-07-29 13:37 UTC (History)
9 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments
the daily traffic to ulsfo and eqiad, by host, during the switchover (332.16 KB, image/png)
2014-07-23 20:52 UTC, Dan Andreescu
Details
the daily traffic to ulsfo and eqiad, by datacenter, during the switchover (138.58 KB, image/png)
2014-07-23 20:53 UTC, Dan Andreescu
Details
the hourly traffic to ulsfo and eqiad, by host, during the switchover (483.68 KB, image/png)
2014-07-23 20:53 UTC, Dan Andreescu
Details
the hourly traffic to ulsfo and eqiad, by datacenter, during the switchover (276.22 KB, image/png)
2014-07-23 20:54 UTC, Dan Andreescu
Details
spreadsheet of daily data for ulsfo and eqiad with totals and graph (36.06 KB, application/vnd.oasis.opendocument.spreadsheet)
2014-07-23 20:54 UTC, Dan Andreescu
Details
spreadsheet of hourly data for ulsfo and eqiad with totals and graph (58.32 KB, application/vnd.oasis.opendocument.spreadsheet)
2014-07-23 20:55 UTC, Dan Andreescu
Details

Description Kevin Leduc 2014-07-17 23:00:56 UTC
(happened on wed 9th of July)
* check with gage whether this work already took place

* check that during the switchover, hosts were correctly reporting
   * like only the correct hosts going down during the migration
   * the other hosts were picking up the correct traffic

* check that each host is still reporting the expected number of requests
   * sampled-1000 logs (stat1002 /a/squid/...)
   * mobile-sampled-100 logs (stat1002 /a/squid/...)
      can be done by plotting requests per host per time
   * zero logs (stat1002 /a/squid/...)
   * edit logs (stat1002 /a/squid/...)
   * Find out where those files get written, and find a way to cover
      * oxygen,
      * gadolinium (unicast)
      * gadolinium (multicast)
      * erbium
     if they are not covered by the above file
Comment 1 nuria 2014-07-23 15:01:19 UTC
Meeting notes on etherpad:
http://etherpad.wikimedia.org/p/6gR8aSREkz
Comment 2 Dan Andreescu 2014-07-23 20:51:29 UTC
* check with gage whether this work already took place

Checked with gage, ops had not checked the network traffic in-depth during the switchover

* check that during the switchover, hosts were correctly reporting
   * like only the correct hosts going down during the migration
   * the other hosts were picking up the correct traffic

Checked traffic on all hosts (ulsfo, eqiad, and esams) by using data gathered by Christian from the sampled logs.  Found that only ULSFO hosts had their traffic go down, and only for the expected period.  Also found that only EQIAD hosts had their traffic increase abnormally, and again for the expected period.  Overall, I believe that no traffic leaked or increased anywhere outside of what was expected.  I will attach pictorial proof and spreadsheets.
Comment 3 Dan Andreescu 2014-07-23 20:52:52 UTC
Created attachment 16018 [details]
the daily traffic to ulsfo and eqiad, by host, during the switchover
Comment 4 Dan Andreescu 2014-07-23 20:53:28 UTC
Created attachment 16019 [details]
the daily traffic to ulsfo and eqiad, by datacenter, during the switchover
Comment 5 Dan Andreescu 2014-07-23 20:53:52 UTC
Created attachment 16020 [details]
the hourly traffic to ulsfo and eqiad, by host, during the switchover
Comment 6 Dan Andreescu 2014-07-23 20:54:13 UTC
Created attachment 16021 [details]
the hourly traffic to ulsfo and eqiad, by datacenter, during the switchover
Comment 7 Dan Andreescu 2014-07-23 20:54:51 UTC
Created attachment 16022 [details]
spreadsheet of daily data for ulsfo and eqiad with totals and graph
Comment 8 Dan Andreescu 2014-07-23 20:55:07 UTC
Created attachment 16023 [details]
spreadsheet of hourly data for ulsfo and eqiad with totals and graph
Comment 9 Jeff Gage 2014-07-24 03:45:30 UTC
Ok, by examining router interface statistics with LibreNMS, I have confirmed that when traffic from* ULSFO ceased, traffic from EQIAD increased by a similar amount.

 * Rather than looking at actual inbound web request traffic, I'm looking at the outbound responses because they should correlate and are much bigger.

I've provided URLs for reference; LibreNMS access may be requested by emailing access-requests@rt.wikimedia.org.

EQIAD:
    cr1-eqiad xe 5/3/1 (transit)
        https://librenms.wikimedia.org/graphs/to=1406129520/id=4515/type=port_bits/from=1404228720/
        +700 Mbps
    cr1-eqiad xe 4/3/2 (transit)
        https://librenms.wikimedia.org/graphs/to=1405179180/id=6821/type=port_bits/from=1404747180/
        +700 Mbps
    cr1-eqiad xe 4/3/1 (peering)
        https://librenms.wikimedia.org/graphs/to=1405154040/id=6820/type=port_bits/from=1404722040/
        +250 Mbps
    cr2-eqiad xe 5/3/1 (transit)
        https://librenms.wikimedia.org/graphs/to=1405159680/id=134/type=port_bits/from=1404727680/
        +1000 Mbps
    cr2-eqiad xe 5/3/3 (peering)
        https://librenms.wikimedia.org/graphs/to=1405159800/id=136/type=port_bits/from=1404727800/
        +1000 Mbps
ULSFO:
    cr1: 0/0/3 (transit)
        https://librenms.wikimedia.org/graphs/to=1405158120/id=7200/type=port_bits/from=1404726120/
        -1800
    cr2: 0/0/2 (transit)
        https://librenms.wikimedia.org/graphs/to=1405158600/id=7139/type=port_bits/from=1404726600/
        -400 maybe
    cr2: 0/0/3 (peering)
        https://librenms.wikimedia.org/graphs/to=1405158480/id=7140/type=port_bits/from=1404726480/
        -1100

Increase at EQIAD: roughly 3650 Mbps
Decrease at ULSFO: roughly 3300 Mbps

I had to visually estimate the values from the graphs, so this seems like acceptable equivalence.

In addition to the traffic math, I'm not aware of any user reports of service disruption, and our 3rd party monitoring reports 100% availability in all significant categories for that week. Therefore I have high confidence that traffic was successfully rerouted without loss during the migration.

Approximate timeline:
2014-07-09 16:00 UTC: Mark reroutes traffic to EQIAD
2014-07-09 17:00 UTC: I arrive at ULSFO
2014-07-09 17:30 UTC: ULSFO becomes unreachable
    [servers and routers are moved to a new room, everything is plugged back in]
2014-07-09 21:30 UTC: routers back online
2014-07-09 22:45 UTC: Mark restores traffic to ULSFO
2014-07-10 00:30 UTC: I leave ULSFO
Comment 10 Dan Andreescu 2014-07-24 10:38:03 UTC
Awe-some.  Thank you so much Jeff
Comment 11 christian 2014-07-24 11:22:39 UTC
(In reply to Dan Andreescu from comment #2)
> Also found
> that only EQIAD hosts had their traffic increase abnormally, [...]

I had expected to see amssq47 (esams) being called out, as it picked
up traffic just as ULSFO's went down.

That's just a coincidence. Right?

(In reply to Jeff Gage from comment #9)
> Approximate timeline:
> 2014-07-09 16:00 UTC: Mark reroutes traffic to EQIAD

While I see that the timelime is labeled as “approximate”, but since
we're looking at numbers of hourly at hourly granularity ...

Looking at the graphs, they take the deep downward dive already ~4-5
hours earlier. This earlier time also nicely matches Mark's rerouting
commit [1], which is shows as gotten merged on 2014-07-09 10:40 UTC in
gerrit.

> 2014-07-09 22:45 UTC: Mark restores traffic to ULSFO

While that might be right, it is neither reflected by graphs, nor the
puppet repo.

Looking at the graphs, they start to rise only ~1-2 hours later. This
later time again nicely aligns with the puppet repo. There, Brandon's
(not Mark's) rerouting commits [2] are shown in as gotten merged
between 2014-07-10 00:38 and 2014-07-10 08:37.



[1] https://gerrit.wikimedia.org/r/#/c/144934/

[2] They are a series of commits between
  https://gerrit.wikimedia.org/r/#/c/145182/
  https://gerrit.wikimedia.org/r/#/c/145221/
Comment 12 Jeff Gage 2014-07-24 17:05:02 UTC
Hi,

We should definitely trust the commits over my information source, the Server Admin Log whose events are manually input: https://wikitech.wikimedia.org/wiki/SAL

22:42 mark: Enabling PAIX BGP sessions on cr2-ulsfo
22:40 mark: Enabling WMF HQ BGP sessions on cr1-ulsfo
22:38 mark: Enabling TiNet transit links on cr1-ulsfo
22:35 mark: Enabling WMF HQ BGP sessions on cr2-ulsfo
22:34 mark: Enabling NTT and HE transit links on cr2-ulsfo 

16:17 mark: ulsfo is now offline
16:16 mark: Shutdown NTT BGP sessions on cr2-ulsfo
16:13 mark: Shutdown TiNet BGP sessions on cr1-ulsfo
16:10 mark: Shutdown IXP BGP sessions on cr2-ulsfo
16:10 mark: Shutdown WMF HQ BGP sessions on cr2-ulsfo
16:09 mark: Shutdown WMF HQ BGP sessions on cr1-ulsfo 

From the patch we can see that all traffic directed away from ULSFO was sent to EQIAD. Therefore it does seem like any increased traffic to ESAMS would be coincidental. I'll ask Mark to comment on this.
Comment 13 Mark Bergsma 2014-07-25 11:10:18 UTC
Yeah, amssq47 had been used as a test server before, not receiving any traffic. Brandon reinstalled and put it back in production around that time, so that would explain it.
Comment 14 christian 2014-07-29 13:37:59 UTC
(In reply to Mark Bergsma from comment #13)
> [ Situation around amssq47 ]

Thanks for confirming.

------------------

Per host per hour packet loss numbers look good.
(The only host that sticks out there a bit is cp3013--a mobile esams
cache. But esams should not have seen changes from the ULSFO move, and
that host is not super-stable around packetloss anyways. Nothing
concerning, but it is on the border more often than not. As the total
volume of messages looks sound, and other parts of this host's log do
too, I am assuming it's a coincidence)


Per host per hour total traffic numbers look wrong.
But during the ULSFO floor move, a semi-final for the 2014 FIFA World
Cup took place (Yay, coincidence!) This caused a traffic spike
during the ULSFO move which makes numbers look really skew.
However, when limiting to various slices of non-soccer traffic, no
spike is visible.
For each individual non-soccer slice, data looks good.

Per host per hour urls look good.

Per host per hour per status code numbers look good.

Per host per hour referers look good.

From my point of view, log data is overall good.

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links