Last modified: 2014-07-29 13:37:59 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T70199, the corresponding Phabricator task for complete and up-to-date bug report information.

Bug 68199 - ULSFO post-move verification


Summary:	ULSFO post-move verification

Status:	RESOLVED FIXED

Product:	Analytics
Classification:	Unclassified
Component:	General/Unknown (Other open bugs)
Version:	unspecified
Hardware:	All All

Importance:	Unprioritized normal
Target Milestone:	---
Assigned To:	christian

URL:
Whiteboard:	u=Kevin c=General/Unknown p=0 s=2014-...
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2014-07-17 23:00 UTC by Kevin Leduc
Modified:	2014-07-29 13:37 UTC (History)
CC List:	9 users (show)

See Also:
Web browser:	---
Mobile Platform:	---
Assignee Huggle Beta Tester:	---

Attachments
the daily traffic to ulsfo and eqiad, by host, during the switchover (332.16 KB, image/png) 2014-07-23 20:52 UTC, Dan Andreescu	Details
the daily traffic to ulsfo and eqiad, by datacenter, during the switchover (138.58 KB, image/png) 2014-07-23 20:53 UTC, Dan Andreescu	Details
the hourly traffic to ulsfo and eqiad, by host, during the switchover (483.68 KB, image/png) 2014-07-23 20:53 UTC, Dan Andreescu	Details
the hourly traffic to ulsfo and eqiad, by datacenter, during the switchover (276.22 KB, image/png) 2014-07-23 20:54 UTC, Dan Andreescu	Details
spreadsheet of daily data for ulsfo and eqiad with totals and graph (36.06 KB, application/vnd.oasis.opendocument.spreadsheet) 2014-07-23 20:54 UTC, Dan Andreescu	Details
spreadsheet of hourly data for ulsfo and eqiad with totals and graph (58.32 KB, application/vnd.oasis.opendocument.spreadsheet) 2014-07-23 20:55 UTC, Dan Andreescu	Details
Add an attachment (proposed patch, testcase, etc.)

Description Kevin Leduc 2014-07-17 23:00:56 UTC

(happened on wed 9th of July)
* check with gage whether this work already took place

* check that during the switchover, hosts were correctly reporting
   * like only the correct hosts going down during the migration
   * the other hosts were picking up the correct traffic

* check that each host is still reporting the expected number of requests
   * sampled-1000 logs (stat1002 /a/squid/...)
   * mobile-sampled-100 logs (stat1002 /a/squid/...)
      can be done by plotting requests per host per time
   * zero logs (stat1002 /a/squid/...)
   * edit logs (stat1002 /a/squid/...)
   * Find out where those files get written, and find a way to cover
      * oxygen,
      * gadolinium (unicast)
      * gadolinium (multicast)
      * erbium
     if they are not covered by the above file

Comment 1 nuria 2014-07-23 15:01:19 UTC

Meeting notes on etherpad:
http://etherpad.wikimedia.org/p/6gR8aSREkz

Comment 2 Dan Andreescu 2014-07-23 20:51:29 UTC

* check with gage whether this work already took place

Checked with gage, ops had not checked the network traffic in-depth during the switchover

* check that during the switchover, hosts were correctly reporting
   * like only the correct hosts going down during the migration
   * the other hosts were picking up the correct traffic

Checked traffic on all hosts (ulsfo, eqiad, and esams) by using data gathered by Christian from the sampled logs.  Found that only ULSFO hosts had their traffic go down, and only for the expected period.  Also found that only EQIAD hosts had their traffic increase abnormally, and again for the expected period.  Overall, I believe that no traffic leaked or increased anywhere outside of what was expected.  I will attach pictorial proof and spreadsheets.

Comment 3 Dan Andreescu 2014-07-23 20:52:52 UTC

Created attachment 16018 [details]
the daily traffic to ulsfo and eqiad, by host, during the switchover

Comment 4 Dan Andreescu 2014-07-23 20:53:28 UTC

Created attachment 16019 [details]
the daily traffic to ulsfo and eqiad, by datacenter, during the switchover

Comment 5 Dan Andreescu 2014-07-23 20:53:52 UTC

Created attachment 16020 [details]
the hourly traffic to ulsfo and eqiad, by host, during the switchover

Comment 6 Dan Andreescu 2014-07-23 20:54:13 UTC

Created attachment 16021 [details]
the hourly traffic to ulsfo and eqiad, by datacenter, during the switchover

Comment 7 Dan Andreescu 2014-07-23 20:54:51 UTC

Created attachment 16022 [details]
spreadsheet of daily data for ulsfo and eqiad with totals and graph

Comment 8 Dan Andreescu 2014-07-23 20:55:07 UTC

Created attachment 16023 [details]
spreadsheet of hourly data for ulsfo and eqiad with totals and graph

Comment 9 Jeff Gage 2014-07-24 03:45:30 UTC

Ok, by examining router interface statistics with LibreNMS, I have confirmed that when traffic from* ULSFO ceased, traffic from EQIAD increased by a similar amount.

 * Rather than looking at actual inbound web request traffic, I'm looking at the outbound responses because they should correlate and are much bigger.

I've provided URLs for reference; LibreNMS access may be requested by emailing access-requests@rt.wikimedia.org.

EQIAD:
    cr1-eqiad xe 5/3/1 (transit)
        https://librenms.wikimedia.org/graphs/to=1406129520/id=4515/type=port_bits/from=1404228720/
        +700 Mbps
    cr1-eqiad xe 4/3/2 (transit)
        https://librenms.wikimedia.org/graphs/to=1405179180/id=6821/type=port_bits/from=1404747180/
        +700 Mbps
    cr1-eqiad xe 4/3/1 (peering)
        https://librenms.wikimedia.org/graphs/to=1405154040/id=6820/type=port_bits/from=1404722040/
        +250 Mbps
    cr2-eqiad xe 5/3/1 (transit)
        https://librenms.wikimedia.org/graphs/to=1405159680/id=134/type=port_bits/from=1404727680/
        +1000 Mbps
    cr2-eqiad xe 5/3/3 (peering)
        https://librenms.wikimedia.org/graphs/to=1405159800/id=136/type=port_bits/from=1404727800/
        +1000 Mbps
ULSFO:
    cr1: 0/0/3 (transit)
        https://librenms.wikimedia.org/graphs/to=1405158120/id=7200/type=port_bits/from=1404726120/
        -1800
    cr2: 0/0/2 (transit)
        https://librenms.wikimedia.org/graphs/to=1405158600/id=7139/type=port_bits/from=1404726600/
        -400 maybe
    cr2: 0/0/3 (peering)
        https://librenms.wikimedia.org/graphs/to=1405158480/id=7140/type=port_bits/from=1404726480/
        -1100

Increase at EQIAD: roughly 3650 Mbps
Decrease at ULSFO: roughly 3300 Mbps

I had to visually estimate the values from the graphs, so this seems like acceptable equivalence.

In addition to the traffic math, I'm not aware of any user reports of service disruption, and our 3rd party monitoring reports 100% availability in all significant categories for that week. Therefore I have high confidence that traffic was successfully rerouted without loss during the migration.

Approximate timeline:
2014-07-09 16:00 UTC: Mark reroutes traffic to EQIAD
2014-07-09 17:00 UTC: I arrive at ULSFO
2014-07-09 17:30 UTC: ULSFO becomes unreachable
    [servers and routers are moved to a new room, everything is plugged back in]
2014-07-09 21:30 UTC: routers back online
2014-07-09 22:45 UTC: Mark restores traffic to ULSFO
2014-07-10 00:30 UTC: I leave ULSFO

Comment 10 Dan Andreescu 2014-07-24 10:38:03 UTC

Awe-some.  Thank you so much Jeff

Comment 11 christian 2014-07-24 11:22:39 UTC

(In reply to Dan Andreescu from comment #2)
> Also found
> that only EQIAD hosts had their traffic increase abnormally, [...]

I had expected to see amssq47 (esams) being called out, as it picked
up traffic just as ULSFO's went down.

That's just a coincidence. Right?

(In reply to Jeff Gage from comment #9)
> Approximate timeline:
> 2014-07-09 16:00 UTC: Mark reroutes traffic to EQIAD

While I see that the timelime is labeled as “approximate”, but since
we're looking at numbers of hourly at hourly granularity ...

Looking at the graphs, they take the deep downward dive already ~4-5
hours earlier. This earlier time also nicely matches Mark's rerouting
commit [1], which is shows as gotten merged on 2014-07-09 10:40 UTC in
gerrit.

> 2014-07-09 22:45 UTC: Mark restores traffic to ULSFO

While that might be right, it is neither reflected by graphs, nor the
puppet repo.

Looking at the graphs, they start to rise only ~1-2 hours later. This
later time again nicely aligns with the puppet repo. There, Brandon's
(not Mark's) rerouting commits [2] are shown in as gotten merged
between 2014-07-10 00:38 and 2014-07-10 08:37.



[1] https://gerrit.wikimedia.org/r/#/c/144934/

[2] They are a series of commits between
  https://gerrit.wikimedia.org/r/#/c/145182/
  https://gerrit.wikimedia.org/r/#/c/145221/

Comment 12 Jeff Gage 2014-07-24 17:05:02 UTC

Hi,

We should definitely trust the commits over my information source, the Server Admin Log whose events are manually input: https://wikitech.wikimedia.org/wiki/SAL

22:42 mark: Enabling PAIX BGP sessions on cr2-ulsfo
22:40 mark: Enabling WMF HQ BGP sessions on cr1-ulsfo
22:38 mark: Enabling TiNet transit links on cr1-ulsfo
22:35 mark: Enabling WMF HQ BGP sessions on cr2-ulsfo
22:34 mark: Enabling NTT and HE transit links on cr2-ulsfo 

16:17 mark: ulsfo is now offline
16:16 mark: Shutdown NTT BGP sessions on cr2-ulsfo
16:13 mark: Shutdown TiNet BGP sessions on cr1-ulsfo
16:10 mark: Shutdown IXP BGP sessions on cr2-ulsfo
16:10 mark: Shutdown WMF HQ BGP sessions on cr2-ulsfo
16:09 mark: Shutdown WMF HQ BGP sessions on cr1-ulsfo 

From the patch we can see that all traffic directed away from ULSFO was sent to EQIAD. Therefore it does seem like any increased traffic to ESAMS would be coincidental. I'll ask Mark to comment on this.

Comment 13 Mark Bergsma 2014-07-25 11:10:18 UTC

Yeah, amssq47 had been used as a test server before, not receiving any traffic. Brandon reinstalled and put it back in production around that time, so that would explain it.

Comment 14 christian 2014-07-29 13:37:59 UTC

(In reply to Mark Bergsma from comment #13)
> [ Situation around amssq47 ]

Thanks for confirming.

------------------

Per host per hour packet loss numbers look good.
(The only host that sticks out there a bit is cp3013--a mobile esams
cache. But esams should not have seen changes from the ULSFO move, and
that host is not super-stable around packetloss anyways. Nothing
concerning, but it is on the border more often than not. As the total
volume of messages looks sound, and other parts of this host's log do
too, I am assuming it's a coincidence)


Per host per hour total traffic numbers look wrong.
But during the ULSFO floor move, a semi-final for the 2014 FIFA World
Cup took place (Yay, coincidence!) This caused a traffic spike
during the ULSFO move which makes numbers look really skew.
However, when limiting to various slices of non-soccer traffic, no
spike is visible.
For each individual non-soccer slice, data looks good.

Per host per hour urls look good.

Per host per hour per status code numbers look good.

Per host per hour referers look good.

From my point of view, log data is overall good.

Wikimedia Bugzilla is closed!

Search

Personal tools

Navigation

Links