This report is to provide some information in regards to the downtime and packet loss which occurred on April 13, 2021.
An upstream carrier (Psychz) suffered an outage of rack rows CGXX, this included around 30 cabinets. This caused issues among their Layer 2 aggregation, which required them to repair/rebuild the relationships between these chassis. This required rebooting route engines, which ultimately effected us.
Swiftnode houses no hardware in rack rows CGXX at Carrier 1. So our racks were not effected by the outage, but they were effected by the actions of rebooting the route engines. Around 8PM EST we began monitoring an substantial amount of packet loss across our 18.104.22.168/24 subnet which houses most of our Dallas customers, as well as controllers for TOR switches. Initially we were told the impact of the outage did not effect our row of racks, because we weren't in CGXX. We began trying to address the packet loss on our side by pulling all customers experiencing loss out of upstream diversion. We then scheduled a firmware release for our TOR switches, and pushed it live at 4:20AM EST. (4/14/2021)
Both the firmware upgrade, and the upstream diversion removals were unsuccessful at resolving the packet loss. We then escalated our ingress loss ticket to the engineers at Psychz. Around 5:00AM EST, we had them withdraw all announcements and then re-announce the IP blocks. This ultimately resolved the issue, the underlying cause is assumed to be the unexpected reboot of the route engines after the CGXX row failure.
As of 5:00AM EST 4/14/2021, this issue has been internally marked as resolved but still under monitoring. We have observed no loss occurring over these hops for 24 hours, and have since closed the escalation. We will begin diverting customers back over the weekend, in the meantime if you experience any issues, please reach out to our support team at [email protected].
Wednesday, April 14, 2021