Published on

Post-Incident Review: Config Plane Outage Sept 2025

Authors

Incident Overview

Event: Configuration Plane Disabled

Severity: High (2)

Customer Impact: Configuration Plane disabled. CDN Traffic was not affected.

Duration: 2 hours, 45 minutes (September 8th, 15:45 UTC - September 8th, 18:30 UTC)

Status: Resolved

Impact Details

We are a bootstrapped infrastructure startup. We have no funding and run a very tight monthly budget at this time. We rely on Cloud Infrastructure providers' Startup Programs to provide the credits we need to run certain parts of our infrastructure, including:

  • Configuration Plane Servers
  • Firewall Configuration API
  • Health Monitoring This incident caused half of our Configuration Plane Servers to become immediately terminated and deleted forever. CDN traffic was not affected. While normally the failover process is easy, the incident also caused other utility servers, including a bastion host server which was used for triggering failover processes manually, to be terminated.

Root Cause Analysis

The service degradation resulted from a Cloud Provider promising and then denying Startup Program credits to our account. Early Support Interaction: Over a month ago, we contacted our Account Manager and Cloud Support to inquire about our credit usage and options to upgrade to a higher tier. We were told by support that our additional credits would be deposited once the current credits were depleted. We were reassured that we would receive additional credits and were told to reference our case number if there were any issues.

"Please be informed that once the remianing credits gets exhausted, you will receive the next set of credits on your account. Please be assured, if you face any issues or incur charges you can reachout to us. We are available 24/7 at your service"

  1. Credits Ran Out: After receiving a bill for last month's usage, we immediately verified that we recieved no additional credits and attempted to contact support. Since it was a weekend, we had to wait until the following Monday, 9/8, to contact them. We were told that we would receive no additional credits and that the original support agent was incorrect.
  2. Terminate Account: We were told that in order to prevent being charged for the usage after our credits ran out, we must immediately terminate all resources and projects in our account.

This resulted in a number of servers we use as a part of our Configuration Plane infrastructure to be deleted, as expected. However, a bastion host used to access our internal network was also deleted unexpectedly. This bastion host held the only copy of the SSH private key required to initiate the Configuration Plane failover process.

Resolution Process

We:

  1. identified the affected resources that would need to be migrated to other cloud vendors and performed migrations manually,
  2. recovered access to perform failover operations,
  3. and failed the Configuration Plane system over to two controllers instead of four.

These actions restored all network services.

Moving Forward

Since this incident occurred, we have greatly increased the resiliency of our Configuration Plane infrastructure. All internal services now have automated failover mechanisms and secondary/backup servers running separately at other cloud providers.

Additionally, we've confirmed that the other cloud providers we use have already provided us enough credits to run our infrastructure for over six months.

Support

If you have any ongoing issues please create a ticket at https://support.skip2.net/