- Published on
Monitoring a Content Delivery Network
- Authors
- Name
- Alex Lee
- @alexjoelee
What makes a CDN special?
CDNs, or Content Delivery Networks are specialized networks of servers built to deliver web content quickly and securely, reducing load on the application server and protecting them from DDoS attacks. Our network needs to be highly-available, accessible via IPv4 and IPv6, and responding to requests in less than 50ms from 30 locations around the world.
Finding the Right Tool
When we first started, we used UptimeKuma for all endpoint monitoring, pushing alerts up to ZenDuty for incident management and notifications. UptimeKuma is excellent and we will continue to use it, but it had some limitations with third-party APIs and performance issues scaling up with lots of endpoints.
Introducing... Snitch
Snitch is an internal microservice that we built to monitor and keep track of the health of our network. Snitch performs an HTTP and HTTPS request to both the IPv4 and IPv6 endpoints of every server in our network every thirty seconds or so (we're still tuning this). If a request fails to receive a status 200 response with the word "healthy" then Snitch automatically reroutes traffic away from the server using DNS. Snitch is capable of changing our DNS records via API based on healthchecks it performs.
Extending Snitch
We have several Configuration Management servers around the world whose purpose is to perform near-realtime configuration syncronization with our edge servers. This way, when a customer updates their website, the change is propagated quickly around the world. Sometimes, these servers can fail - Snitch will know when they miss their healthcheck and automatically reassign the orders of the failed configuration management server to another one in another region.
Manual Control
Sometimes you have to enter the danger zone and start performing some manual actions... Like taking certain servers offline for upgrades. Snitch enables us to even more easily manage our network and traffic, reducing errors and increasing resiliency and control. We've been able to implement a rolling upgrade system to ensure customers won't even see a latency increase during routine updates or maintenance.