Ensuring Seamless Connectivity - The Crucial Role of Failover Testing in AWS Direct Connect
đź‘‹ Hey there!
Setting up the Direct Connect service is reserved for a select few. Typically, the network dudes handle this intricate task. However, understanding this service is crucial, especially when establishing hybrid cloud connectivity.
What is Direct Connect
Direct Connect, at its core, is like plugging your on-premise network into a physical port in your AWS account—a bit like connecting your home to a power grid. We work our magic to link this to our VPC, and voila! Our users can access our cloud endpoints.
Now, there is a little more to it than that. Direct Connect has a concept of locations. Locations host the AWS physical devices that we need to plug into. Generally, you'll also need some carrier in the middle of all this to get to your on-premise equipment.
The diagram below represents this at a high level, removing the finer details of the network configuration.
Now we have some background, let's get to the point of this post.
Best Practises
When establishing any service on AWS, testing failover to validate the workings of your architecture is key to successful availability in the event of failure.
These are referenced in the reliability pillar of the well-architected framework.
“Everything fails, all the time” is a famous quote from Amazon's Chief Technology Officer Werner Vogels, and he's right of course.
If we take the above scenario, we now have our own on-premise router, a carrier, and AWS physical location equipment to deal with. That stuff is going to fail, so we actually care about making our users happy. We need to duplicate all of that stuff and test that when one fails, everything still works!
NOTE:- We should really be using multiple Direct Connect, on-premise locations, and carriers, but hopefully, you get the point.
But how do we test it? Thankfully, AWS thought of that and created the # AWS Direct Connect Resiliency Toolkit which allows you to perform a Failover Test.
Failover Test
Initially, this test had to be conducted through the CLI. However, in a recent update, AWS has made it possible to initiate the test directly from the console.
Navigate to the Direct Connect Console
Viewing our Virtual Interfaces you can see we have two connections as per the above diagram.
Select one of the Virtual Interfaces, and in the Actions drop-down, choose Bring down BGP.
As this can cause potential disruption for our users, we need to select a duration and confirmation that we want to proceed.
You can select a value between 1 and 4320 minutes (3 Days), which is quite nice if you have some carrier maintenance coming up.
NOTE:- You have the option to cancel a test once it has been started. This feature is designed to provide you with a safety net in case you need to revert your actions.
After hitting Confirm, you will see the virtual interface in a state of testing
Another nice little feature is that you can see a history of all the tests that have taken place.
And that is it, easy peasy!
Summary
In this post, we touched on how to test the failover and redundancy of your Direct Connect links. To further enhance this, we could set up a CloudWatch canary to test an endpoint located in our on-premise network and validate that the target was reachable. At a bare minimum, we would have several tests to run to ensure the network returns to a usable state as soon as possible.
As mentioned in the AWS Well-Architected Framework above, testing the reliability of all of our services is key to aligning with SLAs and keeping our users happy.
I hope this helps someone else
Cheers