Like us, you may have received a rather confusing email from AWS titled.
Important information about NAT gateway in your account.
The content looks something like this:
Hello,
We have observed that your Amazon VPC resources are using a shared NAT Gateway across multiple Availability Zones (AZ). To ensure high availability and minimize inter-AZ data transfer costs, we recommend utilizing separate NAT Gateways in each AZ and routing traffic locally within the same AZ.
Each NAT Gateway operates within a designated AZ and is built with redundancy in that zone only. As a result, if the NAT Gateway or AZ experiences failure, resources utilizing that NAT Gateway in other AZ(s) also get impacted. Additionally, routing traffic from one AZ to a NAT Gateway in a different AZ incurs additional inter-AZ data transfer charges. We recommend choosing a maintenance window for architecture changes in your Amazon VPC.
The following is a list of your VPCs and NAT Gateways that are shared across AZ(s), in the format: 'VPC | NAT Gateway':
In this post I'll explain what that email actually means and then show you how to use Overmind to work out whether you need to worry about it.
Explaining the Email
'We've observed that your Amazon VPC resources are using a shared network NAT gateway across multiple availability zones'.
Which means
You’ve got a NAT gateway in ******one availability zone
But the stuff that uses it is in many availability zones
This means that if the availability zone that the NAT gateway is in fails, nothing will be able to talk to the internet (unless it has a public IP), even if it’s not actually in the availability zone that failed.
The recommended solution is that you instead use a NAT gateway in each availability zone, meaning that a failure in one AZ won’t affect others. It also means that you won’t be paying for cross-AZ traffic that isn’t required (~$0.02 per GB). However you will be paying for 2x new NAT Gateways (~$0.05/hour)
Given this complexity, you really need to understand your workloads to determine what is the best/cheapest/easiest solution for you.
In this post I’ll show you how to do that using Overmind.
Using Overmind
We can start by searching for the NAT gateway from the email. We can see that it's in eu-west-2a but not much more. What we need to do is work out what depends on the gateway and therefore what would lose internet access if this or the whole availability zone would fail.
You can do this easily by double clicking the NAT gateway to show related resources.
We now have some more linked resources, including a network interface, VPC and some IP’s. What we care about is the route table and the subnet that it's in. The route table controls how subnets route between them meaning anything in these subnets is going to use that NAT gateway.
If we expand all these subnets we can get a full picture of everything that is in each of these subnets and therefore might be affected. The next step is to go through each type of item and work out how it would use our NAT Gateway.
Load Balancers
One of the results is an elb-load-balancer which is in three availability zones (as I’ve shown in the drawing). Since this is a load balancer it’s job is to take traffic that's coming in and split it between some services on the backend. This means it’s not going to be reaching out to the internet, so it's not going to care if the NAT gateway goes down.
RDS
We can also see two rds-db-subnet-groups. One is called dogfood and the other is gatewaydb. As it's a database it’s unlikely to be reaching out to the internet meaning that our NAT gateway going down is not likely to affect it.
EKS & EC2
The last thing that we haven't looked at is the EKS cluster named dogfood. Being a kubernetes cluster it is likely to be talking to the internet so it can pull down docker images to start up pods. The pods themselves could also be talking to the internet.
We need to check though which AZ's this cluster is actually located in. To do this we can expand each of the node groups related to it, then the autoscaling groups related to them, which gives me the ec2-instances that actually run the cluster.
We can see that these instances are both in eu-west-2c which means it is not actually a highly available cluster, however the instances will almost certainly use the NAT gateway to access the internet.
The problem
We have discovered that there isn’t actually anything related to this NAT gateway that is configured to be highly available, however the two instances that actually use it are located in eu-west-2c, where the gateway is in eu-west-2a, which is definitely a problem. It means that:
- Our instances are going to be talking across availability zones to get to the internet which incurs additional bandwidth costs
- If eu-west-2a goes down, we'll lose internet access and things will stop working
- If eu-west-2c goes down, these instances will go down and things will stop working
We have two points of failure for no reason.
Solution (saving money)
The simplest and cheapest solution for our situation would be to shift out NAT gateway into eu-west-2c. Which means everything would be in the same availability zone. So if that goes down, everything goes down but it would've gone down anyway. It will also save us money because these instances will no longer be talking across availability zones.
The solution will of course be different for everyone and depends entirely on your workload, which is why we need to make our infrastructure easier to navigate so you can come to these conclusions quicker rather then wrestling with heaps of tabs in the AWS console.
Find out for yourself… for free
This is just an example of how you can use Overmind to get a better understanding of what you have in your AWS. We didn’t need to put anything in, or have any previous knowledge. All this data was discovered automatically.
Overmind is now available to try for free. Get started by signing up and creating an account here.