Azure and HA

Network Lessons learned from Azure:

Recently, I have been asked to design an entire Multi-Region Infrastructure in Azure. This includes firewalling our sensitive data, Load balancing for our normal and high priority traffic, migrating workloads that don’t play well in the cloud, and all while keeping an eye on costs.

Azure is not really set up to be a grandiose Data Center, like your traditional data center. It does some things very well, but if you think about it as a traditional data center, you’re going to have a bad time. You have to completely rethink things like MTTR (Meantime to Repair), SLAs (Service Level Agreements), Uptime, availability, etc.

The 3 biggest challenges to moving to this cloud environment, were not the technical implementation challenges like learning Powershell, ARM Templates/JSON, and learning scale out and scale up techniques. The three biggest hurdles for a traditional on prem data center guy like me were:

1) Network Performance
2) Routing
3) Availability

1) Network Performance – There are no network guarantees in the LAN. If you want more network performance, you need more interfaces, and can get UP TO theoretical speeds, but it’s not guaranteed. Please let me know if you can find SLAs on network performance, as I’ve searched and not found any. There ARE 10 Gbps circuits with ExpressRoute, which is really cool, but that doesn’t help one VM talking to another VM in the same Subnet. The backplane speed of these HyperV servers is not published that I’ve seen, but I’ve been given the Wink and the Nod from Microsoft that they are overprovisioned, but there are no guarantees here. Also, you need to validate whatever applications/workloads you are putting in Azure can use high performance storage otherwise, your Storage Blob may only be able to transfer 100mbps of data, which isn’t good for large data transfers.
Here are two links with more details. The first document discusses how networking performance works in Azure. The second link shows limits for VM sizes
https://docs.microsoft.com/en-us/azure/virtual-network/virtual-machine-network-throughput
https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes

2) Routing – One of the main paradigm shifts when using Azure is that all traffic will follow the shortest route path. This sounds great, right? Well not really. Unfortunately, there are times when you want your traffic engineer your date, like, maybe you want send traffic to your firewall for segmentation. The only way to accomplish this is with User Defined Routes(Route Tables). Any time you want traffic to go to a different virtual appliance, you must create a static route in Azure. While you can use BGP, it is not well suited to being an IGP. OSPF, EIGRP, ISIS, are not supported. Also anytime you have an appliance and set a route on your appliance, you MUST set it’s next hop as the azure gateway (The first IP address in your subnet). The reason for this is because Azure is responsible for routing all subnets, and even if you have infrastructure in separate subnets, Azure will make sure they can reach each other. This is great, because it means there will be few times your VMs cannot talk to each other, but it can be a major shift in thinking for the average Data Center/Network engineer when you are used to having very detailed levels of granularity to your traffic engineering.

3) Availability -Azure has a listed set of SLAs for its resources, that you can find here:
https://azure.microsoft.com/en-us/support/legal/sla/

You’ll notice you don’t really see any resources with 5 9s (99.999%) of uptime, which is a standard for high priority applications. Because of this, you’ll need to really think about which of your applications and services require this level of uptime and plan accordingly. If you have a website that brings in a million dollars an hour in revenue, then of course that 99.95% uptime is not enough! that .05% of downtime could cost you a million dollars a year! The good news is, Azure provides many different ways to increase your availability and uptime. Between Load Balancing, traffic managers, availability sets/zones), you can still get that 99.999% of uptime. In a future blog post, we’ll go more into how to achieve this level of uptime, but the main takeaway is that you should know what SLAs you’ll be getting before you move your application to the cloud.

Enough of the negatives. These are just gotchas on my road to getting our Infrastructure built out. On my next blog post, i’ll explain why all of this doesn’t matter and Azure still rocks 🙂

About Lanny Ballard

Lanny Ballard is a double CCIE in Routing&Switching and Collaboration with 25 years of IT experience. He also likes Cloud, Data Center, and Security technologies. In his free time, he likes board games, movies, and spending time with his family.

Leave a Reply

Your email address will not be published. Required fields are marked *