SQL Server Availability Groups are not the solution!

That’s right. Availability groups are fairly easy to set up and can add flexibility and recovery capabilities to your SQL Server database environment, but they are not your solution. They can be your disaster without proper planning, evaluation, and implementation. Let’s take a look at some of the components required for availability groups (AGs) and see where they typically fail.

Failover Clustering

For availability groups to work, the failover clustering features must be enabled, and a failover cluster created. Failover clustering comes with its own list of requirements, but some common mistakes are made:

Domain Permission & Connectivity

The Listener object that gets created as part of the AG requires the Cluster Name Object (CNO) to have certain permissions within the domain. It requires the ability to create and re-assign computer objects within the AD folder it is a part of. Without this, it cannot properly manage the cluster resources that handle failover. Your domain admin needs to grant the Cluster Name Object the “Create computer accounts” permission within the folder where the cluster resides.

Sometimes there will be group policies that revoke access in some way, and you will see messages in the cluster log indicating issues with this. Do not ignore those messages, as they will cause some severe headaches the next time you need to failover.

In addition to permission, the connectivity between the clustered computers and the Domain Controller must be robust, and the Domain Controller (DC) must be responsive. If either node experiences issues communicating with the domain controller or an object it expects (like a Service Principal Name, or SPN) exists on one DC but not another, these inconsistencies can cause failover or ownership resolution errors that can cause the AG to appear down even when all of the servers are otherwise healthy. If the individual nodes of the cluster have trouble reaching each other, this can cause the unresponsive node/nodes to be “evicted” from the cluster. Any landlord can tell you that eviction can sometimes be necessary, but it’s never a good experience for anyone involved.

Storage Assignment

At some point during the cluster install, you will be presented with a seemingly innocuous check box – defaulted to the “checked” status – giving you the choice to assign all available storage to the cluster. Resist!!! You must not agree to such tyranny on behalf of the cluster! If you assign the storage to the cluster, you’re on your way to creating a SQL Failover Cluster Instance (FCI), and not an AG. With an FCI, the storage is shared among the replica nodes. With an AG, each node will have its own independent storage. If you assign the storage to the cluster, it will not be available to be used by the AG, and you’ll end up unwinding it later, or possibly just reinstalling clustering on that particular node.

IP Address Registration and Time to Live

If you’re setting up a cluster that spans multiple subnets, your Listener will need more than one IP address assigned to it. If it does, you then have a choice in how you’d like those IP addresses registered in DNS. If you’ve got a modern .NET application (SQL Server Management Studio, for example), you can choose to set the connection parameter “MultiSubnetFailover=True” and you can allow DNS to register all of the IP addresses for the listener. This parameter tells the application to try each IP address simultaneously and connect to the one that responds. Without it, the application may attempt the wrong IP address first and
experience a connection timeout, causing frustrating issues for the end users.

If your application does not support this parameter, then you need a different approach. You will need to tell DNS not to register all the IP addresses, only register the currently active IP address. If you do this, it’s recommended to adjust the “Time to Live” parameter as well. Your end-user computers and your application servers don’t re-fetch the IP address each time they connect, that would be inefficient and put an absurd load on the DC. Instead, they cache the values locally, refreshing them when the entries “Time to Live (TTL)” timer expires. If you experience a failover and the TTL is set at the default – 12 hours – then they will need to take manual action before they will get the correct value and be able to reconnect. A shorter TTL helps prevent this issue.

Managing the AG using cluster tools

Using Failover Cluster Manager, you can initiate failovers, set dependencies, and do all sorts of other things that will be disastrous for your AG. Unless you’ve explored all options available for managing the AG via SQL Server Management Studio and you have a lot of clustering experience, you should not adjust parameters or initiate failover of the AG using cluster management tools. A few examples of where you must use cluster management tools are: Setting node weights, adjusting lease timeout values, or setting up quorum.

Virtualization

Anti-affinity rules
Virtualization is a fantastic technology that allows tremendous resource flexibility, however, some care needs to be taken when configuring a high-availability technology in a virtualized environment. One very large one is that two or more nodes should not share the same host machine. If they do, then a failure of the host machine likely means a complete failure of your availability group, because more than one node experienced a simultaneous failure. Windows Failover Cl u stering is fairly robust and intelligent these days, but unless you’ve built an atypically large cluster, losing more than one node in the same instant is going to cause the entire cluster to disable itself, and when it does that, it disables your Listener connection, which is going to cause heartache. Anti-affinity rules define a set of virtual servers that are not allowed to share the same host and will prevent an admin or a dynamic resource scheduler from placing them
together. While we’re talking about that….

Dynamic Resource Scheduling

One of the great features that virtualization offers is to identify when one hist machine is very busy and another has resources to spare. It can then balance them so the workload is more evenly distributed. You must make sure your AG nodes are excluded from this process. While most hypervisor platforms will insist that the time required to dynamically reallocate a guest machine is with the tolerance required by AGs, the reality is that oftentimes there are complicating factors, and moving a node can cause an outage. AG nodes should only be moved to different host machines in a planned fashion, with the involvement of a DBA who can pause cluster nodes and make sure that the migrating node has no demands on it.

The SQL Servers

Workload capability

When you set up an AG, additional CPU thread resources are required to keep all the nodes in sync. If your workload is such that the server runs hot, or occasionally peaks at CPU over 95%, you’re putting yourself in the danger zone. If the AG nodes can’t communicate with each other, they will lose the ability to synchronize. If they pass certain thresholds, they will begin to get confused about who is supposed to be the boss (aka primary node), and while they’re figuring it out, your end users will be shouting (or at least thinking) expletives and making assumptions about how closely your ancestors may have been to lower primates. If your workload runs hot, you have three options: increase your resources, tune the workload, or limit the resources available to the offending applications. Of the three, tuning the workload is the preferred and typically least expensive, most effective, and quickest way to prepare, unless you have a lot of spare resources and cash lying around.

Humans

“We have seen the enemy, and it is us” – Walt Kelly. Often we are our own worst enemy. Whether it’s a developer who pushed bad code to production, a sysadmin implementing a new firewall rule, or a DBA rebooting a server at the wrong time, the results can be catastrophic. It’s important to follow proper procedures, including code review and Change Advisory Board review. Even then, mistakes will be made, because, after all, we ARE human.

Minimize the impact of mistakes by following the principle of least privilege, doing regular security audits, and having robust and properly configured monitoring in place.

Not the solution, just one piece of it.

Availability groups are wonderful tools when configured properly. But by themselves, they are not a solution. Get the above details right, and they can be one component of a successful solution implementation.

If you need help making sure your AG is part of a solution and not part of a problem, we’re just a click away, schedule an evaluation meeting with us today at Schedule a Call.