In a current customer project my job is to redesign and migrate an existing Citrix ADC environment. The responsible network administrator told me that they are having issues with the HA failover functionality since a long time. The symptoms have been looking like this:
- HA Failover is initiated on NS#1
- NS#2 becomes the active node
- 3-15 minutes later the NetScaler Gateways VIPs and Load Balancing VIPs become unreachable
- A reboot of the NS#2 brings back NS#1 as primary node and the VIPs are accessible again
To further investigate the problem and find the root cause of this issue we need to understand whats going on when a HA failover is happening. As soon NS#2 takes over two GARP (Gratuitous ARP) packets are send out on all connected ethernet interfaces. This needs to be done because the connected network devices (switches,routers) needs to update their ARP table with the “new” mac addresses of NS#2. Without a succesfull GARP the ARP tables are not getting updated and still point to the old mac addreses of NS#1.
So what do you need to take care of when deploying a NetScaler on a Cisco ACI infrastructure? There are two parameters which needs to be configured to solve the issue.
1.) GARP-based EP Move Detection Mode
Enable the “GARP based detection” feature under “L3 Configuration” on the related bridge domain.
2.) Endpoint Dataplane Learning
There is a relative new Citrix KB article, which is recommending to disable the endpoint dataplane learning feature when using Cisco ACI. This can be found under “General” on the bridge domain. If you are not doing this the network traffic still could be delivered to the secondary node after a failover.
After applying this two changes the NetScaler failover feature is working flawless and the Load Balancing & NetScaler Gateway vServers are still accessible in case of an outage of the primary node.
Happy Failover 🙂