I’m having an issue after Live Migration where VMs are intermittently losing their network connectivity for up to 90 seconds, then it returns without intervention. It doesn’t seem to matter which hosts are involved in the Live Migration, although moving from node 2 to node 3 seems to have less frequent issues.
4 node Hyper-V cluster running W2012 R2 Data Center on DL580 Gen8s. Fully patched and with recommended Hyper-V and Failover Clustering hotfixes. I’m aware of the issue with intermittent connectivity with Broadcom 1Gbps NICs, but believe it is resolved in the drivers we’re using (latest from Broadcom – 17.2.0.0). Additionally, VMQ confirmed as disabled on all NICs.
Network setup on hosts:
HV_HM – single NIC used by host only. Production network subnet
HV_CSV – 2 NICs teamed using LBFO. Set to LACP and Dynamic. Backend network on 192.168.5.0/24 subnet
HV_LM – 2 NICs teamed using LBFO. Set to LACP and Dynamic. Backend network on 192.168.6.0/24 subnet
Backup – single NIC with virtual switch BKUP-NET. Management OS allowed to use. Backend backup network on 192.168.1.0/24 subnet
vEthernet (BKUP-NET) – virtual NIC for backing up host. Backend backup network on 192.168.1.0/24 subnet
VM – 2 NICs teamed using LBFO. Set to Switch Independent and Dynamic. Virtual switch VM-NET with management OS not allowed to use. Production network subnet
Connections to the backend networks are 1Gbps into Cisco 3750 switches, which are managed by our own Networking team. These are stacked. Connections to the Production network are 100 Mbps into Cisco 3750 switches, which are managed by a 3<sup>rd</sup> party. I don’t know if the 2 switches involved are configured as a stack or not.
For the VM team members on each host, nodes 1 & 2 have 1 team member on one switch and the other team member on another switch. For nodes 3 & 4, all VM team members are on the same switch.
If I move a VM that has a vNIC on the VM-NET virtual switch and a vNIC on the BKUP-NET virtual switch and have pings going to both of its IP addresses, the BKUP-NET IP consistently responds. The VM-NET IP is the one that drops after Live Migration; sometimes with a time out then a destination host unreachable, other times straight to destination host unreachable.
If I run an arp -a before the move I can see the MAC for the VM. When it has an issue after migration, the ARP cache on the host shows the MAC address while the pings are timing out, then when it changes to destination host unreachable, the IP and MAC address for the VM are not listed in the cache.
The only differences I can think of between the non-working VM-NET connections and the working BKUP-NET connections are the Teaming Mode of the NICs on the host; and potentially the port/switch configuration, which is a black box for me for the problematic VM-NET connections.
My feeling is that it could be a switch configuration issue, but I’m not sure what settings to have the 3<sup>rd</sup> party check. Has anybody seen a similar issue or have any thoughts on how to progress please?