We recently setup a Hyper-V Fail-over cluster using (2) Dell R610 servers and a direct attached (SAS) Dell MD3200 storage array. The servers are identical: Xeon E5630 Processors, 64 Gbs of RAM. The hosts are running Windows Hyper-V 2016. (so we are
heavily Powershell dependent) There are two networks cards, one integrated and one add-on. (4) of NICs (2 from each card) are NIC teamed for production network traffic, and (1) NIC on each server's integrate card is dedicated to Live Migration traffic.
The Live migration subnet is on its own physically separate 1Gb network hardware (we are not 10gb capable). We are running 4 production VM's: DC, Appserver, Backup Server, File Server. We also have some test VMs.
When we first set the system up, Live Migration ran perfectly. We started having issues once we put load on the VMs. The exact issue is that when we attempt Live Migration of the production servers they fail at 3%. We receive a time out error message (see
below). Every now and then the DC, Appserver, and Backup server will Live Migrate, but never the File Server (with 3Tb VHD). The inconsistency is killing us. The Test servers, with no real load, migrate fine.
We have plenty of available hardware resources, and have even dropped the specs of the VM's to the lowest possible (1gig RAM and 1 Proc) but they still time out. We use Fail-over Cluster Manager and Hyper-V Manager to administer the cluster; we are not usingSCVMM.
Error Message:
Live migration of 'Virtual Machine PRGFILESHARE' failed.
Virtual machine migration operation for 'PRGFILESHARE' failed at migration source 'PRGHYPERV1'. (Virtual machine ID F6F0B8CA-D100-4A7C-8115-AC09FC47125A)
Planned virtual machine creation failed for virtual machine 'PRGFILESHARE': This operation returned because the timeout period expired. (0x800705B4).
(Virtual Machine ID F6F0B8CA-D100-4A7C-8115-AC09FC47125A).
Failed to receive data for a Virtual Machine migration: This operation returned because the timeout period expired. (0x800705B4).
RESOLUTION - Delete Checkpoints
It appears that the checkpoints we were taking daily on this server (and others) were causing the live migration issues. We had a script to take checkpoints (formerly called a Snapshot) every morning between 7am-8am and keep them for 5 days. During our troubleshooting
we deleted all of the existing checkpoints, took a fresh one, and ran the live migration on the server that wasn't working. It worked perfectly. For servers that were successfully live migrating, but at a slow rate, this trick increased their live migration
speed dramatically. While we're happy that everything works we don't understand why these checkpoints were causing the issue. We never tried to migrate while a checkpoint was being created and the checkpoint files were stored on the cluster shared volume not
the host servers. If our understanding of live migration is correct, only the CPU and RAM information are being copied over. I have read that a shallow copy of the VM is copied over but I don't see how a static snapshot file would factor in.