August 5, 2011
Posted by on
StarWind have implemented a throttled sync option with the recently released 5.7 version of their HA iSCSI SAN software so you can continue to use the storage while a sync is taking place on less than nuclear powered systems, additionally sync speeds have been hugely improved over 5.6 by adding multiple iSCSI sessions on the sync channel, at least I’m noticing a big difference on on the hardware we’re using. The console now has performance monitoring capacity and has had a number of minor interface bugs removed. Overall, it works double-awesome.
By collating some information i got from the StarWind forums and by lab testing I have documented a procedure for our engineers to follow with clusters in order to recover from a both-san-nodes-down situation in the event the UPS capacity is exceeded in a power outage, and additionally to ensure the cluster has the maximum opportunity to gracefully recover from a power cut that does not exceed the capacity of the UPS’s.
The timings i specify below obviously depend on your UPS capacity and other loads. To control the reboot timing and monitoring of APC UPS’s i’ve used APCUPSD, which is awesome, and free! Get it here: http://www.apcupsd.org/ .
The outline process I’ve used based on our requirements to avoid recreating iSCSI targets in StarWind is this:
- Set Hyper-V hosts to respond to UPS’s and migrate servers then power down one side the Hyper-V cluster when the UPS has eaten 20% of the initial battery power
- Set second Hyper-V host to power down all VM’s and power off once 40% of battery power is gone.
- Configure both hyper-v hosts to power on within 15 minutes of power returning.
- Configure the secondary half of the san to shut down when 50% of battery power is gone.
- Configure the primary half of the san to shut down when 90% of the power is gone.
- Configure both SAN nodes to power on within 10 minutes of power returning.
For us this means recovery will be pretty much automatic unless the power cut lasts more than an hour, as long as one side of the san remains online for the duration the other side will only need to do a fast synch as there will have been no changes to the disk images given the hyper-v cluster has been off since near the start of the outage.
In the event of a total outage causing both sides of the san to go offline, a manual recovery of the StarWind targets is necessary. Given both sides are the same we simply delete the targets, then recreate them with their existing iqn’s, names, settings and existing images and tell the disks not to sync upon creation, then boot the cluster and providing your record keeping is good and process is followed you should be back online in as long as it takes you to click through and create the targets.