ccolonbackslash

Just another WordPress.com site

Tag Archives: starwind

StarWind iSCSI SAN 5.7 with Hyper-V R2 and recovery from both-san-nodes-down situation.

StarWind have implemented a throttled sync option with the recently released 5.7 version of their HA iSCSI SAN software so you can continue to use the storage while a sync is taking place on less than nuclear powered systems, additionally sync speeds have been hugely improved over 5.6 by adding multiple iSCSI sessions on the sync channel, at least I’m noticing a big difference on on the hardware we’re using. The console now has performance monitoring capacity and has had a number of minor interface bugs removed. Overall, it works double-awesome.

By collating some information i got from the StarWind forums and by lab testing I have documented a procedure for our engineers to follow with clusters in order to recover from a both-san-nodes-down situation in the event the UPS capacity is exceeded in a power outage, and additionally to ensure the cluster has the maximum opportunity to gracefully recover from a power cut that does not exceed the capacity of the UPS’s.

The timings i specify below obviously depend on your UPS capacity and other loads. To control the reboot timing and monitoring of  APC UPS’s i’ve used APCUPSD, which is awesome, and free! Get it here: http://www.apcupsd.org/ .

The outline process I’ve used based on our requirements to avoid recreating iSCSI targets in StarWind is this:

  1. Set Hyper-V hosts to respond to UPS’s and migrate servers then power down one side the Hyper-V cluster when the UPS has eaten 20% of the initial battery power
  2. Set second Hyper-V host to power down all VM’s and power off once 40% of battery power is gone.
  3. Configure both hyper-v hosts to power on within 15 minutes of power returning.
  4. Configure the secondary half of the san to shut down when 50% of battery power is gone.
  5. Configure the primary half of the san to shut down when 90% of the power is gone.
  6. Configure both SAN nodes to power on within 10 minutes of power returning.

For us this means recovery will be pretty much automatic unless the power cut lasts more than an hour, as long as one side of the san remains online for the duration the other side will only need to do a fast synch as there will have been no changes to the disk images given the hyper-v cluster has been off since near the start of the outage.

In the event of a total outage causing both sides of the san to go offline, a manual recovery of the StarWind targets is necessary. Given both sides are the same we simply delete the targets, then recreate them with their existing iqn’s, names, settings and existing images and tell the disks not to sync upon creation, then boot the cluster and providing your record keeping is good and process is followed you should be back online in as long as it takes you to click through and create the targets.

Advertisements

VM Cluster using HA Starwind iscsi storage and you don’t have NASA style data centre redundancy? don’t bother yet. Major problems.

I’ll leave this in place, but should mention that the concerns i have below are not valid anymore, in version 5.7. See my follow up post here: StarWind 5.7/5.6 and recovery from both nodes down.

__________________________________

Seems starwind consider HA to be slightly more exclusive than their site and marketing blurb let on.

I understand that true HA means, never, ever off, but even investment banks have occasional power-downs, just to prove they can start systems up again afterwards. Beware though, if you ever (and i mean EVER) want to contemplate turning your clustered storage off for a period of time due to a building power cut/act of god/whatever, for now, pick another solution.

It works great if one node is up full time, which i suppose if you are NASA is possible, but its good practice for all organizations to do an occasional poweroff, and every so often you know, even in London, you have a long power outage, or there is building maintenance.

Essentially, the issue is if you power down both nodes of a storage cluster gracefully after powering down your hyper-v/xen/vmware cluster you will not be able to get them up again without MANUALLY specifying the most recent copy of the data (a major issue if you get this wrong and are running any db app) then sitting through a FULL synchronisation, 200gb took almost 12 hours in my test environment during which the cluster was inaccessible as the storage was not accepting incoming connections. In production this would mean your supposedly HA environment would be offline until the storage had done a pointless full sync between nodes.

I checked out the Starwind forum where they claim this is by design, this is totally ridiculous. There are degrees of HA, and it’s not often a midsize company can afford separate power supply companies at either end of the building, which seems to be where most people lose out, for example, we planned to have redundant hosts, redundant storage units, redundant switches all on redundant UPS’s but we only have one provider supplying electricity, to totally eliminate the viability of this platform by not implementing a last write flag on the storage is insane.

Essentially this means a great product is ruined for a large number of it’s users. A real shame. There is a workaround, outlined in this link, but it’s risky and involves judging yourself which replica is most current, deleting the targets, recreating and then recreating ALL iscsi connections on the cluster? absolutely crazy. In my test environment this took me almost an hour first time round.

Check this out:

http://www.starwindsoftware.com/forums/starwind-f5/full-sync-requirement-both-nodes-are-powered-off-t2132.html

If anyone else has had their implementation hobbled by this oversight I’d love to hear from you. I’d also be keen to hear when this is addressed in a workable way by Starwind as this does not seem to be a feature they shout about in the marketing department.