I’ll leave this in place, but should mention that the concerns i have below are not valid anymore, in version 5.7. See my follow up post here: StarWind 5.7/5.6 and recovery from both nodes down.
Seems starwind consider HA to be slightly more exclusive than their site and marketing blurb let on.
I understand that true HA means, never, ever off, but even investment banks have occasional power-downs, just to prove they can start systems up again afterwards. Beware though, if you ever (and i mean EVER) want to contemplate turning your clustered storage off for a period of time due to a building power cut/act of god/whatever, for now, pick another solution.
It works great if one node is up full time, which i suppose if you are NASA is possible, but its good practice for all organizations to do an occasional poweroff, and every so often you know, even in London, you have a long power outage, or there is building maintenance.
Essentially, the issue is if you power down both nodes of a storage cluster gracefully after powering down your hyper-v/xen/vmware cluster you will not be able to get them up again without MANUALLY specifying the most recent copy of the data (a major issue if you get this wrong and are running any db app) then sitting through a FULL synchronisation, 200gb took almost 12 hours in my test environment during which the cluster was inaccessible as the storage was not accepting incoming connections. In production this would mean your supposedly HA environment would be offline until the storage had done a pointless full sync between nodes.
I checked out the Starwind forum where they claim this is by design, this is totally ridiculous. There are degrees of HA, and it’s not often a midsize company can afford separate power supply companies at either end of the building, which seems to be where most people lose out, for example, we planned to have redundant hosts, redundant storage units, redundant switches all on redundant UPS’s but we only have one provider supplying electricity, to totally eliminate the viability of this platform by not implementing a last write flag on the storage is insane.
Essentially this means a great product is ruined for a large number of it’s users. A real shame. There is a workaround, outlined in this link, but it’s risky and involves judging yourself which replica is most current, deleting the targets, recreating and then recreating ALL iscsi connections on the cluster? absolutely crazy. In my test environment this took me almost an hour first time round.
Check this out:
If anyone else has had their implementation hobbled by this oversight I’d love to hear from you. I’d also be keen to hear when this is addressed in a workable way by Starwind as this does not seem to be a feature they shout about in the marketing department.