Last week in Asia Pacific and Japan we completed our annual presales conference roadshow (PSCR). At this PSCR I delivered a talk to the presales community on the intersection of three great solutions: disaster recovery (DR), downtime avoidance (DA), and high availability (HA). Each of these is easily understood by its own. But their combinations can introduce mind-bending complexity. I used my presentation to untangle some mental knots.
I believe that mixed DR/DA/HA environments are so complex that very few people in the world fully understand their complexities. Scott Lowe might be the only person I have worked with that appears to have a complete grasp of the situation. I know I do not. So, you might find it strange that I was called upon to lead this talk.
Well, I learned a lot in the talk. Both from the research prior to its first delivery and from the hallway conversations that followed my sessions. I documented, learned and created a few key concepts for this talk. Here they are.
Downtime avoidance and disaster avoidance are two different things. EMC uses VPLEX to deliver the key characteristic of both downtime and disaster avoidance: zero downtime migration over distance. The difference between them is defined by the customer’s expectation of a disaster’s radius and the distance between the datacenters. When two datacenters are far enough apart that they do not both fall within the radius of disaster, it is possible to implement a disaster avoidance solution between them. If they both fall within that radius, only downtime avoidance is possible.
Some customers enjoy VPLEX between two relatively close datacenters. They can then migrate workloads to avoid planned outages. But if those datacenters both fall within the same flood plane, they do not have a disaster avoidance solution. Despite this difference of function, there is no technical difference between disaster and downtime avoidance. So, they can both be called “DA”.
The combination of DR, DA, multi-site HA can produce unexpected effects. Most of the weirdness comes from using old versions of vSphere. Prior to vSphere 5 HA was controlled by five, arbitrarily placed primary nodes. When they found themselves at the same site, that site’s failure would disrupt HA. Prior to vSphere 4.1 there was no way to recommend virtual machine placement at a site using host affinity. I tell customers these days that multi-site HA should not be attempted without vSphere 5.
HA is not DR. This simple fact is not universally understood. HA provides a simple, automated restart mechanism to account for relatively small failures. DR requires documented, tested, potentially complex plans to recover an entire site’s loss. I now tell customers that HA is a technology. DR is a process.
Joint DR/DA solutions are available to EMC’s customers using a combination of VMware Site Recovery Manager, EMC VPLEX, and any number of layer two virtualization/flattening technologies. EMC released with GeoSynchrony 5.1 earlier this year a RecoverPoint splitter to allow replication from a VPLEX deployment to a remote site. With VMware SRM, this means that customers can enjoy DA and HA between the two VPLEX sites and DR to the third, protected site.
In truth SRM is not a prerequisite for a joint DR/DA solution. In Singapore we have a customer that deployed VPLEX to achieve cross-island DA. The customer uses “traditional” methods of scripting to simplify and automate the disaster recovery process. For environments like this script creation is pretty straightforward. But script maintenance can be costly and risky. I prefer to see customers using SRM for their DR plans.
Stretching L2 networks across distant sites is far too complicated for me (and many IT professionals) to handle without help. The design of complex networks remains an area of IT that I am woefully inadequate. But in my experience many of the professionals I engage are not strong in this area, either. If you are considering a DA or cross-site HA scenario, I recommend you buddy-up with a network expert. And make sure you understand the demands your configuration will place on the cross-site link. Often dedicated links between distance sites are costly and limited. If you allow DRS or HA to place virtual machines at sites distant from their users or intranet ingress points, you may find an artificial and avoidable bottleneck caused by the limited link.