vPivot

Scott Drummonds on Virtualization

Datacenter Optimization

1 Comment »

The article I wrote on VMTurbo’s alternative to DRS generated an incredible amount of interest. I received numerous emails, Tweets, and a couple of comments. One comment led me to VKernel, one led me to CiRBA, and one slyly hinted of a sham in one partner’s offering.  This investigation has been fun and informative.

Today I am going to share with you some of my thoughts on datacenter optimization and why I think VMware’s partner community is succeeding where VMware is not.

For the sake of argument, let me take a contrary position.  I will propose one statement that I promise you I do not believe. Bare with me and follow this through:

What if I said to you, “you should not trust DRS”?  How would you react?  My first knee-jerk reaction to this comment is, “You’re an idiot.  DRS is invaluable in tens of thousands of customer deployments worldwide.”  While both of these statements may be true, I must concede there are numerous three reasons why DRS is inefficient, ineffective or worthy of suspicion in today’s datacenters:

  • DRS has no competition.  The incredible improvements in vSphere 4.1 in vMotion implementation came from a honest assessment of competitive options.  Until such competitive inspection can occur in DRS technologies, we cannot see their warts.
  • DRS is, at its heart, a preventative mechanism for avoiding performance bottlenecks.  But most performance problems are coming from storage throughput and latency.  DRS currently provides no mechanism for storage bottlenecks.
  • VMware provides no tools to correct cross-cluster imbalance.  And if you agree with the small cluster size argument, then you will deploy many small clusters which will bring down your datacenter’s efficiency.
  • VMware has not published the DRS algorithm.  There is therefore no way that customers can prove vMotion anomalies are coming from DRS corner cases.  For what its worth, I think that VMware never should expose its DRS algorithm.  It is core intellectual property.  Along that vein, I do not think VMware needs to answer questions about the scheduler.  But they are kind enough to do.

For the record, I absolutely believe in DRS as core value to today’s virtual infrastructures.  But I recognize the validity of those that counter it.  These reasons have compelled the partner community to augment VMware’s capacity management and optimization capabilities with tools that provide better visibility and enhanced automation.  VMTurbo is one such company and CiRBA, who I talked to a couple of weeks ago, is another.

CiRBA CTO Andrew Hillier was kind enough to spend an hour with me last week to give me an overview of his offering.  CiRBA’s net additions to your datacenter include cross-cluster capacity optimization, capacity modeling on hypothetical clusters, capacity estimation across differing instruction sets, and other cool features.  But it was Andrew’s suggestion that DRS-initiated migrations were a symptom of poor capacity management that touched off the reasoning that generated the above polemics.

The truth is that the incredible differences exist across heterogenous hardware in today’s VMware deployments.  These datacenters demands better tools for optimizing capacity across different clusters.  That optimization will eventually beg for automation.  And this problem may too complex for any one vendor to solve.  But there are some technology pieces that we need to see before we can try and solve this puzzle.

So, here my challenge to VMware, its customers, and the partner community:

Customers, if you want optimal capacity using VMware’s tools of today you are going to have to design highly standardized, homogenous datacenters based on predictable hardware.  VCE’s Vblock is one such option in this space.  However, if you want the benefits of pitting multiple hardware vendors against each other, you are going to have to use complex capacity optimization offerings like CiRBA’s.

VMware or its partners, the world needs a virtual environment capacity dashboard. There are five performance resources (I am excluding here non-performance resources like physical space, power, etc.) that drive most capacity decisions: CPU, memory, network, storage capacity and storage throughput. We need someone to graphically represent clusters’ use of these resources as a percentage of their possible maximums.  This solution calls for user interface wizardry that is more common in consumer products like iPads than enterprise software.  Furthermore, this dashboard implies maximums need to be calculated correctly.  This is non-trivial with storage throughput.

VMware and its partners need to solve the problem of managing storage throughput like CPU or memory.  Storage IO Control uses latency as a sentinel of storage throughput bottlenecks and presumably VMware will use technology like this in storage DRS some day.  VMware and the storage vendor community are working on APIs that will allow arrays to pass their own assessment of LUN speed back up to vCenter.  Some tools calculate and estimate maximums using usage patterns.  All of these options (storage DRS, improved APIs, predictive analysis) are either half-baked or non-existant.

There is an incredible amount of work going on in VMware, EMC, and other partners to solve some of these problems of datacenter optimization.  But I think an overarching is still an unknown distance over the horizon.  And as the VMware universe grows, the customer demand for such a solution increases.

One Response

Scott,

Thanks for the comprehensive rendering of this important issue. Indeed, as we discussed, DRS is a key piece of the Vmware strategy, yet there are some very critical deficiencies which may significantly reduce virtualization ROI and impact application performance.

I want to touch on several points in your message. One is that DRS-initiated migrations could be a symptom of bad capacity management. This is definitely true. Just want to remind you Sun Tzu: “No Battle Plan Survives First Contact With the Enemy.”

You can try to plan off-line, but very quickly this “optimal” plan may become far from optimal, given the dynamic nature of workloads in the (rapidly-changing and interconnected) virtualized environment. If one doesn’t have a way to do the real-time analysis and real-time corrective actions and optimization, this offline planning process will have to be repeated again and again, putting the entire infrastructure performance at risk and reducing automation.

So components like DRS are needed. But they cannot be confined to their small cluster quarters, one cannot look at one or two resources at a time; an innocent vMotion to avoid high memory utilization can cause CPU Ready queue congestion or high latency on the network card in another host. Once you start looking at more resources, larger clusters and heterogeneous components (as oppose to standard uniform building blocks), our today’s reality, the traditional resource scheduling algorithms would need enormous amount of time and resources to come up with accurate suggestions. Exactly at the moment when one needs to react in real time to dynamic demand fluctuations in a large heterogeneous shared infrastructure. Across stacks, clusters, data centers and clouds.