vPivot

Scott Drummonds on Virtualization

vSphere 4.0, Hyper-Threading, and Terminal Services

12 Comments »

I recently wrote a blog article detailing Hyper-Threading (HT) and its effect on vSphere.  An astute reader pointed out, a recent update to Project VRC‘s terminal services analysis suggests disappointment with HT on vSphere.  We spent a lot of time looking at those results to understand why they contradicted the body of performance data, which show HT offering 10-30% gain on vSphere. What we discovered led us to create a vSphere patch that would allow users to improve performance in some benchmarking environments.

Among the many results presented by VRC, the configurations that most perplexed us were the two and four virtual machine configurations, each with four vCPUs per virtual machine.  The configuration with two virtual machines looked good and matched our internal numbers.  In this configuration there are a total of eight vCPUs on the host which maps each to its own physical core on the Xeon 5500 series processor.  The problem arose when the virtual machine count was increased to four, resulting in 16 total vCPUs.  In this configuration each vCPU is paired with one logical, Hyper-Threaded core.  Project VRC showed this configuration supporting no more desktops than the two-VM configuration, which suggests no value to Hyper-Threading on this configuration.

It took us some time to understand the reason for these results, but we eventually identified a very specific condition where ESX’s scheduler enforces fairness in scheduling vCPUs at at cost of throughput.  ESX’s scheduler has long be subject of the intensive scrutiny of a large number of VMware engineers to guarantee fair access to the processor for each virtual machine.  It is because of this fairness that VMware’s customers can rely on CPU resource controls.  But, when fairness goes too far, throughput may be sub-optimal.

Hyper-Threading presents particular problems to fairness because of the non-linear performance it delivers.  A thread will run at one speed when it has full access to a physical core, at another speed when it is sharing a core, and at third speed when sharing a core with a different thread.  As a result, ESX’s scheduler will sometimes pause a thread to enforce fairness.  These pauses are more common when Hyper-Threading is present to account for its lack of uniformity in thread performance.  If the host lacks vCPUs that are ready to run, the result is CPU utilization below saturation, leaving CPU cycles unused.

There are three specific conditions that can excite this condition:

  1. A Xeon 5500 series processor is present with Hyper-Threading enabled,
  2. CPU utilization is near saturation, and
  3. A roughly one-to-one mapping between vCPUs and logical processors.

In this scenario, VMware vSphere favors fairness over throughput and sometimes pauses one vCPU to dedicate a whole core to another vCPU, eliminating gains provided by Hyper-Threading.  In cases outside of these three conditions, the performance of VMware vSphere 4 meets the high expectations of VMware’s R&D team and its customers.  Of course production environments rarely (never?) have a one-to-one ratio of vCPUs to logical processors.  This occurs when there are only four 4-way virtual machines on a Xeon 5500 system, for example.

But environments such as Project VRC’s are simplifications of production environments meant to understand the capabilities of virtual platforms.  VMware has provided a patch to Project VRC that will allow them to improve throughput in their environment.  We are going to release this patch and its documentation to the general public within a couple of weeks.  I do not expect that any of VMware’s customers will benefit from the changes is allows, but I will later document the patch and its usage for anyone that cares to experiment.

12 Responses

Really, thanks a lot to you and your team for clearing this up.
I can’t possibly stress enough how valueable this kind of information and open handling of potential doubts in ESX is to people like me.

  •  Hey people,

    When I read the article a couple of days ago I was somewhat confused of the final statements made.
    All made sense and the 3 conditions that excite the issue made sense, but then the final payload confused me totally.

    “…/production environments rarely (never?) have a one-to-one ratio of vCPUs to logical processors”

    I do for all demanding workloads such as RDS and SQL. Am I doing something crazy here?

    ” I do not expect that any of VMware’s customers will benefit from the changes is allows, but I will later document the patch and its usage for anyone that cares to experiment”

    This stetement (within context) suggests that the patch was just a patch designed to improve the results within certain conditions in the Project VRC benchmark, not in “real” production environments.

    My question is therefor: Do we have and issue or do we not have an issue? Please educate me.

    • Kimmo,

      If your production environment meets the three requirements, you may benefit from the patch we are going to release. Using 4-way VMs on any Xeon 5500 processor, meeting requirement three means having only four VMs on the host. It is uncommon for our customers to run so few VMs on a Xeon 5500 system.

      Scott

      • I’m not clear on the requirement and hoping you can help. 4, four-way VMs would be a total of 16 vcpus. So for the specific condition to be met, you would have to be on a host with 4 Xeon 5500s? Do they exist? Or does the one-to-one mapping requirement apply to threads, in which case the 4ea. @ 4 VCPU VMs would meet the requirement on the 2 socket Xeon 5500.

        • I see you said for condition 3 “A roughly one-to-one mapping between vCPUs and logical processors.”. Logical Processsors clears it up. Thanks! This is good stuff.

  • Thanks Scott

  • Scott,
    Wouldn’t having 4 very busy 4-way VM’s on a 5500 with additional idle VM’s cause the same issue? Or is this really a very corner case where the lack of a tiebreaker input from the idle VM’s throws off the scheduler?

    • If the vCPU-to-core ratio is greater than one, the scheduler is always able to find work to plug into the available scheduler slots. Even idle VMs fill a small amount of CPU. So, we are not seeing this issue with any large number of VMs.

  • […] few weeks ago VMware acknowledged a bug in ESX that translates into poor performance when it runs Microsoft Terminal Services workload on Intel […]

  • […] Intel Nehalem processors when Hyper-Threading is enabled, the good folk at VMware have rushed out a patch. To quote:'It took us some time to understand the reason for these results, but we eventually […]

  • Has the patch been released to the public yet?

    If so, is there any down-side to installing the patch?

    • Yes. The patch was released and documented in KB article 1020233. The patch actually makes no modification to ESX behavior at all. Is simply opens up the possibility of a special scheduler configuration. That special scheduler configuration should be utilized when the condition described by the KB occurs.