vPivot

Scott Drummonds on Virtualization

Optimizing vSphere for Hyper-threading

7 Comments »

VMware’s Jeff Buell has been looking into High Performance Computing (HPC) in support of a new addition to the office of the CTO.  Jeff just posted an article on VROOM! showing outstanding memory bandwidth in vSphere virtual machines.  No one should be surprised by this–virtual machine memory bandwidth has rarely been a problem.  But Jeff did discuss a advanced configuration parameter that should pique everyone’s curiosity: NUMA.preferHT.

Hyper-threading presents an interesting dilemma to any software running on Nehalem-based processors.  For some multithreaded workloads, an operating system scheduler can spread threads across multiple NUMA nodes or co-locate them to a single node.  Consider the following figure, which depicts a single 8-way virtual machine being scheduled to all of the eight physical cores on a server.

This figure depicts the eight vCPUs of a single virtual machine being spared across two NUMA nodes' eight cores.

In this case the threads (vCPUs for vSphere) are each given their own physical core.  The benefit is that the vCPUs get unfettered access to their physical cores and the resulting additional computational power.  The drawback is that common memory is remote for half the vCPUs and will have to go through the other NUMA node.  This means memory-intensive workloads might run slower.

This figure depicts the eight vCPUs of a single virtual machine being consolidated to one NUMA node's four cores.

This second configuration places the same virtual machine’s eight vCPUs on a single NUMA node.  This means physical cores are shared but all memory access is local.  The vCPUs are contending for fewer CPU cycles, although they are benefiting from Hyper-threading.  This will result in less computational power than dedicated physical cores.  On the other hand, assuming the virtual machine was sized to fit in a single node, 100% of memory access will go to fast, local memory.  This could produce better performance for memory intensive workloads.

vSphere will prefer to spread virtual CPUs across NUMA nodes (option one above) to gain the benefit of more physical cores.  But if you are running an application where memory throughput is more important than processor speed, you should consider testing a change vSphere’s default behavior. You can do this by setting the ESX 4.1  advanced parameter NUMA.preferHT to 1.  This will configure the scheduler to prefer consolidating threads on logical processors on a single NUMA instead of using more physical cores across multiple nodes.

It would be nice if VMware provided definitive guidance on when virtual machines should be configured to prefer more physical cores (the default setting) or local memory access (NUMA.preferHT=1). But this guidance would be dependent on application, CPU, virtual machine size, consolidation ratios and utilization. The complexity of this guidance likely means that we will not see an authoritative word on this any time soon. But that does not stop you from experimenting on your own and sharing results. I would love to see any results of experiments posted here.

7 Responses

[…] VMware performance guru and (now) vSpecialist Scott Drummonds recently posted a great piece on optimizing vSphere for hyper-threading. In his article, Scott discusses the NUMA.preferHT configuration parameter and the potential […]

  • Great Article Scott as always!

    I’m very interested in this as I have an 8vCPU machine going onto a 2 x 6 core Westmere Intel X5670 physical server. With ESX 4.1 wide NUMA support I’d expect this to be split into 4vCPU’s and 4vCPUs and them to run on seperate NUMA nodes. My worry however is that this application is designed to run in memory to reduce latency, therefore this wide NUMA support may actually become a performance hot spot due to remote memory access across NUMA nodes.

    In theory with HT enabled and NUMA.preferHT=1 turned on would lock those 8vCPUs to run on one NUMA node consisting of 6 physical cores. I personally feel that due to application design that memory speed is actually going to be more important than the CPU utilisation (CPU utilisation is quite low in the current physical version of this app). I will be conducting some load testing to measure performance at application level, If I remember I will report back my findings.

    Regards

    Craig

    • Craig,

      Don’t be so sure that vSphere will split the vCPUs evenly across nodes. On a six-core processor an 8-way VM could find six of its vCPUs on one NUMA node and the other two distant.

      The wide VM NUMA support in 4.1 did not actually allow ESX to do anything it could not do before; it just made it more efficient. Now instead of arbitrarily placing pages on either of the two NUMA nodes the pages are placed near the cores that are using them most.

      In any case, I very much hope to see your results and would love to give you a guest blog post if you produce something you want to share!

      Scott

  • […] the core, it will only schedule one vCPU of a vSMP virtual machine onto one core. Scott Drummond article about numa.preferHT might offer a solution. Setting the advanced parameter numa.preferHT=1 allows […]

  • […] on logical processors on a single NUMA instead of using more physical cores across multiple nodes. (Source) Reply With Quote   […]

  • […] Drummonds also has nice articles on hyper-threading here and […]

  • Nice article, however, why not use something else(maybe block object) to represent CPU instead of using RAM.