VMware’s Jeff Buell has been looking into High Performance Computing (HPC) in support of a new addition to the office of the CTO. Jeff just posted an article on VROOM! showing outstanding memory bandwidth in vSphere virtual machines. No one should be surprised by this–virtual machine memory bandwidth has rarely been a problem. But Jeff did discuss a advanced configuration parameter that should pique everyone’s curiosity: NUMA.preferHT.
Hyper-threading presents an interesting dilemma to any software running on Nehalem-based processors. For some multithreaded workloads, an operating system scheduler can spread threads across multiple NUMA nodes or co-locate them to a single node. Consider the following figure, which depicts a single 8-way virtual machine being scheduled to all of the eight physical cores on a server.
In this case the threads (vCPUs for vSphere) are each given their own physical core. The benefit is that the vCPUs get unfettered access to their physical cores and the resulting additional computational power. The drawback is that common memory is remote for half the vCPUs and will have to go through the other NUMA node. This means memory-intensive workloads might run slower.
This second configuration places the same virtual machine’s eight vCPUs on a single NUMA node. This means physical cores are shared but all memory access is local. The vCPUs are contending for fewer CPU cycles, although they are benefiting from Hyper-threading. This will result in less computational power than dedicated physical cores. On the other hand, assuming the virtual machine was sized to fit in a single node, 100% of memory access will go to fast, local memory. This could produce better performance for memory intensive workloads.
vSphere will prefer to spread virtual CPUs across NUMA nodes (option one above) to gain the benefit of more physical cores. But if you are running an application where memory throughput is more important than processor speed, you should consider testing a change vSphere’s default behavior. You can do this by setting the ESX 4.1 advanced parameter NUMA.preferHT to 1. This will configure the scheduler to prefer consolidating threads on logical processors on a single NUMA instead of using more physical cores across multiple nodes.
It would be nice if VMware provided definitive guidance on when virtual machines should be configured to prefer more physical cores (the default setting) or local memory access (NUMA.preferHT=1). But this guidance would be dependent on application, CPU, virtual machine size, consolidation ratios and utilization. The complexity of this guidance likely means that we will not see an authoritative word on this any time soon. But that does not stop you from experimenting on your own and sharing results. I would love to see any results of experiments posted here.