Scott Drummonds on Virtualization

Newer Processors and Virtualization Performance


[New content has been added to this is an update to an old article from the performance community.]

Newer processors are much more important to virtualized environments than the non-virtualized counterpart. Generational improvements have not just increased the raw compute power, they have also reduced virtualization overheads. This blog entry will describe three key changes that have particularly impacted virtual performance.

Hardware Assist Is Faster

In 2008, with the launch of the Opteron 1300, 2300 and 8300 parts, AMD became the first CPU vendor to produce a hardware memory management unit equipped to support virtualization. They called this technology Rapid Virtualization Indexing (RVI). This year Intel did the same with Extended Page Tables (EPT) on its Xeon 5500 line. Both vendors have been providing the ability to virtualize privileged instructions since 2006, with continually improving results. Consider the following graph showing the latency of one key instruction from Intel:


This instruction, VMEXIT, is called each time the guest exits to the kernel. The graph shows its latency (delay) in completing this instruction, which represents a wait time incurred by the guest. Clearly Intel has made great strides in reducing VMEXIT’s wait time from its Netburst parts (Prescott and Cedar Mill) to its Core architecture (Merom and Penryn) and on to its current generation, Core i7 (Nehalem). AMD processors have shown commensurate gains with AMD-V.

In a recent white paper detailing SQL Server on vSphere, the following graph showed the gains derived by using AMD-V in the Opteron 8324 (Shanghai).

Binary translation, AMD-V, and AMV-V plus RVI are measured using SQL Server.

Binary translation, AMD-V, and AMV-V plus RVI are measured using SQL Server.

This graph shows the practical value of the great gains that CPU manufacturers have made with virtualization assist.  Hardware assist can now be regularly relied upon for great performance.

Pipelines Are Shorter

The longest pipelines in the x86 world were in Intel’s Netburst processors. These processor’s pipelines had twice as many stages at their counterparts at AMD and twice as many as the generation of Intel CPUs that followed. The increased pipeline length would have enabled support for 8 GHz silicon, had it arrived. Instead, silicon switching speeds hit a wall at 4 GHz and Intel (and its customers) were forced to suffer the drawbacks of large pipelines.

Large pipelines are not necessarily a problem for desktop environments, where single threaded applications used to dominate the market. But in the enterprise, application thread counts were larger. Furthermore, consolidation in virtual environments drove thread counts even higher. With more contexts in the processor, the number of pipeline stalls and flushes increased, and efficiency fell.

Because of decreased efficiency of consolidated workloads on processors with long pipelines, VMware has often recommended that performance-intensive VMs be run on processors no older than 2-3 years. This excludes Intel’s Netburst parts. VI3 and vSphere will do a fine job at virtualizing your less-demanding applications on any supported processors. But you should use newer parts for applications that hold your highest performance expectations.

Caches Are Larger

A cache is highly effective when it fully contains the software’s working set. The addition from the hypervisor of even a small about of code will change the working set and reduce cache hit rate. I’ve attempted to illustrate this concept with the following simplified view of the relationship between cache hit rates, application working set, and cache sizes:

Performance drops with small cache systems for even small increases to working set size.

Performance drops with small cache systems for even small increases to working set size.

This graph is based on a model that greatly simplifies working sets and the hypervisor’s impact on them. Assuming that ESX increases the working set by 256 KB, this graph shows the decrease cache hit rate due to the contributions of the hypervisor. Notice that with very small caches and very small application working sets, the cache hit rate suffers greatly due to the addition of even 256 KB of virtualization code. And even up to 2 MB, a 10% decrease in cache hit rate can be seen in some applications. With a 256 KB contribution by the kernel, cache hit rates do not change significantly with cache sizes of 4 MB and beyond.

In some cases a 10% improvement in cache hit rate can double application throughput. This means that a doubling of cache size can profoundly effect the performance of virtual applications as compared to native. Given ESX’s small contribution to the working set, you can see why we at VMware recommend that customers run their performance-intensive workloads on CPUs with 4 MB caches or larger.

4 Responses

Scott, do you have insight on how Xeon 5500 turbo mode is utilized with ESX 3.5 and 4.0? Is it OK to have it enabled on ESX server BIOS?

  • The turbo mode feature will accelerate the clock speed of one core if others are idle. This way that thread’s performance can be increased while the processor is kept within the TDP.

    This functionality happens in the hardware and is transparent to ESX or any OS. Set it or ignore it with any version of ESX–it will not change VMware functionality.

    I surmise that the processor will infrequently use turbo mode in consolidated environments where there are many threads distributed across the cores. But this belief is not backed by data.

  • Any views on HyperTransport 1 vs 3 in a vmware environment? I’ve been unable to find any performance reference and yet HT3 is usually enabled in benchmarks. Knowing that benchmarks not always reflect real-life workloads I hope you have some insight into what to choose.

    • We do not benchmark specific features of microprocessors with the exception of Hyper-threading and virtualization-specific features. So, we do not have anything on HyperTransport.

      But, I would expect it to have very little value to our most common benchmark, VMmark. Proper placement of VMs on NUMA nodes would minimize remote memory access and reduce the value of a better internode bus, such as HyperTransport.

      But this is just hand waving. We do not have any data.