Virtual machine sizing is a tricky issue for many VMware administrators. It is important to find the right number of virtual CPUs to maximize application performance and minimize wasted CPU cycles. The optimal number of vCPUs can never be easily identified. But I can offer a few suggestions to help get this number right.
ESX must expend CPU cycles to maintain running virtual CPUs whether they are being used by an application or not. This means that host efficiency drops as more vCPUs are put on the server. But applications that scale well with CPUs will deliver greater performance when their virtual machines have been given more CPUs. The administrator must therefore balance the desires of an individual application’s owner with the needs of the entire cluster’s of applications.
There are several resources that VI administrators can use to inform their decisions in virtual machine sizing. I have listed some of them below.
Bruce Herndon’s Cost-of-SMP Article
Last summer the VMmark team’s Bruce Herndon published an article on the cost of SMP. I summarized his findings in a vPivot article I wrote on the ESX 4 scheduler. There are two key messages that you can take away from these posts to inform your decisions on virtual machine sizing:
- Over-sized virtual machines only hurt system performance when the server’s CPUs are saturated. When utilization is low, unneeded vCPUs only penalize the system’s CPU utilization, not the applications’ performance.
- Unneeded 2-way virtual machines are not very harmful to the environment. But administrators should be very careful with 4-way virtual machines and larger.
Co-stop and Ready Time
Ready time indicates a vCPU waiting for an available core when it has work to perform. Co-scheduling stop time (or co-stop time) indicates a vCPU being paused by the scheduler to allow its sibling vCPUs to catch up. These two counters can help administrators recognize a certain kind of stress due to limited CPU resources.
Ready time is generally a sign of the unavailability of CPU. Correction usually requires the administrator reducing work on the host (migrating virtual machines, decreasing vCPU count, etc.) or increasing CPU capacity (more hosts or faster CPUs). Co-stop time is a sign that the scheduler is allowing vCPUs to develop skew while it runs portions of virtual machines on available cores. Considerable numbers for these counters are 10% ready time and 3% co-stop time. There is no guarantee that application performance is suffering if these thresholds are crossed, but a problem may be present.
The important thing about ready time and co-stop time is that they are signs that you are using all of the CPU you have available to you. This could be a Good Thing. But it could also be a surprise to you. When these counters get high it is a good time to start asking yourself if you capacity usage meets your expectations. If not, you should inspect your virtual machines to be sure that the applications are using the vCPUs you have given them. If your guest tools show poor in-guest utilization then decrease those VM sizes. That will free up resources in the cluster for more virtual machines.
Application Scalability Information
I wish we lived in a world where every ISV published data showing their applications’ abilities to scale with cores. Unfortunately for us, many software vendors have for years allowed their customers to assume that each doubling of cores would double the performance of the application. VMware has chosen to provide some scalability information so our customers know how well or how poorly applications scale. But every customer of a software company deserves to have the vendor provide guidance on sizing the server. And those vendors deserve the right to put these results out on their own products. Go talk to your ISV to get the information you need to size your virtual machines.
CPU Usage Calculations and CapacityIQ
I am belatedly updating this post with a fourth way of identifying oversized virtual machines: mathematical calculation or Capacity IQ.
When a virtual machine consistently uses only a fraction of its vCPU resources it is possible that the virtual machine can be downsized and still deliver the same application performance. The calculation to determine this is simple: multiply the vCPU count by utilization and round up. Set the virtual machine’s vCPU count to the result of that calculation.
If you own CapacityIQ it will make this calculation for you for every virtual machine in your data center. Here is an screenshot of its recommendations based on virtual machine CPU and memory utilization. Click for a clearer picture.