<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Pivot Point &#187; cpu</title>
	<atom:link href="http://vpivot.com/tag/cpu/feed/" rel="self" type="application/rss+xml" />
	<link>http://vpivot.com</link>
	<description>Scott Drummonds on Virtualization</description>
	<lastBuildDate>Wed, 08 Sep 2010 08:37:56 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Designing VMs with Performance SLAs</title>
		<link>http://vpivot.com/2010/08/09/designing-vms-with-performance-slas/</link>
		<comments>http://vpivot.com/2010/08/09/designing-vms-with-performance-slas/#comments</comments>
		<pubDate>Mon, 09 Aug 2010 13:56:50 +0000</pubDate>
		<dc:creator>drummonds</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[benchmarking]]></category>
		<category><![CDATA[cpu]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[netioc]]></category>
		<category><![CDATA[sioc]]></category>

		<guid isPermaLink="false">http://vpivot.com/?p=614</guid>
		<description><![CDATA[Consolidation amplifies the uncertainty of application performance.  Still, VI administrators need a means of guaranteeing performance SLAs to their applications&#8217; users.  But the best VMware has been able to offer are resource controls, which are at best an indirect mechanism for sustaining application performance.  With the acquisition of B-hive, now AppSpeed, VMware [...]]]></description>
			<content:encoded><![CDATA[<p>Consolidation amplifies the uncertainty of application performance.  Still, VI administrators need a means of guaranteeing performance SLAs to their applications&#8217; users.  But the best VMware has been able to offer are resource controls, which are at best an indirect mechanism for sustaining application performance.  With the acquisition of B-hive, now AppSpeed, VMware moved a step closer to allowing VI administrators to guarantee a performance SLA.  As an application-aware latency measurement tool, AppSpeed may eventually provide feedback to vCenter to guarantee throughput levels.  But it does not today.  So how are VI administrators to guarantee application performance?</p>
<p><span id="more-614"></span>It was during discussions with advanced VMware customers in Melbourne that a solution to this problem occurred to me.  I have reasoned it through and I think it holds water.  I have socialized it with more customers and my colleagues and we think it stands.  So I want to introduce a system for implementing virtual machines with a better assurance of a performance SLA.</p>
<p>The key to this process is that minimum performance can be measured using limits and that performance can be assured using reservations.  You can develop and document virtual machines with performance SLAs using the following procedure:</p>
<ul>
<li>First, as always, define a small number of strictly-sized virtual machines to be used by all applications in your environment.  Often these look something like small VMs of 1 vCPU and 4 GB RAM, medium VMs of 2 vCPUs and 8 GB of RAM, and large VMs of 4 vCPUs and 16 GB of RAM.  Tune these numbers for your environment, as needed.</li>
<li>For any application, benchmark its maximum performance against each of these virtual machine configurations on an unloaded system.  Chose an ISV-supplied benchmark or a well-known third party tool.  This sets your high water mark for throughput for each application in its virtual machine.</li>
<li>For each configuration, set a CPU limit at 50% of the available CPU and a memory limit of 50% of the available memory.  Retest the application against this smaller, limited configuration.</li>
<li>During the applications&#8217; deployment, change the limits to reservations.  That is, remove limits and set reservations equal to the limits&#8217; previous values, in this case 50%.</li>
<li>Your application now has a maximum performance defined in bullet two, and a &#8220;guaranteed&#8221; performance measured in bullet three.  This is your application&#8217;s performance SLA.</li>
</ul>
<p>The concept is simple: limits can be used to measure the performance of an application in the presence of that degree of contention.  Reservations ensure that those resource amounts are always present.  Here are some notes on this process:</p>
<ul>
<li>This is not a true guarantee since network and storage throughput may drop.  No tool can eliminate this risk entirely but <a href="http://vpivot.com/2010/05/04/storage-io-control/">SIOC</a> and <a href="http://www.vmware.com/resources/techresources/10119">NetIOC</a> can reduce the risk of a network- or storage-induced performance failure.</li>
<li>The memory test is going to be highly dependent on the working set created by your load generation tool.  Your mileage will vary depending on your application owners&#8217; use of the virtual machine.</li>
<li>vCenter will guarantee that the reservations are always available through a process called admission control, which checks the cluster to ensure that enough CPU or memory is available to run the virtual machine immediately and in the event of a server failure.</li>
</ul>
<p>As I said above, this is not a true guarantee of application performance.  But it is as close as we can get until AppSpeed or a replacement evolves into universal application latency measurement that is fed into vCenter.  And this is another in a growing list of reasons  why <a href="http://vpivot.com/2010/03/31/memory-reservations-drive-over-commit/">CPU and memory reservations should be part of all VMware deployments</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://vpivot.com/2010/08/09/designing-vms-with-performance-slas/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>How Many Virtual CPUs Per VM?</title>
		<link>http://vpivot.com/2010/04/30/how-many-virtual-cpus-per-vm/</link>
		<comments>http://vpivot.com/2010/04/30/how-many-virtual-cpus-per-vm/#comments</comments>
		<pubDate>Fri, 30 Apr 2010 04:22:42 +0000</pubDate>
		<dc:creator>drummonds</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[cpu]]></category>
		<category><![CDATA[esxtop]]></category>
		<category><![CDATA[scheduler]]></category>
		<category><![CDATA[vcenter]]></category>

		<guid isPermaLink="false">http://vpivot.com/?p=403</guid>
		<description><![CDATA[Virtual machine sizing is a tricky issue for many VMware administrators.  It is important to find the right number of virtual CPUs to maximize application performance and minimize wasted CPU cycles.  The optimal number of vCPUs can never be easily identified.  But I can offer a few suggestions to help get this [...]]]></description>
			<content:encoded><![CDATA[<p>Virtual machine sizing is a tricky issue for many VMware administrators.  It is important to find the right number of virtual CPUs to maximize application performance and minimize wasted CPU cycles.  The optimal number of vCPUs can never be easily identified.  But I can offer a few suggestions to help get this number right.</p>
<p><span id="more-403"></span><br />
ESX must expend CPU cycles to maintain running virtual CPUs whether they are being used by an application or not.  This means that host efficiency drops as more vCPUs are put on the server.  But applications that scale well with CPUs will deliver greater performance when their virtual machines have been given more CPUs.  The administrator must therefore balance the desires of an individual application&#8217;s owner with the needs of the entire cluster&#8217;s of applications.</p>
<p>There are several resources that VI administrators can use to inform their decisions in virtual machine sizing.  I have listed some of them below.</p>
<h2>Bruce Herndon&#8217;s Cost-of-SMP Article</h2>
<p>Last summer the VMmark team&#8217;s Bruce Herndon published <a href="http://blogs.vmware.com/performance/2009/06/measuring-the-cost-of-smp-with-mixed-workloads.html">an article on the cost of SMP</a>.  I summarized his findings in <a href="http://vpivot.com/2009/09/29/four-things-you-should-know-about-esx-4s-scheduler/">a vPivot article I wrote on the ESX 4 scheduler</a>.  There are two key messages that you can take away from these posts to inform your decisions on virtual machine sizing:</p>
<ul>
<li>Over-sized virtual machines only hurt system performance when the server&#8217;s CPUs are saturated.  When utilization is low, unneeded vCPUs only penalize the system&#8217;s CPU utilization, not the applications&#8217; performance.</li>
<li>Unneeded 2-way virtual machines are not very harmful to the environment.  But administrators should be very careful with 4-way virtual machines and larger.</li>
</ul>
<h2>Co-stop and Ready Time</h2>
<p>Ready time indicates a vCPU waiting for an available core when it has work to perform.  Co-scheduling stop time (or co-stop time) indicates a vCPU being paused by the scheduler to allow its sibling vCPUs to catch up.  These two counters can help administrators recognize a certain kind of stress due to limited CPU resources.</p>
<p>Ready time is generally a sign of the unavailability of CPU.  Correction usually requires the administrator reducing work on the host (migrating virtual machines, decreasing vCPU count, etc.) or increasing CPU capacity (more hosts or faster CPUs).  Co-stop time is a sign that the scheduler is allowing vCPUs to develop skew while it runs portions of virtual machines on available cores.  Considerable numbers for these counters are 10% ready time and 3% co-stop time.  There is no guarantee that application performance is suffering if these thresholds are crossed, but a problem may be present.</p>
<p>The important thing about ready time and co-stop time is that they are signs that you are using all of the CPU you have available to you.  This could be a Good Thing.  But it could also be a surprise to you.  When these counters get high it is a good time to start asking yourself if you capacity usage meets your expectations.  If not, you should inspect your virtual machines to be sure that the applications are using the vCPUs you have given them.  If your guest tools show poor in-guest utilization then decrease those VM sizes.  That will free up resources in the cluster for more virtual machines.</p>
<h2>Application Scalability Information</h2>
<p>I wish we lived in a world where every ISV published data showing their applications&#8217; abilities to scale with cores.  Unfortunately for us, many software vendors have for years allowed their customers to assume that each doubling of cores would double the performance of the application.  VMware has chosen to provide some scalability information so our customers know <a href="http://www.vmware.com/pdf/Perf_ESX40_Oracle-eval.pdf">how well</a> or <a href="http://www.vmware.com/files/pdf/consolidating_webapps_vi3_wp.pdf">how poorly</a> applications scale.  But every customer of a software company deserves to have the vendor provide guidance on sizing the server.  And those vendors deserve the right to put these results out on their own products.  Go talk to your ISV to get the information you need to size your virtual machines.</p>
<h2>CPU Usage Calculations and CapacityIQ</h2>
<p>I am belatedly updating this post with a fourth way of identifying oversized virtual machines: mathematical calculation or Capacity IQ.</p>
<p>When a virtual machine consistently uses only a fraction of its vCPU resources it is possible that the virtual machine can be downsized and still deliver the same application performance.  The calculation to determine this is simple: multiply the vCPU count by utilization and round up.  Set the virtual machine&#8217;s vCPU count to the result of that calculation.</p>
<p>If you own CapacityIQ it will make this calculation for you for every virtual machine in your data center.  Here is an screenshot of its recommendations based on virtual machine CPU and memory utilization.  Click for a clearer picture.</p>
<div id="attachment_512" class="wp-caption alignnone" style="width: 310px"><a href="http://vpivot.com/wp-content/uploads/2010/04/capiq_vm_size_recs.png"><img src="http://vpivot.com/wp-content/uploads/2010/04/capiq_vm_size_recs-300x102.png" alt="" title="Capacity IQ Recommending VM Resize" width="300" class="size-medium wp-image-512" /></a><p class="wp-caption-text">CapacityIQ monitors CPU and memory utilization to recommend VM downsizing.</p></div>
]]></content:encoded>
			<wfw:commentRss>http://vpivot.com/2010/04/30/how-many-virtual-cpus-per-vm/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Processor Utilization Calculations</title>
		<link>http://vpivot.com/2010/04/09/processor-utilization-calculations/</link>
		<comments>http://vpivot.com/2010/04/09/processor-utilization-calculations/#comments</comments>
		<pubDate>Fri, 09 Apr 2010 20:10:44 +0000</pubDate>
		<dc:creator>drummonds</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[cpu]]></category>
		<category><![CDATA[esxtop]]></category>

		<guid isPermaLink="false">http://vpivot.com/?p=387</guid>
		<description><![CDATA[A little Friday esxtop trivia for the performance massive: did you ever notice your Hyper-Threaded systems have three rows showing CPU utilization in the CPU panel header?  They are labeled &#8220;PCPU USED(%)&#8221;, &#8220;PCPU UTIL(%)&#8221;, and &#8220;CORE UTIL(%)&#8221;.  Here is a screen shot to jog your memory:

This capture shows utilization of each physical and logical processor [...]]]></description>
			<content:encoded><![CDATA[<p>A little Friday esxtop trivia for the performance massive: did you ever notice your Hyper-Threaded systems have three rows showing CPU utilization in the CPU panel header?  They are labeled &#8220;PCPU USED(%)&#8221;, &#8220;PCPU UTIL(%)&#8221;, and &#8220;CORE UTIL(%)&#8221;.  Here is a screen shot to jog your memory:</p>
<p><span id="more-387"></span></p>
<div id="attachment_388" class="wp-caption alignnone" style="width: 568px"><a href="http://vpivot.files.wordpress.com/2010/04/esxtop.png"><img class="size-full wp-image-388" title="esxtop Screen Shot" src="http://vpivot.files.wordpress.com/2010/04/esxtop.png" alt="esxtop Screen Shot" width="558" height="343" /></a><p class="wp-caption-text">esxtop shows three processor utilization rows.  What do they mean?</p></div>
<p>This capture shows utilization of each physical and logical processor core in the system.  The first row, PCPU USED, provides the percent of each physical core used by the logical core, multiplied  by turbo mode, a processor feature that temporarily increases the core&#8217;s internal clock frequency.  This means two threads running at full, turbo speed might produce a number like 55% for each entry.  It also means that one thread can drive its logical CPU to 100% only if the logical core&#8217;s sibling is unused.  The second row is a straightforward calculation of the utilization of each logical core and the third row similarly shows utilization of physical cores.</p>
<p>Where things get really confusing is when these results are combined into three system-wide, aggregate utilization numbers, as seen by esxtop&#8217;s batch printout.  The three utilization types above generated different utilization numbers.  Unfortunately esxtop&#8217;s batch mode labels these counters slightly differently.  But this table includes both names:</p>
<table id="newspaper-a">
<tbody>
<tr>
<th>esxtop Interactive</th>
<th>esxtop Batch Output (and <a href="http://vpivot.com/2009/10/21/esxtop-analysis-with-esxplot/">esxplot</a>)</th>
<th>Description</th>
<th>Single Core Example</th>
</tr>
<tr>
<td>PCPU USED(%)</td>
<td>% Processor Time</td>
<td>The average of each hardware thread&#8217;s use of the physical core multiplied by turbo mode.</td>
<td>One thread running fully: 108%, two threads running fully: 50%</td>
</tr>
<tr>
<td>PCPU UTIL(%)</td>
<td>% Util Time</td>
<td>Percent utilization of logical cores.</td>
<td>One thread running fully: 50%, two threads running fully: 100%</td>
</tr>
<tr>
<td>CORE UTIL(%)</td>
<td>% Core Util Time</td>
<td>Utilization of the physical core.</td>
<td>One thread running fully: 100%, two threads running fully: 100%</td>
</tr>
</tbody>
</table>
<p>The &#8220;Single Core Example&#8221; column provides an example calculation based on threads running as fast as possible on a single Hyper-Threaded physical core.  There are some interesting observations on these calculations:</p>
<ul>
<li>Even a great number of threads running full bore on an HT system will not produce a PCPU USED(%) number much over 50%.</li>
<li>Two running threads will produce a lower PCPU USED(%) than one running thread.  This is because each thread&#8217;s utilization is calculated against the physical core.  With two running, each is averaging 50% of the core.  But a single thread that is not sharing the core can drive this to 100%.  In both cases the actual number could be a little higher if turbo mode is on.</li>
<li>You must have at least two threads&#8211;one on each logical core&#8211;to drive PCPU UTIL(%) to 100%.</li>
<li>CORE UTIL(%) can be driven to 100% with only one thread per physical core.</li>
</ul>
<p>Look for this content to be rolled into our <a href="http://communities.vmware.com/docs/DOC-9279">esxtop documentation</a> soon.</p>
]]></content:encoded>
			<wfw:commentRss>http://vpivot.com/2010/04/09/processor-utilization-calculations/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Hyper-V&#039;s Lack of Memory Over-commit</title>
		<link>http://vpivot.com/2010/04/01/hyper-vs-lack-of-memory-over-commit/</link>
		<comments>http://vpivot.com/2010/04/01/hyper-vs-lack-of-memory-over-commit/#comments</comments>
		<pubDate>Thu, 01 Apr 2010 17:52:38 +0000</pubDate>
		<dc:creator>drummonds</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[cpu]]></category>
		<category><![CDATA[hyper-v]]></category>
		<category><![CDATA[memory]]></category>

		<guid isPermaLink="false">http://vpivot.com/?p=371</guid>
		<description><![CDATA[I find it interesting that one day after I wrote about memory over-commitment in vSphere, Greg Shields wrote about the lack of memory over-commitment in Hyper-V.  In today&#8217;s short blog entry, I want provide one paragraph that Greg&#8217;s article currently lacks:
While memory over-subscription is a critical feature for production environments, balancing the demands of heterogenous [...]]]></description>
			<content:encoded><![CDATA[<p>I find it interesting that one day after I wrote about <a href="http://vpivot.com/2010/03/31/memory-reservations-drive-over-commit/">memory over-commitment in vSphere</a>, Greg Shields wrote about <a href="http://virtualizationreview.com/articles/2010/04/01/hypervs-missing-feature.aspx">the lack of memory over-commitment in Hyper-V</a>.  In today&#8217;s short blog entry, I want provide one paragraph that Greg&#8217;s article currently lacks:</p>
<blockquote><p>While memory over-subscription is a critical feature for production environments, balancing the demands of heterogenous applications of varying demands in a resource starved environment is difficult.  Without guidance from administrators on the relative importance of the virtual machines running these applications, a hypervisor will be forced to make arbitrary decisions in assigning limited resources.  Effective use of over-commitment requires a sound resource control system.  The only product on the market that does this well is VMware vSphere.</p></blockquote>
<p>Both Greg and my articles only talked of memory over-commitment, but the rules apply for CPU over-commitment, too.  Microsoft will realize how important resource controls are somewhere between year two and five of their product&#8217;s life.  I can only imagine where vSphere will be by then.</p>
]]></content:encoded>
			<wfw:commentRss>http://vpivot.com/2010/04/01/hyper-vs-lack-of-memory-over-commit/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hyper-Threading on vSphere</title>
		<link>http://vpivot.com/2010/03/06/hyper-threading-on-vsphere/</link>
		<comments>http://vpivot.com/2010/03/06/hyper-threading-on-vsphere/#comments</comments>
		<pubDate>Sat, 06 Mar 2010 18:05:38 +0000</pubDate>
		<dc:creator>drummonds</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[cpu]]></category>
		<category><![CDATA[hyper-threading]]></category>
		<category><![CDATA[intel]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[scheduler]]></category>
		<category><![CDATA[vmkernel]]></category>
		<category><![CDATA[vmmark]]></category>
		<category><![CDATA[vsphere]]></category>

		<guid isPermaLink="false">http://vpivot.com/?p=328</guid>
		<description><![CDATA[I continue to receive many questions from our customers on the expected performance gains of the new version of Hyper-Threading in Intel&#8217;s Core i7 processors.  The answer requires a little bit of discussion on Hyper-Threading, a little bit on ESX, and comes with some performance data.  If you are still interested, read on.
On [...]]]></description>
			<content:encoded><![CDATA[<p>I continue to receive many questions from our customers on the expected performance gains of the new version of Hyper-Threading in Intel&#8217;s Core i7 processors.  The answer requires a little bit of discussion on Hyper-Threading, a little bit on ESX, and comes with some performance data.  If you are still interested, read on.</p>
<p><span id="more-328"></span>On VI3, many of VMware&#8217;s customers disabled Hyper-Threading on their older, Netburst architecture Intel processors.  Intel has vaguely described the new Hyper-Threading as more efficient than the previous generation and I believe this to be due to a shorter pipeline and an improved ability to context switch pipeline stage data.  Long pipelines&#8211;such as the Netburst era Xeons of model numbers x1xx and x2xx&#8211;are more likely to suffer bubbles during context switches and are therefore penalized versus shorter pipeline products, such as the Core i7.  Furthermore, by pushing and restoring pipeline stage data during a hardware context switch, the new HT can reduce pipeline bubbles.</p>
<p>But the gains vSphere users experience as a result of the new Hyper-Threading also comes from changes in ESX.  ESX&#8217;s scheduler must make decisions as to when to co-locate two worlds on a physical core to take advantage of Hyper-Threading.  In some conditions the scheduler will perform this co-location and in others it will allow a world to run on the core by itself.  The decision to execute worlds concurrently instead of serially on a physical core can be informally called the scheduler&#8217;s <em>trust</em> of Hyper-Threading.  The vSphere scheduler <em>trusts</em> Hyper-Threading more than the VI3 scheduler did.  This amplifies the effect of HT.</p>
<p>I am now going to bore you with a disclaimer before I give you any data showing the effect of Hyper-Threading.  The value of HT will vary from workload to workload and the ultimate authority of HT&#8217;s value is the end-user.  The following numbers are the result of informal analysis and VMware that should only be used as a guide in your own analysis.  Please do not make purchasing decisions on this information, which is devoid of the detail we would normally commit to a white paper.</p>
<table id="newspaper-a">
<tbody>
<tr>
<th>Workload</th>
<th>Observed Throughput Gain Due to HT</th>
</tr>
<tr>
<td>VMmark</td>
<td>24%</td>
</tr>
<tr>
<td>SPECjbb</td>
<td>10%</td>
</tr>
<tr>
<td>Consolidated SQL</td>
<td>19%</td>
</tr>
</tbody>
</table>
<p>In addition to the gains we informally cite here, I can say that we have not yet seen a workload where the new Hyper-Threading slows down consolidated performance.  As far as we can tell, the new Hyper-Threading should be left enabled in 100% of virtualized environments.</p>
]]></content:encoded>
			<wfw:commentRss>http://vpivot.com/2010/03/06/hyper-threading-on-vsphere/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Four Things You Should Know About ESX 4&#039;s Scheduler</title>
		<link>http://vpivot.com/2009/09/29/four-things-you-should-know-about-esx-4s-scheduler/</link>
		<comments>http://vpivot.com/2009/09/29/four-things-you-should-know-about-esx-4s-scheduler/#comments</comments>
		<pubDate>Wed, 30 Sep 2009 06:00:18 +0000</pubDate>
		<dc:creator>drummonds</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[cpu]]></category>
		<category><![CDATA[scheduler]]></category>
		<category><![CDATA[vmkernel]]></category>
		<category><![CDATA[vmmark]]></category>
		<category><![CDATA[vsphere]]></category>

		<guid isPermaLink="false">http://vpivot.com/?p=11</guid>
		<description><![CDATA[[This is the last re-post of old community content.  But this content is important enough to be worth a re-post.]
I spend a great deal of time answering customers&#8217; questions about the scheduler.  Never have so many questions been asked about such an abstruse component for which so little user influence is possible.  But [...]]]></description>
			<content:encoded><![CDATA[<p><em>[This is the last <a href="http://communities.vmware.com/blogs/drummonds/2009/08/21/four-things-you-should-know-about-esx-4s-scheduler">re-post of old community content</a>.  But this content is important enough to be worth a re-post.]</em></p>
<p>I spend a great deal of time answering customers&#8217; questions about the scheduler.  Never have so many questions been asked about such an abstruse component for which so little user influence is possible.  But CPU scheduling is central to system performance, so VMware strives to provide as much information on the subject as possible.  In this blog entry, I want to point out a few nuggets of information on the CPU scheduler.  These four bullets answer 95% of the questions I get asked.</p>
<p><span id="more-11"></span></p>
<h2>Item 1: ESX 4&#8217;s Scheduler Better Uses Caches Across Sockets</h2>
<p>On UMA systems at low load levels, virtual machine performance improves when each virtual CPU (vCPU) is placed on its own socket.  This is because providing each vCPU its own socket also gives it the entire cache on that CPU.  On page 18 of a <a class="jive-link-external" href="http://www.vmware.com/files/pdf/perf-vsphere-cpu_scheduler.pdf">recent paper on the scheduler written by Seongbeom Kim</a>, a graph highlights the case where vCPU spreading improves performance.</p>
<p><img class="jive-image-thumbnail jive-image" src="http://communities.vmware.com/servlet/JiveServlet/downloadImage/38-4886-6674/Picture+2.png" alt="Picture 2.png" width="620" /></p>
<p>The X-axis represents different combinations of VM and vCPU counts.  SPECjbb is memory intensive and shows great gains with increases in CPU cache.  The few cases that show dramatic benefit due to the ESX 4.0 scheduler are benefiting from the distribution of vCPUs across sockets.  Very large gains are possible in this somewhat uncommon case.</p>
<h2>Item 2: Overuse of SMP Only Slows Consolidated Environments At Saturation</h2>
<p>For years customers have asked me how many vCPUs they should give to their VMs.  The best guidance, &#8220;as few as possible&#8221;, seems too vague to satisfy.  It remains the only correct answer, unfortunately.  But <a class="jive-link-external" href="http://blogs.vmware.com/performance/2009/06/measuring-the-cost-of-smp-with-mixed-workloads.html">a recent experiment performed by Bruce Herndon&#8217;s team</a> sheds some light on this VM sizing question.</p>
<p>In this experiment we ran VMmark against VMs that were configured outside of VMmark specifications.  In one case some of the virtual machines were given too few vCPUs and in another they were given too many.  Because VMmark&#8217;s workload is fixed, increasing the VMs&#8217; sizes does not increase the work performed by the VMs.  In other words, the system&#8217;s score does not depend on the VMs&#8217; vCPU count.  Until CPU saturation, that is.</p>
<p><img class="jive-image-thumbnail jive-image" src="http://communities.vmware.com/servlet/JiveServlet/downloadImage/38-4886-6675/Picture+3.png" alt="Picture 3.png" width="620" /></p>
<p>Notice that the scores are similar between the undersized, right-sized, and over-sized VMs.  Up until tile 10 (60 VMs) they are nearly identical.  There is a slight difference in processor utilization that begins to impact throughput (score) as the system runs out of CPU.  At that point the additional vCPUs waste cycles which degrades system performance.  Two points I will call out from this work:</p>
<ul>
<li>Sloppy VI admins that provide too many vCPUs need not worry about performance when their servers are under low load.  But performance will suffer when CPU utilization spikes.</li>
<li>The penalty of over-sizing VMs gets worse as VMs get larger.  Using a 2-way VM is not that bad, but unneeded use of 4-way VMs when one or two processors suffice can cost up to 15% of your system throughput.  I presume that unnecessarily eight vCPUs would be criminal.</li>
</ul>
<h2>Item 3: ESX Has Not Strictly Co-scheduled Since ESX 2.5</h2>
<p>I have documented ESX&#8217;s relaxation of co-scheduling previously (<a class="jive-link-wiki" href="http://communities.vmware.com/docs/DOC-4960">Co-scheduling SMP VMs in VMware ESX Server</a>).  But this statement cannot be repeated too frequently: ESX has not strictly co-scheduled virtual machines since version 2.5.   This means that ESX can place vCPUs from SMP VMs individually.  It is not necessary to wait for physical cores to be available for every vCPU before starting the VM.  However, as Item 3 pointed out, this does not give you free license to over-size your VMs.  Be frugal with your SMP VMs and assign vCPUs only when you need them.</p>
<h2>Item 4: The Cell Construct Has Been Eliminated in ESX 4.0</h2>
<p>In the performance best practices deck that I give at conferences I talk about the benefits of creating small virtual machines over large ones.  In versions of ESX up to ESX 3.5, the scheduler used a construct called a cell that would contain and lock CPU cores.  The vCPUs from a single VM could never span a cell.  With a ESX 3.x&#8217;s cell size of four this meant that VMs never spanned multiple four-core sockets.  Consider this figure:</p>
<p><img class="jive-image" src="http://communities.vmware.com/servlet/JiveServlet/downloadImage/38-4886-6688/Picture+1.png" alt="http://communities.vmware.com/servlet/JiveServlet/downloadImage/38-4886-6688/Picture+1.png" /></p>
<p>What this figure shows is that a 4-way VM on ESX 3.5 can only be placed in two locations on this hypothetical two-socket configuration.  There are 12 combinations for a 2-way VM and eight for a uniprocessor VM.  The scheduler has more opportunities to optimize VM placement when you provide it with smaller VMs.</p>
<p>In ESX 4 we have eliminated the cell lock so VMs can span multiple sockets, as item one states.  Continue to think of this placement problem as a challenge to the scheduler that you can alleviate.  By choosing multiple, smaller VMs you free the scheduler to pursue opportunities to optimize performance in consolidated environments</p>
]]></content:encoded>
			<wfw:commentRss>http://vpivot.com/2009/09/29/four-things-you-should-know-about-esx-4s-scheduler/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Newer Processors and Virtualization Performance</title>
		<link>http://vpivot.com/2009/09/16/newer-processors-and-virtualization-performance/</link>
		<comments>http://vpivot.com/2009/09/16/newer-processors-and-virtualization-performance/#comments</comments>
		<pubDate>Wed, 16 Sep 2009 20:08:33 +0000</pubDate>
		<dc:creator>drummonds</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[amd]]></category>
		<category><![CDATA[cpu]]></category>
		<category><![CDATA[ept]]></category>
		<category><![CDATA[intel]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[monitor]]></category>
		<category><![CDATA[rvi]]></category>
		<category><![CDATA[sql]]></category>
		<category><![CDATA[vmkernel]]></category>

		<guid isPermaLink="false">http://vpivot.com/?p=18</guid>
		<description><![CDATA[[New content has been added to this is an update to an old article from the performance community.]
Newer processors are much more important to virtualized environments than the non-virtualized counterpart. Generational improvements have not just increased the raw compute power, they have also reduced virtualization overheads.  This blog entry will describe three key changes [...]]]></description>
			<content:encoded><![CDATA[<p><em>[New content has been added to this is an update to an <a href="http://communities.vmware.com/blogs/drummonds/2009/06/02/newer-processors-and-virtualization-performance">old article from the performance community</a>.]</em></p>
<p>Newer processors are much more important to virtualized environments than the non-virtualized counterpart. Generational improvements have not just increased the raw compute power, they have also reduced virtualization overheads.  This blog entry will describe three key changes that have particularly impacted virtual performance.</p>
<h2><span id="more-18"></span>Hardware Assist Is Faster</h2>
<p>In 2008, with the launch of the Opteron 1300, 2300 and 8300 parts, AMD became the first CPU vendor to produce a hardware memory management unit equipped to support virtualization.  They called this technology Rapid Virtualization Indexing (RVI).  This year Intel did the same with Extended Page Tables (EPT) on its Xeon 5500 line.  Both vendors have been providing the ability to virtualize privileged instructions since 2006, with continually improving results.  Consider the following graph showing the latency of one key instruction from Intel:</p>
<p><img class="jive-image-thumbnail jive-image" src="http://communities.vmware.com/servlet/JiveServlet/downloadImage/38-3171-5926/vmexit_latencies.png" alt="vmexit_latencies.png" width="620" /></p>
<p>This instruction, VMEXIT, is called each time the guest exits to the kernel.  The graph shows its latency (delay) in completing this instruction, which represents a wait time incurred by the guest.  Clearly Intel has made great strides in reducing VMEXIT&#8217;s wait time from its Netburst parts (Prescott and Cedar Mill) to its Core architecture (Merom and Penryn) and on to its current generation, Core i7 (Nehalem).  AMD processors have shown commensurate gains with AMD-V.</p>
<p>In a recent <a href="http://www.vmware.com/files/pdf/perf_vsphere_sql_scalability.pdf">white paper detailing SQL Server on vSphere</a>, the following graph showed the gains derived by using AMD-V in the Opteron 8324 (Shanghai).</p>
<div id="attachment_33" class="wp-caption alignnone" style="width: 609px"><img class="size-full wp-image-33" title="Monitor Mode and SQL Server Performance" src="http://vpivot.files.wordpress.com/2009/06/picture-3.png" alt="Binary translation, AMD-V, and AMV-V plus RVI are measured using SQL Server." width="599" height="343" /><p class="wp-caption-text">Binary translation, AMD-V, and AMV-V plus RVI are measured using SQL Server.</p></div>
<p>This graph shows the practical value of the great gains that CPU manufacturers have made with virtualization assist.  Hardware assist can now be regularly relied upon for great performance.</p>
<h2>Pipelines Are Shorter</h2>
<p>The longest pipelines in the x86 world were in Intel&#8217;s Netburst processors.  These processor&#8217;s pipelines had twice as many stages at their counterparts at AMD and twice as many as the generation of Intel CPUs that followed.  The increased pipeline length would have enabled support for 8 GHz silicon, had it arrived.  Instead, silicon switching speeds hit a wall at 4 GHz and Intel (and its customers) were forced to suffer the drawbacks of large pipelines.</p>
<p>Large pipelines are not necessarily a problem for desktop environments, where single threaded applications used to dominate the market.  But in the enterprise, application thread counts were larger.  Furthermore, consolidation in virtual environments drove thread counts even higher.  With more contexts in the processor, the number of pipeline stalls and flushes increased, and efficiency fell.</p>
<p>Because of decreased efficiency of consolidated workloads on processors with long pipelines, VMware has often recommended that performance-intensive VMs be run on processors no older than 2-3 years.  This excludes Intel&#8217;s Netburst parts.  VI3 and vSphere will do a fine job at virtualizing your less-demanding applications on any supported processors.  But you should use newer parts for applications that hold your highest performance expectations.</p>
<h2>Caches Are Larger</h2>
<p>A cache is highly effective when it fully contains the software&#8217;s working set.  The addition from the hypervisor of even a small about of code will change the working set and reduce cache hit rate.  I&#8217;ve attempted to illustrate this concept with the following simplified view of the relationship between cache hit rates, application working set, and cache sizes:</p>
<div id="attachment_34" class="wp-caption alignnone" style="width: 610px"><img class="size-full wp-image-34" title="Cache Size, Working Set, and Performance" src="http://vpivot.files.wordpress.com/2009/06/cache_size_perf.png" alt="Performance drops with small cache systems for even small increases to working set size." width="600" height="400" /><p class="wp-caption-text">Performance drops with small cache systems for even small increases to working set size.</p></div>
<p>This graph is based on a model that greatly simplifies working sets and the hypervisor&#8217;s impact on them.  Assuming that ESX increases the working set by 256 KB, this graph shows the decrease cache hit rate due to the contributions of the hypervisor.  Notice that with very small caches and very small application working sets, the cache hit rate suffers greatly due to the addition of even 256 KB of virtualization code.  And even up to 2 MB, a 10% decrease in cache hit rate can be seen in some applications.  With a 256 KB contribution by the kernel, cache hit rates do not change significantly with cache sizes of 4 MB and beyond.</p>
<p>In some cases a 10% improvement in cache hit rate can double application throughput.  This means that a doubling of cache size can profoundly effect the performance of virtual applications as compared to native.  Given ESX&#8217;s small contribution to the working set, you can see why we at VMware recommend that customers run their performance-intensive workloads on CPUs with 4 MB caches or larger.</p>
]]></content:encoded>
			<wfw:commentRss>http://vpivot.com/2009/09/16/newer-processors-and-virtualization-performance/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
