Scott Drummonds on Virtualization

PVSCSI and Low IO Workloads


Scott Sauer recently asked me a tough question on Twitter.  My roaming best practices talk includes the phrase “do not use PVSCSI for low-IO workloads”.  When Scott saw a VMware KB echoing my recommendation, he asked the obvious question: “Why?”  It took me a couple of days to get a sufficient answer.

One technique for storage driver efficiency improvements is interrupt coalescing.  Coalescing can be thought of as buffering: multiple events are queued for simultaneous processing.  For coalescing to improve efficiency, interrupts must stream in fast enough to create large batch requests. Otherwise the timeout window will pass with no additional interrupts arriving.  This means the single interrupt is handled as normal but after a useless delay.

An intelligent storage driver will therefore coalesce at high IO but not low IO.  In the years we have spent optimizing ESX’s LSI Logic virtual storage adapter, we have fine-tuned the coalescing behavior to give fantastic performance on all workloads.  This is done by tracking two key storage counters:

  • Outstanding IOs (OIOs): Represents the virtual machine’s demand for IO.
  • IOs per second (IOPS):  Represents the storage system’s supply of IO.

The robust LSI Logic driver increases coalescing as OIOs and IOPS increase.  No coalescing is used with few OIOs or low throughput.  This produces efficient IO at large throughput and low latency IO when throughput is small.

Currently the PVSCSI driver coalesces based on OIOs only, and not throughput.  This means that when the virtual machine is requesting a lot of IO but the storage is not delivering, the PVSCSI driver is coalescing interrupts.  But without the storage supplying a steady stream of IOs there are no interrupts to coalesce.  The result is a slightly increased latency with little or no efficiency gain for PVSCSI in low throughput environments.

LSI Logic is so efficient at low throughput levels that there is no need for a special device driver to improve efficiency.  The CPU utilization difference between LSI and PVSCSI at hundreds of IOPS is insignificant.  But at massive amounts of IO–where 10-50K IOPS are streaming over the virtual SCSI bus–PVSCSI can save a large number of CPU cycles.  Because of that, our first implementation of PVSCSI was built on the assumption that customers would only use the technology when they had backed their virtual machines by world-class storage.

But VMware’s marketing engine (me, really) started telling everyone about PVSCSI without the right caveat (“only for massive IO systems!”)  So, everyone started using it as a general solution.  This meant that in one condition–slow storage (low IOPS) with a demanding virtual machine (high OIOs)–PVSCSI has been inefficiently coalescing IOs resulting in performance slightly worse than LSI Logic.

But now VMware’s customers want PVSCSI as a general solution and not just for high IO workloads.  As a result we are including advanced coalescing behavior in PVSCSI for future versions of ESX.  More on that when the release vehicle is set.

PVSCSI In A Nutshell

If you plodded through the above technical explanation of interrupt coalescing and PVSCSI I applaud you. If you just want a summary of what to do, here it is:

  • For existing products, only use PVSCSI against VMDKs that are backed by fast (greater than 2,000 IOPS) storage.
  • If you have installed PVSCSI in low IO environments, do not worry about reconfiguring to LSI Logic.  The net loss of performance is very small.  And clearly these low IO virtual machines are not running your performance-critical applications.
  • For future products*, PVSCSI will be as efficient as LSI Logic for all environments.

(*) Specific product versions not yet announced.

Update: February 16

The simple, almost austere KB on this rare occurrence raised more questions than answers. You may notice that the KB has been updated with text from this blog since the blog’s original publication. A white paper on PVSCSI that had been under construction for quite some time was also released with a VROOM! article we often pair with such a white paper.

27 Responses

Hi Scott, knowing that interrupt coalescing exists also for paravirtualized network cards, does the new VMXNET3 has the same ‘problem’? I meant that the driver was designed for 10GbE infrastructure, with technique like LRO (Large Receive Offload) to achieve higher throughput, but is it also the prefered card for gigabit networks?

  • […] A technical explanation can be found at the Pivot Point blog […]

  • […] 04/02/2010 : Scott nous explique pourquoi ce pilote ne doit être utilisé QUE pour les VM ayant des I/O disque […]

  • Thanks for the spotlight Scott !

  • […] read on Scott Lowe’s that in the a future release, the PVSCSI adapter will be good for both low-io and high-io virtual […]

  • Thanks again Scott, BTW do you know if they are working on supporting SCSI3? We would love to use pvSCSI for a SQL cluster but can’t as Windows 2008 requires SCSI3 and pvSCSI uses SCSI2.

    • No, I don’t know. I continue to run down a longer answer on this but it seems that the array implementations of SCSI3 are no better (from a perf perspective) than SCSI2. So, we are not doing much work with it. Still poking around, though.

  • If I am in a situation where pvscsi is “slightly worse” than the other bus types, it means that I am in an environment with only a small demand for storage performance, right? So why would I care about losing a handlful of ms for my disk access?

    • Exactly! That is my point in the summary. If your environment matches the case where pvscsi is slightly worse, by definition you have an application where performance is not important.

      • Careful there. There are plenty of “performance-sensitive” apps out there which are latency bound. In those situations, it is critical to have a technique that is adaptive. Of course, in vSphere 4.1, we fixed this issue and made the pvscsi interrupt coalescing algorithm to match the version we originally implemented in our virtualized LSI controller so this particular issue should be behind us now. For those curious, the actual technique is super cool and we published a very approachable paper on said topic. Search for virtual interrupt coalescing under my name on Google.

  • […] can read about it at pivot point. Filed under: I.T, VMWare Comment […]

  • […] I didn’t really understand this and the knowledge base article is lacking any detail on the rational behind the statement.  I reached out to VMware performance engineer Scott Drummonds to see if he had anything he could publish to help clarify the KB article.  Scott was nice enough to research this and posted his findings here. […]

  • Hi, any updates on when PVSCSI will work optimal for low IOPS? Anu new status on scsi3 support fro PVSCSI?

    • PVSCSI coalescing has been fixed in ESX 4.1. VMware has checked in the fix to the code branch that will produce ESX 4.0 U3. I am not the right person to comment on PVSCI support for SCSI3.

  • […] In a post on his vPivot blog, former VMware performance guru Scott Drummonds echoed the sentiments and stated that this issue would be resolved in a future release of vSphere.  He noted that the PVSCSI driver was only slightly slower than the LSI driver in certain scenarios but acknowledged that it could result in slightly less performance with no efficiency gains. […]

  • […] first it’s not supported by VMware on the boot disk as described above, and secondly it should not be used in a low throughput environment. This entry was posted in howto and tagged 10.04, lucid, pvscsi, ubuntu. Bookmark the permalink. […]

  • Scott,
    Great work as always. So it’s pretty obvious to use pvscsi with high I/O workload. Our client is planning on running SLES11 sp1 with Oracle 11G. He is planning on using one disk for boot and “/” and two more disk using Oracle ASM for load balancing within Oracle across both data disk. He ask a good question. When adding the third disk should it be added to a separate controller from the second disk. Would this have any performance gains or is this over complicating this with nominal benefits?

    • I spent some time digging into the benefits of distributing VMDKs across multiple virtual HBAs. VMware never quantified any gains from this practice but we all agreed that there should be some value. Most likely the gains would come from an increased queue in the guest. But to realize these gains you would have to be distributing the VMDKs across multiple VMFS volumes, too. And to measure a difference you would have to be driving a large number of IOs. Certainly about 50K IOPS.

      • Bumping an old one here, but this is one of the better discussions out there regarding pvscsi. FYI, I’ve defaulted all my new VMs to pvscsi (mostly 2008 R2 guests on 4.1U1 hosts). Has worked great, but recently we’ve had three bluescreens when growing data disks on-the-fly. VMware’s recommendation (from http://kb.vmware.com/kb/1010398) is that the boot disk should be on a pvscsi controller by itself, and the subsequent data disks should be on at least one other one (SCSI ID’s of 1: or higher).

  • Thanks Scott for the fast response. I’ll make sure to share this with the client.

  • So with the added latency in low IO situations resolved, is it worth using this as a default adapter (for OSes that support it) for the increased CPU efficiency? It’s the default device for RHEL 6 already, but I’m thinking specifically 2008 R2. We’d need to ensure that the driver for this (from the VMware Tools) is loaded into the boot image as well I assume.

    • You can and should always use the PVSCSI driver now. This article was written a year and a half ago. Its content pertains to vSphere 4 with no updates applied. These issues were corrected through updates to vSphere 4 that were also folded into subsequent versions of the product.