Scott Drummonds on Virtualization

Storage Consolidation (or: How Many VMDKs Per Volume?)


Part of the performance best practices talk I co-presented at VMworld in San Francisco and Copenhagen focused on answering the question, “How many virtual machines can be placed on a single VMFS volume?”  There are a lot of theories as to a best answer.  It will not surprise you to learn that no single consolidation ratio works in every environment.  Your workloads will influence the maximum consolidation.  But we know enough about how ESX virtualizes storage to provide guidance as to the right storage consolidation ratios.

First, a little background on ESX’s storage queues.  There are two relevant queues in ESX.  First is the device queue, which has one instantiation at each HBA for each LUN.  Second is the kernel queue, which handles “overflowed” IOs that are waiting to be placed in a full device queue.

For Fibre Channel HBAs, the device queue’s default length is 32 commands.  It is much larger for iSCSI. No HBA, and thus no device queue, exists for NFS.  A 32 command queue is capable of opening 32 commands at a time.  Obviously, if you double this queue length then the queue will drive twice as many IOs to the volume.  For the rest of this article I will discuss queues in terms of the 32 element Fibre Channel queue.

Because one device queue is instantiated at each HBA for each LUN, a storage reconfiguration at an array can change the number of queues at an ESX host.  Increasing the number of queues increases the total number of IOs that the host can open against the array.  I demonstrated this in my VMworld presentation with the following figure.

Two VMFS volumes means two queues.  One volume one queue.

Putting two VMs on two volumes results in up to 64 commands being opened from the pair of them at one time.

This figure shows the simple difference between two virtual machines sharing a single VMFS volume and two that each get their own.  In the first configuration, only 32 commands can be opened from the host and that single queue is shared between the virtual machines.  In the second configuration, the host can open up 64 total commands and each virtual machine can open up to 32.

Your first reaction to this might be, “Wow! I should put every VMDK on a VMFS volume of its own!  Then imagine the total throughput that the host could drive!!”  My first response to this is stop using so many exclamation points.  Nobody likes an overenthusiastic writer.  But second, you should consider that more is not always better.  In fact, I can think of several reasons why you should not reconfigure storage to multiply the number of queues:

  1. Allowing a host to open many commands simultaneously may be good for the individual virtual machines but is likely to be dangerous for the shared infrastructure.  This could result in short but extremely intense microbursts of IO that could present challenges to your fabric or storage processors.
  2. The device driver (and the HBA) can only open a fixed number of commands depending on the device’s implementation.  You have to use these sparingly.
  3. The configuration that results in more queues necessarily requires more VMFS volumes which results in a greater administration cost.

In addition to reconfiguring storage to increase the number of device queues, you always have the option of increasing the length of ESX’s device queues.  This is documented on page 71 of the Fibre Channel SAN Configuration Guide.  But I will caution you from reconfiguring storage queues, too.  This requires manual changes at every host, produces longer queues that more quickly eat into the fixed number of commands each HBA can support, and increases the possible IO intensity every virtual machine on the host.

And if these detailed explanations are insufficient at explaining why storage queue manipulation is unproductive or even counterproductive towards your goal of optimizing your infrastructure, let me point out that VMware has years of experience at consolidating storage and they chose 32 commands per queue as the right number for most environments.  Trust their experience on this one.

Of course I would be remiss if I did not mention that there are rare times that a storage reconfiguration may help performance.  Redistributing virtual machines across different VMFS volumes or increasing queue depths can correct some issues.  And you can identify occasions where this change may help by a large kernel latency.

As I mentioned above, commands that are waiting for access to a full device queue reside in the kernel queue until a device queue slot becomes available.  On the whole, commands should only spend a fraction of a millisecond in the kernel queue on their way to the device queue.  A kernel queuing time of over one millisecond and certainly over two milliseconds suggests the virtual machines are not having their IO needs served fast enough.

You can see kernel queueing times in the kernel latency statistic reported in esxtop (counter: KAVG) and vCenter (counter: Kernel Latency).  When these latencies consistently average any whole number in milliseconds its time to investigate storage.  But know that slow storage can result in high kernel queuing times.  So, before you go manipulating queues, or reconfiguring your storage layout, make sure your storage is serving IOs in periods deemed acceptible by the storage teams (usually 5-10 ms).

This is kind of a long article by vPivot standards, I know.  But cut me some slack.  Chad Sakac bangs out footnotes and parenthetical digressions that are longer than this entry.  This content has already been covered in my VMworld presentations so if you have access to those recordings go listen to Kaushik and I present it there.  But for those of you that were unable to attend I wanted to present this important guidance for your consideration.

18 Responses

Hey Scott.

Great post, all VMware admins or those involved in planning a VMware implementation should be aware of this. I have not seen your VMworld presentation, did it cover the behavior of SIOC? It is very interesting the way VMware has implemented SIOC by dynamically controlling the device queue based on a latency threshold.

I have heard many suggest not to change the queue, and then there are many KB articles and PDF’s that say you should to address latency, throughput or SCSI locking. It really does boil down to the workload of the VM’s.

Are you going to follow up with considerations from VMware’s Scalable Storage Performance PDF ? (http://www.vmware.com/resources/techresources/1059). I know it is old, but it still applies for many (link bandwidth, SCSI locking, latency).

I like the way VMware is actively pushing on the storage front, they are tacking big design and operational considerations one by one; VAAI and locking, SIOC and latency/queuing/QOS.


    • I’m probably not going to update the content with data from that paper. That paper was created in response to concerns that there are VMFS scalability problems. It does not delve into queuing but serves an important role in supporting VMFS deployment. I would love for VMware to update their documentation with more on queues. But they are unlikely to do so. That’s why I wrote this blog.

  • Nice succint post. It goes without saying that before making any changes to queue depth settings, one should consult the best practice guidelines of the storage system and HBA vendors.

    Also, in a mixed SAN infrastructure particulary, with physical and virtual hosts, I would suggest noting the fan-out ratio of the storage controllers and the SAN fabric to your servers. It is possible for over-subsciption of the SAN fabric and/or storage controllers to impact performance across the entire SAN, including your ESX hosts.

  • Scott,

    You say cranking up the queue of the HBA, doubles the number of IO requests a host does.
    I don’t agree. 🙂
    There’s a slight difference in what you’re telling, and what we see IRL.

    It’s a queue, what you’re effectively doing is giving the HBA more room for outstanding requests. So in case of a pile up at storage for a short timespan, your HBA queues them, and not the OS. Take this example, standard 32, no queued io’s seen on HBA, so your storage is keeping up. Now we set the HBA’s queue at 128, that’s the spec IBM gives for SVC/Qlogic/ESX. According to your post, we should see 8 times more (2 HBA’s per ESX host) iops from an ESX host coming in at the front end storage. But of course we don’t see that. Because there was no queue with the setting at 32, there won’t be a queue with 128 set. The host isn’t going to ask for more iops simply because you upped the HBA setting.
    The way you described this mechanism, we automatically should see more io requested from the ESX server that had it’s setting changed, which just isn’t happening.

    Now why does IBM in this example say set it at 128 then, if it doesn’t make any difference? They just want to make sure an HBA is not the bottleneck. Effectively they say; just throw the iops

    • I get your point and agree with you. If the application is not demanding more than 32 commands at a time, then increasing queue size will have no change. Furthermore, increasing queue length does not technically double the IOs. The IOs remain the same in number. But they are processed in a shorter, more intense burst.

    • It really depends on the workload, OS, and HBA driver. I have seen some Oracle workloads, especially OLTP type workloads where in solaris you I have seen the best performance and response time using queues of 16 or 32 but have seen that same workload in a windows environment using Oracle where the queue is set at 128. Bottom line if you can see how well the IO requests are being serviced you can monitor the response times (as one of several things you can do) and if response time increases there may be a sign that the IOs waiting to long to be serviced. There are many factors that go into the magic que length. Many folks spend time characterizing this for their own workload.

  • Woops, thick fingers and iPad 🙂

    Effectively they say; just throw the iops at us, we queue in the front end storage if we got problems keeping up. If you leave it at 32, and there is a burst which makes the HBA queue up for a very short time, it’s delaying your host. At 128, it isn’t, and storage takes care of your burst. But setting it higher because you want to make your host do more iops?

  • How about NFS Volume? As your post is only focusing on FC SAN and LUN base with VMFS. Just curious as your post mentioned about volume, which I thought should cover both LUN and Volume base

    • I did not describe NFS stores because I lack as much familiarity with the processing of NFS-bound IOs in the VMkernel. But my understanding is that there is no queue for NFS IOs: they are converted to TCP packets when handed to the VMkernel. The only queue that exists is in the guest’s SCSI driver.

  • So NFS volumes are inherently better than FC? This seems to be more evidence for it. We’re doing NFS on NetApp and that seems to work really well. I’m less familiar with FC and this queueing issue. NFS doesn’t have a similar bottleneck? I guess what I’m asking in short, why would anyone use FC if NFS doesn’t have this constraint? What’s the downside of NFS?

    • I would caution you from drawing conclusions about the superiority of any protocol from this queuing discussion. Remember, even without ESX kernel or device queues yoru NFS-based virtual machines are still limited by the guest queue, which defaults to something like 64 commands. So, you are still dealing with queues with any protocol.

      Fibre Channel generally has nominally better latency and is (currently) a more scalable solution. The primary problems with NFS today are (1) VMware’s lagging support for it in their feature set (see: latency statistics, SIOC, etc.) and (2) the single connection limitation that sets a maximum throughput at a single link’s speed.

      These things aside, its fair to say that the complexity of Fibre Channel makes NFS a fantastically attractive option. Once VMware upgrades their NFS implementation and commits to providing full support for NFS with all subsequent features we may all find ourselves sticking to NFS for all virtual machines.

      • According to Netapp, one of the benefits of NFS was it’s superior scalability over FC. Perhaps that’s just their implementation? We built our environment around their recommendations, so I guess I’m just trying to understand more about it. Their spiel, as I understand it, is that FC is faster for a few VMs but as you increase VM density, NFS maintains its performance curve better than FC. Do you disagree?

        • We (at VMware) saw NetApp claiming scalability limitations in VMFS about a year and a half ago. We always refuted those claims and could never reproduce the limits claimed. We politely asked NetApp to help us prove their results or stop disseminating them. For the most part, they stopped making the claim of poor VMFS scalability.

          NFS is great. It’s awesome. Everyone should love it. But no one should believe that the reason why NFS is better than VMFS is due to poor VMFS scalability. It’s not true.

          All of VMware’s hero numbers were done on Fibre Channel because it scales better on extreme workloads. Because VMware’s implementation of NFS support is limited to a single session its maximum throughput is limited to a single link’s speed. This is not the case of FC. But note that I said “with extreme workloads”. On 99% of applications NFS can provide every ounce of performance you need.

          Also, we saw nominally better latency with FC. But nothing I would base a purchasing decision on.

  • p.s. and I do agree, the reduced complexity makes it very attractive. I might stick with it even if we could show it was slower just because it’s so easy to implement.

  • Thanks for the feedback. I’m new to this blog, and just now clicked your About box and see you’re EMC. Didn’t mean to start any vendor wars here. But I do have only one vendor’s perspective right now, so it’s good to hear other voices. We run mainly light workloads but do have one heavy hitter: Mimosa NearPoint, an email archiving solution, that thrashes storage hard all day long. We (and Mimosa) were very hesitant to do this over NFS (they wanted FC-connected RDMs), but we did it on NFS and it hums along very nicely. We’re about 60% virtualized now, so I’m just curious if we’re going to be able to keep scaling on NFS.

    • I am an EMC employee. But I was at VMware for nearly four years. And my total experience on this issue is as a VMware employee, not an EMC one. VMware works equally well with all protocols. And because of that we (when I was at VMware) vocally disagreed with NetApp’s claims that VMFS did not scale. To date, I have never seen results of a credible experiment supporting that claim.

      You have to realize that EMC does not come down on any side of this argument. EMC’s Unified arrays support both NFS and FC and customers should choose what works well in their environment.

  • Great post, but regarding your comment above: “…VMware’s implementation of NFS support is limited to a single session…” – this makes NFS sound limited when logically instead is at least like FC. With multiple VMkernel ports and a proper NFS datastores design one can use (static it’s true) link aggregation with multiple targets and have true load balancing with multiple active paths. With FC there is still one path to a lun. Using multiple luns complicates things on both sides. Adding the queues in the mix does not create a nicer pictures. With NFS these are non-issues. All you need is basic etherchannel skills. VMware did a pretty good job in my opinion in keeping the NFS configuration easy and flexible since the beginning in 3.x.

    • No, FC is definitely superior to NFS with respect to multi-pathing. You cannot establish two paths from a single VM to a single NFS datastore. You can do with easily with FC.

      I am not trying to start a protocol war here. NFS is awesome and people should be using it for almost everything. But it has some limits that we are waiting for VMware to improve upon.