Scott Drummonds on Virtualization

vSphere Is Not the Performance Problem, Your Storage Is


[This is an update to one of my favorite articles, which details my on-site investigation of SQL Server performance problems.]

Back in July I had the privilege of riding along with VMware’s Professional Services Organization as they piloted a possible performance offering. We are considering two possible services: one for performance troubleshooting and another for infrastructure optimization. During this trip we piloted the troubleshooting service, focusing on the customer’s disappointing experience with SQL Server’s performance on vSphere.

If you have read my blog entries (SQL Server Performance Problems Not Due to VMware) or heard me speak, you know that SQL performance is a major focus of my work. SQL Server is the most common source of performance discontent among our customers, yet 100% of the problems I have diagnosed were not due to vSphere. When this customer described the problem, I knew this SQL Server issue was stereotypical of my many engagements:

“We virtualized our environment nearly a year ago and and quickly determined that virtualization was not right for our SQL Servers. Performance dropped by 75% and we know this is VMware’s fault because we virtualized on much newer hardware on the exact same SAN. We have since moved the SQL instance back to native.”

Most professionals in the industry stop here, incorrectly files this problem as a deficiency of virtualization, and move on with their deployments. But I know that vSphere’s abilities with SQL Server are phenomenal, so I expect to make every user happy with their virtual SQL deployment. I start by challenging the assumptions and trust nothing that I have not seen for myself. Here are my first steps on the hunt for the source of the problem:

  1. Instrument the SQL instance that has been moved back to native to profile its resource utilization. Do this by running Perfmon to collect stats on the database’s memory, CPU, and disk usage.
  2. Audit the infrastructure and document the SAN configuration. Primarily I will need RAID group and LUN configuration and an itemized list of VMDKs on each VMFS volume.
  3. Use esxtop and vscsiStats to measure resource utilization of important VMs under peak production load.

There are about a dozen other things that I could do here, but my experience in these issues is that I can find 90% of all performance problems with just these three steps. Let me start by showing you the two RAID groups that were most important to the environment. I have greatly simplified the process of estimating these groups’ performance, but the rough estimate will serve for this example:

RAID Group Configuration Performance Estimate
A RAID5 using 4 15K disks 4 x 200 = 800 IOPS
B RAID5 using 7 10K disks 7 x 150 = 1050 IOPS

We found two SQL instances in their environment that were generating significant IO: one that had been moved back to native and one that remained in a virtual machine. By using Perfmon for the native instance and vscsiStats the virtual one, we documented the following demands during a one-hour window:

SQL Instance Peak IOPS Average IOPS
X (physical) 1800 850
Y (virtual) 1000 400

In the customer’s first implementation of the virtual infrastructure, both SQL Servers, X and Y, were placed on RAID group A. But in the native configuration SQL Server X was placed on RAID group B. This meant that the storage bandwidth of the physical configuration was approximately 1850 IOPS. In the virtual configuration the two databases shared a single 800 IOPS RAID volume.

It does not take a rocket scientist to realize that users are going to complain when a critical SQL Server instances goes from 1050 IOPS to 400. And this was not news to the VI admin on-site, either. What we found as we investigated further was that virtual disks requested by the application owners were used in unexpected and undocumented ways and frequently demanded more throughput than originally estimated. In fact, through vscsiStats analysis, my contact and I were able to identify an “unused” VMDK with moderate sequential IO that we immediately recognized as log traffic. Inspection of the application’s configuration confirmed this.

Despite the explosion of VMware into the data center we remain the new kid on the block. As soon as performance suffers the first reaction is to blame the new kid. But next time you see a performance problem in your production environment, I urge you to look at the issue as a consolidation challenge, and not a virtualization problem. Follow the best practices you have been using for years and you can correct this problem without needing to call me and my colleagues to town.

Of course, if you want to fly us out for to help you correct a specific problem or optimize your design, I promise we will make it worth your while.

10 Responses

Itzik Reich (EMC) here.
it’s amazing how it happens time after time..i think it’s down to the fact that the VI admin understand “Capacity”, and thestorage admins understand “IOPS”.
now, if only they could speak to each other..

  • Ohhh how true this is , I’m in the process of fixing an issue as we speak about this same problem.

    Itzik is right Vmware Admin’s see capacity and thrown as much as they can on a data-store.

    Storage guy looks at his monitoring tools and says’s WTF.

  • […] Organization (PSO) as they develop a performance troubleshooting service, I see repeats of a theme I wrote about a couple of months ago:  storage is causing most performance problems, not VMware.  In fact, I have yet to see a […]

  • […] with a focus on VMware View/VDI and is also worth a few minutes of your time.  Also check out http://vpivot.com/2009/09/18/storage-is-the-problem/ for a rubber-meets-the-road post from Scott Drummonds on the importance of storage performance […]

  • […] This tools helps to identify performance issues specifically with your storage.  Most performance experts will tell you that storage issues lead to more performance problems than any other single […]

  • Great article, and I’d point out one additional detail. If your physical servers have multiple HBAs with true active-active multipathing set up (such that you can get more than one HBA’s worth of bandwidth to a single LUN), then you need to understand how pathing works in virtualization. You won’t get that true active/active multipathing for a single LUN unless you also install that same multipathing solution at the VMware host level, not at the guest.

  • Thanks for all your documentation relating to Vmware performance and SQL. I’ve read everything you’ve written or referenced relating to SQL and CPU performance or troubleshooting after running into a SQL performance/scalability problem in recent weeks which remains unresolved. We were benchmarking an important system and SQL performance get increasing worse when increasing vCPU’s beyond 4.

    At this point, it’s cheaper for my organization to change our virtual DB servers to native servers. I’m running out of time and the organization is quickly coming to the conclusion that Vmware can’t handle large SQL workloads spite clear documentation to the contrary from yourself and others. I still look at this as challenge to be resolved, but my organization sees it as a problem. (actually my family sees it as problem also because I’m putting in extra time and effort till I get to the bottom of it)

    You’re thoughts or recomendation would be very much appreciated… So here’s the short of my problem as previusly summarized to VMware support:

    “First, our results and conclusions have not changed after additional
    testing during the past 24 hours. We’ve confirmed our results using an
    Active/Passive SQL cluster between a virtual node and native (physical)
    node. Both capacity and performance are significantly reduced when using
    a virtual db compared to the native db.

    Last night’s benchmark demonstrate our previous findings that we get 20%
    more transactions when using 50% less virtual CPU cores. Results
    include: 4 vCPU cores @ 572 k/hr vs. 8 vCPU cores @ 471 k/hr.
    Furthermore, avg response times when using a virtual DB server compared
    to a native DB server are nearly twice as long in virtual. Results:
    4vCPU virtual db user experience simulation @ 16.030 seconds (avg) vs.
    4pCPU native user experience simulation @ 7.737 seconds (avg).

    Once again it was confirmed the native OS installation doesn’t suffer
    from the same throughput problem and it yields a significantly higher
    quantity of transactions across the board.

    The native OS consistently increases in transactions with each CPU
    including CPU core quantities above 4, unlike the virtual DB. The
    throughput increases with each additional physical CPU core in the
    native host but throughput decreases with each virtual CPU core beyond 4-5.
    Results: 4 pCPU cores @ 727 k/hr, 6 pCPU cores @ 925 k /hr, 8 pCPU @
    1,065 k/hr. vs. 4 vCPU cores @ 572 k/hr, 6 vCPU cores @ 500-600 k /hr, 8 vCPU @ 471 k/hr.

    If all 8 CPU cores are used, the native DB achieves more than twice the transactional throughput of the virtual DB with 8 vCPU cores. Results: 8 pCPU cores @ 1,022 k/hr vs. 8 vCPU cores @ 471 k/hr.”


    • I have seen a negative scaling scenario with SQL Server in only one other case. The SQL DB was accessing an Oracle DB on another host for some of the data that it needed. Setting the SQL Server Processor Affinity manually greatly improved performance. On the Processors page of the properites settings for SQL Server, unselect Automatically Set Processor affinity and instead manually select all available processors.

      If you database is accessing an Oracle DB as part of the workload, I would expect this to have a fair chance of working.

      What was the performance scaling like up to 4vCPU? Is this an Intel Xeon 5500 series (Nehalem) based system?