Scott Drummonds on Virtualization

Storage Performance Analysis: SingB Case Study


Prepare to get deep into storage.

In the past few weeks I had the pleasure of getting deep into a customer’s performance problem.  Ultimately we identified some interesting issues in the environment that we traced back to an overloaded array.  Like most performance problems, the complaints started at the application layer and then shifted to vSphere.  Like many configurations, it was difficult to pinpoint why the storage was slow.  But EMC account teams pride themselves in customer responsiveness. We assembled a small team to help out. I was amazed and grateful that experts from our midtier specialists in Australia, Malaysia, and India all pitched in on the analysis!

If you are a VMware administrator you may choose to leave the nuts and bolts of storage management to your storage teams.  While this article talks about those nuts and bolts, I ask you to read on.  A little knowledge about how your array works will make you an awesome VMware administrator.  It will help you work with your storage administrators to get the most out of your array.  When your array is at its best, so are your virtual machines.

The analysis you below is the product of tools EMC can run against your EMC storage in a very short time.  The data collection took 24 hours in this case.  But the figures I will show were auto-assembled in minutes.  This is one of the many cool things an EMC technical consultant or one of our partners can do for you.

This account is all true, save the pseudonym I am using for the customer: SingB.  All customers are great customers, but I think we have a particularly good relationship with this one.  I have been working with this customer’s IT department since my first week in Singapore.  They are forthcoming with their needs and problems.  And I am always happy to spend a few minutes with them.

The analysis started with the creation of a NAR file.  The process to create a NAR file is well documented throughout the internet.  We collected 24 hours worth of data from SingB’s CX4-480.  Confirming what the VMware and storage teams already knew, the first sign of trouble from the NAR file analysis is a 24-hour summary of array latency.

You can immediately see that the 95th percentile latency is just below 50 ms.  A backup job they are running after 21:00 is driving the latency quite high.  But as that slowdown is happening at off-peak hours, they can live with it.  But there are three other times in the day–right in the middle of the work day–that latency touches or barely passes the 50ms threshold.  During those periods end users are screaming.

The most visually exciting figure auto-generated from the NAR file is the heat map.

For those of you VMware admins that find yourself in the same position I was in five years ago, the parts of the storage array may be foreign to you.  So, here’s a five-second summary of what happens in the array: commands enter the array through the SP and are sometimes serviced by the DRAM cache.  When the DRAM cache misses a sequence of reads and possibly writes are initiated against the backend disks using the bus.  All of the components involved in this flow–SP, DRAM, connecting bus, and disks–are depicted in the above heatmap. There you go.  Storage is simple, right?

Several observations from the heat map:

  • The SP utilization is generally high, moving between the green 50% and the red 100%.  As I said to our friends at SingB this week, they are getting their money’s worth out of this array.
  • The DRAM cache is exceptionally busy.  That is generally a good thing.  But high cache usage in write-intensive environments often causes forced flushes.  A forced flush is when the cache fills up from writes and IOs are temporarily halted while more space is made.
  • Bus 1, attached to SP A, is too often at 100% utilization.  This will limit the SP’s effectiveness.
  • The disks, labeled by the pool they have been added to, are not exceptionally busy.  But on the whole the pool called “LOC” shows the highest utilization.

This customer originally deployed this array for a development environment.  They followed what I would call “generic best practices”.  They wanted to suit the most varied set of workloads with a reasonable cost.  Some of the decisions they made included:

  • RAID 5 everywhere.  I would say the “average” workload looks a lot like an OLTP workload but with slightly larger blocks.  Mostly read with a block size just above of 8 KB or 16 KB.  RAID 5 is usually good for this.
  • Pools for the majority of the applications but a few RAID groups for applications that are being manually managed.
  • FAST Cache for the VMware environment, which we would expect to meet the “average” workload I describe above.  No FAST Cache for their database environment, which has frequently table scans and backups.  Those activities are mainly sequential and do not realize as much benefit from a large cache.

Later in our report I found something interesting.  You performance sleuths out there should take a look at this summary and see if you can identify where our assumptions above have been contradicted.

The thing that jumped off the screen at me is the write-heavy ratio for pool “vmw-1”.  It is 83% write!

If you are a storage administrator, you may have already realized why the SP utilization is so high.  If you are not an admin, you will first need to understand how RAID 5 works to know what is driving up SP utilization.

RAID 5 protects data using a parity system.  For each number of data blocks in a stripe there is one more block with the data’s parity information.  A read from any block in the stripe requires reading from all the blocks so the parity can be recalculated and checked.  A write to any block in the strip requires that all the data be read first, the parity checked and then recalculated before the stripe is written back.  It is because of this fact that writes are much more expensive with parity protection like RAID 5 than mirroring, like RAID 1.

In fact, our helpful analysis tool gave us a precise calculation of the effect of RAID choices with this workload. The following table estimates the backend IOPS (those internal to the array) as a result of choosing RAID 5 versus RAID 10 protection.

As you can see, at this particular read/write ratio the RAID 5 choice is generating roughly twice as many IOs at the backend as RAID 10 would, when configured to the same capacity.  But RAID 10, using both mirroring and striping, produces less usable space than RAID 5.  Balancing performance efficiency and capacity efficiency is one of the many difficult decisions storage administrators need make.

(As a side note, the above table also shows the tremendous value of SSDs in throughput-intensive environments.  But the SSD versus HD decision does not help in this particular situation.  The array is starved for SP cycles.  Disks that respond faster do not change the fundamental number of front-end and back-end IOs that the SP much process.)

As I mentioned earlier, we were working on this analysis with our friends at VMware.  VMware setup vCenter Operations and produced some phenomenal summaries of what vSphere was generating against the array.  With that analysis we have provided a simple plan of action for the customer:

  1. Storage vMotion a couple offending write-heavy workloads to another array to provide immediate relief.
  2. Over the next few weeks, perform some light storage redesign to exchange one RAID 5 pool for a RAID 10 pool.  Place the write-heavy VMs there and the subsequent SP load caused by those applications will go down by more than 50%.
  3. Over the coming months, upgrade to vSphere 5 (SingB is currently on an older version of vSphere), consider buying vCenter Operations, and spend some more time optimizing the storage layout to reduce SP utilization and eliminate bus contention.

So, what can you take away from this lesson?

  • If you are a VMware administrator, your storage administrators’ best friend.  Educate yourself a little on the different ways arrays actually store data.  Know how your VMs’ profiles will benefit from or be penalized by being mapped to different volumes.  Consider using vSphere’s vStorage APIs for Array Awareness to automate the mapping. A recent EMC paper on VASA on Symmetrix contains a great description of what VASA does and how it can help.
  • If you are a a storage admin, make sure your VMware admins are using their storage vCenter plugin, such as EMC’s free VSI.  That is the beginning of their education in storage and the little extra help they need to see that all VMFS volumes are not the same.

15 Responses

Great post!
Isn’t there RAID5 write optimization mechanism available on CX4-480 series which can outperform RAID10 in certain situation?

    • Yes – you are referring to the Full Stripe Write operation.

      The CLARiiON or VNX in its normal cached operation optimises disk I/O to perform Full-Stripe writes whenever possible. A full-stripe write is somethimes referred to as a Modified RAID-3 (MR3) write.

      If the incoming I/O is large enough to fill the RAID-5 stripe and it is aligned to the disk/block boundaries, or the storage processor has accumulated enough write I/Os in write cache, then a full strip write is performed.

      Also, depending on the data locality, it is possible to coalese smaller I/O writes into fewer larger writes. It is possible to sequence individual write requests in cache until an entire stripe is full, before writing it to disk. This makes for more efficient use of drives and the back-end bus.

  • Thanks Scott for the analysis. Will try my best to attend the meeting on Friday afternoon.

  • Good analysis and explanation for VMware and Storage gurus. Well done Scott!

  • Scott,
    Is this a very simplified example or is a CX4-480 really falling over at ~6,500 total frontend IOPS?

    • The example is a bit simplified. One of the things I discovered in this process is the notion of “average array IOPS” lacks a clear definition. In three separate tools we got averages ranging from ~7K, ~10K, ~13.5K. You can derive these numbers based on when the sample occurs, whether averages are done across time slice, across LUN, whether each sample is an average or a peak, etc.

      Also remember that the number you’re seeing is a 24-hour average. The problems were occurring during a small portion of the day. I will estimate that this array was at maximum throughput at somewhere between 15-20k IOPS. But with a proper design I think we can get its maximum above 30k. But that will depend on a lot of other factors like IO size, sequentiality, obviously.

  • Great write up! Very easy to understand for just about anyone. For any customers reading, I’d like to emphasize the point “never hesitate to escalate”. As a former EMC Customer and EMC’er myself, I have seen first hand the willingness to go above and beyond when it comes to helping customers. If you’re having a problem, your vendor(s) need to know about it! If you’re working with a good business partner, they’ll step up and take advantage of tools they have as is shown in this post.

  • We can also understand from this story why using NetApp RAID-DP aggregates is so much better and easier, you don’t have to make these decisions (RAID5 or RAID10? should I put all my disks together or not?, etc.)! 🙂

    • Mihai,

      There certainly is a place for “easy” in enterprise storage. In fact, “easy” was the initial priority in this deployment. SingB used EMC’s storage pools to meet this requirement. They simply piled workloads in the volume until no more performance remained. That is operationally equivalent to the NetApp aggregates you mention.

      However, as I described in my article, a storage configuration designed for the average workload (such as SingB’s) will show unexpected results when applications have profiles that are distinctly not average. In that situation the customer can either dig in and optimize, or throw more hardware at the problem.

      In this case the customer wanted to tweak and understand. That is why I provided this detailed observation for the customer’s edification. But, if simplicity had been the requirement, I would have recommended continued use of EMC’s storage pools.


      • Yes, but the beauty of NetApp is that you get both simplicity AND high performance (RAID-DP has the same performance of RAID10 – check standard industry benchmarks such as SPC-1 of SPECsfs – and it automatically balances between all the disks in the aggregate; also if you want higher performance and your working set is not too big you just plug in some FlashCache cards and automatically all your volumes speed up).

        P.S. I don’t work for NetApp, I am just a happy user
        P.P.S. NetApp is not the ideal storage company, they have their own set of problems (you need expensive controllers to get very good performance, you should perform regular reallocation scans so that data doesn’t get too fragmented, etc.)

        • RAID-DP get the the “same” perf as 10 IF and _only_ IF :
          a) you statically pre-allocate all the WAFL cluster by random-write on SAN disk initialization
          b) you throw enough (read 2*as much) hardware at the problem – including spindles and the SP power

          All that said, power of WAFL is in the flexibility. The same place its weak block performance is at.

          • Ah, and c) You do not use the array so that WAFL does not get fragmented …

            Overall, all sols have their strengths and weaknesses. Preaching any as superior is not the way to go.

  • […] really enjoyed Scott Drummonds’ recent storage performance analysis post. He goes pretty deep into some storage concepts and provides real-world, relevant information and […]

  • Hi Scott,

    you wrote:
    # A read from any block in the stripe requires reading from all the blocks
    # so the parity can be recalculated and checked.
    # A write to any block in the strip requires that all the data be read first,
    # the parity checked and then recalculated before the stripe is written back.

    Does CX4 (and/or VNX) really check parity while READING blocks?
    Isn’t there a background verify process called “Sniffer” to do so?
    If your statement is true, it fundamentally changes “the math” I use to calculate backend IOPS vs. frontend.
    Would you pls. clarify ?

    • A background operation could help proactively identify disks a storage admin should replace. But if a parity check is not done concurrent with each read then the array could return bad data from a failed/corrupt disk. So, yes, enterprise storage should do parity checks for every read.