vPivot

Scott Drummonds on Virtualization

Performance Troubleshooting Made Simple

5 Comments »

I have struggled for years to give VMware’s customers a framework for diagnosing performance problems. People want a simple system to troubleshoot the unknown sources of poorly performing applications. The best attempt at documenting such a flow is Hal Rosenberg’s document on vSphere performance troubleshooting. Elegant as it may be, Hal’s document remains complex for the novice VI administrator.  And it is because that document is so complex that performance people maintain their job security. 🙂 But in an effort to further obviate my own job, I will try and generalize the troubleshooting flow to add more clarity to the process.

The first tool in the VI administrator’s toolbox should always be vCenter. Through the vSphere client you can use vCenter’s performance counters to confirm a problem with any of the four resources (storage, CPU, memory, network). vCenter’s 20 second sample window impedes its ability to eliminate a resource as a problem. This is because a three second spike in any resource will be smoothed and missed over the 20 second window. But when vCenter confirms a sustained resource bottleneck, it is sure to be the performance problem’s cause.

If vCenter fails to confirm an obvious performance problem, the administrator must next go to more precise, more time-intensive, and more knowledge-intensive tools such as esxtop and vscsiStats. esxtop takes more skill and time than vCenter but provides better resolution and more visibility into the system. vscsiStats is the most time-intensive tool and has limits with ESXi hosts but can uncover a world of detail invisible to esxtop and vCenter.

I estimate each tool’s chance of identifying a random performance problem as follows:

  • vCenter: used in 90% of performance problems
  • esxtop: used in 9% of problems
  • vscsiStats: used 0.9% of the time

The remaining 0.1% of the time is when you engage your account team or your local VMware performance expert.

Even within each tool’s usage there is an hierarchy of investigation: storage, CPU, memory and network. My experience with troubleshooting has informed this decision.  Storage causes the most problems, then CPU, then memory, and lastly (and rarely) network. After each resource level is inspected in vCenter, a repeat of the inspection should occur on esxtop. Guest tools may be a third option for memory, CPU, and network but vscsiStats should always be consulted if the performance problem persists.

VMware’s growing array of performance management tools will change this flow somewhat. AppSpeed, for instance, adds the ability to make very educated guesses about resource bottlenecks based on inside information into the application execution.  Hyperic can provide in-guest process visibility and Ionix ADM will map application interdependenies to focus the investigation.  But, I will abstain from providing best practices on these tools until I have used them more. In all cases, however, the fundamental relationship of “easy first, precise later” remains.

VMware continues to work towards integrating all of these tools into a single view within the vSphere client. I expect that integration will improve the success rate of the performance layman in troubleshooting these problems. But I am sure that even into the distant future performance people will find their jobs secure.

5 Responses

Any idea when vscsiStats will be supported in some form on ESXi?

Great post. I’ll add that I often start (and end) with esxtop. I normally only use vCenter for the obvious stuff like “Hey, why is this VM so slow? Oh, the CPU has been pegged since this morning..”

    • My suggestion of the best metrics to check:

      1. Storage: Driver MilliSec/Command aka DAVG: ms of latency per IO command introduced by the storage array. High values are bad. Check your storage config.

      2. CPU: %Ready for a vCPU: % of time the vCPU wanted to run but was denied a pCPU. High values are bad. (If you get ready in ms, divide by # of ms in sample period e.g. 1000 ms out of the last 20 seconds in vCenter = 1,000/20,000 = 5%. High values are bad. Check for too many vCPUs per pCPU or just too much demand.

      3. Memory: MB/s swapped in from .vswp. The rate at which data is read in from the .vswp file (host-level swapfile). Note that it doesn’t matter how much stuff is IN the .vswp (swapped), just how quickly the stuff is being read back into memory. Non-trivial values are bad. Add more pRAM or reduce the demands from vRAM.

      4. Network: % dropped packets. But this is much less likely to be the problem than storage, CPU, or memory.

      I like those metrics because, once you decide what your thresholds are, each is unambiguously bad. %Utilization, for example, is much more ambiguous. But it’s never good to have slow storage, or wanted data that has to be read from disk instead of memory, or time spent waiting to get a pCPU.

    • vscsiStats is already available on ESXi: http://vpivot.com/2009/10/21/vscsistats-for-esxi. I am not sure that our performance tools get “support”, per se. But they work and can be used for ESXi.

      Scott

  • Do you mind if I use a number of the data that you just described on this guide for my very own individual site? I dont want to copy specifically what you stated, but I do want to make reference to several of the items you stated. Please tell me. Thanks.