Tuesday, October 5, 2010

ESXTOP - a swiss knife for ESX performance analysis

ESXTOP is an excellent tool for VMs performance analysis in an ESX server. Based on my analysis, i am posting info about some of the parameters in ESXTOP which help in identifying the resource that is degrading performance of VMs.

History behind post:


Sometime ago, developers working for one of my client started complaining about performance issue they were facing with desktop VMs alloted to them. At that time we weren't equipped with products which could do Performance analysis except for VIC Performance charts. But charts weren't giving me enough confidence to prove the cause of performance bottle necks. During this time I had to play around with ESXTOP command and the Perfomance analysis using this was so accurate that we could exactly pinpoint what resources (CPU and Mem) were degrading the VMs Performance and take the necessary steps.

Virtual Infrastructure:
--> Virtual center 2.5
--> ESX 3.5


Analysis:

After running ESXTOP command in the ESX server, decide the required option (like cpu, mem, disk and n/w) you want check and type ' f '
to add/remove parameters that are mentioned below.
--> For CPU related:


The most important thing to check is %RDY. %RDY needs to be around 5 but if reaches more than 10% better press the panic button and start finding what is eating resources. Usually, it could be low availability of resources in the host which can be overcome by migrating some VMs present on this host to other and then again check %RDY.


%USED displays the percentage of CPU cycles that remain unused. This should be compared with %RDY, see if %USED is not near to 1% while %RDY is in its expected values.

More %RDY can be brought down by reducing the VMs count on that host and also by providing some CPU reservation to VMs.


Sometimes we also need to see the PCPU values which show utilization of physical processors. The utilization values should be nearly same for all the processors.

If they are not same, I mean -

> If the PCPU values are very high that means host is having high utilization – migrate some VMs from this host and check again if PCPU values have reduced (high host utilization will definitely reduce VM performance) . The values can be somewhere in b/w 30-65%
.

>If the PCPU values are showing different values like one processor has 100% utilization while another has 10% that means some of the VMs are using the whole CPU allocated to them. This may be reduced by allocating multiple vCPUs but this kind of situation is very rare.

--> For RAM related:

From VM perspective, RAM utilization is more dependent on applications running in VMs but sometimes host might be having issue in providing RAM to VMs which can be checked with below parameters.

Below are expansion of each parameter's abbreviation that are related RAM on ESX host.
PMEM – Physical memory;
VMKMEM - virtual machine memory;
COSMEM – ESX server service console memory;
PSHARE – ESX server page sharing statistic;
SWAP – ESX server swap space;
MEMCTL – memory balloon driver;
MEMSZ column gives total allocated memory;


The memory that is actively in use by guest operating system and its applications are reported in touched (TCHD) and active counters (mainly %ACTV, %ACTVS, %ACTVF).
TCHD (shouldn’t cross around 60% of MEMSZ) and %ACTV counters should be as low as possible because the more these are the more they try to consume ESX memory and if that is not available
VMs start using their swap memory (.vswp file) which lowers performance.


MCTL parameter shows if the balloon driver is active or not. Will be ‘Y’ if active.
MCTLSZ gives the amount of memory the balloon driver is using in a specific guest operating system. This memory for balloon driver is collected from less RAM intensive VMs. ESX server uses this before resorting to last option - VMs SWAP memory (present as .vswp file). This value should be nearer to 0.

Check the swap write (SWW /s ) and swap read (SWW /r). These need to be nearer to 0. If they show a significant MB/s then the VM is not having enough RAM.

The memory overhead required to maintain each virtual machine is displayed by the OVHD counter. This is dependent on memory, vCPUs, 32 or 64 bit OS of virtual machine. Its good to have these values as low as possible.

--> For Disk related:

QUED - lists the number of queued commands that the host will process after commands in 'ACTV column' have finished. This should be as low as possible.

LOAD - its ratio of "no. of commands that are active or queued" to the "total no. of commands that can be active or queued at one time". This should be nearer to 0.0

%USD - counter provides the percentage of the queue depth used by VMkernel active commands. Thresholds should be around 0-10.

GAVG /cmd - total latency seen from a virtual machine to the array – Sum of 'hardware (HAVG)' and 'kernel (KAVG) latency'. GAVG around 0 - 25 is fine (this changes from local to SAN)

ABRTS /s - Aborted commands per second. – this should be near 0.

--> For Network related:

Things that need to be checked are, Transmitted data (MBTX /s) and Received data (MBRX /s). These values will be high if network usage is high.
Apart from above values, when usage is at saturation level some packet dropping while sending and receiving may be observed. These dropped packets can be checked using DRPTX /s (sending) and DRPRX /s (receiving), their values should be 0 or near to it.

For more info, on other parameters and also on how to use ESXTOP, check below links-
http://www.yellow-bricks.com/esxtop/
http://communities.vmware.com/docs/DOC-5240

Happy Virtualizing....

Note: These are my views... :-)


No comments:

Post a Comment