Over at the VMware performance blog, VROOM!, a new published report appeared showing off the company's latest storage scalability experiments. The hardware setup included 64 blades running VMware ESX Server 3.5, connected to a storage array via 2Gbps Fibre Channel links. All hosts shared a single VMFS volume, and virtual machines running IOmeter generated a heavy I/O load to that one volume. The queue depth for the Fibre Channel HBA was set to 32 on each ESX Server host.
ESX Server enables multiple hosts to reliably share the same physical storage through its highly optimized storage stack and the VMFS file system. There are many benefits to a shared storage infrastructure, such as consolidation and live migration, but people commonly wonder about performance. While it is always desirable to squeeze the most performance out of the storage system, care should be taken not to severely over-commit the available resources, which can lead to performance degradation. Specifically, the primary factors that affect the shared storage performance of an ESX Server cluster are as follows:
1. The number of outstanding SCSI commands going to a shared LUN
SCSI allows multiple commands to be active on a link, and SCSI drivers support a configurable parameter called “queue depth” to control this. The maximum supported value is most commonly 256. For an I/O group (ESX Server(s) – LUN), it is important that the number of active SCSI commands does not exceed this value, otherwise the commands will get queued. Excessive queuing leads to increased latencies and potentially a drop in throughput. The number of commands queued per ESX Server host can be derived using the esxtop command.
2. SCSI reservations
VMFS is a clustered file system and uses SCSI reservations to implement on-disk locks. Administrative operations, such as creating/deleting a virtual disk, extending a VMFS volume, or creating/deleting snapshots, result in metadata updates to the file system using locks, and hence result in SCSI reservations. A reservation causes the LUN to be available exclusively to a single ESX Server host for a brief period of time. It is therefore preferable that administrators perform the above-mentioned operations during off-peak hours, especially if there will be many of them.
3. Storage device capabilities
The capabilities of the storage array play a role in how well performance scales with multiple ESX Servers. The capabilities include the maximum LUN queue depth, the cache size, the number of sequential streams, and other vendor-specific enhancements. Our results have shown that most modern Fibre Channel storage arrays have enough capacity to provide good performance in an ESX Server cluster.
If you are using iSCSI or NFS, this comparison paper nicely outlines how ESX Server can efficiently make use of the full Ethernet link speed for most block sizes.
Read the entire VROOM! post, here.