Virtualization Technology News and Information
Virtualization and Storage: Five Best Practices to Ensure Success
Today many companies are asking themselves “the big question” about server virtualization: Are we ready for full production?

Even though server virtualization is a mainstream technology today and is used in a growing number of ways, there is still some trepidation about implementing it for mission critical production applications. Many IT organizations use virtualization extensively for development and test applications, or for second tier requirements, and have experienced tremendous benefits. Now, there is growing interest in taking virtualization technologies like VMware to the next level as key infrastructure elements in more business-sensitive applications. In a 2007 survey by The Info Pro, 83% of Fortune 1000 IT managers surveyed responded “Yes” to the question, “Do you consider virtualization to be the next enterprise IT server platform?” The real question companies should be asking is not whether they will use virtualization for mainstream applications, but what they should do to ensure its success.

Server Virtualization and Storage
In today’s production data center, networked storage is at the center of everything, and it is one of the critical elements to consider when rolling out applications on VMware or any other platform. In development and test environments, VMware applications are typically run with local disk or direct-attached RAID storage. In production data centers, Virtual Machines (VMs) need to work with enterprise-class SAN or NAS — an infrastructure that is shared across a range of applications and workloads. Storage is ultimately about input/output (I/O), and it is important to make sure that the I/O workloads that are required by the server domain can be handled by all of the elements of the storage domain, including the host bus adapter (HBA), storage fabric and the storage array. One of the main causes of poor performance in both virtualized and traditional server environments is a mismatch between the front-end and the back-end. Frequently, contention for shared storage resources such as RAID groups causes I/O bottlenecks that result in queuing backlogs and poor end-to-end response time. Ultimately, this impacts the business application and the end-user. In a virtual world, this scenario is more complex. While it may be possible to run more virtual machines on a given server, it is also possible to go too far and overload the storage layer, resulting in negative unintended consequences. One of the most important things to understand is that enterprise storage is increasingly virtualized along with servers; as it is also a pooled, shared resource. A key to ensuring VMware application success in these environments is making sure that front-end workloads are matched to back-end storage capabilities, and to monitor these relationships consistently.

Best Practices
Here are five best practices for ensuring successful VMware projects on enterprise-class storage:
  • Establish a “cross-domain” management orientation.
  • Use “infrastructure response time” as a key metric.
  • When there are VMware performance issues that are difficult to diagnose, look for contention and contention-based latency in the storage layer.
  • Strive for “best fit” of workloads to storage resources.
  • Work toward infrastructure performance service-level agreements (SLAs).

Establish a “Cross-Domain” Management Orientation
Virtualized servers and networked storage are both pooled resources, and in the data center they do not operate independently. Optimal performance only occurs when the server and storage domains interoperate in a load-balanced way. Unfortunately, most of the management software in today’s data center was designed to manage a single element – there are separate tools for managing servers, storage, networks, etc. Today there are new “cross-domain” management tools emerging that can help get visibility across the data center and find root cause when things go wrong. When deploying virtual infrastructure, it is important to deploy management software that can help monitor and manage across the entire infrastructure.

Another cross-domain challenge is people. The typical IT management team is still organized functionally, around the technology domains. There are server teams, network teams, storage teams, and so forth. These groups of people each have their own management software tools as discussed earlier, and as a result, they sometimes understand the environment very differently. In some companies they don’t even like each other very much. When things go wrong, such as when there is an application performance problem, it’s not uncommon for the different groups to play the “blame game.” This sort of organizational model is a legacy of the “one-application-per-server” model, but in a virtual resource world where storage is a shared utility, this silo-oriented approach doesn’t work. In addition to ensuring cross-domain management tools are implemented, it is important to have a cross-domain IT management team that can provide an end-to-end service level and respond productively when there are availability or performance problems.

Use “Infrastructure Response Time” as a Key Metric
Getting people to work together is easier said than done, especially when they have expertise in different areas. That’s where metrics come in. If there are a few key metrics that each of the different IT teams can monitor and understand in the same way, people tend to be more aligned. “Infrastructure Response Time” (IRT) is one of these metrics. In the typical VMware case, IRT measures response time from the VM through the server and HBA all the way to the spinning disk, and back. When IRT increases for a given application, it is a good indicator that something has changed. This may be the result of increased workload, but it also may mean that something is wrong. If something is wrong, IRT can be used to find out what the problem is. As response time is a function of not only raw throughput but also of queuing or “wait time” at the component level, drilling down into IRT can help identify where the I/O bottlenecks are.

Look for Contention and Contention-Based Latency in the Storage Layer
Companies experiencing persistent performance problems in virtualized environments often find the root cause to be contention between virtual servers and a shared storage resource. An increase in the IRT metric flags these issues, and the culprit is very often a RAID group or HBA that is being overwhelmed by I/O workloads. VMware administrators with new production implementations often find themselves “guilty until proven innocent” when there are performance problems. In the absence of effective tools and practices to help perform root cause analysis, their virtual servers are likely to get blamed. Frequently, after implementing a cross-domain solution, VMware administrators discover the real causes to be I/O workload mismatches, such as too many VMs writing to a RAID 5. Once these sources of contention are found, and workloads rebalanced, then many performance problems can be solved or avoided altogether.

Strive for “Best Fit” of Workloads to Storage Resources
All storage is not created equal. The RAID 5 case mentioned above is just one example. In enterprise storage, workloads can be affected by many factors including disk type (e.g. Fibre Channel, SATA), protocol (e.g. iSCSI, NAS), protection level (e.g. RAID 1/0, RAID 5), the number of disks per group, striping, and so forth. When setting up new VM applications on enterprise storage, virtualization project owners should work with a qualified storage administrator to help identify and provision the right class of storage for the given workload. Many performance problems are created when storage is provisioned for a new application based only on available capacity and not workload.

Work Toward Infrastructure Performance SLAs
Typically, the only metric in service-level agreements between application owners and storage administrators is capacity. That is, there is a commitment that a specific maximum storage capacity will be available for the application. If there are any other elements in the SLA they typically have to do with availability, identifying how frequently the data will be backed up, or a maximum amount of system down-time that can be expected per year. Companies rarely use performance SLAs. This is likely due to many of the items discussed previously. Many companies do not measure performance in a consistent way, especially with newer virtualization technologies. Few have good cross-domain coordination or have established an agreed set of metrics for monitoring and managing performance expectations. SLAs cannot happen without good metrics that people buy into. If the right metrics are put in place, with tools and processes to help measure and manage them, then performance SLAs can be established relatively easily.

As server and storage virtualization move into production mode, these best practices are necessary to ensure successful deployments that optimize performance and utilization. Once properly implemented, virtualization has the potential to make IT as a service-oriented utility – an idea which has been talked about for many years but rarely achieved – a reality.

Read the original at Virtual Strategy Magazine, here.

Published Tuesday, August 28, 2007 10:23 PM by David Marshall
Filed under:
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<August 2007>