AMD Best Practices Series. A Contributed Series by AMD
VMware vSphere Performance Considerations for NUMA is written by By Ruben Soto, Field Application Engineer at AMD
One of the questions I get asked frequently is, “Do I have
to do anything unique or special to VMware® vSphereTM to
optimize performance for the AMD OpteronTM processor?” The brief
answer is no, not normally. “Out of the box”, vSphere has the capability to lay
out a multi-VM structure that delivers excellent performance across the vast
majority of workloads. VMware’s engineering teams do a superb job of
collaborating with AMD’s design teams to assess what needs to be done to vSphere’s
kernel to account for new features implemented in newer processors.
You’ll notice I stated “vast majority”, implying there are
unusual cases where the “out of the box” experience is not optimal. As vSphere
deployments have matured and progressed, clients are achieving a level of
comfort with the technology where bolder, more complex VMs are being
implemented. For example, vSphere 4.1 increases the size that a vSMP can
achieve to eight (8) virtual cpus (vCPU). Complex workloads, such as SQL Server
and SAP, are now feasible. The AMD Opteron 6000 Series platform multi-die
design has introduced the notion of ‘intra-processor’ and ‘inter-processor’
NUMA. This has a major influence on how to craft vSMPs in a manner that does not
conflict with this processor. Another vSphere capability that is beginning to
get exploited more often is the idea of “overcommitting” resources, e.g. more
vCPUs than physical cores. How, when, and which workloads best utilize this
capability requires a thorough understanding before committing to it.
As one of AMD’s Field Application Engineers, I was recently
called in to help a customer that was experiencing a performance issue.
The customer was running an AMD Opteron™ 6000 Series-based server with 12 cores
per processor but unfortunately was seeing poorer performance, reported as
excessive latency, than both older AMD Opteron processor based servers and a
current Intel based server. After a little research we found a couple of
interesting data points:
1.
The VMs were being migrated from node to node,
causing the VM and its data to become separated (remote) for periods of time
2.
As a result, the vSMPs were being split across
dies as part of a VM rebalancing, resulting in non-optimal cache utilization or
data sharing across a vSMP or vCPUs that need to communicate
Understanding how to begin an analysis of AMD Opteron
processor performance anomolies requires an understanding of (a) how vSphere
initially ‘carves up’ and lays out a large vSMP VM in a multi-socket server,
(b) the role of the vSphere NUMAsched(uler), and (c) how
vSphere attempts to maintain proper workload distribution to optimize
performance.
VM Initial Placement
vSphere utilizes the same BIOS structures that any
bare-metal OS would use to get a “lay of the land” picture of the underlying
host server, i.e. SRAT, SLIT, and ACPI. It’s from these structures that vSphere
creates a ‘mapping’ of the host server, defining NUMA nodes boundaries. The
vSphere NUMAsched uses a round-robin technique for vCPU placement but also
makes every attempt to keep process and data co-resident on the same NUMA node.
This is a desirable condition, as it reduces remote memory access latency and
promotes cache sharing. This is an example of VMware’s software engineering
strength and collaboration with AMD.
But it’s also here where the first potential issue may
arise. As I stated earlier, vSphere 4.1 allows a vSMP VM to be up to eight (8)
vCPUs. The internal structure of the AMD Opteron 6000 Series platform is two
(2) 6-core die, each with its own memory controller. In this scenario, when
creating a vSMP up to the maximum vCPU level, vSphere will abandon its
NUMA-boundary-biased algorithm in favor of equally distributing all vCPUs
across all available NUMA nodes. This is referred to as a “Wide VM”. The
illustration below captures this. The number of cores per die is irrelevant for
this scenario.

Illustration courtesy
of Frank Denneman
Potential side effects include:
- NUMAsched will place data at any position it deems appropriate, with no consideration of process placement, potentially causing performance issues due to remote memory accesses.
- NUMAsched optimizations are disabled since there is no possibility of consolidating the vSMP onto one NUMA node.
- The situation is more acute for HyperThreaded processors since HT is not taken into consideration during initial placement, i.e. they don't exist.
vSphere Workload Rebalancing
While vSphere will attempt to keep process and data within the same package, this is not guaranteed. Another potential issue introduced with larger vSMPs is that the VM guest OS expects all the vCPUs to execute in close synchronicity. If the time difference in execution becomes too large, the guest OS may incur an unrecoverable fault. VSphere revamped the NUMAsched in V4.1 to account for this condition and to further optimize workload distribution via a "relaxed co-scheduling" algorithm.
Every two (2) seconds, vSphere takes an accounting of resource utilization, vSMP synchronicity, and VM Reservations and Entitlements status and takes any actions it deems necessary to "rebalance" workloads. During this rebalancing, it may be possible that a running process may be moved to a NUMA Node separate from its data, causing a performance problem due to remote memory access. VSphere will eventually rebalance the data portion as well, but there will be a performance issue in the meantime.
Again, this situation is more acute in HyperThreaded processors as vSphere enumerates physical and logical cores in sequential fashion. In a rebalance, it's possible that two independent VMs may be collocated on the same physical core, causing an overcommitted CPU condition.
Important vSMP Deployment Considerations
In the limited scenarios I described above, the following practices will help address these issues:
- Craft vSMPs so they do not exceed the physical core count of the die/NUMA Node.
- "Encourage" VSphere to keep all vCPU siblings in a vSMP on the same NUMA Node:
- "sched.cpu.vsmpConsolidate=true"
- For Benchmarking, perhaps Rebalancing is not necessary:
- Disable via "Numa.RebalanceEnable=0"
- Disable excessive Page Migrations via /Numa/PageMigEnable="0"
- CPU/Memory Affinity, an option for managing VM placement, is a capability to be used with extreme caution. There are many pitfalls, especially if used in a DRS cluster:
- VMotions are disallowed
- The affected VM is treated as a NON-NUMA client and gets excluded from NUMA scheduling
- Affinity does not equal ‘isolation'. The VMkernel Scheduler will still attempt to schedule other VMs on that core.
Resources
http://www.vmware.com/files/pdf/techpaper/VMW_vSphere41_cpu_schedule_ESX.pdf
http://www.vmware.com/pdf/Perf_Best_Practices_vSphere4.1.pdf
http://www.vmware.com/pdf/vSphere4/r41/vsp_41_resource_mgmt.pdf
The AMD Cloud Computing Blog can be found here.
###
Ruben Soto is a Field Application Engineer at AMD. His postings are his own opinions and may not represent AMD's positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.