Virtualization Technology News and Information
AMD Best Practices Series: VMware vSphere Performance Considerations for NUMA

AMD Best Practices Series. A Contributed Series by AMD


VMware vSphere Performance Considerations for NUMA is written by By Ruben Soto, Field Application Engineer at AMD

One of the questions I get asked frequently is, “Do I have to do anything unique or special to VMware® vSphereTM to optimize performance for the AMD OpteronTM processor?” The brief answer is no, not normally. “Out of the box”, vSphere has the capability to lay out a multi-VM structure that delivers excellent performance across the vast majority of workloads. VMware’s engineering teams do a superb job of collaborating with AMD’s design teams to assess what needs to be done to vSphere’s kernel to account for new features implemented in newer processors.

You’ll notice I stated “vast majority”, implying there are unusual cases where the “out of the box” experience is not optimal. As vSphere deployments have matured and progressed, clients are achieving a level of comfort with the technology where bolder, more complex VMs are being implemented. For example, vSphere 4.1 increases the size that a vSMP can achieve to eight (8) virtual cpus (vCPU). Complex workloads, such as SQL Server and SAP, are now feasible. The AMD Opteron 6000 Series platform multi-die design has introduced the notion of ‘intra-processor’ and ‘inter-processor’ NUMA. This has a major influence on how to craft vSMPs in a manner that does not conflict with this processor. Another vSphere capability that is beginning to get exploited more often is the idea of “overcommitting” resources, e.g. more vCPUs than physical cores. How, when, and which workloads best utilize this capability requires a thorough understanding before committing to it.

As one of AMD’s Field Application Engineers, I was recently called in to help a customer that was experiencing a performance issue.  The customer was running an AMD Opteron™ 6000 Series-based server with 12 cores per processor but unfortunately was seeing poorer performance, reported as excessive latency, than both older AMD Opteron processor based servers and a current Intel based server. After a little research we found a couple of interesting data points:

1.     The VMs were being migrated from node to node, causing the VM and its data to become separated (remote) for periods of time

2.     As a result, the vSMPs were being split across dies as part of a VM rebalancing, resulting in non-optimal cache utilization or data sharing across a vSMP or vCPUs that need to communicate

Understanding how to begin an analysis of AMD Opteron processor performance anomolies requires an understanding of (a) how vSphere initially ‘carves up’ and lays out a large vSMP VM in a multi-socket server, (b) the role of the vSphere NUMAsched(uler), and (c) how vSphere attempts to maintain proper workload distribution to optimize performance.

VM Initial Placement

vSphere utilizes the same BIOS structures that any bare-metal OS would use to get a “lay of the land” picture of the underlying host server, i.e. SRAT, SLIT, and ACPI. It’s from these structures that vSphere creates a ‘mapping’ of the host server, defining NUMA nodes boundaries. The vSphere NUMAsched uses a round-robin technique for vCPU placement but also makes every attempt to keep process and data co-resident on the same NUMA node. This is a desirable condition, as it reduces remote memory access latency and promotes cache sharing. This is an example of VMware’s software engineering strength and collaboration with AMD.

But it’s also here where the first potential issue may arise. As I stated earlier, vSphere 4.1 allows a vSMP VM to be up to eight (8) vCPUs. The internal structure of the AMD Opteron 6000 Series platform is two (2) 6-core die, each with its own memory controller. In this scenario, when creating a vSMP up to the maximum vCPU level, vSphere will abandon its NUMA-boundary-biased algorithm in favor of equally distributing all vCPUs across all available NUMA nodes. This is referred to as a “Wide VM”. The illustration below captures this. The number of cores per die is irrelevant for this scenario.


Illustration courtesy of Frank Denneman


Potential side effects include:

  • NUMAsched will place data at any position it deems appropriate, with no consideration of process placement, potentially causing performance issues due to remote memory accesses.
  • NUMAsched optimizations are disabled since there is no possibility of consolidating the vSMP onto one NUMA node.
  • The situation is more acute for HyperThreaded processors since HT is not taken into consideration during initial placement, i.e. they don't exist.

vSphere Workload Rebalancing

While vSphere will attempt to keep process and data within the same package, this is not guaranteed. Another potential issue introduced with larger vSMPs is that the VM guest OS expects all the vCPUs to execute in close synchronicity. If the time difference in execution becomes too large, the guest OS may incur an unrecoverable fault. VSphere revamped the NUMAsched in V4.1 to account for this condition and to further optimize workload distribution via a "relaxed co-scheduling" algorithm. 

Every two (2) seconds, vSphere takes an accounting of resource utilization, vSMP synchronicity, and VM Reservations and Entitlements status and takes any actions it deems necessary to "rebalance" workloads. During this rebalancing, it may be possible that a running process may be moved to a NUMA Node separate from its data, causing a performance problem due to remote memory access. VSphere will eventually rebalance the data portion as well, but there will be a performance issue in the meantime.

Again, this situation is more acute in HyperThreaded processors as vSphere enumerates physical and logical cores in sequential fashion. In a rebalance, it's possible that two independent VMs may be collocated on the same physical core, causing an overcommitted CPU condition. 

Important vSMP Deployment Considerations

In the limited scenarios I described above, the following practices will help address these issues:

  1. Craft vSMPs so they do not exceed the physical core count of the die/NUMA Node.
  2. "Encourage" VSphere to keep all vCPU siblings in a vSMP on the same NUMA Node:
    1. "sched.cpu.vsmpConsolidate=true"
  3. For Benchmarking, perhaps Rebalancing is not necessary:
    1. Disable via "Numa.RebalanceEnable=0"
    2. Disable excessive Page Migrations via /Numa/PageMigEnable="0"
    3. CPU/Memory Affinity, an option for managing VM placement, is a capability to be used with extreme caution. There are many pitfalls, especially if used in a DRS cluster:
      1. VMotions are disallowed
      2. The affected VM is treated as a NON-NUMA client and gets excluded from NUMA scheduling
      3. Affinity does not equal ‘isolation'. The VMkernel Scheduler will still attempt to schedule other VMs on that core.


The AMD Cloud Computing Blog can be found here.


Ruben Soto is a Field Application Engineer at AMD.  His postings are his own opinions and may not represent AMD's positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only.  Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

Published Friday, May 20, 2011 5:00 AM by David Marshall
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<May 2011>