A Contributed Article by Leo Reiter, Chief Technology Officer at Nimbix
High Performance Computing workloads are not web applications. This is why it's important that your cloud is designed to run them, rather than generic web services. Today we'll learn how an HPC Cloud is architected, and why...
Jobs versus Instances
It would be silly to dive into architecture without examining HPC workloads a bit more, and how they differ from other applications:
- HPC jobs process data (often times, Big Data), and return results. In other types of clouds, Instances run when launched and listen for requests. Instances are later shut down after some time when they are no longer needed.
- HPC jobs "shut down" as soon as they finish. In a pay-per-use model, the end user need not worry about managing the infrastructure in order to save money, as the infrastructure charges the end user only for the processing cycles their jobs consume.
- Instances tend to be virtual machines with entire operating system stacks on them. HPC jobs run best on bare metal, where they can take full advantage of the high performance hardware underneath them without having to waste cycles dealing with an abstraction layer in a hypervisor. HPC jobs also spend far less time "starting" than instances do, again, due to their non-virtualized nature. In a pay-per-use model, this means less money spent on non-productive computing overhead.
HPC Cloud Job Scheduling
All clouds have "Cloud Controllers", no matter what type of work they do. At a high level, Cloud Controllers put resources to work, and often feature load balancing and metering capabilities. An HPC Cloud uses a Job Scheduler to assign work when requested. Basically, this puts work in queues for future execution on appropriate resources. If resources are available right away, jobs run right away.
Queuing versus Oversubscription
When resources are not available, the cloud is busy, and the Cloud Controller has a decision to make. An HPC Cloud Controller will queue the work for later execution. This is also known as "batch queuing". Since the job has all the parameters and data it needs, there is no need for the user to "watch" it run. In fact some jobs take hours or days to run, even if resources are immediately available. The end user submits the request, and later gets notified with the results. The HPC Cloud Controller runs the job as soon as resources become available for it, without the user having to "resubmit" or even care.
In other clouds, Instances suffer a much less desirable fate when resources are not available: oversubscription. This is in fact one way "web services" clouds make money - by putting more jobs to work than what their hardware can handle. When instances are virtualized, the end user has no visibility into how busy their resources actually are, other than drastically reduced performance. This is because overloaded hypervisors have to "time slice" between instances, since there is not enough hardware to run in real time. Depending on SLA, the cloud may even reject the instance altogether, asking the user to try again later!
An HPC Cloud, on the other hand, ensures deterministic, real-time performance for all work submitted, even if some of the jobs may queue until resources are available. A well designed HPC Cloud will alert operators of resource shortfalls ahead of time so they can anticipate and expand accordingly.
Scalability and Elasticity
Elasticity is a key element of Cloud Computing as a way to scale applications for large scale processing. An HPC Cloud supports jobs that span across many physical nodes, without requiring that the job itself configure the infrastructure underneath. The Nimbix Cloud, for example, supports both distributed and parallel HPC application models, leveraging 56Gbps FDR Infiniband technology. At this speed, applications can pass up to 137 million messages per second between parallel runs! Compare that to up to around 1 million messages per second on commodity web services clouds. Since parallel HPC applications may run millions of data processing iterations during a job, they must be able to communicate quickly to finish faster. Since you are paying for compute cycles, this makes a big difference on your bottom line. The faster the job runs, the less it costs, all other things being equal.
Commodity web service clouds also require end users to configure parallel or distributed communication themselves, since they don't typically offer a workload manager that orchestrates this automatically. That means more time spent configuring, less time spent doing productive work - and the end user pays the cloud provider regardless.
API and Portal
Most clouds offer end users self-service through both a web portal and an API. An HPC Cloud offers a "processing API", where other clouds offer a "machine API". A processing API allows end users to submit jobs, parameters, and data. A machine API requires end users to start and stop instances, so they can later install applications inside of them to do work. Obviously a processing API is key for an HPC Cloud since users shouldn't be expected to configure their own infrastructure before they can even do work.
While the API allows programmatic orchestration of cloud resources for automation, the web portal gives end users a convenient way to submit work for processing. In HPC Cloud terms, this means kicking off complex jobs with just a few touches on your tablet or smartphone, as opposed to "spinning up" virtual servers, logging into them, and typing Linux commands.
Traditional HPC clusters do not offer API nor portals, instead requiring end users to write and submit "batch scripts". This is by no means "self-service" in the spirit of the NIST Cloud Computing Definition. Imagine writing a batch script on your smartphone, or even learning how to write batch scripts in order to get work done? A real HPC Cloud requires both an API and a portal!
Summary
Not all clouds are created equal. Sure, they all have Cloud Controllers, which assign work to various types of resources. They all bill users for cycles consumed, and allow self-service. They all offer elastic scalability, through API and/or web portal. An HPC Cloud optimizes all this for data processing jobs, not "web services". As not all workloads are created equal either, why would you try to run your HPC applications on a "web services" cloud?