Virtualization Technology News and Information
How to Prevent Costly Outages with Better Capacity Planning and Management

Article written by John Miecielica, Director of Product Management for TeamQuest

"Our website is temporarily out of service...." The dreaded 504 error. No one on either side of the screen wants to see this message. Consumers have zero patience for dysfunctional websites, and IT knows time=money. An outage might make your competitors happy, but your boss...not so much. Gone are simpler days when saying "the website is down" was equivalent to "I forgot to bring my business cards."  Now it's more like "it's a busy Saturday afternoon downtown and no one showed up to work so the shop is locked up." Or even more likely, "It's our busiest online shopping day of the year and we can't sell anything because our web infrastructure just took an unexpected nosedive and now all our annoyed customers are complaining about it on Twitter." And it's not the momentary stutters we hear about, it's the multi-hour wipeouts that cause real inconvenience for end-users, cost real money, and put IT jobs and sanity at risk.

Recently we were treated to a day of spectacular outages, courtesy of the NYSE, United Airlines, and the Wall Street Journal. The media's moniker for the events of July 8, the "Great Glitch," makes it sound a lot cuter than it was in reality. Initially, many feared a coordinated cyber attack by China or hacktivists. Grounded United Airlines passengers were dismayed by a lack of communication from the airline, and investors worldwide were concerned about the rare shut down of the New York Stock Exchange. When looking for news about the outage, they were greeted by a 504 message on None of the outages lasted more than a few hours, and as updates began to trickle out it became clear that it wasn't cyber criminals, but basic configuration errors at fault. The Journal's web site was most likely overwhelmed by the sudden surge of traffic from curious traders and investors, a cautionary example of how our hyper connectedness can lead to cascading failures.

In actuality, the culprit was a combination of inadequate testing and faulty capacity management and failover planning. As one insightful blogger framed it, "The big problem we face isn't coordinated cyber-terrorism, it's that software sucks. Software sucks for many reasons, all of which go deep, are entangled, and expensive to fix...This is a major headache, and a real worry as software eats more and more of the world. We are building skyscraper favelas in code-in earthquake zones." Ms. Tufecki's colorful warning may be a bit hyperbolic, but her concerns are valid. Nearly everything we do is reliant on information technology. Ultimately, the infrastructure IT is tasked with running is the primary interface with customers, and an organization's reputation depends on the reliability of those interfaces. 

The July pile-up was indeed alarming, but such outages are not uncommon. In the last year, Southwest Airlines, Best Buy, Apple, Target, Adobe, Dropbox, Sony, Verizon, Comcast, and many more have experienced costly (and mostly preventable) outages that lasted several hours to several days, costing millions of dollars in lost revenue and reputational damage. These are the headline-makers, but less noteworthy outages happen all the time. A Ponemon survey from 2013 pegged the average cost of an outage at almost $700,000.  In addition to financial and reputational damage, outages invite regulatory hassles and fines as well as the increased scrutiny of supply chain partners and other stakeholders.

What can be done to better prevent and manage website and service outages? In a nutshell, enterprises need to do better testing, resource management, and incident response planning. Capacity planning and management are central to optimizing IT infrastructure and providing the necessary visibility across the enterprise and into the future. It is possible to efficiently manage the performance of complex IT environments with automation and predictive capabilities.  In the era of data center virtualization and burgeoning cloud technology, there are multiple layers (storage to applications), multiple vendors, and multiple models (on-premise vs. public cloud) to assess, federate, and manage.

In these highly dynamic environments, it is challenging to anticipate and troubleshoot problems. The ubiquity of smartphones, social media, and online commerce has intensified the velocity with which volumes of data are amassed, processed, and routed in all directions.  There aren't enough skilled technicians available, budgets are tight, and all the project deadlines were yesterday. Customers and end-users are tech savvy, empowered, mobile-and more demanding than ever. Internal teams are bypassing enterprise IT and acquiring technology services directly, creating shadow IT scenarios that undermine cost, capacity, and security strategies. Capacity planning and management helps IT regain control by providing accurate visibility across the enterprise and into future capacity needs. IT has an unfortunate reputation of being the Department of NO. It's easier to say "Yes!" when you can quickly and accurately substantiate the associated costs and consequences. With a longer and more complete view of the IT estate, businesses can make decisions with more confidence and build plans driven by accurate intelligence and strategic priorities.

In the face of such complexity, it is imperative that IT brings their efforts into alignment with business objectives. Many outages could be prevented by improved collaboration between business and IT. Those responsible for capacity planning and provisioning must work closely with business stakeholders to build accurate models of peak traffic during promotions, client onboarding, new service roll-outs, etc. Putting the tools and solutions in place to automate the collection, correlation, and analysis of machine, sensor, and log data enables the enterprise to move from a reactive (putting out fires) to a proactive (optimizing and innovating) stance. With these solutions, organizations can mature and do more, better, faster. When it comes time to make big decisions-moving into the cloud, for example-risks and costs can be properly assessed only if IT has the ability to plan, model, and test for dynamic environments.

Most enterprises are now moving into the cloud in some way. They are shifting from experimental DevOps cloud adoption to increased use of cloud platforms for central IT functions. The hybrid cloud model is especially promising for enterprises that want to retain critical compute resources on premise but need to leverage the public cloud as a cost-effective failover mechanism.

Capacity planning for hybrid environments requires solutions that can federate the two views, analyzing performance across private, public and non-cloud computing resources with a single toolset. Service performance and end-user experience should be central to these analyses, which include performance data from multiple services, applications, and more. Starting with meaningful metrics aligned to business outcomes is essential; it is more important to look at latency and response time than machine utilization, for example. To optimize service quality, you need to dig deeper, breaking apart components of the service and to analyze and identify the specific capacity requirements for each piece. Capacity planning solutions that include predictive capabilities enable IT to be notified in advance when customers will experience degraded response time, identify the IT component responsible, and fix it before an embarrassing outage occurs.

Capacity planning also enables enterprises to predict the cost for cloud services, avoiding bill shock and informing budget planning. It's easier to decide when to move to a different cloud provider (due to poor SLA compliance or mounting costs) if you are already doing capacity management of cloud virtual instances. Likewise, it helps IT to know when to supplement in-house resources with cloud infrastructure in order to ensure SLA compliance. Cost is always perceived as a key driver behind cloud adoption-but resilience and availability are very important. After all, downtime is expensive on many levels (lost revenue, lost reputation, remediation, etc.) In virtualized and hybrid environments, cost optimization requires solid insight into the future, insight garnered through historical data, intelligent forecasts, modeling, simulations, and testing.

As we speed on down the superhighway, demand grows for more sophisticated and integrated services, and infrastructure becomes more complex to support them. Services (and internal business teams) are no longer constrained by finite infrastructure. Capacity planning and management has shifted from how to operate your services using a finite set of infrastructure resources to how you can optimally place the operation of your services among an endless number of infrastructure sourcing alternatives (private/public/hybrid cloud, on-premise arrays, MSP, cloud bursting, etc.).

In these complex technology ecosystems, there's no chance for perfection-incidents will happen. Mother Nature, cyber criminals, human error, and subpar software are always in play. Incidents may originate outside your enterprise, with root causes beyond your control. Intelligent, data-driven capacity planning brings significant and lasting resilience to enterprise infrastructure, and therefore to the business as a whole. It is this resilience, born of rigorous testing and proactive preparations that empowers your organization to respond effectively and quickly. Knowing what you have and how you can use it creates security, sustainability, and competitive strength across the entire enterprise.


About the Author

John Miecielica, Director of Product Management for TeamQuest

John Miecielica is the Director of Product Management for TeamQuest Corporation. Prior to joining TeamQuest, John spent 19 years with M&I Data Services/Metavante/FIS where he built and managed many of the disciplines around open systems (server administration, storage administration, monitoring, and project Management). He also served as VP of Capacity Management for Metavante/FIS for 10 years before joining the product management team at TeamQuest. 

John holds a bachelor's degree in mathematics and computer science from Providence College and a masters degree in computer science from Binghamton University.

Published Thursday, July 23, 2015 6:32 AM by David Marshall
Filed under: ,
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<July 2015>