Virtualization Technology News and Information
Article
RSS
Everything you know about scaling your architecture is wrong: A conversation with Fastly Founder Artur Bergman on Resilience, Recovery and the Future of Systems Engineering

Whether you maintain platforms that help fellow developers do their jobs or build applications that consumers and businesses rely on, you're responsible for owning your availability and reliability. You're also responsible for scaling your systems as demand grows. These can be daunting engineering challenges, and there's no shortage of opinions on how to scale successfully. However, few of these opinions come from real planet-scale operating experience.

Fastly founder and CTO Artur Bergman spent decades building and operating the largest sites and applications on the internet before starting Fastly. His experience building and scaling an extensive global network that not only handles massive traffic volumes but needs to be both always up for customers and easy to manage and troubleshoot by engineering teams has resulted in several battle-tested strategies for better resiliency and reliability.

In this interview, Bergman shares some hard-learned lessons from the past, his perspective on how and when to scale, and his outlook on what skills will be needed in systems engineering.

You've been in the trenches of large-scale system operations. Can you share a war story about a major incident that changed your perspective on system operations?

As a global infrastructure platform, our system's stability and reliability is essential for millions of enterprises, startups, and open-source projects. For a mind-changing incident, I don't need to look too far: In June of 2021, Fastly had a major, public, global outage that impacted every single customer, in which we failed at our speed of recovery. More specifically, we failed to detect what was happening with a specific piece of customer code quickly enough to prevent customer impact. However, once we did detect it, we took action, handled the incident with utmost transparency and recovered very quickly.

I've always thought speed of recovery is important and we had certainly done things that made recovery faster, but I think we didn't necessarily realize how vulnerable we were to this scenario. That's why I focus so heavily on speed of recovery in conversations like this one when it comes to resiliency.

Fundamentally, we still have the same values in operating Fastly. We innovate in the open, allowing customers to use the world's largest instant programmable platform. This incident only reinforced that it's our job to make it safe.

When designing for resilience, how do you prevent cascading failures in highly distributed systems?

We do a lot of things at Fastly to avoid cascading failures in our systems, it's a big concern given our distributed architecture as a globally distributed network. We do things like progressive deployments to slowly introduce changes, and have automated safety mechanisms built into our deployment process to roll back or disable changes that correspond with elevated error rates on key system-dependent metrics. What that means in practice is: we don't roll everything everywhere and if we detect an error, we safely disable the code and focus on fast recovery.

Resiliency doesn't stop with automated tooling. After every major incident we go through a thorough retro process with a mindset of further improving these systems' design in a way that challenges the initial assumptions of their architecture. From these exercises we've significantly improved our stress testing, our overload planning and testing, our processes for partial rollouts, and the triggers for fast rollbacks. There's continuous work on faster recovery and resilience engineering.

Teams commonly underinvest in recovery and spend too little time figuring out how to recover from errors. They focus too much on trying to prevent errors, rather than building systems that can be fixed quickly. It's an important mindset shift even though it can just be more fun for engineers to try to prevent failures than it is to plan for recovery.

What are the biggest mistakes companies make when scaling their systems beyond their initial architecture?

Overcomplicating things is one of the most common pitfalls. A telltale sign that you've overcomplicated your systems is that you've stopped delivering value for your stakeholders. Companies also tend to overcorrect by trying to fix everything at once instead of making incremental improvements and find themselves spending too much time on the "perfection of scaling".

It's fine to plan for a future stage so that you don't accidentally make a decision that's very difficult to correct later, you do need an idea of where you want to go. But you should take small steps to get there. I get concerned when I hear about complete rewrites.

If you're a startup, the priority should be delivering value, not scaling prematurely. If your startup is successful, it will eventually face scaling challenges. But if you put too much focus on scaling early on, it can take away from resources better spent on finding product-market fit.

The same principle applies to established companies launching new products. Too often, teams invest heavily in scalability for a product too early, either delaying the product's time to market or because the product ultimately doesn't take off. This wastes precious engineering resources - the faster you can deliver something that people can use and that you can learn from, the better. And then you get the luxury of iterating and scaling when you're successful.

What's the most interesting or unexpected failure mode you've encountered, and what did you learn from it?

One of the strangest bugs we ever tracked was a kernel crash that we couldn't figure out. We operate a vast fleet with different kernel versions, so we're extremely familiar with operating the kernel in variable conditions. But this one had us stumped - it turned out that this specific crash state was geography-dependent. Meaning, whatever kernel version we put in LA or Amsterdam would crash much faster than any kernel in Chicago. The crash condition depended on a user's distance from the data center where distance is measured in milliseconds, and we didn't consider that as an option during the majority of our time diagnosing the issue.

The learning was to step back and look at the whole picture. We were getting narrowly focused on the version as the potential cause and we weren't considering the greater state of the infrastructure. It's really easy to say in retrospect. However, it took us 3 months to figure it out.

AI is increasingly shaping software and infrastructure. How do you see it changing the way we design and operate internet-scale systems?

I think AI technologies can be immensely useful, and we have just scratched the surface of where they can help. An area I'm interested in is using AI to detect failures and help operators identify where they should focus to recover, it's something we're actively trialing.

It's possible that one day AI will completely replace people, but if that's the case, it's a very long time in the future. But AI can definitely help people become more effective and efficient, especially in recovery where time matters so much. Having these tools helps us digest the information, summarize it, and pattern match. I think there's way more that can be done in the space of AI and systems operations that we just don't know about yet.

There's also a weird resistance from people around using these tools. Eventually, it's just newness, it takes time to develop best practices and figure out how to use new tools to their full extent. This long-term mindset informs how we see ourselves in the ecosystem with our AI Accelerator. With recent innovations like Deepseek's cost breakthroughs, we already see a "Moore's Law" effect around running AI models, lowering the barrier for new entrants. However, the interaction experience for working with those models has consistently been high-latency, which hurts the end-user experience. We see ourselves as the platform engineering tool for improving all AI experiences, regardless of the model or vendor you're using.

What's one principle that every system architect should live by?

Fight complexity, fight overbuilding.

And what skills do you think will be most valuable for the next generation of engineers building high-scale systems?

When I think about what I look for when hiring, I look for curiosity, intelligence, and willingness to adapt to reality. I also highly value a bias for action. It's less about specific skill sets and more about mindset. I would like to see someone that's willing to learn how to use AI correctly.

Overall, empathy for your users is essential-whether they're fellow engineers working on your system or end users. Great technology alone doesn't guarantee success.

##

To learn more about Kubernetes and the cloud native ecosystem, join us at KubeCon + CloudNativeCon EU, in London, England, on April 1-4, 2025.

About Artur Bergman

Artur-Bergman 

Artur Bergman is Founder and Chief Technology Officer at Fastly. He served as CEO from Fastly's founding in March 2011 until February 2020 following the company's IPO in 2019. Before founding Fastly, Artur was the CTO of Fandom (previously Wikia, Inc.), a global community knowledge-sharing platform. Prior to Fandom, he held engineering management roles at SixApart, a social networking service, and Fotango, Ltd., a subsidiary of Canon Europe.

Published Wednesday, March 05, 2025 7:31 AM by David Marshall
Filed under: ,
Comments
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
Calendar
<March 2025>
SuMoTuWeThFrSa
2324252627281
2345678
9101112131415
16171819202122
23242526272829
303112345