Whether you maintain platforms that help
fellow developers do their jobs or build applications that consumers and
businesses rely on, you're responsible for owning your availability and
reliability. You're also responsible for scaling your systems as demand grows.
These can be daunting engineering challenges, and there's no shortage of
opinions on how to scale successfully. However, few of these opinions come from
real planet-scale operating experience.
Fastly founder and CTO Artur Bergman spent
decades building and operating the largest sites and applications on the
internet before starting Fastly. His experience building and scaling an
extensive global network that not only handles massive traffic volumes but
needs to be both always up for customers and easy to manage and troubleshoot by
engineering teams has resulted in several battle-tested strategies for better
resiliency and reliability.
In this interview, Bergman
shares some hard-learned lessons from the past, his perspective on how and when
to scale, and his outlook on what skills will be needed in systems engineering.
You've
been in the trenches of large-scale system operations. Can you share a war
story about a major incident that changed your perspective on system
operations?
As a global infrastructure platform, our
system's stability and reliability is essential for millions of enterprises,
startups, and open-source projects. For a mind-changing incident, I don't need
to look too far: In June of 2021, Fastly had a major, public, global outage
that impacted every single customer, in which we failed at our speed of recovery. More specifically, we failed to
detect what was happening with a specific piece of customer code quickly enough
to prevent customer impact. However, once we did detect it, we took action,
handled the incident with utmost transparency and recovered very quickly.
I've always thought speed of recovery is
important and we had certainly done things that made recovery faster, but I
think we didn't necessarily realize how vulnerable we were to this scenario.
That's why I focus so heavily on speed of recovery in conversations like this
one when it comes to resiliency.
Fundamentally, we still have the same
values in operating Fastly. We innovate in the open, allowing customers to use
the world's largest instant programmable platform. This incident only
reinforced that it's our job to make it safe.
When
designing for resilience, how do you prevent cascading failures in highly
distributed systems?
We do a lot of things at Fastly to avoid
cascading failures in our systems, it's a big concern given our distributed
architecture as a globally distributed network. We do things like progressive
deployments to slowly introduce changes, and have automated safety mechanisms
built into our deployment process to roll back or disable changes that
correspond with elevated error rates on key system-dependent metrics. What that
means in practice is: we don't roll everything everywhere and if we detect an
error, we safely disable the code and focus on fast recovery.
Resiliency doesn't stop with automated tooling. After
every major incident we go through a thorough retro process with a mindset of
further improving these systems' design in a way that challenges the initial
assumptions of their architecture. From these exercises we've significantly
improved our stress testing, our overload planning and testing, our processes
for partial rollouts, and the triggers for fast rollbacks. There's continuous work on faster recovery and resilience
engineering.
Teams commonly underinvest in recovery and spend too little time
figuring out how to recover from errors. They focus too much on trying to
prevent errors, rather than building systems that can be fixed quickly. It's an
important mindset shift even though it can just be more fun for engineers to
try to prevent failures than it is to plan for recovery.
What
are the biggest mistakes companies make when scaling their systems beyond their
initial architecture?
Overcomplicating things is one of the
most common pitfalls. A telltale sign that you've overcomplicated your systems
is that you've stopped delivering value for your stakeholders. Companies also
tend to overcorrect by trying to fix everything at once instead of making
incremental improvements and find themselves spending too much time on the
"perfection of scaling".
It's fine to plan for a future stage so
that you don't accidentally make a decision that's very difficult to correct
later, you do need an idea of where you want to go. But you should take small
steps to get there. I get concerned when I hear about complete rewrites.
If you're a startup, the priority should
be delivering value, not scaling prematurely. If your startup is successful, it
will eventually face scaling challenges. But if you put too much focus on
scaling early on, it can take away from resources better spent on finding
product-market fit.
The same principle applies to established
companies launching new products. Too often, teams invest heavily in
scalability for a product too early, either delaying the product's time to
market or because the product ultimately doesn't take off. This wastes precious
engineering resources - the faster you can deliver something that people can
use and that you can learn from, the better. And then you get the luxury of
iterating and scaling when you're successful.
What's the most
interesting or unexpected failure mode you've encountered, and what did you
learn from it?
One of the strangest bugs
we ever tracked was a kernel crash that we couldn't figure out. We operate a
vast fleet with different kernel versions, so we're extremely familiar with
operating the kernel in variable conditions. But this one had us stumped - it
turned out that this specific crash state was geography-dependent. Meaning,
whatever kernel version we put in LA or Amsterdam would crash much faster than
any kernel in Chicago. The crash condition depended on a user's distance from
the data center where distance is measured in milliseconds, and we didn't
consider that as an option during the majority of our time diagnosing the
issue.
The learning was to step back and look at the whole picture. We
were getting narrowly focused on the version as the potential cause and we
weren't considering the greater state of the infrastructure. It's really easy
to say in retrospect. However, it took us 3 months to figure it out.
AI
is increasingly shaping software and infrastructure. How do you see it changing
the way we design and operate internet-scale systems?
I think AI technologies can be immensely useful, and
we have just scratched the surface of where they can help. An area I'm
interested in is using AI to detect failures and help operators identify where
they should focus to recover, it's something we're actively trialing.
It's possible that one day
AI will completely replace people, but if that's the case, it's a very long
time in the future. But AI can definitely help people become more effective and
efficient, especially in recovery where time matters so much. Having these
tools helps us digest the information, summarize it, and pattern match. I think
there's way more that can be done in the space of AI and systems operations
that we just don't know about yet.
There's also a weird resistance from people around using these
tools. Eventually, it's just newness, it takes time to develop best practices
and figure out how to use new tools to their full extent. This long-term
mindset informs how we see ourselves in the ecosystem with our AI Accelerator.
With recent innovations like Deepseek's cost breakthroughs, we already see a
"Moore's Law" effect around running AI models, lowering the barrier for new
entrants. However, the interaction experience for working with those models has
consistently been high-latency, which hurts the end-user experience. We see
ourselves as the platform engineering tool for improving all AI experiences,
regardless of the model or vendor you're using.
What's
one principle that every system architect should live by?
Fight complexity, fight overbuilding.
And what skills do you think will be most valuable
for the next generation of engineers building high-scale systems?
When I think about what I look for when
hiring, I look for curiosity, intelligence, and willingness to adapt to
reality. I also highly value a bias for action. It's less about specific skill
sets and more about mindset. I would like to see someone that's willing to
learn how to use AI correctly.
Overall, empathy for your users is
essential-whether they're fellow engineers working on your system or end users.
Great technology alone doesn't guarantee success.
##
To learn more about Kubernetes and the
cloud native ecosystem, join us at KubeCon + CloudNativeCon EU, in London, England, on April 1-4, 2025.
About
Artur Bergman
Artur Bergman is Founder and Chief
Technology Officer at Fastly. He served as CEO from Fastly's founding in March
2011 until February 2020 following the company's IPO in 2019. Before founding
Fastly, Artur was the CTO of Fandom (previously Wikia, Inc.), a global
community knowledge-sharing platform. Prior to Fandom, he held engineering
management roles at SixApart, a social networking service, and Fotango, Ltd., a
subsidiary of Canon Europe.