Alex
Hidalgo, whose SRE book "Implementing Service Level Objectives" was published
by O'Reilly this month, has joined Nobl9. VMblog spoke with Alex about his book
and his career move.
VMblog: Alex, your SLO book was
just published by O'Reilly. Tell us about the content of the book
and the audience you had in mind when you wrote it.
Alex Hidalgo: Service level objectives (SLOs) are often
associated with site reliability engineering (SRE), and more specifically SRE
at Google. And while I certainly expect the book might be picked up by many
SREs at many companies, I actually wrote it for anyone that works in the tech
industry in any capacity. In order for SLO-based approaches to work to their
fullest potential, you need everyone on board: operations, development,
product, business -- I'd even argue things like your Q&A and test teams,
your security operations, etc. One of the major benefits of an SLO-based
approach is better communications across teams and organizations due to having
a shared language to use when talking about reliability.
While not every chapter in the book will be
relevant to every possible audience, I hope that at least portions of the book
will be useful to anyone working with computer systems.
VMblog: You've had an interesting career path. How did you get into SRE, and how did
you become an expert?
Hidalgo: I ended up as an SRE almost by accident. Right
after the 2008 recession I had just moved to New York City and I needed any job
I could get. I hadn't worked with computers professionally in about 7 years at
that point. Due to some bad early experiences in the industry, I just thought
it wasn't for me. But due to aforementioned need for a job, I applied at a
small IT firm and got hired right away. From there I realized I did actually
love working with computers for a living and that it was entirely separate
circumstances that had left a bad taste in my mouth.
From there I realized that I was wasting some
of my Linux, networking, and programming expertise and joined Admeld as a
technical operation engineer. Three months later, I was at Google due to
acquisition and suddenly I was a site reliability engineer. It was a pretty
wild ride!
As to your second point, I'm humbled to be
referred to as an expert. I certainly don't sell myself short, but it's still
pretty wild to hear that sometimes. I think I ended up in the position I've
found myself in primarily because I care about people first. That's truly what
this is all about. You want reliable services because that's better for all of
the people involved. It's better for the users of a service because they get
what they want, and it's better for the engineers responsible for running the
service because good reliability means fewer pages and fewer emergencies.
That's why I wrote the book. I've been incredibly blessed to have gotten to
learn from some brilliant people and then having moved on to teaching others.
In a way the book is just the story of how I came to see how SLO-based
approaches can make people happier, what you actually need to know to do this
correctly, and wanting to share what I've learned on my journey with others.
VMblog: Obviously your experience at Google was formative, and Google is widely
respected as the pioneer of SRE, but many organizations question whether SRE is
applicable for smaller enterprises. What are your thoughts on that?
Hidalgo: I think it's entirely fair to question if you
need a true team of SREs. I think it's the philosophies and the concepts that
matter -- anyone can adopt best practices around thinking about and achieving
reliability. I've frequently dubbed coworkers of mine who aren't strictly SREs
as "honorary SRE," because they've internalized the lessons. I don't think everyone needs SRE, but I do think everyone
needs to learn how to think about reliability in the right way. After all, it's
all only a model. The map is not the territory, and not every territory needs
guides.
VMblog: What is it about SLOs specifically that can improve the way organizations
approach SRE?
Hidalgo: SLOs are in many ways the absolute foundation
of SRE. As originally formulated, the idea was that developers and operations
were at odds with each other. Developers want to ship features, and operations
want to keep things stable by not changing too much. I personally
think this is a bit outdated of a model in 2020 but that's how things started.
And the way you solved this problem was via SLOs. If you're missing your
reliability targets too often, the operations team was allowed to say: "No new
features! We need to fix things!"; but, if you were exceeding your reliability
targets regularly, the developers were allowed to ship features as quickly as
they wanted! Both sides are now happy!
I think SRE as a discipline has far outgrown
this simplistic and original example. It turns out that this kind of approach
is useful for so many things beyond just maintaining a balance between feature
velocity and reliability. The heart of service level objectives is: you can't
be perfect, no one needs you to be perfect anyway, and everyone is happier once
they realize that. Don't let the great be the enemy of the good, in a sense.
And that applies to everything, not just site reliability engineering.
VMblog: What are the biggest barriers that organizations need to overcome to make SLOs
work for them?
There are three that I see as the most common.
The first is that it's not always easy to
convince leadership that you shouldn't be trying to be 100% reliable. Leaders
-- especially founders I've noticed -- want to shoot for the moon and not miss
it. They want to believe that their product and their team can do something
that's simply impossible. It's not always easy to convince people in those
positions that it's actually okay to fail sometimes, especially when you're
trying to push this message from the bottom up.
Another primary example is that people just
don't realize how much you actually need in order to implement SLOs correctly.
They'll read a chapter or two in one of the first two Google SRE books and find
themselves armed with a bunch of definitions. But that's not nearly enough. You
need education initiatives and workshops and cultural buy-in and often also
custom tooling or new telemetry. People don't often realize how much goes into
this when they try to "do SLOs" and they end up disappointed with the outcome
because of this.
Finally, I think way too many organizations
view SLOs as "a thing you do," often as an OKR for a quarter or something like
that. But that's not at all how it works! SLO-based approaches are exactly
that: they're approaches. They're a model for how to measure and talk about the
reliability of your services, not a thing you can ever "finish." It's a
different path, not a destination. Well-done SLO-based approaches to
reliability are closer to using something like Agile to plan sprints than
anything else.
VMblog: You've now joined Nobl9, a startup focused on building reliable software, and
you've described this as a "too-good-to-pass-up" opportunity. What is Nobl9 building and why do you
find it so compelling?
Hidalgo: I mentioned earlier that too many
organizations don't really understand that SLO-based approaches are an entirely
different way of doing things and not just something you can mark off of a
checklist. Not many people understand that these concepts are actually
applicable to all businesses and not
just tech companies with web services. I hope to one day formally bring these
ideas and approaches to industries such as logistics, public transportation,
and even the service industry! Nobl9 understands all of this. They've truly
internalized the most important aspects of service level objectives into the
very DNA of their product in a way I haven't seen in any other product
offering. I've seen well-implemented service level objectives make humans
happier and I think Nobl9 can make a lot of humans happier by helping them do
this correctly.
VMblog: SRE is near the top of the list of hottest IT jobs right now. What do you see
happening in this field in the next 5 years?
Hidalgo: I think we're at a very exciting crossroads in
the tech industry. Tech in general has historically been pretty bad about
feeling the need to invent everything themselves even though for almost every
situation there is prior art! Reliability engineering has been around since the
1940s -- in some ways you could say it's been around for millenia as people
strove to build things that didn't fail too often. Statistics is centuries old.
Resilience engineering isn't quite as ancient, but you're seeing the industry
pick up on those studies and that knowledge as well. I think the last few years
you've seen people opening their eyes to the fact that tech can learn from
those that came before us, and I see that nowhere more clearly than within SRE.
I don't know exactly what SRE will look like in 5 years, but I'm pretty certain
it'll look a lot more like other engineering disciplines -- in a very good way.
##