Virtualization Technology News and Information
VMblog Expert Interview: Alex Hidalgo Talks SRE and SLOs


Alex Hidalgo, whose SRE book "Implementing Service Level Objectives" was published by O'Reilly this month, has joined Nobl9.  VMblog spoke with Alex about his book and his career move.

VMblog:  Alex, your SLO book was just published by O'Reilly.  Tell us about the content of  the book and the audience you had in mind when you wrote it.

Alex Hidalgo:  Service level objectives (SLOs) are often associated with site reliability engineering (SRE), and more specifically SRE at Google. And while I certainly expect the book might be picked up by many SREs at many companies, I actually wrote it for anyone that works in the tech industry in any capacity. In order for SLO-based approaches to work to their fullest potential, you need everyone on board: operations, development, product, business -- I'd even argue things like your Q&A and test teams, your security operations, etc. One of the major benefits of an SLO-based approach is better communications across teams and organizations due to having a shared language to use when talking about reliability.

While not every chapter in the book will be relevant to every possible audience, I hope that at least portions of the book will be useful to anyone working with computer systems.

VMblog:  You've had an interesting career path.  How did you get into SRE, and how did you become an expert?

Hidalgo:  I ended up as an SRE almost by accident. Right after the 2008 recession I had just moved to New York City and I needed any job I could get. I hadn't worked with computers professionally in about 7 years at that point. Due to some bad early experiences in the industry, I just thought it wasn't for me. But due to aforementioned need for a job, I applied at a small IT firm and got hired right away. From there I realized I did actually love working with computers for a living and that it was entirely separate circumstances that had left a bad taste in my mouth.

From there I realized that I was wasting some of my Linux, networking, and programming expertise and joined Admeld as a technical operation engineer. Three months later, I was at Google due to acquisition and suddenly I was a site reliability engineer. It was a pretty wild ride!

As to your second point, I'm humbled to be referred to as an expert. I certainly don't sell myself short, but it's still pretty wild to hear that sometimes. I think I ended up in the position I've found myself in primarily because I care about people first. That's truly what this is all about. You want reliable services because that's better for all of the people involved. It's better for the users of a service because they get what they want, and it's better for the engineers responsible for running the service because good reliability means fewer pages and fewer emergencies. That's why I wrote the book. I've been incredibly blessed to have gotten to learn from some brilliant people and then having moved on to teaching others. In a way the book is just the story of how I came to see how SLO-based approaches can make people happier, what you actually need to know to do this correctly, and wanting to share what I've learned on my journey with others.

VMblog:  Obviously your experience at Google was formative, and Google is widely respected as the pioneer of SRE, but many organizations question whether SRE is applicable for smaller enterprises.  What are your thoughts on that?

Hidalgo:  I think it's entirely fair to question if you need a true team of SREs. I think it's the philosophies and the concepts that matter -- anyone can adopt best practices around thinking about and achieving reliability. I've frequently dubbed coworkers of mine who aren't strictly SREs as "honorary SRE," because they've internalized the lessons. I don't think everyone needs SRE, but I do think everyone needs to learn how to think about reliability in the right way. After all, it's all only a model. The map is not the territory, and not every territory needs guides.

VMblog:  What is it about SLOs specifically that can improve the way organizations approach SRE?

Hidalgo:  SLOs are in many ways the absolute foundation of SRE. As originally formulated, the idea was that developers and operations were at odds with each other. Developers want to ship features, and operations want to keep things stable by not changing too much. I personally think this is a bit outdated of a model in 2020 but that's how things started. And the way you solved this problem was via SLOs. If you're missing your reliability targets too often, the operations team was allowed to say: "No new features! We need to fix things!"; but, if you were exceeding your reliability targets regularly, the developers were allowed to ship features as quickly as they wanted! Both sides are now happy!

I think SRE as a discipline has far outgrown this simplistic and original example. It turns out that this kind of approach is useful for so many things beyond just maintaining a balance between feature velocity and reliability. The heart of service level objectives is: you can't be perfect, no one needs you to be perfect anyway, and everyone is happier once they realize that. Don't let the great be the enemy of the good, in a sense. And that applies to everything, not just site reliability engineering.

VMblog:  What are the biggest barriers that organizations need to overcome to make SLOs work for them?

There are three that I see as the most common.

The first is that it's not always easy to convince leadership that you shouldn't be trying to be 100% reliable. Leaders -- especially founders I've noticed -- want to shoot for the moon and not miss it. They want to believe that their product and their team can do something that's simply impossible. It's not always easy to convince people in those positions that it's actually okay to fail sometimes, especially when you're trying to push this message from the bottom up.

Another primary example is that people just don't realize how much you actually need in order to implement SLOs correctly. They'll read a chapter or two in one of the first two Google SRE books and find themselves armed with a bunch of definitions. But that's not nearly enough. You need education initiatives and workshops and cultural buy-in and often also custom tooling or new telemetry. People don't often realize how much goes into this when they try to "do SLOs" and they end up disappointed with the outcome because of this.

Finally, I think way too many organizations view SLOs as "a thing you do," often as an OKR for a quarter or something like that. But that's not at all how it works! SLO-based approaches are exactly that: they're approaches. They're a model for how to measure and talk about the reliability of your services, not a thing you can ever "finish." It's a different path, not a destination. Well-done SLO-based approaches to reliability are closer to using something like Agile to plan sprints than anything else.

VMblog:  You've now joined Nobl9, a startup focused on building reliable software, and you've described this as a "too-good-to-pass-up" opportunity.  What is Nobl9 building and why do you find it so compelling?

Hidalgo:  I mentioned earlier that too many organizations don't really understand that SLO-based approaches are an entirely different way of doing things and not just something you can mark off of a checklist. Not many people understand that these concepts are actually applicable to all businesses and not just tech companies with web services. I hope to one day formally bring these ideas and approaches to industries such as logistics, public transportation, and even the service industry! Nobl9 understands all of this. They've truly internalized the most important aspects of service level objectives into the very DNA of their product in a way I haven't seen in any other product offering. I've seen well-implemented service level objectives make humans happier and I think Nobl9 can make a lot of humans happier by helping them do this correctly.

VMblog:  SRE is near the top of the list of hottest IT jobs right now.  What do you see happening in this field in the next 5 years?

Hidalgo:  I think we're at a very exciting crossroads in the tech industry. Tech in general has historically been pretty bad about feeling the need to invent everything themselves even though for almost every situation there is prior art! Reliability engineering has been around since the 1940s -- in some ways you could say it's been around for millenia as people strove to build things that didn't fail too often. Statistics is centuries old. Resilience engineering isn't quite as ancient, but you're seeing the industry pick up on those studies and that knowledge as well. I think the last few years you've seen people opening their eyes to the fact that tech can learn from those that came before us, and I see that nowhere more clearly than within SRE. I don't know exactly what SRE will look like in 5 years, but I'm pretty certain it'll look a lot more like other engineering disciplines -- in a very good way.

Published Friday, September 25, 2020 7:32 AM by David Marshall
Filed under: ,
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<September 2020>