
Rookout is a SaaS company
helping businesses get the data they need in real time, in order to make better
decisions. VMblog recently connected with their CEO, Shahar Fogel, to discuss a number of important topics. As advocates for the importance of ‘Understandability' in
software, you'll want to hear more how he explains the differences between Observability and
Understandability. And you're going to want to learn more about their latest Agile Flame Graphs, graphs that collect the most useful data across applications, such as CPU consumption and latency between microservices, then visualizes it in an easily accessible manner.
VMblog: What does it mean
to have a "modern" debugger or a "production" debugger? Why
are the old ways of debugging applications not suitable for cloud-native
environments, and what other tools are tackling this problem?
Shahar Fogel: The "old ways" of debugging are about either reproducing
locally and debugging step-by-step, or about adding log lines and hoping that
the issue will happen again. In Cloud-Native and distributed environments, and
in production environments, these methods are not effective and are sometimes outright
impossible to use.
Complex environments are hard or impossible to reproduce
locally, yet debugging step-by-step means stopping your app or pod, which is
something you can't do in a production. And adding log lines means waiting
anywhere between hours or sometimes days, even for the most agile and
automation-driven organizations.
In addition - the old way of debugging does not really
allow you to debug a distributed environment with 1000s of instances. Engineers
don't know where the issue is invoked, and lack the visibility into the exact
point and instance/server the issue has occurred in.
Most tools tackling this problem speak of observability
and mention distributed tracing, which is a fancier way of adding log lines
(and still requires adding code and waiting). Some tools tackle the problem in
a method similar to Rookout's and use terminology such as logpoints,
tracepoints or snappoints (we prefer non-breaking breakpoints).
VMblog: Rookout claims
that it's not simply enough to observe applications, but developers need
instant access to data in order to better understand
them. Can you talk about the differences between Observability and
Understandability?
Fogel: Exactly right. We've seen Observability take off as a
category, given the importance of being able to observe the health of your
systems across distributed environments. But it's still very challenging for
developers to actually go in and get the data they need to make better
decisions, and that's what we are calling Understandability - the ability to
quickly understand and get to the root cause of an issue, without the long and
cumbersome process which exists today.
VMblog: You have a new
product that looks great called Agile
Flame Graphs. Can you tell the readers a little more about it and how it
helps developers better understand their applications?
Fogel: Flame graphs are the type of graph you expect to see when
tackling the complex challenge of profiling an application, in an attempt to
find performance bottlenecks and identify the areas in your code that cause
high latency, or the "hot spots" that get hit so frequently that they just get
stuck, and cause a poor user experience and even crashes.
The purpose of the flame graph is to show you a
color-coded breakdown of the time spent in different areas of your code.
Traditional tools for profiling applications are too resource intensive to use
broadly in production without a significant negative impact on application
performance. They also come with a lot of noise, and can only really be
operated effectively by operations experts. We wanted to simplify the
traditional flame graph to just the most necessary information, so that it
could be visualized easily and it didn't create a ton of overhead, while giving
engineers the most precise information and metrics related to the performance
of their application and code-base.
VMblog: When I think of
flame graphs, it's usually in an IT Ops context. What made you bring an agile
version into the debugging workflow, and do you think in general we are asking
developers to care about too many things other than just writing code?
Fogel: That's a good question. It reminds me of DevSecOps, where we ask is it
reasonable for developers to now also care about security and become experts in
that as well. And the truth is that, these issues come back to the developer
one way or another. Whether it's after it's impacted a customer and you get a
JIRA ticket, or you just deal with it in a more productive, shift-left context.
I think developers are very interested in having better visibility into
performance of their part of the application, but if it's too cumbersome they
won't use it. That's why we built Agile Flame Graphs.
VMblog: To drive home the
point about the importance of resilience shifting left, I've heard you talk
about bugs like they are "mini outages" for customers -- and that's a
really interesting way to put it. Can you expand on what you mean by that?
Fogel: The conversations around reliability and resilience tend to be focused
on the SRE / Ops side of the house. But the truth is that software bugs are a
main cause of outages and customer issues. Even if the system in general is up,
the customer's experience of running into a bug or even just the inability to
use a specific feature can prevent them from doing what they want to do. Being
able to resolve these issues faster for customers is a big part of resilience
from my point of view.
##
Shahar Fogel, CEO of Rookout, has spent the last 2 decades leading data-driven businesses, products and R&D teams, from early stage start-ups to government organizations. Shahar is passionate about software architecture and observability; as a cyber Security team lead, product manager, VC investor, and a Cambridge University MBA alumni.