Logs are the core of the human-machine
interface for software developers and operators. Historically, they are very
much like caveman paintings. They were our first attempt to express and
understand how our software was working.
For several decades, logs were an island of
calm in a rapidly changing technological ecosystem. Logs remained the same even
as software services became web-based and grew in scale. We added context to
make them easier to search through, moved them to a structured format, and over
the past decade or two, started to aggregate and index them for ease of use.
And yet, at some point, that wasn't enough.
Thus, the three pillars of Observability were born: Logs, Metrics, and Traces.
Why do we need Metrics?
One of the most common questions we ask
ourselves while monitoring the web server is, "how many requests for that URL
did we get over the last minute?" To answer this question using logs, we must
collect logs from all servers, parse individual lines, filter the relevant
URLs, and count the results.
Whether we build a dedicated pipeline for this
metric or calculate it by querying a fully indexed logs database, it's a long
and arduous process for both man and machine and unlikely to give us results in
real-time.
Think of metrics as a way to efficiently
aggregate multiple log lines of the same instance at the source application. By
counting (or using other forms of aggregations such as summing) each event, you
can efficiently get a real-time value of the behavior of your application as a whole.
A much more efficient way to get high-quality
data is to create a counter inside the application and export it to the
Observability stack, which will aggregate it and produce the relevant reports.
So where does Distributed Tracing
come in?
Modern web applications are running on a much
grander scale than ever before. We shifted our engineering paradigms and have
adopted new architectural patterns, such as microservices and reactive
programming.
Unfortunately, this has fundamentally broken
the unwritten promise of logs: that we can tell the story by connecting the
dots one log line at a time. One can no longer assume that two consecutive log
lines are part of the same request, or even use process and thread IDs to build
the timeline.
Distributed Tracing is a way to generate the
timeline of individual requests and other processing tasks. This way, we can
easily keep track of each step within the flow, even as it crosses service and
functional boundaries.
What's still missing?
By adding Metrics and Distributed Tracing, the
three pillars of Observability significantly improved the operational paradigm
of modern cloud-native applications.
Metrics allow us to bind log lines vertically
and see how the system behaves over many requests. Tracing allows us to bind
log lines horizontally and know how the system behaves through the lifespan of
a single request. Both tools are super valuable for understanding the system as
a whole and excite SREs and architects across the globe.
And yet, for most software organizations, the
software developer is the most common engineering role-those poor souls who
spend most of their time writing and debugging code.
We shift responsibility left and want
engineers to own their code across the whole software development lifecycle,
all the way to production. They don't care about the number of requests or how
requests cross service boundaries. What they want to know is how the code
behaves.
What does it take to understand
the code?
The incredible power of modern code is that
the sum is way more than the value of its parts. Each variable is an
abstraction, combining code and data to provide superb power with only a few
characters of text. The layers stack on top of each other.
The code in question might be your code, or it
might be first, second, and third-party packages and services, many of which
are open-source. The data comes from various configurations, databases, caches,
user settings, user inputs, feature flags, and more. Add to that the current
state of the application, which often brings its own set of caveats, especially
for long-running processes.
Squeezing that invaluable context into a
single log line is no picnic. When stringifying primitive values into a log
line, you lose some of the finer points, such as type information. When
stringifying complex objects, the challenge is even greater.
Will you take a lean approach and miss out on
invaluable information? Or will you take a deeper capture and impact the
application's performance? Chances are, you won't bother in the first place,
and pray that whoever built the library provided a decent stringification flow
that won't do too bad on either front.
Even worse, the current line is only a tiny
part of the application state. What about the stack trace, the request context,
or other valuable information?
What's better than logs?
Snapshots.
Snapshots as the fourth pillar of
Observability that meets that need. By capturing most of the relevant
application state, you get a clear, detailed, high-fidelity image of what's
happening. To paraphrase: a Snapshot is worth a thousand log lines.
Snapshots provide everything you need to know.
Variables are captured with full fidelity, maintaining type information and
exact representation. Objects are captured by individual attributes, and
collections are appropriately enumerated. The stack trace and other global
variables are readily available.
As is often the case with software
engineering, Snapshots are not a new concept. Operating systems such as Linux
and Windows had snapshot tools (core dumps) for years, used to analyze kernel
and application crashes. Error monitoring tools such as Sentry or Bugsnag
utilize (limited) snapshotting capabilities focused on errors. For more recent
examples, developer Observability platforms such as Rookout are heavily focused
on Snapshots.
How do we use Snapshots?
To meet the needs of modern development, we
need to put snapshots at easy reach for every developer. We need to give them the ability to decide
ahead of time which obscure edge cases to snapshot for ease of reproduction and
fixing. We must allow them to snapshot unexpected events in real-time to
understand and remediate them. Therefore, we should build monitoring tools that
intelligently identify and snapshot interesting events for easy analysis.
Lastly, we must build automation engines that correlate data from other sources
and automatically collect snapshots.
Snapshots are the key to unlocking peak
efficiency and effectiveness for engineering organizations in these turbulent
times. Even more important is the potential impact on engineering culture. By
empowering engineers to witness how their code runs in production, we promote a
true shift-left culture and create day-to-day ownership of their code across the
software development lifecycle.
After all, developers deserve a pillar too.
##
To learn more about the transformative nature of cloud native applications and open source software, join us at KubeCon + CloudNativeCon Europe 2023, hosted by the Cloud Native Computing Foundation, which takes place from April 18-21.
ABOUT THE AUTHOR
Liran Haimovitch Co-Founder and CTO,
Rookout
Liran is Co-Founder and CTO of Rookout.
He's an award-winning cyber security practitioner and writer. As an advocate of
modern software methodologies like agile, lean and DevOps, Liran's passion is
to understand how software actually works.