Linkerd, based on a
platform originally developed at Twitter to defeat the notorious Fail Whale, is
celebrating its one-year anniversary as an open source project. I recently spoke with Oliver Gould, CTO and co-founder at
Buoyant, who was on the original Twitter core engineering team tasked with
"fixing" the site and banishing the whale.
VMblog: For those who don't know, can you explain to readers what Linkerd is?
Linkerd is an open source project I started with my Buoyant company
co-founder a year ago. We had both worked at Twitter at a time when the
Fail Whale was a huge problem to the site. We left Twitter in 2015 after
those problems had been fixed, with the idea that the software we had
helped create there could help the rest of the world. The internal tool
at Twitter was called Finagle. Linkerd extends the ideas behind Finagle
and makes it useful to everyone, not just specific to Twitter's
VMblog: And what are you announcing today?
We launched the Linkerd project with the mission to make the same kind
of reliability we had at Twitter available to everyone through
something we call a "service mesh." We're off to a fast start. So far
Linkerd has served over 100 billion production requests in companies
around the globe. Today we're celebrating the project's one-year
anniversary!VMblog: What specifically does Linkerd
Gould: Linkerd solves the problem of the insane amount of
complexity involved in running microservices at scale, by automating the hard
and the complex parts of the communications layer. It makes it easy for
operators to run microservices with automated load balancing, service
discovery, and run-time resilience. It protects against a whole class of
reliability issues inherent in distributed systems design. Think of Linkerd as
the network layer for modern cloud applications. Or as we call it, the new
VMblog: And what do you mean
by a service mesh?
Gould: A service mesh is a software layer that decouples the
communication between services, or microservices, into a separate layer that
can be managed and controlled independently. It adds reliability and stability
without application developers having to be write code. At Twitter we showed
that this approach was a crucial part of ensuring uptime at massive web-scale
operations when they're built as microservices.
In traditional applications,
this logic is coded directly into the application itself. We think it's better
for scaling, uptime and security if you abstract that logic to the underlying
communications layer. The monitoring/visibility, tracing, load balancing and
service discovery should not be hard-coded into each application. That's too
hard, too brittle, and too complex to scale unless your company has an army of
infrastructure engineers to support this kind of architecture. Few do.
the idea to the TCP stack. Just as applications shouldn't be writing their own
TCP stack, they also shouldn't be managing the critical communications logic
that underlies the reliability of their system.