What is Telco Reliability?
What
does reliability mean to telcos? When
the emergency systems that are responsible for a 911 call go offline, people's
lives are at stake. In one sense this
story is the driving force for the type of telecommunications systems that
people have come to expect. Reliability in a telco environment is far more
critical and in a way, a complex requirement when compared to, say, the
traditional IT infrastructure.
When
we dig deeper, we can look at the reliability [1][2][3] of a
specific protocol, TCP:
Reliability:
One of the key functions of the widely used Transport Control Protocol (TCP) is
to provide reliable data transfer over an unreliable network. ... Reliable data
transfer can be achieved by maintaining sequence numbers of packets and
triggering retransmission on packet loss.
--
Serpanos, Dimitrios, Wolf, Tilman
In this way, reliability is about the correctness of data (regardless
of the methods used to ensure that data is correct). One method for measuring reliability is mean
time between failure (MTBF [4]).
Telco Availability
Another
description of what we've come to expect from the emergency system story is
known as availability and serviceability.
How long a system can stay functional, regardless of corruption of data
(e.g. a successful phone call that has background noise), is the availability
of the system. How easily a system can
be repaired when it goes down is the serviceability [5] of the system. The availability of a
system is measured by the percentage of uptime it has, e.g. 99.99% uptime, e.g.
Serviceability of a system can be measured by mean time to recover (MTTR [ 6]).
Cloud Native Availability
Cloud
native systems [7] try to
address reliability (MTBF) by having the subcomponents have higher availability
through higher serviceability (MTTR) and redundancy. For example, having ten
redundant subcomponents where seven components are available and three have
failed will produce a top level component that is more reliable (MTBF) than a
single component that "never fails" in the cloud native world.
How
do cloud native systems focus on MTTR? This is
implemented by focusing on pipelines (the ability to
easily recreate the whole system) that deploy immutable
infrastructure using declarative systems (saying what a system should do, not
how it should do it) combined with the reconciler pattern (constantly repairing
the system based on its declared specifications). With this method, cloud native systems
prioritize availability of the overall system over reliability of the
subcomponents.
Two Ways to Resilience
We
have seen two descriptions, telco and cloud native, of resilient systems. If we
take a deeper look at how these systems are implemented, maybe we can find
common ground.
K8s Enterprise Way
The
cloud native infrastructure uses the reconciler pattern. With the reconciler pattern, the desired
state of the application (e.g. there should be three identical nginx servers)
is declared as configuration, and the actual state is a result of querying the
system. An orchestrator repeatedly
queries the actual state of the system and then when there is a difference
between the actual state and the desired state, it reconciles them.
Kubernetes
is an orchestrator and an implementation of the reconciler pattern. The configuration for an application, such as
how many instances (pods, containers) and how those instances are monitored
(liveness, readiness) are managed by Kubernetes.
The Erlang Way
Erlang
has been used to build telecommunications systems to be fault tolerant and
highly available from the ground up [ 8]. The Erlang method for building resilient
systems is similar to the K8s enterprise way, in that there are lower layers
that get restarted by higher layers:
In erlang you build robust
systems by layering. Using processes,
you create a tree in which the leaves consist of the application layer that
handles the operational tasks while the interior nodes monitor the leaves and
other nodes below them ... [9]
Instead
of an orchestrator (K8s) that watches over pods, Erlang has a a hierarchy of
supervisors [10]:
In a well designed system,
application programmers will not have to worry about error-handling code. If a worker crashes, the exit signal is sent
to its supervisor, which isolates it from the higher levels of the system ...
Based on a set of ... parameters ... the supervisor will decide whether the
worker should be restarted. [11]
In
an Erlang system, reliability is accomplished by achieving fault tolerance
through replication [12]
. In both traditional telco systems and cloud
native enterprise systems, workers that restart in order to heal themselves is
a common theme. Repairability,
specifically of subcomponents, is an important feature of cloud nativre
systems. The Erlang world has the mantra
"Let it fail". This speaks to the
repairability and mean time to recovery (MTTR) of subcomponents. Combining reliability, or mean time between
failure (MTBF) and repairability makes it possible to focus on MTTR, which will
produce the type of highly available and reliable systems that telco cloud
native systems need to be.
Telco Within Cloud Native
Like
other domains that have found value in microservices architecture and
containerization, telco is slowly but steadily embracing Cloud Native [13]. One
important reason for this is the need for telco applications to be able to run
on different infrastructures as most service providers have bespoke
configurations. This has resulted in the creation of CNFs (Cloud-Native Network
Functions) that are built ground-up based on standard cloud-native principles,
including containerized apps subjected to CI/CD flows, maintenance of test beds
controlled via IaC (Infrastructure-as-Code) constructs, with everything being
deployed and operated on Kubernetes.
The
CNF-WG is a body within the CNCF (Cloud
Native Computing Foundation) that is involved in defining the process around
evaluating the cloud nativeness of networking applications, aka CNFs. One of the (now) standard practices
furthering reliability in the cloud-native world is Chaos Testing or Chaos
Engineering. It can be defined as:
..the discipline of experimenting on
a system in order to build confidence in the system's capability to withstand
turbulent conditions in production.
In
layman's terms, chaos
testing is like vaccination. We inject harm to build immunity from outages.
Subjecting
the telco services on Kubernetes to chaos on a regular basis (with the right
hypothesis and validation thereof) is a very useful exercise in unearthing
failure points and improving resilience. The CNF-TestSuite
initiative employs such resilience scenarios executed on the vendor-neutral CNF
testbed.
In
the subsequent sections, we will take a deeper look at Chaos testing, what
benefits can be expected from it, and how it plays out within the Cloud-Native
model.
Antifragility and Chaos Testing
Antifragility
is when a system gets stronger, rather than weaker, under stress. An example of this would be the immune system
in humans. Certain types of exposure to illness actually can make the immune
system stronger.
Some things benefit from shocks; they
thrive and grow when exposed to volatility, randomness, disorder, and stressors
and love adventure, risk, and uncertainty. Yet, in spite of the ubiquity of the
phenomenon, there is no word for the exact opposite of fragile. Let us call it
antifragile. Antifragility is beyond resilience or robustness. The resilient
resists shocks and stays the same; the antifragile gets better -- Nassim Taleb [14]
Cloud
native architectures are a way to achieve resilience through enforcing
self-healing configuration where the best examples of such don't have human
intervention.
When
chaos testing, individual subcomponents are removed or corrupted and the system
is checked for correctness. This is done
on a subset of production components to ensure the proper level of reliability
and availability is baked in at all levels of architecture. If we add chaos testing to a cloud native
architecture, the result should be an organization that develops processes that
have resilience in mind from the start.
Chaos Engineering in the Cloud
Native World
Originally
brought about as the result of a "shift-right" philosophy that advocated
testing in production environments to verify system behaviour upon controlled
failures, chaos engineering has evolved to become more ubiquitous in terms of
where and how it is employed. The "Chaos First" principle involves testing
resilience via failures at each stage of product development and in
environments that are not production, i.e., in QA and staging environments.
This evolution has been accelerated by the paradigm shift caused in development
and deployment practices due to the emergence of cloud-native architecture,
which has microservices and containerization at its core. With increased
dependencies and an inherently more complex hosting environment (read Kubernetes),
points of failures are increased even further, bringing about a stronger case
for the practice of chaos testing. While the reconciliation model of
cloud-native/Kubernetes provides greater opportunities for self-healing, it
still needs to be coupled with right domain intelligence (in the form of custom
controllers, right deployment architecture etc.,) to provide resilience.
Add
to this the dynamicity of a Kubernetes-based deployment environment wherein
different connected microservices undergo independent upgrades - there is a real need for continuous chaos or resilience testing. This places chaos tests as
an integral part of the devops (CI/CD) flow.
Conclusion
Borrowing from the lessons learned when applying chaos
testing to cloud native environments, we should use declarative chaos
specifications to test telecommunication infrastructure in tandem with its
development and deployment. The CI/CD
tradition of "pull the pain forward" with a focus on MTTR will produce the type
of highly available and reliable systems that cloud native telecommunication
systems will need to be.
Continue the Conversation
If
you are interested in this topic and would like to help define cloud native
best practices for telcos, you're welcome to join the Cloud Native Network
Functions Working Group (CNF WG) slack channel, participate in CNF WG calls on Mondays at 16:00 UTC, and
attend the CNF WG KubeCon NA session virtually on Friday, October
15th.
For
more details about the CNF WG and LimusChaos, visit:
https://github.com/cncf/cnf-wg/
https://github.com/litmuschaos/litmus
##
ABOUT THE AUTHORS
W. Watson - Principal Consultant, Vulk Coop
W.
Watson has been professionally developing software for over 25 years. He has
spent numerous years studying game theory and other business expertise in
pursuit of the perfect organizational structure for software co-operatives. He
also founded the Austin Software Cooperatives meetup group and Vulk Coop as an
alternative way to work on software as a group. He has a diverse background
that includes service in the Marine Corps as a computer programmer, and
software development in numerous industries including defense, medical,
education, and insurance. He has spent the last couple of years developing
complementary cloud native systems such as the cncf.ci dashboard. He currently
works on the Cloud Native Network Function (CNF) Test Suite, the CNF Testbed , and the Cloud
native networking principles
initiatives. Recent speaking experiences include ONS NA, KubeCon NA 2019, and
Open Source Summit 2020.
Karthik
S - Co-founder, ChaosNative & Maintainer, LitmusChaos
Karthik
S is the chief architect of the CNCF chaos engineering project LitmusChaos. He
has over 12 years of experience, mostly spent in improving the resiliency of
software systems, especially in the areas of data storage & SaaS solutions.
He is passionate about all things Kubernetes and DevOps. A regular at CNCF
meetups, his recent speaking experiences include Gitlab Commit 2019, OSConf
2020, Kubecon EU 2021, CDCon 2021 & PerconaLive 2021.
[1] 4.
Transport layer: Transport protocols establish end-to-end communication between
end systems over the network defined by a layer 3 protocol. Often, transport
layer protocols provide reliability, which refers to complete and correct data
transfer between end systems. Reliability can be achieved through mechanisms
for end-to-end error detection, retransmissions, and flow control. Serpanos,
Dimitrios,Wolf, Tilman. Architecture of Network Systems (The Morgan Kaufmann
Series in Computer Architecture and Design) (p. 12). Elsevier Science. Kindle
Edition.
[2] Reliability:
One of the key functions of the widely used Transport Control Protocol (TCP) is
to provide reliable data transfer over an unreliable network. As discussed
briefly in the Appendix, reliable data transfer can be achieved by maintaining
sequence numbers of packets and triggering retransmission on packet loss.
Serpanos, Dimitrios,Wolf, Tilman. Architecture of Network Systems (The Morgan
Kaufmann Series in Computer Architecture and Design) (p. 142). Elsevier
Science. Kindle Edition.
[3] Transmission
Control Protocol: The transmission control protocol operates in
connection-oriented mode. Data transmissions between end systems require a
connection setup step. Once the connection is established, TCP provides a
stream abstraction that provides reliable, in-order delivery of data. To
implement this type of stream data transfer, TCP uses reliability, flow
control, and congestion control. TCP is widely used in the Internet, as
reliable data transfers are imperative for many applications. Serpanos,
Dimitrios,Wolf, Tilman. Architecture of Network Systems (The Morgan Kaufmann
Series in Computer Architecture and Design) (p. 143). Elsevier Science. Kindle
Edition.
[4] Reliability can be defined as
the probability that a system will produce correct outputs up to some given
time t. [5] Reliability is enhanced by
features that help to avoid, detect and repair hardware faults. A reliable
system does not silently continue and deliver results that include uncorrected
corrupted data. Instead, it detects and, if possible, corrects the corruption,
for example: by retrying an operation for transient (soft) or intermittent errors, or else, for
uncorrectable errors, isolating the fault and reporting it to higher-level
recovery mechanisms (which may failover to redundant replacement hardware,
etc.), or else by halting the affected program or the entire system and
reporting the corruption. Reliability can be characterized in terms of mean time between
failures (MTBF) -- https://en.wikipedia.org/wiki/Reliability,_availability_and_serviceability
[5] Serviceability
or maintainability is the simplicity and speed with which a system can be
repaired or maintained; if the time to repair a failed system increases, then
availability will decrease. --
https://en.wikipedia.org/wiki/Reliability,_availability_and_serviceability
[ 6] Availability
means the probability that a system is operational at a given time, i.e. the
amount of time a device is actually operating as the percentage of total time
it should be operating. High-availability systems may report availability in
terms of minutes or hours of downtime per year. Availability features allow the
system to stay operational even when faults do occur. A highly available system
would disable the malfunctioning portion and continue operating at a reduced
capacity. In contrast, a less capable system might crash and become totally
nonoperational. Availability is typically given as a percentage of the time a
system is expected to be available, e.g., 99.999 percent ("five nines")--
https://en.wikipedia.org/wiki/Reliability,_availability_and_serviceability
[7] https://www.cncf.io/online-programs/what-is-cloud-native-and-why-does-it-exist/
[ 8] British Telecom issued a press release claiming nine-nines
availability during a six-month trial of an AXD301 ATM switch network that
carried all of its long-distance-traffic calls. Cesarini, Francesco,Vinoski,
Steve. Designing for Scalability with Erlang/OTP: Implement Robust,
Fault-Tolerant Systems (p. 404). O'Reilly Media. Kindle Edition.
[9] In
erlang you build robust systems by layering.
Using processes, you create a tree in which the leaves consist of the
application layer that handles the operational tasks while the interior nodes
monitor the leaves and other nodes below them ... Erlang Programming: A Concurrent Approach to
Software Development (p. 216). O'Reilly Media. Kindle Edition.
[10]...
you should never allow processes that are not part of a supervision tree Erlang
Programming: A Concurrent Approach to Software Development (p. 216). O'Reilly
Media. Kindle Edition.
[11] In
a well designed system, application programmers will not have to worry about
error-handling s\code. If a worker
crashes, the exit signal is sent to its supervisor, which isolates it from the
higher levels of the system ... Based on a set of ... parameters ... the
supervisor will decide whether the worker should be restarted. Erlang Programming: A Concurrent Approach to
Software Development (pp. 216-217). O'Reilly Media. Kindle Edition.
[12] ...
There are a number of advantages in building a distributed system ... It will
provide performance that can scale with demand ... Replication also provides
fault tolerance ... This fault tolerance allows the system to be more robust
and reliable. Erlang Programming: A
Concurrent Approach to Software Development (p. 348). O'Reilly Media. Kindle
Edition.
[13] https://vmblog.com/archive/2020/09/08/a-break-from-the-past-why-cnfs-must-move-beyond-the-nfv-mentality.aspx#.YT_B19NKhhE
[14] "Some
things benefit from shocks; they thrive and grow when exposed to volatility, randomness, disorder, and stressors and love adventure, risk,
and uncertainty. Yet, in spite of the ubiquity of the
phenomenon, there is no word for the exact opposite of fragile. Let us call it
antifragile. Antifragility is beyond resilience or robustness. The resilient
resists shocks and stays the same; the antifragile gets better" Taleb, Nassim
Nicholas. Antifragile (Incerto) (p. 3). Random House Publishing Group. Kindle
Edition.