Virtualization Technology News and Information
Article
RSS
Cloud Native Chaos and Telcos - Enforcing Reliability and Availability for Telcos

cloud-native-chaos-telco

What is Telco Reliability?

What does reliability mean to telcos?  When the emergency systems that are responsible for a 911 call go offline, people's lives are at stake.  In one sense this story is the driving force for the type of telecommunications systems that people have come to expect. Reliability in a telco environment is far more critical and in a way, a complex requirement when compared to, say, the traditional IT infrastructure.

itinra-telco-complexity-scale 

When we dig deeper, we can look at the reliability [1][2][3] of a specific protocol, TCP:

Reliability: One of the key functions of the widely used Transport Control Protocol (TCP) is to provide reliable data transfer over an unreliable network. ... Reliable data transfer can be achieved by maintaining sequence numbers of packets and triggering retransmission on packet loss.

-- Serpanos, Dimitrios, Wolf, Tilman

In this way, reliability is about the correctness of data (regardless of the methods used to ensure that data is correct).  One method for measuring reliability is mean time between failure (MTBF [4]).

Telco Availability

Another description of what we've come to expect from the emergency system story is known as availability and serviceability.  How long a system can stay functional, regardless of corruption of data (e.g. a successful phone call that has background noise), is the availability of the system.  How easily a system can be repaired when it goes down is the serviceability [5] of the system.  The availability of a system is measured by the percentage of uptime it has, e.g. 99.99% uptime, e.g. Serviceability of a system can be measured by mean time to recover (MTTR [ 6]).

Cloud Native Availability

Cloud native systems [7] try to address reliability (MTBF) by having the subcomponents have higher availability through higher serviceability (MTTR) and redundancy. For example, having ten redundant subcomponents where seven components are available and three have failed will produce a top level component that is more reliable (MTBF) than a single component that "never fails" in the cloud native world.

How do cloud native systems focus on MTTR? This is implemented by focusing on pipelines (the ability to easily recreate the whole system) that deploy immutable infrastructure using declarative systems (saying what a system should do, not how it should do it) combined with the reconciler pattern (constantly repairing the system based on its declared specifications).  With this method, cloud native systems prioritize availability of the overall system over reliability of the subcomponents. 

Two Ways to Resilience

We have seen two descriptions, telco and cloud native, of resilient systems. If we take a deeper look at how these systems are implemented, maybe we can find common ground.  

K8s Enterprise Way

The cloud native infrastructure uses the reconciler pattern.  With the reconciler pattern, the desired state of the application (e.g. there should be three identical nginx servers) is declared as configuration, and the actual state is a result of querying the system.  An orchestrator repeatedly queries the actual state of the system and then when there is a difference between the actual state and the desired state, it reconciles them.

Kubernetes is an orchestrator and an implementation of the reconciler pattern.  The configuration for an application, such as how many instances (pods, containers) and how those instances are monitored (liveness, readiness) are managed by Kubernetes.

The Erlang Way

Erlang has been used to build telecommunications systems to be fault tolerant and highly available from the ground up [ 8].  The Erlang method for building resilient systems is similar to the K8s enterprise way, in that there are lower layers that get restarted by higher layers:

In erlang you build robust systems by layering.  Using processes, you create a tree in which the leaves consist of the application layer that handles the operational tasks while the interior nodes monitor the leaves and other nodes below them ... [9]

Instead of an orchestrator (K8s) that watches over pods, Erlang has a a hierarchy of supervisors [10]:

In a well designed system, application programmers will not have to worry about error-handling code.  If a worker crashes, the exit signal is sent to its supervisor, which isolates it from the higher levels of the system ... Based on a set of ... parameters ... the supervisor will decide whether the worker should be restarted. [11]

In an Erlang system, reliability is accomplished by achieving fault tolerance through replication [12].  In both traditional telco systems and cloud native enterprise systems, workers that restart in order to heal themselves is a common theme.  Repairability, specifically of subcomponents, is an important feature of cloud nativre systems.  The Erlang world has the mantra "Let it fail".  This speaks to the repairability and mean time to recovery (MTTR) of subcomponents.  Combining reliability, or mean time between failure (MTBF) and repairability makes it possible to focus on MTTR, which will produce the type of highly available and reliable systems that telco cloud native systems need to be.

Telco Within Cloud Native

Like other domains that have found value in microservices architecture and containerization, telco is slowly but steadily embracing Cloud Native [13]. One important reason for this is the need for telco applications to be able to run on different infrastructures as most service providers have bespoke configurations. This has resulted in the creation of CNFs (Cloud-Native Network Functions) that are built ground-up based on standard cloud-native principles, including containerized apps subjected to CI/CD flows, maintenance of test beds controlled via IaC (Infrastructure-as-Code) constructs, with everything being deployed and operated on Kubernetes.

The CNF-WG is a body within the CNCF (Cloud Native Computing Foundation) that is involved in defining the process around evaluating the cloud nativeness of networking applications, aka CNFs.  One of the (now) standard practices furthering reliability in the cloud-native world is Chaos Testing or Chaos Engineering. It can be defined as: 

..the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production.

In layman's terms, chaos testing is like vaccination. We inject harm to build immunity from outages.

Subjecting the telco services on Kubernetes to chaos on a regular basis (with the right hypothesis and validation thereof) is a very useful exercise in unearthing failure points and improving resilience. The CNF-TestSuite initiative employs such resilience scenarios executed on the vendor-neutral CNF testbed.

In the subsequent sections, we will take a deeper look at Chaos testing, what benefits can be expected from it, and how it plays out within the Cloud-Native model.

Antifragility and Chaos Testing

Antifragility is when a system gets stronger, rather than weaker, under stress.  An example of this would be the immune system in humans. Certain types of exposure to illness actually can make the immune system stronger.  

Some things benefit from shocks; they thrive and grow when exposed to volatility, randomness, disorder, and stressors and love adventure, risk, and uncertainty. Yet, in spite of the ubiquity of the phenomenon, there is no word for the exact opposite of fragile. Let us call it antifragile. Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better -- Nassim Taleb [14]

Cloud native architectures are a way to achieve resilience through enforcing self-healing configuration where the best examples of such don't have human intervention. 

When chaos testing, individual subcomponents are removed or corrupted and the system is checked for correctness.  This is done on a subset of production components to ensure the proper level of reliability and availability is baked in at all levels of architecture.  If we add chaos testing to a cloud native architecture, the result should be an organization that develops processes that have resilience in mind from the start.

chaos-testing-flow 

Chaos Engineering in the Cloud Native World

Originally brought about as the result of a "shift-right" philosophy that advocated testing in production environments to verify system behaviour upon controlled failures, chaos engineering has evolved to become more ubiquitous in terms of where and how it is employed. The "Chaos First" principle involves testing resilience via failures at each stage of product development and in environments that are not production, i.e., in QA and staging environments. This evolution has been accelerated by the paradigm shift caused in development and deployment practices due to the emergence of cloud-native architecture, which has microservices and containerization at its core. With increased dependencies and an inherently more complex hosting environment (read Kubernetes), points of failures are increased even further, bringing about a stronger case for the practice of chaos testing. While the reconciliation model of cloud-native/Kubernetes provides greater opportunities for self-healing, it still needs to be coupled with right domain intelligence (in the form of custom controllers, right deployment architecture etc.,) to provide resilience.

cloud-native-chaos-engineering 

Add to this the dynamicity of a Kubernetes-based deployment environment wherein different connected microservices undergo independent upgrades -  there is a real need for continuous chaos or resilience testing. This places chaos tests as an integral part of the devops (CI/CD) flow.

Conclusion

Borrowing from the lessons learned when applying chaos testing to cloud native environments, we should use declarative chaos specifications to test telecommunication infrastructure in tandem with its development and deployment.  The CI/CD tradition of "pull the pain forward" with a focus on MTTR will produce the type of highly available and reliable systems that cloud native telecommunication systems will need to be.

Continue the Conversation

If you are interested in this topic and would like to help define cloud native best practices for telcos, you're welcome to join the Cloud Native Network Functions Working Group (CNF WG) slack channel, participate in CNF WG calls on Mondays at 16:00 UTC, and attend the CNF WG KubeCon NA session virtually on Friday, October 15th.

For more details about the CNF WG and LimusChaos, visit:

https://github.com/cncf/cnf-wg/

https://github.com/litmuschaos/litmus

##

ABOUT THE AUTHORS

W. Watson - Principal Consultant, Vulk Coop

w watson 

W. Watson has been professionally developing software for over 25 years. He has spent numerous years studying game theory and other business expertise in pursuit of the perfect organizational structure for software co-operatives. He also founded the Austin Software Cooperatives meetup group and Vulk Coop as an alternative way to work on software as a group. He has a diverse background that includes service in the Marine Corps as a computer programmer, and software development in numerous industries including defense, medical, education, and insurance. He has spent the last couple of years developing complementary cloud native systems such as the cncf.ci dashboard. He currently works on the Cloud Native Network Function (CNF) Test Suite, the CNF Testbed , and the Cloud native networking principles initiatives. Recent speaking experiences include ONS NA, KubeCon NA 2019, and Open Source Summit 2020.

 

Karthik S - Co-founder, ChaosNative & Maintainer, LitmusChaos

karthik s 

Karthik S is the chief architect of the CNCF chaos engineering project LitmusChaos. He has over 12 years of experience, mostly spent in improving the resiliency of software systems, especially in the areas of data storage & SaaS solutions. He is passionate about all things Kubernetes and DevOps. A regular at CNCF meetups, his recent speaking experiences include Gitlab Commit 2019, OSConf 2020, Kubecon EU 2021, CDCon 2021 & PerconaLive 2021.



[1] 4. Transport layer: Transport protocols establish end-to-end communication between end systems over the network defined by a layer 3 protocol. Often, transport layer protocols provide reliability, which refers to complete and correct data transfer between end systems. Reliability can be achieved through mechanisms for end-to-end error detection, retransmissions, and flow control. Serpanos, Dimitrios,Wolf, Tilman. Architecture of Network Systems (The Morgan Kaufmann Series in Computer Architecture and Design) (p. 12). Elsevier Science. Kindle Edition.

[2] Reliability: One of the key functions of the widely used Transport Control Protocol (TCP) is to provide reliable data transfer over an unreliable network. As discussed briefly in the Appendix, reliable data transfer can be achieved by maintaining sequence numbers of packets and triggering retransmission on packet loss. Serpanos, Dimitrios,Wolf, Tilman. Architecture of Network Systems (The Morgan Kaufmann Series in Computer Architecture and Design) (p. 142). Elsevier Science. Kindle Edition.

[3] Transmission Control Protocol: The transmission control protocol operates in connection-oriented mode. Data transmissions between end systems require a connection setup step. Once the connection is established, TCP provides a stream abstraction that provides reliable, in-order delivery of data. To implement this type of stream data transfer, TCP uses reliability, flow control, and congestion control. TCP is widely used in the Internet, as reliable data transfers are imperative for many applications. Serpanos, Dimitrios,Wolf, Tilman. Architecture of Network Systems (The Morgan Kaufmann Series in Computer Architecture and Design) (p. 143). Elsevier Science. Kindle Edition.

[4] Reliability can be defined as the probability that a system will produce correct outputs up to some given time t. [5] Reliability is enhanced by features that help to avoid, detect and repair hardware faults. A reliable system does not silently continue and deliver results that include uncorrected corrupted data. Instead, it detects and, if possible, corrects the corruption, for example: by retrying an operation for transient (soft) or intermittent errors, or else, for uncorrectable errors, isolating the fault and reporting it to higher-level recovery mechanisms (which may failover to redundant replacement hardware, etc.), or else by halting the affected program or the entire system and reporting the corruption. Reliability can be characterized in terms of mean time between failures (MTBF) -- https://en.wikipedia.org/wiki/Reliability,_availability_and_serviceability

[5] Serviceability or maintainability is the simplicity and speed with which a system can be repaired or maintained; if the time to repair a failed system increases, then availability will decrease. -- https://en.wikipedia.org/wiki/Reliability,_availability_and_serviceability

[ 6] Availability means the probability that a system is operational at a given time, i.e. the amount of time a device is actually operating as the percentage of total time it should be operating. High-availability systems may report availability in terms of minutes or hours of downtime per year. Availability features allow the system to stay operational even when faults do occur. A highly available system would disable the malfunctioning portion and continue operating at a reduced capacity. In contrast, a less capable system might crash and become totally nonoperational. Availability is typically given as a percentage of the time a system is expected to be available, e.g., 99.999 percent ("five nines")-- https://en.wikipedia.org/wiki/Reliability,_availability_and_serviceability

[7] https://www.cncf.io/online-programs/what-is-cloud-native-and-why-does-it-exist/

[ 8] British Telecom issued a press release claiming nine-nines availability during a six-month trial of an AXD301 ATM switch network that carried all of its long-distance-traffic calls. Cesarini, Francesco,Vinoski, Steve. Designing for Scalability with Erlang/OTP: Implement Robust, Fault-Tolerant Systems (p. 404). O'Reilly Media. Kindle Edition.

[9] In erlang you build robust systems by layering.  Using processes, you create a tree in which the leaves consist of the application layer that handles the operational tasks while the interior nodes monitor the leaves and other nodes below them ...  Erlang Programming: A Concurrent Approach to Software Development (p. 216). O'Reilly Media. Kindle Edition.

[10]... you should never allow processes that are not part of a supervision tree Erlang Programming: A Concurrent Approach to Software Development (p. 216). O'Reilly Media. Kindle Edition.

[11] In a well designed system, application programmers will not have to worry about error-handling s\code.  If a worker crashes, the exit signal is sent to its supervisor, which isolates it from the higher levels of the system ... Based on a set of ... parameters ... the supervisor will decide whether the worker should be restarted.  Erlang Programming: A Concurrent Approach to Software Development (pp. 216-217). O'Reilly Media. Kindle Edition.

[12] ... There are a number of advantages in building a distributed system ... It will provide performance that can scale with demand ... Replication also provides fault tolerance ... This fault tolerance allows the system to be more robust and reliable.  Erlang Programming: A Concurrent Approach to Software Development (p. 348). O'Reilly Media. Kindle Edition.

[13] https://vmblog.com/archive/2020/09/08/a-break-from-the-past-why-cnfs-must-move-beyond-the-nfv-mentality.aspx#.YT_B19NKhhE 

[14]  "Some things benefit from shocks; they thrive and grow when exposed to volatility, randomness, disorder, and stressors and love adventure, risk, and uncertainty. Yet, in spite of the ubiquity of the phenomenon, there is no word for the exact opposite of fragile. Let us call it antifragile. Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better" Taleb, Nassim Nicholas. Antifragile (Incerto) (p. 3). Random House Publishing Group. Kindle Edition.

Published Wednesday, September 15, 2021 7:34 AM by David Marshall
Filed under: ,
Comments
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
Calendar
<September 2021>
SuMoTuWeThFrSa
2930311234
567891011
12131415161718
19202122232425
262728293012
3456789