Virtualization Technology News and Information
Article
RSS
Stateful CNFs

database 

Database" by christophe.benoit74 is licensed under CC BY 2.0.

Stateful CNFs

Towards the implementation of stateful cloud native network functions using the cloud native ecosystem 

Stateful CNFs lie at the intersection of cloud native [1], telecommunications [2], and data intensive systems [3].  Stateful applications have matured to the point of being first class citizens in the traditionally stateless architecture of Kubernetes.[4]  Although there are many shared non-functional requirements between the architectures of these three domains, when deploying a stateful network application the requirements shift in the direction of lower latency.  High levels of consistency, availability, and partition tolerance have traditionally been thought of as impossible to achieve, but there are new techniques that may help to serve low latency networking requirements.

Stateful CNF Non-Functional Requirements

The shared non-functional requirements of a stateful CNF are availability, resilience, scalability, and interoperability.

Availability

The lower bound for availability in a cloud native system would be a system that has no planned maintenance windows.  This is implemented by using phoenix deploys [5], meaning that when a new deployment is pushed, each routable endpoint that has connections is allowed to continue processing until its current task is finished while new requests are routed to newly deployed containers/components/vms etc.  If a system needs planned maintenance windows, we can safely assume that the architecture is not configured to have this minimal guarantee of downtime.  A high level of availability, measured by the ‘nines' nomenclature, would be something comparable to 5 nines [ 6].

Resilience

Cloud native, telco, and data intensive systems achieve resilience at the higher layers through management of robust lower level components [7][ 8][9] such as load balancing, corrective protocols [10] and replication.  Data intensive applications also have strong provisions for keeping the system in a correct state while allowing for safe retries by using ACID consistency and isolation levels.

Scalability

Scalability can be described as a system's method for coping with growth.  The cloud native scalability strategy strongly encourages scaling out versus scaling up.  Telecommunications has a similar parallel  within the CLOS data center architecture.[11]  Data intensive applications use replication and partitioning to scale out data while maintaining performance. 

Latency and Guarantees

We often talk about a system being ‘real time' when describing a system that has low latency requirements.  A rigorous description of a real time system will describe real time requests as having some kind of timing guarantee.  A system's real time request also has a relationship to what is considered a failure within that system.  With those ideas in mind, a hard real-time system would be one that considers any missed deadline a system failure.  A firm real-time system allows for deadlines to be missed, as the system can handle some non-zero amount of tasks that fail to complete before a deadline, so long as this happens rarely.  A soft real time system can handle lots of failed requests, as the requests still have some value if they are completed after the deadline.  Most systems casually described as real time are referring to firm or soft real time systems.

When describing soft real time systems it is common to refer to a system's response times or latency in terms of the average time measured.  This can be misleading because we then overlook the slower response times of a subset of the tasks.  When we use percentiles, we can reason about the average time (50 percentile) and the slower times (99 percentile). 

Low Latency

What does it mean to be low latency [12] with concern to a data intensive system?  Data can be structured and stored in many ways.  But the correctness of the data with respect to read and write skew has a direct effect on how low the latency can get.  Some data structures allow for faster writes and reads, but generally speaking, if you require ACID consistency and isolation (or more specifically, serializability), you'll take a performance hit on latency.  Still, some of the fastest, highly available, strongly consistent (serialized isolation) databases can write with 10 ms latency[13][14].

Interoperability

Interoperability in cloud native systems comes in many forms, but is exemplified in API endpoints implemented as microservices while being deployed and orchestrated as coarse grained binaries.  This allows different organizations to have separate rates of change in their deployments while continuing to depend on one another[15][16].  While telecommunications has a strong relationship with interoperability via the OSI model layers[17], which successfully allowed for innovation by separating network concerns into layers, it also has a troubled history with some protocols and implementations, such as SNMP[18].  Within data intensive applications the structure of the data and its clients drive forward and backward compatibility, which influences how the system's performs its own upgrades[19].

Determining Functional Requirements of a Stateful CNF

While determining the functional requirements while building use cases and user stories for any system that has non-trivial persistent data, one must ask specific questions about the data requirements.  These questions can tease out the tradeoffs that will need to be made between data correctness and latency.  In order to reason about these questions, it is necessary to have an unambiguous shared vocabulary about correctness that maps to the technologies that address the requirement.

How safe must your system be?

Safety is about stopping something bad from happening (or at least noting that something bad happened and not returning a bad result).  Safety keeps the cows from getting out of the barn.  Liveness (something good happens, eventually) measures how the cows get back into the barn if they get out. Historically, systems have concentrated on safety, or reliability, to make their systems resilient.  Cloud native systems opt for liveness, or anti-fragility[20], where possible. 

Can you tolerate write skew?

Write skew happens when a transaction reads some data, that read-data is then changed by another process, but the transaction writes data based on a decision made from the original read-data.  This kind of transaction (read committed) was the default setting in many databases,[21] so just turning on transactions in an ACID compliant RDBMS may not address this problem. 

How do you address write skew?

If there is write screw, can you tolerate a solution where the last write wins or the write is merged[22][23][24] with another write?

Are there latency sensitive parts of the system where the updates can be applied in any order, such as incrementation?[25]

Are there other parts of the system where the data is causally ordered, that is, where the premise of an insert must be true in order for the insert to occur, but not temporally (temporally)?  If this is the case, some performance increases may be possible.[26]

Do you require strong consistency?

There are some problems that require strong consistency, such as enforcing uniqueness (e.g. usernames) or preventing a double spending problem.  Do you have these kinds of challenges?

Is your data write heavy?

Is 90% of the activity in your dataset a write? Is your data read heavy?  Or is your data activity split 50% between writes and reads?

How fast should your responses be?

Some of the questions of correctness have historically been glanced over because the latency requirements were not very challenging.  In a data intensive system, the latency requirements can be down to 1ms latency or below.  This, combined with resilience requirements, pushes us to find nuanced answers to problems that we used a simple DBMS for historically.  That being said, do you have 1ms latency response times requirements?  What latency requirements do you have for the 99th percentile?  Do you have any hard real time guarantees? 

Will the users be distributed geographically?  

If so, there are techniques that move the data closer to the user to reduce latency.

How secure must your system be? 

Can you tolerate trusting a compromised node?

Data Profiles of Interest

When reasoning about the functional requirement questions about data intensive applications, it can be useful to group your system into a profile based on correctness, similar to the intuitions found in the CAP theorem[27] but with the nuances previously mentioned.

High Availability, Low Correctness, Low Latency

Within telecommunications, there is a high standard for reliability.  The idea is that since a life threatening situation occurs when the 911 systems go down, telecommunication lines must be as reliable (highly available) as possible.  Even so, the analog lines that made 911 calls are lossy (low correctness).  An intrusion detection system (IDS) such as Snort that monitors a network doesn't need transactions (low correction) but still needs to be highly available and preferably zero impact on the network (integrates with a low latency).

High Availability, Low Correctness, High Latency

The search implemented in a system is often an entry point to the system and fundamental to other components, so it needs to be robust (high availability).  The indexing for search does not, however, require transactions (low correctness) and the search response times can be more than 10 ms while retaining  usefulness.

High availability, High Correctness, High Latency

Traditional online transaction processes systems (OLTP) use HA (high availability) relational dbms systems (high correctness), to serve data requests.  These requests were not expected to be 1 ms or less (high latency).  Leadership election and cluster configuration require linearizability (high correctness) and fall under this category as well.

Low Availability, High Correctness, Low latency

There are various problems that require more correctness than last write wins, but less correctness than linearizability and serialized isolation.  Leaderboards, some of which issue rewards, require a running tally of increments (high correctness).  The writes to the board could very well be data intensive (low latency) but the board can still be useful with much less than 5 nines of availability.

High Availability, High Correctness, Low Latency

The holy grail of profiles is high availability, high correctness, and low latency.  For the weaker forms of correctness (read committed isolation, data structures that tolerate automatic conflict resolution/merge based, data that can be applied in a partial order), this profile may become possible.  Given serialized snapshot isolation and linearizability, both of which are required for the highest level of correctness for some problems (e.g. preventing double spending) in the distributed environments that are implied by high availability, this profile approaches the impossible.  That is to say that either a 50 percentile 1 ms latency  will not be achieved with high availability (one of the higher ‘nines' such as 5 nines) and these high correction levels.  Hard real time guarantees of 1 ms are even less plausible.

Collaboration tools have nuanced, partial ordering (high correctness) requirements.  They are expected to be online at all times (high availability) and extremely responsive (low latency), sometimes down to the keystroke.  These tools can fit into this category.  Traditional fundamental networking components, such as those that are dependencies for many other components such as charging[28] and session management, fall into this category as well.

Stateful Implementation

Stateful implementation in a cloud native solution can be separated into the storage solution and the database solution.  Both of these need to solve problems of leader election behind the scenes, but database solutions address more fine grained correctness guarantees.

Cloud Native Data Storage Implementation

Cloud Native data storage generally, and Kubernetes data storage specifically, focuses on techniques that scale out with more nodes, instead of scale up with faster nodes.  This is not to say that scaling up does not have its place, but software that takes advantage of node elasticity is a strong expectation for the cloud native community.  Along with elasticity, tradeoffs that select resilience and availability are also preferred.

Infrastructure and applications that are stateless or reduce state as much as possible lend themselves well to cloud native environments.   Even so, statefulness is unavoidable in most useful applications.  With regard to storage at the fundamental level (disk state), cloud native solutions have arisen that take advantage of scale out strategies.  A proper use of elastic volumes is one such solution.  Elastic volume block-level storage uses replication to distribute data over multiple disks, which provides resilience if any node goes down. 

Kubernetes storage solutions have two types of provisioning, static and dynamic.  Static provisioning requires the operator to set up the persistent volume manually. Dynamic provisioning requires a driver specific to the storage type, but if the driver is available Kubernetes handles the setup of volumes.  The Rook and Ceph projects provide a way to implement cloud native storage in the Kubernetes environment using the Paxos consensus algorithm to choose nodes.

Cloud Native Database Implementation

Kubernetes Statefulness

Kubernetes statefulsets work with elastic volumes to provide a resilient, scale out solution for the storage involved with a database cluster.

Cloud Native Database Latency

In minimally available data intensive systems, sensitive latency requirements are addressed with replication, partitioning, and sensitivity to hard drive access.  When addressing latency within a data intensive system that has at least minimal availability, the system must have a distributed nature.

Replication Implementation

Replication is the most common way to introduce availability in databases.  Within replication types, there is single leader, multi leader, and leaderless replication.  An important note about replication: only single leader replication has the potential of distributed strong consistency (linearizability for leader election with serializable isolation)[29] while multi-leader (not linearizable)[30][31] and leaderless[32] (not linearizable) do not.  Mysql and Postgres support single leader out of the box, and there are projects such as Tungsten Replicator for MySQL, and BDR for PostgreSQL for helping with multi-leader replication.  Leaderless replication is available in Amazon Dynamo, Riak, Cassandra, and Voldemort.  While consensus based storage, which similar to single leader replication, can provide strong distributed consistency (linearizable), it is designed for coarse grained data such as cluster configuration, and so not very suitable for fine grained data.  The best-known fault-tolerant consensus algorithms are Paxos (used in Ceph), Raft (used in etcd), and Zab (used in Zookeeper).

Partitioning Implementation

Partitioning data is useful for reducing latency because it brings the data closer to the user of that data.  Some projects such as Vitess and Cockroachdb do this automatically. 

In memory databases reduce latency but avoid paying the cost of serializing data when writing to a hard drive.  They can also provide durability by writing to the disk in using append only files in log data structures, which is significantly faster than a regular write.  On faster networks, they may use replication to guarantee copies of the data will survive a crash. VoltDB and MemSQL are in-memory databases with a relational model while Memcached is an in-memory key-value store.  Redis and Couchbase are in-memory no-sql options with weak durability while RAMCloud is the same but with strong durability.

If there are large amounts of advanced data required in the reads of a query, especially within a range of data, column oriented storage may be the solution.  Some column oriented data solutions are Cassandra, HBase, Druid, Hypertable, and MariaDB.

Correctness Implementation

Traditionally DBMSs have offered ACID compliance and yet some configurations of these DBMSs are not suitable for minimally available use with high levels of safety.  This makes it important for us to have a nuanced view on what correctness is with relation to our data requirements. 

Isolation Implementation

One of the problems when thinking about transactions and ACID compliance is the ambiguity[33] of the terms consistency and isolation. 

Weak isolation levels[34] such has read committed, snapshot isolation, and repeatable read[35] have serious issues[36] that have caused the loss of money and data corruption.  These isolation levels have contributed to the ambiguity of the terms consistency and isolation.

Serializability

When it comes to correctness, serializability (e.g. serializable snapshot isolation), is the gold standard.  This is different than snapshot isolation[37], which is often the default on many DBMSs.  There are many examples of data that require serializable isolation such as scheduling, gaming, uniqueness, and double spending.[38]   The approach of executing transactions serially in a single thread on one node is implemented in VoltDB/H-Store, Redis, and Datomic.  Serializability using two phased locking is used by the serializable isolation level in MySQL (InnoDB).  Both PostgreSQL and FoundationDB have a serializable snapshot isolation configuration.  CockroachDB uses a technique called parallel commits to achieve serializability.

Linearizability

In the CAP theorem, consistency is defined as linearizability.  Linearizability[39] guarantees that the system treats data as having exactly one copy, regardless of the implementation.   Once a write happens to that data, all reads will see what was written.  Consensus algorithms get several nodes to agree on something.  This is important in the problem of leadership election and atomic commits.  Consensus algorithms need to be linearizable in order to avoid the split brain problem[40]. Etcd is one solution that uses the Raft algorithm to handle configuration data, while Zookeeper uses the Paxos algorithm for locking and is good for leader election.  Both of these algorithms are linearizable.

Merging Data

Automatic conflict resolution works well for data that can be safely merged, with the classic example being a shopping cart.  Some technologies for implementing automatic conflict resolution are CRDTs (Riak 2.0), mergeable persistent data structures (similar to Git three way merges), and operational transformation (implemented within etherpad).

Conclusion

In surveying the requirements of stateful cnfs, we have recognized availability, resilience, scalability, and interoperability as necessary nonfunctional requirements for the architecture.  Cloud native, telecommunications, and data intensive applications place lower bounds on these requirements for one another.  For a solution to be cloud native it must achieve a minimal level of availability coupled with a scale out approach to resilience which will influence the solutions for the other domains.  Telecommunications has strong reliability and low latency requirements, oftentimes approaching 1ms, where higher latency translates into monetary loss .  Both cloud native and telecommunications are venturing into more nontrivial uses of data, which is bounded by correctness.   Given the limits of true serializable transactions, even when we move beyond the limits of the CAP theorem, correctness bounds the lower levels of latency that we can achieve.  The combination of linearizable consensus for leader election, single leader replication, partitioning, and serializable snapshot isolation seems to be the most correct, lowest latency, and flexible option that spans all of the requirements of the three domains.  This achieves single digit millisecond latency at the 50 percentile range for smaller writes.  We can get to the 1 ms range, but not without relaxing some of the correctness, which is appropriate for mergeable or partially ordered datasets.

##

ABOUT THE AUTHOR

W. Watson - Principal Consultant, Vulk Coop

 

W. Watson has been professionally developing software for 25 years. He has spent numerous years studying game theory and other business expertise in pursuit of the perfect organizational structure for software co-operatives. He also founded the Austin Software Cooperatives meetup group (https://www.meetup.com/Austin-Software-Co-operatives/) and Vulk Coop (https://www.vulk.coop) as an alternative way to work on software as a group. He has a diverse background that includes service in the Marine Corps as a computer programmer, and software development in numerous industries including defense, medical, education, and insurance. He has spent the last couple of years developing complementary cloud native systems such as the cncf.ci dashboard. He currently works on the Cloud Native Network Function (CNF) Test Suite (https://github.com/cncf/cnf-testsuite) the CNF Testbed (https://github.com/cncf/cnf-testbed), and the cloud native networking principles (https://networking.cloud-native-principles.org/) initiatives. Recent speaking experiences include ONS NA, KubeCon NA 2019, and Open Source Summit 2020.



[1] https://networking.cloud-native-principles.org/cloud-native-principles

[2] https://networking.cloud-native-principles.org/cloud-native-networking-preamble

[3] We call an application data-intensive if data is its primary challenge-the quantity of data, the complexity of data, or the speed at which it is changing-as opposed to compute-intensive, where CPU cycles are the bottleneck. Kleppmann, Martin. Designing Data-Intensive Applications . O'Reilly Media. Kindle Edition.

[4] https://www.infoq.com/articles/kubernetes-stateful-applications/

[5] Phoenix replacement is the natural progression from blue-green using dynamic infrastructure. Rather than keeping an idle instance around between changes, a new instance can be created each time a change is needed. As with blue-green, the change is tested on the new instance before putting it into use. The previous instance can be kept up for a short time, until the new instance has been proven in use. But then the previous instance is destroyed. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 5694-5697). O'Reilly Media. Kindle Edition.

[ 6] https://en.wikipedia.org/wiki/High_availability#Percentage_calculation

[7] https://vmblog.com/archive/2021/09/15/cloud-native-chaos-and-telcos-enforcing-reliability-and-availability-for-telcos.aspx

[ 8] "intuitively it may seem like a system can only be as reliable as its least reliable component (its weakest link). This is not the case: in fact, it is an old idea in computing to construct a more reliable system from a less reliable underlying base" Kleppmann, Martin. Designing Data-Intensive Applications . O'Reilly Media. Kindle Edition.

[9] "When operators buy large spine switches in an effort to reduce the number of boxes they have, they also automatically increase the blast radius of a failure. This is often not factored into the thinking of the operator. With a spine switch of 512 ports, you can build a two-tier Clos topology of 512 racks. But now the effect of a failure is much larger. If instead you broke this up into a three-tier Clos network with eight pods of 64 racks each, you get a far more reliable network." Dutt, Dinesh G.. Cloud Native Data Center Networking (Kindle Locations 5269-5272). O'Reilly Media. Kindle Edition.

[10] Often, transport layer protocols provide reliability, which refers to complete and correct data transfer between end systems. Reliability can be achieved through mechanisms for end-to-end error detection, retransmissions, and flow control. Serpanos, Dimitrios,Wolf, Tilman. Architecture of Network Systems (The Morgan Kaufmann Series in Computer Architecture and Design) (p. 12). Elsevier Science. Kindle Edition.

[11] "The use of the Clos topology encourages a scale-out model in which the core is kept from having to perform complex tasks and carry a lot of state. It is far more scalable to increase scale by pushing complex functionality to the leaves, including into the servers, where the state is distributed. Because the state is distributed among many boxes, the failure of one box has a much smaller blast radius than a single large box that carries a lot of state." Dutt, Dinesh G.. Cloud Native Data Center Networking (Kindle Locations 5266-5269). O'Reilly Media. Kindle Edition.

[12] "Clients connect to Redis using a TCP/IP connection or a Unix domain connection. The typical latency of a 1 Gbit/s network is about 200 us, while the latency with a Unix domain socket can be as low as 30 us. It actually depends on your network and system hardware. On top of the communication itself, the system adds some more latency (due to thread scheduling, CPU caches, NUMA placement, etc ...). System induced latencies are significantly higher on a virtualized environment than on a physical machine." https://redis.io/docs/reference/optimization/latency/

[13] "As a globally-distributed database, Spanner provides several interesting features. First, the replication configurations for data can be dynamically controlled at a fine grain by applications. Applications can specify constraints to control which datacenters contain which data, how far data is from its users (to control read latency), how far replicas are from each other (to control write latency), and how many replicas are maintained (to control durability, availability, and read performance)." https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf

[14] https://www.cockroachlabs.com/docs/stable/performance.html#performance-limitations

[15] Containerized services works by packaging applications and services in lightweight containers (as popularized by Docker). This reduces coupling between server configuration and the things that run on the servers. So host servers tend to be very simple, with a lower rate of change. One of the other change management models still needs to be applied to these hosts, but their implementation becomes much simpler and easier to maintain. Most effort and attention goes into packaging, testing, distributing, and orchestrating the services and applications, but this follows something similar to the immutable infrastructure model, which again is simpler than managing the configuration of full-blown virtual machines and servers. Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1617-1621). O'Reilly Media. Kindle Edition.

[16] "The value of a containerization system is that it provides a standard format for container images and tools for building, distributing, and running those images. Before Docker, teams could isolate running processes using the same operating system features, but Docker and similar tools make the process much simpler." Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations 1631-1633). O'Reilly Media. Kindle Edition.

[17] An important property of the OSI Reference Model is that it enables standardization of the protocols used in the protocol stacks, leading to the specification of interfaces between layers. Furthermore, an important feature of the model is the distinction it makes between specification (layers) and implementation (protocols), thus leading to openness and flexibility. Openness is the ability to develop new protocols for a particular layer and independently of other layers as network technologies evolve. Openness enables competition, leading to low-cost products. Flexibility is the ability to combine different protocols in stacks, enabling the interchange of protocols in stacks as necessary. Serpanos, Dimitrios,Wolf, Tilman. Architecture of Network Systems (The Morgan Kaufmann Series in Computer Architecture and Design) (p. 14). Elsevier Science. Kindle Edition.

[18] "SNMP implementations vary across platform vendors. In some cases, SNMP is an added feature, and is not taken seriously enough to be an element of the core design. Some major equipment vendors tend to over-extend their proprietary command line interface (CLI) centric configuration and control systems." https://en.wikipedia.org/wiki/Simple_Network_Management_Protocol#Implementation_issues

[19] "When a data format or schema changes, a corresponding change to application code often needs to happen (for example, you add a new field to a record, and the application code starts reading and writing that field). However, in a large application, code changes often cannot happen instantaneously: With server-side applications you may want to perform a rolling upgrade (also known as a staged rollout), deploying the new version to a few nodes at a time, checking whether the new version is running smoothly, and gradually working your way through all the nodes. This allows new versions to be deployed without service downtime, and thus encourages more frequent releases and better evolvability. With client-side applications you're at the mercy of the user, who may not install the update for some time." Kleppmann, Martin. Designing Data-Intensive Applications . O'Reilly Media. Kindle Edition.

[20] "Some things benefit from shocks; they thrive and grow when exposed to volatility, randomness, disorder, and stressors and love adventure, risk, and uncertainty. Yet, in spite of the ubiquity of the phenomenon, there is no word for the exact opposite of fragile. Let us call it antifragile. Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better" Taleb, Nassim Nicholas. Antifragile (Incerto) (p. 3). Random House Publishing Group. Kindle Edition.

[21] Read committed is a very popular isolation level. It is the default setting in Oracle 11g, PostgreSQL, SQL Server 2012, MemSQL, and many other databases [ 8]. Kleppmann, Martin. Designing Data-Intensive Applications . O'Reilly Media. Kindle Edition.

[22] Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski: "A Comprehensive Study of Convergent and Commutative Replicated Data Types," INRIA Research Report no. 7506, January 2011.

[23] Benjamin Farinier, Thomas Gazagnaire, and Anil Madhavapeddy: "Mergeable Persistent Data Structures," at 26es Journées Francophones des Langages Applicatifs (JFLA), January 2015.

[24] Chengzheng Sun and Clarence Ellis: "Operational Transformation in Real-Time Group Editors: Issues, Algorithms, and Achievements," at ACM Conference on Computer Supported Cooperative Work (CSCW), November 1998.

Kleppmann, Martin. Designing Data-Intensive Applications . O'Reilly Media. Kindle Edition.

[25] "Atomic operations can work well in a replicated context, especially if they are commutative (i.e., you can apply them in a different order on different replicas, and still get the same result). For example, incrementing a counter or adding an element to a set are commutative operations. That is the idea behind Riak 2.0 datatypes, which prevent lost updates across replicas. When a value is concurrently updated by different clients, Riak automatically merges together the updates in such a way that no updates are lost" Kleppmann, Martin. Designing Data-Intensive Applications . O'Reilly Media. Kindle Edition.

[26] "... systems that appear to require linearizability in fact only really require causal consistency, which can be implemented more efficiently. Based on this observation, researchers are exploring new kinds of databases that preserve causality, with performance and availability characteristics that are similar to those of eventually consistent systems" Kleppmann, Martin. Designing Data-Intensive Applications . O'Reilly Media. Kindle Edition.

[27] Seth Gilbert and Nancy Lynch: "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services," ACM SIGACT News, volume 33, number 2, pages 51-59, June 2002. doi:10.1145/564585.564601

[28] "The Service Provider needs to be able to take action in real-time (ultra low latency: less than a millisecond end-to-end) for whether to allow the device to use a network resource or not" https://github.com/cncf/cnf-wg/tree/main/use-case/0003-UC-stateful-cnf

[29] "Single-leader replication (potentially linearizable) In a system with single-leader replication (see "Leaders and Followers"), the leader has the primary copy of the data that is used for writes, and the followers maintain backup copies of the data on other nodes. If you make reads from the leader, or from synchronously updated followers, they have the potential to be linearizable. However, not every single-leader database is actually linearizable, either by design (e.g., because it uses snapshot isolation) or due to concurrency bugs [10]. Using the leader for reads relies on the assumption that you know for sure who the leader is. As discussed in "The Truth Is Defined by the Majority", it is quite possible for a node to think that it is the leader, when in fact it is not-and if the delusional leader continues to serve requests, it is likely to violate linearizability [20]. With asynchronous replication, failover may even lose committed writes (see "Handling Node Outages"), which violates both durability and linearizability." Kleppmann, Martin. Designing Data-Intensive Applications . O'Reilly Media. Kindle Edition.

[30] "Although multi-leader replication has advantages, it also has a big downside: the same data may be concurrently modified in two different datacenters, and those write conflicts must be resolved (indicated as "conflict resolution" Kleppmann, Martin. Designing Data-Intensive Applications . O'Reilly Media. Kindle Edition.

[31] "The biggest problem with multi-leader replication is that write conflicts can occur, which means that conflict resolution is required. For example, consider a wiki page that is simultaneously being edited by two users, as shown in Figure 5-7. User 1 changes the title of the page from A to B, and user 2 changes the title from A to C at the same time. Each user's change is successfully applied to their local leader. However, when the changes are asynchronously replicated, a conflict is detected [33]. This problem does not occur in a single-leader database." Kleppmann, Martin. Designing Data-Intensive Applications . O'Reilly Media. Kindle Edition.

[32] "Leaderless replication (probably not linearizable) For systems with leaderless replication (Dynamo-style; see "Leaderless Replication"), people sometimes claim that you can obtain "strong consistency" by requiring quorum reads and writes (w + r > n). Depending on the exact configuration of the quorums, and depending on how you define strong consistency, this is not quite true. "Last write wins" conflict resolution methods based on time-of-day clocks (e.g., in Cassandra; see "Relying on Synchronized Clocks") are almost certainly nonlinearizable, because clock timestamps cannot be guaranteed to be consistent with actual event ordering due to clock skew. Sloppy quorums ("Sloppy Quorums and Hinted Handoff") also ruin any chance of linearizability. Even with strict quorums, nonlinearizable behavior is possible, as demonstrated in the next section." Kleppmann, Martin. Designing Data-Intensive Applications . O'Reilly Media. Kindle Edition.

[33] "... see, there is a lot of ambiguity around the meaning of isolation [ 8]. The high-level idea is sound, but the devil is in the details. Today, when a system claims to be "ACID compliant," it's unclear what guarantees you can actually expect. ACID has unfortunately become mostly a marketing term." Kleppmann, Martin. Designing Data-Intensive Applications . O'Reilly Media. Kindle Edition.

[34] Peter Bailis, Alan Fekete, Ali Ghodsi, et al.: "HAT, not CAP: Towards Highly Available Transactions," at 14th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2013.

[35] "Snapshot isolation is a useful isolation level, especially for read-only transactions. However, many databases that implement it call it by different names. In Oracle it is called serializable, and in PostgreSQL and MySQL it is called repeatable read [23]. The reason for this naming confusion is that the SQL standard doesn't have the concept of snapshot isolation, because the standard is based on System R's 1975 definition of isolation levels [2] and snapshot isolation hadn't yet been invented then. Instead, it defines repeatable read, which looks superficially similar to snapshot isolation. PostgreSQL and MySQL call their snapshot isolation level repeatable read because it meets the requirements of the standard, and so they can claim standards compliance." Kleppmann, Martin. Designing Data-Intensive Applications . O'Reilly Media. Kindle Edition.

[36] Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan: "Automating the Detection of Snapshot Isolation Anomalies," at 33rd International Conference on Very Large Data Bases (VLDB), September 2007.

[37] Snapshot isolation is a useful isolation level, especially for read-only transactions. However, many databases that implement it call it by different names. In Oracle it is called serializable, and in PostgreSQL and MySQL it is called repeatable read [23]. The reason for this naming confusion is that the SQL standard doesn't have the concept of snapshot isolation, because the standard is based on System R's 1975 definition of isolation levels [2] and snapshot isolation hadn't yet been invented then. Kleppmann, Martin. Designing Data-Intensive Applications . O'Reilly Media. Kindle Edition.

[38] "Preventing double-spending: A service that allows users to spend money or points needs to check that a user doesn't spend more than they have. You might implement this by inserting a tentative spending item into a user's account, listing all the items in the account, and checking that the sum is positive [44]. With write skew, it could happen that two spending items are inserted concurrently that together cause the balance to go negative, but that neither transaction notices the other." Kleppmann, Martin. Designing Data-Intensive Applications . O'Reilly Media. Kindle Edition.

[39] Maurice P. Herlihy and Jeannette M. Wing: "Linearizability: A Correctness Condition for Concurrent Objects," ACM Transactions on Programming Languages and Systems (TOPLAS), volume 12, number 3, pages 463-492, July 1990. doi:10.1145/78969.78972

[40] Mike Burrows: "The Chubby Lock Service for Loosely-Coupled Distributed Systems," at 7th USENIX Symposium on Operating System Design and Implementation (OSDI), November 2006.

Published Monday, May 16, 2022 7:33 AM by David Marshall
Filed under: ,
Comments
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
Calendar
<May 2022>
SuMoTuWeThFrSa
24252627282930
1234567
891011121314
15161718192021
22232425262728
2930311234