Database" by christophe.benoit74 is licensed
under CC BY 2.0.
Stateful CNFs
Towards the implementation of stateful
cloud native network functions using the cloud native ecosystem
Stateful
CNFs lie at the intersection of cloud native [1],
telecommunications [2], and data
intensive systems [3]. Stateful applications have matured to the
point of being first class citizens in the traditionally stateless architecture
of Kubernetes.[4] Although there are many shared non-functional
requirements between the architectures of these three domains, when deploying a
stateful network application the requirements shift in the direction of lower
latency. High levels of consistency,
availability, and partition tolerance have traditionally been thought of as
impossible to achieve, but there are new techniques that may help to serve low
latency networking requirements.
Stateful CNF Non-Functional
Requirements
The
shared non-functional requirements of a stateful CNF are availability,
resilience, scalability, and interoperability.
Availability
The
lower bound for availability in a cloud native system would be a system that
has no planned maintenance windows. This
is implemented by using phoenix deploys [5],
meaning that when a new deployment is pushed, each routable endpoint that has
connections is allowed to continue processing until its current task is
finished while new requests are routed to newly deployed
containers/components/vms etc. If a
system needs planned maintenance windows, we can safely assume that the
architecture is not configured to have this minimal guarantee of downtime. A high level of availability, measured by the
‘nines' nomenclature, would be something comparable to 5 nines [ 6].
Resilience
Cloud
native, telco, and data intensive systems achieve resilience at the higher
layers through management of robust lower level components [7][ 8][9] such as
load balancing, corrective protocols [10]
and replication. Data intensive
applications also have strong provisions for keeping the system in a correct
state while allowing for safe retries by using ACID consistency and isolation
levels.
Scalability
Scalability
can be described as a system's method for coping with growth. The cloud native scalability strategy
strongly encourages scaling out versus scaling up. Telecommunications has a similar
parallel within the CLOS data center
architecture.[11] Data intensive applications use replication
and partitioning to scale out data while maintaining performance.
Latency and Guarantees
We
often talk about a system being ‘real time' when describing a system that has
low latency requirements. A rigorous
description of a real time system will describe real time requests as having
some kind of timing guarantee. A
system's real time request also has a relationship to what is considered a
failure within that system. With those
ideas in mind, a hard real-time system would be one that considers any missed
deadline a system failure. A firm
real-time system allows for deadlines to be missed, as the system can handle
some non-zero amount of tasks that fail to complete before a deadline, so long
as this happens rarely. A soft real time
system can handle lots of failed requests, as the requests still have some
value if they are completed after the deadline.
Most systems casually described as real time are referring to firm or
soft real time systems.
When
describing soft real time systems it is common to refer to a system's response
times or latency in terms of the average time measured. This can be misleading because we then
overlook the slower response times of a subset of the tasks. When we use percentiles, we can reason about
the average time (50 percentile) and the slower times (99 percentile).
Low Latency
What
does it mean to be low latency [12] with
concern to a data intensive system? Data
can be structured and stored in many ways.
But the correctness of the data with respect to read and write skew has
a direct effect on how low the latency can get.
Some data structures allow for faster writes and reads, but generally
speaking, if you require ACID consistency and isolation (or more specifically,
serializability), you'll take a performance hit on latency. Still, some of the fastest, highly available,
strongly consistent (serialized isolation) databases can write with 10 ms
latency[13][14].
Interoperability
Interoperability
in cloud native systems comes in many forms, but is exemplified in API
endpoints implemented as microservices while being deployed and orchestrated as
coarse grained binaries. This allows
different organizations to have separate rates of change in their deployments
while continuing to depend on one another[15][16]. While telecommunications has a strong
relationship with interoperability via the OSI model layers[17],
which successfully allowed for innovation by separating network concerns into
layers, it also has a troubled history with some protocols and implementations,
such as SNMP[18]. Within data intensive applications the
structure of the data and its clients drive forward and backward compatibility,
which influences how the system's performs its own upgrades[19].
Determining Functional Requirements
of a Stateful CNF
While
determining the functional requirements while building use cases and user
stories for any system that has non-trivial persistent data, one must ask
specific questions about the data requirements.
These questions can tease out the tradeoffs that will need to be made
between data correctness and latency. In
order to reason about these questions, it is necessary to have an unambiguous
shared vocabulary about correctness that maps to the technologies that address
the requirement.
How safe must your system be?
Safety
is about stopping something bad from happening (or at least noting that
something bad happened and not returning a bad result). Safety keeps the cows from getting out of the
barn. Liveness (something good happens,
eventually) measures how the cows get back into the barn if they get out.
Historically, systems have concentrated on safety, or reliability, to make
their systems resilient. Cloud native
systems opt for liveness, or anti-fragility[20],
where possible.
Can you tolerate write skew?
Write
skew happens when a transaction reads some data, that read-data is then changed
by another process, but the transaction writes data based on a decision made
from the original read-data. This kind
of transaction (read committed) was the default setting in many databases,[21] so just
turning on transactions in an ACID compliant RDBMS may not address this
problem.
How do you address write skew?
If
there is write screw, can you tolerate a solution where the last write wins or
the write is merged[22][23][24] with
another write?
Are there latency sensitive parts of
the system where the updates can be applied in any order, such as incrementation?[25]
Are
there other parts of the system where the data is causally ordered, that is,
where the premise of an insert must be true in order for the insert to occur,
but not temporally (temporally)? If this
is the case, some performance increases may be possible.[26]
Do you require strong consistency?
There
are some problems that require strong consistency, such as enforcing uniqueness
(e.g. usernames) or preventing a double spending problem. Do you have these kinds of challenges?
Is your data write heavy?
Is
90% of the activity in your dataset a write? Is your data read heavy? Or is your data activity split 50% between writes
and reads?
How fast should your responses be?
Some
of the questions of correctness have historically been glanced over because the
latency requirements were not very challenging.
In a data intensive system, the latency requirements can be down to 1ms
latency or below. This, combined with
resilience requirements, pushes us to find nuanced answers to problems that we
used a simple DBMS for historically.
That being said, do you have 1ms latency response times
requirements? What latency requirements
do you have for the 99th percentile? Do
you have any hard real time guarantees?
Will the users be distributed
geographically?
If so, there are techniques that
move the data closer to the user to reduce latency.
How secure must your system be?
Can you tolerate trusting a compromised node?
Data Profiles of Interest
When
reasoning about the functional requirement questions about data intensive
applications, it can be useful to group your system into a profile based on
correctness, similar to the intuitions found in the CAP theorem[27] but with
the nuances previously mentioned.
High Availability, Low Correctness,
Low Latency
Within
telecommunications, there is a high standard for reliability. The idea is that since a life threatening
situation occurs when the 911 systems go down, telecommunication lines must be
as reliable (highly available) as possible.
Even so, the analog lines that made 911 calls are lossy (low correctness). An intrusion detection system (IDS) such as
Snort that monitors a network doesn't need transactions (low correction) but
still needs to be highly available and preferably zero impact on the network
(integrates with a low latency).
High Availability, Low Correctness,
High Latency
The
search implemented in a system is often an entry point to the system and
fundamental to other components, so it needs to be robust (high
availability). The indexing for search
does not, however, require transactions (low correctness) and the search
response times can be more than 10 ms while retaining usefulness.
High availability, High Correctness,
High Latency
Traditional
online transaction processes systems (OLTP) use HA (high availability)
relational dbms systems (high correctness), to serve data requests. These requests were not expected to be 1 ms
or less (high latency). Leadership
election and cluster configuration require linearizability (high correctness)
and fall under this category as well.
Low Availability, High Correctness,
Low latency
There
are various problems that require more correctness than last write wins, but
less correctness than linearizability and serialized isolation. Leaderboards, some of which issue rewards,
require a running tally of increments (high correctness). The writes to the board could very well be
data intensive (low latency) but the board can still be useful with much less
than 5 nines of availability.
High Availability, High Correctness,
Low Latency
The
holy grail of profiles is high availability, high correctness, and low
latency. For the weaker forms of
correctness (read committed isolation, data structures that tolerate automatic
conflict resolution/merge based, data that can be applied in a partial order),
this profile may become possible. Given
serialized snapshot isolation and linearizability, both of which are required
for the highest level of correctness for some problems (e.g. preventing double
spending) in the distributed environments that are implied by high
availability, this profile approaches the impossible. That is to say that either a 50 percentile 1
ms latency will not be achieved with
high availability (one of the higher ‘nines' such as 5 nines) and these high
correction levels. Hard real time
guarantees of 1 ms are even less plausible.
Collaboration
tools have nuanced, partial ordering (high correctness) requirements. They are expected to be online at all times
(high availability) and extremely responsive (low latency), sometimes down to
the keystroke. These tools can fit into
this category. Traditional fundamental
networking components, such as those that are dependencies for many other
components such as charging[28] and
session management, fall into this category as well.
Stateful Implementation
Stateful
implementation in a cloud native solution can be separated into the storage
solution and the database solution. Both
of these need to solve problems of leader election behind the scenes, but
database solutions address more fine grained correctness guarantees.
Cloud Native Data Storage
Implementation
Cloud
Native data storage generally, and Kubernetes data storage specifically,
focuses on techniques that scale out with more nodes, instead of scale up with
faster nodes. This is not to say that
scaling up does not have its place, but software that takes advantage of node
elasticity is a strong expectation for the cloud native community. Along with elasticity, tradeoffs that select
resilience and availability are also preferred.
Infrastructure
and applications that are stateless or reduce state as much as possible lend
themselves well to cloud native environments.
Even so, statefulness is unavoidable in most useful applications. With regard to storage at the fundamental
level (disk state), cloud native solutions have arisen that take advantage of
scale out strategies. A proper use of
elastic volumes is one such solution.
Elastic volume block-level storage uses replication to distribute data
over multiple disks, which provides resilience if any node goes down.
Kubernetes
storage solutions have two types of provisioning, static and dynamic. Static provisioning requires the operator to
set up the persistent volume manually. Dynamic provisioning requires a driver
specific to the storage type, but if the driver is available Kubernetes handles
the setup of volumes. The Rook and Ceph
projects provide a way to implement cloud native storage in the Kubernetes
environment using the Paxos consensus algorithm to choose nodes.
Cloud Native Database Implementation
Kubernetes Statefulness
Kubernetes
statefulsets work with elastic volumes to provide a resilient, scale out
solution for the storage involved with a database cluster.
Cloud Native Database Latency
In
minimally available data intensive systems, sensitive latency requirements are
addressed with replication, partitioning, and sensitivity to hard drive
access. When addressing latency within a
data intensive system that has at least minimal availability, the system must
have a distributed nature.
Replication Implementation
Replication
is the most common way to introduce availability in databases. Within replication types, there is single
leader, multi leader, and leaderless replication. An important note about replication: only
single leader replication has the potential of distributed strong consistency
(linearizability for leader election with serializable isolation)[29] while
multi-leader (not linearizable)[30][31] and
leaderless[32] (not
linearizable) do not. Mysql and Postgres
support single leader out of the box, and there are projects such as Tungsten
Replicator for MySQL, and BDR for PostgreSQL for helping with multi-leader
replication. Leaderless replication is
available in Amazon Dynamo, Riak, Cassandra, and Voldemort. While consensus based storage, which similar
to single leader replication, can provide strong distributed consistency
(linearizable), it is designed for coarse grained data such as cluster
configuration, and so not very suitable for fine grained data. The best-known fault-tolerant consensus algorithms
are Paxos (used in Ceph), Raft (used in etcd), and Zab (used in Zookeeper).
Partitioning Implementation
Partitioning
data is useful for reducing latency because it brings the data closer to the
user of that data. Some projects such as
Vitess and Cockroachdb do this automatically.
In
memory databases reduce latency but avoid paying the cost of serializing data
when writing to a hard drive. They can
also provide durability by writing to the disk in using append only files in
log data structures, which is significantly faster than a regular write. On faster networks, they may use replication
to guarantee copies of the data will survive a crash. VoltDB and MemSQL are
in-memory databases with a relational model while Memcached is an in-memory key-value
store. Redis and Couchbase are in-memory
no-sql options with weak durability while RAMCloud is the same but with strong
durability.
If
there are large amounts of advanced data required in the reads of a query,
especially within a range of data, column oriented storage may be the
solution. Some column oriented data
solutions are Cassandra, HBase, Druid, Hypertable, and MariaDB.
Correctness Implementation
Traditionally
DBMSs have offered ACID compliance and yet some configurations of these DBMSs
are not suitable for minimally available use with high levels of safety. This makes it important for us to have a
nuanced view on what correctness is with relation to our data
requirements.
Isolation Implementation
One
of the problems when thinking about transactions and ACID compliance is the
ambiguity[33] of the
terms consistency and isolation.
Weak
isolation levels[34] such has
read committed, snapshot isolation, and repeatable read[35]
have serious issues[36] that have
caused the loss of money and data corruption.
These isolation levels have contributed to the ambiguity of the terms
consistency and isolation.
Serializability
When
it comes to correctness, serializability (e.g. serializable snapshot
isolation), is the gold standard. This
is different than snapshot isolation[37],
which is often the default on many DBMSs.
There are many examples of data that require serializable isolation such
as scheduling, gaming, uniqueness, and double spending.[38] The approach of executing transactions
serially in a single thread on one node is implemented in VoltDB/H-Store,
Redis, and Datomic. Serializability
using two phased locking is used by the serializable isolation level in MySQL
(InnoDB). Both PostgreSQL and
FoundationDB have a serializable snapshot isolation configuration. CockroachDB uses a technique called parallel
commits to achieve serializability.
Linearizability
In the CAP theorem, consistency is defined as linearizability. Linearizability[39]
guarantees that the system treats data as having exactly one copy, regardless
of the implementation. Once a write
happens to that data, all reads will see what was written. Consensus algorithms get several nodes to
agree on something. This is important in
the problem of leadership election and atomic commits. Consensus algorithms need to be linearizable
in order to avoid the split brain problem[40].
Etcd is one solution that uses the Raft algorithm to handle configuration data,
while Zookeeper uses the Paxos algorithm for locking and is good for leader
election. Both of these algorithms are
linearizable.
Merging Data
Automatic
conflict resolution works well for data that can be safely merged, with the
classic example being a shopping cart.
Some technologies for implementing automatic conflict resolution are
CRDTs (Riak 2.0), mergeable persistent data structures (similar to Git three
way merges), and operational transformation (implemented within etherpad).
Conclusion
In
surveying the requirements of stateful cnfs, we have recognized availability,
resilience, scalability, and interoperability as necessary nonfunctional
requirements for the architecture. Cloud
native, telecommunications, and data intensive applications place lower bounds
on these requirements for one another.
For a solution to be cloud native it must achieve a minimal level of
availability coupled with a scale out approach to resilience which will
influence the solutions for the other domains.
Telecommunications has strong reliability and low latency requirements,
oftentimes approaching 1ms, where higher latency translates into monetary loss
. Both cloud native and
telecommunications are venturing into more nontrivial uses of data, which is
bounded by correctness. Given the
limits of true serializable transactions, even when we move beyond the limits
of the CAP theorem, correctness bounds the lower levels of latency that we can
achieve. The combination of linearizable
consensus for leader election, single leader replication, partitioning, and
serializable snapshot isolation seems to be the most correct, lowest latency,
and flexible option that spans all of the requirements of the three domains. This achieves single digit millisecond
latency at the 50 percentile range for smaller writes. We can get to the 1 ms range, but not without
relaxing some of the correctness, which is appropriate for mergeable or
partially ordered datasets.
##
ABOUT THE AUTHOR
W. Watson - Principal Consultant, Vulk Coop
W.
Watson has been professionally developing software for 25 years. He has spent
numerous years studying game theory and other business expertise in pursuit of
the perfect organizational structure for software co-operatives. He also
founded the Austin Software Cooperatives meetup group
(https://www.meetup.com/Austin-Software-Co-operatives/) and Vulk Coop
(https://www.vulk.coop) as an alternative way to work on software as a group.
He has a diverse background that includes service in the Marine Corps as a computer
programmer, and software development in numerous industries including defense,
medical, education, and insurance. He has spent the last couple of years
developing complementary cloud native systems such as the cncf.ci dashboard. He
currently works on the Cloud Native Network Function (CNF) Test Suite
(https://github.com/cncf/cnf-testsuite) the CNF Testbed
(https://github.com/cncf/cnf-testbed), and the cloud native networking
principles (https://networking.cloud-native-principles.org/) initiatives. Recent
speaking experiences include ONS NA, KubeCon NA 2019, and Open Source Summit
2020.
[1] https://networking.cloud-native-principles.org/cloud-native-principles
[2] https://networking.cloud-native-principles.org/cloud-native-networking-preamble
[3] We
call an application data-intensive if data is its primary challenge-the quantity
of data, the complexity of data, or the speed at which it is changing-as
opposed to compute-intensive, where CPU cycles are the bottleneck. Kleppmann,
Martin. Designing Data-Intensive Applications . O'Reilly Media. Kindle Edition.
[4] https://www.infoq.com/articles/kubernetes-stateful-applications/
[5] Phoenix replacement
is the natural progression from blue-green using dynamic infrastructure. Rather
than keeping an idle instance around between changes, a new instance can be
created each time a change is
needed. As with blue-green, the change is tested
on the new instance before putting it into use. The previous instance can be kept up for a short time, until the new instance has been proven in use. But then the previous instance is destroyed.
Morris, Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle
Locations 5694-5697). O'Reilly Media. Kindle Edition.
[ 6] https://en.wikipedia.org/wiki/High_availability#Percentage_calculation
[7] https://vmblog.com/archive/2021/09/15/cloud-native-chaos-and-telcos-enforcing-reliability-and-availability-for-telcos.aspx
[ 8] "intuitively
it may seem like a system can only be as reliable as its least reliable
component (its weakest link). This is not the case: in fact, it is an old idea
in computing to construct a more reliable system from a less reliable
underlying base" Kleppmann, Martin. Designing Data-Intensive Applications .
O'Reilly Media. Kindle Edition.
[9] "When
operators buy large spine switches in an effort to reduce the number of boxes
they have, they also automatically increase the blast radius of a failure. This
is often not factored into the thinking of the operator. With a spine switch of
512 ports, you can build a two-tier Clos topology of 512 racks. But now the
effect of a failure is much larger. If instead you broke this up into a
three-tier Clos network with eight pods of 64 racks each, you get a far more
reliable network." Dutt, Dinesh G.. Cloud Native Data Center Networking (Kindle
Locations 5269-5272). O'Reilly Media. Kindle Edition.
[10] Often,
transport layer protocols provide reliability,
which refers to complete and correct data transfer between end systems.
Reliability can be achieved through mechanisms for end-to-end error detection, retransmissions,
and flow control. Serpanos,
Dimitrios,Wolf, Tilman. Architecture of Network Systems (The Morgan Kaufmann
Series in Computer Architecture and Design) (p. 12). Elsevier Science. Kindle
Edition.
[11] "The
use of the Clos topology encourages a scale-out model in which the core is kept
from having to perform complex tasks and carry a lot of state. It is far more
scalable to increase scale by pushing complex functionality to the leaves,
including into the servers, where the state is distributed. Because the state
is distributed among many boxes, the failure of one box has a much smaller
blast radius than a single large box that carries a lot of state." Dutt, Dinesh
G.. Cloud Native Data Center Networking (Kindle Locations 5266-5269). O'Reilly
Media. Kindle Edition.
[12] "Clients
connect to Redis using a TCP/IP connection or a Unix domain connection. The
typical latency of a 1 Gbit/s network is about 200 us, while the latency with a
Unix domain socket can be as low as 30 us. It actually depends on your network
and system hardware. On top of the communication itself, the system adds some
more latency (due to thread scheduling, CPU caches, NUMA placement, etc ...).
System induced latencies are significantly higher on a virtualized environment
than on a physical machine." https://redis.io/docs/reference/optimization/latency/
[13] "As a globally-distributed database, Spanner
provides several interesting features. First, the replication configurations
for data can be dynamically controlled at a fine grain by applications.
Applications can specify constraints to control which datacenters contain which
data, how far data is from its users (to control read latency), how far
replicas are from each other (to control write latency), and how many replicas
are maintained (to control durability, availability, and read performance)." https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf
[14] https://www.cockroachlabs.com/docs/stable/performance.html#performance-limitations
[15] Containerized services works by packaging applications and services in lightweight containers (as popularized
by Docker). This reduces coupling
between server configuration and the
things that run on the servers. So host servers tend to be very simple, with a lower rate of change.
One of the other change management models
still needs to be applied to these hosts, but their implementation becomes
much simpler and easier to maintain. Most
effort and attention goes into packaging, testing, distributing, and
orchestrating the services and applications, but this follows something
similar to the immutable infrastructure model, which again is simpler than
managing the configuration of full-blown virtual machines and servers. Morris,
Kief. Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations
1617-1621). O'Reilly Media. Kindle Edition.
[16] "The value
of a containerization system is that it provides a standard format for container images and tools for building, distributing, and running
those images. Before Docker, teams
could isolate running processes using the same operating system features, but
Docker and similar tools make the process much simpler." Morris, Kief.
Infrastructure as Code: Managing Servers in the Cloud (Kindle Locations
1631-1633). O'Reilly Media. Kindle Edition.
[17] An
important property of the OSI Reference Model is that it enables standardization
of the protocols used in the protocol stacks, leading to the specification of
interfaces between layers. Furthermore, an important feature of the model is
the distinction it makes between specification
(layers) and implementation
(protocols), thus leading to openness and flexibility. Openness is the
ability to develop new protocols for a particular layer and independently of
other layers as network technologies evolve. Openness enables competition,
leading to low-cost products. Flexibility is the ability to combine different
protocols in stacks, enabling the interchange of protocols in stacks as
necessary. Serpanos, Dimitrios,Wolf, Tilman. Architecture of Network Systems
(The Morgan Kaufmann Series in Computer Architecture and Design) (p. 14). Elsevier
Science. Kindle Edition.
[18] "SNMP implementations vary
across platform vendors. In some cases, SNMP is an added feature, and is not
taken seriously enough to be an element of the core design. Some major
equipment vendors tend to over-extend their proprietary command line interface (CLI) centric configuration and control systems." https://en.wikipedia.org/wiki/Simple_Network_Management_Protocol#Implementation_issues
[19] "When
a data format or schema changes, a corresponding change to application code
often needs to happen (for example, you add a new field to a record, and the
application code starts reading and writing that field). However, in a large
application, code changes often cannot happen instantaneously: With server-side
applications you may want to perform a rolling upgrade (also known as a staged
rollout), deploying the new version to a few nodes at a time, checking whether
the new version is running smoothly, and gradually working your way through all
the nodes. This allows new versions to be deployed without service downtime,
and thus encourages more frequent releases and better evolvability. With
client-side applications you're at the mercy of the user, who may not install
the update for some time." Kleppmann, Martin. Designing Data-Intensive
Applications . O'Reilly Media. Kindle Edition.
[20] "Some things benefit from shocks;
they thrive and grow when exposed to volatility, randomness, disorder, and
stressors and love adventure, risk, and uncertainty. Yet, in spite of
the ubiquity of the phenomenon, there is no word for the exact opposite of
fragile. Let us call it antifragile. Antifragility is beyond resilience or
robustness. The resilient resists shocks and stays the same; the antifragile gets
better" Taleb, Nassim Nicholas. Antifragile (Incerto) (p. 3). Random House
Publishing Group. Kindle Edition.
[21] Read
committed is a very popular isolation level. It is the default setting in
Oracle 11g, PostgreSQL, SQL Server 2012, MemSQL, and many other databases [ 8].
Kleppmann, Martin. Designing Data-Intensive Applications . O'Reilly Media.
Kindle Edition.
[22] Marc Shapiro, Nuno Preguiça, Carlos Baquero,
and Marek Zawirski: "A Comprehensive Study of Convergent and Commutative
Replicated Data Types," INRIA Research Report no. 7506, January 2011.
[23] Benjamin
Farinier, Thomas Gazagnaire, and Anil Madhavapeddy: "Mergeable Persistent Data
Structures," at 26es Journées Francophones des Langages Applicatifs (JFLA),
January 2015.
[24] Chengzheng
Sun and Clarence Ellis: "Operational Transformation in Real-Time Group Editors:
Issues, Algorithms, and Achievements," at ACM Conference on Computer Supported
Cooperative Work (CSCW), November 1998.
Kleppmann, Martin. Designing Data-Intensive Applications . O'Reilly
Media. Kindle Edition.
[25] "Atomic
operations can work well in a replicated context, especially if they are
commutative (i.e., you can apply them in a different order on different
replicas, and still get the same result). For example, incrementing a counter
or adding an element to a set are commutative operations. That is the idea
behind Riak 2.0 datatypes, which prevent lost updates across replicas. When a
value is concurrently updated by different clients, Riak automatically merges
together the updates in such a way that no updates are lost" Kleppmann, Martin.
Designing Data-Intensive Applications . O'Reilly Media. Kindle Edition.
[26] "...
systems that appear to require linearizability in fact only really require
causal consistency, which can be implemented more efficiently. Based on this
observation, researchers are exploring new kinds of databases that preserve
causality, with performance and availability characteristics that are similar
to those of eventually consistent systems" Kleppmann, Martin. Designing
Data-Intensive Applications . O'Reilly Media. Kindle Edition.
[27] Seth
Gilbert and Nancy Lynch: "Brewer's Conjecture and the Feasibility of
Consistent, Available, Partition-Tolerant Web Services," ACM SIGACT News,
volume 33, number 2, pages 51-59, June 2002. doi:10.1145/564585.564601
[28] "The
Service Provider needs to be able to take action in real-time (ultra low
latency: less than a millisecond end-to-end) for whether to allow the device to
use a network resource or not" https://github.com/cncf/cnf-wg/tree/main/use-case/0003-UC-stateful-cnf
[29] "Single-leader
replication (potentially linearizable) In a system with single-leader
replication (see "Leaders and Followers"), the leader has the primary copy of
the data that is used for writes, and the followers maintain backup copies of
the data on other nodes. If you make reads from the leader, or from
synchronously updated followers, they have the potential to be linearizable.
However, not every single-leader database is actually linearizable, either by
design (e.g., because it uses snapshot isolation) or due to concurrency bugs
[10]. Using the leader for reads relies on the assumption that you know for
sure who the leader is. As discussed
in "The Truth Is Defined by the
Majority", it is quite possible for a node to think that it is the leader, when
in fact it is not-and if the delusional leader continues to serve requests, it
is likely to violate linearizability [20]. With asynchronous replication, failover
may even lose committed writes (see "Handling Node Outages"), which violates
both durability and linearizability." Kleppmann, Martin. Designing
Data-Intensive Applications . O'Reilly Media. Kindle Edition.
[30] "Although
multi-leader replication has advantages, it also has a big downside: the same
data may be concurrently modified in two different datacenters, and those write
conflicts must be resolved (indicated as "conflict resolution" Kleppmann,
Martin. Designing Data-Intensive Applications . O'Reilly Media. Kindle Edition.
[31] "The biggest problem with multi-leader
replication is that write conflicts can occur, which means that conflict
resolution is required. For example, consider a wiki page that is
simultaneously being edited by two users, as shown in Figure 5-7. User 1
changes the title of the page from A to B, and user 2 changes the title from A
to C at the same time. Each user's change is successfully applied to their
local leader. However, when the changes are asynchronously replicated, a conflict
is detected [33]. This problem does not occur in a single-leader database."
Kleppmann, Martin. Designing Data-Intensive Applications . O'Reilly Media.
Kindle Edition.
[32] "Leaderless
replication (probably not linearizable) For systems with leaderless replication
(Dynamo-style; see "Leaderless Replication"), people sometimes claim that you
can obtain "strong consistency" by requiring quorum reads and writes (w + r
> n). Depending on the exact configuration of the quorums, and depending on
how you define strong consistency, this is not quite true. "Last write wins"
conflict resolution methods based on time-of-day clocks (e.g., in Cassandra;
see "Relying on Synchronized Clocks") are almost certainly nonlinearizable,
because clock timestamps cannot be guaranteed to be consistent with actual
event ordering due to clock skew. Sloppy quorums ("Sloppy Quorums and Hinted
Handoff") also ruin any chance of linearizability. Even with strict quorums,
nonlinearizable behavior is possible, as demonstrated in the next section."
Kleppmann, Martin. Designing Data-Intensive Applications . O'Reilly Media.
Kindle Edition.
[33] "...
see, there is a lot of ambiguity around the meaning of isolation [ 8]. The
high-level idea is sound, but the devil is in the details. Today, when a system
claims to be "ACID compliant," it's unclear what guarantees you can actually
expect. ACID has unfortunately become mostly a marketing term." Kleppmann,
Martin. Designing Data-Intensive Applications . O'Reilly Media. Kindle Edition.
[34] Peter
Bailis, Alan Fekete, Ali Ghodsi, et al.: "HAT, not CAP: Towards Highly
Available Transactions," at 14th USENIX Workshop on Hot Topics in Operating
Systems (HotOS), May 2013.
[35] "Snapshot
isolation is a useful isolation level, especially for read-only transactions.
However, many databases that implement it call it by different names. In Oracle
it is called serializable, and in PostgreSQL and MySQL it is called repeatable
read [23]. The reason for this naming confusion is that the SQL standard
doesn't have the concept of snapshot isolation, because the standard is based
on System R's 1975 definition of isolation levels [2] and snapshot isolation
hadn't yet been invented then. Instead, it defines repeatable read, which looks
superficially similar to snapshot isolation. PostgreSQL and MySQL call their
snapshot isolation level repeatable read because it meets the requirements of
the standard, and so they can claim standards compliance." Kleppmann, Martin.
Designing Data-Intensive Applications . O'Reilly Media. Kindle Edition.
[36] Sudhir Jorwekar, Alan Fekete, Krithi
Ramamritham, and S. Sudarshan: "Automating the Detection of Snapshot Isolation
Anomalies," at 33rd International Conference on Very Large Data Bases (VLDB),
September 2007.
[37] Snapshot
isolation is a useful isolation level, especially for read-only transactions.
However, many databases that implement it call it by different names. In Oracle
it is called serializable, and in PostgreSQL and MySQL it is called repeatable
read [23]. The reason for this naming confusion is that the SQL standard
doesn't have the concept of snapshot isolation, because the standard is based
on System R's 1975 definition of isolation levels [2] and snapshot isolation
hadn't yet been invented then. Kleppmann, Martin. Designing Data-Intensive
Applications . O'Reilly Media. Kindle Edition.
[38] "Preventing
double-spending: A service that allows users to spend money or points needs to
check that a user doesn't spend more than they have. You might implement this
by inserting a tentative spending item into a user's account, listing all the
items in the account, and checking that the sum is positive [44]. With write
skew, it could happen that two spending items are inserted concurrently that
together cause the balance to go negative, but that neither transaction notices
the other." Kleppmann, Martin. Designing Data-Intensive Applications . O'Reilly
Media. Kindle Edition.
[39] Maurice
P. Herlihy and Jeannette M. Wing: "Linearizability: A Correctness Condition for
Concurrent Objects," ACM Transactions on Programming Languages and Systems
(TOPLAS), volume 12, number 3, pages 463-492, July 1990.
doi:10.1145/78969.78972
[40] Mike Burrows: "The Chubby Lock Service for
Loosely-Coupled Distributed Systems," at 7th USENIX Symposium on Operating
System Design and Implementation (OSDI), November 2006.