Time and again big ticket events cause
sites and platforms to crash, whether it's major UK retailers like John Lewis
and GAME failing to put the right infrastructure in place to scale fast enough
to handle the influx in visitors wanting a PS5,
or the entire internet in South Korea crashing due to the popularity of Netflix show Squid Game.
The Glastonbury Festival ticket site crashed
almost immediately after launching in November, with revellers already fighting
for the few available tickets. Similarly, global ticket sales website Ticketmaster crashed due to high demand for
Taylor Swift concert tickets, and more recently it crashed yet again as
partygoers attempted to buy Eurovision tickets.
Didn't we fix all of this when Covid-19
rapidly accelerated digital transformation across every industry? There's a big
difference between migrating to the cloud, and making your cloud work for less.
Always
online: stretching the bandwidth
Where once the internet was limited to
sharing text information, today it's used from video streaming and gaming to
data gathering and complex calculations. To complicate matters, the number of
people using the internet is also increasing.
There are 8 billion people on Earth right
now - 6.92 billion of them are smartphone users. 86%
of the world has access to the internet at all times from their pockets,
constantly messaging, shopping, streaming, gaming, reading, exploring. And it
isn't only consumers who are online more, so are businesses.
Global Covid-19 lockdowns accelerated the
world's digital transformation journey. In fact, according to McKinsey, companies digitised
many activities 20 to 25 times faster during Covid-19. Companies had to adjust
processes to allow employees to work remotely where possible. But can the
internet handle this rapid growth in active users? The smart answer is yes, but
it's not what many are experiencing.
Traffic is growing exponentially, the
amount of data being produced by each individual is also increasing rapidly,
but many companies fail to have the right infrastructure to scale online at
pace.
Kubernetes:
not quite the golden solution
The ongoing shift to the cloud led to the
rise of Kubernetes (or K8s), which calls itself an open-source system for
automating deployment, scaling, and management of containerised applications.
Dubbed by some as the cloud's operating system, K8s has
supported the world's invasion of the web allowing companies to scale online
but it also created a new complex digital world of metrics.
This is one of the key issues with K8s - by
default, it generates a huge amount of metrics that continuously grows. Many
businesses can't decide which metric is important today, which metric may be
important in the future or which metric will probably never be important. Too
afraid to stop monitoring data that could
one day in a hundred years be important, businesses instead attempt to
monitor it all. To quantify how big an issue this is, let's focus on K8s
version 1.24.0. Every node exports between 2,000-3,000 series, without counting
application metrics. As the number of K8s nodes and running containers
increases, so too does the number of metrics, very quickly resulting in
millions of metrics. Considering only 25% of K8s metrics are ever used, storing
and analysing these huge volumes of unused data is a waste of time and
resources.
A solution to this problem could be through
a universal monitoring standard, but no such thing exists. Instead there are a
number of standards that are used across a single company as each person
prefers a specific model. This results in a chaotic collection of data with no
structure, no uniformity and no compatibility - and even worse, it means you
end up with even more data.
Drowning
in data: the crux of the problem
Those in the industry are still struggling
to solve the K8s multi-metric issue with no avail. But while solutions continue
to be sought to reduce the number of metrics created, there are changes
businesses can make to reduce the amount of RAM and disk space needed to
maintain high cardinality series - such as when hundreds of thousands of people
attempt to buy glastonbury tickets.
In fact, looking into that example further,
consider the amount of time series data created by a single person purchasing
tickets. The network allows the user to log into the website - requiring a
database check and confirmation; it supports ticket purchases allowing a
certain number of applicants through based on system availability, then
verifies payment details with a bank and updates internal systems on ticket
availability; attendee onboarding is activated including regular email
newsletters that need to be sent at specific pre-agreed points of time ahead of
the live event. The amount of time series data that needs to be collected,
monitoring, analysed and processed for this one single event is huge - now
consider hundreds of thousands of people attempting to do the same thing.
Approaches like optimising data structures and using intelligent algorithms to
compress data is an effective way companies can reduce the energy - and
therefore cost - required for data processing and storage.
Data is essential but expensive. When
online websites and applications are hit with peak traffic events the volume of
data and metrics created jumps considerably, oftentimes overloading a system
and causing it to crash. However, it's not plausible for a business to cover the
cost of high bandwidth all the time when it likely won't be used. This is why
it's so important to be able to scale up OR down quickly and efficiently. If
your current approach to data monitoring was not born with scalability in mind,
you will not be able to maintain uptime during high-traffic times.
##
To learn more about the transformative nature of cloud native applications and open source software, join us at KubeCon + CloudNativeCon Europe 2023, hosted by the Cloud Native Computing Foundation, which takes place from April 18-21.
ABOUT THE AUTHOR
Roman Khavronenko Co-Founder, VictoriaMetrics

Roman is a software engineer with experience in distributed systems, monitoring and high-performance services. The idea to create VictoriaMetrics took shape when Roman and Aliaksandr were working at the same company. Prior to joining the VictoriaMetrics team, Roman worked as an engineer at Cloudflare in London. He lives in Austria with his wife and son.