Instaclustr, the leading provider of fully managed
solutions for scalable open source technologies, today announced it has
successfully created an anomaly detection application capable of processing and
vetting real-time events at a uniquely massive scale - 19 billion events per
day - by leveraging open source
Apache Cassandra and
Apache Kafka and Kubernetes container orchestration.
Instaclustr completed this as an example of the scalability achievable with its
Managed Platform and has made detailed design information available
here, and source code available
here.
Anomaly detection is the identification of unusual events within an event
stream - often indicating fraudulent activity, security threats or in general a
deviation from the expected norm. Because recognizing such anomalies is
integral to the integrity and security of critical business and/or customer data,
anomaly detection applications are widely deployed across numerous industries
and use cases, including financial fraud detection, IT security intrusion and
threat detection, website user analytics and digital ad fraud, IoT systems and
beyond. Anomaly detection applications typically compare inspected streaming
data with historical event patterns, raising alerts if those patterns match
previously recognized anomalies or show significant deviations from normal
behavior. These detection systems utilize a stack of solutions that often
include machine learning, statistical analysis, and algorithm optimization, and
that leverage data-layer technologies to ingest, process, analyze, disseminate,
and store streaming data.
However, there are significant challenges in designing an architecture capable
of detecting anomalies in high-scale environments where the volume of daily
events reaches into the millions or billions. In these scenarios, data-layer
technologies must overcome substantial computational, performance and
scalability requirements in order to cope with the massive scale of events.
To showcase just how powerful the open source data-layer technologies
Instaclustr delivers through its fully-managed platform can be for processing
massive real-time event streams, its engineering team built a streaming data
pipeline application able to overcome the hurdles of mass-scale anomaly
detection. To do so, Instaclustr teamed the NoSQL Cassandra database and the
Kafka streaming platform with application code hosted in Kubernetes to create
an architecture with the scalability, performance and cost-effectiveness
required for the solution to be viable in real-world scenarios.
Cassandra and Kafka are not just performant and scalable, they are also
naturally complementary technologies. Kafka supports fast, scalable ingestion
of streaming data, and uses a store and forward design that provides a buffer
preventing Cassandra from being overwhelmed by large data spikes. Cassandra
then serves as a linearly scalable, write-optimized database ideal for storing
high-velocity streaming data. In the successful experiment, Instaclustr
combined Kafka, Cassandra and the anomaly detection application in a
Lambda
architecture, with Kafka as the speed layer and Cassandra as the batch and
serving layer. Instaclustr's solution also utilized Kubernetes on AWS EKS in
order to automate the provisioning, deployment, and scaling of the application.
Proceeding with an incremental development approach, Instaclustr carefully
monitored, debugged, tuned and retuned specific functions within the pipeline
to optimize its capabilities. The result: an anomaly detection application able
to process 19 billion real-time events per day and detect anomalies in those
events.
"Our anomaly detection solution showcases how critical applications can scale -
colossally - using expertly-optimized Kafka and Cassandra in their fully open
source form," said Ben Slater, Chief Product Officer, Instaclustr. "We welcome
enterprises across industries interested in knowing how Kafka and Cassandra can
be leveraged to meet the data scale requirements within their own applications
to get in touch, whether you're building a real-time anomaly detection application
or any other solution."
"Apache Cassandra and Apache Kafka each hold a well-earned reputation for their
ability to deliver high data performance in mass-scale use cases, as is
thoroughly demonstrated by Instaclustr's new anomaly detection data pipeline,"
said James Curtis, Senior Analyst, Data, AI, and Analytics at 451 Research.
"Through this successful experiment, Instaclustr again showcases the vast
potential of these open source technologies, which organizations can take full
advantage of through Instaclustr's managed platform."