The cloud computing industry is facing a critical challenge - how to
efficiently allocate resources and optimize costs at scale. As
organizations increasingly rely on cloud platforms like Databricks to
power their data and AI initiatives, managing the complexity and costs
of cloud infrastructure has become a major pain point. Enter
Sync
Computing, an innovative startup that is pioneering a new paradigm
called "declarative computing" to tackle this problem head-on.
Founded by MIT and UC Berkeley alums with backgrounds in
high-performance computing, Sync Computing has developed a machine
learning-powered platform that aims to revolutionize how organizations
manage their cloud resources. In a recent briefing at the 58th IT Press
Tour in Boston, Sync Computing CEO and co-founder Jeff Chou shared
insights into the company's technology and vision.
The Resource Allocation Problem
Chou outlined what he calls the "resource allocation problem" in cloud computing:
"Anytime you want to spin up any resources, even on-prem or on the
cloud, the old way of doing it was always, you have your code, you have
your data and then you always have to specify the compute resources. You
have to say, I want this many nodes of this instance type, this
storage, this network, etc. And you specify it and then you run it and
then you get that on the cloud and then the output is always these
things - these kind of business metrics that everyone sees, that job costs you a hundred dollars, that took one hour to run, your
latency was 300 milliseconds, etc. And this is literally how the entire
world works today."
This traditional approach leads to three major business problems:
- High compute costs
- Inability to tune infrastructure at scale
- Difficulty meeting SLA deadlines
As Chou explained, "Compute costs are just incredibly high. Cloud
costs as everyone knows is a huge problem and everyone wants to be more
efficient. Two, you can't tune at scale. These are kind of more
application specific settings and compute things are very complicated
and companies might have thousands or tens of thousands of pipelines
running. And no amount of employees can manage and optimize the
compute at that scale. And the third bucket is more on a performance
side which is like you can't hit SLA deadlines."
Introducing Declarative Computing
To address these challenges, Sync Computing is proposing a new
paradigm called "declarative computing." The key idea is to flip the
traditional model on its head:
"Why can't we flip it around and say, instead of the outputs being
the cost and runtime, why can't the input be what we want the business
performance to be?" Chou asked. "For example, here's the code. And
instead of specifying low-level compute resources, I want to specify
high-level business goals because that's all I really care about."
With declarative computing, users simply specify their desired
business outcomes - like minimizing cost, maximum runtime, and target
latency. An intelligent system then figures out the optimal
infrastructure configuration to meet those goals.
Closing the Feedback Loop with Machine Learning
The secret sauce behind Sync Computing's approach is a closed-loop
feedback system powered by machine learning. As Chou described:
"What's actually missing in computing at all in the entire industry
is a feedback loop. There is no feedback loop going from 'hey this job
ran' and how did it do and should we try to improve things? There is
nothing today. The entire cloud industry is completely static and the
resources are fixed."
Sync Computing's platform continuously monitors workloads as they
run, collecting data and using machine learning models to understand the
relationship between infrastructure choices and performance outcomes.
This allows the system to automatically optimize and tune infrastructure
over time.
"The key concept that we're doing is we have this closed-loop
feedback that goes between the customer environment and then the Sync
environment, and we're actually closing this loop and actually managing
customers' infrastructure," Chou explained.
Gradient: Automated Infrastructure Optimization for Databricks
Sync Computing's flagship product, called Gradient, applies this
declarative computing approach specifically to Databricks environments.
Gradient integrates with customers' Databricks jobs and automatically
optimizes infrastructure settings to reduce costs and improve
performance.
Chou demonstrated how Gradient works:
"This is our product page. And this is a, for example, a job, a
Databricks job. Let's say it runs once an hour. The top graph is the
total cost of your job. This is both the Databricks cost and your cloud
cost. So we kind of put it all together for users which people really
like because that's very hard to do. The bottom graph is the runtime."
He showed how Gradient goes through an initial learning phase for
each job, then switches to an optimization phase where it dramatically
reduces costs:
"Once it kind of figures it out, we have these proprietary
ML models behind the scenes where we say, okay, I got it. And
then it switches to green so that it can start optimizing. And it might go back and forth between green and gray because you
might want to explore more search opportunities, but eventually,
it'll figure it out and then drop costs tremendously."
In one example, Gradient reduced the cost of a job from $8.31 to
$0.90 - an 89% cost savings - while maintaining similar runtime
performance.
Beyond Cost Optimization
While cost savings are a major benefit, Chou emphasized that
Gradient's capabilities go beyond just reducing spend. The platform can
also help organizations hit specific performance targets and SLAs.
He demonstrated how users can set a target runtime for a job, and
Gradient will automatically reconfigure the infrastructure to meet that
goal:
"All the user has to do is come in here to our settings and change
the SLA from zero to five, click save, and now the
algorithm says, alright, instead of optimizing for cost, you're going to go back to that declarative computing concept. Your
goal is to cut down runtime."
This allows organizations to make intelligent tradeoffs between cost and performance based on their business needs.
Key Differentiators and Competitive Landscape
When asked how Sync Computing's approach differs from traditional job
schedulers or auto-scaling, Chou highlighted several key
differentiators:
-
Custom ML models for each workload: "If we're monitoring a
thousand Databricks jobs, for example, there are a thousand different
models. Each one custom tuned for each job."
-
Intelligence beyond simple rules: "Auto scaling is literally in a one-line if statement on what are the rules to add and remove
workers. And what we're trying to do is next level, which is much more intelligence measurement data-driven
analysis."
-
Focus on batch workloads: "One kind of technical requirement of
ours is it's batch workloads. Meaning it runs, it
finishes, it ends. As opposed to streaming for example where it's
just always on all the time."
Chou noted that while Databricks recently launched a serverless
offering that aims to address some of the same challenges, it is still
very new and optimized primarily for performance rather than cost. He
believes Sync Computing's more flexible, ML-driven approach provides
additional value.
Looking Ahead: Expansion Beyond Databricks
While Sync Computing is currently focused on optimizing Databricks
environments, Chou sees significant opportunity to expand to other
platforms and use cases:
"Databricks is just strategically for us. We wanted to be very
specific. We get requests all the time. Snowflake is probably one of our
top requests. But then other kind of more general computing like on AWS
- containers, Kubernetes, Lambda functions, ECS, EKS, these kind of
compute resources that are used all the time in batch workloads."
He indicated that the company plans to gradually expand its offerings
each quarter based on customer demand and market opportunity.
The Road Ahead
As cloud adoption continues to accelerate and organizations grapple
with rising infrastructure costs, solutions like Sync Computing's
declarative computing approach are likely to become increasingly
critical. By leveraging machine learning to automate infrastructure
optimization, Sync Computing aims to give engineers and data scientists
more time to focus on building products and deriving insights rather
than managing compute resources.
While the company is still in its early stages, with its Gradient
product only becoming commercially available earlier this year, Chou
believes they are just scratching the surface of what's possible:
"We're really just getting started on what we
can do. And then we want to generalize this because there are some
shades of this you can apply to Snowflake for example, Kubernetes
containers, etc. But the challenge is building a really good model and
that's kind of what we focused on."
As organizations increasingly seek ways to tame cloud costs without
sacrificing performance, Sync Computing's innovative approach to
infrastructure optimization positions them as a company to watch in the
evolving cloud computing landscape.
##