UC Berkeley's AMPLab
has become the center of gravity for innovation in big data analytics. There is
even a stack named for it, Berkeley Data Analytics Stack (BDAS). Everyone loves
the acronym.
The latest big thing from AMPLab BDAS is Alluxio,
a memory-centric virtual distributed storage system. I recently spoke to
Haoyuan Li, the creator of Alluxio, founding committer of Spark and CEO of Alluxio Inc., about the latest developments in his project.
VMblog: First of all, can you explain what problem you are trying to
solve?
Haoyuan Li: As datasets continue to grow, storage has
increasingly become the critical bottleneck for enterprises leveraging Big Data
frameworks like Apache Spark, Apache MapReduce, Apache Flink. We saw this when
I was working on systems like Spark at UC Berkeley's AMPLab where I was a PhD
candidate co-advised by two leading computer scientists in the world, Prof. Ion
Stoica and Prof. Scott Shenker.
The frameworks themselves are driving much of the
exciting innovation in Big Data, but the complexity of the underlying storage
systems was slowing the pace that data assets can be leveraged by these
frameworks. Traditional storage architectures are inadequate for distributed
computing and the size of today's datasets.
VMblog: How did you achieve a breakthrough and create
Alluxio?
Haoyuan Li: The hard work started almost four years ago.
The ah-hah moment was realizing how to give Big Data frameworks and
applications access to all the data in memory for high-speed computations while
also enabling true failover without replication. That was the first big step.
The next came with our innovations around a unification platform abstraction
that allowed users to access any persistent data stored anywhere in any major
storage system or file system through an API. The version 1.0 release we announced yesterday is all about this capability. (
You can see the news release here)
VMblog: How does Alluxio help developers and at the
same time operators?
Haoyuan Li: It makes life much better for both! We
removed the natural tension between the two groups. Developers no longer have
to worry about the underlying storage systems or wait for ops. They could work
with Big Data frameworks and write distributed apps and just point to our API.
Alluxio takes care of all the hard work automatically. For operators, it means
they keep the systems they trust for storing their data. And they are not
holding developers back with any obstacles in the storage infrastructure.
VMblog: Do you have any examples of Alluxio in use?
Haoyuan Li: Yes, we do. Some of the information on our
technology refers to our previous name, Tachyon, just to warn you. The
community changed the project name to Alluxio to avoid trademark issues and to
protect the project. Anyways, your readers can check out how a big bank uses
Alluxio, with some code samples even. Here is how Barclays saw a big improvement
in performance and how they also took advantage of our data locality advantages
to comply with banking laws around confidential customer information. We just
posted on our company web site a great case study on Baidu, the giant Asia search engine. They run 2 Petabytes
analytics jobs in memory using Alluxio.
##
Thank you to Haoyuan Li, CEO of Alluxio, for taking time out to speak with VMblog and answer a few questions.