
There's a data explosion taking place right now, and that data growth is only going to keep growing. Machine
learning will be needed to automatically discover data and detect usage patterns across
high-scale data lakes and other sources. Now is the time for data lakes to
start proving their business value - not just storing massive quantities of
data. To dive in deeper into this topic, I reached out to Arvin Hsu, Senior
Director of Data Science and Machine Learning at GoodData.
VMblog: What is the current state of
enterprise data as it pertains to machine learning?
Arvin Hsu: As
more, faster, and less structured data pours in from an estimated 50 billion
IoT-connected devices by 2020 - adding to an already unprecedented amount of
data - today's enterprises are facing the new challenge of making that data
actionable and capable of creating meaningful change.
Data
lakes promised a path to make this happen, yet enterprises are abandoning them
left and right. Fundamentally, this has been a reflection of the failed promise
of extracting meaning from big data. The challenge of extracting concrete
business value from terabytes of structured, unstructured, IoT, audio/visual
data and more now lies in the hands of data scientists and machine learning
(ML) engineers.
In
order to meet this challenge, data scientists will need to develop more
advanced cataloging tools to access data from many different sources,
visualization and discovery tools to help them understand the data at hand, and
leverage automated/ML driven meaning extraction systems. The signals that will
help businesses make better decisions, create better customer value, and
optimize their workflows reside within the big data storage centers of these
enterprises. It will be up to the AI teams to write the ML algorithms that can
detect those signals in order to effect meaningful change.
VMblog: When will enterprises start to
fully utilize big data?
Hsu: Early
adopters like Amazon, Google, and Uber are already fully utilizing their data.
Other enterprises are all playing catch up - building out their data
engineering pipelines, signal detection ML algorithms, and the ML
operationalization systems required to turn insights into action. I predict
that organizations will start reaping the benefits of their data stories in
2019 and beyond, as major enterprises continue to adopt cloud computing,
scalable ML architecture, and streamlined production systems. This benefit will
come to fruition as better customer personalization, efficient, streamlined
workflows, and optimization improvements across a myriad of business processes.
VMblog: What role does the continued
emerging of cloud compute and ML have on this?
Hsu: As
more siloed enterprise data sources get migrated to the cloud, data access and
data democratization will increase, allowing data intelligence gurus to get a
more holistic view of customers, products, and business processes. Similarly,
the shift towards cloud services for big data pipelines, realtime and
streaming, and compute allows for easier integration of everything that's
needed to create high impact data products for the business.
Cloud
compute platforms allow much easier, faster provisioning for big data
processing initiatives, and incremental, serverless billing models make it much
cheaper to implement both temporary sandboxes as well as production compute
systems. Coud-based data pipelines and compute models create an ease of
distribution for end-users, whether they are internal business units or
embedded into a customer-facing application. GoodData has built an end-to-end
intelligence platform that specifically leverages all these benefits, ingesting
enterprise data into the cloud to deliver analytics and intelligence to
end-users embedded directly into their applications.
VMblog: How can data scientists make
data actionable and capable of creating meaningful change?
Hsu: Data
scientists need to start with understanding business processes, use cases, and
pain points. This allows them to tie their data discovery and model building to
concrete business value - usually decisions or actions that a business takes.
Throughout the data discovery and model building process, data scientists need
to continue to think about the type of impact the models they build will
affect, such as the impact of Type I Errors and Type II Errors. The best models
don't maximize mathematical accuracy or prediction performance - they maximize
business impact.
VMblog: What kind of business use
cases can be made possible by deeper text analytics, more free-form text, and
enhanced sentiment analysis?
Hsu: Enterprises
have a wealth of untapped information in unstructured text fields. Comprising
everything from qualitative problem descriptions to customer satisfaction
reports, unstructured text can provide amazing insight into not only customer
satisfaction, behavior and opinion, but also business processes, user feedback,
order processing and more. Using new deep learning models to better extract
meaning from unstructured text fields and using that as inputs into learning
models will enable enterprises to extract significant value from all those
untapped resources.
VMblog: GoodData is unique in that its
analytics and data pipeline are built in an end-to-end in the Cloud. How does
this help companies eliminate the pain points of getting meaningful insights
from their data?
Hsu: GoodData's
end-to-end platform provides a seamless and efficient value creation process
from data ingestion all the way to embedded recommendations and other
analytics. The typical friction points of multi-source data integration, ML
productionalization, and endless BI dashboards are all obviated. GoodData focuses on changing business
workflows by embedding action-oriented analytics at the point-of-work. This
creates fundamental changes in the actions and decisions that businesses make
and creates a direct, seamless link from data ingestion to business value.
VMblog: How will cloud-based AI
services help manage the unprecedented volume and diversity of enterprise data
in 2019 and beyond?
Hsu: AI
helps us comb through massive amounts of data, separating the signal from the
noise. As the AI industry matures to handle more big data use cases, and
technologies develop to deal with these issues, more big data stores that have
yet to be tapped into will yield nuggets of value as the AI-algorithms
"automatically mine" them. These advanced mining algorithms will be more
capable of detecting significant signals that impact business-critical KPI's.
This holds true not only for unstructured text, but also for IoT data,
Audio/Video/Image data, data stored in ER databases, and more. All of this, of course,
is tied to the continued availability of more powerful and more affordable
compute resources. From Google's TPU's to AWS's P3 GPU-clusters, compute
necessary for big data AI and deep learning continues to become more
affordable.
VMblog: As major players such as AWS,
GCP, and Azure continue to introduce greater cloud compute and deep learning
resources, what do you see as the direction of these technologies, and what
does that mean for the future of AI?
Hsu: All
the major cloud vendors are competing against each other to offer the most
innovative, across-the-board solutions for big data and AI. The competition
fuels innovation and the creation of better serverless offerings to use ML to
extract business value. Enterprises will only benefit from integrating
cloud-based architecture solutions into their roadmap, whether they choose to
work with a single vendor or a multi-cloud solution. We will also continue to
see specialization and diversification among the cloud vendors. Microsoft has a
technological lead with its R-language offerings, while Google has a lead with
its Tensorflow based offerings. As technology and innovation evolve, we will
continue to see the cloud vendors differentiate and specialize into different
areas of ML and AI.
##