Anaconda, Inc. announced the results of its State of Data Science
survey, revealing key trends in data science and machine learning within the Anaconda
community. The survey, which ran from March 22 to April 30, 2018, resulted in
4,218 responses with a 100% survey completion rate. The majority of respondents
were students (26%), followed by data scientists (16%), academics (15%) and
software developers (15%).
"The
shift from managing big data to making data actionable is more important than
ever in the enterprise," said Krishnan Subramanian, Chief Research Analyst,
Rishidot Research. "Anaconda is easy to use and its users are experiencing
clear value in their machine learning platform for cloud native especially as
they transition to new technologies like containers."
The
State of Data Science
The Anaconda State of
Data Science is strong. With 2 to 2.5 million downloads per month during
January to March 2018, Anaconda is easily the most popular Python distribution,
with a growing R following.
Key findings of the survey include:
-
Applying cloud-native technologies
such as Docker containers and Kubernetes to data science is growing at the
expense of traditional Big Data (Hadoop/Spark).
-
Google Cloud Platform's data services
outrank those of Amazon Web Services and Microsoft Azure. Although Google Cloud
is the third largest cloud provider, its focus on data services is paying off
with the Anaconda community.
-
Anaconda is gaining popularity with
software developers (15%), in addition to data scientists (16%) and academics
(16%).
-
Matplotlib continues to enjoy its
first-mover advantage in visualization, sweeping the category, but it is a
highly-crowded space with many strong competitors, both open source and
commercial. Plotly, Tableau, Microsoft Power BI and Tibco Spotfire are all
strong commercial competitors to Matplotlib and other open source projects like
ggplot, Bokeh, D3 and Altair.
-
It matters a lot that Anaconda is
free, but not so much that it is open source. Free was ranked the most
important attribute, while the open source licensing was second to last.
"The Anaconda Distribution is the data science
community's de-facto platform for data processing, visualization and machine
learning/AI. The survey shows that data science is undergoing a shift away from
traditional big data (Hadoop/Spark) towards cloud-native technologies such as
Docker containers, Kubernetes and API-driven applications," said Mathew Lodge,
SVP Products and Marketing, Anaconda Inc. "We're also pleased to see more
software developers using the Anaconda platform as machine learning is becoming
pervasive and will be integrated with every application."
Data
Scientists Dropping Big Data and Looking at Containers and Cloud
Traditional Hadoop-style "big data"
performed relatively weakly versus the other options given this is a
data-centric audience, and that Hadoop has dominated on-premises (non-cloud)
data infrastructure for the past 10 years and spawned two tech IPOs
(Hortonworks and Cloudera). From this, one could conclude that what was "big
data" in 2005 when Hadoop began now easily fits into a single server's memory
and there is a plethora of alternatives to building a Hadoop data lake. Additionally,
containers are growing in production. Docker makes a strong showing at 19%,
beating out Hadoop/Spark with 15%, followed by Kubernetes at 5.8%. These
results suggest that modern cloud-native style architectures like Docker and
Kubernetes are rising, again at the expense of traditional Hadoop "big data"
and Apache Mesos (0.85%).
Additional findings of interest
include:
-
NoSQL databases came in at 14%, right
behind the cloud services, demonstrating their value for storing and processing
semi-structured data.
- Dask,
an open source technology for parallelizing single host algorithms and machine
learning across multiple CPU cores or multiple servers, came in at 3% of
responses.