Today,
Amazon Web Services, Inc. (AWS), an Amazon.com company,
launched AWS Glue, a fully managed extract, transform, and load (ETL)
service that makes it easy for customers to prepare and load their data
into Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon
Relational Database Service (Amazon RDS), and databases running on
Amazon Elastic Compute Cloud (Amazon EC2) for query and analysis.
Customers can create and run an ETL job with a few clicks in the AWS
Management Console. Customers simply point AWS Glue at their data stored
on AWS, and AWS Glue discovers the associated metadata (e.g. table
definitions) and classifies it, generates ETL scripts for data
transformation, and loads the transformed data into a destination data
store, provisioning the infrastructure needed to complete the job. With
AWS Glue, data can be available for analysis in minutes, and because AWS
Glue is serverless, customers only pay for the compute resources they
consume while executing data preparation and loading jobs. To learn more
about AWS Glue, visit https://aws.amazon.com/glue.
Data
integration - extracting data from various sources, normalizing it, and
loading it into data stores - often represents as much as 75 percent of
the time required to implement an analytics project. Customers can
spend months hand coding and editing ETL scripts, which frequently
become more complex and error prone as data volumes grow, and new data
sources are added. And, running ETL jobs requires dedicated hardware
that often sits idle between jobs. AWS Glue significantly speeds the ETL
phase of analytics projects by eliminating all of the undifferentiated
heavy lifting involved in creating, managing, and modifying ETL jobs.
After
crawling a customer's selected data sources, AWS Glue identifies data
formats and schemas to build a unified Data Catalog that provides a
central view of customers' selected data. This makes it easy for
customers to search and manage all of their data across various data
stores without having to manually move it. When a customer identifies a
data source (e.g., a database table) and target (e.g., a data warehouse)
from the Data Catalog, AWS Glue matches the schemas and generates data
transformation code that is customizable, reusable, portable, and
sharable. Developers can schedule any number of ETL jobs, and AWS Glue
manages the rest - automatically spinning compute resources up or down
depending on customer ETL workloads. By streamlining the process of
creating ETL jobs, AWS Glue allows customers to build scalable and
reliable data preparation platforms spanning thousands of jobs, with
built-in dependency resolution, scheduling, resource management, and
monitoring.
"AWS's
scalable, reliable cloud storage, combined with our broad range of
analytics services make it easier than ever for customers to collect,
store, analyze, and share data," said Raju Gulabani, Vice President,
Databases, Analytics, and AI, Amazon Web Services. "While it's amazing
to see how much analytics are being run on AWS today, many have told us
that there is one piece of the equation that is still way too hard -
cleaning and preparing huge volumes of data for analysis. We developed
AWS Glue to eliminate much of the undifferentiated heavy lifting
involved with ETL. By cataloging all of a customer's data and automating
the ETL process, AWS Glue not only takes a lot of the hassle out of
analytics. It also makes it possible for customers to store their data
in as many sources as they want, and very quickly start analyzing all of
it with whatever AWS service they choose."
NewsCorp
is a global provider of news and business information, delivering
content to a few hundred million consumers every day in over 50
countries. "At NewsCorp, we are building a world-class digital platform
on AWS to distribute content to our external customers and to facilitate
data-driven decision making across all our businesses. We merge data
from a variety of sources and load it to our Amazon S3-based data lake
on a continuous basis," said Simon Smith, Chief Data Officer at
NewsCorp. "AWS Glue is unparalleled in its ability to infer, classify,
and transform data. With AWS Glue our data scientists and analysts can
always have access to the latest data available in our data lake. AWS
Glue Data Catalog automatically detects the availability of new data,
infers its metadata and makes it readily available in Amazon Athena so
we can start querying that data. Our AWS Glue ETL jobs seamlessly
convert raw data in a variety of data formats to an Amazon Athena
optimized Parquet data format. And the best part is that AWS Glue is
serverless. We do not have to provision or manage any resources to
prepare data for analytics."
21st
Century Fox is home to a global portfolio of media companies that reach
more than 1.8 billion homes in 50 languages every day. "As part of our
overall data strategy, we are building a petabyte-scale data lake on
Amazon S3 so that our executives can have access to any data asset
through a unified data platform. We bring in data from a variety of
sources, ranging from our ERP systems to clickstream and mobile
analytics, process it, and make it available in a queryable form," says
John Herbert, Global CIO, 21st Century Fox. "We are always interested in
trying out new products that will reduce the administrative overhead of
managing our data lake. We are impressed by AWS Glue's ability to
automatically discover new data, extract the associated metadata and
make it available through a central Data Catalog so we can instantly
start querying this data. We are looking forward to making AWS Glue a
component of our data lake."
myTomorrows
is an online platform that provides information and access to treatment
options in the form of Clinical Trials and Early Access Programs. "We
ingest clinical trial data, medical vocabularies and scientific
publications that vary in formats, schema and quality from a variety of
data sources, to provide insights to our customers," said Robert-Jan
Sips, Chief Technology Officer, myTomorrows. "AWS Glue's automatic
schema discovery and code generation features are truly a game changer
for a small, fast-growing organization like ours. AWS Glue makes it
extremely easy and cost effective to onboard new datasets, and its
serverless offering makes it a breeze to test and run our ETL jobs. Our
developers love that they can simply connect their notebooks to AWS Glue
and get going without any ramp up time."
The
OLX Group operates a network of online trading platforms in over 40
countries, with over 300 million monthly users worldwide. "We collect
clickstream data across billions of monthly visits and page views for
all our online marketplaces into a central data lake on Amazon S3. We
are constantly looking for products that will make our data ingest
pipeline robust, reliable, and automated," says Jakub Orlowski, Data
Engineering Manager, OLX. "We jumped at the first opportunity to start
using AWS Glue and loved its ease of use, flexibility, and zero
administrative overhead. AWS Glue automatically converts raw JSON data
from our data lake into Parquet data format and makes it available for
search and querying through a central Data Catalog. We can use our
Zeppelin notebooks to edit the AWS Glue generated ETL code and once we
are done, AWS Glue runs everything on a serverless Spark platform. AWS
Glue will allow us to push our data innovation and democratization
efforts to the next level and bring data producers and consumers closer
than ever before."
OST,
an APN Partner with expertise in building enterprise cloud solutions
for connected products, is working with Herman Miller, one of the
world's largest manufacturers of office furniture, to bring IoT and Big
Data to the workplace. "We are partnering on an IoT platform and
analytics solution with Herman Miller to collect real-time data from
sensor-enabled furniture, catalog it in a data lake, then run machine
learning algorithms. Office employees benefit from instant ergonomic
adjustments, and employers can measure the effectiveness of their space
for optimal real estate use," said Alex Jantz, Senior Architect, OST.
"AWS Glue helps cut our DevOps time in half. We start with an
auto-generated script, then customize it with Zeppelin notebooks as
needed. AWS Glue has completely redefined the way we think of ETL. We
just focus on the custom code and AWS Glue takes care of the rest."
Customers
can start using AWS Glue using the AWS Management Console. AWS Glue is
available in the US East (N. Virginia) Region and will expand to
additional Regions in the coming months.