Virtualization Technology News and Information
5 Emerging Open Source Big Data Projects that will Revolutionize Your Business

Article Written by Daniel Kulp, VP of Open Source Development and Application Integration at Talend

Twenty years ago, the Open Source framework was published, delivering what would be the most significant trend in software development since that time. The Open Source Initiative, a non-profit organization that advocates for open source development and non-proprietary software, pegs the date of inception at February 3, 1998.

Since its creation, OSS has disrupted the status quo in groundbreaking ways while also becoming mainstream in the process. According to Ovum, open source is now the default option across several big data categories ranging from storage, analytics and applications to machine learning. In the latest Black Duck Software and North Bridge's survey, 90 percent of respondents reported they rely on open source "for improved efficiency, innovation and interoperability," most commonly because of "freedom from vendor lock-in, competitive features and technical capabilities, ability to customize, and overall quality."

If you're an IT leader of any sized organization, you should be thinking about and planning for incorporating OSS into your infrastructure, or thinking about the next project if you've already started. OSS can enable extreme agility and lightning fast responses to customers, business needs and market challenges -- but with thousands of successful open source projects underway, there are so many options it can be hard to know which to take note of.

Thus, here are five projects we recommend you look into and keep a pulse on, to consider for the potential impact they may have on your IT infrastructure and overall business:

1.     Apache Beam is a project model that got its name from combining the terms for big data processes batch and streaming because it's a single model for both cases. Beam = Batch + strEAM. Under the Beam model, you only need to design a data pipeline once, and choose from multiple processing frameworks later. You don't need to redesign every time you want to choose a different processing engine, meaning your team can choose the right processing engine for multiple use cases.

2.     Apache Carbon Data is an indexed columnar data format for incredibly fast analytics on big data platforms such as Hadoop and Spark. This new kind of file format solves the problem of querying analysis for different use cases. With Apache Carbon, the data format is unified so you can access through a single copy of data and use only the computing power needed, thus making your queries run much faster. 

3.     Apache Spark is one of the most widely used Apache projects and a popular choice for incredibly fast big data processing (cluster computing) with built-in capabilities for real-time data streaming, SQL, machine learning and graph processing. Spark is optimized to run in memory, and enables interactive streaming analytics. Unlike batch processing, you can analyze vast amounts of historical data with live data to make real-time decisions, such as fraud detection, predictive analytics, sentiment analysis and next-best offer.

4.     Docker and Kubernetes are container and automated container management technologies that speed deployments of applications. Using technologies like containers makes your architecture extremely flexible and more portable. Your DevOps process will benefit from increased efficiencies in continuous deployment.

5.     TensorFlow is an extremely popular open source library for machine intelligence, which enables far more advanced analytics at scale. TensorFlow is designed for large-scale distributed training and inference, but it's also flexible enough to support experimentation with new machine learning models and system-level optimizations. Before TensorFlow, there was no single library that deftly caught the breadth and depth of machine learning and possessed such huge potential. But TensorFlow is very readable, well documented and expected to continue to grow into a more vibrant community.

Not all open source projects are created equal, and not just any open source project will propel your company to the head of the pack. Every company must still develop its strategy and choose the open source project that would best fuel the desired business outcomes. It's important to join the open source communities relative to your projects and interests, to educate yourself, your team and management about the different benefits. OSS is so valuable largely in part because you can leverage the collective minds of the community instead of reinventing the wheel.

At the end of the day, change has always been the only constant in human existence and business. But change in technology is happening faster now than at any other time in history. By staying open-minded, attuned to open source and aware of the many ways to use data and analytics, you'll be well prepared for whatever pops up next on the horizon.


About the Author


Daniel Kulp is an ASF member and committer of Apache CXF, Apache Aries, Apache Maven, Apache WebServices, Apache ServiceMix and Apache Camel.

Daniel attended Northeastern University in Boston where he received degrees in Chemical Engineering and Computer Science. As the VP of Open Source Development for the Application Integration Division at Talend, Dan gets to practice his passion for coding open source at work, and still has time to dedicate to his loving family.
Published Wednesday, June 27, 2018 7:31 AM by David Marshall
Filed under: ,
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<June 2018>