Hadoop is a natural fit for big data ETL as a platform, but lacks modern tools and certain optimzations. Syncsort steps into the fray with a GUI-based solution to speed along big data ETL on Hadoop.
Published: May 22, 2013
Sometimes vendors build technology and then start looking for a problem for the technology to solve. Sometimes vendors build technology directly addressing a succinct and prevalent problem. Syncsort’s new DMX-h ETL Edition and DMX-h Sort Edition definitely fall in the latter category. While the “Sort edition” is the productization of Syncsort’s previously announced sort contribution for the Hadoop community, with impressive performance results, I am quite taken with the “ETL Edition.” Here’s why:
Based on the evidence I have gathered talking with customers and in-the-weeds big data consultants, claims that Hadoop, and some non-Hadoop big data solutions, eliminate the need for ETL are patently false. The only successful example I have seen of achieving big data analytics with poorly understood and/or dirty data took place with an advanced Graph analytics solution, and truth be told the initial Graph iterations focused on refining, combining, and understanding the underlying data – an ETLish exercise. It appears, and this should not come as a surprise, the most common choke point in big data projects involve data prep and data understanding.
Nothing solves data prep and understanding challenges like ETL. ETL forces the data analyst to dig into the details of all the raw data, and conceptualize what a perfect data set for analytics would look like – and this exercise also helps the data analyst determine the analytical possibilities. ETL offers the business analyst or user what looks like near-perfect data sources. Let’s be honest – what good are all the cool visualization tools if their data sources aren’t spot on or very nearly so for the basic user.
Thus, it should also come as no surprise that ETL has thus far proven to be one of the most popular applications of Hadoop, and, if anything, ESG sees Hadoop-based ETL continuing to grow its fan base. Well-established ETL solutions do well with structured data from a few sources, but Hadoop’s ability to recombine large numbers of data sources of varying data structures relatively quickly make it a natural for, dare I use the term, big data ETL. The difficulty as always with Hadoop is the development tools, or more specifically the lack of development tools. Compared to the graphical, visual, drag-and-drop tools available from established ETL solutions, with Hadoop you must code.
Syncsort DMX-h ETL Edition will help Hadoopists take a big data step forward in terms of ETL ease of development and performance. First, on the client side, “ETL Edition” offers a graphical development environment with in-process test and debug capabilities. The GUI environment includes what Syncsort calls “use-case accelerators,” which are pre-built templates for common ETL/MapReduce use-cases, such as mainframe access, complex joins, and web log aggregation. Second, it decreases Hadoop run times for ETL by adding “per node scalability” complementing Hadoop’s horizontal scalability. Finally, it includes best-in-class mainframe connectivity, and connectors for the full breadth of other data sources from flat files to XML to databases including Hive as well as popular MPP analytical warehouses such as Greenplum, Vertica, and Teradata.
Data integration continues to enjoy a renaissance due to not only all kinds of fresh data sources, but also due to new theaters of deployment, like clouds and Hadoop. As clouds and Hadoop continue to increase their reach, the integration vendors that best identify and solve for the most ubiquitous challenges in those deployment domains should reap the rewards of being customer sensitive. Both editions of Syncsort DMX-h, Sort and ETL, meet that criteria of pinpointing a technical solution to well-known problems, and problems that more and more enterprises will need helping solving as they adopt or further adopt Hadoop.
*All views and opinions expressed in ESG blog posts are intended to be those of the post's author and do not necessarily reflect the views of Enterprise Strategy Group, Inc., or its clients. ESG bloggers do not and will not engage in any form of paid-for blogging. Click to see our complete Disclosure Policy.