The wild west Spark Summit showdown

With the NBA finals, NHL finals, presidential primaries, and the end of the school year, it's shaping up to be a raucous, contentious, high stakes month of June. Yet the biggest battle may be around the future of Apache Spark at high noon in the O.K. Corral Spark Summit at Hilton Union Square. 

First of all, there is a lot to like about Spark as a adaptable, multi-purpose engine for analytics. ESG research identified the main reasons the big data market is shifting so rapidly toward adoption of Spark, with 16% of respondents already in production and an even bigger wave of 47% very interested in moving forward soon. The survey seems to reflect the mood and priorities of the market, and I've had a number of direct conversations reinforcing these themes.


So who's driving Spark and where will it go next? This is a matter of some debate and great importance. A year ago, IBM declared intent to have 3,500 developers for Spark-related offerings. And even in the last week, IBM still claims to have made the largest investment in Spark, much of it centered around their Spark Technology Center. The major Hadoop distributions also lay their claims to Spark.

Cloudera makes the case that Spark is the natural heir to MapReduce and is certainly embracing the technology in their One Platform Initiative, with a lot of resources to help the transition. MapR boasts having been first in offering full Spark support, for over two years and eight releases now. Hortonworks and HPE have teamed up to do more around specific workloads with Spark. Amazon Web Services showcases the success of their customers  over the last year since their introduction of Spark on EMR. Many, many others are laying out the logic and integration points of their products and services alongside Spark.

Given that Apache Spark is open source and nominally publically-owned technology, one interesting measure of activity is how much a particular group is currently contributing to its development. This Apache site shows a stark picture there, and Databricks is dominating with 17 of 43 committers, followed by 6 at UC Berkeley where Spark was born in Cal's AMPLab.

The remainder are mostly in ones and twos across 16 other groups. So while many vendors may indeed be developing around Spark, they aren't necessarily actively giving back their work to community, instead using it for proprietary advantage. There's certainly nothing wrong with this approach, but it does show a tendency toward specialization, customization, and differentiation. Databricks, too, reserves a ton of additional enterprise operational functionality for their own cloud platform.

What all this does show is that the future of Spark is going to be both shiny and messy, as the market grows quickly with many innovations, both public and private. Spark Summit West will be a great opportunity to see the creativity on display. Hope to see you there, and please check out our Spark 360 panel hosted by Ben Lorica of O'Reilly.

esg research

Topics: Data Platforms, Analytics, & AI