Over the last couple of days, I've enjoyed meeting with business and technology leaders around Apache Spark. Hosted in Boston this time, the Spark Summits in general have become great forums to learn about what's new, what's working, and what's coming up next. Here are my quick takes if you weren't able to attend (and missed the thunder snowstorm, WHAT?!?) Highlights included:
- Detailing improvements to Apache Spark by Matei Zaharia of Databricks, in three words: performance, democratization, and production. Project Tungsten is solving the bottlenecks of CPU and memory, which are stubbornly remaining expensive, at least relative to storage and networking gains. Python and R are becoming the languages of choice for use with Spark. Not least, Spark is becoming the universal engine to consolidate and simplify a range of use cases for continuous, streaming applications.
- Ion Stoica shared how RISElab is taking over where AMPlab left off at UC Berkeley, with a consortium of heavies investing in next generation innovation around "real-time, intelligent, and secure systems." There was an emphasis on immediate decisions and actions, which I imagine means both more innovation around Spark and opportunities to spin out the next bigger things in big data. If you haven't worked out the differences between Spark streaming and Flink, you better hurry because Drizzle was also introduced and shown to beat both in terms of sheer speed.
- Day two, Arsalan Tavakoli-Shiraji of Databricks shared the secrets to success with Spark. After observing the relative progress of dozens (maybe hundreds) of companies, the real overlooked magic seemed to be about empowering people. Too many have focused on the data infrastructure itself, rather than how it feeds applications that knowledge workers can really leverage. I didn't quite agree that virtual applications should be called the third generation, as that felt dismissive of data warehouses and data lakes, but chronologically it's accurate and there are clearly a lot of new possibilities.
- Perhaps my favorite speaker was Ziya Ma of Intel who shared her views on the emergence of machine learning and AI, based on improved ease of use, better efficiency, and cost. ESG is actively researching the adoption trends for enterprises (let me know if you want to get in on that action). This space will continue to accelerate in both power and friendliness. The Intel BigDL platform seems like a sound effort, even as there are a lot of alternatives now fighting for mindshare among data scientists. Tools like Caffe, Torch, TensorFlow, OpenAI, Theano, MxNet, and others will all be striving to attract adherents.
- Not least, Arun Murthy of Hortonworks did a fine job of articulating the importance of security, privacy, governance, even ethics for how big data is handled. I can't emphasize enough how much this topic needs to be addressed up front, not as a bolt-on patch later.