Just as big data has emerged to heavily disrupt traditional databases and data warehouses, machine learning will be the next big wave of advancement in data management. Why, you ask? There is a simple, one word answer for you, "economics." They say any innovation has to be 10x better, faster, or cheaper to overcome the inertia of a traditional approach in IT. Apache Hadoop made it at least 10x less expensive to house data by distributing it across commodity hardware using open source software. Of course, there were (and still are) some rough edges and hidden costs, but this was compelling enough to get significant market traction versus legacy hardware and software. Weirdly, the core utility of Hadoop distributions has morphed towards being utilized mostly as a storage layer, with an ecosystem of other tools building analytics value above it.
That's because the next challenge has been in processing that data for understanding. MapReduce turns out to be a bit hard for many to leverage in building analytics applications, both unfamiliar as a language and limiting in its versatility. Hence, Apache Spark is rapidly becoming the multi-purpose analytics engine for Hadoop storage, offering efficient SQL queries, "real time" streaming, and graph capabilities. And developers can use their language of choice for much of this functionality, be it Python, Java, or even Scala. Most of the Hadoop distribution vendors and big data cloud service providers have therefore repositioned to include Spark as a central part of their offerings, to varying degrees of success. Companies like Cloudera, IBM, MapR, and Hortonworks are in the first category of Hadoop software providers refocusing to include Spark. Meanwhile, AWS, Microsoft, Google, IBM, Oracle, all offer their own multi-purpose big data clouds. Databricks is notable for its laser focus on delivering Spark as a service on AWS, Qubole offers a multi-cloud multi-distribution platform, and just recently Hortonworks announced it now has its own big data cloud platform on AWS to match their HDInsight on Azure. Independent of how they got there, all of these vendors have delivered better economics versus the traditional on-premises, proprietary approaches.
Now, the next problem is that as the volumes of data available have become so inexpensive to capture and house, extracting meaning has become harder at scale. The good news is that machine learning actually works better the more data you throw at it. The math for machine learning isn't new, it's the economics that have changed. This is spawning a new round of startups and reinvention to deliver machine learning capabilities. Machine learning libraries are also now available in all Hadoop and Spark distributions and services, but it's critical to avoid repeating the problems of MapReduce, so machine learning must be made easy (or easy enough) to use. Just as there aren't enough data scientists, the problem is magnified if you want to hire the subset that can do machine learning well.
So we'll see a variety of approaches then to simplify things. Some will hide the machine learning behind their big data platforms (as IBM does with its platforms around DataFirst, Data Science Experience, Watson analytics, etc.), some will build common use cases into ready-to-call APIs (as Google does with Vision, Natural Language, Translation, etc.), but some will merely bundle the tools and let you do your own thing. There will also be many specialists like H2O.ai and DataRobot, but I expect a majority of these will be acquired quickly by bigger players seeking technology and talent. What is quite clear is that this will be a major focus for many companies in the big data space. Get the popcorn.