Many people consider "Hadoop" and "big data" to be synonyms, and certainly there is significant overlap between them. Yet in my mind, Hadoop is an ecosytem of data management tools, and big data is an approach to understanding complex situations. So the one can be used for the other, but surely isn't the only way to go.
Similarly, many folks consider Hadoop to be basically HDFS for storage and YARN/MapReduce for jobs, but these aren't necessarily the only options either. For example, MapR replaces HDFS with its own file system to achieve different capabilities like replication. Above this, there are a plethora of extensions via open source or from specific distribution vendors. And now a lot of people are exploring Spark as a new analytics engine.
Spark has some notable advantages over MapReduce:
- More familiar for developers who already know Java, Scala, or Python
- Flexibility to handle SQL queries, streaming, machine learning, and graph analytics
- Improved performance for analytics on disk or even in-memory
- Adaptability to more data sources, including NoSQL databases, and the Hive data warehouse
Whether or not all of these options are required, Spark is gaining momentum. Tellingly, during the Hadoop Summit last week one attendee asked whether Spark might not be a better place to begin.
Databricks, started by the same people who invented Spark at UC Berkeley's AMPLab, must be rather proud. The company has built up "Spark-as-a-service" in the public clouds to make the technology more readily accessible to anyone interested by removing the infrastructure bottleneck. Their free trial is an easy entry point, and reportedly already has had 3500 punters.
Now IBM is going deep into Spark, with investments in products, expertise, alliances (including Databricks), cloud offerings, and of course, open source contributions. This is significant, as it is the corporation-level answer and de facto endorsement of the question above - "you should start with Spark" (and with IBM, no doubt.) This is going to spur more development, more capabilities, and more maturity for the project, leading to more market adoption and growth. If the big boys are getting in the game, you can bet the big data market is beginning to shift.