Any Talking Heads fans reading this blog? Take any French classes in high school? No? Nevermind then.
I get asked a lot about SQL on Hadoop, and I know what you're thinking: "this guy must have the coolest friends and the go to all best parties." And you're right, I do. Lenny Kravitz by a rooftop pool in Vegas. Fitz and the Tantrums. Duran Duran. The Astoria Middle School Marching Band on Loyalty Day. (10-year old daughter with a shiny new flute...)
What were we talking about? Oh right — how do you find the important and relevant information you need in a very random data lake? You use SQL on Hadoop. Way easier for the average user than trying to code up the same functionality in MapReduce, at least if you are a DBA, analyst, or BI developer already familiar with SQL. Which according to our latest study, more than half of you are. If we take the respondents (shown below) as representative of the those engaged in big data and analytics initiatives, then you're probably going down this road now.
Then your next question is "what's the best way to do this?" No shortage of options out there today. Seems like every vendor looking to play in the big data market is pushing their own flavor of SQL on Hadoop. All open-source, but all also semi-bound to a leader. With Apache Hive as a base, Cloudera's Impala, MapR's Drill, Hortonworks's Stinger, Pivotal's HawQ, etc... there are lots of native, open-source options on the market. Each has its pros and cons, but the decision here is usually dictated by your overall preference for Hadoop distribution.
There are also extensions from traditional database vendors looking to connect to Hadoop environments, such as Microsoft's Polybase, Oracle's Big Data SQL, Teradata's Presto, SAP HANA Vora, etc. again. And again, the decision may be largely predetermined for you by your database of choice.
All the same, then? Not exactly. We asked the audiences above what they looked for in selecting the right solution for their needs. Speed, scale, reliability, security, flexibility, standards, hmm, this list is starting to sound a lot like all the traditional IT operational requirements for BI and analytics tools. What I recommend is you rank your own needs and then use that as your "buyer's guide" to choose the best for your particular environment. I don't have a star endorsement here. There isn't a single answer.
As a footnote, I will say that some startups are simplifying this by offering a more direct access to do BI on Hadoop, like ArcadiaData (all-in-one) and AtScale (via your preferred BI tool.) It's often wiser to start with the user needs and work backwards, rather than let your data platform push you to a default.
"Réalisant mon espoir, je me lance vers la gloire, OK."