Extending the Value of Hadoop and Spark with a Cloud-based Managed Service

GettyImages-841372192With the growth of data and the focus on becoming more data-driven, many organizations have turned to Apache Hadoop and Apache Spark as their big data and analytics framework to store and process their data. While the solutions are quite powerful, to effectively manage a constantly evolving infrastructure that must continue to meet the demands of a modern business, it can quickly become a logistical and administrative nightmare. With the understanding that organizations have plans to adopt cloud technology and one of the top use cases for existing IaaS and PaaS users is for data analytics, the idea of leveraging an Hadoop- and Spark-based managed service in the cloud is quite appealing.

Last year we conducted a rigorous economic validation of Google BigQuery, where we modeled different sized organizations and estimated what the costs would be over a three year period for those organizations if they were to leverage on-premises Hadoop, another cloud service, or BigQuery. We recently completed another economic validation, but this time directed our focus to Google Cloud Dataproc.

For the Cloud Dataproc TCO analysis, we compared an on-premises Hadoop and Spark environment, against hosting the same infrastructure in Cloud Dataproc. In the process, we were able to gain insights comparing Cloud Dataproc and Amazon EMR from customers who have experience using both cloud environments. The results highlighted a 57% cost savings when leveraging Google Cloud Dataproc compared to an on-premises environment, and a 32% cost savings compared to Amazon EMR. Additionally, we found dramatic benefits in the ability for customers to pull strategic information from data stored within Cloud Dataproc (when compared with the other two environments). While cost is always a driving factor, every customer ESG interviewed as part of our research identified business and revenue benefits far outweighing just the cost benefits.

For more details on the findings of our Google Cloud Dataproc TCO analysis, check out the full report. Also, if you’re interested in hearing about the results, as well as what organizations should consider when evaluating the long-term economic impact of moving data analytics to the cloud, check out my session from Google Cloud Next 18, where I spoke with Saptarshi Mukherjee, GCP Product Marketing Lead of Data Analytics & IoT. I should note that while the show itself was filled with announcements across the entire GCP organizations, new features were introduced for both BigQuery and Cloud Dataproc that can lead to further savings. Additionally, the show floor was packed with strategic partners, sponsors, and some pretty cool technology. One that comes to mind on the cost savings subject matter is superQuery. This Chrome add-on takes SQL queries you’re already writing in BigQuery and optimizes them (rewrites your SQL so it yields the same results, but processes as little data as possible) to minimize your BigQuery on demand costs.

Topics: Google Next Data Platforms, Analytics, & AI