The Lost and Found of Big Data Innovation
I came looking for answers to four questions, posted in a previous blog, at the Big Data Innovation conference a couple of weeks ago. Did I find some answers to the questions? Yes, some answers were found, some remain lost. Here is a synopsis:
- In terms of "How much Hadoop?" the answer was: Less than expected. The focus at this event was analytics, not infrastructure, and at the end of the day businesses and their proxies in Big Data, i.e. data scientists and analysts, have little to no allegiance to Hadoop. They want results and insights, and whatever solutions help them reach those insights wins, whether Hadoop-based, connected, or not.
- ROI Metrics remain lost, not found. Despite one panel on the subject, little was discussed about ROI by practitioners. Big Data remains mainly a faith-based investment, but I remain convinced that formally addressing ROI of Big Data will gain popularity, just not this month.
- Despite the lack of metrics (wouldn't you want to use analytics to prove the ROI of analytics?), the appreciation that Big Data must reflect a balance between tech, science and business was clearly found. In fact, nearly all the many data scientists I saw present emphasized business impact above all else.
- The subject of appliances, Cloud, and NoSQL in the context of Big Data fell mainly in the lost category. Again, this summit had more to do with results and analytics, less with infrastructure. Actian and Lenovo, however, co-unveiled the Vectorwise Data Mart Appliance. Actian, renamed from Ingres (yes that Ingres!) last year, acquired MonetDB spun out technology, originally called the X100 project, in 2010, now called Vectorwise. Vectorwise, leveraging vector processing versus scalar processing, is known as a blazing fast analytics DB engine that supports SQL.
Ah, but there were three answers for questions I did not originally posit before going to the Big Data Innovation summit. Finding what you didn't know you lost is the beauty of Big Data after all isn't it? Here are my three "Aha" moments:
- Analytics Service Bus: NYSE Euronext offered one of the most compelling solutions I have witnessed to date for Big Data. First consider their situation: They run several of the world's most important securities exchanges, generate about 5TB of fresh data daily with transaction/message counts in the billions. Daily peaks may double daily norms, and they operate in a zero down time and highly regulated environment. Their data scientists are located globally, spanning time zones and languages. They already use several analytics solutions (Netezza and Greenplum for example) for certain businesses, and have a highly heterogeneous technical environment, partially due to M&A, partially due to choice.
Despite all the data, point analytics solutions and infrastructures, they now offer their data scientists and other analytics users a single, easy-to-use resource that abstracts away all the tricky details of their complex environment, yet enhances the data available for analytics. The result is their meta-data driven NYSE Euronext Big Data "On Demand" resource, which in effect is a Big Data middleware platform, and I will refer to it as an ASB - Analytics Service Bus.
It includes, behind the scenes, data migration, information management, workload management, security, and transformation built-ins for Big Data purposes, yielding a "flat file farm" primed for analytics. It also includes sophisticated filtering tools to reduce the petabyte-scale flat file source into more tidy result set nibbles for their data analysts. The solution services over 1,000 requests per month for historical loads, aims to utilize their MPP investments to the fullest extent possible in order to maximize ROI, while remaining vendor agnostic and thus being open to new MPP environments - which includes experimenting with Hadoop/HFDS. Finally, they plan to go-to-market with this solution set.
We will see what the commercial offering looks like, and how many enterprises could actually take advantage of such a sophisticated environment. But it is the closest thing I have seen to a private cloud platform for delivering "Big Data as a Service," and acts, in essence, as a service bus for analytics. Informatica, IBM, Oracle, and SAP should take note.
- Linkedin is a Data Company, not a Social Media Company, Maybe Your Company Could be a Data Company: While you might guess Linkedin has enough data to chew on through its roughly 175m professional profiles, in fact they have all kinds of other data sources such as job posts, queries, searches, group activities, and third party sources. Like NYSE Euronext, Linkedin wields a long line of analytics solutions and tools, such as Teradata Aster, Oracle, Hadoop, R, Tableau , and others.
But what struck me is, just as NYSE Euronext viewed Big Data provisioning as a holistic, cross-enterprise challenge, Linkedin views Big Data as a holistic set of value-based deliverables for all aspects of the business and Linkedin users as well. The goal of the Linkedin data team is to ensure that Linkedin is data driven at all levels from executive to marketing to product to design to business ops to engineering to users like us. Their commitment to "democratize the data" underscores the idea that we should not delimit Big Data as a discrete project, but rather as an opportunity to make the entire organization, and our related external value chains, smarter. Check out data.linkedin.com for a fun peek.
- It's the Index, Stupid! Those who have paid attention to Big Data history know that before Hadoop was Hadoop it was as a search project, born out of the brains that brought us Google and Yahoo search. The notion of search in the context of Big Data analytics remains healthy, albeit somewhat in the background relative to the rest of Hadoop hype. For example, LucidWorks, who offers cloud-based commercial wrapping around the Apache Lucene/Solr project, treats search as an essential ingredient for Big Data. Specifically, the LucidWorks Big Data platform augments the Big Data analytics/Hadoop paradigm with search, very quickly enabling analytics users to not just use search as a UI for analytics, but to index all content to enable analytics users to dig through the content for answers.
My lesson learned about the power of indexing and search in the context of analytics at the summit helped me appreciate Chiliad, who offers a different approach to empowering search for analytics, a more networked approach: They deploy intelligent peer nodes for indexing and search at the data sources, which then communicate and cooperate, resulting in a highly distributed search-based analytics solution without moving much data, cutting storage and implementation costs considerably. Chiliad's technique has served the government intelligence and healthcare sectors well for several years. The idea here is, why move distributed data for search/analytics purposes if you don't have to; shun the "shuns" - migration, replication, integration, transformation - unless you really must.
Finally, a Big Data Thanks! to The Innovation Enterprise Ltd. who put on such a concentrated, multi-faceted event in Boston. It was fun and informative, just like Big Data.