Most organizations have an existing BI/analytics strategy. Many of these strategies consist of a mix of overlapping BI/analytics tools with a focus on batch processing—analyzing piles of data that have already been collected, stored, transformed, merged, cleansed, etc. Now that the Internet of Things is more than a buzz word, the early-adopters are starting to re-evaluate their BI/data analytics approaches. Their existing solutions are already overwhelmed with massive quantities of data and IoT will only add to the complexity. Whether it’s web applications tracking user clicks, sensors collecting weather data, or simply machine log data from within a single IT infrastructure, massive amounts of data are being generated every second and organizations are looking for any way possible to harness that data as soon as it’s collected. As such, ESG research shows that 45% of organizations are planning to deploy a new BI/analytics solution over the next 24 months. And what is the top requirement for driving the evaluation process? The move to a more real-time analytics approach (Figure 1). This same research also supports the idea that current BI/analytics solutions do not meet existing requirements and needs (26%) because new applications are generating new data types that need different analytical tools (27%).
As organizations look to gain real-time insight from these massively growing data sets, they run into additional complexities. Geographic distribution of organizations and devices has forced organizations to either analyze a subset of data from a specific location or delay data analysis until the data has been merged to a central location. To avoid incomplete analysis or short-sighted insights, many organizations are forced to wait for all the data before making business decisions, imposing data pipeline processing delays that not only hurt the current state of the business, but also potentially impact what could be a timely competitive business advantage.
With MapR already having a global presence in the BI/data analytics/big data market, it is no surprise that capitalizing on the shift to real-time is a priority. With an existing platform built for efficiency, speed, and reliability, the natural progression for MapR was to add a service to handle the real-time needs of organizations. MapR Streams has been integrated into MapR’s existing technology to provide a publish-subscribe, real-time service, turning MapR’s data platform into a fully converged solution to handle batch and real-time processing/analytics. ESG Lab validated the latest MapR Streams technology with a focus on performance and reliability.
The Solution: MapR Converged Data Platform with MapR Streams
The MapR Converged Data Platform is a collection of open-source engines, tools, and applications that leverage purpose-built MapR platform data to deliver a scalable, reliable, and secure infrastructure for global data-driven applications. Leveraging the data that is made available from these services are a number of open-source engines, such as Spark and Storm, which provide the processing of the data. The distributed MapR architecture offers high levels of availability and reliability with no single points of failure, while working in conjunction with the comprehensive projects in the Hadoop ecosystem to deliver enterprise-grade features and functions on low-cost commodity hardware.
Three core services work together to enable MapR to truly be a converged platform that supports all workloads on a single cluster. MapR-FS serves as the underlying file system of the Converged Data Platform, providing distributed, enterprise-grade storage that is high performing, scalable, and reliable. MapR-DB delivers a flexible, scalable NoSQL database for real-time and operational analytics in Hadoop. Lastly, MapR Streams is a publish-subscribe event streaming system that enables real-time processing of globally produced data in a reliable, scalable fashion.
MapR Streams Architecture
MapR Streams is a publish-subscribe system that takes data from producers and gives it to consumers. The producers and consumers are decoupled across topics, which are managed and stored in a MapR cluster. The producers publish their data to a relevant topic and the consumers subscribe to the topic. Topics are simply a collection of messages that organize an event and are managed by MapR Streams. Using weather as an example (see Figure 2), the producers would be various sensors spread across the globe, the topics would be information like temperature, wind, and pressure, and the consumers would be the applications used to process, analyze, and share the data.
Individual topics can be partitioned across servers in a MapR cluster to not only address throughput and scalability performance requirements, but also alleviate resource contention. The partitions handle a group of sequentially ordered, appended messages delivered in parallel. Continuing with the weather example, the topic for temperature could have two partitions, and each partition can have a set number of nodes assigned to it. MapR Streams leverages these partitions to handle load balancing from the producers and enables consumers to read from the topics in parallel.
In many cases, topics work hand in hand, and therefore can be managed together. This is called a stream, which can be asynchronously replicated between globally dispersed MapR clusters with publishers and consumers existing anywhere. MapR Streams replication can provide access to all the real-time data of all subscribed topics across all MapR clusters around the globe. When leveraging MapR Stream replication, topic and stream data gets backed up for easy failover of producers and consumers in the case of a failure, greatly reducing risk of data loss.
With no open source tools available to test performance, MapR created a multi-threaded shell application called Rubix to simulate event-streaming workloads. As of now, this customized tool works with Kafka and MapR Streams, and MapR is looking to release a more generalized version of the tool to help with testing and validating the performance of message frameworks.
Within the tool, each thread behaves as an application writing out messages, which leverages APIs on the client to push the messages into a MapR cluster. Testing can be done by leveraging a Rubix writer and/or a Rubix listener. The writer represents the producer and the listener serves as the consumer. There are four test scenarios simulated in Rubix:
- Producer Test – The cluster is flooded with messages and message acknowledgement consistency is monitored.
- Consumer Test – Already written messages are simply read off the cluster.
- Tango Test – A set of producers and consumers are run simultaneously with a goal measuring the time it takes a message to go from production to consumption.
- Slacker Test – A set of producers starts generating messages. After a predetermined amount of time, a set of consumers comes online and the time it takes for the consumers to catch up to the producers is measured.
Throughput and message delivery time are the key performance metrics measured throughout testing. It should be noted that the message delivery time metric is based on timestamps. Every X number of messages generated from the producer are given a timestamp. The consumer decodes the timestamp, compares it with the current time, and determines the time it takes to traverse the entire pipeline. Customized Rubix parameters are leveraged to simulate applications that give out messages in bursts, specify partitions, adjust timestamp frequency, add checksums for data verification, etc.
The test bed leveraged for performance testing is shown in Figure 3. The key area of focus was the MapR Streams cluster. The cluster consisted of five server nodes running MapR 5.1 and CentOS 6.2 with jumbo frames enabled over a 10Gbps network. Each of the server nodes contained 32 cores, 128 GB of RAM, and twelve 1TB HDDs. Each test used a total of 10 clients running at one time, with all clients either consuming or producing, or in the case of the Tango and Slacker tests, both simultaneously. A control node was used to manage the producers, consumers, and server nodes from a Rubix standpoint. The Rubix workload simulated a message size of 200 bytes across ten topics with 30 partitions for each topic. Messages were issued one after the other to the server nodes in a round-robin fashion across the 300 total partitions from ten simulated clients for a total of 180 seconds. All tests used full synchronous replication, and compression was enabled (by default) on the platform.
ESG Lab audited results of running each Rubix test scenario on the five-node cluster. The goal of MapR Stream performance testing was to simulate a number of messages coming into and out of a MapR cluster and measure overall throughput performance. In other words, ESG Lab was interested in the total number of messages that the MapR cluster could consistently handle.
All scenarios tested used a replication factor of both 1 and 3. For replication factor one, messages are not acknowledged until replication is completed to another node, whereas replication factor three does not acknowledge the receipt of a message until replication is completed across three nodes, obviously impacting throughput and the total number of received messages per second. The only scenario where performance improved with a replication factor of 3 was in the consumer test case. This is due to the fact that messages can be read in parallel across nodes. Figure 4 highlights the results of the producer test, which are shown with two metrics: throughput and ingested messages/sec. It should be noted that results can also be viewed in the Rubix interface. Also included are the performance tests of the consumer, tango, and slacker tests, which not only showed versatility with respect to Rubix’s ability to simulate other workloads, but also the sustainable levels of performance delivered by MapR streams across the entire data flow.
Why This Matters
Gaining a competitive advantage has never been more important, and organizations are turning to real-time analytics for help. Combined with batch processing, real-time processing can truly reinvigorate organizations looking to be more data-driven. This is especially true as the Internet of Things continues to be a focal point in big data. With more “things” generating and collecting data, an infrastructure that can not only support the data growth, but also reliably handle its processing at the speed at which it is generated is becoming a necessity.
ESG Lab validated that MapR Streams delivers scalable event processing performance on commodity hardware. Leveraging the in-house Rubix benchmark, four test scenarios that represented common occurrences in event-processing infrastructures were simulated: purely ingesting messages, purely consuming messages, simultaneously running both producers and consumers with timestamps, and delaying the consumption of messages to measure the time it takes to catch up to the flow of data. ESG Lab was particularly impressed with the ability of a five-node cluster to handle the ingestion of nearly 7 million messages/sec. Further, when consuming those messages after they were generated, over 18 million messages were consumed in the correct order without any data loss at rates as high as 3.5 GB/s.
The MapR architecture requires less resources and improves reliability when compared with traditional HDFS. All commodity hardware resources in the MapR Hadoop cluster are utilized. Because data is spread across each node in the cluster, if a MapR node fails, the entire cluster resyncs the dead node’s data. With this architecture, inefficiencies related to idle spare drives are eliminated due to plenty of spare capacity throughout the rest of the cluster being available to absorb a lost node’s data.
Jepsen is a tool that tests the consistency and availability of distributed databases by simulating network partitions. With some pre-configuration, Jepsen will recognize a cluster, communicate with each node via SSH, start daemons, and perform actions on the nodes based on testing profiles. During each failure injection, a stream of data is sent to the cluster (producer) and once the stream has completed, it is validated through the Jepsen framework to ensure that no data has been lost (consumer). Jepsen was run on a client server outside of a MapR cluster and utilized the Kafka 0.9 API to communicate with MapR Streams.
The Jepsen tool consists of a standard suite of prepackaged tests for Kafka and its in-sync replica (ISR) functionality. Additionally, the tool can be customized to address vendor-specific features and functions. Along with running the standard tests, a series of unique, MapR-specific failures (i.e., shutting down meta-data services or shutting down a file server) were created and customized with a goal of understanding the behavior of the cluster during failure scenarios.
A four-node MapR cluster was leveraged for Jepsen testing (see Figure 5). The small-sized cluster served as an ideal reliability test case as a smaller node count presents a greater possibility for data loss. Two of the nodes ran the container location database (CLDB) service, which is a key MapR service that keeps track of data locations in the cluster. This service manages and maintains the location of MapR containers across the cluster. When two instances run simultaneously on the same cluster, they operate as a master-slave, with the slaves being on active standby for high availability that allows failovers to occur within seconds. While on standby, they also handle read traffic to alleviate the load on the master CLDB service. It should be noted that the CLDB integrates directly into the MapR-FS, which in many cases enables faster lookup times than traditional namenode functionality.
MapR-FS was run on all four nodes. Along with the CLDB service, the MapR-FS service is managed and monitored by the warden, which is a service in and of itself. MapR Platform Services consists of not only the MapR file system, but also MapR-DB and MapR Streams. Lastly, Zookeeper was run on all nodes to allow for the sharing of configuration and state data in case of a needed failover.
One of the failure scenarios consisted of failing the master CLDB service. A stream was started and as soon as a message was sent, a node was selected as the primary writer of that data stream. In this particular test case, the node serving as the CLDB master was selected. As messages were sent, the warden service used heartbeat monitoring to ensure the master CLDB node was online. Next, Jepsen forced the CLDB service to crash. The warden service detected the CLDB service failure, and quickly restarted it in seconds. Upon test completion, Jepsen displayed information verifying that all messages were successfully received in the correct sequential order and that no data was lost during the minor disruption in service.
Additional simulated failure scenarios along with their resulting behaviors are shown in Table 1.
Why This Matters
As more tools and processes are put in place to handle the volume and velocity of data, the potential of impacting the continuous flow of data due to a failure increases. It is difficult enough to collect and move the massive data sets in a timely fashion to handle both batch and real-time processing requirements. If a failure were to occur anywhere in the data pipeline, the impact to the business could be catastrophic. Simply put, the fewer moving parts and more reliable the infrastructure, the easier it is for the customer to achieve success.
ESG Lab validated that MapR Streams inherits the existing benefits of the overall MapR infrastructure to deliver a highly available, reliable infrastructure to handle the continuous flow of real-time data. Through the use of replication within a Hadoop cluster, high levels of uptime are achieved by eliminating single points of failure. This not only ensures that the flow of data is uninterrupted, but also prevents data loss across the entire data pipeline, from data collection at the producer level, to analytics at the consumer level.
The Bigger Truth
Data analysis is a necessity for organizations looking to gain an edge. Whether it be social, transactional, sensor, or log data, a majority of this data is event-based. Batch processing has worked well to eventually get a grip on all of it, but a need to augment batch processing with real-time functionality is what is driving the next phase of data analytics. A number of solutions are available today that, when cobbled together, can fulfill this need, but complexities related to cost, support, and interoperability, just to name a few, still remain. These make-shift solutions have created somewhat of an inefficient data pipeline that relies heavily on a number of pieces working harmoniously together. If one of those pieces fails, the whole pipeline falls apart. Organizations are looking for a solution that delivers always-on support for all data being generated and efficiently transports that data to its proper destination for processing and analysis.
As part of the MapR Converged Data Platform, MapR Streams provides a simplified architecture to enable big data stream processing on Hadoop. Data producers and consumers can connect in real-time to enable the development of powerful applications at a global scale with security, reliability, and massive scalability as a core foundation. MapR delivers a big data storage and processing infrastructure that manages data from initial generation and ingestion to consumption and real-time insight through industry-proven MapR services, open-source engines, and commercial applications.
MapR started with Hadoop and a purpose-built data platform called MapR-FS for enterprise storage. Along the way, a NoSQL database called MapR-DB was added, as well as Apache Drill for interactive SQL support, all of which now work together to enable support for a large number of tools and applications focused on processing big data. This includes popular open-source tools like Spark and Storm, which not only leverage the data and platform services of MapR to enable real-time data processing, but inherit the core pillars of MapR’s architecture such as reliability and performance to meet the always-on, real-time demands of data-driven businesses. Naturally, a publish-subscribe event streaming service was required to create a truly comprehensive platform of fully integrated data services. MapR Streams provides that global event stream processing system that turns MapR into a highly performing, ultra-reliable, enterprise-grade converged data platform to meet the demands of batch and real-time processing on a global scale.
ESG Lab Reports
The goal of ESG Lab reports is to educate IT professionals about data center technology products for companies of all types and sizes. ESG Lab reports are not meant to replace the evaluation process that should be conducted before making purchasing decisions, but rather to provide insight into these emerging technologies. Our objective is to go over some of the more valuable feature/functions of products, show how they can be used to solve real customer problems and identify any areas needing improvement. ESG Lab’s expert third-party perspective is based on our own hands-on testing as well as on interviews with customers who use these products in production environments. This ESG Lab report was sponsored by MapR.
 Source: ESG Research Report, Enterprise Big Data, Business Intelligence, and Analytics Trends, January 2015.