Co-Author(s): Tony Palmer, Senior Validation Analyst
This ESG Technical Validation Report documents testing of Pure Storage’s Purity ActiveCluster synchronous replication, with a focus on ease of deployment and management; stretched-cluster business continuity that delivers recovery point objectives (RPOs) and recovery time objectives (RTOs) of zero; and active/active asynchronous replication for global disaster recovery.
Organizations do everything they can to maintain business continuity and recover immediately from outages, as these significantly impact their competitiveness and profitability. The cost of downtime is enormous; depending on the industry, organizations lose hundreds of thousands to millions of dollars for every hour of downtime. When ESG surveyed organizations about the impacts that can result from downtime and data loss, the most cited impacts included loss of customer confidence, direct loss of revenue, and missed business opportunities (see Figure 1).1
One way to help achieve business continuity objectives is to use stretched clusters of storage. These provide high availability and failover between data centers within metro distances, enabling two sites to store the same data. But they are typically complex and expensive, requiring costly professional services, expensive software licenses, and specialized storage management skills. They may also demand weeks of time that IT staff spends poring over thousands of pages of technical manuals. Similarly, replication to a remote site for disaster recovery (DR) can be complex and costly. Many organizations today are looking for simple storage solutions that virtualization, database, and general IT administrators can handle, and replication has not fit that condition to date.
ActiveCluster is a feature of the Purity FA Operating Environment that enables simple synchronous and asynchronous replication between Pure FlashArrays in racks, data centers, campuses, and metro regions. It supports synchronous round-trip latency of less than 11 ms; asynchronous replication sites can be anywhere in the world with no latency restrictions. Customers on versions of Purity FA prior to 5.0 gain access to ActiveCluster with a simple, non-disruptive upgrade; there are no licenses or additional hardware to purchase and deploy. It is extremely simple to set up and manage, requiring little prior storage knowledge.
ActiveCluster synchronous replication (available with Purity version 5.0 or greater) is a multi-site, active/active, bidirectional solution that delivers transparent failover and supports RTOs and RPOs of zero. ActiveCluster synchronous replication provides business continuity and non-disruptive data migration. ActiveCluster active/active asynchronous replication (available with Purity version 5.2 or greater) provides an offsite copy for operational recovery, such as after a disaster. Both support Pure Protection Groups, which control replication scheduling, retention, and volume consistency. ActiveCluster consists of four core components, which are represented in Figure 2:
- Active/Active Synchronous Clustering. Using synchronous replication, both arrays keep an up-to-date data copy. Hosts that are attached to either or both arrays access a single, consistent data copy that is readable and writable on both arrays simultaneously. This creates a true active/active deployment in which workloads can continue uninterrupted during link, path, array, and host failures.
- Hosts simply see multiple paths to the data. At each site, the optimized path connects the hosts to the local arrays; hosts are also connected to the remote arrays, but these paths are referred to as “non-optimized” because they add latency. Should the optimized path fail, ActiveCluster automatically shifts to the active path on the remote array.
- Stretched Pods. These are management containers that hold storage objects (such as volumes) that are collected into groups and replicated together, or “stretched,” between the two sites. Volumes can be created and resized while stretched without having to take down replication configurations and restart them
- Pure1 Cloud Mediator. This cloud-based component is used during an outage to determine which synchronous array continues data services in order to avoid the “split brain” problem in which active/active arrays have live data copies that are out of sync. Mediator instances are hosted in multiple availability zones and are protected by load balancers and high availability. No additional management component is needed. An on-premises mediator is supported for organizations with high security levels that cannot be cloud-connected.
- Active/Active Asynchronous Clustering. Data can be asynchronously replicated to a third site, fully integrated with ActiveCluster stretched pods for simplicity and automation. Data is replicated from both active/active synchronous arrays, but with efficiency and intelligence. Asynchronous replication is orchestrated by the target so that only unique data is copied to the third site; in addition, data is copied from the synchronous array offering the best performance.
Purity ActiveCluster features include:
- Synchronous replication. All writes are immediately copied to the other array; before being acknowledged to the host, data is written to the local and remote arrays, and protected in NVRAM.
- Symmetric active/active arrays. Reads and writes can be executed on the same volumes on both arrays, with optional site awareness.
- Transparent failover. Synchronously replicated arrays support automatic, non-disruptive failover, and automatic resync and recovery.
- Failover preference. Administrators have the option to configure pods to failover particular volumes to preferred arrays when possible. In the non-uniform configuration, in which hosts are connected to just one array, applications can be placed in a pod with a failover preference aligned with the FlashArray they are connected to.
- Active/active asynchronous replication. ActiveCluster supports policy-based, automatic asynchronous replication to a DR site from both synchronous arrays, so RPOs are never in jeopardy. It is intelligent and efficient, as the target array orchestrates data movement to only pull unique data, and data requests are directed to the best performing array.
- Simple management. Data management tasks such as storage provisioning, host connections, and creation of snapshots and clones can be performed from local and remote sites.
ESG Technical Validation
ESG performed evaluation and testing of Purity ActiveCluster on FlashArrays in Mountain View, CA. Testing was designed to demonstrate how ActiveCluster synchronous replication enables business continuity during various failure scenarios, and active/active asynchronous replication is easily configured without downtime to enable operational recovery.
The first test bed leveraged two Pure FlashArrays (model FA//M50R2) with 21 TB of storage running the Purity FA 5.0 Operating Environment, and two Supermicro servers with Intel Xeon E5 processors. Hosts and storage were located in the same data center. The array named AC-m50r2-A will be referred to as Array A, and AC-m50r2-B as Array B. To test failover and business continuity, a virtualized OLTP workload was run on a 384GB SQL Server database with 100 test users. This workload is designed to emulate typical OLTP tasks for managing, selling, and distributing products, and combines a mix of OLTP transaction types executed simultaneously. The arrays were configured with a stretched pod and synchronous mirroring.
ESG began by evaluating the ease of installation. While many synchronous replication solutions require professional services, costly hardware and software licenses, and weeks of planning and preparation, ActiveCluster installation is as simple as upgrading the Purity Operating Environment, at no cost, and completing a few simple steps.
We began with version 5.0 of the Purity Operating Environment installed on the two arrays. Array A contained two volumes, named SQL and Vol1. First, from the Storage/Array tab, we connected the two arrays by requesting the connection key for Array B and copying that key. We selected Connect Array on Array A, pasted in the key, selected Sync Replication, and entered the IP address of Array B’s management interface. Array B then appeared in the Connected Arrays screen, showing it was configured for sync replication, and showing the management and replication addresses.
Next, we simply, non-disruptively moved the SQL volume, with the workload running, into a pod, and then stretched that pod to the other array. We clicked on pod:SQL from the Pods tab. Array A showed in the Arrays screen; we clicked the plus sign (+), and from the Add Array dropdown, selected Array B, and clicked Add. Immediately, Array B was added to the Arrays screen, and resyncing began.
Why This Matters
Typically, synchronous replication solutions are difficult, time-consuming, and expensive to install, costing tens to hundreds of thousands of dollars. Many organizations cannot afford the costs of additional hardware, software licenses, professional services, and disruption required to set them up; organizations that can afford to invest in these solutions often use them for only their most critical data.
ESG was impressed with how easy it was to set up ActiveCluster. It is included with Pure’s Evergreen storage subscription, accessible with a simple operating system update. The process of connecting the arrays and configuring synchronous replication took minutes with just a few clicks—no downtime, no professional services, and no interruption to productivity. Administrators could easily set it up with no specialized storage skills.
Next, ESG tested several failure scenarios. Testing began with the same two Pure FlashArrays, and with the SQL database in an active/active, stretched pod between the arrays, so that writes were synchronously replicated and both arrays supported read/write. The Pure Dashboard showed the running workload via charts of latency, IOPS, and throughput; this enabled ESG to view I/O status during each failure.
Local Array HA Controller Failure
The first test demonstrated what happens during the failure of one of the redundant controllers in a single Pure array. Choosing Array B, we failed the primary controller that the workload was using; the secondary controller took over. The result was a brief pause in I/O as the controller failed over, but host I/O continued to both arrays while maintaining RPO zero.
Figure 4 shows the workload activities.2 This view shows Array B, with a purple line showing mirrored writes; as the failover from the primary to secondary controller occurred, there was a brief dip as I/O paused, and IOPS, bandwidth, and latency dropped to zero. This can be seen in the graph and also in the latency, IOPS, and bandwidth dialog boxes that all read “0.” Once the controller failed over, operations resumed, including mirrored writes. RPO zero was never in jeopardy.
The amount of time for the I/O pause depends on the failure and on the hosts. A local HA failover typically pauses for 8-12 seconds yet the arrays remain in sync, while an ActiveCluster failover usually pauses for 16-20 seconds. Host multi-path software will also impact the I/O pause time, with many hosts pausing automatically for 30 seconds.
Replication Link Failure
Next, we failed the four replication interfaces on Array B. This failure invokes the Pure1 Cloud Mediator. Failover is automatic as long as at least one array is in contact with the Mediator. In order to keep data from being out of sync, the array that reaches the Mediator first claims the right to continue serving I/O (unless a failover preference has been configured); in this case, Array B won that race, and Array A showed as “offline” (right).
Figure 5 shows that Array A is sitting idle, with no workload, while Array B is continuing to service I/O as if it were local. Note that on Array B, the reads are indicated with a blue line, and writes with an orange line; they were not being mirrored (purple line) because Array A was unavailable. Once we restarted the replication links, the pod immediately began to resync, applying the changes that occurred while Array A was unavailable. Once the resync was complete, synchronous replication restarted.
Single Array Failure
Next, we viewed how ActiveCluster enables automatic failover in the case of a single array failure. To simulate that failure, we shut down Purity services on both controllers on Array B. Within seconds, ActiveCluster failed over the I/O load automatically to Array A; no administrative intervention was required. Alerts notified us about the failed connection between the arrays, and the surviving array recognized storage path failures, but read and write traffic continued virtually uninterrupted on Array A because of the stretched pod. Once we re-enabled the primary controller on Array B, the stretched pod immediately began to resync, and the mirrored writes resumed.
Active/Active Asynchronous Replication
Active/active asynchronous replication starts with a baseline snapshot, after which only incremental, unique data is transferred to the target in a compressed format. The snapshot is created at the source arrays, but data movement is orchestrated for efficiency by the target. Content requests from the target are broken up and sent to alternating source arrays to speed data movement, load balancing by task completion: Whichever array completes a request first gets the next request.
This test bed was a campus-type environment with two Pure M20 FlashArrays (tmefa04 and tmefa05) running Purity 5.2, two ESXi servers (each with a single VM), and the IOmeter workload generator running 16KiB blocks, 70% read/30% write.
First, we set up ActiveCluster on the two arrays, created a pod, added volumes connected to the ESXi hosts to the pod, and then stretched the pod to the peer array. Setting up active/active asynchronous replication was simple and fast. We connected a third array (tmefa02) to represent the remote DR site by simply typing in the domain name of the third array and copying its connection key to both of the ActiveCluster arrays.
Next, we created a Protection Group within the pod, added the volumes to replicate together, set the asynchronous replication target to tmefa02, and defined a replication schedule. Our schedule was set for every five minutes, keeping all snapshots for one day and retaining four snapshots for an additional day. This setup took about three minutes and was completed while applications were running, and we watched the replication complete on the target array. The left side of Figure 6 shows both synchronous and asynchronous replication configured; the right side shows the async replication.
Why This Matters
Data growth and the rapid proliferation of virtualized applications are increasing the cost and complexity of storing, securing, and protecting business-critical information assets. An “always available” storage solution with automatic failover and simple user tools that make it easy to deploy and centrally manage a complex, multi-site storage deployment can reduce the time and cost required to deliver continuous business productivity and disaster recovery.
ESG validated that with ActiveCluster, volumes that are stretched between local and remote arrays can read and write on both arrays, creating a true active/active deployment. Should an array fail, there is no actual storage “failover;” the workload simply continues to run on the other clustered array. The host loses some storage paths, but I/O continues. We validated how quickly and easily ActiveCluster enabled business continuity during failures of a controller, replication link, array, and host. It all happened automatically, with no administrator input, and with no more than a few seconds of pause in I/O. We also validated the ease and speed of setting up active/active, intelligent, and efficient asynchronous replication for operational recovery.
ActiveCluster can be used for business continuity, disaster recovery, and data migration. IT administrators can stretch a pod to another array (on the same site or at a different site), where it enjoys local access because the data is synchronized on the secondary array; then, by un-stretching the pod, the data is effectively migrated.
Customer Interview – Global Security Firm
ESG spoke with Solutions Architect Brian Cummings about global security firm ESET’s ActiveCluster deployment between data centers that are 12 miles apart. The company uses Pure Storage as well as VMware vSphere, vCloud Director, NSX, and Horizon for virtual desktops. It has more than 1,500 production VMs, and leverages VMware vMotion, HA, and Fault Tolerance. This company has a software-defined mindset and a real need for consistent business continuity.
Application SLAs range from 15-60 minutes; any outage demands attention from IT and application owners. The solutions architect explained that before ActiveCluster, downtime required manual intervention. To get services back up and running, IT had to restore from snapshots or from secondary disk or tape, costing precious production time and lost data, with recovery points of 24 hours. And with one data center in a co-location provider, the business was dependent on the provider to get back online.
ActiveCluster solves all that with automatic recovery. Before deploying it, this solutions architect tested every failure scenario, from a VM failure to host outage, power outage, and full site failure, demonstrating a seamless failover with no intervention. “In addition,” he commented, “ActiveCluster lets us seamlessly move VMware workloads between data centers with vMotion with no downtime. We can move 1,500 VMs in under 15 minutes, saving us time and money.”
This customer was impressed with the ease of installation. The organization had used a different synchronous replication solution with other enterprise storage. Implementing that solution required professional services and hours of planning, maintenance windows, and multiple upgrades. With ActiveCluster, installation took a total of 30 minutes—20 minutes for the Purity upgrade, and then 5-10 minutes to connect the arrays and create pods. He also noted that while ActiveCluster came at no incremental cost (it is included in Pure’s Evergreen Storage subscription), other synchronous replication solutions would cost hundreds of thousands of dollars. “We continue to reap the benefits of Pure’s Evergreen Storage architecture, which regularly gives us new capabilities with no disruption during upgrades,” he commented.
Customer Interview – Industrial Construction Company
ESG spoke with Alexander Letan, head of IT infrastructure at Kremsmüller, a European industrial construction corporation that maintains refineries and large plants. The company runs two data centers that serve more than 35 sites worldwide, and has been using Pure Storage FlashArrays for almost three years for its 300 VMs (SQL databases, Citrix XenApp, and Xen Desktop) and VMware HA for high availability. The company had previous experience with synchronous replication solutions, but finds more value and success running Purity ActiveCluster.
The ActiveCluster installation was quick and easy, and enabled the company to reclaim staff time for other initiatives. This engineer mentioned that previous synchronous replication offerings had proven to be cumbersome and time-consuming, with maintenance activities sapping production time. In his experience, installing the solution, provisioning storage, and scripting for failover took weeks, and required specialized training to manage. In comparison, the ActiveCluster installation was completed in half a day, and was so simple that any IT administrator could manage it with virtually no training.
ActiveCluster has proven to be stress-free and reliable, easing the pressure on IT. Before implementing ActiveCluster, the company tested it extensively, and it always failed over as expected; they could not produce an outage of any kind. IT staff tested site, array, controller, network, and other types of failures without any erroneous behavior, despite testing it with workloads running at 10x the typical production load. This is essential, since the business requires that employees have 24/7 access to blueprints, files, emails, and other applications worldwide.
The engineer mentioned that his colleagues spend very little time planning ActiveCluster updates thanks to its simplicity, and confessed they sleep better knowing that the system is in place. Upgrades are run during production time and have no impact on performance. They have found that preparing for an ActiveCluster code upgrade takes hours instead of days and can all be done remotely. By eliminating outages and spending less time on administration, the company expects savings of up to $500,000 with ActiveCluster.
The Bigger Truth
The longer you can keep business up and operational, the better the chances of success. When asked how much downtime their organizations could handle before failing over to a secondary site, ESG research respondents reported that 14% of high-priority workloads could suffer no downtime at all—requiring immediate failover— and 71% could be down less than an hour.3 This drives IT on a constant search for solutions that help keep interdependent groups of data sets available and protected in order to avoid planned and unplanned downtime. Synchronous replication—continually creating an up-to-date data copy at another location—goes a long way toward those goals, keeping data available in the face of many failure scenarios. Asynchronous replication provides instantaneous recovery of multiple volumes from snapshots on the target array. This enables a very low RTO, allowing customers to achieve faster recovery of multiple applications in case of disasters.
Replication has been around for two decades now. But it is expensive, and has always been complex, bordering on painful, to set up, maintain, and test. As a result, it has been used for only the most critical applications, and by only large companies with deep pockets. With FlashArrays running Purity 5.2 and higher, ActiveCluster provides both synchronous and asynchronous replication capabilities at no additional charge, while some other solutions require the purchase of additional software and hardware such as third site mediator servers or gateway products. ActiveCluster is simple and fast to deploy and use. Customers with earlier versions of Purity can upgrade to version 5.x easily and non-disruptively. ActiveCluster extends the opportunity for replication to any application, and to organizations lacking the resources—financial or staffing—that active/active multi-site replication had excluded in the past.
In addition, ActiveCluster gives organizations the opportunity to actually test their replication and failover, which is something that, in ESG’s experience, is lacking in the vast majority of organizations. With ActiveCluster, workloads run on either site, in true active/active fashion—you can test this by simply running the application from the other site, proving that business will continue despite most failures. Many organizations set up business continuity and disaster recovery solutions, but never test them because of complexity and disruption to production operations. And, sadly, many find the problems and configuration issues only during a disaster, when they cannot access their data. Purity ActiveCluster takes the cost and complexity barriers out of business continuity, making it simple, cost-effective, and provable for any organization—and relieving the minds of IT administrators.
ESG validated that Purity ActiveCluster simplifies metro region data protection, making failover transparent and effortless. Our setup was easy and fast, and the test scenarios proved how ActiveCluster delivers I/O without interruption during various failure scenarios. We validated the ease of setting up intelligent, efficient active/active asynchronous replication, eliminating the traditional complexity of manual setup. Both replication solutions are configured without downtime, expensive professional services, or weeks of administrative effort.
Pure Storage strives for simplicity, and in ESG’s view the company is highly successful. ActiveCluster makes maximizing uptime simple and cost-effective for any application and organization. Given the competitive advantage that maximizing uptime delivers, ActiveCluster is well worth a look.
1. Source: ESG Master Survey Results, Real-world SLAs and Availability Requirements, May 2018.↩
2. Note that it takes 30 seconds for the dashboard to draw the graph lines, so time stamps will not match wall clock time.↩
3. Source: ESG Master Survey Results, Real-world SLAs and Availability Requirements, May 2018.↩
ESG Validation Reports
The goal of ESG Validation reports is to educate IT professionals about information technology solutions for companies of all types and sizes. ESG Validation reports are not meant to replace the evaluation process that should be conducted before making purchasing decisions, but rather to provide insight into these emerging technologies. Our objectives are to explore some of the more valuable features and functions of IT solutions, show how they can be used to solve real customer problems, and identify any areas needing improvement. The ESG Validation Team’s expert third-party perspective is based on our own hands-on testing as well as on interviews with customers who use these products in production environments.