ESG Validation

ESG Lab Validation: Pure Storage Purity ActiveCluster: Synchronous Replication with Automatic Failover

Co-Author(s): Tony Palmer


Introduction

This ESG Lab Validation report documents testing of Pure Storage’s Purity ActiveCluster synchronous replication, with a focus on ease of deployment and management, and stretched-cluster business continuity that delivers recovery point objectives (RPOs) and recovery time objectives (RTOs) of zero.

Background

Organizations do everything they can to maintain business continuity, as this significantly impacts their competitiveness and profitability. The cost of downtime is enormous; depending on the industry, organizations lose hundreds of thousands to millions of dollars for every hour of downtime from lost productivity and revenue, missed opportunities, and loss of reputation and customers. When ESG surveyed organizations about their downtime tolerance for primary production servers or systems, 51% reported that they could tolerate high priority applications being down for less than an hour, and 29% could tolerate high priority applications having less than 15 minutes of downtime.1 When all production applications were included, 15% could endure no downtime, and a total of 52% could tolerate less than one hour (see Figure 1).

One way to help achieve business continuity objectives is to use stretched clusters of storage. These provide high availability and failover between data centers within metro distances, enabling two sites to store the same data. But they are typically complex and expensive, usually requiring costly professional services, expensive software licenses, and specialized storage management skills. They also demand weeks of time that IT staff spends poring over thousands of pages of technical manuals. Many organizations today are looking for simple storage solutions that virtualization and general IT administrators can handle, and synchronous replication has not fit that condition to date.

Purity ActiveCluster

ActiveCluster is a feature of the Purity FA Operating Environment 5.0 or greater that enables synchronous replication between Pure FlashArrays in racks, data centers, campuses, and metro regions. It supports round-trip latency of less than 5 ms and can also be asynchronously replicated globally for disaster recovery to a third site for additional protection. Customers on versions of Purity FA prior to 5.0 gain access to ActiveCluster with a simple, non-disruptive upgrade; there are no licenses or additional hardware to purchase and deploy. It is extremely simple to set up and manage, requiring little prior storage knowledge.

ActiveCluster is a multi-site, active/active, bidirectional replication solution that delivers transparent failover and supports RTO and RPO of zero. ActiveCluster synchronous replication provides business continuity and non-disruptive data migration. It consists of three core components, which are represented in Figure 2:

  • Active/Active Clustered Arrays. Using synchronous replication, both arrays keep an up-to-date data copy. Hosts that are attached to either or both arrays access a single, consistent data copy that is readable and writable on both arrays simultaneously. This creates a true active/active deployment in which workloads can continue uninterrupted during link, path, array, and host failures.
  • Stretched Pods. These are management containers that hold storage objects (such as volumes) that are collected into groups and synchronously replicated together, or “stretched” between the two sites.
  • Pure1 Cloud Mediator. This cloud-based component is used during an outage to determine which array continues data services in order to avoid the “split brain” problem in which active/active arrays have live data copies that are out of sync. This relieves organizations from the cost and burden of deploying a third site. Mediator instances are hosted in multiple availability zones and are protected by load balancers and high availability. No additional management component is needed. An on-premises mediator is supported for organizations with high security levels that cannot be cloud-connected.

Hosts simply see multiple paths to the data. Note that at each site, the optimized path connects the hosts to the local arrays; hosts are also connected to the remote arrays, but these paths are referred to as “non-optimized” because they add latency. Should the optimized path fail, ActiveCluster automatically shifts to the active path on the remote array.

Purity ActiveCluster features include:

  • Synchronous replication. All writes are immediately copied to the other array; before being acknowledged to the host, data is written to the local and remote arrays, and protected in NVRAM.
  • Symmetric active/active arrays. Reads and writes can be executed on the same volumes on both arrays, with optional site awareness.
  • Transparent failover. Synchronously replicated arrays support automatic, non-disruptive failover, and automatic resync and recovery.
  • Integrated asynchronous replication. ActiveCluster uses asynchronous replication to create baseline copies, to resynchronize after a failure, and to replicate data to a third site for disaster recovery.
  • Simple management. Data management tasks such as storage provisioning, host connections, and creation of snapshots and clones can be performed from local and remote sites.

ESG Lab Validation

ESG Lab performed evaluation and testing of Purity ActiveCluster on a pair of FlashArrays in Mountain View, CA. Testing was designed to demonstrate how ActiveCluster enables business continuity during various failure scenarios, using industry standard tools and methodologies.

The test bed leveraged two Pure FlashArrays (model FA//M50R2) with 21 TB of storage running the Purity FA 5.0 Operating Environment, and two Supermicro servers with Intel Xeon E5 processors. Hosts and storage were located in the same data center. The array named AC-m50r2-A will be referred to as Array A, and AC-m50r2-B as Array B. To test failover and business continuity, a virtualized OLTP workload was run on a 384GB SQL Server database with 100 test users. This workload is designed to emulate typical OLTP tasks for managing, selling, and distributing products, and combines a mix of OLTP transaction types executed simultaneously. The arrays were configured with a stretched pod and synchronous mirroring.

Getting Started

ESG Lab began by evaluating the ease of installation. While many synchronous replication solutions require professional services, costly hardware and software licenses, and weeks of planning and preparation, ActiveCluster installation is as simple as upgrading the Purity Operating Environment, at no cost, and completing a few simple steps.

We began with the latest version of the Purity Operating Environment installed on the two arrays. Array A contained two volumes, named SQL and Vol1. First, from the Storage/Array tab, we connected the two arrays by requesting the connection key for Array B and copying that key. We selected Connect Array on Array A, pasted in the key, selected Sync Replication, and entered the IP address of Array B’s management interface. Array B then appeared in the Connected Arrays screen, showing it was configured for sync replication, and showing the management and replication addresses.

Next, we simply, non-disruptively moved the SQL volume, with the workload running, into a pod, and then stretched that pod to the other array. We clicked on pod:SQL from the Pods tab. Array A showed in the Arrays screen; we clicked the plus sign (+), and from the Add Array dropdown, selected Array B, and clicked Add. Immediately, Array B was added to the Arrays screen, and resyncing began.

Why This Matters

Typically, synchronous replication solutions are difficult, time-consuming, and expensive to install, costing tens to hundreds of thousands of dollars. Many organizations cannot afford the costs of additional hardware, software licenses, professional services, and disruption required to set them up; organizations that can afford to invest in these solutions often use them for only their most critical data.

ESG Lab was impressed with how easy it was to set up ActiveCluster. It is included with Pure’s Evergreen storage subscription, accessible with a simple operating system update. The process of connecting the arrays and configuring synchronous replication took minutes with just a few clicks—no downtime, no professional services, and no interruption to productivity. Administrators could easily set it up with no specialized storage skills.


Failure Testing

Next, ESG Lab tested several failure scenarios. Testing began with the same two Pure FlashArrays, and with the SQL database in an active/active, stretched pod between the arrays, so that writes were synchronously replicated and both arrays supported read/write. The Pure Dashboard showed the running workload via charts of latency, IOPS, and throughput; this enabled ESG Lab to view I/O status during each failure.

Local Array HA Controller Failure

The first test demonstrated what happens during the failure of one of the redundant controllers in a single Pure array. Choosing Array B, we failed the primary controller that the workload was using; the secondary controller took over. The result was a brief pause in I/O as the controller failed over, but host I/O continued to both arrays while maintaining RPO zero.

Figure 4 shows the workload activities.2 This view shows Array B, with a purple line showing mirrored writes; as the failover from the primary to secondary controller occurred, there was a brief dip as I/O paused, and IOPS, bandwidth, and latency dropped to zero. This can be seen in the graph and also in the latency, IOPS, and bandwidth dialog boxes that all read “0”. Once the controller failed over, operations resumed, including mirrored writes. RPO zero was never in jeopardy.

The amount of time for the I/O pause depends on the failure and on the hosts. A local HA failover typically pauses for 8-12 seconds yet the arrays remain in sync, while an ActiveCluster failover usually pauses for 16-20 seconds. Host multi-path software will also impact the I/O pause time, with many hosts pausing automatically for 30 seconds.

Replication Link Failure

Next, we failed the four replication interfaces on Array B. This failure invokes the Pure1 Cloud Mediator. Failover is automatic as long as at least one array is in contact with the Mediator. In order to keep data from being out of sync, the array that reaches the Mediator first claims the right to continue serving I/O; in this case, Array B won that race, and Array A showed as “offline” (right).

Figure 5 shows that Array A is sitting idle, with no workload, while Array B is continuing to service I/O as if it were local. Note that on Array B, the reads are indicated with a blue line, and writes with an orange line; they were not being mirrored (purple line) because Array A was unavailable. Once we restarted the replication links, the pod immediately began to resync. This activates the asynchronous replication engine to apply the changes that occurred while Array A was unavailable.

Once the resync was complete, synchronous replication restarted; Figure 6 shows the return of the purple line for mirrored writes.

Single Array Failure

Next, we viewed how ActiveCluster enables automatic failover in the case of a single array failure. To simulate that failure, we shut down Purity services on both controllers on Array B. Within seconds, ActiveCluster failed over the I/O load automatically to Array A; no administrative intervention was required. Figure 7 shows Array A switching immediately to reads and writes (blue and orange lines) from mirrored writes (purple lines). Alerts notified us about the failed connection between the arrays, and the surviving array recognized storage path failures, but read and write traffic continued virtually uninterrupted on Array A because of the stretched pod. Once we re-enabled the primary controller on Array B, the stretched pod immediately began to resync, and the mirrored writes resumed.

Single Host Failure

The test bed was configured with VMware HA so that in the event of a server failure, VMs could be restarted on other servers in the cluster. VMware HA (or other host cluster software) and ActiveCluster work together to provide production failover for VMs and their data.

We tested a single host failure by disconnecting power to Host B, which disabled the VM on that host. The workload on Host B ceased. Note that there is nothing that Purity ActiveCluster can do for the server; that is dependent on the host cluster software, in this case VMware HA.3 The SQL VM switched to Host A, and the SQL workload resumed automatically once we manually restarted the VM. Next, we repowered Host B, and ActiveCluster synchronous replication restarted immediately.

Why This Matters

Data growth and the rapid proliferation of virtualized applications are increasing the cost and complexity of storing, securing, and protecting business-critical information assets. An “always available” storage solution with automatic failover and simple user tools that make it easy to deploy and centrally manage a complex, multi-site storage deployment can reduce the time and cost required to deliver continuous business productivity.

ESG Lab validated that with ActiveCluster, volumes that are stretched between local and remote arrays can read and write on both arrays, creating a true active/active deployment. Should an array fail, there is no actual storage “failover”; the workload simply continues to run on the other clustered array. The host loses some storage paths, but I/O continues. We validated how quickly and easily ActiveCluster enabled business continuity during failures of a controller, replication link, array, and host. It all happened automatically, with no administrator input, and with no more than a few seconds of pause in I/O.

ActiveCluster can be used for business continuity and data protection, and for data migration. IT administrators can stretch a pod to another array (on the same site or at a different site), where it enjoys local access because the data is synchronized on the secondary array; then, by un-stretching the pod, the data is effectively migrated.


Customer Interview – Global Security Firm

ESG Lab spoke with Brian Cummings, a Solutions Architect at global security firm ESET, about the company’s ActiveCluster deployment between data centers that are 12 miles apart. The company uses Pure Storage as well as VMware vSphere, vCloud Director, NSX, and Horizon for virtual desktops. It has more than 1,500 production VMs, and leverages VMware vMotion, HA, and Fault Tolerance. This company has a software-defined mindset and a real need for consistent business continuity.

Application SLAs range from 15-60 minutes; any outage demands attention from IT and application owners. The Solutions Architect explained that before ActiveCluster, downtime required manual intervention. To get services back up and running, IT had to restore from snapshots or from secondary disk or tape, costing precious production time and lost data, with recovery points of 24 hours. And with one data center in a co-location provider, the business was dependent on the provider to get back online.

ActiveCluster solves all that with automatic recovery. Before deploying it, this Solutions Architect tested every failure scenario, from a VM failure to host outage, power outage, and full site failure, demonstrating a seamless failover with no intervention. “In addition,” he commented, “ActiveCluster lets us seamlessly move VMware workloads between data centers with vMotion with no downtime. We can move 1,500 VMs in under 15 minutes, saving us time and money.”

This customer was impressed with the ease of installation. The organization had used a different synchronous replication solution with other enterprise storage. Implementing that solution required professional services and hours of planning, maintenance windows, and multiple upgrades. With ActiveCluster, installation took a total of 30 minutes—20 minutes for the Purity upgrade, and then 5-10 minutes to connect the arrays and create pods. He also noted that while ActiveCluster came at no incremental cost (it is included in Pure’s Evergreen Storage subscription), other synchronous replication solutions would cost hundreds of thousands of dollars. “We continue to reap the benefits of Pure’s Evergreen Storage architecture, which regularly gives us new capabilities with no disruption during upgrades,” he commented.

Customer Interview – Industrial Construction Company

ESG Lab spoke with Alexander Letan, head of IT infrastructure at Kremsmüller, a European industrial construction corporation that maintains refineries and large plants. The company runs two data centers that serve more than 35 sites worldwide, and has been using Pure Storage FlashArrays for almost three years for its 300 VMs (SQL databases, Citrix XenApp, and Xen Desktop) and VMware HA for high availability. The company had previous experience with synchronous replication solutions, but finds more value and success running Purity ActiveCluster.

The ActiveCluster installation was quick and easy, and enabled the company to reclaim staff time for other initiatives. This engineer mentioned that previous synchronous replication offerings had proven to be cumbersome and time-consuming, with maintenance activities sapping production time. In his experience, installing the solution, provisioning storage, and scripting for failover took weeks, and required specialized training to manage it. In comparison, the ActiveCluster installation was completed in half a day, and was so simple that any IT administrator could manage it with virtually no training.

ActiveCluster has proven to be stress-free and reliable, easing the pressure on IT. Before implementing ActiveCluster, the company tested it extensively, and it always failed over as expected; they could not produce an outage of any kind. IT staff tested site, array, controller, network, and other types of failures without any erroneous behavior, despite testing it with workloads running at 10X the typical production load. This is essential, since the business requires that employees have 24/7 access to blueprints, files, emails, and other applications worldwide.

The engineer mentioned that his colleagues spend very little time planning ActiveCluster updates thanks to its simplicity, and confessed they sleep better knowing that the system is in place. Upgrades are run during production time and have no impact on performance. They have found that preparing for an ActiveCluster code upgrades take hours instead of days, and can all be done remotely. By eliminating outages and spending less time on administration, the company expects savings of up to $500,000 with ActiveCluster.

The Bigger Truth

The longer you can keep business up and operational, the better the chances of success. This drives IT on a constant search for solutions that help keep interdependent groups of data sets available and protected in order to avoid planned and unplanned downtime. Synchronous replication—continually creating an up-to-date data copy at another location—goes a long way toward those goals, keeping data available in the face of many failure scenarios.

Synchronous replication has been around for two decades now. But it is expensive, and has always been complex, bordering on painful, to set up, maintain, and test. As a result, it has been used for only the most critical applications, and by only large companies with deep pockets. With ActiveCluster, Pure provides synchronous replication at no additional charge, while some other solutions require the purchase of additional software and hardware such as third site mediator servers or gateway products. ActiveCluster is available in Purity FA 5 or greater and is remarkably simple and fast to deploy and use. Customers with earlier versions of Purity FA can upgrade to FA 5 easily and non-disruptively. ActiveCluster extends the opportunity for synchronous replication to any application, and to organizations that don’t have the resources—financial or staffing—that synchronous replication has required in the past.

In addition, ActiveCluster gives organizations the opportunity to actually test their replication and failover, which is something that, in ESG’s experience, is lacking in the vast majority of organizations. With ActiveCluster, workloads run on either site, in true active/active fashion—you can test this by simply running the application from the other site, proving that business will continue despite most failures. Many organizations set up business continuity and disaster recovery solutions, but never test them because of complexity and disruption to production operations. And, sadly, many find the problems and configuration issues only during a disaster, when they cannot access their data. Purity ActiveCluster takes the cost and complexity barriers out of business continuity, making it simple, cost-effective, and provable for any organization—and relieving the minds of IT administrators.

ESG Lab validated that Purity ActiveCluster simplifies metro region data protection, making failover transparent and effortless. Our setup was easy and fast, and the test scenarios proved how ActiveCluster delivers I/O without interruption during various failure scenarios.

In ESG’s view, ActiveCluster makes maximizing uptime simple and cost effective for any application and organization. Given the competitive advantage that maximizing uptime delivers, ActiveCluster is well worth a look.



1. Source: ESG Research Report, The Evolving Business Continuity and Disaster Recovery Landscape, February 2016.
2. Note that it takes 30 seconds for the dashboard to draw the graph lines, so time stamps will not match wall clock time.
3. VMware calls this combination of VMware HA and synchronous SAN replication “vSphere Metro Storage Cluster.” Pure Storage Purity ActiveCluster certification with VMware can be found here.

ESG Validation Reports

The goal of ESG Validation reports is to educate IT professionals about information technology solutions for companies of all types and sizes. ESG Validation reports are not meant to replace the evaluation process that should be conducted before making purchasing decisions, but rather to provide insight into these emerging technologies. Our objectives are to explore some of the more valuable features and functions of IT solutions, show how they can be used to solve real customer problems, and identify any areas needing improvement. The ESG Validation Team’s expert third-party perspective is based on our own hands-on testing as well as on interviews with customers who use these products in production environments.

Topics: Storage