Dealing with Availability and Recoverability requirements in a VCDX design

07. January 2020 Study, VCDX 0
Dealing with Availability and Recoverability requirements in a VCDX design

I’m going to take a big risk here by trying to address the topics Availability and Recoverability in a VCDX design. This is arguably both the most important and most challenging topic of any VCDX design. I would say VCDX candidates get grilled on meeting SLAs, RPO and RTO times the most during the mocks and the actual defense. I was also challenged quite hard by my panel during both my defenses (DCV and NV). In this blogpost, I will try to explain the different architectural concepts surrounding Availability and Recoverability in a VCDX design and I will share some practical tips when dealing with these topics in your VCDX design.

AMPRS Design Qualities

Looking at the VCDX blueprint, it becomes clear that a VCDX design needs to include non-functional requirements or design qualities such as: 

  • Availability
  • Manageability
  • Performance
  • Recoverability
  • Security

These design qualities are pretty much always non-functional requirements because they state HOW the system needs to behave (how fast, how secure, how recoverable, and so on). Functional requirements are all about WHAT a system should do (deliver self-service capabilities to the business, enable a consistent hybrid cloud infrastructure, deliver a single pane of glass for management, and so on)

So, the design qualities I’m addressing in this blogpost are Availability and Recoverability in a VCDX design. Both are closely related to each other and are quite often mixed up, but there is a clear distinction!

  • Availability: how you are ensuring the system stays up and running
  • Recoverability: how you are ensuring the system can recover from failure

Recoverability: RPO and RTO

Every design includes (or should include J) RPO and RTO requirements. The Recovery Point Objective (RPO) dictates how much tolerance for data loss you have, most often stated in hours. The Recovery Time Objective (RTO) describes how much time you allow for a failed system to be fully up and running again. In a visual representation, this looks something like this:

Source: https://blogs.vmware.com/virtualblocks/2018/09/30/more-srm-and-vsphere-replication-faqs/

RPO and RTO and the Service Level Agreement (SLA)

In an ideal world, RPO and RTO times are specified by the consumer of your platform (i.e. “the business”) and formally documented in a Service Level Agreement (SLA).

In smaller organizations, especially ones without formal regulation, you are met with blank stares if you ask the business for RPO and RTO times for your SDDC or cloud solution. There is no SLA in place whatsoever. As an architect, you need to explain these KPIs and come to an agreement with them. You also need to document them in your conceptual design and ensure they get formally signed off.

In larger enterprises, there often is a risk analysis process in place, called a Business Impact Analysis (BIA). A BIA quantifies the business impact of all kinds of risks (including IT risks). A BIA quantifies Confidentiality, Integrity and Availability risks, often using a big questionnaire in an Excel sheet. The outcome of the BIA often is a Confidentiality, Integrity and Availability (CIA) rating, for Example C1 I1 A3. Different SLA levels often map to these CIA ratings. An A3 rating could map to an RPO 0 and RTO 15 minutes SLA for example. An A1 could map to an SLA with an RPO 24 hours and RTO 48 hours and so on. This obviously varies per organization.

If these kinds of processes are in place, or if there is at least an SLA in place, you are in luck as an architect. If not, you will need to work with the business to discover which RPO and RTO times you can both agree on. You have a responsibility as an architect here! I have seen many situations in which an architect simply asks for RPO and RTO times in a design session, gets no useful response and leaves it at that. In my opinion you need to explain the concepts and do discovery via Q&A and by explaining the different scenarios. It will bite you (or the business) at some point if you don’t..

Everything comes at a cost

A pitfall I have seen many times is not including the budget owner for the project in these discussions. Every project has a budget constraint. Period. If you are discussing RPO and RTO requirements, you need to keep in mind that there always is a financial tradeoff. To what financial cost can you justify mitigating risks. Every organization wants an RPO and RTO near zero solution with five nines of uptime … but this can turn out to be extremely expensive. If the business can tolerate a downtime of 4 hours, there is no point in spending excessive amounts of money on a near zero downtime solution. A good design is also a cost effective en responsible design!

Tips when dealing with RPO and RTO times

  • You first need to ask yourself how you are going to measure the RPO and RTO times and on which object(s) they apply. Are you guaranteeing an RPO and RTO on a VM, on the application, on the SDDC, on the cluster, etc.? What is your “unit of measure” and is it within your span of control? If you are not dealing with the application itself, how can you guarantee an uptime of the application as a sysadmin? If other people have administrative access to the operating system, how can you take responsibility for the uptime of the OS?
  • Can you guarantee the RPO in all scenario’s? Running a stretched cluster with synchronous data replication guarantees an RPO 0 if something in the physical layer fails (a host or site for example) but if there is a massive ransomware attack on your system, your stretched cluster will have zero impact on protecting your data. It will be gone and you will likely need to resort to your last backup. You always need to be aware of what you are protecting against with a certain measure.
  • You also need to align the RPO and RTO times. Let’s look at the ransomware example again and imagine you need to restore 4 PB of data from your backup solution. Is the throughput of your recovery fast enough to finish the restore within the RTO timeframe? If you promised an RTO of 1 hour, you are not going to meet your SLA.

(Mock) panelists often try to find these kinds of gaps in your design. They look at your RPO and RTO times, number of VMs, amount of data, do some quick calculations in certain failure scenarios and boom! They drop a huge bomb by saying they don’t believe you can guarantee your SLA J.

Availability: Uptime SLA %

The availability is often reported in an Uptime % or “number of nines”. It is extremely important to align the measured time period with your business. 99,9% per year is something entirely different from 99,9% per month:

Availability %Downtime per yearDowntime per monthDowntime per weekDowntime per day
99% (“two nines”)3.65 days7.31 hours1.68 hours14.40 minutes
99.5% (“two nines five”)1.83 days3.65 hours50.40 minutes7.20 minutes
99.8%17.53 hours87.66 minutes20.16 minutes2.88 minutes
99.9% (“three nines”)8.77 hours43.83 minutes10.08 minutes1.44 minutes
99.95% (“three nines five”)4.38 hours21.92 minutes5.04 minutes43.20 seconds
99.99% (“four nines”)52.60 minutes4.38 minutes1.01 minutes8.64 seconds

Source: https://en.wikipedia.org/wiki/High_availability

99,9% per year means your system could be down for 8.77 consecutive hours and you would still be performing within the agreed SLA. I once came across a company that outsourced their IT. The outsourcing company measured the number of nines per year while the internal IT organization was reporting a number of nines per month to the business. This misalignment caused some serious discussion when the system was down one day for almost 8 hours.  According to the outsourcing company, they were still performing within their SLA but according to the customer only 4 hours of downtime were allowed.

Aligning your number of nines with the RTO time

It’s also very important to align the Uptime % with your RTO. Imagine you have an uptime % of 99,5% per month. This comes down to 3.65 hours per month. Let’s say you have an RTO of 1 hour. This means you can tolerate a maximum of 3 outages each month that last 1 hour. Is that reasonable / feasible? If an entire site goes down, can you recover all your workloads within that timeframe? This is also a quick calculation (mock) panelists often make in their effort to find a loophole in your design.

Not all workloads are equal

Availability SLAs often come with different tiers like Gold, Silver and Bronze or Tier I, II and III. This allows you to differentiate in your solution and have different measures in place. Especially within an SDDC it is perfectly possible to deliver different SLAs to workloads running within that SDDC. For example, temporary test systems typically don’t need PFTT and SFTT redundancy so why replicate the data multiple times across your vSAN cluster? Systems that have data replication built in, like databases or MS Exchange for example might not need redundancy at the storage level. Modern applications running stateless across sites behind a GLSB might not need a site recovery failover solution at the infrastructure level. These are all factors you need to weigh into your design when you are putting measures in place to meet availability and recoverability requirements.

Down the rabbit hole(?)

This is such a broad topic. I tried to deliver a structured and concise overview of Availability and Recoverability in a VCDX design but it is quite easy to get lost down the rabbit hole. Each topic uncovers a new area and before you know it I’m down 10 pages and going. I hope this was helpful and I’m curious about your feedback. If there are any other topics you want me to dive into, please drop a note in the comments or let me know on Twitter.

PS. I’m also going to record an ITQ Lightboard Video with Johan van Amersfoort (VCDX 238) on this topic in a couple of weeks. Keep an eye out on our Youtube channel!


Leave a Reply