SaFi Bank Space : DR

Service-interrupting events can happen at any time. Your network could have an outage, your latest application push might introduce a critical bug, or you might someday have to contend with a natural disaster. When things go awry, it's important to have a robust, targeted, and well-tested DR plan.

With a well-designed, well-tested DR plan in place, you can make sure that if catastrophe hits, the impact on your business's bottom line will be minimal. No matter what your DR needs look like, Google Cloud has a robust, flexible, and cost-effective selection of products and features that you can use to build or augment the solution that is right for you.

RTO and RPO

DR is a subset of business continuity planning. DR planning begins with a business impact analysis that defines two key metrics:

A recovery time objective (RTO), which is the maximum acceptable length of time that your application can be offline. This value is usually defined as part of a larger service level agreement (SLA).
A recovery point objective (RPO), which is the maximum acceptable length of time during which data might be lost from your application due to a major incident. This metric varies based on the ways that the data is used. For example, user data that's frequently modified could have an RPO of just a few minutes. In contrast, less critical, infrequently modified data could have an RPO of several hours. (This metric describes only the length of time; it doesn't address the amount or quality of the data that's lost.)

Typically, the smaller your RTO and RPO values are (that is, the faster your application must recover from an interruption), the more your application will cost to run. The following graph shows the ratio of cost to RTO/RPO.

Because smaller RTO and RPO values often mean greater complexity, the associated administrative overhead follows a similar curve. A high-availability application might require you to manage distribution between two physically separated data centers, manage replication, and more.

SLA & SLO

RTO and RPO values typically roll up into another metric: the service level objective (SLO), which is a key measurable element of an SLA. SLAs and SLOs are often conflated. An SLA is the entire agreement that specifies what service is to be provided, how it is supported, times, locations, costs, performance, penalties, and responsibilities of the parties involved. SLOs are specific, measurable characteristics of the SLA, such as availability, throughput, frequency, response time, or quality. An SLA can contain many SLOs. RTOs and RPOs are measurable and should be considered SLOs.

More about SLOs and SLAs.

DR Planning

HA doesn't entirely overlap with DR, but it's often necessary to take HA into account when you're thinking about RTO and RPO values. HA helps to ensure an agreed level of operational performance, usually uptime, for a higher than normal period. When you run production workloads on Google Cloud, you might use a globally distributed system so that if something goes wrong in one region, the application continues to provide service even if it's less widely available. In essence, that application invokes its DR plan.

Traditional DR planning requires you to account for a number of requirements, including the following:

Capacity: securing enough resources to scale as needed.
Security: providing physical security to protect assets.
Network infrastructure: including software components such as firewalls and load balancers.
Support: making available skilled technicians to perform maintenance and to address issues.
Bandwidth: planning suitable bandwidth for peak load.
Facilities: ensuring physical infrastructure, including equipment and power.

Google Cloud offers several features that are relevant to DR planning, including the following:

A global network. Google has one of the largest and most advanced computer networks in the world. The Google backbone network uses advanced software-defined networking and edge-caching services to deliver fast, consistent, and scalable performance.
Redundancy. Multiple points of presence (PoPs) across the globe mean strong redundancy. Your data is mirrored automatically across storage devices in multiple locations.
Scalability. Google Cloud is designed to scale like other Google products (for example, search and Gmail), even when you experience a huge traffic spike. Managed services such as App Engine, Compute Engine autoscalers, and Datastore give you automatic scaling that enables your application to grow and shrink as needed.
Security. The Google security model is built on over 15 years of experience with helping to keep customers safe on Google applications like Gmail and Google Workspace. In addition, the site reliability engineering teams at Google help ensure high availability and prevent abuse of platform resources.
Compliance. Google undergoes regular independent third-party audits to verify that Google Cloud is in alignment with security, privacy, and compliance regulations and best practices. Google Cloud complies with certifications such as ISO 27001, SOC 2/3, and PCI DSS 3.0.

DR patterns

DR patterns are considered to be cold, warm, or hot. These patterns indicate how readily the system can recover when something goes wrong.

Cold: You have no spare tire, so you must call someone to come to you with a new tire and replace it. Your trip stops until help arrives to make the repair.
Warm: You have a spare tire and a replacement kit, so you can get back on the road using what you have in your car. However, you must stop your journey to repair the problem.
Hot: You have run-flat tires. You might need to slow down a little, but there is no immediate impact on your journey. Your tires run well enough that you can continue (although you must eventually address the issue).

Recommendations

Create a detailed DR plan
- Design according to your recovery goals:
  Combine your application and data recovery techniques and look at the bigger picture - look at your RTO and RPO values and which DR pattern you can adopt to meet those values.
  Examples:
  - In the case of historical compliance-oriented data, you probably don't need speedy access to the data, so a large RTO value and cold DR pattern is appropriate
  - Your email notification system, which typically isn't business critical, is probably a candidate for a warm pattern.
  - If your online service experiences an interruption, you'll want to be able to recover both the data and the customer-facing part of the application as quickly as possible. In that case, a hot pattern would be more appropriate.
- Design for end-to-end recovery:
  Make sure your DR plan addresses the full recovery process, from backup to restore to cleanup.
- Make your tasks specific:
  Make each task in your DR plan consist of one or more concrete, unambiguous commands or actions.
  Example:
  - "Run the restore script" is too general. In contrast -> "Open Bash and run /home/example/restore.sh" is precise and concrete
Implement control measures
- Add controls to prevent disasters from occurring and to detect issues before they occur.
  Example:
  - Add a monitor that sends an alert when a data-destructive flow, such as a deletion pipeline, exhibits unexpected spikes or other unusual activity. This monitor could also terminate the pipeline processes if a certain deletion threshold is reached, preventing a catastrophic situation.
Prepare the software
- Verify that you can install your software
  - Make sure that your application software can be installed from source or from a preconfigured image.
  - Make sure that needed Compute Engine resources are available in the recovery environment. This might require preallocating instances or reserving them.
- Design continuous deployment for recovery
  - Consider where in your recovered environment you will deploy artifacts
  - Plan where you want to host your CD environment and artifacts - they need to be available and operational in the event of a disaster.
Implement security and compliance controls
- Configure security the same for the DR and production environments
  - Make sure that your network controls provide the same separation and blocking that the source production environment uses.
  - Make sure to use service accounts as part of the firewall rules.
  - Make sure that you grant users the same access to the DR environment that they have in the source production environment.
- Verify your DR security
  - After you've configured permissions for the DR environment, make sure that you test everything.
  - Verify that the access that you grant users confers the same permissions that the users are granted on premises.
- Make sure users can log in to the DR environment
  - Make sure that you have granted appropriate access rights to users, developers, operators, data scientists, security administrators, network administrators, and any other roles in your organization.
- Make sure that the DR environment meets compliance requirements
  - Verify that access to your DR environment is restricted to only those who need access.
  - Make sure that while your DR environment is in service, any logs that you collect are backfilled into the log archive of your production environment.
  - Make sure that as part of your DR environment, you can export audit logs that are collected through Cloud Logging to your main log sink archive.
- Use Cloud Storage as part of your daily backup routines
  - Use Cloud Storage to store backups. Make sure that the buckets that contain your backups have appropriate permissions applied to them.
- Manage secrets properly
  - Manage application-level secrets and keys by using Google Cloud to host the key/secret management service (KMS).
    You can use Cloud KMS or a third-party solution like HashiCorp Vault with a Google Cloud backend such as Spanner or Cloud Storage.
- Treat recovered data like production data
  - Make sure that the security controls that you apply to your production data also apply to your recovered data: the same permissions, encryption, and audit requirements should all apply.
  - Make sure your recovery process is auditable—after a disaster recovery, make sure you can show who had access to the backup data and who performed the recovery.
Make sure your DR plan works
- Maintain more than one data recovery path
  - In the event of a disaster, your connection method to Google Cloud might become unavailable. Implement an alternative means of access to Google Cloud to help ensure that you can transfer data to Google Cloud. Regularly test that the backup path is operational.
- Test your plan regularly
  - Automate infrastructure provisioning with Deployment Manager.
  - Monitor and debug your tests with Cloud Logging and Cloud Monitoring.