There tends to be a misconception that just because an application is running in the cloud, you instantly get business continuity, but that couldn’t be farther from the truth.
As with any new technology, the fact is that when you are hosting an application in the cloud, you need to be sure that your team is asking the right questions and investigating the business continuity capabilities during the move.
But what does it all mean, anyways?
Business continuity as described in Tech Target as “the processes and procedures an organization puts in place to ensure that essential functions can continue during and after a disaster. Business continuance planning seeks to prevent interruption of mission-critical services, and to reestablish full functioning as swiftly and smoothly as possible.”
In other words, there is no fixed component that makes up business continuity. Rather, it is more of a continuum.
The business continuity continuum spans from having a backup disk stored in in the data center with your application all of the way to having the application always on and spread across multiple geographies so that if one area has a major outage or worse, the application just keeps on running. You can provide many different levels of continuity and protection for your application and, depending on your business and application, you may choose all or some levels of protection.
Backup and Recovery
When we think of backup and recovery, it isn’t as black and white as it seems. You can backup your systems in multiple ways and you can store the backups on different types of media and in different places. Additionally, the database may be backed up very differently from the application.
Let’s take a high level view of backup and what you need to look for when evaluating solutions.
- On-site: Storing the backup in the same place as your servers are running. This provides quick recovery times, but puts the backup at risk if there were an issue with the building itself.
- Off-site: Storing the backup in a different location that the servers.
- Tape: The information about your applications and databases, including the data are stored on tapes. This is somewhat of an older technology and can be slower to drive recovery, but generally lower cost and great for archiving.
- Disk: The information is stored on a disk. The type of disk can be determined based on your requirements for speed and recovery times. Additionally, disk can be much less expensive if you have a lot of data since you can have many disks living in a single pod.
When considering off-site backup and disaster recovery sites, this is a good rule of thumb for determination of the distance between your production and secondary sites:
- Hurricanes: 105 miles of distance
- Volcanoes: More than 70 miles distance
- Floods: More than 40 miles distance
- Power Grid Failure: More than 20 miles distance, or shorter if generators are in place
- Tornadoes: Around 10 miles distance
Backup is not a disaster recovery solution, however. Backup is used to restore data or applications to servers in the event of deletion, corruption or to be used during a legal event. Disaster recovery will be covered in more depth later in the article, but think of it this way: backup is a single point in time, while disaster recovery is a synchronizing process.
Image via Disaster Recovery Journal
So if you were to synchronize a corruption or bad upgrade, now both sites are corrupt or in failure due to the upgrade. A backup has multiple points in time and is done on a scheduled basis so that you can recover to different points and roll back to prior to the corruption.
Single Site High Availability
High availability is pretty much what it seems: having a system or component that is continuously operational for a desirably long length of time. Availability is measured relative to "100 percent operational" or "never failing." Generally, when measuring high availability of a data center, you look at what is called the "five 9s" (99.999 percent) availability, although many providers today will guarantee 100 percent.
More specifically, single site high availability means having everything in a single building or location. The servers or application may be available across multiple data centers but within a single location. This can be very important when it comes to hardware failures or even cable issues.
Some public cloud vendors offer single site high availability right out of the gate, but with others, you need to be sure to architect for high availability. High availability in a public cloud scenario generally requires that the vendor keeps the storage separated from the computer. This way if a cloud server was to fail, your only downtime would be until another server is booted up using your information stored elsewhere.
If the storage is part of the server, then a backup would need to be used costing significant time and maybe money for your business.
In a private cloud scenario, you need to be sure that you are building for high availability. This means that you have multiple servers available to handle the load, so if one goes down, you can flip to a second and that again the data is either not locally stored or it is replicated across the servers.
Load balancing can be quite valuable in this scenario as well. With proper load balancing, you can send some or all traffic to one of your servers, and if that server is seen as not available, the traffic will be routed to another available server, maintaining 100 percent availability. In scenarios where your public cloud vendor requires multiple cloud instances or virtual machines for high availability, the load balancer would be used as well.
Disaster recovery provides a secondary site where you have servers and storage available to “flip” to in the event of a significant outage or disaster. You need to be conscious of the fact that the disaster recovery site is not there for a short blip in service, but for true failures occurring over a longer period of time. Mainly because failing back isn’t trivial, but also because the failover itself requires technical intervention and time. If anyone says they have one click failover, make sure you start reading the fine print.
When considering where you production environment exists, you may have several choices as to where to put your disaster recovery site. If production is in-house, you may select another owned data center, co-location of equipment in a 3rd party data center, or to leverage a vendor that provides Disaster Recovery as a Service (DRaaS).
If your production environment resides with a 3rd party cloud vendor for example, you don’t have to be locked into that vendor for your disaster recovery site. You actually have the option to leverage another vendor if you feel that you don’t want to put all of your eggs in that one basket. What you must do, however, is ensure that your production vendor’s disaster recovery offering fits within your miles guidelines for distance between data centers.
A system that is always on sounds like a myth, but it is absolutely possible and achievable. Now, are there scenarios even with always on that you could have an issue and go down? Yes, but it becomes highly unlikely with always on.
Still, always on is expensive for a reason, and requires changes to your infrastructure, database and application architectures.
When building out an always on architecture, you are designing the application and database to live in multiple places, again using at least your minimum distance requirements. The application and databases are replicated across geographies in near real-time or the application and database know how to handle data that is not in sync. You will generally have multiple read databases for the application to use, and a write database that posts to the read one.
So, what makes it always on? Load balancing is done at the internet level (simplifying a little here) and traffic is sent to one or more locations before it ever gets to your servers. If a location is seen as not available, then the traffic is sent to a location that is available.
Once the unavailable location comes back online, traffic can be routed there again, providing a system that is always on. Of course your application and database needs to be able to handle all of this which isn’t simple.
There are other aspects that I would consider part of business continuity, such as security, compliance and anything else that may impact your ability to successfully do business and I will cover these in future articles, but the items outlined in this article focus on the traditional sense of keeping applications up and running.
To ensure that you are able to do business and that your systems can handle change, it is important to have a business continuity plan. The plan must look at how long you can be without each application, how well your architecture can support the plan and what your budget is for this “insurance” plan.