It’s been nearly a month since my initial post, so I wanted to take some time to highlight the importance of backups.

I bring this up because I recently had to restore my router to a known good state due to a misconfiguration issue that disrupted my home network. My router was locked, and despite multiple attempts to restart and reconnect, nothing worked. Ultimately, I had to boot it into factory mode and restore from a backup. Fortunately, this downtime happened late at night and lasted only about 15 minutes.

Now that the introduction is out of the way, let’s dive into why backups are essential and how we should be managing them properly. My understanding is we can break the reason for backups into two categories, intentional and accidental.

As previously mentioned, accidental issues are one of the main reasons to maintain reliable backups. Even the most skilled administrators, despite their best intentions, can make mistakes that negatively impact critical business operations.

Beyond accidental errors, there’s also the risk of intentional harm caused by malicious insiders or external attackers. Ransomware is a prime example of this, as it can encrypt valuable data and demand payment for its release. We could also say a disgruntled employee intentionally avoiding an alarm is another good example.

To protect against these risks, backups should be stored in a secure location, preferably offsite, and should be immutable—meaning they cannot be modified or deleted by anyone. Backups should follow a predetermined schedule, which should be established based on a Business Impact Assessment (BIA). This assessment helps identify the company’s most critical processes and assets, ensuring that backup frequency aligns with business needs.

For example, a system used for general marketing might not be severely affected if its backup is a week old. However, a system handling financial transactions requires frequent backups. Consider a company like Amazon, where an estimated 12 million transactions occur daily. A single hour of downtime could result in the loss of approximately 500,000 transactions.

To determine the appropriate backup frequency and recovery times, organizations rely on key metrics:

  • Recovery Point Objective (RPO): The maximum amount of data a company can afford to lose without severely impacting operations.
  • Service Delivery Objective: (SDO): Minimal level of services running during the time of outage
  • Recovery Time Objective (RTO): The targeted duration for restoring operations after an outage..
  • Maximum Tolerable Downtime (MTD): The longest amount of time a business can withstand an outage before experiencing irreversible damage.

In our previous Amazon example, leadership could determine that the business can afford to lose up to 50,000 transactions, which equates to approximately 6 minutes of downtime. This means Amazon’s RPO should be no more than 6 minutes to ensure minimal data loss.

Meanwhile, their SDO may have them initiate other processes like diverting traffic to a failover facility that could handle a limited number of transactions, but would not be able to sustain the normal volume.

The RTO would be defined by the business in how quickly they need services restored in order to avoid extended disruption. The RTO should be set to ensure recovery is completed before reaching the MTD.

Finally, the Maximum Tolerable Downtime MTD would be a longer duration—representing the absolute limit Amazon can sustain before the outage causes severe business consequences. If the affected business process remains down beyond the MTD, the company would face significant operational and financial risks.

Backups are essential, and while this post covered key concepts related to one aspect of them, there are many other factors to consider. These include encrypting backups using strong encryption protocols, securing decryption keys, storing backups off-site, and utilizing the grandfather-father-son method. Perhaps I’ll dive into those topics another time.

Well that’s it for now, thanks for stopping by!