How Google keeps Cloud Storage 99.999999999% durable
Syah Ismail2021-07-14T05:01:44+08:00For any cloud storage solution, one of the most fundamental aspects is durability which is how well the data is protected from loss or corruption. Google Cloud Storage has been designed for at least 99.999999999% annual durability or 11 nines. That means that even with one billion objects, you would likely go a hundred years without losing a single one.
In this post, we’ll explore the ways Google protects Cloud Storage data. At the same time, data protection is ultimately a shared responsibility. The most common cause of data loss is accidental deletion by a user or storage administrator. So, here are the best practices to help protect your data against risks like natural disasters and user errors.
Physical durability
Most people think about durability in the context of protecting against network, server and storage hardware failures. For Google, software is ultimately the best way to protect against hardware failures. This allows Google to attain higher reliability at an attractive cost, instead of depending on exotic hardware solutions. Hardware will fail all the time but that doesn’t mean durability has to suffer.
To store an object in Cloud Storage, Google breaks it up into a number of ‘data chunks’ which are placed on different servers with different power sources. Google also created a number of ‘code chunks’ for redundancy. In the event of a hardware failure, Google will use data and code chunks to reconstruct the entire object. This technique is called erasure coding. In addition, Google stores several copies of the metadata needed to find and read the object, so that if one or more metadata servers fails, we can continue to access the object.
The key here is that Google always stores data redundantly across multiple availability zones before a write is acknowledged as successful. The encodings Google use provide sufficient redundancy to support a target of more than 11 nines of durability against a hardware failure. Once stored, Google regularly verifies checksums to guard data at rest from certain types of data errors. In the case of a checksum mismatch, data is automatically repaired using the redundancy present in its encodings.
Best practice: use dual-region or multi-region locations
These layers of protection against physical durability risks are well and good but they may not protect against substantial physical destruction of a region such as acts of war, an asteroid hit, or other large-scale disasters.
Cloud Storage’s 11 nines durability target applies to a single region. To go further and protect against natural disasters that could wipe out an entire region, consider storing your most important data in dual-region or multi-region buckets. These buckets automatically ensure redundancy of your data across geographic regions. Using these buckets requires no additional configuration or API changes to your applications while providing added durability against very rare but potentially catastrophic, events. As an added benefit, these location types also come with significantly higher availability SLAs because we can transparently serve your objects from more than one location if a region is temporarily inaccessible.
Durability in transit
Another class of durability risks concerns corruption to data in transit. This could be data transferred across networks within the Cloud Storage service itself or when uploading or downloading objects to/from Cloud Storage.
To protect against this source of corruption, data in transit within Cloud Storage is designed to be always checksum-protected, without exception. In the case of a checksum-validation error, the request is automatically retried or an error is returned, depending on the circumstances.
Best practice: use checksums for uploads and downloads
While Google Cloud checksums all Cloud Storage objects that travel within Google service, to achieve end-to-end protection, it is recommended that you provide checksums when you upload your data to Cloud Storage and validate these checksums on the client when you download an object.
Human-induced durability risks
Arguably the biggest risk of data loss is due to human error. Software bugs are potentially the single biggest risk to data durability. To avoid durability loss from software bugs, Google take steps to avoid introducing data-corrupting or data-erasing bugs in the first place. Google then maintains safeguards to detect these types of bugs quickly with the aim of catching them before durability degradation turns into durability loss.
To catch bugs upfront, Google only releases a new version of Cloud Storage to production after it passes a large set of integration tests. These include exercising a variety of edge-case failure scenarios such as an availability zone going down and comparing the behaviours of data encoding and placement APIs to previous versions to screen for regressions.
Once a new software release is approved, Google rolls out upgrades in stages by availability zone, starting with a very limited initial area of impact and slowly ramping up until it is in widespread use. This allows Google to catch issues before they have a large impact and while there are still additional copies of data (or a sufficient number of erasure code chunks) from which to recover if needed. These software rollouts are monitored closely with plans in place for quick rollbacks if necessary.
Best practice: turn on object versioning
One of the most common sources of data loss is accidental deletion of data by a storage administrator or end-user. When you turn on object versioning, Cloud Storage preserves deleted objects in case you need to restore them at a later time. By configuring Object Lifecycle Management policies, you can limit how long you keep versioned objects before they are permanently deleted in order to better control your storage costs.
Best practice: backup your data
Cloud Storage’s 11-nines durability target does not obviate the need to back up your data. For example, consider what a malicious hacker might do if they obtained access to your Cloud Storage account. Depending on your goals, a backup may be a second data copy in another region or cloud, on-premises or even physically isolated with an air gap on tape or disk.
Best practice: use data access retention policies and audit logs
For long-term data retention, use the Cloud Storage bucket lock feature to set data retention policies and ensure data is locked for specific periods of time. Doing so prevents accidental modification/deletion and when combined with data access audit logging, can satisfy regulatory and compliance requirements such as FINRA, SEC and CFTC and certain health care industry retention regulations
Best practice: use role-based access control policies
You can limit the blast radius of malicious hackers and accidental deletions by ensuring that IAM data access control policies follow the principles of separation of duties and least privilege. For example, separate those with the ability to create buckets from those who can delete projects.
Encryption keys and durability
All Cloud Storage data is designed to always be encrypted at rest and in transit within the cloud. Since objects are unreadable without their encryption keys, the loss of encryption keys is a significant risk to durability. With Cloud Storage, you have three choices for key management:
- trust Google to manage the encryption keys for you,
- use Customer Managed Encryption Keys (CMEK) with Cloud KMS
- use Customer Supplied Encryption Keys (CSEK) with an external key server
Google takes similar steps as described earlier (including erasure coding and consistency checking) to protect the durability of the encryption keys under its control.
Best practice: safeguard your encryption keys
By choosing either CMEK or CSEK to manage your keys, you take direct control of managing your own keys. It is vital in these cases that you also protect your keys in a manner that also provides at least 11 nines of durability. For CSEK, this means maintaining off-site backups of your keys so that you have a path to recovery even if your keys are lost or corrupted in some way. If such precautions are not taken, the durability of the encryption keys will determine the durability of the data.
In practice, the numerous techniques outlined here have allowed Cloud Storage to exceed 11 nines of annual durability to date. Add to that the best practices shared here and you’ll help to ensure that your data is here when you need it, whether that be later today or decades in the future.