Google has revealed more information on what happened when it was forced to shut down one of its London data centers on the UK’s hottest day of the year so far.
The failure of zone “europe-west2-a” last month was, according to Google, down to not maintaining a safe operating temperature due to a simultaneous failure of multiple, redundant cooling systems combined with the “extraordinarily high” outside temperatures.
The failure impacted numerous Google services, including Google Compute Engine, Persistent Disk (PD), and Google Cloud Storage, causing instance terminations, service degradation, and networking issues.
What actually happened?
Google engineers powered down the data center that hosted a portion of the impacted zone Europe-west2-a while the cooling system was repaired
The total impact on cloud services was estimated at 18 hours and 23 minutes.
This is fairly disturbing news, particularly considering how Google claims these regional services are “designed to survive the failure of a single zone”.
Google attributed the mistake to inadvertently modifying traffic routing for internal services to avoid all three zones in the “europe-west2” region, rather than just the impacted “europe-west2-a” zone.
The routing incident stopped customers from being able to access data from regional storage services, including GCS and BigQuery, across multiple zones.
Will this happen again?
News like this is understandably pretty scary if you are concerned about global warming, as the UK might well be seeing quite a few even warmer days in the future.
Luckily, Google made some commitments to stop these types of failures from impacting its cloud hosting ever again.
These included repairing and re-testing its failover automation in an attempt to ensure stronger resilience in its failover protocols during large-scale events such as this one.
The cloud giant is also committed to investigating and developing “more advanced methods” to progressively decrease the thermal load within a single data center space, reducing the probability that a full shutdown is required.
In addition, Google is supposedly set to examine its procedures, tooling, and automated recovery systems for gaps and will be conducting an audit of cooling system equipment and standards across the data centers that house Google Cloud globally.
- Want to move your storage away from external data centers? Check out our guide to the best bare metal storage