All of services are down due to Google Cloud Platform's major outage
Resolved
Jun 14 at 08:45am KST
[Google Cloud Platform] Incident Report
Summary
Google Cloud, Google Workspace and Google Security Operations products experienced increased 503 errors in external API requests, impacting customers.
We deeply apologize for the impact this outage has had. Google Cloud customers and their users trust their businesses to Google, and we will do better. We apologize for the impact this has had not only on our customers’ businesses and their users but also on the trust of our systems. We are committed to making improvements to help avoid outages like this moving forward.
What happened?
Google and Google Cloud APIs are served through our Google API management and control planes. Distributed regionally, these management and control planes are responsible for ensuring each API request that comes in is authorized, has the policy and appropriate checks (like quota) to meet their endpoints. The core binary that is part of this policy check system is known as Service Control. Service Control is a regional service that has a regional datastore that it reads quota and policy information from. This datastore metadata gets replicated almost instantly globally to manage quota policies for Google Cloud and our customers.
On May 29, 2025, a new feature was added to Service Control for additional quota policy checks. This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code. As a safety precaution, this code change came with a red-button to turn off that particular policy serving path. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash. Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues. If this had been flag protected, the issue would have been caught in staging.
On June 12, 2025 at ~10:45am PDT, a policy change was inserted into the regional Spanner tables that Service Control uses for policies. Given the global nature of quota management, this metadata was replicated globally within seconds. This policy data contained unintended blank fields. Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop. This occurred globally given each regional deployment.
Within 2 minutes, our Site Reliability Engineering team was triaging the incident. Within 10 minutes, the root cause was identified and the red-button (to disable the serving path) was being put in place. The red-button was ready to roll out ~25 minutes from the start of the incident. Within 40 minutes of the incident, the red-button rollout was completed, and we started seeing recovery across regions, starting with the smaller ones first.
Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure. Service Control did not have the appropriate randomized exponential backoff implemented to avoid this. It took up to ~2h 40 mins to fully resolve in us-central-1 as we throttled task creation to minimize the impact on the underlying infrastructure and routed traffic to multi-regional databases to reduce the load. At that point, Service Control and API serving was fully recovered across all regions. Corresponding Google and Google Cloud products started recovering with some taking longer depending upon their architecture.
What is our immediate path forward?
Immediately upon recovery, we froze all changes to the Service Control stack and manual policy pushes until we can completely remediate the system.
How did we communicate?
We posted our first incident report to Cloud Service Health about ~1h after the start of the crashes, due to the Cloud Service Health infrastructure being down due to this outage. For some customers, the monitoring infrastructure they had running on Google Cloud was also failing, leaving them without a signal of the incident or an understanding of the impact to their business and/or infrastructure. We will address this going forward.
What’s our approach moving forward?
Beyond freezing the system as mentioned above, we will prioritize and safely complete the following:
- We will modularize Service Control’s architecture, so the functionality is isolated and fails open. Thus, if a corresponding check fails, Service Control can still serve API requests.
- We will audit all systems that consume globally replicated data. Regardless of the business need for near instantaneous consistency of the data globally (i.e. quota management settings are global), data replication needs to be propagated incrementally with sufficient time to validate and detect issues.
- We will enforce all changes to critical binaries to be feature flag protected and disabled by default.
- We will improve our static analysis and testing practices to correctly handle errors and if need be fail open.
- We will audit and ensure our systems employ randomized exponential backoff.
- We will improve our external communications, both automated and human, so our customers get the information they need asap to react to issues, manage their systems and help their customers.
- We'll ensure our monitoring and communication infrastructure remains operational to serve customers even when Google Cloud and our primary monitoring products are down, ensuring business continuity.
Affected services
Updated
Jun 13 at 03:34pm KST
[Google Cloud Platform] Mini Incident Report
We are deeply sorry for the impact to all of our users and their customers that this service disruption/outage caused. Businesses large and small trust Google Cloud with your workloads and we will do better. In the coming days, we will publish a full incident report of the root cause, detailed timeline and robust remediation steps we will be taking. Given the size and impact of this incident, we would like to provide some information below.
Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support or to Google Workspace Support using help article https://support.google.com/a/answer/1047213.
(All Times US/Pacific)
Incident Start: 12 June, 2025 10:49
All regions except us-central1 mitigated: 12 June, 2025 12:48
Incident End: 12 June, 2025 13:49
Duration: 3 hours
Regions/Zones: Global
Description:
Multiple Google Cloud and Google Workspace products experienced increased 503 errors in external API requests, impacting customers.
From our initial analysis, the issue occurred due to an invalid automated quota update to our API management system which was distributed globally, causing external API requests to be rejected. To recover we bypassed the offending quota check, which allowed recovery in most regions within 2 hours. However, the quota policy database in us-central1 became overloaded, resulting in much longer recovery in that region. Several products had moderate residual impact (e.g. backlogs) for up to an hour after the primary issue was mitigated and a small number recovering after that.
Google will complete a full Incident Report in the following days that will provide a detailed root cause.
Customer Impact:
Customers had intermittent API and user-interface access issues to the impacted services. Existing streaming and IaaS resources were not impacted.
Additional details:
This incident should not have happened, and we will take the following measures to prevent future recurrence:
* Prevent our API management platform from failing due to invalid or corrupt data.
* Prevent metadata from propagating globally without appropriate protection, testing and monitoring in place.
* Improve system error handling and comprehensive testing for handling of invalid data.
Affected Services and Features:
Google Cloud Products:
* Identity and Access Management
* Cloud Build
* Cloud Key Management Service
* Google Cloud Storage
* Cloud Monitoring
* Google Cloud Dataproc
* Cloud Security Command Center
* Artifact Registry
* Cloud Workflows
* Cloud Healthcare
* Resource Manager API
* Dataproc Metastore
* Cloud Run
* VMWare engine
* Dataplex
* Migrate to Virtual Machines
* Google BigQuery
* Contact Center AI Platform
* Google Cloud Deploy
* Media CDN
* Colab Enterprise
* Vertex Gemini API
* Cloud Data Fusion
* Cloud Asset Inventory
* Datastream
* Integration Connectors
* Apigee
* Google Cloud NetApp Volumes
* Google Cloud Bigtable
* Looker (Google Cloud core)
* Looker Studio
* Google Cloud Functions
* Cloud Load Balancing
* Traffic Director
* Document AI
* AutoML Translation
* Pub/Sub Lite
* API Gateway
* Agent Assist
* AlloyDB for PostgreSQL
* Cloud Firestore
* Cloud Logging
* Cloud Shell
* Cloud Memorystore
* Cloud Spanner
* Contact Center Insights
* Database Migration Service
* Dialogflow CX
* Dialogflow ES
* Google App Engine
* Google Cloud Composer
* Google Cloud Console
* Google Cloud DNS
* Google Cloud Pub/Sub
* Google Cloud SQL
* Google Compute Engine
* Identity Platform
* Managed Service for Apache Kafka
* Memorystore for Memcached
* Memorystore for Redis
* Memorystore for Redis Cluster
* Persistent Disk
* Personalized Service Health
* Speech-to-Text
* Text-to-Speech
* Vertex AI Search
* Retail API
* Vertex AI Feature Store
* BigQuery Data Transfer Service
* Google Cloud Marketplace
* Cloud NAT
* Hybrid Connectivity
* Cloud Vision
* Network Connectivity Center
* Cloud Workstations
* Google Security Operations
Google Workspace Products:
* AppSheet
* Gmail
* Google Calendar
* Google Drive
* Google Chat
* Google Voice
* Google Docs
* Google Meet
* Google Cloud Search
* Google Tasks
Affected services
Updated
Jun 13 at 01:18pm KST
This incident has been resolved.
Affected services
Updated
Jun 13 at 10:27am KST
[Google Cloud Platform] Vertex AI Online Prediction is full recovered as of 18:18 PDT.
All the services are fully recovered from the service issue
We will publish analysis of this incident once we have completed our internal investigation.
We thank you for your patience while we worked on resolving the issue.
Affected services
Updated
Jun 13 at 09:59am KST
[Google Cloud Platform] Vertex AI Online Prediction: The issue causing elevated 5xx errors with some Model Garden models was fully resolved as of 17:05 PDT. Vertex AI serving is now back to normal in all regions except europe-west1 and asia-southeast1. Engineers are actively working to restore normal serving capacity in these two regions.
The ETA for restoring normal serving capacity in europe-west1 and asia-southeast1 is 19:45 PDT.
We will provide an update by Thursday, 2025-06-12 19:45 PDT with current details.
Affected services
Updated
Jun 13 at 09:33am KST
[Google Cloud Platform] The impact on Personalized Service Health is now resolved and the updates should be reflected without any issues.
The issue with Google Cloud Dataflow is fully resolved as of 17:10 PDT
The only remaining impact is on Vertex AI Online Prediction as follows:
Vertex AI Online Prediction: Customers may continue to experience elevated 5xx errors with some of the models available in the Model Garden. We are seeing gradual decrease in error rates as our engineers perform appropriate mitigation actions.
The ETA for full resolution of these 5xx errors is 22:00 PDT
We will provide an update by Thursday, 2025-06-12 22:00 PDT with current details.
Affected services
Updated
Jun 13 at 09:06am KST
[Google Cloud Platform] The following Google Cloud products are still experiencing residual impact:
Google Cloud Dataflow: Dataflow backlog has cleared up in all regions except us-central1. Customers may experience delays with Dataflow operations in us-central1 as the backlog clears up gradually. We do not have an ETA for Cloud Dataflow recovery in us-central1.
Vertex AI Online Prediction: Customers may continue to experience elevated 5xx errors with some of the models available in the Model Garden. We are seeing gradual decrease in error rates as our engineers perform appropriate mitigation actions. The ETA for full resolution of these 5xx errors is 22:00 PDT
Personalized Service Health: Updates on the Personalized Service Health are delayed and we recommend customers to continue using Cloud Service Health dashboard for updates.
We will provide an update by Thursday, 2025-06-12 17:45 PDT with current details.
Affected services
Updated
Jun 13 at 08:13am KST
[Google Cloud Platform] The following Google Cloud products are still experiencing residual impact:
Google Cloud Dataflow: Customers may experience delays with Dataflow operations as the backlog is clearing up gradually.
Vertex AI Online Prediction: Customers may continue to experience elevated 5xx errors with some of the models available in the Model Garden.
Personalized Service Health: Updates on the Personalized Service Health are delayed and we recommend customers to continue using Cloud Service Health dashboard for updates.
We currently do not have an ETA for full mitigation of the above services.
We will provide an update by Thursday, 2025-06-12 17:00 PDT with current details.
Affected services
Updated
Jun 13 at 07:16am KST
[Google Cloud Platform] Most of the Google Cloud products are fully recovered as of 13:45 PDT.
There is some residual impact for the products currently marked as affected on the dashboard. Please continue to monitor the services and the dashboard for individual product recoveries.
We will provide an update by Thursday, 2025-06-12 16:00 PDT with current details.
Affected services
Updated
Jun 13 at 06:23am KST
[Google Cloud Platform] Most of the Google Cloud products have confirmed full service recovery.
A few services are still seeing some residual impact and the respective engineering teams are actively working on recovery of those services.
We expect the recovery to complete in less than an hour.
We will provide an update by Thursday, 2025-06-12 15:00 PDT with current details.
Affected services
Updated
Jun 13 at 06:00am KST
[Google Cloud Platform] We have implemented mitigation for the issue in us-central1 and multi-region/us and we are seeing signs of recovery.
We have received confirmation from our internal monitoring and customers that the Google Cloud products are also seeing recovery in multiple regions and are also seeing signs of some recovery in us-central1 and mutli-region/us.
We expect the recovery to complete in less than an hour.
We will provide an update by Thursday, 2025-06-12 14:30 PDT with current details.
Affected services
Updated
Jun 13 at 05:57am KST
[Cloudflare] All Cloudflare services have been restored and are now fully operational. We are moving the incident to Monitoring while we watch platform metrics to confirm sustained stability.
Affected services
Updated
Jun 13 at 05:32am KST
[Cloudflare] Cloudflare services are recovering quickly around the globe. WARP and Turnstile are operational, though a small residual impact remains and we’re working to eliminate it. The core KV service is restored, bringing dependent products back online. We expect further recovery over the next few minutes and a steady drop in impact.
Affected services
Updated
Jun 13 at 05:16am KST
[Google Cloud Platform] We have identified the root cause and applied appropriate mitigations. Our infrastructure has recovered in all regions except us-central1.
Google Cloud products that rely on the affected infrastructure are seeing recovery in multiple locations.
Our engineers are aware of the customers still experiencing issues on us-central1 and multi-region/us and are actively working on full recovery.
We do not have an ETA for full recovery.
We will provide an update by Thursday, 2025-06-12 14:00 PDT with current details.
Affected services
Updated
Jun 13 at 04:57am KST
[Cloudflare] Cloudflare’s critical Workers KV service went offline due to an outage of a 3rd party service that is a key dependency. As a result, certain Cloudflare products that rely on KV service to store and disseminate information are unavailable including:
Access
WARP
Browser Isolation
Browser Rendering
Durable Objects (SQLite backed Durable Objects only)
Workers KV
Realtime
Workers AI
Stream
Parts of the Cloudflare dashboard
Turnstile
AI Gateway
AutoRAG
Cloudflare engineers are working to restore services immediately. We are aware of the deep impact this outage has caused and are working with all hands on deck to restore all services as quickly as possible.
Affected services
Updated
Jun 13 at 04:41am KST
[Google Cloud Platform] Our engineers have identified the root cause and have applied appropriate mitigations.
While our engineers have confirmed that the underlying dependency is recovered in all locations except us-central1, we are aware that customers are still experiencing varying degrees of impact on individual google cloud products. All the respective engineering teams are actively engaged and working on service recovery.
We do not have an ETA for full service recovery.
We will provide an update by Thursday, 2025-06-12 13:30 PDT with current details.
Affected services
Updated
Jun 13 at 04:30am KST
[Google Cloud Platform] All locations except us-central1 have fully recovered. us-central1 is mostly recovered. We do not have an ETA for full recovery in us-central1.
We will provide an update by Thursday, 2025-06-12 13:00 PDT with current details.
Affected services
Updated
Jun 13 at 04:12am KST
[Cloudflare] We are starting to see services recover. We still expect to see intermittent errors across the impacted services as systems handle retried and caches are filled.
Affected services
Updated
Jun 13 at 04:09am KST
[Google Cloud Platform] Our engineers are continuing to mitigate the issue and we have confirmation that the issue is recovered in some locations.
We do not have an ETA on full mitigation at this point.
We will provide an update by Thursday, 2025-06-12 12:45 PDT with current details.
Affected services
Updated
Jun 13 at 04:02am KST
[Cloudflare] We are seeing a number of services suffer intermittent failures. We are continuing to investigate this and we will update this list as we assess the impact on a per-service level.
Impacted services:
Access
WARP
Durable Objects (SQLite backed Durable Objects only)
Workers KV
Realtime
Workers AI
Stream
Parts of the Cloudflare dashboard
AI Gateway
AutoRAG
Affected services
Updated
Jun 13 at 03:59am KST
[Google Cloud Platform]
Summary: Multiple GCP products are experiencing Service issues with API requests
Description We are experiencing service issues with multiple GCP products beginning at Thursday, 2025-06-12 10:51 PDT.
Our engineering team continues to investigate the issue.
We will provide an update by Thursday, 2025-06-12 12:15 PDT with current details.
We apologize to all who are affected by the disruption.
Symptoms: Multiple GCP products are experiencing varying level of service impacts with API requests.
Workaround: None at this time.
Affected services
Updated
Jun 13 at 03:48am KST
[Cloudflare] We are seeing a number of services suffer intermittent failures. We are continuing to investigate this and we will update this list as we assess the impact on a per-service level.
Impacted services:
Access
WARP
Durable Objects (SQLite backed Durable Objects only)
Workers KV
Realtime
Workers AI
Stream
Parts of the Cloudflare dashboard
Affected services
Updated
Jun 13 at 03:47am KST
[Cloudflare] We are continuing to investigate this issue.
Affected services
Updated
Jun 13 at 03:46am KST
[Google Cloud Platform]
Summary: Multiple GCP products are experiencing Service issues
Description We are experiencing service issues with multiple GCP products beginning at Thursday, 2025-06-12 10:51 PDT.
Our engineering team continues to investigate the issue.
We will provide an update by Thursday, 2025-06-12 12:15 PDT with current details.
We apologize to all who are affected by the disruption.
Symptoms: Multiple GCP products are experiencing varying level of service impacts.
Workaround: None at this time.
Affected services
Updated
Jun 13 at 03:46am KST
[Cloudflare] We are seeing a number of services suffer intermittent failures. We are continuing to investigate this and we will update this list as we assess the impact on a per-service level.
Affected services
Updated
Jun 13 at 03:31am KST
[Cloudflare] We are continuing to investigate this issue.
Affected services
Updated
Jun 13 at 03:30am KST
[Cloudflare] We are seeing a number of services suffer intermittent failures. We are continuing to investigate this and we will update this list as we assess the impact on a per-service level.
Affected services
Updated
Jun 13 at 03:20am KST
[Cloudflare] We are continuing to investigate this issue.
Affected services
Created
Jun 13 at 03:19am KST
현재 Google Cloud Platform과 Cloudflare의 장애로 인해 저희 NGuard 서비스가 불안정합니다.
사용자 여러분의 너른 양해 부탁드리며, 저희 팀원은 Google Cloud Platform과 Cloudflare와 접촉하며 문제를 해결하기 위해 노력하고 있습니다.
문제가 해결되는 대로 다시 안내드리겠습니다.
서비스 이용에 불편을 드려 대단히 죄송합니다.
Dear customers,
Our NGuard service is currently unstable due to a Google Cloud Platform and Cloudflare's outage.
We ask for your patience, our team members are working with Google Cloud Platform and Cloudflare to resolve the issue.
We will inform you again as soon as the issue is resolved.
We apologise for an inconvenience you may have experienced while this outage time.
Affected services