Recover the platform onto a fresh GCP account or credit-backed project¶
-
Purpose: Re-establish the HybridOps GCP lane on a new GCP account, billing profile, or credit-backed project without losing the surrounding on-prem, Hetzner, WAN, or secret posture. Owner: Platform operations
-
Trigger: Free-tier exhaustion, credit rollover, billing suspension, or a deliberate move to a fresh GCP account/project.
- Impact: GCP-native resources are rebuilt under a new project/account boundary; the runtime vault remains the canonical secret source and GCP Secret Manager is rehydrated from it.
-
Severity: P2 Pre-reqs: Runtime vault is intact and decryptable; access to the new GCP account and a billable project path is available; on-prem and shared control-plane lanes remain reachable.
-
Rollback strategy: Stop after any failed phase, keep the previous GCP env/state untouched, correct the issue, and re-run from the last successful step.
Core principle¶
HybridOps does not treat GCP Secret Manager as the canonical source of truth.
The intended hierarchy is:
- runtime vault
- state-driven rebuild contracts
- GCP-native projection targets such as:
- Secret Manager
- GKE Workload Identity bindings
- Cloud SQL / DMS resources
- GKE clusters
- object repositories
That means a fresh GCP account is primarily a rehydration exercise, not a secret-recovery crisis, as long as the runtime vault is healthy.
What changes vs what does not¶
Recreated on the fresh GCP side¶
- project id / project number
- billing attachment
- enabled APIs
- service accounts and IAM bindings
- object buckets and bucket IAM
- GKE clusters and kubeconfigs
- Cloud SQL instances and DMS jobs
- GCP runner VMs
- Secret Manager secrets and versions
Not recreated from scratch¶
- runtime vault values
- on-prem state and data lanes
- Hetzner control-plane services
- public and internal workloads repos
- DR / burst / decision-service contracts
Safe operating model¶
Do not introduce a second long-lived env solely because the cloud account changed.
HybridOps treats an env such as dev as a single logical lane that can contain
its on-prem, Hetzner, and GCP surfaces together. In steady state:
- same-env state resolution is the default
sharedis the only normal cross-env authority- any other cross-env state reference should be reserved for controlled drills or migrations
That means a fresh-account recovery can be rehearsed in an isolated env, but the
clean end state is to rehydrate the existing dev lane rather than normalising
per-provider env splits such as dev-onprem and dev-gcp.
Preferred recovery posture:
- rehearse in a new env only when isolation is required
- keep the old env for reference while rehearsing
- once validated, rehydrate the real
devlane in place so cross-env dependencies do not become permanent
Phase 1: establish the new GCP identity and billing path¶
- Confirm the new account can create or attach to a project.
- Confirm billing is active.
- Confirm the required APIs can be enabled.
Quick checks:
gcloud auth list
gcloud projects list
gcloud beta billing accounts list
Typical failure you should expect first:
BILLING_DISABLED- missing
resourcemanager.projects.create - missing billing-account attachment rights
If you hit those, stop there. Do not try to force later cluster or DR steps.
Phase 2: initialise the new env cleanly¶
Create or choose a clean env and run GCP init there.
hyops init gcp --env <new-env>
Then verify:
hyops init status --env <new-env>
cat "$HOME/.hybridops/envs/<new-env>/meta/gcp.ready.json"
If the new project itself will be created by HybridOps, follow:
- Runbook - Init GCP credentials with hyops init gcp
- Runbook - Create a GCP project with Project Factory
Phase 3: recreate the GCP control-plane substrate¶
Rebuild these in order:
- project / project-factory
- object repository
- runner or network substrate
- GKE cluster
- kubeconfig publication
- secret-store path
- DR-specific managed services if required
That order avoids trying to bootstrap workloads or secrets into a project that is not ready yet.
After org/gcp/project-factory succeeds, rerun:
hyops init gcp --env <env> --force
Recovery is not complete until the readiness marker shows:
project_bootstrap_pending=falseproject_access_validated=trueauth_mode="impersonation"impersonation_validated=true
Phase 3a: rebind bucket-backed state safely¶
When the recovered GCP lane needs a new object bucket lineage, do not try to
mutate the original org/gcp/object-repo slot in place. Bucket names are
immutable within a HyOps state slot, and the old slot may still be the correct
historical record for the retired project.
Recommended recovery pattern:
- create a new object-repo state instance
- apply that new instance against the recovered project
- update downstream
repo_state_refconsumers to the new instance - rerun the dependent backup or restore modules
Example:
hyops apply --env dev \
--module org/gcp/object-repo \
--state-instance pgbackrest_primary \
--inputs ~/.hybridops/envs/dev/config/modules/org__gcp__object-repo/instances/pgbackrest_primary.inputs.yml
Then point consumers at:
repo_state_ref: org/gcp/object-repo#pgbackrest_primary
If the recovered repo principal changed, rotate the vault secret that carries the
repository credential before rerunning DR or backup modules. For GCS-backed
pgBackRest flows that usually means replacing PG_BACKUP_GCS_SA_JSON with a key
for the new service_account_email published by the recovered object-repo
instance.
Use this when:
- the old bucket belonged to a closed or suspended billing lineage
- the old object-repo slot is still
okbut no longer represents the recovered project - for Terraform Cloud-backed GCP modules, HyOps now keeps the current derived workspace name when the resolved
project_idchanges, instead of preserving a legacy workspace alias that still points at the old project state - you need a clean rebuild without rewriting historical state
Do not overwrite the old bucket name in place solely because the account changed. That is the wrong recovery primitive for GCS-backed state. If the old bucket name is no longer available when the new instance is applied, choose a new globally unique bucket name in the replacement inputs and recreate the instance cleanly.
After all downstream consumers have been repointed and validated, remove the stale local no-instance slot so the runtime no longer advertises the retired project as the default object repository state:
rm ~/.hybridops/envs/dev/state/modules/org__gcp__object-repo/latest.json
rm ~/.hybridops/envs/dev/config/modules/org__gcp__object-repo/latest.inputs.yml
From that point on, operators should use explicit instance refs such as:
org/gcp/object-repo#pgbackrest_primary
org/gcp/object-repo#vyos_artifacts
If a workflow still uses a bare org/gcp/object-repo ref after multiple ready
instances exist, HyOps preflight and validate now fail early and require an
explicit instance.
Phase 4: rehydrate secrets into the new GCP project¶
Because the runtime vault remains canonical, the normal path is:
- unlock runtime vault
- persist allowlisted secrets into the new GCP project
- rebuild the consuming GCP-native surfaces
Example:
hyops vault password >/dev/null
hyops secrets gsm-persist --env <new-env> --scope dr
hyops secrets gsm-persist --env <new-env> --scope build
If the env also carries private Academy, Moodle, or external identity-provider
secrets, copy the env-local GSM map example to
<root>/config/secrets/gsm/allowed.csv first, then persist those additional
scopes deliberately.
Then verify secret-store readiness again where needed.
Reference:
Phase 5: restore the GKE burst lane¶
For burst, the minimum clean recovery chain is:
platform/gcp/gke-clusterplatform/gcp/gke-kubeconfigplatform/k8s/argocd-bootstrapplatform/k8s/gcp-secret-store
Reference:
Phase 6: restore the managed DR lane if required¶
For the managed PostgreSQL DR path, rebuild:
platform/onprem/postgresql-dr-sourceposture if neededorg/gcp/cloudsql-external-replica- promote / failback drill overlays only after standby health is real again
Reference:
- Establish PostgreSQL Cloud SQL Standby in GCP (HyOps Blueprint)
- If a Terraform Cloud-backed GCP module still points at the old project after the env move, force-delete and recreate only that module workspace, then rerun the module so the new project-backed state becomes authoritative.
Common problems to expect¶
Billing is active but APIs still fail¶
This is common on fresh credit-backed projects.
Check and enable at least:
compute.googleapis.comcontainer.googleapis.comsecretmanager.googleapis.comiamcredentials.googleapis.comsqladmin.googleapis.comdatamigration.googleapis.comservicenetworking.googleapis.com
Kubeconfig retrieval fails on a fresh workstation¶
Make sure the auth plugin is present:
hyops setup cloud-gcp --sudo
or:
sudo apt-get install -y google-cloud-cli-gke-gcloud-auth-plugin
Secret flows fail after project rebuild¶
Usually one of:
- the new GCP project id is wrong in env config
- Secret Manager was not rehydrated from runtime vault
- Workload Identity bindings were not recreated
Cloud SQL private service connection fails with AUTH_PERMISSION_DENIED¶
This is no longer the same problem as stale project drift.
Interpretation:
- the module is already targeting the recovered project correctly
- the effective Terraform identity can reach the project and create the PSA range
- but it still lacks the private peering permission on the project that owns the VPC
Repair path for a single-project VPC:
hyops init gcp --env dev --with-cli-login --force \
--project-id hybridops-dev-gcp-03 \
--region europe-west2
That reruns the current-project bootstrap path and re-ensures roles/servicenetworking.networksAdmin on the Terraform service account.
For Shared VPC:
- grant the effective Terraform identity the required network roles on the host project that owns the VPC
- minimum expected roles for the Cloud SQL private-service-access path are:
roles/compute.networkAdminroles/servicenetworking.networksAdmin
Then rerun hyops preflight or hyops validate before apply.
Burst or DR modules still point at old project ids¶
This is a state/env hygiene problem, not a secret-loss problem.
Treat it by:
- checking env-scoped input overlays
- checking project-state refs
- checking any pinned project id fields before rerun
- recreating object-repo lineage with a new
--state-instancewhen the old bucket belonged to the retired project
Do not assume that --force on the old object-repo slot is the right fix. If the
bucket name itself has to change, the clean repair is a new state instance plus
updated repo_state_ref consumers.
DR node connectivity fails immediately after recreate¶
Freshly recreated GCP DR nodes can pass transport reachability before the guest
SSH service is ready for Ansible. If validation fails with an SSH banner timeout
or ssh service did not become ready yet, wait briefly and rerun validation
instead of rewriting module state.
DR restore reaches the new bucket but backup.info is missing¶
That means the credential and bucket wiring are correct, but the recovered repo is empty. The fix is not another restore retry.
Required recovery sequence:
- point the surviving source-cluster backup module at the recovered object-repo instance
- run an on-demand backup into that repo
- use the resulting backup state as
backup_state_reffor the DR restore - rerun the DR restore after the new backup set is published
In practice this often means:
repo_state_ref: org/gcp/object-repo#pgbackrest_primary
backup_state_ref: platform/onprem/postgresql-ha-backup#postgresql_backup_run_onprem_dr
restore_target_timeline: ''
When backup_state_ref is present, HyOps resolves the backup label from state.
Leave restore_target_timeline blank unless the repository contains divergent
lineages and you are deliberately pinning one.
Do not treat an empty recovered bucket as a Terraform or workspace problem. It is a missing backup corpus problem.
Verification checklist¶
Recovery is in a healthy state when:
hyops init status --env <new-env>is green for GCP- project/billing/API checks are green
- the runtime vault remains unchanged and decryptable
- GCP Secret Manager is repopulated from the runtime vault where required
- GKE burst baseline is healthy again
- managed Cloud SQL standby can be re-established if required
- no public docs or module defaults were changed just to fit the temporary account
Outcome¶
If this runbook is followed, a fresh GCP account or credit-backed student project is an operational rebuild, not a platform redesign.