Skip to content

Recover the platform onto a fresh GCP account or credit-backed project

  • Purpose: Re-establish the HybridOps GCP lane on a new GCP account, billing profile, or credit-backed project without losing the surrounding on-prem, Hetzner, WAN, or secret posture. Owner: Platform operations

  • Trigger: Free-tier exhaustion, credit rollover, billing suspension, or a deliberate move to a fresh GCP account/project.

  • Impact: GCP-native resources are rebuilt under a new project/account boundary; the runtime vault remains the canonical secret source and GCP Secret Manager is rehydrated from it.
  • Severity: P2 Pre-reqs: Runtime vault is intact and decryptable; access to the new GCP account and a billable project path is available; on-prem and shared control-plane lanes remain reachable.

  • Rollback strategy: Stop after any failed phase, keep the previous GCP env/state untouched, correct the issue, and re-run from the last successful step.

Core principle

HybridOps does not treat GCP Secret Manager as the canonical source of truth.

The intended hierarchy is:

  1. runtime vault
  2. state-driven rebuild contracts
  3. GCP-native projection targets such as:
  4. Secret Manager
  5. GKE Workload Identity bindings
  6. Cloud SQL / DMS resources
  7. GKE clusters
  8. object repositories

That means a fresh GCP account is primarily a rehydration exercise, not a secret-recovery crisis, as long as the runtime vault is healthy.

What changes vs what does not

Recreated on the fresh GCP side

  • project id / project number
  • billing attachment
  • enabled APIs
  • service accounts and IAM bindings
  • object buckets and bucket IAM
  • GKE clusters and kubeconfigs
  • Cloud SQL instances and DMS jobs
  • GCP runner VMs
  • Secret Manager secrets and versions

Not recreated from scratch

  • runtime vault values
  • on-prem state and data lanes
  • Hetzner control-plane services
  • public and internal workloads repos
  • DR / burst / decision-service contracts

Safe operating model

Do not introduce a second long-lived env solely because the cloud account changed.

HybridOps treats an env such as dev as a single logical lane that can contain its on-prem, Hetzner, and GCP surfaces together. In steady state:

  • same-env state resolution is the default
  • shared is the only normal cross-env authority
  • any other cross-env state reference should be reserved for controlled drills or migrations

That means a fresh-account recovery can be rehearsed in an isolated env, but the clean end state is to rehydrate the existing dev lane rather than normalising per-provider env splits such as dev-onprem and dev-gcp.

Preferred recovery posture:

  • rehearse in a new env only when isolation is required
  • keep the old env for reference while rehearsing
  • once validated, rehydrate the real dev lane in place so cross-env dependencies do not become permanent

Phase 1: establish the new GCP identity and billing path

  1. Confirm the new account can create or attach to a project.
  2. Confirm billing is active.
  3. Confirm the required APIs can be enabled.

Quick checks:

gcloud auth list
gcloud projects list
gcloud beta billing accounts list

Typical failure you should expect first:

  • BILLING_DISABLED
  • missing resourcemanager.projects.create
  • missing billing-account attachment rights

If you hit those, stop there. Do not try to force later cluster or DR steps.

Phase 2: initialise the new env cleanly

Create or choose a clean env and run GCP init there.

hyops init gcp --env <new-env>

Then verify:

hyops init status --env <new-env>
cat "$HOME/.hybridops/envs/<new-env>/meta/gcp.ready.json"

If the new project itself will be created by HybridOps, follow:

Phase 3: recreate the GCP control-plane substrate

Rebuild these in order:

  1. project / project-factory
  2. object repository
  3. runner or network substrate
  4. GKE cluster
  5. kubeconfig publication
  6. secret-store path
  7. DR-specific managed services if required

That order avoids trying to bootstrap workloads or secrets into a project that is not ready yet.

After org/gcp/project-factory succeeds, rerun:

hyops init gcp --env <env> --force

Recovery is not complete until the readiness marker shows:

  • project_bootstrap_pending=false
  • project_access_validated=true
  • auth_mode="impersonation"
  • impersonation_validated=true

Phase 3a: rebind bucket-backed state safely

When the recovered GCP lane needs a new object bucket lineage, do not try to mutate the original org/gcp/object-repo slot in place. Bucket names are immutable within a HyOps state slot, and the old slot may still be the correct historical record for the retired project.

Recommended recovery pattern:

  1. create a new object-repo state instance
  2. apply that new instance against the recovered project
  3. update downstream repo_state_ref consumers to the new instance
  4. rerun the dependent backup or restore modules

Example:

hyops apply --env dev \
  --module org/gcp/object-repo \
  --state-instance pgbackrest_primary \
  --inputs ~/.hybridops/envs/dev/config/modules/org__gcp__object-repo/instances/pgbackrest_primary.inputs.yml

Then point consumers at:

repo_state_ref: org/gcp/object-repo#pgbackrest_primary

If the recovered repo principal changed, rotate the vault secret that carries the repository credential before rerunning DR or backup modules. For GCS-backed pgBackRest flows that usually means replacing PG_BACKUP_GCS_SA_JSON with a key for the new service_account_email published by the recovered object-repo instance.

Use this when:

  • the old bucket belonged to a closed or suspended billing lineage
  • the old object-repo slot is still ok but no longer represents the recovered project
  • for Terraform Cloud-backed GCP modules, HyOps now keeps the current derived workspace name when the resolved project_id changes, instead of preserving a legacy workspace alias that still points at the old project state
  • you need a clean rebuild without rewriting historical state

Do not overwrite the old bucket name in place solely because the account changed. That is the wrong recovery primitive for GCS-backed state. If the old bucket name is no longer available when the new instance is applied, choose a new globally unique bucket name in the replacement inputs and recreate the instance cleanly.

After all downstream consumers have been repointed and validated, remove the stale local no-instance slot so the runtime no longer advertises the retired project as the default object repository state:

rm ~/.hybridops/envs/dev/state/modules/org__gcp__object-repo/latest.json
rm ~/.hybridops/envs/dev/config/modules/org__gcp__object-repo/latest.inputs.yml

From that point on, operators should use explicit instance refs such as:

org/gcp/object-repo#pgbackrest_primary
org/gcp/object-repo#vyos_artifacts

If a workflow still uses a bare org/gcp/object-repo ref after multiple ready instances exist, HyOps preflight and validate now fail early and require an explicit instance.

Phase 4: rehydrate secrets into the new GCP project

Because the runtime vault remains canonical, the normal path is:

  1. unlock runtime vault
  2. persist allowlisted secrets into the new GCP project
  3. rebuild the consuming GCP-native surfaces

Example:

hyops vault password >/dev/null
hyops secrets gsm-persist --env <new-env> --scope dr
hyops secrets gsm-persist --env <new-env> --scope build

If the env also carries private Academy, Moodle, or external identity-provider secrets, copy the env-local GSM map example to <root>/config/secrets/gsm/allowed.csv first, then persist those additional scopes deliberately.

Then verify secret-store readiness again where needed.

Reference:

Phase 5: restore the GKE burst lane

For burst, the minimum clean recovery chain is:

  1. platform/gcp/gke-cluster
  2. platform/gcp/gke-kubeconfig
  3. platform/k8s/argocd-bootstrap
  4. platform/k8s/gcp-secret-store

Reference:

Phase 6: restore the managed DR lane if required

For the managed PostgreSQL DR path, rebuild:

  1. platform/onprem/postgresql-dr-source posture if needed
  2. org/gcp/cloudsql-external-replica
  3. promote / failback drill overlays only after standby health is real again

Reference:

Common problems to expect

Billing is active but APIs still fail

This is common on fresh credit-backed projects.

Check and enable at least:

  • compute.googleapis.com
  • container.googleapis.com
  • secretmanager.googleapis.com
  • iamcredentials.googleapis.com
  • sqladmin.googleapis.com
  • datamigration.googleapis.com
  • servicenetworking.googleapis.com

Kubeconfig retrieval fails on a fresh workstation

Make sure the auth plugin is present:

hyops setup cloud-gcp --sudo

or:

sudo apt-get install -y google-cloud-cli-gke-gcloud-auth-plugin

Secret flows fail after project rebuild

Usually one of:

  • the new GCP project id is wrong in env config
  • Secret Manager was not rehydrated from runtime vault
  • Workload Identity bindings were not recreated

Cloud SQL private service connection fails with AUTH_PERMISSION_DENIED

This is no longer the same problem as stale project drift.

Interpretation:

  • the module is already targeting the recovered project correctly
  • the effective Terraform identity can reach the project and create the PSA range
  • but it still lacks the private peering permission on the project that owns the VPC

Repair path for a single-project VPC:

hyops init gcp --env dev --with-cli-login --force \
  --project-id hybridops-dev-gcp-03 \
  --region europe-west2

That reruns the current-project bootstrap path and re-ensures roles/servicenetworking.networksAdmin on the Terraform service account.

For Shared VPC:

  • grant the effective Terraform identity the required network roles on the host project that owns the VPC
  • minimum expected roles for the Cloud SQL private-service-access path are:
  • roles/compute.networkAdmin
  • roles/servicenetworking.networksAdmin

Then rerun hyops preflight or hyops validate before apply.

Burst or DR modules still point at old project ids

This is a state/env hygiene problem, not a secret-loss problem.

Treat it by:

  • checking env-scoped input overlays
  • checking project-state refs
  • checking any pinned project id fields before rerun
  • recreating object-repo lineage with a new --state-instance when the old bucket belonged to the retired project

Do not assume that --force on the old object-repo slot is the right fix. If the bucket name itself has to change, the clean repair is a new state instance plus updated repo_state_ref consumers.

DR node connectivity fails immediately after recreate

Freshly recreated GCP DR nodes can pass transport reachability before the guest SSH service is ready for Ansible. If validation fails with an SSH banner timeout or ssh service did not become ready yet, wait briefly and rerun validation instead of rewriting module state.

DR restore reaches the new bucket but backup.info is missing

That means the credential and bucket wiring are correct, but the recovered repo is empty. The fix is not another restore retry.

Required recovery sequence:

  1. point the surviving source-cluster backup module at the recovered object-repo instance
  2. run an on-demand backup into that repo
  3. use the resulting backup state as backup_state_ref for the DR restore
  4. rerun the DR restore after the new backup set is published

In practice this often means:

repo_state_ref: org/gcp/object-repo#pgbackrest_primary
backup_state_ref: platform/onprem/postgresql-ha-backup#postgresql_backup_run_onprem_dr
restore_target_timeline: ''

When backup_state_ref is present, HyOps resolves the backup label from state. Leave restore_target_timeline blank unless the repository contains divergent lineages and you are deliberately pinning one.

Do not treat an empty recovered bucket as a Terraform or workspace problem. It is a missing backup corpus problem.

Verification checklist

Recovery is in a healthy state when:

  • hyops init status --env <new-env> is green for GCP
  • project/billing/API checks are green
  • the runtime vault remains unchanged and decryptable
  • GCP Secret Manager is repopulated from the runtime vault where required
  • GKE burst baseline is healthy again
  • managed Cloud SQL standby can be re-established if required
  • no public docs or module defaults were changed just to fit the temporary account

Outcome

If this runbook is followed, a fresh GCP account or credit-backed student project is an operational rebuild, not a platform redesign.

References