Recover the platform onto a fresh GCP account or credit-backed project¶

Purpose: Re-establish the HybridOps GCP lane on a new GCP account, billing profile, or credit-backed project without losing the surrounding on-prem, Hetzner, WAN, or secret posture. Owner: Platform operations
Trigger: Free-tier exhaustion, credit rollover, billing suspension, or a deliberate move to a fresh GCP account/project.
Impact: GCP-native resources are rebuilt under a new project/account boundary; the runtime vault remains the canonical secret source and GCP Secret Manager is rehydrated from it.
Severity: P2 Pre-reqs: Runtime vault is intact and decryptable; access to the new GCP account and a billable project path is available; on-prem and shared control-plane lanes remain reachable.
Rollback strategy: Stop after any failed phase, keep the previous GCP env/state untouched, correct the issue, and re-run from the last successful step.

Core principle¶

HybridOps does not treat GCP Secret Manager as the canonical source of truth.

The intended hierarchy is:

runtime vault
state-driven rebuild contracts
GCP-native projection targets such as:
Secret Manager
GKE Workload Identity bindings
Cloud SQL / DMS resources
GKE clusters
object repositories

That means a fresh GCP account is primarily a rehydration exercise, not a secret-recovery crisis, as long as the runtime vault is healthy.

What changes vs what does not¶

Recreated on the fresh GCP side¶

project id / project number
billing attachment
enabled APIs
service accounts and IAM bindings
object buckets and bucket IAM
GKE clusters and kubeconfigs
Cloud SQL instances and DMS jobs
GCP runner VMs
Secret Manager secrets and versions

Not recreated from scratch¶

runtime vault values
on-prem state and data lanes
Hetzner control-plane services
public and internal workloads repos
DR / burst / decision-service contracts

Safe operating model¶

Do not introduce a second long-lived env solely because the cloud account changed.

HybridOps treats an env such as dev as a single logical lane that can contain its on-prem, Hetzner, and GCP surfaces together. In steady state:

same-env state resolution is the default
shared is the only normal cross-env authority
any other cross-env state reference should be reserved for controlled drills or migrations

That means a fresh-account recovery can be rehearsed in an isolated env, but the clean end state is to rehydrate the existing dev lane rather than normalising per-provider env splits such as dev-onprem and dev-gcp.

Preferred recovery posture:

rehearse in a new env only when isolation is required
keep the old env for reference while rehearsing
once validated, rehydrate the real dev lane in place so cross-env dependencies do not become permanent

Phase 1: establish the new GCP identity and billing path¶

Confirm the new account can create or attach to a project.
Confirm billing is active.
Confirm the required APIs can be enabled.

Quick checks:

gcloud auth list
gcloud projects list
gcloud beta billing accounts list

Typical failure you should expect first:

BILLING_DISABLED
missing resourcemanager.projects.create
missing billing-account attachment rights

If you hit those, stop there. Do not try to force later cluster or DR steps.

Phase 2: initialise the new env cleanly¶

Create or choose a clean env and run GCP init there.

hyops init gcp --env <new-env>

Then verify:

hyops init status --env <new-env>
cat "$HOME/.hybridops/envs/<new-env>/meta/gcp.ready.json"

If the new project itself will be created by HybridOps, follow:

Phase 3: recreate the GCP control-plane substrate¶

Rebuild these in order:

project / project-factory
object repository
runner or network substrate
GKE cluster
kubeconfig publication
secret-store path
DR-specific managed services if required

That order avoids trying to bootstrap workloads or secrets into a project that is not ready yet.

After org/gcp/project-factory succeeds, rerun:

hyops init gcp --env <env> --force

Recovery is not complete until the readiness marker shows:

project_bootstrap_pending=false
project_access_validated=true
auth_mode="impersonation"
impersonation_validated=true

Phase 3a: rebind bucket-backed state safely¶

When the recovered GCP lane needs a new object bucket lineage, do not try to mutate the original org/gcp/object-repo slot in place. Bucket names are immutable within a HyOps state slot, and the old slot may still be the correct historical record for the retired project.

Recommended recovery pattern:

create a new object-repo state instance
apply that new instance against the recovered project
update downstream repo_state_ref consumers to the new instance
rerun the dependent backup or restore modules

Example:

hyops apply --env dev \
  --module org/gcp/object-repo \
  --state-instance pgbackrest_primary \
  --inputs ~/.hybridops/envs/dev/config/modules/org__gcp__object-repo/instances/pgbackrest_primary.inputs.yml

Then point consumers at:

repo_state_ref: org/gcp/object-repo#pgbackrest_primary

If the recovered repo principal changed, rotate the vault secret that carries the repository credential before rerunning DR or backup modules. For GCS-backed pgBackRest flows that usually means replacing PG_BACKUP_GCS_SA_JSON with a key for the new service_account_email published by the recovered object-repo instance.

Use this when:

the old bucket belonged to a closed or suspended billing lineage
the old object-repo slot is still ok but no longer represents the recovered project
for Terraform Cloud-backed GCP modules, HyOps now keeps the current derived workspace name when the resolved project_id changes, instead of preserving a legacy workspace alias that still points at the old project state
you need a clean rebuild without rewriting historical state

Do not overwrite the old bucket name in place solely because the account changed. That is the wrong recovery primitive for GCS-backed state. If the old bucket name is no longer available when the new instance is applied, choose a new globally unique bucket name in the replacement inputs and recreate the instance cleanly.

After all downstream consumers have been repointed and validated, remove the stale local no-instance slot so the runtime no longer advertises the retired project as the default object repository state:

rm ~/.hybridops/envs/dev/state/modules/org__gcp__object-repo/latest.json
rm ~/.hybridops/envs/dev/config/modules/org__gcp__object-repo/latest.inputs.yml

From that point on, operators should use explicit instance refs such as:

org/gcp/object-repo#pgbackrest_primary
org/gcp/object-repo#vyos_artifacts

If a workflow still uses a bare org/gcp/object-repo ref after multiple ready instances exist, HyOps preflight and validate now fail early and require an explicit instance.

Phase 4: rehydrate secrets into the new GCP project¶

Because the runtime vault remains canonical, the normal path is:

unlock runtime vault
persist allowlisted secrets into the new GCP project
rebuild the consuming GCP-native surfaces

Example:

hyops vault password >/dev/null
hyops secrets gsm-persist --env <new-env> --scope dr
hyops secrets gsm-persist --env <new-env> --scope build

If the env also carries private Academy, Moodle, or external identity-provider secrets, copy the env-local GSM map example to <root>/config/secrets/gsm/allowed.csv first, then persist those additional scopes deliberately.

Then verify secret-store readiness again where needed.

Reference:

Runbook - Sync GCP Secret Manager secrets into runtime vault bundle

Phase 5: restore the GKE burst lane¶

For burst, the minimum clean recovery chain is:

platform/gcp/gke-cluster
platform/gcp/gke-kubeconfig
platform/k8s/argocd-bootstrap
platform/k8s/gcp-secret-store

Reference:

GKE Burst Baseline (HyOps Blueprint)

Phase 6: restore the managed DR lane if required¶

For the managed PostgreSQL DR path, rebuild:

platform/onprem/postgresql-dr-source posture if needed
org/gcp/cloudsql-external-replica
promote / failback drill overlays only after standby health is real again

Reference:

Establish PostgreSQL Cloud SQL Standby in GCP (HyOps Blueprint)
If a Terraform Cloud-backed GCP module still points at the old project after the env move, force-delete and recreate only that module workspace, then rerun the module so the new project-backed state becomes authoritative.

Common problems to expect¶

Billing is active but APIs still fail¶

This is common on fresh credit-backed projects.

Check and enable at least:

compute.googleapis.com
container.googleapis.com
secretmanager.googleapis.com
iamcredentials.googleapis.com
sqladmin.googleapis.com
datamigration.googleapis.com
servicenetworking.googleapis.com

Kubeconfig retrieval fails on a fresh workstation¶

Make sure the auth plugin is present:

hyops setup cloud-gcp --sudo

or:

sudo apt-get install -y google-cloud-cli-gke-gcloud-auth-plugin

Secret flows fail after project rebuild¶

Usually one of:

the new GCP project id is wrong in env config
Secret Manager was not rehydrated from runtime vault
Workload Identity bindings were not recreated

Cloud SQL private service connection fails with `AUTH_PERMISSION_DENIED`¶

This is no longer the same problem as stale project drift.

Interpretation:

the module is already targeting the recovered project correctly
the effective Terraform identity can reach the project and create the PSA range
but it still lacks the private peering permission on the project that owns the VPC

Repair path for a single-project VPC:

hyops init gcp --env dev --with-cli-login --force \
  --project-id hybridops-dev-gcp-03 \
  --region europe-west2

That reruns the current-project bootstrap path and re-ensures roles/servicenetworking.networksAdmin on the Terraform service account.

For Shared VPC:

grant the effective Terraform identity the required network roles on the host project that owns the VPC
minimum expected roles for the Cloud SQL private-service-access path are:
roles/compute.networkAdmin
roles/servicenetworking.networksAdmin

Then rerun hyops preflight or hyops validate before apply.

Burst or DR modules still point at old project ids¶

This is a state/env hygiene problem, not a secret-loss problem.

Treat it by:

checking env-scoped input overlays
checking project-state refs
checking any pinned project id fields before rerun
recreating object-repo lineage with a new --state-instance when the old bucket belonged to the retired project

Do not assume that --force on the old object-repo slot is the right fix. If the bucket name itself has to change, the clean repair is a new state instance plus updated repo_state_ref consumers.

DR node connectivity fails immediately after recreate¶

Freshly recreated GCP DR nodes can pass transport reachability before the guest SSH service is ready for Ansible. If validation fails with an SSH banner timeout or ssh service did not become ready yet, wait briefly and rerun validation instead of rewriting module state.

DR restore reaches the new bucket but `backup.info` is missing¶

That means the credential and bucket wiring are correct, but the recovered repo is empty. The fix is not another restore retry.

Required recovery sequence:

point the surviving source-cluster backup module at the recovered object-repo instance
run an on-demand backup into that repo
use the resulting backup state as backup_state_ref for the DR restore
rerun the DR restore after the new backup set is published

In practice this often means:

repo_state_ref: org/gcp/object-repo#pgbackrest_primary
backup_state_ref: platform/onprem/postgresql-ha-backup#postgresql_backup_run_onprem_dr
restore_target_timeline: ''

When backup_state_ref is present, HyOps resolves the backup label from state. Leave restore_target_timeline blank unless the repository contains divergent lineages and you are deliberately pinning one.

Do not treat an empty recovered bucket as a Terraform or workspace problem. It is a missing backup corpus problem.

Verification checklist¶

Recovery is in a healthy state when:

hyops init status --env <new-env> is green for GCP
project/billing/API checks are green
the runtime vault remains unchanged and decryptable
GCP Secret Manager is repopulated from the runtime vault where required
GKE burst baseline is healthy again
managed Cloud SQL standby can be re-established if required
no public docs or module defaults were changed just to fit the temporary account

Outcome¶

If this runbook is followed, a fresh GCP account or credit-backed student project is an operational rebuild, not a platform redesign.

Recover the platform onto a fresh GCP account or credit-backed project¶

Core principle¶

What changes vs what does not¶

Recreated on the fresh GCP side¶

Not recreated from scratch¶

Safe operating model¶

Phase 1: establish the new GCP identity and billing path¶

Phase 2: initialise the new env cleanly¶

Phase 3: recreate the GCP control-plane substrate¶

Phase 3a: rebind bucket-backed state safely¶

Phase 4: rehydrate secrets into the new GCP project¶

Phase 5: restore the GKE burst lane¶

Phase 6: restore the managed DR lane if required¶

Common problems to expect¶

Billing is active but APIs still fail¶

Kubeconfig retrieval fails on a fresh workstation¶

Secret flows fail after project rebuild¶

Cloud SQL private service connection fails with AUTH_PERMISSION_DENIED¶

Burst or DR modules still point at old project ids¶

DR node connectivity fails immediately after recreate¶

DR restore reaches the new bucket but backup.info is missing¶

Verification checklist¶

Outcome¶

References¶

Cloud SQL private service connection fails with `AUTH_PERMISSION_DENIED`¶

DR restore reaches the new bucket but `backup.info` is missing¶