Automatic Failover Simulation - Line-by-Line Documentation

This document explains the simulate_automatic_failover project: what each piece does, why it is built that way, and the pros and cons.

Project Overview
Infrastructure & Replication
Application Code (main.go)
Summary: Pros and Cons of Automatic Failover

Project Overview

A single primary is a single point of failure: when it dies, writes stop. High availability adds a standby that already has a near-current copy of the data (via streaming replication) and a failover mechanism that detects the primary's death and promotes the standby to take over.

The two hard parts are detection (is the primary really dead, or just slow?) and promotion (make the standby writable, then send traffic to it). This project implements both as an explicit control loop so the logic is visible.

Infrastructure & Replication

The Docker topology is the same streaming-replication setup used in simulate_read_replicas and simulate_read_write_splitting: one primary (port 5470) and one hot standby (port 5471) cloned with pg_basebackup -R. See those projects' DOCUMENTATION.md for the line-by-line replication walkthrough. The only differences are the ports and the container/volume names (failover_*).

What matters here: the standby is a hot standby — it replays WAL continuously and serves read-only queries — and it can be promoted to a full primary on demand.

Application Code (main.go)

FailoverManager — the control loop

type FailoverManager struct {
    current    *pgxpool.Pool  // whoever is the writable primary right now
    standby    *pgxpool.Pool  // promoted on failover
    failThresh int            // consecutive failures before failover
    probeEvery, probeTimout time.Duration
}

current always points at the node we believe is the writable primary. After a successful promotion it is swapped to point at the (former) standby.

Detection — `healthy` + `WatchAndFailover`

func (m *FailoverManager) healthy(ctx) bool {
    cctx, cancel := context.WithTimeout(ctx, m.probeTimout)
    defer cancel()
    return m.current.QueryRow(cctx, "SELECT 1").Scan(&one) == nil
}

Element	What It Does	Why We Do It
`SELECT 1`	Cheapest possible liveness check	Confirms the server actually answers queries
`WithTimeout`	Bounds how long a probe can hang	A hung primary must count as a failure
consecutive count	Requires `failThresh` misses in a row	One blip shouldn't trigger a disruptive failover

if failures >= m.failThresh {
    m.promote(ctx)   // commit to failover
}

Why a threshold? Networks hiccup. Failing over on a single missed probe causes flapping — needless promotions that themselves cause downtime. Requiring N consecutive failures trades a few seconds of detection delay for stability.

Promotion — `promote`

m.standby.Exec(ctx, "SELECT pg_promote(wait := true)")
// then poll until writable:
m.standby.QueryRow(ctx, "SELECT pg_is_in_recovery()").Scan(&inRecovery)
if !inRecovery { m.current = m.standby; m.promoted = true }

Step	What It Does	Why We Do It
`pg_promote(wait:=true)`	Tells the standby to exit recovery and accept writes	This is PostgreSQL's built-in promotion call
poll `pg_is_in_recovery()`	Wait until the node reports it is no longer a replica	Don't route writes before it's truly writable
swap `current = standby`	Point the manager at the new primary	All later writes go to the promoted node

After this, the former standby is a normal read-write primary holding all WAL it had replayed up to the crash.

Simulating the crash — `simulatePrimaryCrash`

exec.Command("docker", "stop", "failover_primary").CombinedOutput()

The program stops the primary's container so the manager observes a genuine outage rather than a faked flag. If docker isn't reachable it prints the manual command instead.

The demo flow

Write a row to the primary; wait a second; confirm the standby replicated it.
docker stop the primary.
The probe loop logs OK until the stop, then logs failures 1/3, 2/3, 3/3.
On the 3rd failure, pg_promote() runs; the standby becomes writable.
A post-failover write succeeds; the final count includes the pre-crash row → no data lost.

Summary: Pros and Cons of Automatic Failover

Pros

Benefit	Explanation
High availability	Survives a primary crash without manual intervention
Fast recovery	Promotion takes seconds, not a human paging cycle
Data durability	Standby already holds replicated data up to the crash
Hands-off ops	No 3am "promote the replica" runbook to execute by hand

Cons

Drawback	Explanation
Split-brain risk	Two primaries if the old one returns un-fenced → divergence
Possible data loss	Async replication can lose the last un-replicated writes
False positives	A slow/partitioned primary may be promoted away unnecessarily
Operational complexity	Real HA needs consensus, fencing, and re-cloning the old node

When to Use Automatic Failover

Always, for any service that must stay up — but use a battle-tested tool (Patroni, repmgr, pg_auto_failover, or a managed RDS/Cloud SQL HA option) rather than hand-rolled logic. Those add the consensus store, fencing, and quorum that prevent split-brain. The value of this simulation is understanding what those tools do under the hood.

Key Takeaways

Detection = repeated, timed health probes with a failure threshold.
A threshold prevents flapping on transient blips.
Promotion = pg_promote() + wait for pg_is_in_recovery() = false.
Async replication means failover can lose the last few writes (RPO > 0).
Production HA must add fencing/quorum to avoid split-brain — don't ship this as-is.

go run output

$ go run main.go
✅ Failover manager up: primary (5470) + standby (5471)

============================================================
🔁 AUTOMATIC FAILOVER SIMULATION
============================================================

✍️  WRITE to primary, confirm it replicates
--------------------------------------------------
   wrote 1 row; primary now has 5 customers
   standby replicated: 5 customers (read-only hot standby)

💥 SIMULATE PRIMARY CRASH
--------------------------------------------------
   💥 stopping primary container (docker stop failover_primary)...
   stopped: failover_primary

🩺 HEALTH-CHECK LOOP → automatic promotion
--------------------------------------------------
   probe  1: primary UNREACHABLE ❌ (1/3)
   probe  2: primary UNREACHABLE ❌ (2/3)
   probe  3: primary UNREACHABLE ❌ (3/3)
   → failure threshold reached; promoting standby
   ✅ standby promoted; it now accepts writes

✍️  WRITE after failover (now served by promoted node)
--------------------------------------------------
   write succeeded on the new primary; 6 customers total
   (includes both the pre- and post-failover rows → no data lost)

📊 SUMMARY
--------------------------------------------------
   Detection: consecutive failed health probes.
   Action:    pg_promote() the standby, wait until writable, reroute writes.
   Restore the old node with: docker compose up -d postgres_primary

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic Failover Simulation - Line-by-Line Documentation

Table of Contents

Project Overview

Infrastructure & Replication

Application Code (main.go)

FailoverManager — the control loop

Detection — `healthy` + `WatchAndFailover`

Promotion — `promote`

Simulating the crash — `simulatePrimaryCrash`

The demo flow

Summary: Pros and Cons of Automatic Failover

Pros

Cons

When to Use Automatic Failover

Key Takeaways

go run output

FilesExpand file tree

DOCUMENTATION.md

Latest commit

History

DOCUMENTATION.md

File metadata and controls

Automatic Failover Simulation - Line-by-Line Documentation

Table of Contents

Project Overview

Infrastructure & Replication

Application Code (main.go)

FailoverManager — the control loop

Detection — healthy + WatchAndFailover

Promotion — promote

Simulating the crash — simulatePrimaryCrash

The demo flow

Summary: Pros and Cons of Automatic Failover

Pros

Cons

When to Use Automatic Failover

Key Takeaways

go run output

Detection — `healthy` + `WatchAndFailover`

Promotion — `promote`

Simulating the crash — `simulatePrimaryCrash`