Skip to content

gpstate: fix false "Acting as Primary" report for hot standby mirrors#1743

Open
jangjang0401 wants to merge 1 commit into
apache:mainfrom
jangjang0401:fix/gpstate-hot-standby-false-positive
Open

gpstate: fix false "Acting as Primary" report for hot standby mirrors#1743
jangjang0401 wants to merge 1 commit into
apache:mainfrom
jangjang0401:fix/gpstate-hot-standby-false-positive

Conversation

@jangjang0401
Copy link
Copy Markdown
Contributor

With hot_standby=on, mirrors accept SQL connections and return PQPING_OK
from pg_isready. The previous code unconditionally mapped PQPING_OK +
role=mirror to "Acting as Primary", causing gpstate -s to show a spurious
warning on every mirror.

Fix this in clsSystemState.__buildGpStateData() by cross-checking
pg_stat_replication on the primary: if the mirror has an active WAL
receiver connection in streaming or catchup state, it is a legitimate
hot standby and the status is corrected to "Up". Only fall through to
"Acting as Primary" when no such replication connection exists, meaning
the segment truly promoted itself to primary.

This approach reuses the existing primary connection already established
by _add_replication_info(), so no additional database connections are
required. _add_replication_info() is updated to return the raw
replication state string to make this information available to the
caller.

What does this PR do?

Fixes a false positive in gpstate -s where every mirror segment is
reported as "Acting as Primary" when hot_standby=on is enabled.

Root cause:

  • With hot_standby=on, mirrors accept read-only SQL connections and
    pg_isready returns PQPING_OK(0) instead of PQPING_MIRROR_READY(64).
  • The legacy logic in gpgetstatususingtransition.py assumed
    PQPING_OK + role=mirror could only mean "the mirror was promoted to
    primary", which is no longer true under hot standby.

Fix:

  • In clsSystemState.__buildGpStateData(), cross-check the mirror's
    status against pg_stat_replication (already queried on the primary
    by _add_replication_info()).
  • If the mirror has an active WAL receiver connection (streaming or
    catchup), it is a legitimate hot standby → corrected to "Up".
  • If no such replication connection exists, the mirror genuinely
    promoted itself → "Acting as Primary" is preserved.

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

None. The fix only changes how the status string is derived in
gpstate; cluster configuration, replication behavior, and segment
behavior are unchanged.

Test Plan

Tested on a 5-VM cluster (1 coordinator + 1 standby coordinator + 3
segment hosts, 6 primary / 6 mirror segments) with hot_standby=on:

Before the fix:

  • Every mirror reported Segment status = Acting as Primary
  • gpstate -s produced spurious warnings on every mirror
  • gp_segment_configuration, replication state, and gpstate -m all
    showed the cluster as healthy — only the gpstate -s output was wrong

After the fix:

  • Every mirror reports Segment status = Up
  • No "Acting as Primary" entries appear on healthy mirrors
  • No additional database connections are opened

Impact

Performance:
No additional database connections. The fix consumes data that
_add_replication_info() already collects from the primary; only the
return value of that function is changed.

User-facing changes:
gpstate -s no longer reports false "Acting as Primary" warnings for
healthy hot-standby mirrors. Genuine promotion is still detected and
reported as before.

Dependencies:
None.

Checklist

Additional Context

An alternative approach considered was to call pg_is_in_recovery()
directly on each mirror inside _get_segment_status() in
gpgetstatususingtransition.py. That approach was rejected because:

  1. It opens one additional DB connection per mirror on every gpstate
    invocation, which is undesirable in unhealthy-cluster scenarios
    where gpstate is run most frequently.
  2. Mirror-side recovery state is a weaker source of truth than the
    primary's view of its replication connections.

The pg_stat_replication approach reuses an existing connection and is
authoritative from the primary's perspective.

With hot_standby=on, mirrors accept SQL connections and return
PQPING_OK from pg_isready. The previous code unconditionally mapped
PQPING_OK + role=mirror to "Acting as Primary", causing gpstate -s
to show a spurious warning on every mirror.

Fix this in clsSystemState.__buildGpStateData() by cross-checking
pg_stat_replication on the primary: if the mirror has an active WAL
receiver connection in streaming or catchup state, it is a legitimate
hot standby and the status is corrected to "Up". Only fall through to
"Acting as Primary" when no such replication connection exists,
meaning the segment truly promoted itself to primary.

This approach reuses the existing primary connection already established
by _add_replication_info(), so no additional database connections are
required. _add_replication_info() is updated to return the raw
replication state string to make this information available to the
caller.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant