Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
---
kind:
- Troubleshooting
products:
- Alauda Container Platform
ProductsVersion:
- 4.3.1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Duplicate KB ID: KB260500167 conflicts with existing knowledge-base article.

The front-matter ID must be unique across all troubleshooting documents. The context reference snippet shows that docs/en/solutions/Capturing_data_for_an_intermittent_cluster_issue_with_paired_DaemonSets.md already uses this ID. Assign a new, distinct KB ID to this article.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@docs/en/solutions/Rook_Ceph_upgrade_stuck_because_pod_network_cannot_reach_storage_network_on_ACP.md`
at line 7, The front-matter KB ID is set to the duplicate value `KB260500167`,
which is already used in another troubleshooting document. Replace this
duplicate ID with a new, distinct KB ID that is not currently used in any other
documentation files. Update the KB ID value in the front-matter of this document
to ensure uniqueness across all troubleshooting articles.

---

# Rook-Ceph upgrade leaves CephCluster Progressing because the pod network cannot reach the storage network

## Issue

During an Alauda Container Platform upgrade from 4.2.4 to 4.3.1, Alauda Build of Rook-Ceph is upgraded from Reef 18.2.7 to Squid 19.2.3. During the upgrade, the OSD rollout can stall and the `CephCluster` remains in `Progressing` for an extended period.

In the affected environment, Rook-Ceph components had been customized before the upgrade: `rook-ceph-operator`, CSI provisioner components, or related Rook-Ceph pods were changed to run with `hostNetwork` so they could bypass a pod-network-to-storage-network connectivity gap. After the upgrade, the manual changes in the OLM CSV and component ConfigMaps are overwritten by the new default configuration. The affected components restart on the regular pod network, can no longer reach the Ceph storage network, and block the Rook-Ceph upgrade.

## Environment

- Alauda Container Platform: 4.2.4 upgraded to 4.3.1
- Rook-Ceph: Reef 18.2.7 upgraded to Squid 19.2.3
- Namespace: `rook-ceph`
- Applicable scenario: the pod network cannot reach the Ceph storage network, and the environment previously relied on manual `hostNetwork` changes for Rook-Ceph component connectivity

## Root Cause

The upgrade does not create the network failure by itself. The underlying problem is that the cluster does not satisfy a required network condition: pods on the regular pod network cannot reach the Ceph storage network. Before the upgrade, the environment depended on manual customization that moved selected Rook-Ceph control-plane or CSI components onto the host network.

During an upgrade, OLM renders and manages operator-related Deployments from the new CSV. Manual parameters in component ConfigMaps can also be replaced by defaults or by a later reconcile. For this reason, direct edits to a CSV or ConfigMap are not durable upgrade configuration. When those edits are reverted, Rook-Ceph components return to the regular pod network, lose access to the Ceph network, and the OSD upgrade and `CephCluster` reconciliation stall.

## Diagnostic Steps

Check whether the `CephCluster` remains in an upgrading or `Progressing` state:

```bash
kubectl -n rook-ceph get cephcluster
kubectl -n rook-ceph describe cephcluster
```

Check whether the Rook-Ceph operator, tools, CSI provisioner, or related pods were recreated, and whether their Deployments are running on the regular pod network:

```bash
kubectl -n rook-ceph get pod -o wide
kubectl -n rook-ceph get deploy -o jsonpath='{range .items[*]}{.metadata.name}{" hostNetwork="}{.spec.template.spec.hostNetwork}{"\n"}{end}'
```

Inspect the CSV to confirm whether the `hostNetwork` settings for `rook-ceph-operator`, `rook-ceph-tools`, or CSI-related Deployments have been reverted:

```bash
kubectl get csv -A | grep rook-ceph
kubectl -n rook-ceph get csv <rook-ceph-csv-name> -o yaml
```

Check whether `rook-ceph-operator-config` still contains the pre-upgrade parameter that forced host networking:

```bash
kubectl -n rook-ceph get configmap rook-ceph-operator-config -o yaml
```

Verify connectivity from the regular pod network to the Ceph storage network. The following commands show the method only; replace `<storage-network-ip>` with a reachable Ceph MON, OSD, or other address on the storage network:

```bash
kubectl -n rook-ceph run network-check --rm -it --restart=Never \
--image=busybox:1.36 -- sh

ping <storage-network-ip>
nc -vz <ceph-mon-ip> 3300
nc -vz <ceph-mon-ip> 6789
```

If a regular pod cannot reach the storage network while a host-network pod can, the primary risk behind the stalled upgrade is missing network connectivity rather than the Rook-Ceph version itself.

## Resolution

The long-term fix is to make the Ceph storage network reachable from the pod network. Rook-Ceph operator, CSI provisioner, tools, and other components that need Ceph access must be able to reach the storage network while running on the regular pod network.

Confirm the following items according to the site's network model:

- The Pod CIDR or CNI egress addresses can route to the Ceph public network and any required cluster network.
- Ceph MON ports `3300` and `6789`, and the required OSD port range, are allowed by network ACLs, firewalls, and security groups from the pod network or CNI SNAT addresses.
- If NetworkPolicy is used, egress from the relevant pods in the `rook-ceph` namespace to the Ceph storage network is allowed.
- If the CNI SNATs pod-to-external traffic, the storage network allows the translated source addresses.

After fixing the network path, recheck connectivity from a regular pod to the Ceph MON and OSD addresses, then verify that Rook-Ceph reconciliation resumes:

```bash
kubectl -n rook-ceph get cephcluster
kubectl -n rook-ceph get pod -o wide
kubectl -n rook-ceph logs deploy/rook-ceph-operator --tail=200
```

The upgrade blockage is resolved when `CephCluster` returns to `Ready`, the OSD pods complete their rolling upgrade, and newly created PVCs can bind and mount normally.

## Temporary Recovery

If production service must be restored before the network path is fixed, temporarily restore the pre-upgrade `hostNetwork` customization so the Rook-Ceph components can reach the storage network through the node network. This is only an emergency workaround. It is not a final fix, because manual CSV and ConfigMap edits can be overwritten again during a later upgrade or reconcile.

First identify the current Rook-Ceph CSV:

```bash
kubectl get csv -A | grep rook-ceph
```

Edit the Rook-Ceph CSV in the `rook-ceph` namespace. In the Deployment templates for `rook-ceph-operator`, `rook-ceph-tools`, or the CSI provisioner components that the site has confirmed require storage-network access, restore:

```yaml
hostNetwork: true
```

Then inspect or restore the temporary parameter in `rook-ceph-operator-config`:

```bash
kubectl -n rook-ceph edit configmap rook-ceph-operator-config
```

Example value:

```yaml
data:
ROOK_ENFORCE_HOST_NETWORK: "true"
```

After the change, watch the affected pods restart and confirm that `CephCluster` continues progressing:

```bash
kubectl -n rook-ceph get pod -w
kubectl -n rook-ceph get cephcluster -w
```

After emergency recovery, schedule the network fix and remove the dependency on manual `hostNetwork` customization before the next upgrade.

## Pre-upgrade Prevention

Before upgrading Rook-Ceph, check whether the environment depends on manual customization:

```bash
kubectl -n rook-ceph get deploy -o yaml | grep -n "hostNetwork"
kubectl -n rook-ceph get configmap rook-ceph-operator-config -o yaml
kubectl -n rook-ceph get csv -o yaml | grep -n "hostNetwork"
```

If the components can reach the storage network only through `hostNetwork`, fix pod-network-to-storage-network connectivity before upgrading the platform or Rook-Ceph. Do not treat direct CSV or operator ConfigMap edits as durable upgrade configuration.

## Related Issue

- Jira: ACP-53205
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
---
kind:
- Troubleshooting
products:
- Alauda Container Platform
ProductsVersion:
- '4.3.1'
---
Comment on lines +1 to +8

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

CRITICAL: Front-matter is missing required id field.

The front-matter template is incomplete. Per the context reference snippets (both EN and ZH Capturing_data_for_an_intermittent_cluster_issue_with_paired_DaemonSets.md), the id field is required after ProductsVersion and before the closing ---. Add a unique KB ID to both the EN and ZH files.

Example format:

---
kind:
  - Troubleshooting
products:
  - Alauda Container Platform
ProductsVersion:
  - '4.3.1'
id: <UNIQUE_KB_ID>
---

Replace <UNIQUE_KB_ID> with a new, distinct identifier (do not reuse existing KB IDs from other troubleshooting documents).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@docs/zh/solutions/Rook_Ceph_upgrade_stuck_because_pod_network_cannot_reach_storage_network_on_ACP.md`
around lines 1 - 8, The YAML front-matter in this troubleshooting document is
missing the required id field. Add a new line after the ProductsVersion field
and before the closing --- delimiter with the key id and assign it a unique KB
identifier that does not conflict with any existing troubleshooting document
IDs. The complete front-matter should have kind, products, ProductsVersion, and
id fields in that order.


# Rook-Ceph 升级后 CephCluster 持续 Progressing,因为 Pod 网络无法访问存储网络

## 问题

在 Alauda Container Platform 从 4.2.4 升级到 4.3.1 的过程中,Alauda Build of Rook-Ceph 从 Reef 18.2.7 升级到 Squid 19.2.3。升级期间 OSD 升级卡住,`CephCluster` 长时间处于 `Progressing` 状态。

现场曾在升级前对 Rook-Ceph 组件做过差异化配置:将 `rook-ceph-operator` 和 CSI provisioner 等组件改为 `hostNetwork` 运行,以绕过 Pod 网络无法访问存储网络的问题。升级后,OLM CSV 和组件 ConfigMap 中的手工修改被新版本默认配置覆盖,相关组件重新以普通 Pod 网络启动,导致它们无法访问 Ceph 存储网络,进而阻塞 Rook-Ceph 升级。

## 环境

- Alauda Container Platform: 4.2.4 升级到 4.3.1
- Rook-Ceph: Reef 18.2.7 升级到 Squid 19.2.3
- 命名空间: `rook-ceph`
- 适用场景: Pod 网络与 Ceph 存储网络未打通,且升级前通过 `hostNetwork` 手工规避过 Rook-Ceph 组件连通性问题

## 根本原因

该问题不是升级过程主动破坏了网络,而是集群原本存在网络前提不满足:普通 Pod 网络无法访问 Ceph 存储网络。升级前的环境依赖手工差异化配置,让部分 Rook-Ceph 控制面或 CSI 组件使用主机网络访问存储网络。

升级时,OLM 会根据新版本 CSV 重新渲染并管理 operator 相关 Deployment;组件 ConfigMap 中的手工参数也可能被默认配置或后续 reconcile 覆盖。因此,直接编辑 CSV 或 ConfigMap 形成的 `hostNetwork` 差异不是稳定配置。升级后这些差异被还原,Rook-Ceph 组件回到普通 Pod 网络后无法访问 Ceph 网络,导致 OSD 升级和 `CephCluster` reconcile 卡住。

## 诊断步骤

查看 `CephCluster` 是否持续处于升级中或 `Progressing` 状态:

```bash
kubectl -n rook-ceph get cephcluster
kubectl -n rook-ceph describe cephcluster
```

查看 Rook-Ceph operator、tools、CSI provisioner 等 Pod 是否重建过,以及是否运行在普通 Pod 网络:

```bash
kubectl -n rook-ceph get pod -o wide
kubectl -n rook-ceph get deploy -o jsonpath='{range .items[*]}{.metadata.name}{" hostNetwork="}{.spec.template.spec.hostNetwork}{"\n"}{end}'
```

检查 CSV 中与 `rook-ceph-operator`、`rook-ceph-tools` 或 CSI 相关 Deployment 的 `hostNetwork` 配置是否已经被还原:

```bash
kubectl get csv -A | grep rook-ceph
kubectl -n rook-ceph get csv <rook-ceph-csv-name> -o yaml
```

检查 `rook-ceph-operator-config` 中是否仍保留升级前用于强制 host network 的参数:

```bash
kubectl -n rook-ceph get configmap rook-ceph-operator-config -o yaml
```

从普通 Pod 网络验证到 Ceph 存储网络的连通性。以下命令只给出检查方式,`<storage-network-ip>` 需要替换为现场 Ceph MON、OSD 或存储网络上可验证的地址:

```bash
kubectl -n rook-ceph run network-check --rm -it --restart=Never \
--image=busybox:1.36 -- sh

ping <storage-network-ip>
nc -vz <ceph-mon-ip> 3300
nc -vz <ceph-mon-ip> 6789
```

如果普通 Pod 无法访问存储网络,而 host network Pod 可以访问,则说明升级卡住的核心风险是网络连通性缺失,而不是 Rook-Ceph 版本本身。

## 解决方案

长期解决方案是打通 Pod 网络到 Ceph 存储网络的访问路径,使 Rook-Ceph operator、CSI provisioner、tools 以及其他需要访问 Ceph 的组件在普通 Pod 网络下也能访问存储网络。

根据现场网络模型,至少需要确认以下方向:

- Pod CIDR 或 CNI 出口地址能够路由到 Ceph public network 和必要的 cluster network。
- Ceph MON 端口 `3300`、`6789` 以及 OSD 所需端口范围在网络 ACL、防火墙和安全组中允许来自 Pod 网络或 CNI SNAT 地址访问。
- 如果使用 NetworkPolicy,需要允许 `rook-ceph` 命名空间内相关 Pod 到 Ceph 存储网络的 egress。
- 如果 CNI 对 Pod 到外部网络做 SNAT,需要确认存储网络侧允许 SNAT 后的源地址。

网络修复后,重新检查普通 Pod 到 Ceph MON 和 OSD 地址的连通性,再观察 Rook-Ceph reconcile 是否恢复:

```bash
kubectl -n rook-ceph get cephcluster
kubectl -n rook-ceph get pod -o wide
kubectl -n rook-ceph logs deploy/rook-ceph-operator --tail=200
```

当 `CephCluster` 恢复到 `Ready`,OSD Pod 完成滚动升级,且新建 PVC 能正常绑定和挂载时,说明升级阻塞已解除。

## 临时恢复

如果生产环境必须先恢复升级流程,可以临时恢复升级前的 `hostNetwork` 差异配置,使 Rook-Ceph 组件重新通过主机网络访问存储网络。该方式只能作为应急规避,不应作为最终方案,因为 CSV 和 ConfigMap 手工修改可能在后续升级或 reconcile 中再次被覆盖。

先确认当前 Rook-Ceph CSV:

```bash
kubectl get csv -A | grep rook-ceph
```

编辑 `rook-ceph` 命名空间中的 Rook-Ceph CSV,在与 `rook-ceph-operator`、`rook-ceph-tools` 或现场确认需要访问存储网络的 CSI provisioner 相关的 Deployment 模板中补回:

```yaml
hostNetwork: true
```

再检查或补回 `rook-ceph-operator-config` 中的临时参数:

```bash
kubectl -n rook-ceph edit configmap rook-ceph-operator-config
```

示例值:

```yaml
data:
ROOK_ENFORCE_HOST_NETWORK: "true"
```

修改后观察相关 Pod 是否重建,并确认 `CephCluster` 是否继续推进:

```bash
kubectl -n rook-ceph get pod -w
kubectl -n rook-ceph get cephcluster -w
```

应急恢复完成后,仍需要安排网络修复,并在下次升级前取消对手工 `hostNetwork` 差异的依赖。

## 升级前预防检查

在升级 Rook-Ceph 前,检查是否存在依赖手工差异化配置的环境:

```bash
kubectl -n rook-ceph get deploy -o yaml | grep -n "hostNetwork"
kubectl -n rook-ceph get configmap rook-ceph-operator-config -o yaml
kubectl -n rook-ceph get csv -o yaml | grep -n "hostNetwork"
```

如果发现只有通过 `hostNetwork` 才能访问存储网络,应先修复 Pod 网络到存储网络的连通性,再执行平台或 Rook-Ceph 升级。不要把直接编辑 CSV 或 operator ConfigMap 作为可持久升级配置使用。

## 关联问题

- Jira: ACP-53205
Loading