-
Notifications
You must be signed in to change notification settings - Fork 18
docs: add Rook-Ceph upgrade network troubleshooting #791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,146 @@ | ||
| --- | ||
| kind: | ||
| - Troubleshooting | ||
| products: | ||
| - Alauda Container Platform | ||
| ProductsVersion: | ||
| - 4.3.1 | ||
| --- | ||
|
|
||
| # Rook-Ceph upgrade leaves CephCluster Progressing because the pod network cannot reach the storage network | ||
|
|
||
| ## Issue | ||
|
|
||
| During an Alauda Container Platform upgrade from 4.2.4 to 4.3.1, Alauda Build of Rook-Ceph is upgraded from Reef 18.2.7 to Squid 19.2.3. During the upgrade, the OSD rollout can stall and the `CephCluster` remains in `Progressing` for an extended period. | ||
|
|
||
| In the affected environment, Rook-Ceph components had been customized before the upgrade: `rook-ceph-operator`, CSI provisioner components, or related Rook-Ceph pods were changed to run with `hostNetwork` so they could bypass a pod-network-to-storage-network connectivity gap. After the upgrade, the manual changes in the OLM CSV and component ConfigMaps are overwritten by the new default configuration. The affected components restart on the regular pod network, can no longer reach the Ceph storage network, and block the Rook-Ceph upgrade. | ||
|
|
||
| ## Environment | ||
|
|
||
| - Alauda Container Platform: 4.2.4 upgraded to 4.3.1 | ||
| - Rook-Ceph: Reef 18.2.7 upgraded to Squid 19.2.3 | ||
| - Namespace: `rook-ceph` | ||
| - Applicable scenario: the pod network cannot reach the Ceph storage network, and the environment previously relied on manual `hostNetwork` changes for Rook-Ceph component connectivity | ||
|
|
||
| ## Root Cause | ||
|
|
||
| The upgrade does not create the network failure by itself. The underlying problem is that the cluster does not satisfy a required network condition: pods on the regular pod network cannot reach the Ceph storage network. Before the upgrade, the environment depended on manual customization that moved selected Rook-Ceph control-plane or CSI components onto the host network. | ||
|
|
||
| During an upgrade, OLM renders and manages operator-related Deployments from the new CSV. Manual parameters in component ConfigMaps can also be replaced by defaults or by a later reconcile. For this reason, direct edits to a CSV or ConfigMap are not durable upgrade configuration. When those edits are reverted, Rook-Ceph components return to the regular pod network, lose access to the Ceph network, and the OSD upgrade and `CephCluster` reconciliation stall. | ||
|
|
||
| ## Diagnostic Steps | ||
|
|
||
| Check whether the `CephCluster` remains in an upgrading or `Progressing` state: | ||
|
|
||
| ```bash | ||
| kubectl -n rook-ceph get cephcluster | ||
| kubectl -n rook-ceph describe cephcluster | ||
| ``` | ||
|
|
||
| Check whether the Rook-Ceph operator, tools, CSI provisioner, or related pods were recreated, and whether their Deployments are running on the regular pod network: | ||
|
|
||
| ```bash | ||
| kubectl -n rook-ceph get pod -o wide | ||
| kubectl -n rook-ceph get deploy -o jsonpath='{range .items[*]}{.metadata.name}{" hostNetwork="}{.spec.template.spec.hostNetwork}{"\n"}{end}' | ||
| ``` | ||
|
|
||
| Inspect the CSV to confirm whether the `hostNetwork` settings for `rook-ceph-operator`, `rook-ceph-tools`, or CSI-related Deployments have been reverted: | ||
|
|
||
| ```bash | ||
| kubectl get csv -A | grep rook-ceph | ||
| kubectl -n rook-ceph get csv <rook-ceph-csv-name> -o yaml | ||
| ``` | ||
|
|
||
| Check whether `rook-ceph-operator-config` still contains the pre-upgrade parameter that forced host networking: | ||
|
|
||
| ```bash | ||
| kubectl -n rook-ceph get configmap rook-ceph-operator-config -o yaml | ||
| ``` | ||
|
|
||
| Verify connectivity from the regular pod network to the Ceph storage network. The following commands show the method only; replace `<storage-network-ip>` with a reachable Ceph MON, OSD, or other address on the storage network: | ||
|
|
||
| ```bash | ||
| kubectl -n rook-ceph run network-check --rm -it --restart=Never \ | ||
| --image=busybox:1.36 -- sh | ||
|
|
||
| ping <storage-network-ip> | ||
| nc -vz <ceph-mon-ip> 3300 | ||
| nc -vz <ceph-mon-ip> 6789 | ||
| ``` | ||
|
|
||
| If a regular pod cannot reach the storage network while a host-network pod can, the primary risk behind the stalled upgrade is missing network connectivity rather than the Rook-Ceph version itself. | ||
|
|
||
| ## Resolution | ||
|
|
||
| The long-term fix is to make the Ceph storage network reachable from the pod network. Rook-Ceph operator, CSI provisioner, tools, and other components that need Ceph access must be able to reach the storage network while running on the regular pod network. | ||
|
|
||
| Confirm the following items according to the site's network model: | ||
|
|
||
| - The Pod CIDR or CNI egress addresses can route to the Ceph public network and any required cluster network. | ||
| - Ceph MON ports `3300` and `6789`, and the required OSD port range, are allowed by network ACLs, firewalls, and security groups from the pod network or CNI SNAT addresses. | ||
| - If NetworkPolicy is used, egress from the relevant pods in the `rook-ceph` namespace to the Ceph storage network is allowed. | ||
| - If the CNI SNATs pod-to-external traffic, the storage network allows the translated source addresses. | ||
|
|
||
| After fixing the network path, recheck connectivity from a regular pod to the Ceph MON and OSD addresses, then verify that Rook-Ceph reconciliation resumes: | ||
|
|
||
| ```bash | ||
| kubectl -n rook-ceph get cephcluster | ||
| kubectl -n rook-ceph get pod -o wide | ||
| kubectl -n rook-ceph logs deploy/rook-ceph-operator --tail=200 | ||
| ``` | ||
|
|
||
| The upgrade blockage is resolved when `CephCluster` returns to `Ready`, the OSD pods complete their rolling upgrade, and newly created PVCs can bind and mount normally. | ||
|
|
||
| ## Temporary Recovery | ||
|
|
||
| If production service must be restored before the network path is fixed, temporarily restore the pre-upgrade `hostNetwork` customization so the Rook-Ceph components can reach the storage network through the node network. This is only an emergency workaround. It is not a final fix, because manual CSV and ConfigMap edits can be overwritten again during a later upgrade or reconcile. | ||
|
|
||
| First identify the current Rook-Ceph CSV: | ||
|
|
||
| ```bash | ||
| kubectl get csv -A | grep rook-ceph | ||
| ``` | ||
|
|
||
| Edit the Rook-Ceph CSV in the `rook-ceph` namespace. In the Deployment templates for `rook-ceph-operator`, `rook-ceph-tools`, or the CSI provisioner components that the site has confirmed require storage-network access, restore: | ||
|
|
||
| ```yaml | ||
| hostNetwork: true | ||
| ``` | ||
|
|
||
| Then inspect or restore the temporary parameter in `rook-ceph-operator-config`: | ||
|
|
||
| ```bash | ||
| kubectl -n rook-ceph edit configmap rook-ceph-operator-config | ||
| ``` | ||
|
|
||
| Example value: | ||
|
|
||
| ```yaml | ||
| data: | ||
| ROOK_ENFORCE_HOST_NETWORK: "true" | ||
| ``` | ||
|
|
||
| After the change, watch the affected pods restart and confirm that `CephCluster` continues progressing: | ||
|
|
||
| ```bash | ||
| kubectl -n rook-ceph get pod -w | ||
| kubectl -n rook-ceph get cephcluster -w | ||
| ``` | ||
|
|
||
| After emergency recovery, schedule the network fix and remove the dependency on manual `hostNetwork` customization before the next upgrade. | ||
|
|
||
| ## Pre-upgrade Prevention | ||
|
|
||
| Before upgrading Rook-Ceph, check whether the environment depends on manual customization: | ||
|
|
||
| ```bash | ||
| kubectl -n rook-ceph get deploy -o yaml | grep -n "hostNetwork" | ||
| kubectl -n rook-ceph get configmap rook-ceph-operator-config -o yaml | ||
| kubectl -n rook-ceph get csv -o yaml | grep -n "hostNetwork" | ||
| ``` | ||
|
|
||
| If the components can reach the storage network only through `hostNetwork`, fix pod-network-to-storage-network connectivity before upgrading the platform or Rook-Ceph. Do not treat direct CSV or operator ConfigMap edits as durable upgrade configuration. | ||
|
|
||
| ## Related Issue | ||
|
|
||
| - Jira: ACP-53205 | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,146 @@ | ||
| --- | ||
| kind: | ||
| - Troubleshooting | ||
| products: | ||
| - Alauda Container Platform | ||
| ProductsVersion: | ||
| - '4.3.1' | ||
| --- | ||
|
Comment on lines
+1
to
+8
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CRITICAL: Front-matter is missing required The front-matter template is incomplete. Per the context reference snippets (both EN and ZH Example format: ---
kind:
- Troubleshooting
products:
- Alauda Container Platform
ProductsVersion:
- '4.3.1'
id: <UNIQUE_KB_ID>
---Replace 🤖 Prompt for AI Agents |
||
|
|
||
| # Rook-Ceph 升级后 CephCluster 持续 Progressing,因为 Pod 网络无法访问存储网络 | ||
|
|
||
| ## 问题 | ||
|
|
||
| 在 Alauda Container Platform 从 4.2.4 升级到 4.3.1 的过程中,Alauda Build of Rook-Ceph 从 Reef 18.2.7 升级到 Squid 19.2.3。升级期间 OSD 升级卡住,`CephCluster` 长时间处于 `Progressing` 状态。 | ||
|
|
||
| 现场曾在升级前对 Rook-Ceph 组件做过差异化配置:将 `rook-ceph-operator` 和 CSI provisioner 等组件改为 `hostNetwork` 运行,以绕过 Pod 网络无法访问存储网络的问题。升级后,OLM CSV 和组件 ConfigMap 中的手工修改被新版本默认配置覆盖,相关组件重新以普通 Pod 网络启动,导致它们无法访问 Ceph 存储网络,进而阻塞 Rook-Ceph 升级。 | ||
|
|
||
| ## 环境 | ||
|
|
||
| - Alauda Container Platform: 4.2.4 升级到 4.3.1 | ||
| - Rook-Ceph: Reef 18.2.7 升级到 Squid 19.2.3 | ||
| - 命名空间: `rook-ceph` | ||
| - 适用场景: Pod 网络与 Ceph 存储网络未打通,且升级前通过 `hostNetwork` 手工规避过 Rook-Ceph 组件连通性问题 | ||
|
|
||
| ## 根本原因 | ||
|
|
||
| 该问题不是升级过程主动破坏了网络,而是集群原本存在网络前提不满足:普通 Pod 网络无法访问 Ceph 存储网络。升级前的环境依赖手工差异化配置,让部分 Rook-Ceph 控制面或 CSI 组件使用主机网络访问存储网络。 | ||
|
|
||
| 升级时,OLM 会根据新版本 CSV 重新渲染并管理 operator 相关 Deployment;组件 ConfigMap 中的手工参数也可能被默认配置或后续 reconcile 覆盖。因此,直接编辑 CSV 或 ConfigMap 形成的 `hostNetwork` 差异不是稳定配置。升级后这些差异被还原,Rook-Ceph 组件回到普通 Pod 网络后无法访问 Ceph 网络,导致 OSD 升级和 `CephCluster` reconcile 卡住。 | ||
|
|
||
| ## 诊断步骤 | ||
|
|
||
| 查看 `CephCluster` 是否持续处于升级中或 `Progressing` 状态: | ||
|
|
||
| ```bash | ||
| kubectl -n rook-ceph get cephcluster | ||
| kubectl -n rook-ceph describe cephcluster | ||
| ``` | ||
|
|
||
| 查看 Rook-Ceph operator、tools、CSI provisioner 等 Pod 是否重建过,以及是否运行在普通 Pod 网络: | ||
|
|
||
| ```bash | ||
| kubectl -n rook-ceph get pod -o wide | ||
| kubectl -n rook-ceph get deploy -o jsonpath='{range .items[*]}{.metadata.name}{" hostNetwork="}{.spec.template.spec.hostNetwork}{"\n"}{end}' | ||
| ``` | ||
|
|
||
| 检查 CSV 中与 `rook-ceph-operator`、`rook-ceph-tools` 或 CSI 相关 Deployment 的 `hostNetwork` 配置是否已经被还原: | ||
|
|
||
| ```bash | ||
| kubectl get csv -A | grep rook-ceph | ||
| kubectl -n rook-ceph get csv <rook-ceph-csv-name> -o yaml | ||
| ``` | ||
|
|
||
| 检查 `rook-ceph-operator-config` 中是否仍保留升级前用于强制 host network 的参数: | ||
|
|
||
| ```bash | ||
| kubectl -n rook-ceph get configmap rook-ceph-operator-config -o yaml | ||
| ``` | ||
|
|
||
| 从普通 Pod 网络验证到 Ceph 存储网络的连通性。以下命令只给出检查方式,`<storage-network-ip>` 需要替换为现场 Ceph MON、OSD 或存储网络上可验证的地址: | ||
|
|
||
| ```bash | ||
| kubectl -n rook-ceph run network-check --rm -it --restart=Never \ | ||
| --image=busybox:1.36 -- sh | ||
|
|
||
| ping <storage-network-ip> | ||
| nc -vz <ceph-mon-ip> 3300 | ||
| nc -vz <ceph-mon-ip> 6789 | ||
| ``` | ||
|
|
||
| 如果普通 Pod 无法访问存储网络,而 host network Pod 可以访问,则说明升级卡住的核心风险是网络连通性缺失,而不是 Rook-Ceph 版本本身。 | ||
|
|
||
| ## 解决方案 | ||
|
|
||
| 长期解决方案是打通 Pod 网络到 Ceph 存储网络的访问路径,使 Rook-Ceph operator、CSI provisioner、tools 以及其他需要访问 Ceph 的组件在普通 Pod 网络下也能访问存储网络。 | ||
|
|
||
| 根据现场网络模型,至少需要确认以下方向: | ||
|
|
||
| - Pod CIDR 或 CNI 出口地址能够路由到 Ceph public network 和必要的 cluster network。 | ||
| - Ceph MON 端口 `3300`、`6789` 以及 OSD 所需端口范围在网络 ACL、防火墙和安全组中允许来自 Pod 网络或 CNI SNAT 地址访问。 | ||
| - 如果使用 NetworkPolicy,需要允许 `rook-ceph` 命名空间内相关 Pod 到 Ceph 存储网络的 egress。 | ||
| - 如果 CNI 对 Pod 到外部网络做 SNAT,需要确认存储网络侧允许 SNAT 后的源地址。 | ||
|
|
||
| 网络修复后,重新检查普通 Pod 到 Ceph MON 和 OSD 地址的连通性,再观察 Rook-Ceph reconcile 是否恢复: | ||
|
|
||
| ```bash | ||
| kubectl -n rook-ceph get cephcluster | ||
| kubectl -n rook-ceph get pod -o wide | ||
| kubectl -n rook-ceph logs deploy/rook-ceph-operator --tail=200 | ||
| ``` | ||
|
|
||
| 当 `CephCluster` 恢复到 `Ready`,OSD Pod 完成滚动升级,且新建 PVC 能正常绑定和挂载时,说明升级阻塞已解除。 | ||
|
|
||
| ## 临时恢复 | ||
|
|
||
| 如果生产环境必须先恢复升级流程,可以临时恢复升级前的 `hostNetwork` 差异配置,使 Rook-Ceph 组件重新通过主机网络访问存储网络。该方式只能作为应急规避,不应作为最终方案,因为 CSV 和 ConfigMap 手工修改可能在后续升级或 reconcile 中再次被覆盖。 | ||
|
|
||
| 先确认当前 Rook-Ceph CSV: | ||
|
|
||
| ```bash | ||
| kubectl get csv -A | grep rook-ceph | ||
| ``` | ||
|
|
||
| 编辑 `rook-ceph` 命名空间中的 Rook-Ceph CSV,在与 `rook-ceph-operator`、`rook-ceph-tools` 或现场确认需要访问存储网络的 CSI provisioner 相关的 Deployment 模板中补回: | ||
|
|
||
| ```yaml | ||
| hostNetwork: true | ||
| ``` | ||
|
|
||
| 再检查或补回 `rook-ceph-operator-config` 中的临时参数: | ||
|
|
||
| ```bash | ||
| kubectl -n rook-ceph edit configmap rook-ceph-operator-config | ||
| ``` | ||
|
|
||
| 示例值: | ||
|
|
||
| ```yaml | ||
| data: | ||
| ROOK_ENFORCE_HOST_NETWORK: "true" | ||
| ``` | ||
|
|
||
| 修改后观察相关 Pod 是否重建,并确认 `CephCluster` 是否继续推进: | ||
|
|
||
| ```bash | ||
| kubectl -n rook-ceph get pod -w | ||
| kubectl -n rook-ceph get cephcluster -w | ||
| ``` | ||
|
|
||
| 应急恢复完成后,仍需要安排网络修复,并在下次升级前取消对手工 `hostNetwork` 差异的依赖。 | ||
|
|
||
| ## 升级前预防检查 | ||
|
|
||
| 在升级 Rook-Ceph 前,检查是否存在依赖手工差异化配置的环境: | ||
|
|
||
| ```bash | ||
| kubectl -n rook-ceph get deploy -o yaml | grep -n "hostNetwork" | ||
| kubectl -n rook-ceph get configmap rook-ceph-operator-config -o yaml | ||
| kubectl -n rook-ceph get csv -o yaml | grep -n "hostNetwork" | ||
| ``` | ||
|
|
||
| 如果发现只有通过 `hostNetwork` 才能访问存储网络,应先修复 Pod 网络到存储网络的连通性,再执行平台或 Rook-Ceph 升级。不要把直接编辑 CSV 或 operator ConfigMap 作为可持久升级配置使用。 | ||
|
|
||
| ## 关联问题 | ||
|
|
||
| - Jira: ACP-53205 | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate KB ID:
KB260500167conflicts with existing knowledge-base article.The front-matter ID must be unique across all troubleshooting documents. The context reference snippet shows that
docs/en/solutions/Capturing_data_for_an_intermittent_cluster_issue_with_paired_DaemonSets.mdalready uses this ID. Assign a new, distinct KB ID to this article.🤖 Prompt for AI Agents