Skip to content

Suggest ssh-keygen -R on SSH connect host key mismatch#5645

Open
anton-107 wants to merge 1 commit into
mainfrom
ssh-connect-host-key-changed-hint
Open

Suggest ssh-keygen -R on SSH connect host key mismatch#5645
anton-107 wants to merge 1 commit into
mainfrom
ssh-connect-host-key-changed-hint

Conversation

@anton-107

Copy link
Copy Markdown
Contributor

Changes

When a Databricks compute is recreated it keeps the same deterministic SSH connection name (e.g. databricks-cpu-<hash>) but gets a new host key. The stale known_hosts entry then trips OpenSSH's strict checking and databricks ssh connect exits 255 with Host key verification failed.

Previously this landed in the generic "the cluster's container image is likely missing an OpenSSH server" branch, which is misleading for this case and gives the user no way forward:

Host key verification failed.
The SSH connection closed unexpectedly. If it dropped right after connecting, the cluster's container image is likely missing an OpenSSH server: ...

Now the CLI recognizes the host-key failure and prints an actionable suggestion instead:

Host key verification failed.
The host key for databricks-cpu-6e7644d0 has changed. Remove the stale entry and reconnect:
  ssh-keygen -R databricks-cpu-6e7644d0

When --user-known-hosts-file is set, the suggested command appends -f <file> (since ssh-keygen -R defaults to ~/.ssh/known_hosts).

Implementation

spawnSSHClient previously piped ssh's stderr straight to os.Stderr, so the CLI never saw the failure message. It now tees stderr through a small bounded tail buffer (tailWriter, capped at 4 KB) so the user still sees the live output while the CLI retains the tail to inspect after exit. On exit code 255, a new first branch (hostKeyChangedHint) checks for OpenSSH's fixed Host key verification failed message and emits the hint; the existing server-logs and missing-sshd branches are unchanged for genuine connection drops.

Tests

Added unit tests in client_internal_test.go:

  • TestHostKeyChangedHint — host-key failure (with and without a custom known_hosts file) and an unrelated-failure case.
  • TestTailWriterRetainsTail — tail retention and short-write passthrough.

Also manually verified end-to-end against a live serverless connection: corrupted the recorded host key in an isolated known_hosts file and confirmed the new suggestion appears, then that running the suggested ssh-keygen -R resolves it.

This pull request and its description were written by Isaac.

When a Databricks compute is recreated it keeps the same deterministic SSH
connection name but gets a new host key, so the stale known_hosts entry trips
OpenSSH's strict checking and `ssh connect` exits 255 with "Host key
verification failed." Until now that landed in the generic "container is likely
missing an OpenSSH server" branch, which is misleading and offers no fix.

Tee ssh's stderr through a bounded tail buffer so we can detect the host-key
failure after exit, and when it occurs print an actionable hint telling the user
to run `ssh-keygen -R <host>` (with `-f <file>` when --user-known-hosts-file is
set) and reconnect.

Co-authored-by: Isaac
@anton-107 anton-107 temporarily deployed to test-trigger-is June 18, 2026 14:41 — with GitHub Actions Inactive
@anton-107 anton-107 temporarily deployed to test-trigger-is June 18, 2026 14:41 — with GitHub Actions Inactive
@anton-107 anton-107 requested review from rclarey and rugpanov June 18, 2026 14:43
@eng-dev-ecosystem-bot

Copy link
Copy Markdown
Collaborator

Integration test report

Commit: 748fc00

Run: 27767452555

Env 🟨​KNOWN 🔄​flaky 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
🟨​ aws linux 7 13 264 1011 6:29
🟨​ aws windows 7 13 266 1009 10:57
💚​ aws-ucws linux 7 13 360 925 6:02
💚​ aws-ucws windows 7 13 362 923 7:13
💚​ azure linux 1 15 267 1009 5:12
💚​ azure windows 1 15 269 1007 6:02
💚​ azure-ucws linux 1 15 365 921 6:39
🔄​ azure-ucws windows 2 1 15 365 919 8:32
💚​ gcp linux 1 15 263 1012 5:40
💚​ gcp windows 1 15 265 1010 6:14
22 interesting tests: 13 SKIP, 7 KNOWN, 2 flaky
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🟨​ TestAccept 🟨​K 🟨​K 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/recreate/embedding_dimension 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🔄​ TestFetchRepositoryInfoAPI_FromRepo/root ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p 🔄​f ✅​p ✅​p
🔄​ TestFetchRepositoryInfoAPI_FromRepo/subdir ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p 🔄​f ✅​p ✅​p
Top 20 slowest tests (at least 2 minutes):
duration env testname
4:19 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:15 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:09 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:00 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:38 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:37 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:15 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:12 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:05 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:56 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:55 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:50 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:47 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:47 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:42 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:42 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:41 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:39 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:36 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:32 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct

@rugpanov rugpanov left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks solid!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants