Conversation
✅ Deploy Preview for cockroachdb-api-docs canceled.
|
✅ Deploy Preview for cockroachdb-interactivetutorials-docs canceled.
|
There was a problem hiding this comment.
This page is woefully out of date. There is an ongoing discussion to either remove this page altogether or commit to better maintenance, but for right now it's probably better to add the note than not.
✅ Netlify Preview
To edit notification comments on pull requests, go to your Netlify project configuration. |
| CockroachDB {{ site.data.products.cloud }} {{ site.data.products.advanced }} includes a built-in fault tolerance demo in the {{ site.data.products.cloud }} Console that automatically runs a sample workload and simulates a node failure on your cluster, showing real-time metrics of query latency and failure rate during the outage and recovery. | ||
|
|
||
| {{ site.data.alerts.callout_info }} | ||
| The CockroachDB {{ site.data.products.cloud }} fault tolerance demo is in [Preview]({% link {{ page.version.version }}/cockroachdb-feature-availability.md %}). |
There was a problem hiding this comment.
Not for this PR but it seems like we should have prebuilt macros for each of visibilities. Its annoying that we'd have to add the version and availability to each place we use this. Ideally we could do this and have it link up correctly.
fault tolerance demo is in {{site.data.visibility.preview}}.
There was a problem hiding this comment.
Agreed, some amount of these callouts can be turned into a macro/snippet. We're planning a migration to a new docs site that has its own tools, so rather than create a DOC ticket to do this in the current system we'll earmark it to investigate later.
| - A [CockroachDB {{ site.data.products.advanced }} cluster]({% link cockroachcloud/create-an-advanced-cluster.md %}) with at least three nodes. | ||
| - All nodes are healthy. | ||
| - The cluster's CPU utilization is below 30%. | ||
| - The cluster does not a custom [replication zone configuration]({% link {{ page.version.version }}/configure-replication-zones.md %}). |
There was a problem hiding this comment.
We dropped this one.
There are some others but I don't think most of them are worth listing.
The one additional that I think we should consider is around the cluster being in an unlocked state. For example if they are already undergoing cluster disruption or they are scaling their cluster or the cluster is under maintenance, they won't be able to run the demo. Anything that has locked the cluster will prevent the demo from starting. The messaging we show the user in this case is:
The fault tolerance demo cannot be run because this cluster is currently in a locked state. Try again once the cluster is available.
There was a problem hiding this comment.
Added
- The cluster is not currently in a locked state as a result of maintenance such as scaling.
I'm a little concerned about calling out a "locked state" when we don't currently have a formal "locked" cluster state exposed to the user, so I'm hoping with this wording we don't imply to the user that a "locked cluster" is a state they need to worry about in day-to-day operation. I'm hoping the cluster status rework project helps us out in the long run.
| - The cluster's CPU utilization is below 30%. | ||
| - The cluster does not a custom [replication zone configuration]({% link {{ page.version.version }}/configure-replication-zones.md %}). | ||
|
|
||
| To run the fault tolerance demo, open the {{ site.data.products.cloud }} Console and navigate to **Actions > Fault tolerance demo**. Follow the prompts to check that your cluster is eligible and begin the demo. |
There was a problem hiding this comment.
Follow the prompts to check that your cluster is eligible
I wonder if this will be confusing? There aren't really any visible prompts to check eligibility since we run them automatically when you try to start the demo.
There was a problem hiding this comment.
Removed the second sentence, I think the UX straightforward enough without the specific details written out here.
| To start using your CockroachDB {{ site.data.products.advanced }} cluster, refer to: | ||
|
|
||
| - [Connect to your cluster]({% link cockroachcloud/connect-to-your-cluster.md %}) | ||
| - Run the [fault tolerance demo]({% link {{ site.versions["stable"] }}/demo-cockroachdb-resilience.md %}#run-a-guided-demo-in-cockroachdb-cloud) |
There was a problem hiding this comment.
I noticed that the headline that results in this page anchor is using the macro {{ site.data.products.cloud }} but it's hard coded here in the anchor. It means that if we ever changed cloud, the anchor would break. This is obviously unlikely and perhaps we'd check by some automated link scanner but it suggests that the current system for linking we have within pages is lacking. It would be better if we had some layer of indirection for each headline that would allow us to change its name without changing it's id. Then we use the id to look up the current name and generate the anchor on the fly. Obviously we wouldn't do any of that in this PR. Just food for thought.
There was a problem hiding this comment.
Yep, this is a known limitation of the current way Jekyll/Liquid handles variables. The band-aid solution is that our CICD does an internal linkcheck for anchors based on the rendered HTML before publishing, so we'll get notified in this situation and be able to fix it in flight.
That said, this is another thing we're hoping our new site has better solutions for.
| CockroachDB {{ site.data.products.cloud }} {{ site.data.products.advanced }} includes a built-in fault tolerance demo in the {{ site.data.products.cloud }} Console that automatically runs a sample workload and simulates a node failure on your cluster, showing real-time metrics of query latency and failure rate during the outage and recovery. | ||
|
|
||
| {{ site.data.alerts.callout_info }} | ||
| The CockroachDB {{ site.data.products.cloud }} fault tolerance demo is in [Preview]({% link {{ page.version.version }}/cockroachdb-feature-availability.md %}). |
There was a problem hiding this comment.
nit: suggest lower-case "preview" since we changed everything on the feature availability page to be lowercased (really embarrassed i wrote this, don't care that much)
|
|
||
| ## Run a manual demo on a local machine | ||
|
|
||
| This guide walks you through a simple demonstration of CockroachDB's resilience on a local cluster deployment. Starting with a 6-node local cluster with the default 3-way replication, you'll run a sample workload, terminate a node to simulate failure, and see how the cluster continues uninterrupted. You'll then leave that node offline for long enough to watch the cluster repair itself by re-replicating missing data to other nodes. You'll then prepare the cluster for 2 simultaneous node failures by increasing to 5-way replication, then take two nodes offline at the same time, and again see how the cluster continues uninterrupted. |
There was a problem hiding this comment.
this paragraph is pretty dense. non-blocking suggest to rewrite for more scannability, e.g. something like
This guide walks you through a simple demonstration of CockroachDB’s resilience on a local cluster deployment.
Starting with a 6-node local cluster using the default 3-way replication, you will:
- Run a sample workload
- Terminate one node to simulate a failure
- Observe the cluster continue serving traffic uninterrupted
Next, you’ll leave that node offline for long enough to watch the cluster repair itself by re-replicating missing data to other nodes.
Finally, you’ll prepare the cluster for 2 simultaneous node failures by increasing to 5-way replication, then:
- Take two nodes offline at the same time
- See how the cluster continues uninterrupted
|
|
||
| ## Feb 24, 2026 | ||
|
|
||
| CockroachDB {{ site.data.products.cloud }} {{ site.data.products.advanced }} users can now run a built-in [fault tolerance demo]({% link {{ site.versions["stable"] }}/demo-cockroachdb-resilience.md %}#run-a-guided-demo-in-cockroachdb-cloud) that allows you to monitor query execition during a simulated failure and recovery. The fault tolerance demo is in [Preview]({% link {{ site.versions["stable"] }}/cockroachdb-feature-availability.md %}). |
| ## Before you begin | ||
| ## Run a guided demo in CockroachDB {{ site.data.products.cloud }} | ||
|
|
||
| CockroachDB {{ site.data.products.cloud }} {{ site.data.products.advanced }} includes a built-in fault tolerance demo in the {{ site.data.products.cloud }} Console that automatically runs a sample workload and simulates a node failure on your cluster, showing real-time metrics of query latency and failure rate during the outage and recovery. |
There was a problem hiding this comment.
it's technically an availability zone failure (could be multiple nodes) for larger cluster.
| - The cluster's CPU utilization is below 30%. | ||
| - The cluster is not currently in a locked state as a result of maintenance such as scaling. | ||
|
|
||
| To run the fault tolerance demo, open the {{ site.data.products.cloud }} Console and navigate to **Actions > Fault tolerance demo**. |
There was a problem hiding this comment.
should we have a disclaimer here that we don't recommend running the fault tolerance demo on your production cluster? It is a live cluster demo.
Other things to consider:
- You need cluster operator and/or cluster admin permissions to run the demo
- The demo injects a temporary database and workload into your cluster, and cleans up after the demo is complete. The clean up step may take a few minutes after your demo ends.
- You cannot run a second demo on a cluster if one is already running (this applies if you have multiple cluster admins for the same cluster).
- The demo can take 10-15 mins to complete end to end.
| - The cluster's CPU utilization is below 30%. | ||
| - The cluster is not currently in a locked state as a result of maintenance such as scaling. | ||
|
|
||
| To run the fault tolerance demo, open the {{ site.data.products.cloud }} Console and navigate to **Actions > Fault tolerance demo**. |
There was a problem hiding this comment.
@fantapop for the demo, do we have recommendations on size of cluster or anything like that?
https://cockroachlabs.atlassian.net/browse/DOC-11437