diff --git a/docs/docs/deployment-self-hosting/administration-and-maintenance/troubleshooting.md b/docs/docs/deployment-self-hosting/administration-and-maintenance/troubleshooting.md index 24edd88af33a..08e793325f93 100644 --- a/docs/docs/deployment-self-hosting/administration-and-maintenance/troubleshooting.md +++ b/docs/docs/deployment-self-hosting/administration-and-maintenance/troubleshooting.md @@ -8,14 +8,28 @@ Here are some common issues encountered when trying to set up Flagsmith in a sel ## Health Checks -If you are using health checks, make sure to use `/health` as the health-check endpoint for both the API and the frontend. +If you are using health checks, make sure to use `/health` as the health-check endpoint for both the API and the +frontend. ## API and Database Connectivity -The most common cause of issues when setting things up in AWS with an RDS database is missing Security Group permissions between the API application and the RDS database. You need to ensure that the attached security groups for ECS/Fargate/EC2 allow access to the RDS database. [AWS provide more detail about this here](https://aws.amazon.com/premiumsupport/knowledge-center/ecs-task-connect-rds-database/) and [here](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.RDSSecurityGroups.html). +The most common cause of issues when setting things up in AWS with an RDS database is missing Security Group permissions +between the API application and the RDS database. You need to ensure that the attached security groups for +ECS/Fargate/EC2 allow access to the RDS database. +[AWS provide more detail about this here](https://aws.amazon.com/premiumsupport/knowledge-center/ecs-task-connect-rds-database/) +and [here](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.RDSSecurityGroups.html). Make sure you have a `DATABASE_URL` environment variable set within the API application. ## Frontend > API DNS Setup -If you are running the API and the frontend as separate applications, you need to make sure that the frontend is pointing to the API. Check the [Frontend environment variables](/deployment-self-hosting/core-configuration/environment-variables#frontend-environment-variables), particularly `API_URL`. +If you are running the API and the frontend as separate applications, you need to make sure that the frontend is +pointing to the API. Check the +[Frontend environment variables](/deployment-self-hosting/core-configuration/environment-variables#frontend-environment-variables), +particularly `API_URL`. + +## Runtime issues after setup + +This page covers setup-time problems. If Flagsmith starts successfully but misbehaves at runtime (task processor not +picking up jobs, migration failures on upgrade, intermittent 502s), see the +[Self-Hosted Troubleshooting](/guides/troubleshooting/self-hosted) guide. diff --git a/docs/docs/deployment-self-hosting/administration-and-maintenance/upgrades-and-rollbacks.md b/docs/docs/deployment-self-hosting/administration-and-maintenance/upgrades-and-rollbacks.md index e8eeecefcd94..3df0aa62f3eb 100644 --- a/docs/docs/deployment-self-hosting/administration-and-maintenance/upgrades-and-rollbacks.md +++ b/docs/docs/deployment-self-hosting/administration-and-maintenance/upgrades-and-rollbacks.md @@ -33,7 +33,7 @@ ORDER BY applied DESC 2. Run the rollback command inside a Flagsmith API container running the _current_ version of Flagsmith: ```bash -python manage.py rollbackmigrationsafter "" +python manage.py rollbackmigrationsappliedafter "" ``` 3. Roll back the Flagsmith API to the desired version. diff --git a/docs/docs/guides/_category_.json b/docs/docs/guides/_category_.json new file mode 100644 index 000000000000..1ab8786c2e1b --- /dev/null +++ b/docs/docs/guides/_category_.json @@ -0,0 +1,6 @@ +{ + "label": "Guides", + "position": 125, + "collapsible": true, + "collapsed": true +} diff --git a/docs/docs/guides/troubleshooting/_category_.json b/docs/docs/guides/troubleshooting/_category_.json new file mode 100644 index 000000000000..fd1e9bdf3d88 --- /dev/null +++ b/docs/docs/guides/troubleshooting/_category_.json @@ -0,0 +1,6 @@ +{ + "label": "Troubleshooting", + "position": 130, + "collapsible": true, + "collapsed": false +} diff --git a/docs/docs/guides/troubleshooting/http-errors.mdx b/docs/docs/guides/troubleshooting/http-errors.mdx new file mode 100644 index 000000000000..37a6bcb223c1 --- /dev/null +++ b/docs/docs/guides/troubleshooting/http-errors.mdx @@ -0,0 +1,183 @@ +--- +title: HTTP Errors +sidebar_label: HTTP Errors +sidebar_position: 2 +--- + +import Link from '@docusaurus/Link'; + +This page covers common HTTP error codes returned by the Flagsmith API and what to do about each one. + +## 401 and 403 Authentication failures + +A `401 Unauthorized` or `403 Forbidden` response means the API could not verify your credentials. + +### Common causes + +- **Wrong header name.** The Flags API expects `X-Environment-Key`, while the Admin API expects + `Authorization: Api-Key `. Mixing them up will fail silently with a `403`. +- **Wrong key type.** The Client-side Environment Key and the Server-side Environment Key are different values. If you + are using [local evaluation mode](/integrating-with-flagsmith/integration-overview#local-evaluation-mode), you need + the **Server-side Environment Key**. +- **Expired or revoked Admin API token.** Tokens can be deleted from Organisation Settings at any time. If another + team member rotated the token, your requests will start failing. +- **Self-hosted: no `ALLOW_ADMIN_INITIATION_VIA_CLI` or wrong `DJANGO_ALLOWED_HOSTS`.** Some self-hosted + configurations reject requests before they reach Flagsmith's authentication layer. + +### Steps to resolve + +1. Confirm which API you are calling: Flags (`/api/v1/flags/`) or Admin (`/api/v1/environments/`, `/api/v1/projects/`, + etc.). +2. Check you are sending the correct header: + +- [Flags API](/integrating-with-flagsmith/flagsmith-api-overview/flags-api): + `X-Environment-Key: ` +- [Admin API](/integrating-with-flagsmith/flagsmith-api-overview/admin-api): + `Authorization: Api-Key ` + +3. Verify the key value in the Flagsmith dashboard under the relevant Environment or Organisation settings. +4. For local evaluation, ensure you are using the **Server-side Environment Key**, not the Client-side key. + +**Related documentation:** +[Flags API Authentication](/integrating-with-flagsmith/flagsmith-api-overview/flags-api/authentication) • +[Admin API Authentication](/integrating-with-flagsmith/flagsmith-api-overview/admin-api/authentication) • +[Integration Approaches](/best-practices/integration-approaches) + +--- + +## 404 Endpoint not found + +A `404 Not Found` usually means the request URL is wrong, not that a resource is missing. + +:::note + +An unknown or invalid Environment Key returns `401 Unauthorized`, not `404`. If you suspect a bad key is the cause, see +[401 and 403 Authentication failures](#401-and-403-authentication-failures) above. + +::: + +### Common causes + +- **Wrong base URL.** The SaaS Edge API is at `https://edge.api.flagsmith.com/api/v1/`. The SaaS Admin API is at + `https://api.flagsmith.com/api/v1/`. Self-hosted deployments use your own domain. +- **Missing `/api/v1/` prefix.** All endpoints are nested under this path. +- **Trailing-slash mismatch.** Django (the framework behind the Flagsmith API) expects a trailing slash on most + endpoints. `/api/v1/flags` will redirect or 404 depending on server configuration. Use `/api/v1/flags/` instead. +- **EU vs. US region confusion.** If your project is on the EU cluster, the base URL differs from the default US + cluster. Check your project settings in the dashboard. + +### Steps to resolve + +1. Copy the full URL from the failing request (check browser dev tools or application logs). +2. Compare it against the [API overview](/integrating-with-flagsmith/flagsmith-api-overview) to confirm the path is + correct. +3. Ensure the URL ends with a trailing slash. +4. If you are self-hosting, verify that the `API_URL` environment variable in your frontend matches the API's actual + address. + +**Related documentation:** [Flagsmith API Overview](/integrating-with-flagsmith/flagsmith-api-overview) + +--- + +## 429 Rate limited + +A `429 Too Many Requests` response means you have exceeded a traffic limit. + +### How rate limiting works in Flagsmith + +- **SDK endpoints (Flags API)** are _not_ rate limited by design. However, your plan has a monthly request allowance. + You can review current usage and your plan tier under **Organisation → Admin Settings** in the dashboard, or see + [Billing & API Usage](/administration-and-security/billing-api-usage). If you exceed the allowance, Flagsmith may + block requests depending on your plan tier. +- **Admin API endpoints** are rate limited to **500 requests per minute** per user by default. Self-hosted deployments + can adjust this with the `USER_THROTTLE_RATE` environment variable. + +### Common causes + +- **Tight polling interval on a client-side SDK.** If `startListening` is set to a very short interval (e.g. 1000 ms) + across many clients, aggregate traffic can spike quickly. +- **Scripted Admin API calls without backoff.** Bulk operations (creating flags, updating segments) in a tight loop + will hit the 500/min limit. +- **Free plan limit exceeded.** Free-plan accounts are blocked after exceeding the monthly allowance. A warning email + is sent 7 days before blocking begins. + +### Steps to resolve + +1. Check the `Retry-After` header in the 429 response for how long to wait. +2. For Admin API scripts, add exponential backoff or reduce request concurrency. +3. For SDK traffic, consider switching to + [local evaluation mode](/integrating-with-flagsmith/integration-overview#local-evaluation-mode) which fetches a + single environment document instead of per-request API calls. +4. **Self-hosted: enable server-side caching.** Setting `CACHE_FLAGS_SECONDS` and/or + `CACHE_ENVIRONMENT_DOCUMENT_SECONDS` collapses repeated identical requests into a single cache hit, reducing pressure + on both the database and the throttle. See + [Caching Strategies](/deployment-self-hosting/core-configuration/caching-strategies). +5. Review your plan usage in **Organisation → Admin Settings** in the dashboard. + +**Related documentation:** [System Limits](/administration-and-security/governance-and-compliance/system-limits) • +[Billing & API Usage](/administration-and-security/billing-api-usage) + +--- + +## 502 and 503 Transient or upstream failures + +A `502 Bad Gateway` or `503 Service Unavailable` means the API server did not return a valid response to the upstream +proxy or load balancer. + +### SaaS + +These are typically transient. The Flagsmith SaaS platform runs across multiple AWS regions behind a global edge +network; brief 502/503 errors can occur during deployments or region failovers. + +**What to do:** retry the request with exponential backoff. If the error persists for more than a few minutes, check the +[Flagsmith status page](https://status.flagsmith.com) or [contact support](/support). + +### Self-hosted + +Common causes include: + +- **API container is not running or has crashed.** Check `docker ps` or your orchestrator's pod status. +- **Database is unreachable.** Verify that `DATABASE_URL` is correct and that network/security-group rules allow the + connection. +- **Reverse proxy misconfiguration.** If you run Nginx, Traefik, or a cloud load balancer in front of Flagsmith, + ensure the upstream target and health-check path (`/health`) are correct. +- **Task processor overload.** If the task processor shares a database connection pool with the API and is running + behind, it can contribute to connection exhaustion. + +**Related documentation:** [Platform Architecture](/flagsmith-concepts/platform-architecture) • +[Self-Hosted Troubleshooting](/guides/troubleshooting/self-hosted) + +--- + +## 504 Gateway Timeout + +A `504 Gateway Timeout` means the API did not respond within the proxy or load balancer's timeout window. + +### Common causes + +- **Large environment document.** Environments with thousands of flags, segments, or identities produce a large + document that takes longer to serialise. +- **Cold cache.** If you have just deployed or restarted the API, the first few requests will hit the database + directly before the cache is populated. +- **Proxy timeout too short.** The default timeout on many reverse proxies (e.g. Nginx's `proxy_read_timeout`) is 60 + seconds; some cloud load balancers default to 30 seconds. + +### Steps to resolve + +1. **Enable caching.** Set `CACHE_FLAGS_SECONDS` and/or `CACHE_ENVIRONMENT_DOCUMENT_SECONDS` to reduce database load on + hot paths. +2. **Increase proxy timeouts.** If the API responds within its own timeout but the proxy cuts the connection, raise the + proxy's read timeout. +3. **Use local evaluation.** Server-side SDKs in local evaluation mode fetch the environment document once and evaluate + flags in-process, avoiding per-request latency entirely. The refresh interval defaults to 60 seconds and is + configurable per SDK. +4. **Review environment size.** Consider whether you can archive unused flags or split a large project into smaller + ones. +5. **Diagnose environment volume.** In the dashboard, check your project's flag and segment counts under **Project + Settings**. Environments with thousands of flags or deeply nested segment rules produce oversized documents; + archiving unused flags, simplifying segments, or moving experimental work into a separate project are often faster + wins than raising proxy timeouts. For numerical ceilings on a single project or environment, see + [System Limits](/administration-and-security/governance-and-compliance/system-limits). + +**Related documentation:** [Caching Strategies](/deployment-self-hosting/core-configuration/caching-strategies) • +[Local Evaluation Mode](/integrating-with-flagsmith/integration-overview#local-evaluation-mode) diff --git a/docs/docs/guides/troubleshooting/index.mdx b/docs/docs/guides/troubleshooting/index.mdx new file mode 100644 index 000000000000..2837d64a4d6f --- /dev/null +++ b/docs/docs/guides/troubleshooting/index.mdx @@ -0,0 +1,41 @@ +--- +title: Troubleshooting +sidebar_label: Troubleshooting +sidebar_position: 1 +--- + +import Link from '@docusaurus/Link'; + +This guide helps you diagnose common issues with the Flagsmith API, SDKs, and self-hosted deployments. Start from the +symptom you are seeing, and follow the steps to resolve it. + +## What are you seeing? + +### HTTP errors from the API + +Getting `4xx` or `5xx` responses when calling the Flagsmith API from your application code, an SDK, or a direct HTTP +request. + +Diagnose HTTP errors → + +### SDK behaving unexpectedly + +Flags are stale, default values are returned when they shouldn't be, or trait-based targeting isn't matching the way you +expect. + +Diagnose SDK issues → + +### Self-hosted runtime problems + +The task processor isn't picking up jobs, a database migration failed on upgrade, or your API containers are returning +intermittent errors. + +Diagnose self-hosted issues → + +--- + +:::tip + +If your issue isn't covered here, check the [FAQ](/support/faq) or [contact support](/support). + +::: diff --git a/docs/docs/guides/troubleshooting/sdk-issues.mdx b/docs/docs/guides/troubleshooting/sdk-issues.mdx new file mode 100644 index 000000000000..bca4fda9d83f --- /dev/null +++ b/docs/docs/guides/troubleshooting/sdk-issues.mdx @@ -0,0 +1,224 @@ +--- +title: SDK Issues +sidebar_label: SDK Issues +sidebar_position: 3 +--- + +import Link from '@docusaurus/Link'; + +This page covers common runtime issues with Flagsmith's client-side and server-side SDKs: stale flags, unexpected +default values, and trait evaluation mismatches. + +## Flag updates are delayed or stale + +Your application is not seeing flag changes made in the Flagsmith dashboard, or changes take longer to appear than +expected. + +### Client-side SDKs + +By default, client-side SDKs fetch flags **once** during initialisation. If you need live updates without a page +refresh, you must enable polling explicitly: + +```javascript +flagsmith.startListening(30000); // poll every 30 seconds +``` + +Each poll counts against your plan's API request allowance. Choose an interval that balances freshness against cost. + +### Server-side SDKs in local evaluation mode + +Server-side SDKs in [local evaluation mode](/integrating-with-flagsmith/integration-overview#local-evaluation-mode) +fetch the full environment document on a refresh interval and evaluate flags locally. The default refresh interval is +**60 seconds** in most SDKs: + +| SDK | Parameter | Default | +| ------- | ----------------------------------------------- | ------------------------- | +| Python | `environment_refresh_interval_seconds` | 60 s | +| Java | `.withEnvironmentRefreshIntervalSeconds(...)` | 60 s | +| Ruby | `environment_refresh_interval_seconds` | 60 s | +| Node.js | `environmentRefreshIntervalSeconds` | 60 s | +| .NET | `EnvironmentRefreshInterval` (`TimeSpan`) | 60 s | +| Go | `flagsmith.WithEnvironmentRefreshInterval(...)` | 60 s | +| Rust | `environment_refresh_interval_mills` | 60 000 ms | +| PHP | `environmentTtl` | `null` (polling disabled) | + +If you lower this interval, the SDK will detect dashboard changes faster. If you raise it, you reduce network calls at +the cost of staleness. + +### Server-side SDKs in remote evaluation mode + +Remote evaluation makes a network call on every flag check. If responses feel stale, the issue is usually **server-side +caching**. Check whether `CACHE_FLAGS_SECONDS` or `CACHE_ENVIRONMENT_DOCUMENT_SECONDS` is set on the API. A high value +here delays how quickly dashboard changes reach your SDK. + +### Real-time updates (SSE) + +If you need near-instant propagation without polling, Flagsmith supports Server-Sent Events (SSE). SDKs subscribed to +the real-time stream are notified immediately when flag state changes and refresh their cached values on the next +evaluation. + +- **Enterprise feature.** Real-time updates require a Flagsmith Enterprise subscription on SaaS, or the real-time + service running alongside your self-hosted deployment. +- **Supported SDKs.** Not every SDK has SSE client support. Check the [Real-time Flags](/performance/real-time-flags) + page for the current SDK matrix. + +If you have enabled real-time updates but changes still aren't propagating, verify that the SDK is configured with the +correct real-time URL and that your network allows long-lived HTTP connections to that endpoint. + +**Related documentation:** [Server-Side SDKs](/integrating-with-flagsmith/sdks/server-side) • +[Real-time Flags](/performance/real-time-flags) • +[Caching Strategies](/deployment-self-hosting/core-configuration/caching-strategies) + +--- + +## SDK initialisation times out + +The first call to the SDK (commonly `init`, `getFlags`, or constructing the client) hangs and eventually fails, leaving +your application without flag data. + +### Common causes + +- **Network path to the API is slow or blocked.** A corporate proxy, firewall, or VPN may be intercepting outbound + HTTPS to `edge.api.flagsmith.com` (or your self-hosted domain). The SDK can't distinguish "blocked" from "slow" and + will sit on the connection until the request timeout fires. +- **DNS resolution is slow.** Cold containers or restricted DNS (e.g. some Lambda VPC configurations) can add several + seconds to the first lookup. +- **Default request timeout is shorter than the network round-trip.** Most server-side SDKs default to a 10-second + HTTP request timeout (Node.js defaults to 30). On a cold start to a self-hosted instance behind a slow proxy, that + ceiling is easy to hit. +- **Local evaluation fetching a large environment document.** The first refresh in + [local evaluation mode](/integrating-with-flagsmith/integration-overview#local-evaluation-mode) downloads the full + environment. For projects with thousands of flags or segments, that response can take longer than a tight default + timeout allows. + +### Steps to resolve + +1. Confirm the SDK can reach the API at all: + `curl -v https://edge.api.flagsmith.com/api/v1/flags/ -H "X-Environment-Key: "` from the same host the SDK + is running on. A clean response rules out network and DNS issues. +2. Raise the SDK's request timeout. Each server-side SDK exposes a parameter for this. Examples: + `request_timeout_seconds` (Python, Ruby, Rust), `requestTimeoutSeconds` (Node.js), `RequestTimeout` (.NET), + `WithRequestTimeout(...)` (Go), or `connectTimeout` / `readTimeout` / `writeTimeout` on the OkHttp builder (Java). + See the [Server-Side SDKs reference](/integrating-with-flagsmith/sdks/server-side) for the exact parameter and + default value for your SDK. +3. Enable debug logging in the SDK to see whether the SDK is making the request, where the time is being spent, and + whether it's the initial fetch or a subsequent refresh that is slow. +4. If you're in local evaluation mode and the environment document is large, see the + [504 Gateway Timeout guidance](/guides/troubleshooting/http-errors#504-gateway-timeout). The same fixes (caching, + archiving unused flags, splitting projects) apply. +5. If init genuinely cannot succeed in time, configure an offline handler or a `default_flag_handler` so your + application has a sensible fallback rather than blocking on flag data. + +**Related documentation:** [Server-Side SDKs](/integrating-with-flagsmith/sdks/server-side) • +[Offline Handlers](/integrating-with-flagsmith/sdks/server-side#using-an-offline-handler) + +--- + +## SDK is returning default values + +Your SDK calls are returning the default or fallback value instead of the value configured in the Flagsmith dashboard. + +### Common causes + +- **SDK failed to initialise.** If the initial `getFlags` or `init` call failed (network error, wrong API URL, wrong + key), the SDK has no flag data and falls back to defaults. +- **Wrong Environment Key.** The Client-side Environment Key is different from the Server-side Environment Key. Using + the wrong one will either fail authentication or return an empty environment. +- **Local evaluation with the Client-side key.** Local evaluation requires the **Server-side Environment Key**. The + Client-side key does not have permission to fetch the full environment document. +- **Flag not enabled in this environment.** Flags can be enabled in one environment but disabled in another. Check the + dashboard for the specific environment your SDK is pointed at. +- **Offline handler returning stale data.** If you have configured an offline handler or local file handler, the SDK + may be reading from an outdated snapshot rather than the live API. + +### Steps to resolve + +1. Check your application logs for errors during SDK initialisation. Most SDKs log warnings when the initial fetch + fails. +2. Verify the Environment Key in the dashboard and confirm it matches the value your application is using. +3. For local evaluation, confirm you are using the Server-side key. +4. In the dashboard, navigate to the specific environment and check whether the flag is enabled and has the expected + value. +5. If you are using an offline handler, verify that the snapshot file is up to date and accessible. + +:::tip + +You can define a `default_flag_handler` (Python) or equivalent in other SDKs to control what happens when a flag is +missing or unreachable. This is useful for graceful degradation, but be aware that if it fires unexpectedly, it can mask +an initialisation problem. + +::: + +**Related documentation:** [Offline Handlers](/integrating-with-flagsmith/sdks/server-side#using-an-offline-handler) • +[Flags API Authentication](/integrating-with-flagsmith/flagsmith-api-overview/flags-api/authentication) + +--- + +## Trait evaluation is inconsistent + +Segment rules or identity overrides that depend on traits are not matching the way you expect. + +### Common causes + +- **Type mismatch.** Flagsmith stores trait values with a type (string, integer, float, boolean). If your SDK sends a + trait value as a string `"42"` but the segment rule expects the integer `42`, the comparison may not match. +- **Case sensitivity on trait keys.** Trait keys are case-sensitive. `email` and `Email` are two different traits. If + your SDK sets `Email` but the segment rule targets `email`, the rule will not match. +- **Trait not set on the identity.** If the trait has never been sent for a given identity, the segment rule will not + match. Traits are not automatically created; they only exist once your application sends them via the SDK or API. +- **Local evaluation with missing traits.** In local evaluation mode, the SDK evaluates segments using only the traits + you provide in the `get_identity_flags` call. If you omit a trait that the segment rule depends on, the rule will + not match, even if that trait exists on the server. + +### Steps to resolve + +1. In the dashboard, navigate to the identity and inspect its traits. Confirm the trait key, value, and type. +2. Compare the trait key casing in your code against the segment rule definition. +3. If using local evaluation, ensure you are passing **all** traits that your segment rules depend on. +4. Test the segment rule in the dashboard by adding the identity manually and checking whether it matches. + +:::tip Evaluation order + +If a flag value surprises you, keep the precedence in mind. Flagsmith evaluates in this order, highest priority first: + +1. **Identity overrides** (set directly on a specific identity). +2. **Segment overrides** (the first matching segment wins; segments are evaluated top-to-bottom on the feature). +3. **Environment default** (the value configured on the feature for the current environment). + +See [Flag evaluation precedence](/flagsmith-concepts/segments#flag-evaluation-precedence) for the full rules. + +::: + +**Related documentation:** [Identities & Traits](/flagsmith-concepts/identities) • +[Segments](/flagsmith-concepts/segments) + +--- + +## Local evaluation mode isn't behaving as expected + +You have enabled local evaluation but flags are not matching dashboard state, or certain features don't work. + +### Common causes + +- **Using the Client-side Environment Key.** Local evaluation requires the **Server-side Environment Key**. The + Client-side key will either fail to fetch the environment document entirely or return an incomplete one. +- **Stale environment document.** The SDK refreshes the environment document on a timer (see the + [refresh interval table](#server-side-sdks-in-local-evaluation-mode) above). If you changed a flag very recently, + the SDK might not have picked it up yet. +- **Inconsistent identifier for percentage-based segments.** Local evaluation evaluates everything in-process, + including percentage splits. The SDK hashes the identifier you pass to `get_identity_flags(...)` to decide which + bucket the identity falls into. If that identifier differs from the one used elsewhere (or you make a call without + one), bucketing will diverge from what you saw via remote evaluation. + +### Steps to resolve + +1. Confirm you are using the Server-side Environment Key. +2. Check the SDK's refresh interval. If you need near-instant updates, lower the interval (but be aware this increases + network calls). +3. For percentage-based segments, verify that you are passing a consistent identity identifier. The hash-based bucketing + relies on it. +4. Enable debug logging in the SDK to see when the environment document is refreshed and what it contains. + +**Related documentation:** +[Local Evaluation Mode](/integrating-with-flagsmith/integration-overview#local-evaluation-mode) • +[Server-Side SDKs](/integrating-with-flagsmith/sdks/server-side) diff --git a/docs/docs/guides/troubleshooting/self-hosted.mdx b/docs/docs/guides/troubleshooting/self-hosted.mdx new file mode 100644 index 000000000000..28d4c8236ce3 --- /dev/null +++ b/docs/docs/guides/troubleshooting/self-hosted.mdx @@ -0,0 +1,128 @@ +--- +title: Self-Hosted Issues +sidebar_label: Self-Hosted +sidebar_position: 4 +--- + +import Link from '@docusaurus/Link'; + +This page covers runtime symptoms specific to self-hosted Flagsmith deployments. For initial setup problems (health +checks, database connectivity, frontend DNS), see the +[deployment troubleshooting guide](/deployment-self-hosting/administration-and-maintenance/troubleshooting). + +## Task processor is not running jobs + +Tasks are queueing up but never being processed. Symptoms include: webhooks not firing, audit logs not being written, or +analytics data not appearing. + +### Common causes + +- **Task processor container is not running.** The task processor is a separate service that must be started alongside + the API. Check that a container with the `run-task-processor` command is running (`docker ps` or your orchestrator's + pod list). +- **`TASK_RUN_METHOD` not set to `TASK_PROCESSOR`.** If this environment variable is not set on the API container, + Flagsmith runs tasks in an unmanaged background thread inside the API process instead of sending them to the + processor. The processor will have nothing to pick up. +- **Database connectivity from the processor.** The task processor must be able to reach the same database as the API + (or a dedicated task processor database if you have configured one). Check `DATABASE_URL` and + `TASK_PROCESSOR_DATABASE_URL`. +- **Sleep interval too high.** The `TASK_PROCESSOR_SLEEP_INTERVAL_MS` environment variable controls how often each + worker thread checks for new tasks. The default is 500 ms. If this has been raised significantly, tasks will appear + to be delayed. + +### Steps to resolve + +1. Verify the task processor container is running and check its logs for errors. +2. Confirm that `TASK_RUN_METHOD=TASK_PROCESSOR` is set on the **API** container. +3. Check that `DATABASE_URL` (and `TASK_PROCESSOR_DATABASE_URL` if using a separate database) is correct and reachable + from the processor container. +4. Review the processor configuration: + +| Environment variable | Default | Description | +| ---------------------------------- | ------- | ------------------------------------------ | +| `TASK_PROCESSOR_SLEEP_INTERVAL_MS` | 500 | Milliseconds between polling for new tasks | +| `TASK_PROCESSOR_NUM_THREADS` | 5 | Worker threads per processor instance | +| `TASK_PROCESSOR_GRACE_PERIOD_MS` | 20 000 | Time before a task is considered stuck | +| `TASK_PROCESSOR_QUEUE_POP_SIZE` | 10 | Tasks retrieved per polling iteration | + +5. Check the monitoring endpoint at `GET /processor/monitoring`. It returns the number of tasks waiting in the queue. A + consistently growing number indicates the processor is not keeping up. + +**Related documentation:** +[Asynchronous Task Processor](/deployment-self-hosting/scaling-and-performance/asynchronous-task-processor) + +--- + +## Database migration failures on upgrade + +After upgrading the Flagsmith API image, the container fails to start with a migration error. + +### Common causes + +- **Skipped versions.** Flagsmith migrations are designed to be applied sequentially. If you jump from a much older + version to the latest, an intermediate migration may fail because it expects a schema state that was never reached. +- **Concurrent migration attempts.** If multiple API containers start simultaneously and all attempt to run + migrations, they can deadlock or conflict. Ensure only **one** container runs migrations at a time (use an init + container or a separate migration job). +- **Insufficient database permissions.** The database user must have permission to create, alter, and drop tables and + indexes. Read-only replicas will always fail migrations. + +### Steps to resolve + +1. Read the full traceback in the container logs to identify which migration failed and why. +2. If you skipped versions, consider upgrading incrementally through intermediate releases. +3. If you need to roll back, follow the + [rollback procedure](/deployment-self-hosting/administration-and-maintenance/upgrades-and-rollbacks). For versions + v2.151.0 and later, use: + +```bash +python manage.py rollbackmigrationsappliedafter "" +``` + +4. If concurrent containers caused a conflict, restart with a single replica, let migrations complete, then scale back + up. + +:::caution + +Rolling back migrations may result in data loss if new models or fields were added. Always take a full database backup +before attempting a rollback. + +::: + +**Related documentation:** +[Upgrades and Rollbacks](/deployment-self-hosting/administration-and-maintenance/upgrades-and-rollbacks) + +--- + +## Intermittent 502s from the API container + +The API returns `502 Bad Gateway` sporadically. The container is running and most requests succeed. + +### Common causes + +- **Worker processes crashing.** Flagsmith's API runs behind Gunicorn. If a worker runs out of memory or hits an + unhandled exception, Gunicorn kills and restarts it. Requests in flight during the restart receive a 502 from the + reverse proxy. +- **Too few workers.** The default Gunicorn worker count may not be enough for your traffic. If all workers are busy, + new connections queue at the proxy and may time out. +- **Request timeout mismatch.** If Gunicorn's `--timeout` is longer than your reverse proxy's upstream timeout, the + proxy will cut the connection before Gunicorn does, resulting in a 502. +- **Database connection exhaustion.** If the API and task processor share a connection pool and traffic spikes, the + database may reject new connections. This typically shows as a `502` to the client and a + `OperationalError: connection to server ...` in the API logs. + +### Steps to resolve + +1. Check the API container's logs for `[CRITICAL] WORKER TIMEOUT` messages from Gunicorn or `OperationalError` + exceptions from Django. +2. If workers are timing out, consider raising `GUNICORN_TIMEOUT` (default 30 s) or `GUNICORN_WORKERS` (default 3). See + [Flagsmith's Docker environment variables](/deployment-self-hosting/hosting-guides/docker) for the full list, or use + `GUNICORN_CMD_ARGS` to pass arbitrary Gunicorn flags. +3. Ensure your reverse proxy's upstream timeout is **equal to or greater than** Gunicorn's timeout. +4. Monitor database connection usage. If connections are exhausted, increase `CONN_MAX_AGE` or add a connection pooler + such as PgBouncer. +5. If memory is the bottleneck, raise the container's memory limit or switch Gunicorn to `--worker-class gevent` to + reduce per-worker memory usage. + +**Related documentation:** [Caching Strategies](/deployment-self-hosting/core-configuration/caching-strategies) • +[Asynchronous Task Processor](/deployment-self-hosting/scaling-and-performance/asynchronous-task-processor) diff --git a/docs/docs/integrating-with-flagsmith/flagsmith-api-overview/admin-api/index.md b/docs/docs/integrating-with-flagsmith/flagsmith-api-overview/admin-api/index.md index 4f53ddbd3602..2b3015bf4498 100644 --- a/docs/docs/integrating-with-flagsmith/flagsmith-api-overview/admin-api/index.md +++ b/docs/docs/integrating-with-flagsmith/flagsmith-api-overview/admin-api/index.md @@ -3,16 +3,32 @@ title: Admin API sidebar_label: Admin API --- -The Admin API allows you to programmatically manage your Flagsmith projects, environments, features, segments, and users. Essentially, any action you can perform in the Flagsmith dashboard can also be accomplished via the Admin API. +The Admin API allows you to programmatically manage your Flagsmith projects, environments, features, segments, and +users. Essentially, any action you can perform in the Flagsmith dashboard can also be accomplished via the Admin API. This API is designed for automation, integrations, and building custom workflows on top of Flagsmith. ## API Explorer -You can explore the full Admin API via Swagger at [https://api.flagsmith.com/api/v1/docs/](https://api.flagsmith.com/api/v1/docs/). You can also get the OpenAPI specification in [JSON](https://api.flagsmith.com/api/v1/docs/?format=.json) or [YAML](https://api.flagsmith.com/api/v1/docs/?format=.yaml) format. +You can explore the full Admin API via Swagger at +[https://api.flagsmith.com/api/v1/docs/](https://api.flagsmith.com/api/v1/docs/). You can also get the OpenAPI +specification in [JSON](https://api.flagsmith.com/api/v1/docs/?format=.json) or +[YAML](https://api.flagsmith.com/api/v1/docs/?format=.yaml) format. -We also have a [Postman Collection](https://www.postman.com/flagsmith/workspace/flagsmith/overview) that you can use to experiment with the API. +We also have a [Postman Collection](https://www.postman.com/flagsmith/workspace/flagsmith/overview) that you can use to +experiment with the API. :::info -Our Admin API has a [Rate Limit](/administration-and-security/governance-and-compliance/system-limits#admin-api-rate-limit) that you should be aware of. -::: \ No newline at end of file + +Our Admin API has a +[Rate Limit](/administration-and-security/governance-and-compliance/system-limits#admin-api-rate-limit) that you should +be aware of. + +::: + +:::tip Troubleshooting + +Hitting `401`, `403`, or `429` responses from the Admin API? See the [HTTP Errors](/guides/troubleshooting/http-errors) +troubleshooting guide for common authentication and rate-limit fixes. + +::: diff --git a/docs/docs/integrating-with-flagsmith/flagsmith-api-overview/flags-api/index.md b/docs/docs/integrating-with-flagsmith/flagsmith-api-overview/flags-api/index.md index 75813b5c4302..8be081c4136d 100644 --- a/docs/docs/integrating-with-flagsmith/flagsmith-api-overview/flags-api/index.md +++ b/docs/docs/integrating-with-flagsmith/flagsmith-api-overview/flags-api/index.md @@ -3,7 +3,9 @@ title: Flags API Reference sidebar_label: Flags API --- -The Flags API is the public-facing API that your SDKs use to retrieve feature flags and remote configuration for your users. It's designed for high performance and low latency, with a globally distributed infrastructure to serve requests quickly, wherever your users are. +The Flags API is the public-facing API that your SDKs use to retrieve feature flags and remote configuration for your +users. It's designed for high performance and low latency, with a globally distributed infrastructure to serve requests +quickly, wherever your users are. This API is used for **reading** flag states and user traits, not for managing your projects. @@ -11,7 +13,15 @@ This API is used for **reading** flag states and user traits, not for managing y The two main endpoints you will interact with via the SDKs are: -- `/flags/`: Get all flags for a given environment. -- `/identities/`: Get all flags and traits for a specific user identity. +- `/flags/`: Get all flags for a given environment. +- `/identities/`: Get all flags and traits for a specific user identity. -For SaaS customers, the base URL for the Flags API is `https://edge.api.flagsmith.com/`. Our Edge API specification is detailed [here](/sdk-api/). \ No newline at end of file +For SaaS customers, the base URL for the Flags API is `https://edge.api.flagsmith.com/`. Our Edge API specification is +detailed [here](/sdk-api/). + +:::tip Troubleshooting + +If your SDK or HTTP client is getting unexpected responses from the Flags API, see the +[HTTP Errors](/guides/troubleshooting/http-errors) troubleshooting guide. + +::: diff --git a/docs/docs/integrating-with-flagsmith/flagsmith-api-overview/index.md b/docs/docs/integrating-with-flagsmith/flagsmith-api-overview/index.md index a1ca4ab3c6a9..31d72179f959 100644 --- a/docs/docs/integrating-with-flagsmith/flagsmith-api-overview/index.md +++ b/docs/docs/integrating-with-flagsmith/flagsmith-api-overview/index.md @@ -4,11 +4,13 @@ sidebar_label: Overview sidebar_position: 10 --- -The Flagsmith API is divided into two distinct parts, each serving a different purpose. Understanding the difference is key to integrating with Flagsmith effectively. +The Flagsmith API is divided into two distinct parts, each serving a different purpose. Understanding the difference is +key to integrating with Flagsmith effectively. ### 1. The Flags API (Public SDK API) -This is the API that your client and server-side SDKs interact with to get flag and remote configuration values for your environments and users. It's designed to be fast, scalable, and publicly accessible. +This is the API that your client and server-side SDKs interact with to get flag and remote configuration values for your +environments and users. It's designed to be fast, scalable, and publicly accessible. - **Purpose:** Serving flags to your applications. - **Authentication:** Uses a public, non-secret **Environment Key**. @@ -18,10 +20,19 @@ This is the API that your client and server-side SDKs interact with to get flag ### 2. The Admin API (Private Admin API) -This is the API you use to programmatically manage your Flagsmith projects. Anything you can do in the Flagsmith dashboard, you can also do via the Admin API. +This is the API you use to programmatically manage your Flagsmith projects. Anything you can do in the Flagsmith +dashboard, you can also do via the Admin API. - **Purpose:** Creating, updating, and deleting projects, environments, flags, segments, and users. - **Authentication:** Uses a secret **Organisation API Token**. - **Security:** Requires a secret key that should never be exposed in client-side code. -[Learn more about the Admin API](/integrating-with-flagsmith/flagsmith-api-overview/admin-api). \ No newline at end of file +[Learn more about the Admin API](/integrating-with-flagsmith/flagsmith-api-overview/admin-api). + +:::tip Troubleshooting + +Seeing `401`, `403`, `429`, `502`, or `504` responses from the API? The +[HTTP Errors troubleshooting guide](/guides/troubleshooting/http-errors) walks through common causes and fixes for each +status code. + +::: diff --git a/docs/docs/integrating-with-flagsmith/sdks/client-side-sdks/index.md b/docs/docs/integrating-with-flagsmith/sdks/client-side-sdks/index.md index 2f53c30b11bc..87197265b161 100644 --- a/docs/docs/integrating-with-flagsmith/sdks/client-side-sdks/index.md +++ b/docs/docs/integrating-with-flagsmith/sdks/client-side-sdks/index.md @@ -6,13 +6,15 @@ sidebar_position: 1 # Client-Side SDKs -Client-side SDKs are designed to run in browser environments, mobile applications, and other client-side contexts where you need to evaluate feature flags on the user's device. +Client-side SDKs are designed to run in browser environments, mobile applications, and other client-side contexts where +you need to evaluate feature flags on the user's device. ## Available SDKs - [JavaScript](/integrating-with-flagsmith/sdks/client-side-sdks/javascript) - For web applications - [React](/integrating-with-flagsmith/sdks/client-side-sdks/react) - For React applications -- [Next.js and SSR](/integrating-with-flagsmith/sdks/client-side-sdks/nextjs-and-ssr) - For Next.js applications with server-side rendering +- [Next.js and SSR](/integrating-with-flagsmith/sdks/client-side-sdks/nextjs-and-ssr) - For Next.js applications with + server-side rendering - [Android](/integrating-with-flagsmith/sdks/client-side-sdks/android) - For Android applications - [iOS](/integrating-with-flagsmith/sdks/client-side-sdks/ios) - For iOS applications - [Flutter](/integrating-with-flagsmith/sdks/client-side-sdks/flutter) - For Flutter applications @@ -27,3 +29,8 @@ Client-side SDKs are designed to run in browser environments, mobile application ## Getting Started Choose the SDK that matches your platform and follow the specific integration guide for your technology stack. + +## Troubleshooting + +If flags aren't updating in the browser, the SDK is returning default values, or trait-based segment rules aren't +matching as expected, see the [SDK Issues](/guides/troubleshooting/sdk-issues) troubleshooting guide. diff --git a/docs/docs/integrating-with-flagsmith/sdks/server-side.mdx b/docs/docs/integrating-with-flagsmith/sdks/server-side.mdx index d3631be8319c..33897cceb1ad 100644 --- a/docs/docs/integrating-with-flagsmith/sdks/server-side.mdx +++ b/docs/docs/integrating-with-flagsmith/sdks/server-side.mdx @@ -25,6 +25,12 @@ Server Side SDKs can run in 2 different modes: Local Evaluation and Remote Evalu ::: +:::info Troubleshooting + +If the SDK is returning default values, flags look stale, or trait-based segment rules aren't matching as expected, see the [SDK Issues](/guides/troubleshooting/sdk-issues) troubleshooting guide. + +::: + ## SDK Overview diff --git a/docs/docs/support/index.mdx b/docs/docs/support/index.mdx index c5fd1e301e7b..668723d2aca8 100644 --- a/docs/docs/support/index.mdx +++ b/docs/docs/support/index.mdx @@ -15,6 +15,7 @@ assistants can help you navigate the codebase quickly - point them at the Beyond that, many issues have already been solved. A quick search can save you time: +- **[Troubleshooting Guide](/guides/troubleshooting)** - Step-by-step diagnostics for HTTP errors, SDK issues, and self-hosted problems - **[FAQ](/support/faq)** - Browse answers to common questions by category - **[GitHub Issues](https://github.com/Flagsmith/flagsmith/issues)** - Search open and closed issues for your problem - **[Release Notes](https://github.com/Flagsmith/flagsmith/releases)** - Check if your issue was fixed in a newer version