[FC-0118] docs: add ADR-0029 standardized error responses decision#38246
[FC-0118] docs: add ADR-0029 standardized error responses decision#38246taimoor-ahmed-1 wants to merge 2 commits intoopenedx:docs/ADRs-axim_api_improvementsfrom
Conversation
Add edx-platform/docs/decisions/0029-standardize-error-responses.rst describing objectives and the target contract for REST API error responses. Include implementation requirements, a standardized error payload example, and a DRF exception-handler snippet.
|
Thanks for the pull request, @taimoor-ahmed-1! This repository is currently maintained by Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review. 🔘 Get product approvalIf you haven't already, check this list to see if your contribution needs to go through the product review process.
🔘 Provide contextTo help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:
🔘 Get a green buildIf one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green. DetailsWhere can I find more information?If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources: When can I expect my changes to be merged?Our goal is to get community contributions seen and reviewed as efficiently as possible. However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:
💡 As a result it may take up to several weeks or months to complete a review and merge your PR. |
| Implementation requirements: | ||
|
|
||
| * Use appropriate HTTP status codes (4xx/5xx). Avoid returning HTTP 200 for error conditions. | ||
| * Return a consistent payload with these core fields: | ||
|
|
||
| * ``type`` (URI identifying the problem type) | ||
| * ``title`` (short, human-readable summary) | ||
| * ``status`` (HTTP status code) | ||
| * ``detail`` (human-readable explanation specific to this occurrence) | ||
| * ``instance`` (URI identifying this specific occurrence, when available) |
There was a problem hiding this comment.
Currently there are errors in the platform where some part of the error document is taken and shown to the end user. In this case, the error message is translated, in other cases the message is not displayed to the end user. It seems like maybe it would be useful to have a field for a message that's meant to be shown to end users in MFEs and a second message that is the same consistent error text that can be used by APMs and other monitoring and logging systems.
There was a problem hiding this comment.
added the additional field, can you please review again
bradenmacdonald
left a comment
There was a problem hiding this comment.
Thanks, this is great. However, the ADR doesn't mention two of the biggest problems I see with errors in the platform today:
-
Many (most?) endpoints that are used as APIs return a standard Django HTML error response when there is a 500 error, instead of a JSON response. DRF doesn't seem to be properly catching errors.
GET /api/courses/v1/migrate_legacy_content_blocks/course-v1:foo+bar+x/is just one example but there are many. Any API endpoints that use our API classes should convert all errors to this new format. -
Sometimes the error responses have different CORS headers than the API itself, so when an error occurs, the browser rejects the response because it's missing CORS headers, and users see a misleading "CORS" error. If there are no security implications, error responses should be served with more open CORS policies.
| App-specific types may extend this catalog; they must still be absolute URIs. | ||
| * Ensure **all** API endpoints return JSON error responses — including on unhandled 500 errors. | ||
| The platform-level DRF exception handler must catch exceptions that would otherwise produce | ||
| Django's default HTML error page and return a JSON body in the standardized format instead. | ||
| Endpoints not using DRF's ``APIView`` must be identified and wrapped accordingly. |
There was a problem hiding this comment.
Thanks for adding this!
| .. code-block:: json | ||
|
|
||
| { | ||
| "type": "https://open-edx.org/errors/validation", |
There was a problem hiding this comment.
| "type": "https://open-edx.org/errors/validation", | |
| "type": "https://openedx.org/errors/validation", |
Open-edx.org is not a real domain. (This is in many places in the ADR)
Also, should we make these real URLs that have some content there, explaining the error, or are these strictly IDs?
There was a problem hiding this comment.
I wonder if this should link to an OpenAPI spec?
There was a problem hiding this comment.
Hmm so RFC 9457 specifies this as:
"The "instance" member is a JSON string containing a URI reference that identifies the specific occurrence of the problem.
When the "instance" URI is dereferenceable, the problem details object can be fetched from it."
My observation here is that while this sounds like a great idea, it would mean we have to create a storage for information of specific errors and then a URL endpoint that returns this information.
Is this really practical or desirable? What concrete problem we have would this solve?
jesperhodge
left a comment
There was a problem hiding this comment.
I love that we are doing this. Especially now I know that DRF does not catch some 5xx or unknown errors, this is really important.
I have a few general questions about the format and about ensuring that internal error information is not leaked outside: see my comments below.
And in general, I agree that we are best off with our own standardized format for errors that we define exactly the way we want. But could you explain a bit more why you see this as the preferable format?
In comparison for example with something like https://drf-standardized-errors.readthedocs.io/en/latest/quickstart.html
| * ``type`` (URI identifying the problem type) | ||
| * ``title`` (short, human-readable summary) | ||
| * ``status`` (HTTP status code) | ||
| * ``detail`` (developer-facing explanation specific to this occurrence; stable English text, safe |
There was a problem hiding this comment.
On the security front, let's be aware that DRF does not catch all exceptions, for example, it does not catch 5xx exceptions or famously IntegrityErrors. That level of detail - stack traces, error message - should also not be included in this detail, unless you're in mode DEBUG=True, in my opinion. Because that would expose too sensitive data to the user.
| "type": f"https://open-edx.org/errors/{_error_type(exc)}", | ||
| "title": _error_title(exc), | ||
| "status": response.status_code, | ||
| "detail": _flatten_detail(response.data), |
There was a problem hiding this comment.
There's a problem here because the DRF exception_handler does not catch all 5xx errors. Sending all the response data here would include error messages and stack traces.
A really important job of this function would be to catch any uncaught or unknown exceptions that are not handled by DRF, and send a standard "Internal Server Error" message.
| * ``type`` (URI identifying the problem type) | ||
| * ``title`` (short, human-readable summary) | ||
| * ``status`` (HTTP status code) | ||
| * ``detail`` (developer-facing explanation specific to this occurrence; stable English text, safe |
There was a problem hiding this comment.
How does this handle validation errors and such? Generally there's something like
"field1": {
"errors": [...]
},
"field2": {
"errors": [...]
}
| * ``status`` (HTTP status code) | ||
| * ``detail`` (developer-facing explanation specific to this occurrence; stable English text, safe | ||
| for log aggregators and APM tools) | ||
| * ``instance`` (URI identifying this specific occurrence, when available) |
There was a problem hiding this comment.
If this data okay to be exposed to the user?
| * ``title`` (short, human-readable summary) | ||
| * ``status`` (HTTP status code) | ||
| * ``detail`` (developer-facing explanation specific to this occurrence; stable English text, safe | ||
| for log aggregators and APM tools) |
There was a problem hiding this comment.
Why log aggregators?
Generally we should make sure to log errors and include error information, while this response is for REST APIs. So generally there is a central try/except that logs every error that makes it to this level; or if necessary, that would even be included in this API level error handling - not sure how it's currently done in edx-platform.
Either way I'm a bit confused by the mixing of both here?
There was a problem hiding this comment.
Also I looked at https://www.rfc-editor.org/rfc/rfc9457.html#name-detail a bit - apparently this is something that inspired this?
In general practice as well as in this RFC, "detail" is considered roughly a human-facing client description that an MFE for example could print to the user as an error message.
It's not supposed to have debugging information.
| * Carry a **short human-readable summary** plus a **specific explanation** for this request when helpful. | ||
| * Tie errors to the **request** when useful (e.g. request path or URL) for support and logging. | ||
| * Represent **validation failures** in a consistent way (e.g. field/path to messages) instead of ad-hoc nesting. | ||
| * Are **documented and enforced** in DRF (central exception handling + schema generation). |
There was a problem hiding this comment.
How about in OpenAPI specs?
Description
This PR adds an ADR decision to the edx-platform documentation to standardize how Open edX REST APIs return error payloads to clients. The ADR defines a single, predictable structured JSON error object contract (with consistent status codes and validation error formatting) and provides a reference DRF-style exception handler example.
related issue: #38167