update(link): Relax link schemas to support domain-level identifiers#292
update(link): Relax link schemas to support domain-level identifiers#292xibz wants to merge 1 commit intocdevents:mainfrom
Conversation
This change updates all link schemas (START, END, RELATION, and embedded variants) to allow references to either a CDEvent contextId, a domainId, or both. Previously, links could only reference event context IDs. This limited cross-system connectivity and encouraged embedding execution identifiers in customData purely for graph reconstruction. By allowing domainId alongside contextId: - Links can represent relationships between domain executions (e.g., pipelinerun) as well as individual events. - Connectivity metadata no longer needs to be embedded in event payloads. - Chain-first modeling constraints are relaxed, enabling relation-first graph modeling. - The change remains backward compatible. At least one of contextId or domainId is now required for link endpoints. AdditionalProperties are restricted to prevent schema drift. This preserves existing semantics while improving flexibility and reducing customData pollution. Signed-off-by: xibz <bjp@apple.com>
|
Thanks @xibz - could you clarify the definition of |
The Core ProblemcontextId requires the publisher to know the parent event's context ID. But if the parent isn't a CDEvent, there is no context ID to know. Today: GitHub doesn't emit CDEvents. So we use domainId to link to GitHub PRs explicitly. SolutionIntroduce a domain specific identifier which can be used to relate information URNs will be used for domain IDs, where it follows the format of Examples:
Example 1 (GH to CI)Build event wants to link to GitHub PR Publisher asks: "What is the GitHub PR's contextId?" How domainId solves this Example 2 (Jira to CI)Imagine CircleCI wants to relate a Jira ticket Result: CircleCI task can't link to Jira. Forced into customData. Example 3: Datadog Alert Triggers Rollback PipelineImagine you have a Datadog alert that monitors system health during releases. The rollback pipeline needs to link back to the alert that triggered it: Example 4 Linking to Events Without Knowing Their Context IDImagine a consumer (like a dashboard or audit system) receives an event and wants to query for all related events, but doesn't know their context IDs upfront. A deployment fails. You want to find:
Without Problem: You have to parse customData and hope the IDs are there. No standardized way to query back. With domainId, you can link forward AND backward: with this:
This shows that domainId isn't just for "non-CDEvent systems", but it's also useful for querying across systems when you don't have context IDs. Why it worksEach system uses what it knows. Systems knows its own context IDs (contextId). Systems also knows how to identify triggering systems (domainId URN). No system needs to know another system's internal IDs or context IDs. FAQS
Because causality exists outside CDEvents.
If you don't link them, you lose that causality. If you can't link them with contextId (because they're not CDEvents), you're forced to hide it in customData. domainId lets you link anything, anywhere. That's why it matters.
Your engineers will. They'll put it in customData. Because causality is real whether CDEvents acknowledges it or not.
No. domainId is a stopgap until systems emit CDEvents natively. |
|
Thanks, I now understand your proposal better, I guess.
Side questions: Is "links" just for "tigger", "causeby", or can it be used to define other types of relation? (eg for a test to define what the system under test is (a source, a change, an artifact), in which context (ci, environment), triggered by what (a scheduler, a change, a deployment, ...) |
The proposal does not require links to always be present. It simply provides a standardized way to express causality when it is known. If a producer does not know the trigger, it emits no link. The key difference is: Today: With domainId: This proposal does not require perfect causality capture. It enables correct modeling when information exists.
Correlation is exactly one of the motivations. The intention is that domainId represents the canonical identity of an entity within its domain. subject.id is too flexible, hence the strict URN format. If a system later emits a native CDEvent for that entity, the subject.id of that event should correspond to the same logical identifier represented in the domainId. This allows dashboards, SIEM systems, and audit systems to correlate across both: domainId is not meant to replace subject.id, but to provide a stable cross-domain reference when contextId is unavailable.
Links are not limited to trigger/cause relationships. They are intended to model typed relationships between entities. Examples include: The goal is not only causality modeling, but explicit relationship modeling. This allows us to describe: |
|
About the urn, after some search, the pre-accepted proposal about converting the subject.id to a global id (#252), and the fact that the purpose of Provider is always An alternative can be to not use |
How does that sound @davidB? |
|
The schema of the Link is complex (IMO) (over?):
Maybe, it's also the opportunity to review and simplify it |
|
@davidB I think the complexity here is coming from the model trying to represent two genuinely different situations, rather than complexity for its own sake. The embedded vs non-embedded split exists for a specific recovery case: sometimes the event graph becomes disconnected, and we need a way to reconnect it after the fact. For example, imagine system A is now disconnected from B, and B and C are also disconnected from each other. In that situation, we may want to go back and repair those connections manually so the graph reflects the real flow again. Non-embedded links exist for that purpose. They let us express a connection even when we are not embedding or directly referencing a concrete CDEvent in the normal path structure. That is also why the distinction between END, PATH, and RELATION is intentional. END and PATH are meant to describe structural, navigable links in the event graph. RELATION is different: it is meant to describe a semantic relationship outside of that strict path model. I do not want to overload RELATION to cover everything, because then we lose an important distinction in the graph semantics. In particular, when a link uses domainId, the implication is that we do not know the concrete CDEvent on the other side. At that point, it is not really the same thing as a direct CDEvent-to-CDEvent path. Preserving that distinction is important if we want the graph to carry meaning, rather than just storing generic connections. So from my perspective, the schema is trying to model two different truths: we know the exact event connection we only know the semantic or domain-level connection Those cases look similar at first glance, but collapsing them into one shape or one relation type would blur semantics that are important for reconstructing and interpreting the graph correctly. That said, I do think it is fair to ask whether the current shape is the simplest way to express that distinction. If there is a cleaner way to preserve those semantics without losing the separation, I would be very open to reviewing it. |
You forgot the subjectType: The idea is to work with not existing CDEvent like existing. If we reuse your 3 samples (github, jira, datadog), they match existing subjectType (generated or not by CDEvent) GitHub PR: urn:github:xibz:repo:pr:42 |
For known CDEvents subject types, I agree. But if the subject is outside CDEvents, modeling it as urn:cdevents:* is semantically confusing, because it implies a CDEvents classification rather than just providing an identifier. |
|
If the subject is outside CDEvents, we can encourage to create a "custom" one using a dotted notation, like for for custom events. My issue is that without EDIT: without subjectType, we don't need urn; we can just use the uri/id and let the consumer handle it (to guess what it is). |
|
Using a common URN format helps systems parse identifiers consistently. But if the subject is external to CDEvents, the producer still has to decide how to classify it. That means two producers may assign different CDEvents subject types to the same thing. In that case, the format is standardized, but the meaning is not. |
That is a worst-case scenario. This solution is not meant to be foolproof when producers are unable to provide all of the relevant information. That is part of the purpose of domainId: to help create connections where they may not otherwise be obvious. If subjectType is known, producers can provide it as additional context for known CDEvent types, and that is a constraint we could add. For non-CDEvent types, though, I am not convinced it provides the same value. I would also argue that, for known CDEvent subject types, the domainId should already be sufficient to derive the subject ID in a consistent way. I am less convinced that this works well for custom types. And more broadly, your proposal ends up mirroring subjectId almost identically. I am not sure we gain enough by making the representation that close. |
|
If subjectType is unknown, we could be explicit about it. |
This change updates all link schemas (START, END, RELATION, and embedded variants) to allow references to either a CDEvent contextId, a domainId, or both.
Previously, links could only reference event context IDs. This limited cross-system connectivity and encouraged embedding execution identifiers in customData purely for graph reconstruction.
By allowing domainId alongside contextId:
At least one of contextId or domainId is now required for link endpoints. AdditionalProperties are restricted to prevent schema drift.
This preserves existing semantics while improving flexibility and reducing customData pollution.