Skip to content

feat(registry): configurable driver registries via TOML config#360

Open
zeroshade wants to merge 28 commits intomainfrom
feat/configurable-registries
Open

feat(registry): configurable driver registries via TOML config#360
zeroshade wants to merge 28 commits intomainfrom
feat/configurable-registries

Conversation

@zeroshade
Copy link
Copy Markdown
Member

Summary

Adds support for configuring driver registries via TOML config files, allowing users to add custom registries at both the global (~/.config/columnar/dbc/config.toml) and project (dbc.toml) level.

What's New

  • Global registry config (~/.config/columnar/dbc/config.toml): loaded on every command via ConfigureRegistries() called in main()
  • Project registry config (dbc.toml): [[registries]] section respected by sync and add commands
  • Priority order: project → global → built-in defaults (first-match-wins, URL-deduplicated)
  • replace_defaults = true: opt-in flag to suppress built-in registries entirely
  • DBC_BASE_URL env var: still overrides everything (unchanged behavior)
  • Backward compatible: existing dbc.toml files without [[registries]] work unchanged

Config Format

Global (~/.config/columnar/dbc/config.toml):

[[registries]]
url = "https://my-registry.example.com"
name = "my-registry"

# replace_defaults = true  # omit built-in registries

Project (dbc.toml):

[[registries]]
url = "https://project-registry.example.com"
name = "project"

[drivers]
[drivers.my-driver]
version = ">=1.0.0"

Implementation Notes

  • URL validation requires http/https scheme and non-empty host
  • defaultRegistries is snapshotted eagerly in init() (after drivers.go init runs); env-sensitivity documented in comments
  • add.go decodes dbc.toml once (single file open) for both registry config and driver list
  • remove intentionally omits registry wiring — it never queries a registry

@zeroshade zeroshade requested a review from amoeba April 17, 2026 16:17
@zeroshade zeroshade force-pushed the feat/configurable-registries branch from 5a8bf36 to 5ac1ddd Compare April 17, 2026 16:19
@zeroshade zeroshade force-pushed the feat/configurable-registries branch 2 times, most recently from 0813c82 to 79e6f5c Compare April 22, 2026 21:33
@zeroshade zeroshade force-pushed the feat/configurable-registries branch from 79e6f5c to 40e5ad2 Compare April 22, 2026 21:39
@zeroshade zeroshade requested a review from kentkwu April 26, 2026 18:21
Copy link
Copy Markdown
Member

@amoeba amoeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick code review over the new pieces and left some comments. Even if we don't have an immediate use case for this, I think having it in dbc is useful.

Some other more general thoughts:

  1. Would we want CLI integration for this? I think it would be useful for dbc to be able to do all the work to show/add/remove registries from the project and user config.
  2. I didn't see any tests for how dbc install will work with this. Presumably dbc install will respect the new fields in the user config.
  3. Do we need to define a default priority order for registries to handle name conflicts? We have a lot of the same concerns as conda does with channels. I'm wishing for more tests in this PR for the interaction between the default, user, and project registries around name conflicts.

Comment thread registry_config.go Outdated

func SetProjectRegistries(entries []RegistryEntry, replaceDefaults *bool) error {
if os.Getenv("DBC_BASE_URL") != "" {
return nil
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should warn the user here.

Comment thread registry_config.go Outdated
return nil
}
if replaceDefaults != nil && *replaceDefaults && len(entries) == 0 {
return fmt.Errorf("replace_defaults = true requires at least one [[registries]] entry; omit replace_defaults or add a registry entry")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this error would be better surfaced earlier, such as when we initially parse config.toml or dbc.toml. How this is now, I think this would error like this,

error configuring project registries: replace_defaults = true requires at least one [[registries]] entry; omit replace_defaults or add a registry entry

As a user, it's unclear what I should do to fix it. Which file is my problem in?

If we validated this earlier, I bet we'd have enough context to know which file has the issue and we could tell them.

Comment thread registry_config.go Outdated
}
for _, e := range entries {
if e.URL == "" {
return fmt.Errorf("registry entry has empty url")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these checks needed or is this the best place to handle this? I see loadGlobalConfig already does validation and that seems like a better place. i.e., I much prefer being able to know that any []RegistryEntry I might get is always valid so I'd rather error before creating a RegistryEntry.

Comment thread cmd/dbc/main.go Outdated
})
}

func boolPtr(b bool) *bool { return &b }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the best place for this def?

Comment thread cmd/dbc/main.go Outdated
}

if configDir, err := internal.GetUserConfigPath(); err != nil {
fmt.Fprintf(os.Stderr, "warning: failed to locate config directory: %v\n", err)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to emit a warning? Won't this warn if the folder isn't found? Seems like we should be okay with that.

The warning when the registry config fails to load once found is fine.

Comment thread .gitignore Outdated
Comment on lines +52 to +54

# Sisyphus workflow artifacts (plans, evidence, notepads)
.sisyphus/
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just a tool you use? can we revert this?

Comment thread cmd/dbc/main.go Outdated
if configDir, err := internal.GetUserConfigPath(); err != nil {
fmt.Fprintf(os.Stderr, "warning: failed to locate config directory: %v\n", err)
} else if err := dbc.ConfigureRegistries(configDir); err != nil {
fmt.Fprintf(os.Stderr, "warning: failed to load registry config: %v\n", err)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the error here includes the file with the issue. Whenever we show an error about a TOML file, we need to show the user the path to it.

Comment thread registry_config.go Outdated
Comment on lines +121 to +122
if u.Scheme != "http" && u.Scheme != "https" {
return Registry{}, false
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the validation in here even needed? In the parent call, we already error when a RegistryEntry is invalid.

zeroshade added 3 commits May 1, 2026 13:53
Introduce RegistryEntry, GlobalConfig, LoadGlobalConfig, WithGlobalConfig,
and WithProjectRegistries. NewClient now merges project, global, and default
registries (project wins, then global, then defaults) with deduplication by
scheme+host+path. WithBaseURL still overrides all other registry sources.

Validation (empty URL, missing host, non-http scheme, replace_defaults=true
with no entries) runs at NewClient time so malformed configuration surfaces
immediately rather than at first request.
Extend DriversList with omitempty Registries and ReplaceDefaults fields so
projects can pin or override driver registries inline in dbc.toml. Existing
dbc.toml files without those keys continue to decode unchanged.

applyProjectRegistries rebuilds the process-wide dbc client with the merged
registry list; callers (add, sync) invoke it after decoding the list.
Load ~/.config/dbc/config.toml once at startup and thread it into the
default client via WithGlobalConfig. Project commands (add, sync) rebuild
the client with WithProjectRegistries after parsing dbc.toml so driver
lookups use the merged registry list. DBC_BASE_URL continues to short
circuit all registry merging.

Tests cover project-registry wiring, invalid project URLs, and
backwards-compatibility with dbc.toml files that omit the new fields.
@zeroshade zeroshade force-pushed the feat/configurable-registries branch from a41a2ad to 5f8660a Compare May 1, 2026 17:54
zeroshade added 10 commits May 1, 2026 14:06
Two issues from roborev review:

- DriversList.ReplaceDefaults is now *bool so a project can explicitly set
  replace_defaults = false to override a global config's replace_defaults
  = true (previously collapsed into unset vs false).
- NewClient now validates GlobalConfig.Registries URLs with the same path
  used for project registries, and rejects any configuration that produces
  an empty registry list (e.g. global replace_defaults = true with zero
  entries), so malformed config surfaces at NewClient time instead of at
  first request with silent zero-registry lookups.
- LoadGlobalConfig applies the same no-entries check when decoding
  config.toml so CLI users see the error at startup.
…ries

The existing TestAdd/TestSync project-registry tests route through
testBaseModel, whose getDriverRegistry stub bypasses dbcClient entirely.
Add direct coverage that swaps in a lookup routed to the real dbcClient
against an httptest server, proving applyProjectRegistries genuinely
rewires driver resolution to the merged registry list and that:

- a project [[registries]] entry routes Search() to that server;
- an explicit project replace_defaults=false restores built-in defaults
  even when the global config declared replace_defaults=true;
- DBC_BASE_URL continues to short-circuit project registry overrides.
LoadGlobalConfig rejecting global replace_defaults=true with zero entries
made the CLI diverge from NewClient: the CLI dropped the global config
with only a warning, silently re-enabling the built-in defaults even when
the project dbc.toml supplied the registries the user actually wanted.

Remove that premature check; the post-merge empty-list guard in NewClient
already catches the genuinely broken config (global replace_defaults=true
with no entries AND no project entries), while letting the valid
combination through. Adds both a library test and a CLI wiring test for
the project-supplies-entries case.
Eager initDBCClient() in main() failed startup for the valid
'global replace_defaults=true + project supplies registries' case —
NewClient was called before the project dbc.toml could merge in.
Remove the eager call and rely on lazy init via getDriverRegistry /
initDBCClient, so project commands get a chance to rebuild the client
via applyProjectRegistries first.

Non-project commands (search, info, install) still fail fast on the
broken case (global replace_defaults=true with no entries and no
project override) because NewClient rejects the empty merge — now
surfaced at first use rather than startup.

auth.go's requestDeviceCode now ensures init before touching
dbcClient.HTTPClient(). dbcClientOnce is a pointer so tests can simulate
fresh startup.

Adds TestAddCmdEndToEndThroughRealClient which runs AddCmd.GetModel()
against an httptest registry — proves the production getDriverRegistry
closure routes through the rebuilt dbcClient. Adds TestStartupDeferred*
tests for the main() ordering invariants.

Also replaces 'switch { case err == ... }' with 'switch err' in main().
…in GetDriverList

Two findings from roborev review of the branch:

MEDIUM: OAuth device-code login ran through initDBCClient just to get an
HTTPClient. A misconfigured global registry (e.g. replace_defaults=true
with no entries) broke 'dbc auth login' — the very recovery path the user
would run to fix the config. Add authHTTPClient() that builds a bare HTTP
client without touching registry validation, and route device-code calls
through it.

LOW: GetDriverList decoded the new [[registries]] / replace_defaults
fields but never called applyProjectRegistries, so library consumers
using that helper silently ignored project registry overrides. Call it
immediately after decoding.

Regression tests pin both behaviors:
- TestAuthHTTPClientDoesNotRequireRegistryConfig
- TestGetDriverListHonorsProjectRegistries
Three findings from review of the branch:

MEDIUM (1632): GetDriverList mutated the process-wide dbcClient via
applyProjectRegistries, which leaked registry state across calls — a
later invocation on a dbc.toml without [[registries]] would silently
reuse the previous call's client. Switch GetDriverList to build a
per-call dbc.Client scoped to the decoded file, so each call sees a
fresh registry merge. Test exercises two sequential calls against two
different servers to prove no leakage.

MEDIUM (1630): NewClient's pre-merge guard rejected project
replace_defaults=true with no project entries, even when the global
config supplied registries. Remove the pre-merge guard; the post-merge
empty-list check is the correct gate and lets 'drop defaults, keep
globals' through. Test updated and new test pins the
global-entries+project-replace_defaults case.

LOW (1629): startup tests didn't exercise the real main() loading path.
Extract loadStartupRegistryConfig() and add
TestStartupEndToEndGlobalReplaceDefaultsWithProjectEntries that writes
a real config.toml + dbc.toml, runs the full startup sequence, and
drives AddCmd.GetModel() against an httptest server.
Addresses review 1633:

MEDIUM: extract parseStartupArgs() and have main() use it, so the same
config-load + argv-parse helper runs under test. The startup regression
test now asserts at each step that dbcClient remains nil — if anyone
reintroduces eager initDBCClient() between config load and subcommand
dispatch the test fails at whichever step broke.

LOW: update WithProjectRegistries doc comment to match the actual
semantics: an error only when the MERGED registry list is empty. The
old text claimed replace_defaults=true with no entries is always
rejected, which contradicts the post-merge-only validation.
Addresses review 1634: the previous parseStartupArgs extraction only
covered the parser bit. Anyone reintroducing an eager initDBCClient()
between p.Parse() and GetModel() would still not have failed the test.

Extract a single runStartup() that covers the full path main() runs:
config load, argv parse, subcommand selection, and GetModel() for model
commands. main() now calls runStartup() and switches on the returned
startupKind for its exit cases. Tests drive runStartup() end-to-end,
asserting dbcClient remains nil throughout.

No functional change to main()'s behavior — same exit codes, same help
and version paths, same subcommand dispatch.
Addresses review 1635: after the runStartup refactor, main() always
called runStartup(configDir, ...) even when internal.GetUserConfigPath
failed, leaving configDir = "". runStartup then forwarded that to
LoadGlobalConfig(""), which reads ./config.toml from the current
working directory. That changed registry resolution based on cwd — a
regression against the pre-refactor behavior that simply skipped the
load on this error path.

Guard runStartup to skip the load when configDir is empty. Document the
sentinel in the helper's doc comment. Add TestRunStartupSkipsLoadWhenConfigDirEmpty
that plants a hostile ./config.toml and asserts runStartup leaves
globalRegistryConfig nil.
Addresses review 1636: runStartup("", ...) skipped the load but did not
clear an already-primed globalRegistryConfig, so a second in-process
call would silently reuse the previous invocation's global registries.
Set globalRegistryConfig = nil explicitly in the empty-configDir branch
and add TestRunStartupClearsStaleGlobalConfigWhenConfigDirEmpty, which
primes the global before calling runStartup("") and asserts the
state is cleared.
zeroshade added 15 commits May 1, 2026 15:29
Addresses review 1637: clearing only globalRegistryConfig was
insufficient because dbcClient is cached behind dbcClientOnce. A
repeat in-process runStartup would hit the already-run sync.Once and
reuse the prior invocation's client.

Introduce resetClientState() that clears dbcClient, dbcClientErr,
dbcClientOnce, and globalRegistryConfig, and call it at the top of
runStartup so every invocation starts from a clean slate regardless
of configDir. Replaces the narrower empty-configDir-only reset.

Strengthen the regression test to prime all three state paths
(globalRegistryConfig + dbcClient + a 'used' sync.Once) before
calling runStartup(""), and assert a subsequent initDBCClient()
produces a client whose registries do NOT include the stale URL.
Addresses review 1639: TestRunStartupClearsStaleClientState primed
globalRegistryConfig, dbcClient, and dbcClientOnce but left dbcClientErr
unseeded, so a regression that stopped resetting dbcClientErr in
resetClientState() would still pass. Prime a sentinel error and assert
runStartup clears it and a fresh initDBCClient() does not surface the
sentinel.
Two MEDIUM findings from the full-branch review (job 1641):

1. NewClient regression: passing WithRegistries() alongside
   WithGlobalConfig/WithProjectRegistries caused the explicit list to
   be treated as the 'defaults' slice in the merge, prepending extra
   entries and generally not honoring the documented 'sets the driver
   registries to use' contract. Track explicitRegistries separately
   and skip the config-file merge when WithRegistries was passed.
   Added TestNewClientWithRegistryOptions/WithRegistries_takes_precedence...

2. cmd/dbc add lock ordering: the project lock wrapped the registry
   lookup, so a slow registry fetch could hold .dbc.project.lock for
   up to the lookup timeout. Concurrent dbc commands would hit the
   10s lock timeout and fail with 'another dbc operation is in
   progress' even though no file mutation was happening. Restructure
   the add flow: decode + applyProjectRegistries + getDriverRegistry
   WITHOUT the lock, then acquire the lock only around the
   read-modify-write phase, re-reading the file under the lock so
   concurrent additions aren't clobbered. Added
   TestAddDoesNotHoldLockDuringRegistryLookup.
Addresses review 1645: the post-lock merge wrote the entire stale
pre-lock m.list back onto the freshly re-read on-disk state. That
could clobber a concurrent dbc add/remove that changed a different
driver, and unconditionally overwrote any concurrent edits to
[[registries]] / replace_defaults.

Narrow the merge to only the driver entries this invocation actually
added/replaced (iterating 'specs'), and leave current.Registries /
current.ReplaceDefaults untouched so concurrent edits to those fields
survive. Added TestAddPreservesConcurrentEdits which simulates a
concurrent editor modifying the file mid-registry-lookup (new driver
+ [[registries]] addition) and asserts both the new driver and the
concurrent changes are present after the add completes.
…faults

Addresses review 1646:

MEDIUM: a concurrent editor changing [[registries]] or replace_defaults
between the unlocked driver lookup and the locked write-back phase
could cause dbc add to write drivers that were validated against a
stale registry set — later sync/install would fail against the new
config. Under the lock, compare the re-read list's registry fields to
the pre-lock snapshot via a new registriesChanged() helper. On drift,
return a clear 'please retry' error rather than write inconsistent state.

LOW: TestAddPreservesConcurrentEdits claimed to cover replace_defaults
but the simulated concurrent content didn't set it. Split into two
tests: one for concurrent driver-only edits (preserved) and one for
registry-config drift (aborts). The drift test verifies the written
file still contains the concurrent replace_defaults=true, the
concurrent registry, and that the aborted add did NOT write its
driver entry.
Addresses review 1648: the previous pattern — read-without-lock,
lookup, reacquire-for-write — left a new torn-read race. A concurrent
writer using os.Create (the writer path for dbc add/remove) could
expose partial file contents to the unlocked initial decode, turning
a valid concurrent operation into a spurious TOML parse error or a
snapshot of torn state.

Take the project lock briefly for the initial read, release it before
the network lookup (so slow registry fetches still don't hold the
lock), and reacquire for the final merge/write. The lock is never
held across I/O that blocks on external services.

TestAddDoesNotHoldLockDuringRegistryLookup still passes — the read
lock is released before the stalled getDriverRegistry call begins.
Addresses review 1649 (LOW): the torn-read fix was uncovered by any
regression test. Add TestAddInitialReadBlocksWhileLockHeld which holds
the project lock externally and instruments getDriverRegistry to signal
when add progressed past the initial read. With the read lock in place,
add blocks before reaching the registry lookup and the test waits out a
300ms window before releasing the lock. Without the read lock fix, add
reads the file and drives into getDriverRegistry while the lock is held,
which the test detects via the channel signal and fails.

Verified the test fails under a reverted fix (python-edited add.go into
the pre-fix state) and passes on the current code.
Addresses review 1652 (LOW): the prior
TestAddInitialReadBlocksWhileLockHeld test used a 300ms timeout on a
registry-hook channel as the failure signal, which could spuriously
pass on slow CI runners and didn't exercise the torn-read failure mode
directly. Add TestAddInitialReadIsAtomicAgainstTornState: write
invalidTOML while holding the project lock, start dbc add, assert add
does NOT surface a decode error (it's blocked on the lock), then
restore validTOML and release. If the initial read skips the lock,
add reads the invalid TOML and the channel yields a TOML decode
error, which the test detects.

Verified: under a reverted fix (initial read without the lock) this
test fails with 'toml: expected character ='. Under the current code
the test passes — add blocks on the lock until valid content is
restored.
Addresses review 1654 (LOW): registriesChanged compared whole
RegistryEntry structs, so a concurrent edit that only renamed a
registry would false-positive as 'registry configuration changed'
and abort dbc add even though the URL resolution is identical.
Equivalent URL spellings (trailing slash, host casing) would also
false-positive.

Normalize URLs (lowercase scheme/host, strip trailing path slash) and
ignore Name when comparing. Documented the intent in the helper's
doc comment and added TestRegistriesChanged covering name-only
changes, trailing-slash variation, case differences, and tri-state
replace_defaults.

The 1654 MEDIUM on 'project commands rely on process-global state' is
intentionally left as-is: cmd/dbc is a single-command-per-process
binary, resetClientState() at the top of runStartup clears all cached
state at the start of every run, and in-process reuse is a test-only
concern that the test harness handles. Adding Once/Err resets to
setDBCClient would be defensive code for an unreachable path.
…tion

Addresses review 1657: normalizeRegistryURL dropped query, userinfo,
fragment, and full path — so two URLs like
'https://r.example.com/?tenant=a' and 'https://r.example.com/?tenant=b'
compared equal. That could hide a real registry config change from
the dbc add drift check, letting it write a driver resolved against
different registry settings than the final dbc.toml.

Rebuild the URL via url.URL.String() after lowercasing the scheme and
host and trimming a trailing path slash — only the truly no-op
differences. Query, userinfo, fragment, and interior path segments
are preserved. Added regression tests for tenant-selector query,
userinfo, and path-segment changes all comparing unequal.
Addresses review 1658 (LOW): fragments aren't sent on HTTP requests,
so two registry URLs that only differ in fragment resolve to the same
endpoint. The prior normalizeRegistryURL preserved fragments, which
meant a cosmetic '#a' -> '#b' edit triggered the config-drift abort
in dbc add. Clear Fragment and RawFragment before serializing; added
a regression test pinning that fragment-only changes compare equal.
…ntics

Addresses review 1660: mergeRegistries deduplicated entries using a
key of scheme://host + trimmed path, which collapsed semantically
distinct registries — tenant-bearing query strings, credential-bearing
userinfo, and path-mounted tenants were all treated as duplicates. The
merged client silently dropped configured registries and resolved
drivers against the wrong tenant/identity.

Update the dedupe key to the same normalization used by
cmd/dbc's registriesChanged drift check: lowercase scheme/host,
trim only a trailing path slash, drop fragment — preserve query,
userinfo, and path segments. Added regression tests in
TestMergeRegistries covering tenant query, userinfo, path-mounted
tenants, casing/trailing-slash collapse, and fragment collapse.
Three separate suites (TestLoadGlobalConfig, TestMergeRegistries,
TestNewClientWithRegistryOptions) were each a long chain of t.Run
closures with repeated setup. Convert each to a single table-driven
test with a small struct of inputs + expected outputs and one driver
loop. Adds an urls() helper so mergeRegistries cases can assert the
full ordered URL list in one expression.

Same coverage, much shorter, easier to scan and extend.
…able

Addresses review 1662 (LOW): the table refactor only checked
len(cfg.Registries) + ReplaceDefaults, dropping the pre-refactor
assertions on the parsed URL and Name fields. A regression in TOML
decoding, ordering, or field mapping would now pass unnoticed.

Replace the wantRegs int with a wantEntries []RegistryEntry field
containing the fully-specified expected contents, and compare via
assert.Equal so ordering, URL, and Name mismatches all fail the test.
Addresses review 1664 (LOW): registriesChanged compared raw project
fields, so semantically-equivalent edits produced false-positive drift
aborts — e.g. flipping nil → &false when defaults were already
inherited, or adding an exact duplicate registry entry that
mergeRegistries would dedupe anyway.

Rewrite the helper to run both DriversList values through the same
newDBCClient merge path and compare the resulting normalized URL
sets. That matches what the client actually uses for driver
resolution, so mergeRegistries no-ops and tri-state flips that don't
actually change the active registry set no longer abort dbc add.

Added three new subtests:
- nil vs &false replace_defaults (no global override): equal
- exact duplicate entries: equal
- name-only duplicate rename: equal

The 1664 MEDIUM on concurrent global config.toml edits is
intentionally left out of scope. The global-config-during-add race
is extremely unlikely in practice (user editing
~/.config/dbc/config.toml while a separate shell runs dbc add), and
the recovery is a simple retry — the already-written driver entry
is valid against the dbc.toml the user wrote, just not against the
fresh global config, which 'dbc sync' or a fresh add would
correctly report.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants