Skip to content

feat(appkit): default analytics format to ARROW_STREAM (BREAKING)#387

Open
jamesbroadhead wants to merge 1 commit into
stack/arrow-3-inline-arrow-fixfrom
feat/default-arrow-stream
Open

feat(appkit): default analytics format to ARROW_STREAM (BREAKING)#387
jamesbroadhead wants to merge 1 commit into
stack/arrow-3-inline-arrow-fixfrom
feat/default-arrow-stream

Conversation

@jamesbroadhead
Copy link
Copy Markdown
Contributor

Summary

Switches the default analytics format from JSON_ARRAY to ARROW_STREAM across the hook, the chart-data hook, the SQL warehouse connector defaults, and the analytics plugin request handler.

This is a breaking change for code that consumes useAnalyticsQuery("foo") as a row array (data.length, data[0].col, data.map(...)). Two migration paths are documented below.

Depends on #329

Merge #329 first. This PR is stacked on its branch and relies on the bidirectional disposition fallback added there — without that, the ARROW default doesn't work on warehouses that refuse ARROW_STREAM + INLINE.

Why ARROW_STREAM is a better default

Type fidelity. ARROW preserves column types end-to-end — INT stays number, BIGINT stays bigint, TIMESTAMP stays Date. The warehouse's JSON_ARRAY serialization stringifies everything ("1", "2.5", ISO strings); callers have to coerce on their side and any nuance is lost.

Wire compactness. Arrow IPC is ~3–5× smaller than JSON for typical numeric data, and the binary parse is materially faster than JSON.parse for large tables. Charts/dashboards over real result sets are the primary useAnalyticsQuery use case, and they benefit measurably.

Warehouse alignment. The warehouses that only support one INLINE format (the case #329 was originally written for) want ARROW_STREAM + INLINE. With the ARROW default they get a clean single statement call. With the JSON_ARRAY default they trigger #329's server-side retry-and-decode path — correct but wasteful.

Caveat — round trips on classic warehouses. Classic warehouses (and some serverless) reject ARROW_STREAM + INLINE and require ARROW_STREAM + EXTERNAL_LINKS. Under the ARROW default that's two statement calls (initial INLINE rejected, then EXTERNAL_LINKS) plus the /arrow-result fetch — vs one call with the JSON_ARRAY default. The type-fidelity + wire-compactness win dominates for the typical analytics workload, but if your app is a small-result-set lookup against a classic warehouse, you may want to opt into JSON_ARRAY explicitly.

Migration

For code that walks rows as an array (data.length, data[i].col, data.map):

  • Option A — opt back into JSON_ARRAY (smallest diff):
    const { data } = useAnalyticsQuery("q", params, { format: "JSON_ARRAY" });
  • Option B — use the Arrow API (gets type fidelity + smaller payloads):
    data?.numRows                  // instead of data.length
    data?.getChild("col")?.get(0)  // instead of data[0].col
    data?.toArray()                // if you need a plain row array

In-repo migrations done in this PR

  • template/client/src/pages/analytics/AnalyticsPage.tsxmigrated to Arrow API (this is the example new users get from databricks apps init; it should demonstrate the new default).
  • apps/dev-playground/client/src/features/smart-dashboard/hooks/use-dashboard-data.ts — pinned to JSON_ARRAY (7 call sites, all feeding chart components that consume the row-array shape).
  • apps/dev-playground/client/src/routes/analytics.route.tsx — pinned to JSON_ARRAY (3 call sites).
  • apps/dev-playground/client/src/routes/sql-helpers.route.tsx — pinned to JSON_ARRAY.
  • packages/appkit-ui/src/react/table/table-wrapper.tsx (DataTable) — pinned to JSON_ARRAY. The table renders rows as a JS array; an Arrow-native version is a separate optimization.

Out-of-repo migrations needed (follow-up PRs)

These external repos all have call sites or examples that assume the JSON_ARRAY shape. None of them are blockers for landing this PR, but they should be updated when this lands or shortly after — flagged here so they don't get lost:

Code in external repos:

  • databricks/cliexperimental/aitools/templates/appkit/template/{{.project_name}}/client/src/App.tsx uses data.length / data[0].value.
  • databricks/app-templatesappkit-all-in-one/client/src/pages/analytics/AnalyticsPage.tsx and appkit-analytics/client/src/pages/analytics/AnalyticsPage.tsx use the same pattern.
  • databricks/devhubexamples/content-moderator/template/client/src/pages/AnalyticsPage.tsx uses useAnalyticsQuery; needs an audit of its consumers.

Docs in external repos (need text refresh, not code):

  • databricks/devhubstatic/raw-docs/appkit/v0/plugins/analytics.md, static/raw-docs/appkit/v0/development/type-generation.md, static/raw-docs/appkit/v0/api/appkit-ui/data/DataTable.mdx, and the AI-skill references under .agents/skills/databricks-apps/references/appkit/*.md.
  • databricks/cliexperimental/aitools/templates/appkit/template/{{.project_name}}/docs/*.md (AI agent docs that show the JSON shape).

Tests

  • analytics.test.ts — flipped the "default format" assertion from JSON_ARRAY to ARROW_STREAM (the test verifies the default; its meaning is unchanged).
  • analytics.integration.test.ts — the cache test now explicitly requests JSON_ARRAY because the ARROW path bypasses cache by design (inline-stash ids drain on first read, so a cache hit would replay a dead id).
  • use-chart-data.test.ts — flipped two "auto-selects default" assertions.

Full suite: 2,674 tests, all green.

This pull request was AI-assisted by Isaac.

useAnalyticsQuery now returns a TypedArrowTable by default instead of a
row array. Callers that need the JSON-row shape must pass
{ format: 'JSON_ARRAY' } explicitly. The default switch applies to the
hook, the chart-data hook, the SQL warehouse connector defaults, and
the analytics plugin request handler.

Why:
- ARROW_STREAM preserves column types (number stays number, bigint stays
  bigint) end-to-end. JSON_ARRAY stringifies everything on the wire.
- ARROW IPC is 3-5x more compact than JSON for numeric data and parses
  faster on the client.
- This PR stacks on the disposition-fallback PR, which makes both
  defaults work across all warehouse variants — but ARROW is the format
  the warehouses 'natively' want for INLINE, and aligning with that
  avoids the server-side decode the JSON_ARRAY fallback has to do
  against inline-arrow-only warehouses.

Migration:
- For tabular code that walks data.length / data[i], either:
  (a) opt back into JSON_ARRAY:
      useAnalyticsQuery('q', params, { format: 'JSON_ARRAY' });
  (b) switch to Arrow API: data.numRows / data.getChild('col')?.get(i)
      / data.toArray().
- DataTable.tsx, dev-playground analytics + dashboard routes, and the
  SQL-helpers route are all pinned to JSON_ARRAY in this PR to preserve
  their existing rendering.
- The template AnalyticsPage is updated to the Arrow API to demonstrate
  the new default.

BREAKING CHANGE: Default format for useAnalyticsQuery and the analytics
plugin request handler is now ARROW_STREAM instead of JSON_ARRAY.

Depends on #329 (the disposition-fallback PR); merge that first.

Signed-off-by: James Broadhead <jamesbroadhead@gmail.com>
@jamesbroadhead jamesbroadhead requested a review from a team as a code owner May 15, 2026 16:27
@jamesbroadhead jamesbroadhead requested review from atilafassina and removed request for a team May 15, 2026 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant