Skip to content

optimizeOrExpression silently drops items when one OR branch references an unindexed field #1502

@michaelstewart

Description

@michaelstewart

Below is an AI generated summary of a bug I came across today:

Summary

optimizeOrExpression in packages/db/src/utils/index-optimization.ts is unsound. When a where clause contains an or(...) whose branches reference a mix of indexed and unindexed fields, the optimizer returns canOptimize: true with a result set that is the union of only the indexed branches — silently dropping items that would match the unindexed branch.

The result is that useLiveQuery / currentStateAsChanges returns an incomplete subset of the matching rows, with no warning or error. Callers don't fall back to a full scan because canOptimize is true.

Where the bug is

packages/db/src/utils/index-optimization.ts (verified on main and on the published 0.6.5 tarball — identical code):

function optimizeOrExpression(expression, collection) {
  if (expression.type !== 'func' || expression.args.length < 2) {
    return { canOptimize: false, matchingKeys: new Set() }
  }

  const results = []

  // Try to optimize each part, keep the optimizable ones
  for (const arg of expression.args) {
    const result = optimizeQueryRecursive(arg, collection)
    if (result.canOptimize) {
      results.push(result)
    }
    // ⚠️ non-optimizable args are dropped on the floor
  }

  if (results.length > 0) {
    const allMatchingSets = results.map((r) => r.matchingKeys)
    const unionedKeys = unionSets(allMatchingSets)
    return { canOptimize: true, matchingKeys: unionedKeys }
    // ⚠️ canOptimize: true even though some branches were skipped
  }

  return { canOptimize: false, matchingKeys: new Set() }
}

For OR semantics this is unsound. If any branch can't be optimized, the optimizer can't soundly compute the answer from the indexes alone — the missing branch could match items not in any indexed branch's result set. The function should return canOptimize: false and let the consumer fall back to a full scan.

The corresponding optimizeAndExpression is fine because for AND, dropping a non-optimizable arg only widens the result set (the consumer is expected to apply the full predicate to that superset). For OR it narrows it.

Repro

Conceptual scenario that triggers it in practice:

  1. A collection has items where some match field_A = X, others match field_B = Y, with no overlap.
  2. Some prior query has caused field_A to be auto-indexed (autoIndex: 'eager' is the default).
  3. field_B has not been queried in isolation, so no index exists on it.
  4. Run a live query with where: or(eq(field_A, X), eq(field_B, Y)).

Expected: rows matching either branch.
Actual: only rows matching field_A = X. Rows matching field_B = Y (and not field_A = X) are dropped.

If you reorder the queries so a query touches field_B first (causing auto-indexing on field_B), the bug disappears, because now both branches are optimizable.

Suggested fix

function optimizeOrExpression(expression, collection) {
  if (expression.type !== 'func' || expression.args.length < 2) {
    return { canOptimize: false, matchingKeys: new Set() }
  }

  const results = []
  for (const arg of expression.args) {
    const result = optimizeQueryRecursive(arg, collection)
    if (!result.canOptimize) {
      // OR with any non-optimizable branch cannot be soundly index-optimized
      return { canOptimize: false, matchingKeys: new Set() }
    }
    results.push(result)
  }

  const unionedKeys = unionSets(results.map((r) => r.matchingKeys))
  return { canOptimize: true, matchingKeys: unionedKeys }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions