Skip to content

Isolate filter feature for clean rollback; flag lex-comparison pitfall#240

Merged
thodson-usgs merged 3 commits intoDOI-USGS:mainfrom
thodson-usgs:add-filter-pitfall-check
Apr 28, 2026
Merged

Isolate filter feature for clean rollback; flag lex-comparison pitfall#240
thodson-usgs merged 3 commits intoDOI-USGS:mainfrom
thodson-usgs:add-filter-pitfall-check

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

@thodson-usgs thodson-usgs commented Apr 24, 2026

Summary

Two changes, in one PR because the second motivates the first:

  1. Isolate the CQL filter feature into its own module so the entire feature can be removed by deleting two source files, two test files, and two re-export lines. All filter and chunking logic moves out of api.py / utils.py into dataretrieval/waterdata/filters.py, and get_nearest_continuous moves into a sibling dataretrieval/waterdata/nearest.py. No public API change.

  2. Add a pre-flight check that raises on unquoted-numeric comparisons in cql-text filters (value > 1000, parameter_code IN (60, 61), value BETWEEN 5 AND 10). Every queryable on the Water Data OGC API is type=string server-side, so unquoted numeric literals either get rejected with HTTP 500 or silently produce lexicographic results.

Motivation

Concern raised on the R side (DOI-USGS/dataRetrieval#880):

If I saw a generic filter argument, my first thought would be SWEET, I want to answer all these interesting questions about the data. So I'll set filter=value >1000. The problem with that is it works, but since value is a character, it's filtering all the values that are alphabetically above "1000" (like "12"). … My gut says there would be way more people trying stuff like that and then either get the wrong results unknowingly, or complain that dataRetrieval is broken…

That's a real failure mode for the Python side too. The pre-flight check addresses it; the module isolation makes the whole feature easy to roll back if it turns out to cause more confusion than it solves.

What changed

New modules

  • dataretrieval/waterdata/filters.py (+331) — owns FILTER_LANG, top-level-OR splitting, URL-byte-budget probing, the lex-pitfall guard, and a chunked decorator that wraps the single-request fetch with all of the above.
  • dataretrieval/waterdata/nearest.py (+241) — get_nearest_continuous extracted as-is from api.py.
  • tests/waterdata_filters_test.py (+589) — all filter/chunking tests, plus 35 new tests for the lex-pitfall guard (every op × both orderings, IN, BETWEEN, negation, quoted-string false-positive guards, end-to-end through get_continuous).

Slimmed

  • dataretrieval/waterdata/api.py (−350 / +45) — chunking is now applied via @chunked(build_request=...) on the per-request fetch.
  • dataretrieval/waterdata/utils.py (−241 / +15) — filter helpers removed.
  • tests/waterdata_utils_test.py (−444) — filter tests moved to waterdata_filters_test.py.

Re-exports in dataretrieval/waterdata/__init__.py updated; no public symbol added or removed.

Behavior of the lex-pitfall check

>>> waterdata.get_continuous(
...     monitoring_location_id="USGS-02238500",
...     parameter_code="00060",
...     filter="value >= 1000",
...     filter_lang="cql-text",
... )
ValueError: Filter uses an unquoted numeric comparison against 'value'
(``value >= 1000``). Every queryable on the Water Data API is typed as
a string, so the server rejects unquoted numeric literals with HTTP
500; even quoting the literal gives a lexicographic comparison
(``value > '10'`` matches ``value='34.52'``, ``parameter_code = '60'``
matches nothing because the real codes are ``'00060'``-shaped). For a
true numeric filter, fetch a wider result and reduce in pandas.

Quoted literals (value >= '1000') are not flagged — the caller has signalled they want sort-order semantics.

Live-confirmed against USGS-02238500 / continuous:

filter="parameter_code = '00060'"   → 200, 5 rows (correct)
filter="parameter_code = 60"        → 500 Internal Server Error  ← caught client-side
filter="value > '10'"               → 200, 31 rows of '34.52', '63160', …  (lex)
filter="value > 10"                 → 500 Internal Server Error  ← caught client-side

Rollback path

If the filter feature is rolled back, the change is mechanical:

  • delete dataretrieval/waterdata/filters.py
  • delete tests/waterdata_filters_test.py
  • drop from .filters import FILTER_LANG and the FILTER_LANG entry in __all__ in dataretrieval/waterdata/__init__.py
  • drop the @chunked(build_request=...) decorator and filter / filter_lang kwargs from api.py / utils.py

get_nearest_continuous (in nearest.py) is independent and stays.

Test plan

  • pytest tests/waterdata_filters_test.py tests/waterdata_utils_test.py tests/waterdata_nearest_test.py111/111 pass.
  • Full non-live suite — 206/206 pass.
  • Live-probed actual server behavior against USGS-02238500 / continuous: unquoted numeric RHS consistently returns 500; quoted literal returns 200 with lex-sorted results.

🤖 Generated with Claude Code

@thodson-usgs thodson-usgs force-pushed the add-filter-pitfall-check branch from 559b466 to 68d1a81 Compare April 24, 2026 19:51
Move all CQL filter and chunking logic out of api.py / utils.py into a
dedicated dataretrieval/waterdata/filters.py module (with chunked as a
decorator on the per-request fetch), and extract get_nearest_continuous
into a sibling nearest.py — so the entire filter feature can be removed
by deleting two source files, two test files, and two re-export lines.
Adds a pre-flight check that raises on unquoted-numeric comparisons
(value > 1000, parameter_code IN (60, 61), value BETWEEN 5 AND 10),
since every Water Data API queryable is string-typed and the server
either returns HTTP 500 or silently produces lexicographically-sorted
wrong rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thodson-usgs thodson-usgs force-pushed the add-filter-pitfall-check branch from fbd0f51 to d81dd33 Compare April 27, 2026 15:06
@thodson-usgs thodson-usgs changed the title Check for silent lexicographic comparison against string-typed value Isolate filter feature for clean rollback; flag lex-comparison pitfall Apr 27, 2026
@thodson-usgs thodson-usgs requested a review from Copilot April 28, 2026 13:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reorganizes Water Data OGC filter support into dedicated modules to enable easy rollback, while adding a client-side preflight guard that raises on unquoted numeric literals in cql-text filters (to prevent server 500s and lexicographic-comparison footguns). It also extracts get_nearest_continuous into its own module without changing the public API surface.

Changes:

  • Moved filter/chunking/URL-budget logic into dataretrieval/waterdata/filters.py and applied it via a @chunked(...) decorator around the single-request fetch path.
  • Added _check_numeric_filter_pitfall to reject unquoted numeric comparisons in cql-text filters, with extensive new tests.
  • Extracted get_nearest_continuous into dataretrieval/waterdata/nearest.py and updated imports/tests accordingly.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
dataretrieval/waterdata/filters.py New home for filter language type alias, chunking + URL-budget probe, and numeric-literal pitfall guard.
dataretrieval/waterdata/utils.py Removes inline chunking pipeline; wraps _fetch_once with @filters.chunked(...) and returns metadata from the (possibly aggregated) response.
dataretrieval/waterdata/nearest.py New module containing get_nearest_continuous extracted from api.py.
dataretrieval/waterdata/api.py Updates FILTER_LANG import and docstrings; removes inlined get_nearest_continuous.
dataretrieval/waterdata/types.py Removes FILTER_LANG type alias (now in filters.py).
dataretrieval/waterdata/__init__.py Updates re-exports for FILTER_LANG and get_nearest_continuous.
tests/waterdata_filters_test.py New test module covering filter passthrough, chunking behavior, and numeric-literal pitfall guard.
tests/waterdata_utils_test.py Removes filter/chunking tests moved into waterdata_filters_test.py.
tests/waterdata_nearest_test.py Updates imports/patch targets to follow the new nearest.py module.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread dataretrieval/waterdata/filters.py Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@thodson-usgs thodson-usgs marked this pull request as ready for review April 28, 2026 14:16
@thodson-usgs thodson-usgs merged commit 8cbff61 into DOI-USGS:main Apr 28, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants