Skip to content

Add a heuristic that tells if the standard diff is possible #5

@jcpitre

Description

@jcpitre

In #1 we realized that some GTFS datasets use Ids (e.g. shape_id) that are re-generated for every dataset (See #4 )
In that case the first version of the gtfs-diff engine cannot significantly find the differences.

As a stop-gap measure, we should have some kind of heuristic that quickly tells us if the diff engine can be used on a given dataset.

Copilot Suggestion:

Do a cheap O(N) pre-flight check that scans only the id columns of each file (no full row parsing). For every file present in both feeds, we compute:

churn = size(base_ids OR new_ids - base_ids AND new_ids) / size(base_ids OR new_ids)

If the weighted overall churn across all files reaches 50% (user defined), the diff is aborted with a clear error message listing which files have high churn. This prevents the engine from producing a meaningless diff when a publisher has fully regenerated all IDs between versions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions