Skip to content

ETL: incremental company registry sync using change detection #28

@francescobianco

Description

@francescobianco

Scenario

Teams building data warehouses over the Italian company registry (Registro Imprese) need to run incremental syncs — only fetching records that changed since the last run — rather than full reloads.

Current approach (inefficient)

// Full reload every night — very expensive in API credits
$page = 1;
do {
    $response = json_decode($client->get('https://company.openapi.com/IT-search', [
        'page' => $page++,
    ]), true);
    syncToWarehouse($response['items']);
} while (!empty($response['items']));

Better approach

Use a last_updated_at filter if the API supports it, or store a local hash of each record and compare on each run.

$lastRun = $store->get('last_sync_timestamp');
$response = json_decode($client->get('https://company.openapi.com/IT-search', [
    'updated_after' => $lastRun,
]), true);

Open questions

  • Does the company search API support updated_after or equivalent filtering?
  • Is there a webhook/event stream alternative to polling for changes?
  • Could the SDK expose a since() query builder to make this pattern cleaner?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions