You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -281,26 +283,110 @@ Every document returned from `listDocuments`/`getDocument` must include:
281
283
{
282
284
externalId: string// Source-specific unique ID
283
285
title: string// Document title
284
-
content: string// Extracted plain text
286
+
content: string// Extracted plain text (or '' if contentDeferred)
287
+
contentDeferred?:boolean// true = content will be fetched via getDocument
285
288
mimeType: 'text/plain'// Always text/plain (content is extracted)
286
-
contentHash: string//SHA-256 of content (change detection)
289
+
contentHash: string//Metadata-based hash for change detection
287
290
sourceUrl?:string// Link back to original (stored on document record)
288
291
metadata?:Record<string, unknown>// Source-specific data (fed to mapTags)
289
292
}
290
293
```
291
294
292
-
## Content Hashing (Required)
295
+
## Content Deferral (Required for file/content-download connectors)
293
296
294
-
The sync engine uses content hashes for change detection:
297
+
**All connectors that require per-document API calls to fetch content MUST use `contentDeferred: true`.** This is the standard pattern — `listDocuments` returns lightweight metadata stubs, and content is fetched lazily by the sync engine via `getDocument` only for new/changed documents.
298
+
299
+
This pattern is critical for reliability: the sync engine processes documents in batches and enqueues each batch for processing immediately. If a sync times out, all previously-batched documents are already queued. Without deferral, content downloads during listing can exhaust the sync task's time budget before any documents are saved.
300
+
301
+
### When to use `contentDeferred: true`
302
+
303
+
- The service's list API does NOT return document content (only metadata)
304
+
- Content requires a separate download/export API call per document
- The list API already returns the full content inline (e.g., Slack messages, Reddit posts, HubSpot notes)
310
+
- No per-document API call is needed to get content
311
+
312
+
### Content Hash Strategy
313
+
314
+
Use a **metadata-based**`contentHash` — never a content-based hash. The hash must be derivable from the list response metadata alone, so the sync engine can detect changes without downloading content.
315
+
316
+
Good metadata hash sources:
317
+
-`modifiedTime` / `lastModifiedDateTime` — changes when file is edited
**Critical invariant:** The `contentHash` MUST be identical whether produced by `listDocuments` (stub) or `getDocument` (full doc). Both should use the same stub function to guarantee this.
339
+
340
+
### Implementation Pattern
341
+
342
+
```typescript
343
+
// 1. Create a stub function (sync, no API calls)
344
+
function fileToStub(file:ServiceFile):ExternalDocument {
0 commit comments