diff --git a/docs/filesystem_on_elastickv_design.md b/docs/filesystem_on_elastickv_design.md new file mode 100644 index 00000000..36f420fc --- /dev/null +++ b/docs/filesystem_on_elastickv_design.md @@ -0,0 +1,437 @@ +# Filesystem on Elastickv Design + +## 1. Background + +Elastickv already provides: + +1. Raft-replicated KV state per shard group. +2. Range-based shard routing via `distribution.Engine`. +3. Transactional write path, with current limitation that cross-shard transactions are not supported. + +This design adds a filesystem layer on top of Elastickv with fixed-size file chunks and a placement policy that keeps chunks of the same file on one shard in normal operation. The backend API is protocol-neutral and can be exposed through multiple frontends, including FUSE. + +## 2. Goals and Non-goals + +### 2.1 Goals + +1. Provide filesystem primitives (`create`, `read`, `write`, `truncate`, `rename`, `unlink`, `mkdir`, `readdir`) on Elastickv. +2. Store file data in fixed-size chunks. +3. Make same-file multi-shard placement rare by design (single-home shard per file). +4. Keep file data I/O on one shard for common operations. +5. Support resumable whole-file migration for rebalancing. +6. Define a protocol-neutral filesystem API that can be mapped to FUSE operations without semantic gaps in supported features. + +### 2.2 Non-goals (Phase 1) + +1. Full POSIX parity (`mmap` coherency, hard-link edge cases, full advisory locking). +2. Automatic striping a single large file across many shards by default. +3. Cross-directory atomic rename when source and destination directories are in different shard-home domains. +4. FUSE-only architecture (the backend remains frontend-agnostic). + +## 3. Requirements + +### 3.1 Functional + +1. Random read/write by offset. +2. Sparse file support (unwritten chunks are implicit zeroes). +3. Directory namespace with inode-based metadata. +4. Crash-safe create/delete/move/migration operations. + +### 3.2 Placement + +1. All chunks of one file should have one home shard under normal state. +2. During rebalance, temporary dual placement is allowed only in `MIGRATING` state. +3. Control-plane split must avoid selecting split keys inside one file key-range prefix. + +### 3.3 Consistency + +1. Read-after-write for a file handle. +2. Monotonic file size and mtime updates. +3. No orphaned visible directory entry pointing to a non-existent inode after recovery. + +### 3.4 FUSE compatibility requirements + +1. Stable inode number (`st_ino`) and generation for object lifetime; in Phase 1 inode IDs are not reused (`root=1` fixed), so generation is constant `1`. +2. `getattr` returns coherent `mode`, `nlink`, `uid`, `gid`, `size`, `mtime`, `ctime`, `atime` with nanosecond precision. +3. `readdir` supports resumable offset cookies and emits `"."` and `".."` on the first page (empty-cookie start). +4. `rename` is atomic within supported domain; unsupported cross-domain rename returns `EXDEV`. +5. `unlink`/`rmdir` return expected errno (`ENOENT`, `ENOTEMPTY`, `EISDIR`, `ENOTDIR`) and support remove-while-open semantics via delayed physical GC. +6. `write` durability boundary is explicit: + - `write` + `flush` guarantees visibility to the same mount session. + - `fsync` guarantees Raft-committed durability. +7. Unsupported operations required by a given FUSE mount mode return explicit `ENOSYS` or `EOPNOTSUPP` (not silent no-op). + +## 4. High-level Architecture + +Components: + +1. `FS Core API` (new backend service): protocol-neutral filesystem primitives and placement logic. +2. `Frontend Adapters`: FUSE adapter and non-FUSE clients map into `FS Core API`. +3. `Namespace Store`: inode + directory metadata in Elastickv keys. +4. `File Data Store`: fixed-size chunk values in Elastickv keys. +5. `Placement Manager`: assigns home shard per file and runs whole-file moves. + +Data flow: + +1. Lookup path -> inode metadata. +2. Resolve file home shard. +3. Read/write chunk keys under that file-home prefix. +4. Adapter returns protocol-specific response/errno (FUSE or RPC). + +## 5. Data Model and Key Layout + +### 5.1 IDs + +1. `inode_id`: 64-bit non-zero unique ID (`root=1` reserved) to map directly to FUSE `st_ino`. +2. `inode_generation`: 64-bit generation exported to FUSE (`1` in Phase 1 because inode IDs are never reused). +3. `home_slot`: 64-bit placement token derived at create time. +4. `chunk_index`: 64-bit unsigned index. + +`inode_id` allocation policy: + +1. IDs are generated by a distributed random allocator (not strictly sequential). +2. On collision, allocator retries with a new random value. +3. This avoids sequential-key concentration in `!fs|ino|...` and `!fs|home|...` ranges during high create throughput. + +### 5.2 Key schema (logical) + +```text +!fs|ino| -> InodeMeta +!fs|dir|| -> DirEntry(child_inode_id, type) +!fs|dirv| -> DirVersion +!fs|home| -> Home(home_slot, state, epoch) +!fs|chk||| -> ChunkPayload +!fs|ref||| -> OpenHandleLease(ttl) +!fs|intent| -> IntentState(kind, cursor, payload) +!fs|job|move| -> MoveJobState +``` + +Notes: + +1. `!fs|chk|...` keys include `home_slot` before `inode_id` to cluster same-file data. +2. `!fs|home|...` is the source of truth for home shard assignment. +3. `epoch` protects against stale writers during migration. +4. `!fs|ref|...` keeps remove-while-open semantics for FUSE-compatible behavior. +5. `!fs|intent|...` stores resumable crash-recovery intents for create/delete/move. + +### 5.3 Composite key encoding rules + +1. The `|` separators shown above are logical notation only; keys are encoded as binary tuples. +2. Fixed-width numeric fields (`inode_id`, `home_slot`, `chunk_index`, `fh_id`) are encoded as 8-byte big-endian. +3. Variable fields (`name`, `client_id`, `intent_id`) use a byte-safe order-preserving encoding, so legal filenames including `|` never collide with separators. +4. Directory entry key encoding for `` preserves bytewise name ordering to keep `readdir` scans lexicographically stable. + +### 5.4 Internal transaction key families + +Elastickv transaction internals are not filesystem objects but are part of the overall keyspace and routing behavior: + +1. `!txn|lock|...` +2. `!txn|int|...` +3. `!txn|cmt|...` +4. `!txn|rb|...` +5. `!txn|meta|...` + +### 5.5 Inode metadata + +`InodeMeta` includes: + +1. file type, mode, uid/gid. +2. logical size. +3. `atime`, `mtime`, `ctime` (nanoseconds). +4. chunk size (fixed cluster-wide in Phase 1, still stored for forward compatibility). +5. current `home_slot` and `epoch` snapshot. +6. `nlink`. +7. `inode_generation`. + +## 6. Fixed-size Chunking + +### 6.1 Chunk size + +1. Default: `4 MiB`. +2. Allowed values: `1 MiB`, `2 MiB`, `4 MiB`, `8 MiB` (cluster setting). +3. A file uses one configured chunk size for its lifetime in Phase 1. + +### 6.2 Offset mapping + +For byte offset `off`: + +1. `chunk_index = off / chunk_size` +2. `chunk_offset = off % chunk_size` + +Write path may touch multiple contiguous chunk indexes. + +### 6.3 Partial chunk write + +1. Read existing chunk (or treat missing as zero-filled). +2. Patch modified byte range. +3. Write back full chunk atomically in one KV value. +4. Update inode size if needed. + +### 6.4 Sparse behavior + +1. Missing chunk key means all-zero chunk. +2. `truncate` larger creates holes without materializing zero chunks. + +## 7. Placement Strategy to Avoid Same-file Shard Scatter + +### 7.1 Single-home rule + +At file creation: + +1. Placement manager computes `home_slot` using rendezvous hashing over active shard groups. +2. `!fs|home|` is written once (`state=ACTIVE`, `epoch=1`). +3. All chunk keys use that `home_slot`. + +Result: a file is naturally concentrated on one shard-home range. + +### 7.2 Route key normalization + +Extend shard route normalization (currently `kv.routeKey` behavior in `kv/shard_key.go` for internal key families) to treat: + +1. `!fs|chk|||` +2. transaction-wrapped variants (`!txn|...`) + +as one logical routing domain per `(home_slot, inode_id)` when evaluating split candidates and stale-route protection. + +### 7.3 Split guardrail + +In split planning: + +1. Candidate split keys inside the same `(home_slot, inode_id)` prefix are rejected. +2. Split key is snapped to the nearest file boundary prefix. +3. If a range mostly contains one huge file, auto-split is skipped and flagged as `FILE_PINNED_HOTSPOT`. + +This directly reduces accidental same-file scatter caused by range splitting. + +`FILE_PINNED_HOTSPOT` follow-up actions: + +1. Increment `fs_file_pinned_hotspot_total` and emit an operator alert with inode/home metadata. +2. Keep serving with single-home placement as safe default. +3. Operator may trigger `MoveFile(inode, target_group)` for load shift. +4. If hotspot persists on a single huge file, operator may enable manual striping policy in Phase 2+ for that inode. + +### 7.4 Rebalancing unit + +Rebalancing moves whole files by default: + +1. Copy all chunks to target home slot. +2. Fence writes with epoch bump. +3. Switch `!fs|home|` to new slot. +4. Garbage-collect old chunks after grace window. + +Exception: + +1. Optional manual striping mode for very large files (Phase 2+) may place extents on multiple homes. +2. That mode is opt-in and disabled by default to preserve same-file affinity. + +## 8. Operation Flows + +### 8.1 Create file + +1. Resolve parent directory inode. +2. Allocate `inode_id` via distributed random allocator. +3. Assign `home_slot`. +4. Write `!fs|ino|`, `!fs|home|`, and `!fs|dir|parent|name` with intent marker. +5. If create is requested with open semantics, allocate `fh_id` and write `!fs|ref|...`. +6. Commit and clear intent. + +Recovery replays unfinished intents idempotently. + +### 8.2 Read + +1. Resolve `inode_id` and `home_slot`. +2. Translate `[offset, length]` into chunk indexes. +3. Batch read `!fs|chk|...`. +4. Fill holes with zero bytes. + +### 8.3 Write + +1. Resolve inode + home (`epoch` checked). +2. For each touched chunk: read-modify-write. +3. Update inode size/mtime. +4. If epoch changed due to migration, retry with refreshed home mapping. + +### 8.4 Rename + +1. Same-directory rename: atomic update of directory keys. +2. Cross-directory rename: + - If source and destination are within the same supported atomic domain, perform atomic replace. + - If not in the same atomic domain (Phase 1 limitation), return `EXDEV` for FUSE adapter compatibility. + +### 8.5 Truncate and unlink + +1. `truncate` shrink: delete chunks above new EOF and patch tail chunk. +2. `unlink`: remove dentry and decrement `nlink`; when `nlink==0`, mark inode as orphaned. +3. Physical chunk GC runs only when `nlink==0` and no open-handle lease exists (`!fs|ref|...` empty). + +### 8.6 Open, flush, fsync, release + +1. `open` allocates `fh_id` and writes/refreshes `!fs|ref|||`. +2. `flush` pushes adapter-side buffered writes (if any) to `FS Core API`. +3. `fsync` waits for Raft commit durability on metadata and touched chunks. +4. `release` removes handle lease; if inode is orphaned and has no more leases, enqueue final GC. + +### 8.7 Readdir + +1. Directory entries are ordered lexicographically by name. +2. Offset cookie is an opaque cursor that encodes directory snapshot version plus position. +3. `readdir` started with empty cookie captures a directory snapshot (`dir_version`) and returns `"."` and `".."` first. +4. Subsequent paginated calls with returned cookies read from the same snapshot; concurrent newer mutations are not visible. +5. For non-empty cookie pages, `"."` and `".."` are omitted and listing resumes from the encoded position. + +## 9. Failure Handling and Recovery + +### 9.1 Intent log + +Use intent keys for multi-step metadata/data transitions: + +1. `CREATE_INTENT` +2. `DELETE_INTENT` +3. `MOVE_INTENT` + +Intent records live under `!fs|intent|` and store step cursor + payload for resume after crash. + +### 9.2 Epoch fencing + +Writers include `(inode_id, epoch)` from home mapping. + +1. If shard owner sees stale epoch, return retryable stale-home error. +2. Client refreshes `!fs|home|` and retries. + +### 9.3 Idempotency + +All move/create/delete steps are idempotent by key overwrite semantics and job cursor checkpoints. + +### 9.4 Open-handle lease recovery + +1. `!fs|ref|...` has TTL and must be heartbeated by active clients. +2. Lease reaper clears expired refs. +3. Orphan inode GC checks active refs after reaping to preserve FUSE remove-while-open behavior. + +## 10. APIs + +### 10.1 FS Core API (protocol-neutral) + +1. `Resolve(parent_inode, name) -> inode` +2. `GetAttr(inode) -> Stat` +3. `SetAttr(inode, mask, attrs) -> Stat` +4. `Open(inode, flags, client_id) -> fh_id` +5. `Read(inode, fh_id, offset, size) -> bytes` +6. `Write(inode, fh_id, offset, bytes) -> n` +7. `Flush(inode, fh_id)` +8. `Fsync(inode, fh_id, datasync)` +9. `Release(inode, fh_id, client_id)` +10. `Create(parent_inode, name, mode, uid, gid, flags, client_id) -> inode, fh_id` +11. `Mkdir(parent_inode, name, mode, uid, gid) -> inode` +12. `Unlink(parent_inode, name)` +13. `Rmdir(parent_inode, name)` +14. `Rename(old_parent, old_name, new_parent, new_name, flags)` +15. `Readdir(inode, cookie, limit) -> entries, next_cookie` +16. `StatFS(inode) -> capacity, free, files, free_files` + +Error model: + +1. Return typed filesystem errors (`ENOENT`, `EEXIST`, `ENOTDIR`, `EISDIR`, `ENOTEMPTY`, `EXDEV`, `EPERM`, `EACCES`, `EINVAL`, `EFBIG`, `ENOSYS`, `EOPNOTSUPP`). +2. Frontend adapter maps these 1:1 to protocol-specific errno/status. + +### 10.2 FUSE mapping contract + +| FUSE op | FS Core API | Requirement | +|---|---|---| +| `lookup` | `Resolve` + `GetAttr` | Stable inode and generation (`generation=1` in Phase 1) | +| `getattr` | `GetAttr` | `Stat` fields complete and coherent | +| `setattr` | `SetAttr` | `size`/time/mode updates with masks | +| `open` | `Open` | Return valid `fh_id` | +| `read` | `Read` | Offset-based, hole=zero | +| `write` | `Write` | Partial overwrite via chunk RMW | +| `flush` | `Flush` | Push session buffer | +| `fsync` | `Fsync` | Raft durability barrier | +| `release` | `Release` | Remove lease/ref | +| `create` | `Create` | Atomic name reservation + inode allocation | +| `mkdir` | `Mkdir` | Parent type and permission checks | +| `unlink` | `Unlink` | Delayed physical delete when open refs exist | +| `rmdir` | `Rmdir` | Must fail with `ENOTEMPTY` if children exist | +| `rename` | `Rename` | Atomic in-domain, `EXDEV` out-of-domain | +| `readdir` | `Readdir` | Stable cookie pagination; dot entries on first page | +| `statfs` | `StatFS` | Capacity and inode counters exposed | + +Unsupported op handling: + +1. `link`, `symlink`, `readlink`, file locking ioctls are Phase 1 out of scope. +2. FUSE adapter returns `ENOSYS` or `EOPNOTSUPP` explicitly. + +Internal placement API: + +1. `GetFileHome(inode)` +2. `MoveFile(inode, target_group)` +3. `ListFilePlacementStats()` + +## 11. Metrics and SLOs + +Key metrics: + +1. `fs_file_multi_shard_detected_total` +2. `fs_file_move_inflight` +3. `fs_chunk_read_ops_total`, `fs_chunk_write_ops_total` +4. `fs_chunk_read_latency_ms`, `fs_chunk_write_latency_ms` +5. `fs_home_epoch_conflict_total` +6. `fs_open_handle_lease_active` +7. `fs_orphan_inode_gc_pending` +8. `fs_file_pinned_hotspot_total` +9. `fs_inode_id_collision_retry_total` + +SLO targets (initial): + +1. `same_file_single_home_ratio >= 99.9%` +2. p99 chunk write latency under target workload: `< 50 ms` (cluster dependent) + +## 12. Implementation Plan + +Phase 1: + +1. Key schema + metadata/chunk CRUD. +2. Fixed chunk read/write/truncate. +3. Single-home placement at file create. +4. Split guardrail for file boundary. +5. Protocol-neutral `FS Core API` and thin FUSE adapter with required errno mapping. + +Phase 2: + +1. Whole-file migration job. +2. Epoch fencing + stale-home retry path. +3. Placement metrics and operator commands. +4. Open-handle lease reaper and orphan-GC hardening. + +Phase 3: + +1. Cross-directory atomic rename improvements. +2. Optional large-file striping mode (opt-in). +3. Optional support for currently unsupported FUSE ops (`link`/`symlink`). + +## 13. Test Plan + +1. Unit: + - chunk boundary math + - sparse reads + - key encode/decode + - split guard boundary selection +2. Integration: + - create/read/write/truncate/unlink + - open-unlink-read-release lifecycle + - rename in-domain succeeds, cross-domain returns `EXDEV` + - readdir cookie resume under concurrent directory mutation (no duplicates/skips within one snapshot sequence) + - same-file operations stay on one shard + - restart recovery with unfinished intents + - TTL lease expiry and orphan GC correctness +3. Jepsen-like: + - concurrent append + read under node failure + - migration during read/write load +4. FUSE compatibility: + - libfuse integration tests for `lookup/getattr/open/read/write/readdir/rename/unlink/fsync` + - errno conformance checks for unsupported operations + +## 14. Open Questions + +1. Metadata scalability: should directory namespace stay in one domain or be hash-partitioned in Phase 2? +2. Chunk compression policy: per-file toggle vs cluster default? +3. Read cache location: client-side only or shard-local chunk cache?