diff --git a/docs/filesystem_on_elastickv_design.md b/docs/filesystem_on_elastickv_design.md
new file mode 100644
index 00000000..36f420fc
--- /dev/null
+++ b/docs/filesystem_on_elastickv_design.md
@@ -0,0 +1,437 @@
+# Filesystem on Elastickv Design
+
+## 1. Background
+
+Elastickv already provides:
+
+1. Raft-replicated KV state per shard group.
+2. Range-based shard routing via `distribution.Engine`.
+3. Transactional write path, with current limitation that cross-shard transactions are not supported.
+
+This design adds a filesystem layer on top of Elastickv with fixed-size file chunks and a placement policy that keeps chunks of the same file on one shard in normal operation. The backend API is protocol-neutral and can be exposed through multiple frontends, including FUSE.
+
+## 2. Goals and Non-goals
+
+### 2.1 Goals
+
+1. Provide filesystem primitives (`create`, `read`, `write`, `truncate`, `rename`, `unlink`, `mkdir`, `readdir`) on Elastickv.
+2. Store file data in fixed-size chunks.
+3. Make same-file multi-shard placement rare by design (single-home shard per file).
+4. Keep file data I/O on one shard for common operations.
+5. Support resumable whole-file migration for rebalancing.
+6. Define a protocol-neutral filesystem API that can be mapped to FUSE operations without semantic gaps in supported features.
+
+### 2.2 Non-goals (Phase 1)
+
+1. Full POSIX parity (`mmap` coherency, hard-link edge cases, full advisory locking).
+2. Automatic striping a single large file across many shards by default.
+3. Cross-directory atomic rename when source and destination directories are in different shard-home domains.
+4. FUSE-only architecture (the backend remains frontend-agnostic).
+
+## 3. Requirements
+
+### 3.1 Functional
+
+1. Random read/write by offset.
+2. Sparse file support (unwritten chunks are implicit zeroes).
+3. Directory namespace with inode-based metadata.
+4. Crash-safe create/delete/move/migration operations.
+
+### 3.2 Placement
+
+1. All chunks of one file should have one home shard under normal state.
+2. During rebalance, temporary dual placement is allowed only in `MIGRATING` state.
+3. Control-plane split must avoid selecting split keys inside one file key-range prefix.
+
+### 3.3 Consistency
+
+1. Read-after-write for a file handle.
+2. Monotonic file size and mtime updates.
+3. No orphaned visible directory entry pointing to a non-existent inode after recovery.
+
+### 3.4 FUSE compatibility requirements
+
+1. Stable inode number (`st_ino`) and generation for object lifetime; in Phase 1 inode IDs are not reused (`root=1` fixed), so generation is constant `1`.
+2. `getattr` returns coherent `mode`, `nlink`, `uid`, `gid`, `size`, `mtime`, `ctime`, `atime` with nanosecond precision.
+3. `readdir` supports resumable offset cookies and emits `"."` and `".."` on the first page (empty-cookie start).
+4. `rename` is atomic within supported domain; unsupported cross-domain rename returns `EXDEV`.
+5. `unlink`/`rmdir` return expected errno (`ENOENT`, `ENOTEMPTY`, `EISDIR`, `ENOTDIR`) and support remove-while-open semantics via delayed physical GC.
+6. `write` durability boundary is explicit:
+   - `write` + `flush` guarantees visibility to the same mount session.
+   - `fsync` guarantees Raft-committed durability.
+7. Unsupported operations required by a given FUSE mount mode return explicit `ENOSYS` or `EOPNOTSUPP` (not silent no-op).
+
+## 4. High-level Architecture
+
+Components:
+
+1. `FS Core API` (new backend service): protocol-neutral filesystem primitives and placement logic.
+2. `Frontend Adapters`: FUSE adapter and non-FUSE clients map into `FS Core API`.
+3. `Namespace Store`: inode + directory metadata in Elastickv keys.
+4. `File Data Store`: fixed-size chunk values in Elastickv keys.
+5. `Placement Manager`: assigns home shard per file and runs whole-file moves.
+
+Data flow:
+
+1. Lookup path -> inode metadata.
+2. Resolve file home shard.
+3. Read/write chunk keys under that file-home prefix.
+4. Adapter returns protocol-specific response/errno (FUSE or RPC).
+
+## 5. Data Model and Key Layout
+
+### 5.1 IDs
+
+1. `inode_id`: 64-bit non-zero unique ID (`root=1` reserved) to map directly to FUSE `st_ino`.
+2. `inode_generation`: 64-bit generation exported to FUSE (`1` in Phase 1 because inode IDs are never reused).
+3. `home_slot`: 64-bit placement token derived at create time.
+4. `chunk_index`: 64-bit unsigned index.
+
+`inode_id` allocation policy:
+
+1. IDs are generated by a distributed random allocator (not strictly sequential).
+2. On collision, allocator retries with a new random value.
+3. This avoids sequential-key concentration in `!fs|ino|...` and `!fs|home|...` ranges during high create throughput.
+
+### 5.2 Key schema (logical)
+
+```text
+!fs|ino|<inode_id>                               -> InodeMeta
+!fs|dir|<parent_inode_id>|<name>                -> DirEntry(child_inode_id, type)
+!fs|dirv|<parent_inode_id>                       -> DirVersion
+!fs|home|<inode_id>                              -> Home(home_slot, state, epoch)
+!fs|chk|<home_slot>|<inode_id>|<chunk_index>    -> ChunkPayload
+!fs|ref|<inode_id>|<client_id>|<fh_id>          -> OpenHandleLease(ttl)
+!fs|intent|<intent_id>                           -> IntentState(kind, cursor, payload)
+!fs|job|move|<job_id>                            -> MoveJobState
+```
+
+Notes:
+
+1. `!fs|chk|...` keys include `home_slot` before `inode_id` to cluster same-file data.
+2. `!fs|home|...` is the source of truth for home shard assignment.
+3. `epoch` protects against stale writers during migration.
+4. `!fs|ref|...` keeps remove-while-open semantics for FUSE-compatible behavior.
+5. `!fs|intent|...` stores resumable crash-recovery intents for create/delete/move.
+
+### 5.3 Composite key encoding rules
+
+1. The `|` separators shown above are logical notation only; keys are encoded as binary tuples.
+2. Fixed-width numeric fields (`inode_id`, `home_slot`, `chunk_index`, `fh_id`) are encoded as 8-byte big-endian.
+3. Variable fields (`name`, `client_id`, `intent_id`) use a byte-safe order-preserving encoding, so legal filenames including `|` never collide with separators.
+4. Directory entry key encoding for `<name>` preserves bytewise name ordering to keep `readdir` scans lexicographically stable.
+
+### 5.4 Internal transaction key families
+
+Elastickv transaction internals are not filesystem objects but are part of the overall keyspace and routing behavior:
+
+1. `!txn|lock|...`
+2. `!txn|int|...`
+3. `!txn|cmt|...`
+4. `!txn|rb|...`
+5. `!txn|meta|...`
+
+### 5.5 Inode metadata
+
+`InodeMeta` includes:
+
+1. file type, mode, uid/gid.
+2. logical size.
+3. `atime`, `mtime`, `ctime` (nanoseconds).
+4. chunk size (fixed cluster-wide in Phase 1, still stored for forward compatibility).
+5. current `home_slot` and `epoch` snapshot.
+6. `nlink`.
+7. `inode_generation`.
+
+## 6. Fixed-size Chunking
+
+### 6.1 Chunk size
+
+1. Default: `4 MiB`.
+2. Allowed values: `1 MiB`, `2 MiB`, `4 MiB`, `8 MiB` (cluster setting).
+3. A file uses one configured chunk size for its lifetime in Phase 1.
+
+### 6.2 Offset mapping
+
+For byte offset `off`:
+
+1. `chunk_index = off / chunk_size`
+2. `chunk_offset = off % chunk_size`
+
+Write path may touch multiple contiguous chunk indexes.
+
+### 6.3 Partial chunk write
+
+1. Read existing chunk (or treat missing as zero-filled).
+2. Patch modified byte range.
+3. Write back full chunk atomically in one KV value.
+4. Update inode size if needed.
+
+### 6.4 Sparse behavior
+
+1. Missing chunk key means all-zero chunk.
+2. `truncate` larger creates holes without materializing zero chunks.
+
+## 7. Placement Strategy to Avoid Same-file Shard Scatter
+
+### 7.1 Single-home rule
+
+At file creation:
+
+1. Placement manager computes `home_slot` using rendezvous hashing over active shard groups.
+2. `!fs|home|<inode_id>` is written once (`state=ACTIVE`, `epoch=1`).
+3. All chunk keys use that `home_slot`.
+
+Result: a file is naturally concentrated on one shard-home range.
+
+### 7.2 Route key normalization
+
+Extend shard route normalization (currently `kv.routeKey` behavior in `kv/shard_key.go` for internal key families) to treat:
+
+1. `!fs|chk|<home_slot>|<inode_id>|<chunk_index>`
+2. transaction-wrapped variants (`!txn|...`)
+
+as one logical routing domain per `(home_slot, inode_id)` when evaluating split candidates and stale-route protection.
+
+### 7.3 Split guardrail
+
+In split planning:
+
+1. Candidate split keys inside the same `(home_slot, inode_id)` prefix are rejected.
+2. Split key is snapped to the nearest file boundary prefix.
+3. If a range mostly contains one huge file, auto-split is skipped and flagged as `FILE_PINNED_HOTSPOT`.
+
+This directly reduces accidental same-file scatter caused by range splitting.
+
+`FILE_PINNED_HOTSPOT` follow-up actions:
+
+1. Increment `fs_file_pinned_hotspot_total` and emit an operator alert with inode/home metadata.
+2. Keep serving with single-home placement as safe default.
+3. Operator may trigger `MoveFile(inode, target_group)` for load shift.
+4. If hotspot persists on a single huge file, operator may enable manual striping policy in Phase 2+ for that inode.
+
+### 7.4 Rebalancing unit
+
+Rebalancing moves whole files by default:
+
+1. Copy all chunks to target home slot.
+2. Fence writes with epoch bump.
+3. Switch `!fs|home|` to new slot.
+4. Garbage-collect old chunks after grace window.
+
+Exception:
+
+1. Optional manual striping mode for very large files (Phase 2+) may place extents on multiple homes.
+2. That mode is opt-in and disabled by default to preserve same-file affinity.
+
+## 8. Operation Flows
+
+### 8.1 Create file
+
+1. Resolve parent directory inode.
+2. Allocate `inode_id` via distributed random allocator.
+3. Assign `home_slot`.
+4. Write `!fs|ino|`, `!fs|home|`, and `!fs|dir|parent|name` with intent marker.
+5. If create is requested with open semantics, allocate `fh_id` and write `!fs|ref|...`.
+6. Commit and clear intent.
+
+Recovery replays unfinished intents idempotently.
+
+### 8.2 Read
+
+1. Resolve `inode_id` and `home_slot`.
+2. Translate `[offset, length]` into chunk indexes.
+3. Batch read `!fs|chk|...`.
+4. Fill holes with zero bytes.
+
+### 8.3 Write
+
+1. Resolve inode + home (`epoch` checked).
+2. For each touched chunk: read-modify-write.
+3. Update inode size/mtime.
+4. If epoch changed due to migration, retry with refreshed home mapping.
+
+### 8.4 Rename
+
+1. Same-directory rename: atomic update of directory keys.
+2. Cross-directory rename:
+   - If source and destination are within the same supported atomic domain, perform atomic replace.
+   - If not in the same atomic domain (Phase 1 limitation), return `EXDEV` for FUSE adapter compatibility.
+
+### 8.5 Truncate and unlink
+
+1. `truncate` shrink: delete chunks above new EOF and patch tail chunk.
+2. `unlink`: remove dentry and decrement `nlink`; when `nlink==0`, mark inode as orphaned.
+3. Physical chunk GC runs only when `nlink==0` and no open-handle lease exists (`!fs|ref|...` empty).
+
+### 8.6 Open, flush, fsync, release
+
+1. `open` allocates `fh_id` and writes/refreshes `!fs|ref|<inode>|<client>|<fh>`.
+2. `flush` pushes adapter-side buffered writes (if any) to `FS Core API`.
+3. `fsync` waits for Raft commit durability on metadata and touched chunks.
+4. `release` removes handle lease; if inode is orphaned and has no more leases, enqueue final GC.
+
+### 8.7 Readdir
+
+1. Directory entries are ordered lexicographically by name.
+2. Offset cookie is an opaque cursor that encodes directory snapshot version plus position.
+3. `readdir` started with empty cookie captures a directory snapshot (`dir_version`) and returns `"."` and `".."` first.
+4. Subsequent paginated calls with returned cookies read from the same snapshot; concurrent newer mutations are not visible.
+5. For non-empty cookie pages, `"."` and `".."` are omitted and listing resumes from the encoded position.
+
+## 9. Failure Handling and Recovery
+
+### 9.1 Intent log
+
+Use intent keys for multi-step metadata/data transitions:
+
+1. `CREATE_INTENT`
+2. `DELETE_INTENT`
+3. `MOVE_INTENT`
+
+Intent records live under `!fs|intent|<intent_id>` and store step cursor + payload for resume after crash.
+
+### 9.2 Epoch fencing
+
+Writers include `(inode_id, epoch)` from home mapping.
+
+1. If shard owner sees stale epoch, return retryable stale-home error.
+2. Client refreshes `!fs|home|` and retries.
+
+### 9.3 Idempotency
+
+All move/create/delete steps are idempotent by key overwrite semantics and job cursor checkpoints.
+
+### 9.4 Open-handle lease recovery
+
+1. `!fs|ref|...` has TTL and must be heartbeated by active clients.
+2. Lease reaper clears expired refs.
+3. Orphan inode GC checks active refs after reaping to preserve FUSE remove-while-open behavior.
+
+## 10. APIs
+
+### 10.1 FS Core API (protocol-neutral)
+
+1. `Resolve(parent_inode, name) -> inode`
+2. `GetAttr(inode) -> Stat`
+3. `SetAttr(inode, mask, attrs) -> Stat`
+4. `Open(inode, flags, client_id) -> fh_id`
+5. `Read(inode, fh_id, offset, size) -> bytes`
+6. `Write(inode, fh_id, offset, bytes) -> n`
+7. `Flush(inode, fh_id)`
+8. `Fsync(inode, fh_id, datasync)`
+9. `Release(inode, fh_id, client_id)`
+10. `Create(parent_inode, name, mode, uid, gid, flags, client_id) -> inode, fh_id`
+11. `Mkdir(parent_inode, name, mode, uid, gid) -> inode`
+12. `Unlink(parent_inode, name)`
+13. `Rmdir(parent_inode, name)`
+14. `Rename(old_parent, old_name, new_parent, new_name, flags)`
+15. `Readdir(inode, cookie, limit) -> entries, next_cookie`
+16. `StatFS(inode) -> capacity, free, files, free_files`
+
+Error model:
+
+1. Return typed filesystem errors (`ENOENT`, `EEXIST`, `ENOTDIR`, `EISDIR`, `ENOTEMPTY`, `EXDEV`, `EPERM`, `EACCES`, `EINVAL`, `EFBIG`, `ENOSYS`, `EOPNOTSUPP`).
+2. Frontend adapter maps these 1:1 to protocol-specific errno/status.
+
+### 10.2 FUSE mapping contract
+
+| FUSE op | FS Core API | Requirement |
+|---|---|---|
+| `lookup` | `Resolve` + `GetAttr` | Stable inode and generation (`generation=1` in Phase 1) |
+| `getattr` | `GetAttr` | `Stat` fields complete and coherent |
+| `setattr` | `SetAttr` | `size`/time/mode updates with masks |
+| `open` | `Open` | Return valid `fh_id` |
+| `read` | `Read` | Offset-based, hole=zero |
+| `write` | `Write` | Partial overwrite via chunk RMW |
+| `flush` | `Flush` | Push session buffer |
+| `fsync` | `Fsync` | Raft durability barrier |
+| `release` | `Release` | Remove lease/ref |
+| `create` | `Create` | Atomic name reservation + inode allocation |
+| `mkdir` | `Mkdir` | Parent type and permission checks |
+| `unlink` | `Unlink` | Delayed physical delete when open refs exist |
+| `rmdir` | `Rmdir` | Must fail with `ENOTEMPTY` if children exist |
+| `rename` | `Rename` | Atomic in-domain, `EXDEV` out-of-domain |
+| `readdir` | `Readdir` | Stable cookie pagination; dot entries on first page |
+| `statfs` | `StatFS` | Capacity and inode counters exposed |
+
+Unsupported op handling:
+
+1. `link`, `symlink`, `readlink`, file locking ioctls are Phase 1 out of scope.
+2. FUSE adapter returns `ENOSYS` or `EOPNOTSUPP` explicitly.
+
+Internal placement API:
+
+1. `GetFileHome(inode)`
+2. `MoveFile(inode, target_group)`
+3. `ListFilePlacementStats()`
+
+## 11. Metrics and SLOs
+
+Key metrics:
+
+1. `fs_file_multi_shard_detected_total`
+2. `fs_file_move_inflight`
+3. `fs_chunk_read_ops_total`, `fs_chunk_write_ops_total`
+4. `fs_chunk_read_latency_ms`, `fs_chunk_write_latency_ms`
+5. `fs_home_epoch_conflict_total`
+6. `fs_open_handle_lease_active`
+7. `fs_orphan_inode_gc_pending`
+8. `fs_file_pinned_hotspot_total`
+9. `fs_inode_id_collision_retry_total`
+
+SLO targets (initial):
+
+1. `same_file_single_home_ratio >= 99.9%`
+2. p99 chunk write latency under target workload: `< 50 ms` (cluster dependent)
+
+## 12. Implementation Plan
+
+Phase 1:
+
+1. Key schema + metadata/chunk CRUD.
+2. Fixed chunk read/write/truncate.
+3. Single-home placement at file create.
+4. Split guardrail for file boundary.
+5. Protocol-neutral `FS Core API` and thin FUSE adapter with required errno mapping.
+
+Phase 2:
+
+1. Whole-file migration job.
+2. Epoch fencing + stale-home retry path.
+3. Placement metrics and operator commands.
+4. Open-handle lease reaper and orphan-GC hardening.
+
+Phase 3:
+
+1. Cross-directory atomic rename improvements.
+2. Optional large-file striping mode (opt-in).
+3. Optional support for currently unsupported FUSE ops (`link`/`symlink`).
+
+## 13. Test Plan
+
+1. Unit:
+   - chunk boundary math
+   - sparse reads
+   - key encode/decode
+   - split guard boundary selection
+2. Integration:
+   - create/read/write/truncate/unlink
+   - open-unlink-read-release lifecycle
+   - rename in-domain succeeds, cross-domain returns `EXDEV`
+   - readdir cookie resume under concurrent directory mutation (no duplicates/skips within one snapshot sequence)
+   - same-file operations stay on one shard
+   - restart recovery with unfinished intents
+   - TTL lease expiry and orphan GC correctness
+3. Jepsen-like:
+   - concurrent append + read under node failure
+   - migration during read/write load
+4. FUSE compatibility:
+   - libfuse integration tests for `lookup/getattr/open/read/write/readdir/rename/unlink/fsync`
+   - errno conformance checks for unsupported operations
+
+## 14. Open Questions
+
+1. Metadata scalability: should directory namespace stay in one domain or be hash-partitioned in Phase 2?
+2. Chunk compression policy: per-file toggle vs cluster default?
+3. Read cache location: client-side only or shard-local chunk cache?