diff --git a/docs/design_docs/chunk_aggregation_by_prefix.md b/docs/design_docs/chunk_aggregation_by_prefix.md new file mode 100644 index 0000000..9240dd3 --- /dev/null +++ b/docs/design_docs/chunk_aggregation_by_prefix.md @@ -0,0 +1,122 @@ +# [WITHDRAWN] Chunk Aggregation by Prefix + +## Goal + +To address the "document explosion" and storage bloat issues caused by the current chunking mechanism, while preserving the benefits of content-addressable storage and efficient delta synchronisation. This design aims to significantly reduce the number of documents in the database and simplify Garbage Collection (GC). + +## Motivation + +Our current synchronisation solution splits files into content-defined chunks, with each chunk stored as a separate document in CouchDB, identified by its hash. This architecture effectively leverages CouchDB's replication for automatic deduplication and efficient transfer. + +However, this approach faces significant challenges as the number of files and edits increases: +1. **Document Explosion:** A large vault can generate millions of chunk documents, severely degrading CouchDB's performance, particularly during view building and replication. +2. **Storage Bloat & GC Difficulty:** Obsolete chunks generated during edits are difficult to identify and remove. Since CouchDB's deletion (`_deleted: true`) is a soft delete, and compaction is a heavy, space-intensive operation, unused chunks perpetually consume storage, making GC impractical for many users. +3. **The "Eden" Problem:** A previous attempt, "Keep newborn chunks in Eden", aimed to mitigate this by embedding volatile chunks within the parent document. While it reduced the number of standalone chunks, it introduced a new issue: the parent document's history (`_revs_info`) became excessively large, causing its own form of database bloat and making compaction equally necessary but difficult to manage. + +This new design addresses the root cause—the sheer number of documents—by aggregating chunks into sets. + +## Prerequisites + +- The new implementation must maintain the core benefit of deduplication to ensure efficient synchronisation. +- The solution must not introduce a single point of bottleneck and should handle concurrent writes from multiple clients gracefully. +- The system must provide a clear and feasible strategy for Garbage Collection. +- The design should be forward-compatible, allowing for a smooth migration path for existing users. + +## Outlined Methods and Implementation Plans + +### Abstract + +This design introduces a two-tiered document structure to manage chunks: **Index Documents** and **Data Documents**. Chunks are no longer stored as individual documents. Instead, they are grouped into `Data Documents` based on a common hash prefix. The existence and location of each chunk are tracked by `Index Documents`, which are also grouped by the same prefix. This approach dramatically reduces the total document count. + +### Detailed Implementation + +**1. Document Structure:** + +- **Index Document:** Maps chunk hashes to their corresponding Data Document ID. Identified by a prefix of the chunk hash. + - `_id`: `idx:{prefix}` (e.g., `idx:a9f1b`) + - Content: + ```json + { + "_id": "idx:a9f1b", + "_rev": "...", + "chunks": { + "a9f1b12...": "dat:a9f1b-001", + "a9f1b34...": "dat:a9f1b-001", + "a9f1b56...": "dat:a9f1b-002" + } + } + ``` +- **Data Document:** Contains the actual chunk data as base64-encoded strings. Identified by a prefix and a sequential number. + - `_id`: `dat:{prefix}-{sequence}` (e.g., `dat:a9f1b-001`) + - Content: + ```json + { + "_id": "dat:a9f1b-001", + "_rev": "...", + "chunks": { + "a9f1b12...": "...", // base64 data + "a9f1b34...": "..." // base64 data + } + } + ``` + +**2. Configuration:** + +- `chunk_prefix_length`: The number of characters from the start of a chunk hash to use as a prefix (e.g., `5`). This determines the granularity of aggregation. +- `data_doc_size_limit`: The maximum size for a single Data Document to prevent it from becoming too large (e.g., 1MB). When this limit is reached, a new Data Document with an incremented sequence number is created. + +**3. Write/Save Operation Flow:** + +When a client creates new chunks: +1. For each new chunk, determine its hash prefix. +2. Read the corresponding `Index Document` (e.g., `idx:a9f1b`). +3. From the index, determine which of the new chunks already exist in the database. +4. For the **truly new chunks only**: + a. Read the last `Data Document` for that prefix (e.g., `dat:a9f1b-005`). + b. If it is nearing its size limit, create a new one (`dat:a9f1b-006`). + c. Add the new chunk data to the Data Document and save it. +5. Update the `Index Document` with the locations of the newly added chunks. + +**4. Handling Write Conflicts:** + +Concurrent writes to the same `Index Document` or `Data Document` from multiple clients will cause conflicts (409 Conflict). This is expected and must be handled gracefully. Since additions are incremental, the client application must implement a **retry-and-merge loop**: +1. Attempt to save the document. +2. On a conflict, re-fetch the latest version of the document from the server. +3. Merge its own changes into the latest version. +4. Attempt to save again. +5. Repeat until successful or a retry limit is reached. + +**5. Garbage Collection (GC):** + +GC becomes a manageable, periodic batch process: +1. Scan all file metadata documents to build a master set of all *currently referenced* chunk hashes. +2. Iterate through all `Index Documents`. For each chunk listed: + a. If the chunk hash is not in the master reference set, it is garbage. + b. Remove the garbage entry from the `Index Document`. + c. Remove the corresponding data from its `Data Document`. +3. If a `Data Document` becomes empty after this process, it can be deleted. + +## Test Strategy + +1. **Unit Tests:** Implement tests for the conflict resolution logic (retry-and-merge loop) to ensure robustness. +2. **Integration Tests:** + - Verify that concurrent writes from multiple simulated clients result in a consistent, merged state without data loss. + - Run a full synchronisation scenario and confirm the resulting database has a significantly lower document count compared to the previous implementation. +3. **GC Test:** Simulate a scenario where files are deleted, run the GC process, and verify that orphaned chunks are correctly removed from both Index and Data documents, and that storage is reclaimed after compaction. +4. **Migration Test:** Develop and test a "rebuild" process for existing users, which migrates their chunk data into the new aggregated structure. + +## Documentation Strategy + +- This design document will be published to explain the new architecture. +- The configuration options (`chunk_prefix_length`, etc.) will be documented for advanced users. +- A guide for the migration/rebuild process will be provided. + +## Future Work + +The separation of index and data opens up a powerful possibility. While this design initially implements both within CouchDB, the `Data Documents` could be offloaded to a dedicated object storage service such as **S3, MinIO, or Cloudflare R2**. + +In such a hybrid model, CouchDB would handle only the lightweight `Index Documents` and file metadata, serving as a high-speed synchronisation and coordination layer. The bulky chunk data would reside in a more cost-effective and scalable blob store. This would represent the ultimate evolution of this architecture, combining the best of both worlds. + +## Consideration and Conclusion + +This design directly addresses the scalability limitations of the original chunk-per-document model. By aggregating chunks into sets, it significantly reduces the document count, which in turn improves database performance and makes maintenance feasible. The explicit handling of write conflicts and a clear strategy for garbage collection make this a robust and sustainable long-term solution. It effectively resolves the problems identified in previous approaches, including the "Eden" experiment, by tackling the root cause of database bloat. This architecture provides a solid foundation for future growth and scalability. \ No newline at end of file diff --git a/docs/design_docs/intention_of_chunks.md b/docs/design_docs/intention_of_chunks.md new file mode 100644 index 0000000..a1ffe70 --- /dev/null +++ b/docs/design_docs/intention_of_chunks.md @@ -0,0 +1,127 @@ +# [WIP] The design intent explanation for using metadata and chunks + +## Abstract + +## Goal + +- To explain the following: + - What metadata and chunks are + - The design intent of using metadata and chunks + +## Background and Motivation + +We are using PouchDB and CouchDB for storing files and synchronising them. PouchDB is a JavaScript database that stores data on the device (browser, and of course, Obsidian), while CouchDB is a NoSQL database that stores data on the server. The two databases can be synchronised to keep data consistent across devices via the CouchDB replication protocol. This is a powerful and flexible way to store and synchronise data, including conflict management, but it is not well suited for files. Therefore, we needed to manage how to store files and synchronise them. + +## Terminology + +- Password: + - A string used to authenticate the user. + +- Passphrase: + - A string used to encrypt and decrypt data. + - This is not a password. + +- Encrypt: + - To convert data into a format that is unreadable to anyone. + - Can be decrypted by the user who has the passphrase. + - Should be 1:n, containing random data to ensure that even the same data, when encrypted, results in different outputs. + +- Obfuscate: + - To convert data into a format that is not easily readable. + - Can be decrypted by the user who has the passphrase. + - Should be 1:1, containing no random data, and the same data is always obfuscated to the same result. It is necessarily unreadable. + +- Hash: + - To convert data into a fixed-length string that is not easily readable. + - Cannot be decrypted. + - Should be 1:1, containing no random data, and the same data is always hashed to the same result. + +## Designs + +### Principles + +- To synchronise and handle conflicts, we should keep the history of modifications. +- No data should be lost. Even though some extra data may be stored, it should be removed later, safely. +- Each stored data item should be as small as possible to transfer efficiently, but not so small as to be inefficient. +- Any type of file should be supported, including binary files. +- Encryption should be supported efficiently. +- This method should not depart too far from the PouchDB/CouchDB philosophy. It needs to leave room for other `remote`s, to benefit from custom replicators. + +As a result, we have adopted the following design. + +- Files are stored as one metadata entry and multiple chunks. +- Chunks are content-addressable, and the metadata contains the ids of the chunks. +- Chunks may be referenced from multiple metadata entries. They should be efficiently managed to avoid redundancy. + +### Metadata Design + +The metadata contains the following information: + +| Field | Type | Description | Note | +| -------- | -------------------- | ---------------------------- | ----------------------------------------------------------------------------------------------------- | +| _id | string | The id of the metadata | It is created from the file path | +| _rev | string | The revision of the metadata | It is created by PouchDB | +| children | [string] | The ids of the chunks | | +| path | string | The path of the file | If Obfuscate path has been enabled, it has been encrypted | +| size | number | The size of the metadata | Not respected; for troubleshooting | +| ctime | string | The creation timestamp | This is not used to compare files, but when writing to storage, it will be used | +| mtime | string | The modification timestamp | This will be used to compare files, and will be written to storage | +| type | `plain` \| `newnote` | The type of the file | Children of type `plain` will not be base64 encoded, while `newnote` will be | +| e_ | boolean | The file is encrypted | Encryption is processed during transfer to the remote. In local storage, this property does not exist | + +#### Decision Rule for `_id` of Metadata + +```ts +// Note: This is pseudo code. +let _id = PATH; +if (!HANDLE_FILES_AS_CASE_SENSITIVE) { + _id = _id.toLowerCase(); +} +if (_id.startsWith("_")) { + _id = "/" + _id; +} +if (OBFUSCATE_PATH) { + _id = `f:${OBFUSCATE_PATH(_id, E2EE_PASSPHRASE)}`; +} +return _id; +``` + +#### Expected Questions + +- Why do we need to handle files as case-sensitive? + - Some filesystems are case-sensitive, while others are not. For example, Windows is not case-sensitive, while Linux is. Therefore, we need to handle files as case-sensitive to manage conflicts. + - The trade-off is that you will not be able to manage files with different cases, so this can be disabled if you only have case-sensitive terminals. +- Why obfuscate the path? + - E2EE only encrypts the content of the file, not metadata. Hence, E2EE alone is not enough to protect the vault completely. The path is also part of the metadata, so it should be obfuscated. This is a trade-off between security and performance. However, if you title a note with sensitive information, you should obfuscate the path. +- What is `f:`? + - It is a prefix to indicate that the path is obfuscated. It is used to distinguish between normal paths and obfuscated paths. Due to file enumeration, Self-hosted LiveSync should scan the files to find the metadata, excluding chunks and other information. + - Why does an unobfuscated path not start with `f:`? + - For compatibility. Self-hosted LiveSync, by its nature, must also be able to handle files created with newer versions as far as possible. + +### Chunk Design + +#### Chunk Structure + +The chunk contains the following information: + +| Field | Type | Description | Note | +| ----- | ------------ | ------------------------- | ----------------------------------------------------------------------------------------------------- | +| _id | `h:{string}` | The id of the chunk | It is created from the hash of the chunk content | +| _rev | string | The revision of the chunk | It is created by PouchDB | +| data | string | The content of the chunk | | +| type | `leaf` | Fixed | | +| e_ | boolean | The chunk is encrypted | Encryption is processed during transfer to the remote. In local storage, this property does not exist | + +**SORRY, TO BE WRITTEN, BUT WE HAVE IMPLEMENTED `v2`, WHICH REQUIRES MORE INFORMATION.** + +### How they are unified + +## Deduplication and Optimisation + +## Synchronisation Strategy + +## Performance Considerations + +## Security and Privacy + +## Edge Cases diff --git a/docs/design_docs/tired_chunk_pack.md b/docs/design_docs/tired_chunk_pack.md new file mode 100644 index 0000000..99710e1 --- /dev/null +++ b/docs/design_docs/tired_chunk_pack.md @@ -0,0 +1,117 @@ +# [IN DESIGN] Tiered Chunk Storage with Live Compaction + +** VERY IMPORTANT NOTE: This design must be used with the new journal synchronisation method. Otherwise, we risk introducing the bloat of changes from hot-pack into the Bucket. (CouchDB/PouchDB can synchronise only the most recent changes, or resolve conflicts.) Previous Journal Sync **IS NOT**. Please proceed with caution. ** + +## Goal + +To establish a highly efficient, robust, and scalable synchronisation architecture by introducing a tiered storage system inspired by Log-Structured Merge-Trees (LSM-Trees). This design aims to address the challenges of real-time synchronisation, specifically the massive generation of transient data, while minimising storage bloat and ensuring high performance. + +## Motivation + +Our previous designs, including "Chunk Aggregation by Prefix", successfully addressed the "document explosion" problem. However, the introduction of real-time editor synchronisation exposed a new, critical challenge: the constant generation of short-lived "garbage" chunks during user input. This "garbage storm" places immense pressure on storage, I/O, and the Garbage Collection (GC) process. + +A simple aggregation strategy is insufficient because it treats all data equally, mixing valuable, stable chunks with transient, garbage chunks in permanent storage. This leads to storage bloat and inefficient compaction. We require a system that can intelligently distinguish between "hot" (volatile) and "cold" (stable) data, processing them in the most efficient manner possible. + +## Outlined Methods and Implementation Plans + +### Abstract + +This design implements a two-tiered storage system within CouchDB. +1. **Level 0 – Hot Storage:** A set of "Hot-Packs", one for each active client. These act as fast, append-only logs for all newly created chunks. They serve as a temporary staging area, absorbing the "garbage storm" of real-time editing. +2. **Level 1 – Cold Storage:** The permanent, immutable storage for stable chunks, consisting of **Index Documents** for fast lookups and **Data Documents (Cold-Packs)** for storing chunk data. + +A background "Compaction" process continuously promotes stable chunks from Hot Storage to Cold Storage, while automatically discarding garbage. This keeps the permanent storage clean and highly optimised. + +### Detailed Implementation + +**1. Document Structure:** + +- **Hot-Pack Document (Level 0):** A per-client, append-only log. + - `_id`: `hotpack:{client_id}` (`client_id` could be the same as the `deviceNodeID` used in the `accepted_nodes` in MILESTONE_DOC; enables database 'lockout' for safe synchronisation) + - Content: A log of chunk creation events. + ```json + { + "_id": "hotpack:a9f1b12...", + "_rev": "...", + "log": [ + { "hash": "abc...", "data": "...", "ts": ..., "file_id": "file1" }, + { "hash": "def...", "data": "...", "ts": ..., "file_id": "file2" } + ] + } + ``` + +- **Index Document (Level 1):** A fast, prefix-based lookup table for stable chunks. + - `_id`: `idx:{prefix}` (e.g., `idx:a9f1b`) + - Content: Maps a chunk hash to the ID of the Cold-Pack it resides in. + ```json + { + "_id": "idx:a9f1b", + "chunks": { "a9f1b12...": "dat:1678886400" } + } + ``` + +- **Cold-Pack Document (Level 1):** An immutable data block created by the compaction process. + - `_id`: `dat:{timestamp_or_uuid}` (e.g., `dat:1678886400123`) + - Content: A collection of stable chunks. + ```json + { + "_id": "dat:1678886400123", + "chunks": { "a9f1b12...": "...", "c3d4e5f...": "..." } + } + ``` + +- **Hot-Pack List Document:** A central registry of all active Hot-Packs. This might be a computed document that clients maintain in memory on startup. + - `_id`: `hotpack_list` + - Content: `{"active_clients": ["hotpack:a9f1b12...", "hotpack:c3d4e5f..."]}` + +**2. Write/Save Operation Flow (Real-time Editing):** + +1. A client generates a new chunk. +2. It **immediately appends** the chunk object (`{hash, data, ts, file_id}`) to its **own** Hot-Pack document's `log` array within its local PouchDB. This operation is extremely fast. +3. The PouchDB synchronisation process replicates this change to the remote CouchDB and other clients in the background. No other Hot-Packs are consulted during this write operation. + +**3. Read/Load Operation Flow:** + +To find a chunk's data: +1. The client first consults its in-memory list of active Hot-Pack IDs (see section 5). +2. It searches for the chunk hash in all **Hot-Pack documents**, starting from its own, then others. It reads them in reverse log order (newest first). +3. If not found, it consults the appropriate **Index Document (`idx:...`)** to get the ID of the Cold-Pack. +4. It then reads the chunk data from the corresponding **Cold-Pack document (`dat:...`)**. + +**4. Compaction & Promotion Process (The "GC"):** + +This is a background task run periodically by clients, or triggered when the number of unprocessed log entries exceeds a threshold (to maintain the ability to synchronise with the remote database, which has a limited document size). +1. The client takes its own Hot-Pack (`hotpack:{client_id}`) and scans its `log` array from the beginning (oldest first). +2. For each chunk in the log, it checks if the chunk is still referenced in the latest revision of any file. + - **If not referenced (Garbage):** The log entry is simply discarded. + - **If referenced (Stable):** The chunk is added to a "promotion batch". +3. After scanning a certain number of log entries, the client takes the "promotion batch". +4. It creates one or more new, immutable **Cold-Pack (`dat:...`)** documents to store the chunk data from the batch. +5. It updates the corresponding **Index (`idx:...`)** documents to point to the new Cold-Pack(s). +6. Once the promotion is successfully saved to the database, it **removes the processed entries from its Hot-Pack's `log` array**. This is a critical step to prevent reprocessing and keep the Hot-Pack small. + +**5. Hot-Pack List Management:** + +To know which Hot-Packs to read, clients will: +1. On startup, load the `hotpack_list` document into memory. +2. Use PouchDB's live `changes` feed to monitor the creation of new `hotpack:*` documents. +3. Upon detecting an unknown Hot-Pack, the client updates its in-memory list and attempts to update the central `hotpack_list` document (on a best-effort basis, with conflict resolution). + +## Planned Test Strategy + +1. **Unit Tests:** Test the Compaction/Promotion logic extensively. Ensure garbage is correctly identified and stable chunks are promoted correctly. +2. **Integration Tests:** Simulate a multi-client real-time editing session. + - Verify that writes are fast and responsive. + - Confirm that transient garbage chunks do not pollute the Cold Storage. + - Confirm that after a period of inactivity, compaction runs and the Hot-Packs shrink. +3. **Stress Tests:** Simulate many clients joining and leaving to test the robustness of the `hotpack_list` management. + +## Documentation Strategy + +- This design document will serve as the core architectural reference. +- The roles of each document type (Hot-Pack, Index, Cold-Pack, List) will be clearly explained for future developers. +- The logic of the Compaction/Promotion process will be detailed. + +## Consideration and Conclusion + +This tiered storage design is a direct evolution, born from the lessons of previous architectures. It embraces the ephemeral nature of data in real-time applications. By creating a "staging area" (Hot-Packs) for volatile data, it protects the integrity and performance of the permanent "cold" storage. The Compaction process acts as a self-cleaning mechanism, ensuring that only valuable, stable data is retained long-term. This is not just an optimisation; it is a fundamental shift that enables robust, high-performance, and scalable real-time synchronisation on top of CouchDB. \ No newline at end of file diff --git a/docs/design_docs/tired_chunk_pack_bucket.md b/docs/design_docs/tired_chunk_pack_bucket.md new file mode 100644 index 0000000..4271885 --- /dev/null +++ b/docs/design_docs/tired_chunk_pack_bucket.md @@ -0,0 +1,97 @@ +# [IN DESIGN] Tiered Chunk Storage for Bucket Sync + +## Goal + +To evolve the "Journal Sync" mechanism by integrating the Tiered Storage architecture. This design aims to drastically reduce the size and number of sync packs, minimise storage consumption on the backend bucket, and establish a clear, efficient process for Garbage Collection, all while remaining protocol-agnostic. + +## Motivation + +The original "Journal Sync" liberates us from CouchDB's protocol, but it still packages and transfers entire document changes, including bulky and often transient chunk data. In a real-time or frequent-editing scenario, this results in: +1. **Bloated Sync Packs:** Packs become large with redundant or short-lived chunk data, increasing upload and download times. +2. **Inefficient Storage:** The backend bucket stores numerous packs containing overlapping and obsolete chunk data, wasting space. +3. **Impractical Garbage Collection:** Identifying and purging obsolete *chunk data* from within the pack-based journal history is extremely difficult. + +This new design addresses these problems by fundamentally changing *what* is synchronised in the journal packs. We will synchronise lightweight metadata and logs, while handling bulk data separately. + +## Outlined methods and implementation plans + +### Abstract + +This design adapts the Tiered Storage model for a bucket-based backend. The backend bucket is partitioned into distinct areas for different data types. The "Journal Sync" process is now responsible for synchronising only the "hot" volatile data and lightweight metadata. A separate, asynchronous "Compaction" process, which can be run by any client, is responsible for migrating stable data into permanent, deduplicated "cold" storage. + +### Detailed Implementation + +**1. Bucket Structure:** + +The backend bucket will have four distinct logical areas (prefixes): +- `packs/`: For "Journal Sync" packs, containing the journal of metadata and Hot-Log changes. +- `hot_logs/`: A dedicated area for each client's "Hot-Log," containing newly created, volatile chunks. +- `indices/`: For prefix-based Index files, mapping chunk hashes to their permanent location in Cold Storage. +- `cold_chunks/`: For deduplicated, stable chunk data, stored by content hash. + +**2. Data Structures (Client-side PouchDB & Backend Bucket):** + +- **Client Metadata:** Standard file metadata documents, kept in the client's PouchDB. +- **Hot-Log (in `hot_logs/`):** A per-client, append-only log file on the bucket. + - Path: `hot_logs/{client_id}.jsonlog` + - Content: A sequence of JSON objects, one per line, representing chunk creation events. `{"hash": "...", "data": "...", "ts": ..., "file_id": "..."}` + +- **Index File (in `indices/`):** A JSON file for a given hash prefix. + - Path: `indices/{prefix}.json` + - Content: Maps a chunk hash to its content hash (which is its key in `cold_chunks/`). `{"hash_abc...": true, "hash_def...": true}` + +- **Cold Chunk (in `cold_chunks/`):** The raw, immutable, deduplicated chunk data. + - Path: `cold_chunks/{chunk_hash}` + +**3. "Journal Sync" - Send/Receive Operation (Not Live):** + +This process is now extremely lightweight. +1. **Send:** + a. The client takes all newly generated chunks and **appends them to its own Hot-Log file (`hot_logs/{client_id}.jsonlog`)** on the bucket. + b. The client updates its local file metadata in PouchDB. + c. It then creates a "Journal Sync" pack containing **only the PouchDB journal of the file metadata changes.** This pack is very small as it contains no chunk data. + d. The pack is uploaded to `packs/`. + +2. **Receive:** + a. The client downloads new packs from `packs/` and applies the metadata journal to its local PouchDB. + b. It downloads the latest versions of all **other clients' Hot-Log files** from `hot_logs/`. + c. Now the client has a complete, up-to-date view of all metadata and all "hot" chunks. + +**4. Read/Load Operation Flow:** + +To find a chunk's data: +1. The client searches for the chunk hash in its local copy of all **Hot-Logs**. +2. If not found, it downloads and consults the appropriate **Index file (`indices/{prefix}.json`)**. +3. If the index confirms existence, it downloads the data from **`cold_chunks/{chunk_hash}`**. + +**5. Compaction & Promotion Process (Asynchronous "GC"):** + +This is a deliberate, offline-capable process that any client can choose to run. +1. The client "leases" its own Hot-Log for compaction. +2. It reads its entire `hot_logs/{client_id}.jsonlog`. +3. For each chunk in the log, it checks if the chunk is referenced in the *current, latest state* of the file metadata. + - **If not referenced (Garbage):** The log entry is discarded. + - **If referenced (Stable):** The chunk is added to a "promotion batch." +4. For each chunk in the promotion batch: + a. It checks the corresponding `indices/{prefix}.json` to see if the chunk already exists in Cold Storage. + b. If it does not exist, it **uploads the chunk data to `cold_chunks/{chunk_hash}`** and updates the `indices/{prefix}.json` file. +5. Once the entire Hot-Log has been processed, the client **deletes its `hot_logs/{client_id}.jsonlog` file** (or truncates it to empty), effectively completing the cycle. + +## Test strategy + +1. **Component Tests:** Test the Compaction process independently. Ensure it correctly identifies stable versus garbage chunks and populates the `cold_chunks/` and `indices/` areas correctly. +2. **Integration Tests:** + - Simulate a multi-client sync cycle. Verify that sync packs in `packs/` are small. + - Confirm that `hot_logs/` are correctly created and updated. + - Run the Compaction process and verify that data migrates correctly to cold storage and the hot log is cleared. +3. **Conflict Tests:** Simulate two clients trying to compact the same index file simultaneously and ensure the outcome is consistent (for example, via a locking mechanism or last-write-wins). + +## Documentation strategy + +- This design document will be the primary reference for the bucket-based architecture. +- The structure of the backend bucket (`packs/`, `hot_logs/`, etc.) will be clearly defined. +- A detailed description of how to run the Compaction process will be provided to users. + +## Consideration and Conclusion + +By applying the Tiered Storage model to "Journal Sync", we transform it into a remarkably efficient system. The synchronisation of everyday changes becomes extremely fast and lightweight, as only metadata journals are exchanged. The heavy lifting of data deduplication and permanent storage is offloaded to a separate, asynchronous Compaction process. This clear separation of concerns makes the system highly scalable, minimises storage costs, and finally provides a practical, robust solution for Garbage Collection in a protocol-agnostic, bucket-based environment. \ No newline at end of file