Re: [PR] HDDS-10611. Design document for MPU GC Optimization [ozone]

via GitHub Sat, 21 Feb 2026 01:10:34 -0800


ivandika3 commented on code in PR #9793:
URL: https://github.com/apache/ozone/pull/9793#discussion_r2835975183



##########
hadoop-hdds/docs/content/design/mpu-gc-optimization.md:
##########
@@ -0,0 +1,336 @@
+---
+title: Multipart Upload GC Pressure Optimizations
+summary: Change Multipart Upload Logic to improve OM GC Pressure
+date: 2026-02-19
+jira: HDDS-10611
+status: implemented
+author: Abhishek Pal, Rakesh Radhakrishnan
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# Ozone MPU Optimization - Design Doc
+
+
+## Table of Contents
+1. [Motivation](#1-motivation)
+2. [Proposal](#2-proposal)
+* [Backward Compatibility](#backward-compatibility)
+* [Split-table design (V1)](#split-table-design-v1)
+* [Comparison: V0 (legacy) vs V1](#comparison-v0-legacy-vs-v1)
+* [2.1 Approach-1: Reuse multipartInfoTable with empty part 
list](#21-approach-1--reuse-multipartinfotable-with-empty-part-list)
+* [2.2 Approach-2: Introduce new 
multipartMetadataTable](#22-approach-2--introduce-new-multipartmetadatatable)

Review Comment:
   Few suggestions
   - I think it's better to inline the chosen approach (i.e. Approach 1) of a 
separate section (e.g. 2.4 Chosen Approach). For example, we can structure it 
to be "2.1 Chosen Approach: Reuse multipartInfoTable withe empty part list" and 
then "2.2 Alternative Approach 1: Introduce new multipartMetadataTable".
   - For each approach, you can add the pros and cons (advantages or 
disadvantages) so that the reviewers can understand the tradeoffs of both 
approaches.



##########
hadoop-hdds/docs/content/design/mpu-gc-optimization.md:
##########
@@ -0,0 +1,336 @@
+---
+title: Multipart Upload GC Pressure Optimizations
+summary: Change Multipart Upload Logic to improve OM GC Pressure
+date: 2026-02-19
+jira: HDDS-10611
+status: implemented
+author: Abhishek Pal, Rakesh Radhakrishnan
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# Ozone MPU Optimization - Design Doc
+
+
+## Table of Contents
+1. [Motivation](#1-motivation)
+2. [Proposal](#2-proposal)
+* [Backward Compatibility](#backward-compatibility)

Review Comment:
   This backward compatibility section seems to be missing, but this is 
important to ensure compatibility with previous MPU flow.
   
   Few things to consider:
   - Are we going to have a separate OM request / response (e.g. 
`S3InitiateMultipartUploadRequestV2`, etc) or are we going to change the 
existing OM request (i.e. handle both the old and new flow)?
   - What happens if there are ongoing incomplete MPU uploads when there is an 
upgrade? For example, if the OM sees that there is a non-empty key, it will use 
the old flow (i.e. add in`partKeyInfoList`), otherwise it will use the 
flattened MPU part table.
   - How will the MPU query flow like? Does `listParts` requires to query both 
the multipartInfoTable and the part table? Additionally, how to differentiate 
between the `listParts` of old MPU and new MPU.



##########
hadoop-hdds/docs/content/design/mpu-gc-optimization.md:
##########
@@ -0,0 +1,336 @@
+---
+title: Multipart Upload GC Pressure Optimizations
+summary: Change Multipart Upload Logic to improve OM GC Pressure
+date: 2026-02-19
+jira: HDDS-10611
+status: implemented
+author: Abhishek Pal, Rakesh Radhakrishnan
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# Ozone MPU Optimization - Design Doc
+
+
+## Table of Contents
+1. [Motivation](#1-motivation)
+2. [Proposal](#2-proposal)
+* [Backward Compatibility](#backward-compatibility)
+* [Split-table design (V1)](#split-table-design-v1)
+* [Comparison: V0 (legacy) vs V1](#comparison-v0-legacy-vs-v1)
+* [2.1 Approach-1: Reuse multipartInfoTable with empty part 
list](#21-approach-1--reuse-multipartinfotable-with-empty-part-list)
+* [2.2 Approach-2: Introduce new 
multipartMetadataTable](#22-approach-2--introduce-new-multipartmetadatatable)
+* [2.3 Summary](#23-summary)
+* [2.4 Chosen Approach: Approach-1](#24-chosen-approach-approach-1)
+3. [Upgrades](#3-upgrades)
+4. [Benchmarking and Performance](#4-benchmarking-and-performance)
+5. [Open Questions](#5-open-questions)
+
+---
+
+## 1. Motivation
+Presently Ozone has several overheads when uploading large files via Multipart 
upload (MPU). This document presents a detailed design for optimizing the MPU 
storage layout to reduce these overheads.
+
+### Problem with the current MPU schema
+**Current design:**
+* One row per MPU: `key = /{vol}/{bucket}/{key}/{uploadId}`
+* Value = full `OmMultipartKeyInfo` with all parts inline.
+
+**Implications:**
+1. Each MPU part commit reads the full `OmMultipartKeyInfo`, deserializes it, 
adds one part, serializes it, and writes it back (HDDS-10611).
+2. RocksDB WAL logs each full write → WAL growth (HDDS-8238).
+3. GC pressure grows with the size of the object (HDDS-10611).
+
+#### a) Deserialization overhead
+| Operation     | Current                                                 |
+|:--------------|:--------------------------------------------------------|
+| Commit part N | Read + deserialize whole OmMultipartKeyInfo (N-1 parts) |
+
+#### b) WAL overhead
+Assuming one MPU part info object takes ~1.5KB.
+
+| Scenario    | Current WAL                     |
+|:------------|:--------------------------------|
+| 1,000 parts | ~733 MB (1+2+...+1000) × 1.5 KB |
+
+#### c) GC pressure
+Current: Large short-lived objects per part commit.
+
+#### Existing Storage Layout Overview
+```protobuf
+MultipartKeyInfo {
+  uploadID : string
+  creationTime : uint64
+  type : ReplicationType
+  factor : ReplicationFactor (optional)
+  partKeyInfoList : repeated PartKeyInfo ← grows with each part
+  objectID : uint64 (optional)
+  updateID : uint64 (optional)
+  parentID : uint64 (optional)
+  ecReplicationConfig : optional
+}
+```
+
+---
+
+## 2. Proposal
+The idea is to split the content of `MultipartInfoTable`. Part information 
will be stored separately in a flattened schema (one row per part) instead of 
one giant object.
+
+### Split-table design (V1)
+Split MPU metadata into:
+* **Metadata table:** Lightweight per-MPU metadata (no part list).
+* **Parts table:** One row per part (flat structure).
+
+**New MultipartPartInfo Structure:**
+```protobuf
+message MultipartPartInfo {
+  required string partName = 1;
+  required uint32 partNumber = 2;
+  required string volumeName = 3;
+  required string bucketName = 4;
+  required string keyName = 5;
+  required uint64 dataSize = 6;
+  required uint64 modificationTime = 7;
+  repeated KeyLocationList keyLocationList = 8;
+  repeated hadoop.hdds.KeyValue metadata = 9;
+  optional FileEncryptionInfoProto fileEncryptionInfo = 10;
+  optional FileChecksumProto fileChecksum = 11;
+}
+```
+
+### Comparison: V0 (legacy) vs V1
+| Metric              | Current (V0)                  | Split-Table (V1)       
                          |
+|:--------------------|:------------------------------|:-------------------------------------------------|
+| **Commit part N**   | Read + deserialize whole list | Read Metadata (~200B) 
+ write single PartKeyInfo |
+| **1,000 parts WAL** | ~733 MB                       | ~1.5 MB (or ~600KB 
with optimized info)          |
+| **GC Pressure**     | Large short-lived objects     | Small metadata + 
single-part objects             |
+
+---
+
+### 2.1. Approach-1 : Reuse multipartInfoTable with empty part list
+Reuse the existing table but introduce a new `multipartPartsTable`.
+
+**Storage Layout:**
+* **multipartInfoTable (RocksDB):**
+  * V0: Key → `OmMultipartKeyInfo` { parts inline }
+  * V1: Key → `OmMultipartKeyInfo` { empty list, schemaVersion: 1 }
+* **multipartPartsTable (RocksDB) [V1 only]:**
+  * `/uploadId/part00001` → `PartKeyInfo`
+  * `/uploadId/part00002` → `PartKeyInfo`
+  * `/uploadId/part00003` → `PartKeyInfo`
+
+
+```protobuf
+message MultipartKeyInfo {
+    required string uploadID = 1;
+    required uint64 creationTime = 2;
+    required hadoop.hdds.ReplicationType type = 3;
+    optional hadoop.hdds.ReplicationFactor factor = 4;
+    repeated PartKeyInfo partKeyInfoList = 5;
+    optional uint64 objectID = 6;
+    optional uint64 updateID = 7;
+    optional uint64 parentID = 8;
+    optional hadoop.hdds.ECReplicationConfig ecReplicationConfig = 9;
+    optional uint32 schemaVersion = 10; // default 0
+}
+```
+
+#### V0: OmMultipartKeyInfo (parts inline)
+```
+OmMultipartKeyInfo {
+  uploadID
+  creationTime
+  type
+  factor
+  partKeyInfoList: [ PartKeyInfo, PartKeyInfo, ... ]   ← all parts inline
+  objectID
+  updateID
+  parentID
+  schemaVersion: 0 (or absent)
+}
+```
+##### V1: OmMultipartKeyInfo (empty list + schemaVersion)
+```
+OmMultipartKeyInfo {
+  uploadID
+  creationTime
+  type
+  factor
+  partKeyInfoList: []   ← empty
+  objectID
+  updateID
+  parentID
+  schemaVersion: 1
+}
+```
+
+#### Example (for a 10 part MPU)
+
+---
+#### MultipartInfoTable :
+```
+Key:   `/vol1/bucket1/mp_file1/abc123-uuid-456`
+
+Value:
+OmMultipartKeyInfo {
+  uploadID: "abc123-uuid-456"
+  creationTime: 1738742400000
+  type: RATIS
+  factor: THREE
+  partKeyInfoList: []
+  objectID: 1001
+  updateID: 12345
+  parentID: 0
+  schemaVersion: 1
+}
+```
+
+#### MultipartPartsTable – 10 rows:
+
+```text
+Key:   /vol1/bucket1/mp_file1/abc123-uuid-456/part00001
+Value: PartKeyInfo { partName: ".../part1", partNumber: 1, partKeyInfo: 
KeyInfo{blocks, size,...} }
+
+Key:   /vol1/bucket1/mp_file1/abc123-uuid-456/part00002
+Value: PartKeyInfo { partName: ".../part2", partNumber: 2, partKeyInfo: 
KeyInfo{...} }
+...
+...
+Key:   /vol1/bucket1/mp_file1/abc123-uuid-456/part00010
+Value: PartKeyInfo { partName: ".../part10", partNumber: 10, partKeyInfo: 
KeyInfo{...} }
+```
+
+### 2.2. Approach-2 : Introduce new multipartMetadataTable
+
+Split metadata and introduce two new tables:
+- **multipartMetadataTable**: lightweight per-MPU metadata (no part list).
+- **multipartPartsTable**: one row per part (no aggregation).
+
+Below is the new metadata table info object structure:
+```protobuf
+message MultipartMetadataInfo {
+  required string uploadID = 1;
+  required uint64 creationTime = 2;
+  required hadoop.hdds.ReplicationType type = 3;
+  optional hadoop.hdds.ReplicationFactor factor = 4;
+  optional hadoop.hdds.ECReplicationConfig ecReplicationConfig = 5;
+  optional uint64 objectID = 6;
+  optional uint64 updateID = 7;
+  optional uint64 parentID = 8;
+  optional uint32 schemaVersion = 9; // default 0
+}
+```
+
+#### Storage Layout Overview
+
+* **multipartInfoTable (RocksDB):**
+  * V0: `/vol/bucket/key/uploadId` → `OmMultipartKeyInfo { partKeyInfoList: 
[...] }`
+
+
+* **multipartMetadataTable (RocksDB)**
+  * V1: `/vol/bucket/key/uploadId` → `MultipartMetadata { schemaVersion: 1 }`
+
+
+* **multipartPartsTable (RocksDB) [v1 only]**
+  * `/vol/bucket/key/uploadId/part00001`  → `PartKeyInfo` 
+  * `/vol/bucket/key/uploadId/part00002`  → `PartKeyInfo` 
+  * `/vol/bucket/key/uploadId/part00003`  → `PartKeyInfo`
+  * `...`
+
+#### multipartMetadataInfo Table – 1 row
+**V1: OmMultipartMetadataInfo (metadata only)**
+```text
+OmMultipartMetadataInfo {
+  uploadID
+  creationTime
+  type (ReplicationType)
+  factor (ReplicationFactor)
+  objectID
+  updateID
+  parentID
+  ecReplicationConfig
+  schemaVersion: 1
+}
+```
+
+```protobuf
+message MultipartMetadata {
+  required string uploadID = 1;
+  required uint64 creationTime = 2;
+  required hadoop.hdds.ReplicationType type = 3;
+  optional hadoop.hdds.ReplicationFactor factor = 4;
+  optional uint64 objectID = 5;
+  optional uint64 updateID = 6;
+  optional uint64 parentID = 7;
+  optional hadoop.hdds.ECReplicationConfig ecReplicationConfig = 8;
+  optional uint32 schemaVersion = 9;
+  // NO partKeyInfoList - moved to new table
+}
+```
+
+#### Example:
+
+---
+```
+Key: /vol1/bucket1/mp_file1/abc123-uuid-456
+
+Value:
+MultipartMetadata {
+  uploadID: "abc123-uuid-456"
+  creationTime: 1738742400000
+  type: RATIS
+  factor: THREE
+  objectID: 1001
+  updateID: 12345
+  parentID: 0
+  schemaVersion: 1
+}
+```
+
+---
+
+### 2.3. Summary
+* **Approach-1:** Minimal change, same value type, uses `schemaVersion` flag.
+* **Approach-2:** Dedicated table, cleanest separation, requires new lookup 
logic.
+
+----
+### 2.4. Chosen Approach: Approach-1
+We have chosen **Approach-1: Reuse multipartInfoTable with empty part list**
+as the preferred implementation for MPU optimization (V1).
+
+This approach is favored because it introduces minimal changes to the existing 
`OmMultipartKeyInfo` protobuf structure.
+<br>
+By simply introducing an optional `schemaVersion` field and ensuring the 
partKeyInfoList is empty for V1 entries,
+we maintain backward compatibility.
+
+The key advantages are:
+* **Minimal Protobuf Change**: Older clients and processes can still read the 
multipartInfoTable entries without issue,
+    as the **core structure remains the same**.
+* **Compatibility**: Older uploads (V0) **remain fully functional**, and new 
uploads (V1) can be distinguished by
+    the schemaVersion. This significantly reduces the risk of breaking 
existing functionality.
+* **Simplicity**: The transition logic between V0 and V1 is straightforward, 
primarily checking the
+    `schemaVersion` field upon read.
+---
+
+## 3. Upgrades
+Add a new feature in `OMLayoutFeature`:
+```java
+MPU_PARTS_TABLE_SPLIT(10, "Split multipart table into separate table for parts 
and key");
+```
+`schemaVersion` is set to `1` only when the upgrade is finalized.

Review Comment:
   I'm still not fully familiar with the OM upgrade flow, but here are some 
questions
   1. What happens if the old client sends MPU request to the OM with the new 
layout feature? From what I gather, the old client writes will fail?
   2. What happens to the incomplete multipart uploads in `multipartInfoTable`? 
Are we going to abort all of them as part of the upgrades?



##########
hadoop-hdds/docs/content/design/mpu-gc-optimization.md:
##########
@@ -0,0 +1,336 @@
+---
+title: Multipart Upload GC Pressure Optimizations
+summary: Change Multipart Upload Logic to improve OM GC Pressure
+date: 2026-02-19
+jira: HDDS-10611
+status: implemented
+author: Abhishek Pal, Rakesh Radhakrishnan
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# Ozone MPU Optimization - Design Doc
+
+
+## Table of Contents
+1. [Motivation](#1-motivation)
+2. [Proposal](#2-proposal)
+* [Backward Compatibility](#backward-compatibility)
+* [Split-table design (V1)](#split-table-design-v1)
+* [Comparison: V0 (legacy) vs V1](#comparison-v0-legacy-vs-v1)
+* [2.1 Approach-1: Reuse multipartInfoTable with empty part 
list](#21-approach-1--reuse-multipartinfotable-with-empty-part-list)
+* [2.2 Approach-2: Introduce new 
multipartMetadataTable](#22-approach-2--introduce-new-multipartmetadatatable)
+* [2.3 Summary](#23-summary)
+* [2.4 Chosen Approach: Approach-1](#24-chosen-approach-approach-1)
+3. [Upgrades](#3-upgrades)
+4. [Benchmarking and Performance](#4-benchmarking-and-performance)
+5. [Open Questions](#5-open-questions)
+
+---
+
+## 1. Motivation
+Presently Ozone has several overheads when uploading large files via Multipart 
upload (MPU). This document presents a detailed design for optimizing the MPU 
storage layout to reduce these overheads.
+
+### Problem with the current MPU schema
+**Current design:**
+* One row per MPU: `key = /{vol}/{bucket}/{key}/{uploadId}`
+* Value = full `OmMultipartKeyInfo` with all parts inline.
+
+**Implications:**
+1. Each MPU part commit reads the full `OmMultipartKeyInfo`, deserializes it, 
adds one part, serializes it, and writes it back (HDDS-10611).
+2. RocksDB WAL logs each full write → WAL growth (HDDS-8238).
+3. GC pressure grows with the size of the object (HDDS-10611).
+
+#### a) Deserialization overhead
+| Operation     | Current                                                 |
+|:--------------|:--------------------------------------------------------|
+| Commit part N | Read + deserialize whole OmMultipartKeyInfo (N-1 parts) |
+
+#### b) WAL overhead
+Assuming one MPU part info object takes ~1.5KB.
+
+| Scenario    | Current WAL                     |
+|:------------|:--------------------------------|
+| 1,000 parts | ~733 MB (1+2+...+1000) × 1.5 KB |
+
+#### c) GC pressure
+Current: Large short-lived objects per part commit.
+
+#### Existing Storage Layout Overview
+```protobuf
+MultipartKeyInfo {
+  uploadID : string
+  creationTime : uint64
+  type : ReplicationType
+  factor : ReplicationFactor (optional)
+  partKeyInfoList : repeated PartKeyInfo ← grows with each part
+  objectID : uint64 (optional)
+  updateID : uint64 (optional)
+  parentID : uint64 (optional)
+  ecReplicationConfig : optional
+}
+```
+
+---
+
+## 2. Proposal
+The idea is to split the content of `MultipartInfoTable`. Part information 
will be stored separately in a flattened schema (one row per part) instead of 
one giant object.
+
+### Split-table design (V1)
+Split MPU metadata into:
+* **Metadata table:** Lightweight per-MPU metadata (no part list).
+* **Parts table:** One row per part (flat structure).
+
+**New MultipartPartInfo Structure:**
+```protobuf
+message MultipartPartInfo {
+  required string partName = 1;
+  required uint32 partNumber = 2;
+  required string volumeName = 3;
+  required string bucketName = 4;
+  required string keyName = 5;
+  required uint64 dataSize = 6;
+  required uint64 modificationTime = 7;
+  repeated KeyLocationList keyLocationList = 8;
+  repeated hadoop.hdds.KeyValue metadata = 9;
+  optional FileEncryptionInfoProto fileEncryptionInfo = 10;
+  optional FileChecksumProto fileChecksum = 11;
+}
+```
+
+### Comparison: V0 (legacy) vs V1
+| Metric              | Current (V0)                  | Split-Table (V1)       
                          |
+|:--------------------|:------------------------------|:-------------------------------------------------|
+| **Commit part N**   | Read + deserialize whole list | Read Metadata (~200B) 
+ write single PartKeyInfo |
+| **1,000 parts WAL** | ~733 MB                       | ~1.5 MB (or ~600KB 
with optimized info)          |
+| **GC Pressure**     | Large short-lived objects     | Small metadata + 
single-part objects             |
+
+---
+
+### 2.1. Approach-1 : Reuse multipartInfoTable with empty part list
+Reuse the existing table but introduce a new `multipartPartsTable`.
+
+**Storage Layout:**
+* **multipartInfoTable (RocksDB):**
+  * V0: Key → `OmMultipartKeyInfo` { parts inline }
+  * V1: Key → `OmMultipartKeyInfo` { empty list, schemaVersion: 1 }
+* **multipartPartsTable (RocksDB) [V1 only]:**
+  * `/uploadId/part00001` → `PartKeyInfo`
+  * `/uploadId/part00002` → `PartKeyInfo`
+  * `/uploadId/part00003` → `PartKeyInfo`

Review Comment:
   I think `part` prefix is redundant so we can simply make the key `/upload
   
   Additionally, regarding the padding issue (@jojochuang), instead of adding 0 
padding (which might require additional stripping during query and might add 
more space overhead), it might be better to write a new OM table key codec 
(e.g. `OmMultipartPartsTableKey` with two fields 1) `String uploadId` and 
`Integer partNumber`) to use a fixed length long / int (you can refer on how 
`LongCodec` and `IntegerCodec` serializes and deserializes the long to fixed 
size byte).
   
   



##########
hadoop-hdds/docs/content/design/mpu-gc-optimization.md:
##########
@@ -0,0 +1,336 @@
+---
+title: Multipart Upload GC Pressure Optimizations
+summary: Change Multipart Upload Logic to improve OM GC Pressure
+date: 2026-02-19
+jira: HDDS-10611
+status: implemented
+author: Abhishek Pal, Rakesh Radhakrishnan
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# Ozone MPU Optimization - Design Doc
+
+
+## Table of Contents
+1. [Motivation](#1-motivation)
+2. [Proposal](#2-proposal)
+* [Backward Compatibility](#backward-compatibility)
+* [Split-table design (V1)](#split-table-design-v1)
+* [Comparison: V0 (legacy) vs V1](#comparison-v0-legacy-vs-v1)

Review Comment:
   Nit: IMO V1 and V2 will be better than V0 and V1. However, I understand that 
this seems to be related to the `schemaVersion` protobuf which defaults to 0.



##########
hadoop-hdds/docs/content/design/mpu-gc-optimization.md:
##########
@@ -0,0 +1,336 @@
+---
+title: Multipart Upload GC Pressure Optimizations
+summary: Change Multipart Upload Logic to improve OM GC Pressure
+date: 2026-02-19
+jira: HDDS-10611
+status: implemented
+author: Abhishek Pal, Rakesh Radhakrishnan
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# Ozone MPU Optimization - Design Doc
+
+
+## Table of Contents
+1. [Motivation](#1-motivation)
+2. [Proposal](#2-proposal)
+* [Backward Compatibility](#backward-compatibility)
+* [Split-table design (V1)](#split-table-design-v1)
+* [Comparison: V0 (legacy) vs V1](#comparison-v0-legacy-vs-v1)
+* [2.1 Approach-1: Reuse multipartInfoTable with empty part 
list](#21-approach-1--reuse-multipartinfotable-with-empty-part-list)
+* [2.2 Approach-2: Introduce new 
multipartMetadataTable](#22-approach-2--introduce-new-multipartmetadatatable)
+* [2.3 Summary](#23-summary)
+* [2.4 Chosen Approach: Approach-1](#24-chosen-approach-approach-1)
+3. [Upgrades](#3-upgrades)
+4. [Benchmarking and Performance](#4-benchmarking-and-performance)
+5. [Open Questions](#5-open-questions)
+
+---
+
+## 1. Motivation
+Presently Ozone has several overheads when uploading large files via Multipart 
upload (MPU). This document presents a detailed design for optimizing the MPU 
storage layout to reduce these overheads.
+
+### Problem with the current MPU schema
+**Current design:**
+* One row per MPU: `key = /{vol}/{bucket}/{key}/{uploadId}`
+* Value = full `OmMultipartKeyInfo` with all parts inline.
+
+**Implications:**
+1. Each MPU part commit reads the full `OmMultipartKeyInfo`, deserializes it, 
adds one part, serializes it, and writes it back (HDDS-10611).
+2. RocksDB WAL logs each full write → WAL growth (HDDS-8238).
+3. GC pressure grows with the size of the object (HDDS-10611).
+
+#### a) Deserialization overhead
+| Operation     | Current                                                 |
+|:--------------|:--------------------------------------------------------|
+| Commit part N | Read + deserialize whole OmMultipartKeyInfo (N-1 parts) |
+
+#### b) WAL overhead
+Assuming one MPU part info object takes ~1.5KB.
+
+| Scenario    | Current WAL                     |
+|:------------|:--------------------------------|
+| 1,000 parts | ~733 MB (1+2+...+1000) × 1.5 KB |
+
+#### c) GC pressure
+Current: Large short-lived objects per part commit.
+
+#### Existing Storage Layout Overview
+```protobuf
+MultipartKeyInfo {
+  uploadID : string
+  creationTime : uint64
+  type : ReplicationType
+  factor : ReplicationFactor (optional)
+  partKeyInfoList : repeated PartKeyInfo ← grows with each part
+  objectID : uint64 (optional)
+  updateID : uint64 (optional)
+  parentID : uint64 (optional)
+  ecReplicationConfig : optional
+}
+```
+
+---
+
+## 2. Proposal
+The idea is to split the content of `MultipartInfoTable`. Part information 
will be stored separately in a flattened schema (one row per part) instead of 
one giant object.

Review Comment:
   We can add additional sections at the end regarding other systems (industry 
practice) that having a flattened schema on RocksDB is a common practice in 
RocksDB. For example, in the MVCC setup in CockroachDB or TiKV, usually the key 
has the suffix about the timestamp or version.
   
   In the future, what we learnt from this can be applied when we want to 
support S3 versioning (which should also have a flat schema).



##########
hadoop-hdds/docs/content/design/mpu-gc-optimization.md:
##########
@@ -0,0 +1,336 @@
+---
+title: Multipart Upload GC Pressure Optimizations
+summary: Change Multipart Upload Logic to improve OM GC Pressure
+date: 2026-02-19
+jira: HDDS-10611
+status: implemented
+author: Abhishek Pal, Rakesh Radhakrishnan
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# Ozone MPU Optimization - Design Doc
+
+
+## Table of Contents
+1. [Motivation](#1-motivation)
+2. [Proposal](#2-proposal)
+* [Backward Compatibility](#backward-compatibility)
+* [Split-table design (V1)](#split-table-design-v1)
+* [Comparison: V0 (legacy) vs V1](#comparison-v0-legacy-vs-v1)
+* [2.1 Approach-1: Reuse multipartInfoTable with empty part 
list](#21-approach-1--reuse-multipartinfotable-with-empty-part-list)
+* [2.2 Approach-2: Introduce new 
multipartMetadataTable](#22-approach-2--introduce-new-multipartmetadatatable)
+* [2.3 Summary](#23-summary)
+* [2.4 Chosen Approach: Approach-1](#24-chosen-approach-approach-1)
+3. [Upgrades](#3-upgrades)
+4. [Benchmarking and Performance](#4-benchmarking-and-performance)
+5. [Open Questions](#5-open-questions)

Review Comment:
   Please also specify these. Open questions can be changed to "FAQ".



##########
hadoop-hdds/docs/content/design/mpu-gc-optimization.md:
##########
@@ -0,0 +1,336 @@
+---
+title: Multipart Upload GC Pressure Optimizations
+summary: Change Multipart Upload Logic to improve OM GC Pressure
+date: 2026-02-19
+jira: HDDS-10611
+status: implemented
+author: Abhishek Pal, Rakesh Radhakrishnan
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# Ozone MPU Optimization - Design Doc
+
+
+## Table of Contents
+1. [Motivation](#1-motivation)
+2. [Proposal](#2-proposal)
+* [Backward Compatibility](#backward-compatibility)
+* [Split-table design (V1)](#split-table-design-v1)
+* [Comparison: V0 (legacy) vs V1](#comparison-v0-legacy-vs-v1)
+* [2.1 Approach-1: Reuse multipartInfoTable with empty part 
list](#21-approach-1--reuse-multipartinfotable-with-empty-part-list)
+* [2.2 Approach-2: Introduce new 
multipartMetadataTable](#22-approach-2--introduce-new-multipartmetadatatable)
+* [2.3 Summary](#23-summary)
+* [2.4 Chosen Approach: Approach-1](#24-chosen-approach-approach-1)
+3. [Upgrades](#3-upgrades)
+4. [Benchmarking and Performance](#4-benchmarking-and-performance)
+5. [Open Questions](#5-open-questions)
+
+---
+
+## 1. Motivation
+Presently Ozone has several overheads when uploading large files via Multipart 
upload (MPU). This document presents a detailed design for optimizing the MPU 
storage layout to reduce these overheads.
+
+### Problem with the current MPU schema
+**Current design:**
+* One row per MPU: `key = /{vol}/{bucket}/{key}/{uploadId}`
+* Value = full `OmMultipartKeyInfo` with all parts inline.
+
+**Implications:**
+1. Each MPU part commit reads the full `OmMultipartKeyInfo`, deserializes it, 
adds one part, serializes it, and writes it back (HDDS-10611).
+2. RocksDB WAL logs each full write → WAL growth (HDDS-8238).
+3. GC pressure grows with the size of the object (HDDS-10611).
+
+#### a) Deserialization overhead
+| Operation     | Current                                                 |
+|:--------------|:--------------------------------------------------------|
+| Commit part N | Read + deserialize whole OmMultipartKeyInfo (N-1 parts) |
+
+#### b) WAL overhead
+Assuming one MPU part info object takes ~1.5KB.
+
+| Scenario    | Current WAL                     |
+|:------------|:--------------------------------|
+| 1,000 parts | ~733 MB (1+2+...+1000) × 1.5 KB |
+
+#### c) GC pressure
+Current: Large short-lived objects per part commit.
+
+#### Existing Storage Layout Overview
+```protobuf
+MultipartKeyInfo {
+  uploadID : string
+  creationTime : uint64
+  type : ReplicationType
+  factor : ReplicationFactor (optional)
+  partKeyInfoList : repeated PartKeyInfo ← grows with each part
+  objectID : uint64 (optional)
+  updateID : uint64 (optional)
+  parentID : uint64 (optional)
+  ecReplicationConfig : optional
+}
+```
+
+---
+
+## 2. Proposal
+The idea is to split the content of `MultipartInfoTable`. Part information 
will be stored separately in a flattened schema (one row per part) instead of 
one giant object.
+
+### Split-table design (V1)
+Split MPU metadata into:
+* **Metadata table:** Lightweight per-MPU metadata (no part list).
+* **Parts table:** One row per part (flat structure).
+
+**New MultipartPartInfo Structure:**
+```protobuf
+message MultipartPartInfo {
+  required string partName = 1;
+  required uint32 partNumber = 2;
+  required string volumeName = 3;
+  required string bucketName = 4;
+  required string keyName = 5;
+  required uint64 dataSize = 6;
+  required uint64 modificationTime = 7;
+  repeated KeyLocationList keyLocationList = 8;
+  repeated hadoop.hdds.KeyValue metadata = 9;
+  optional FileEncryptionInfoProto fileEncryptionInfo = 10;
+  optional FileChecksumProto fileChecksum = 11;
+}
+```
+
+### Comparison: V0 (legacy) vs V1
+| Metric              | Current (V0)                  | Split-Table (V1)       
                          |
+|:--------------------|:------------------------------|:-------------------------------------------------|
+| **Commit part N**   | Read + deserialize whole list | Read Metadata (~200B) 
+ write single PartKeyInfo |
+| **1,000 parts WAL** | ~733 MB                       | ~1.5 MB (or ~600KB 
with optimized info)          |
+| **GC Pressure**     | Large short-lived objects     | Small metadata + 
single-part objects             |
+
+---
+
+### 2.1. Approach-1 : Reuse multipartInfoTable with empty part list

Review Comment:
   We can add comparisons of all the MPU flows between the old flow and the new 
flow. 
   
   For example, for MPU init, the flow should be the same. However, in MPU part 
commit, complete, and abort, there should be differences (mainly about how it 
interacts with the new MPU part table).



##########
hadoop-hdds/docs/content/design/mpu-gc-optimization.md:
##########
@@ -0,0 +1,336 @@
+---
+title: Multipart Upload GC Pressure Optimizations
+summary: Change Multipart Upload Logic to improve OM GC Pressure
+date: 2026-02-19
+jira: HDDS-10611
+status: implemented
+author: Abhishek Pal, Rakesh Radhakrishnan
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# Ozone MPU Optimization - Design Doc
+
+
+## Table of Contents
+1. [Motivation](#1-motivation)
+2. [Proposal](#2-proposal)

Review Comment:
   I think the proposal can be broken down to more sections. Currently, the 
whole section on the new table can be a separate subsection under 2.1. We can 
then add the change in the MPU flow on 2.2, and so on.



##########
hadoop-hdds/docs/content/design/mpu-gc-optimization.md:
##########
@@ -0,0 +1,336 @@
+---
+title: Multipart Upload GC Pressure Optimizations
+summary: Change Multipart Upload Logic to improve OM GC Pressure
+date: 2026-02-19
+jira: HDDS-10611
+status: implemented
+author: Abhishek Pal, Rakesh Radhakrishnan
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# Ozone MPU Optimization - Design Doc
+
+
+## Table of Contents
+1. [Motivation](#1-motivation)
+2. [Proposal](#2-proposal)
+* [Backward Compatibility](#backward-compatibility)
+* [Split-table design (V1)](#split-table-design-v1)
+* [Comparison: V0 (legacy) vs V1](#comparison-v0-legacy-vs-v1)
+* [2.1 Approach-1: Reuse multipartInfoTable with empty part 
list](#21-approach-1--reuse-multipartinfotable-with-empty-part-list)
+* [2.2 Approach-2: Introduce new 
multipartMetadataTable](#22-approach-2--introduce-new-multipartmetadatatable)
+* [2.3 Summary](#23-summary)
+* [2.4 Chosen Approach: Approach-1](#24-chosen-approach-approach-1)
+3. [Upgrades](#3-upgrades)
+4. [Benchmarking and Performance](#4-benchmarking-and-performance)
+5. [Open Questions](#5-open-questions)
+
+---
+
+## 1. Motivation
+Presently Ozone has several overheads when uploading large files via Multipart 
upload (MPU). This document presents a detailed design for optimizing the MPU 
storage layout to reduce these overheads.
+
+### Problem with the current MPU schema
+**Current design:**
+* One row per MPU: `key = /{vol}/{bucket}/{key}/{uploadId}`
+* Value = full `OmMultipartKeyInfo` with all parts inline.
+
+**Implications:**
+1. Each MPU part commit reads the full `OmMultipartKeyInfo`, deserializes it, 
adds one part, serializes it, and writes it back (HDDS-10611).
+2. RocksDB WAL logs each full write → WAL growth (HDDS-8238).
+3. GC pressure grows with the size of the object (HDDS-10611).
+
+#### a) Deserialization overhead
+| Operation     | Current                                                 |
+|:--------------|:--------------------------------------------------------|
+| Commit part N | Read + deserialize whole OmMultipartKeyInfo (N-1 parts) |
+
+#### b) WAL overhead
+Assuming one MPU part info object takes ~1.5KB.
+
+| Scenario    | Current WAL                     |
+|:------------|:--------------------------------|
+| 1,000 parts | ~733 MB (1+2+...+1000) × 1.5 KB |
+
+#### c) GC pressure
+Current: Large short-lived objects per part commit.
+
+#### Existing Storage Layout Overview
+```protobuf
+MultipartKeyInfo {
+  uploadID : string
+  creationTime : uint64
+  type : ReplicationType
+  factor : ReplicationFactor (optional)
+  partKeyInfoList : repeated PartKeyInfo ← grows with each part
+  objectID : uint64 (optional)
+  updateID : uint64 (optional)
+  parentID : uint64 (optional)
+  ecReplicationConfig : optional
+}
+```
+
+---
+
+## 2. Proposal
+The idea is to split the content of `MultipartInfoTable`. Part information 
will be stored separately in a flattened schema (one row per part) instead of 
one giant object.
+
+### Split-table design (V1)
+Split MPU metadata into:
+* **Metadata table:** Lightweight per-MPU metadata (no part list).
+* **Parts table:** One row per part (flat structure).
+
+**New MultipartPartInfo Structure:**
+```protobuf
+message MultipartPartInfo {
+  required string partName = 1;
+  required uint32 partNumber = 2;
+  required string volumeName = 3;
+  required string bucketName = 4;
+  required string keyName = 5;
+  required uint64 dataSize = 6;
+  required uint64 modificationTime = 7;
+  repeated KeyLocationList keyLocationList = 8;
+  repeated hadoop.hdds.KeyValue metadata = 9;
+  optional FileEncryptionInfoProto fileEncryptionInfo = 10;
+  optional FileChecksumProto fileChecksum = 11;
+}
+```
+
+### Comparison: V0 (legacy) vs V1
+| Metric              | Current (V0)                  | Split-Table (V1)       
                          |
+|:--------------------|:------------------------------|:-------------------------------------------------|
+| **Commit part N**   | Read + deserialize whole list | Read Metadata (~200B) 
+ write single PartKeyInfo |
+| **1,000 parts WAL** | ~733 MB                       | ~1.5 MB (or ~600KB 
with optimized info)          |
+| **GC Pressure**     | Large short-lived objects     | Small metadata + 
single-part objects             |
+
+---
+
+### 2.1. Approach-1 : Reuse multipartInfoTable with empty part list
+Reuse the existing table but introduce a new `multipartPartsTable`.
+
+**Storage Layout:**
+* **multipartInfoTable (RocksDB):**
+  * V0: Key → `OmMultipartKeyInfo` { parts inline }
+  * V1: Key → `OmMultipartKeyInfo` { empty list, schemaVersion: 1 }
+* **multipartPartsTable (RocksDB) [V1 only]:**
+  * `/uploadId/part00001` → `PartKeyInfo`
+  * `/uploadId/part00002` → `PartKeyInfo`
+  * `/uploadId/part00003` → `PartKeyInfo`
+
+
+```protobuf
+message MultipartKeyInfo {
+    required string uploadID = 1;
+    required uint64 creationTime = 2;
+    required hadoop.hdds.ReplicationType type = 3;
+    optional hadoop.hdds.ReplicationFactor factor = 4;
+    repeated PartKeyInfo partKeyInfoList = 5;
+    optional uint64 objectID = 6;
+    optional uint64 updateID = 7;
+    optional uint64 parentID = 8;
+    optional hadoop.hdds.ECReplicationConfig ecReplicationConfig = 9;
+    optional uint32 schemaVersion = 10; // default 0
+}
+```
+
+#### V0: OmMultipartKeyInfo (parts inline)
+```
+OmMultipartKeyInfo {
+  uploadID
+  creationTime
+  type
+  factor
+  partKeyInfoList: [ PartKeyInfo, PartKeyInfo, ... ]   ← all parts inline
+  objectID
+  updateID
+  parentID
+  schemaVersion: 0 (or absent)
+}
+```
+##### V1: OmMultipartKeyInfo (empty list + schemaVersion)
+```
+OmMultipartKeyInfo {
+  uploadID
+  creationTime
+  type
+  factor
+  partKeyInfoList: []   ← empty
+  objectID
+  updateID
+  parentID
+  schemaVersion: 1
+}
+```
+
+#### Example (for a 10 part MPU)
+
+---
+#### MultipartInfoTable :
+```
+Key:   `/vol1/bucket1/mp_file1/abc123-uuid-456`
+
+Value:
+OmMultipartKeyInfo {
+  uploadID: "abc123-uuid-456"
+  creationTime: 1738742400000
+  type: RATIS
+  factor: THREE
+  partKeyInfoList: []
+  objectID: 1001
+  updateID: 12345
+  parentID: 0
+  schemaVersion: 1
+}
+```
+
+#### MultipartPartsTable – 10 rows:
+
+```text
+Key:   /vol1/bucket1/mp_file1/abc123-uuid-456/part00001
+Value: PartKeyInfo { partName: ".../part1", partNumber: 1, partKeyInfo: 
KeyInfo{blocks, size,...} }
+
+Key:   /vol1/bucket1/mp_file1/abc123-uuid-456/part00002
+Value: PartKeyInfo { partName: ".../part2", partNumber: 2, partKeyInfo: 
KeyInfo{...} }
+...
+...
+Key:   /vol1/bucket1/mp_file1/abc123-uuid-456/part00010
+Value: PartKeyInfo { partName: ".../part10", partNumber: 10, partKeyInfo: 
KeyInfo{...} }
+```

Review Comment:
   Are we including the `/{vol}/{buck}/{key}/` prefix on the 
`MultipartPartsTable`? In the previous Storage Layout section (e.g. 
`/uploadId/part00001`) only uploadId and part are specified.



##########
hadoop-hdds/docs/content/design/mpu-gc-optimization.md:
##########
@@ -0,0 +1,336 @@
+---
+title: Multipart Upload GC Pressure Optimizations
+summary: Change Multipart Upload Logic to improve OM GC Pressure
+date: 2026-02-19
+jira: HDDS-10611
+status: implemented
+author: Abhishek Pal, Rakesh Radhakrishnan
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# Ozone MPU Optimization - Design Doc
+
+
+## Table of Contents
+1. [Motivation](#1-motivation)
+2. [Proposal](#2-proposal)
+* [Backward Compatibility](#backward-compatibility)
+* [Split-table design (V1)](#split-table-design-v1)
+* [Comparison: V0 (legacy) vs V1](#comparison-v0-legacy-vs-v1)
+* [2.1 Approach-1: Reuse multipartInfoTable with empty part 
list](#21-approach-1--reuse-multipartinfotable-with-empty-part-list)
+* [2.2 Approach-2: Introduce new 
multipartMetadataTable](#22-approach-2--introduce-new-multipartmetadatatable)
+* [2.3 Summary](#23-summary)
+* [2.4 Chosen Approach: Approach-1](#24-chosen-approach-approach-1)
+3. [Upgrades](#3-upgrades)
+4. [Benchmarking and Performance](#4-benchmarking-and-performance)
+5. [Open Questions](#5-open-questions)
+
+---
+
+## 1. Motivation
+Presently Ozone has several overheads when uploading large files via Multipart 
upload (MPU). This document presents a detailed design for optimizing the MPU 
storage layout to reduce these overheads.
+
+### Problem with the current MPU schema
+**Current design:**
+* One row per MPU: `key = /{vol}/{bucket}/{key}/{uploadId}`
+* Value = full `OmMultipartKeyInfo` with all parts inline.
+
+**Implications:**
+1. Each MPU part commit reads the full `OmMultipartKeyInfo`, deserializes it, 
adds one part, serializes it, and writes it back (HDDS-10611).
+2. RocksDB WAL logs each full write → WAL growth (HDDS-8238).
+3. GC pressure grows with the size of the object (HDDS-10611).
+
+#### a) Deserialization overhead
+| Operation     | Current                                                 |
+|:--------------|:--------------------------------------------------------|
+| Commit part N | Read + deserialize whole OmMultipartKeyInfo (N-1 parts) |
+
+#### b) WAL overhead
+Assuming one MPU part info object takes ~1.5KB.
+
+| Scenario    | Current WAL                     |
+|:------------|:--------------------------------|
+| 1,000 parts | ~733 MB (1+2+...+1000) × 1.5 KB |
+
+#### c) GC pressure
+Current: Large short-lived objects per part commit.
+
+#### Existing Storage Layout Overview
+```protobuf
+MultipartKeyInfo {
+  uploadID : string
+  creationTime : uint64
+  type : ReplicationType
+  factor : ReplicationFactor (optional)
+  partKeyInfoList : repeated PartKeyInfo ← grows with each part
+  objectID : uint64 (optional)
+  updateID : uint64 (optional)
+  parentID : uint64 (optional)
+  ecReplicationConfig : optional
+}
+```
+
+---
+
+## 2. Proposal
+The idea is to split the content of `MultipartInfoTable`. Part information 
will be stored separately in a flattened schema (one row per part) instead of 
one giant object.
+
+### Split-table design (V1)
+Split MPU metadata into:
+* **Metadata table:** Lightweight per-MPU metadata (no part list).
+* **Parts table:** One row per part (flat structure).
+
+**New MultipartPartInfo Structure:**
+```protobuf
+message MultipartPartInfo {
+  required string partName = 1;
+  required uint32 partNumber = 2;
+  required string volumeName = 3;
+  required string bucketName = 4;
+  required string keyName = 5;
+  required uint64 dataSize = 6;
+  required uint64 modificationTime = 7;
+  repeated KeyLocationList keyLocationList = 8;
+  repeated hadoop.hdds.KeyValue metadata = 9;
+  optional FileEncryptionInfoProto fileEncryptionInfo = 10;
+  optional FileChecksumProto fileChecksum = 11;
+}
+```
+
+### Comparison: V0 (legacy) vs V1
+| Metric              | Current (V0)                  | Split-Table (V1)       
                          |
+|:--------------------|:------------------------------|:-------------------------------------------------|
+| **Commit part N**   | Read + deserialize whole list | Read Metadata (~200B) 
+ write single PartKeyInfo |
+| **1,000 parts WAL** | ~733 MB                       | ~1.5 MB (or ~600KB 
with optimized info)          |
+| **GC Pressure**     | Large short-lived objects     | Small metadata + 
single-part objects             |
+
+---
+
+### 2.1. Approach-1 : Reuse multipartInfoTable with empty part list
+Reuse the existing table but introduce a new `multipartPartsTable`.
+
+**Storage Layout:**
+* **multipartInfoTable (RocksDB):**
+  * V0: Key → `OmMultipartKeyInfo` { parts inline }
+  * V1: Key → `OmMultipartKeyInfo` { empty list, schemaVersion: 1 }
+* **multipartPartsTable (RocksDB) [V1 only]:**
+  * `/uploadId/part00001` → `PartKeyInfo`
+  * `/uploadId/part00002` → `PartKeyInfo`
+  * `/uploadId/part00003` → `PartKeyInfo`
+
+
+```protobuf
+message MultipartKeyInfo {
+    required string uploadID = 1;
+    required uint64 creationTime = 2;
+    required hadoop.hdds.ReplicationType type = 3;
+    optional hadoop.hdds.ReplicationFactor factor = 4;
+    repeated PartKeyInfo partKeyInfoList = 5;
+    optional uint64 objectID = 6;
+    optional uint64 updateID = 7;
+    optional uint64 parentID = 8;
+    optional hadoop.hdds.ECReplicationConfig ecReplicationConfig = 9;
+    optional uint32 schemaVersion = 10; // default 0
+}
+```
+
+#### V0: OmMultipartKeyInfo (parts inline)
+```
+OmMultipartKeyInfo {
+  uploadID
+  creationTime
+  type
+  factor
+  partKeyInfoList: [ PartKeyInfo, PartKeyInfo, ... ]   ← all parts inline
+  objectID
+  updateID
+  parentID
+  schemaVersion: 0 (or absent)
+}
+```
+##### V1: OmMultipartKeyInfo (empty list + schemaVersion)
+```
+OmMultipartKeyInfo {
+  uploadID
+  creationTime
+  type
+  factor
+  partKeyInfoList: []   ← empty
+  objectID
+  updateID
+  parentID
+  schemaVersion: 1
+}
+```
+
+#### Example (for a 10 part MPU)
+
+---
+#### MultipartInfoTable :
+```
+Key:   `/vol1/bucket1/mp_file1/abc123-uuid-456`
+
+Value:
+OmMultipartKeyInfo {
+  uploadID: "abc123-uuid-456"
+  creationTime: 1738742400000
+  type: RATIS
+  factor: THREE
+  partKeyInfoList: []
+  objectID: 1001
+  updateID: 12345
+  parentID: 0
+  schemaVersion: 1
+}
+```
+
+#### MultipartPartsTable – 10 rows:
+
+```text
+Key:   /vol1/bucket1/mp_file1/abc123-uuid-456/part00001
+Value: PartKeyInfo { partName: ".../part1", partNumber: 1, partKeyInfo: 
KeyInfo{blocks, size,...} }
+
+Key:   /vol1/bucket1/mp_file1/abc123-uuid-456/part00002
+Value: PartKeyInfo { partName: ".../part2", partNumber: 2, partKeyInfo: 
KeyInfo{...} }
+...
+...
+Key:   /vol1/bucket1/mp_file1/abc123-uuid-456/part00010
+Value: PartKeyInfo { partName: ".../part10", partNumber: 10, partKeyInfo: 
KeyInfo{...} }
+```
+
+### 2.2. Approach-2 : Introduce new multipartMetadataTable

Review Comment:
   I think Approach-2 is only worthwhile if we are changing the 
`OmMultipartKeyInfo` (e.g. removing unnecessary fields, etc) entirely. So I 
agree with using Approach-1.



##########
hadoop-hdds/docs/content/design/mpu-gc-optimization.md:
##########
@@ -0,0 +1,336 @@
+---
+title: Multipart Upload GC Pressure Optimizations
+summary: Change Multipart Upload Logic to improve OM GC Pressure
+date: 2026-02-19
+jira: HDDS-10611
+status: implemented
+author: Abhishek Pal, Rakesh Radhakrishnan
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# Ozone MPU Optimization - Design Doc
+
+
+## Table of Contents
+1. [Motivation](#1-motivation)
+2. [Proposal](#2-proposal)
+* [Backward Compatibility](#backward-compatibility)
+* [Split-table design (V1)](#split-table-design-v1)
+* [Comparison: V0 (legacy) vs V1](#comparison-v0-legacy-vs-v1)
+* [2.1 Approach-1: Reuse multipartInfoTable with empty part 
list](#21-approach-1--reuse-multipartinfotable-with-empty-part-list)
+* [2.2 Approach-2: Introduce new 
multipartMetadataTable](#22-approach-2--introduce-new-multipartmetadatatable)
+* [2.3 Summary](#23-summary)
+* [2.4 Chosen Approach: Approach-1](#24-chosen-approach-approach-1)
+3. [Upgrades](#3-upgrades)
+4. [Benchmarking and Performance](#4-benchmarking-and-performance)
+5. [Open Questions](#5-open-questions)
+
+---
+
+## 1. Motivation
+Presently Ozone has several overheads when uploading large files via Multipart 
upload (MPU). This document presents a detailed design for optimizing the MPU 
storage layout to reduce these overheads.
+
+### Problem with the current MPU schema
+**Current design:**
+* One row per MPU: `key = /{vol}/{bucket}/{key}/{uploadId}`
+* Value = full `OmMultipartKeyInfo` with all parts inline.
+
+**Implications:**
+1. Each MPU part commit reads the full `OmMultipartKeyInfo`, deserializes it, 
adds one part, serializes it, and writes it back (HDDS-10611).
+2. RocksDB WAL logs each full write → WAL growth (HDDS-8238).
+3. GC pressure grows with the size of the object (HDDS-10611).
+
+#### a) Deserialization overhead
+| Operation     | Current                                                 |
+|:--------------|:--------------------------------------------------------|
+| Commit part N | Read + deserialize whole OmMultipartKeyInfo (N-1 parts) |
+
+#### b) WAL overhead
+Assuming one MPU part info object takes ~1.5KB.
+
+| Scenario    | Current WAL                     |
+|:------------|:--------------------------------|
+| 1,000 parts | ~733 MB (1+2+...+1000) × 1.5 KB |
+
+#### c) GC pressure
+Current: Large short-lived objects per part commit.
+
+#### Existing Storage Layout Overview
+```protobuf
+MultipartKeyInfo {
+  uploadID : string
+  creationTime : uint64
+  type : ReplicationType
+  factor : ReplicationFactor (optional)
+  partKeyInfoList : repeated PartKeyInfo ← grows with each part
+  objectID : uint64 (optional)
+  updateID : uint64 (optional)
+  parentID : uint64 (optional)
+  ecReplicationConfig : optional
+}
+```
+
+---
+
+## 2. Proposal
+The idea is to split the content of `MultipartInfoTable`. Part information 
will be stored separately in a flattened schema (one row per part) instead of 
one giant object.
+
+### Split-table design (V1)
+Split MPU metadata into:
+* **Metadata table:** Lightweight per-MPU metadata (no part list).
+* **Parts table:** One row per part (flat structure).
+
+**New MultipartPartInfo Structure:**
+```protobuf
+message MultipartPartInfo {
+  required string partName = 1;
+  required uint32 partNumber = 2;
+  required string volumeName = 3;
+  required string bucketName = 4;
+  required string keyName = 5;
+  required uint64 dataSize = 6;
+  required uint64 modificationTime = 7;
+  repeated KeyLocationList keyLocationList = 8;
+  repeated hadoop.hdds.KeyValue metadata = 9;
+  optional FileEncryptionInfoProto fileEncryptionInfo = 10;
+  optional FileChecksumProto fileChecksum = 11;
+}
+```
+
+### Comparison: V0 (legacy) vs V1
+| Metric              | Current (V0)                  | Split-Table (V1)       
                          |
+|:--------------------|:------------------------------|:-------------------------------------------------|
+| **Commit part N**   | Read + deserialize whole list | Read Metadata (~200B) 
+ write single PartKeyInfo |
+| **1,000 parts WAL** | ~733 MB                       | ~1.5 MB (or ~600KB 
with optimized info)          |
+| **GC Pressure**     | Large short-lived objects     | Small metadata + 
single-part objects             |
+
+---
+
+### 2.1. Approach-1 : Reuse multipartInfoTable with empty part list
+Reuse the existing table but introduce a new `multipartPartsTable`.
+
+**Storage Layout:**
+* **multipartInfoTable (RocksDB):**
+  * V0: Key → `OmMultipartKeyInfo` { parts inline }
+  * V1: Key → `OmMultipartKeyInfo` { empty list, schemaVersion: 1 }
+* **multipartPartsTable (RocksDB) [V1 only]:**
+  * `/uploadId/part00001` → `PartKeyInfo`
+  * `/uploadId/part00002` → `PartKeyInfo`
+  * `/uploadId/part00003` → `PartKeyInfo`
+
+
+```protobuf
+message MultipartKeyInfo {
+    required string uploadID = 1;
+    required uint64 creationTime = 2;
+    required hadoop.hdds.ReplicationType type = 3;
+    optional hadoop.hdds.ReplicationFactor factor = 4;
+    repeated PartKeyInfo partKeyInfoList = 5;
+    optional uint64 objectID = 6;
+    optional uint64 updateID = 7;
+    optional uint64 parentID = 8;
+    optional hadoop.hdds.ECReplicationConfig ecReplicationConfig = 9;
+    optional uint32 schemaVersion = 10; // default 0
+}
+```
+
+#### V0: OmMultipartKeyInfo (parts inline)
+```
+OmMultipartKeyInfo {
+  uploadID
+  creationTime
+  type
+  factor
+  partKeyInfoList: [ PartKeyInfo, PartKeyInfo, ... ]   ← all parts inline
+  objectID
+  updateID
+  parentID
+  schemaVersion: 0 (or absent)
+}
+```
+##### V1: OmMultipartKeyInfo (empty list + schemaVersion)
+```
+OmMultipartKeyInfo {
+  uploadID
+  creationTime
+  type
+  factor
+  partKeyInfoList: []   ← empty
+  objectID
+  updateID
+  parentID
+  schemaVersion: 1
+}
+```
+
+#### Example (for a 10 part MPU)
+
+---
+#### MultipartInfoTable :
+```
+Key:   `/vol1/bucket1/mp_file1/abc123-uuid-456`
+
+Value:
+OmMultipartKeyInfo {
+  uploadID: "abc123-uuid-456"
+  creationTime: 1738742400000
+  type: RATIS
+  factor: THREE
+  partKeyInfoList: []
+  objectID: 1001
+  updateID: 12345
+  parentID: 0
+  schemaVersion: 1
+}
+```
+
+#### MultipartPartsTable – 10 rows:
+
+```text
+Key:   /vol1/bucket1/mp_file1/abc123-uuid-456/part00001
+Value: PartKeyInfo { partName: ".../part1", partNumber: 1, partKeyInfo: 
KeyInfo{blocks, size,...} }
+
+Key:   /vol1/bucket1/mp_file1/abc123-uuid-456/part00002
+Value: PartKeyInfo { partName: ".../part2", partNumber: 2, partKeyInfo: 
KeyInfo{...} }
+...
+...
+Key:   /vol1/bucket1/mp_file1/abc123-uuid-456/part00010
+Value: PartKeyInfo { partName: ".../part10", partNumber: 10, partKeyInfo: 
KeyInfo{...} }
+```
+
+### 2.2. Approach-2 : Introduce new multipartMetadataTable
+
+Split metadata and introduce two new tables:
+- **multipartMetadataTable**: lightweight per-MPU metadata (no part list).
+- **multipartPartsTable**: one row per part (no aggregation).
+
+Below is the new metadata table info object structure:
+```protobuf
+message MultipartMetadataInfo {
+  required string uploadID = 1;
+  required uint64 creationTime = 2;
+  required hadoop.hdds.ReplicationType type = 3;
+  optional hadoop.hdds.ReplicationFactor factor = 4;
+  optional hadoop.hdds.ECReplicationConfig ecReplicationConfig = 5;
+  optional uint64 objectID = 6;
+  optional uint64 updateID = 7;
+  optional uint64 parentID = 8;
+  optional uint32 schemaVersion = 9; // default 0
+}
+```
+
+#### Storage Layout Overview
+
+* **multipartInfoTable (RocksDB):**
+  * V0: `/vol/bucket/key/uploadId` → `OmMultipartKeyInfo { partKeyInfoList: 
[...] }`
+
+
+* **multipartMetadataTable (RocksDB)**
+  * V1: `/vol/bucket/key/uploadId` → `MultipartMetadata { schemaVersion: 1 }`
+
+
+* **multipartPartsTable (RocksDB) [v1 only]**
+  * `/vol/bucket/key/uploadId/part00001`  → `PartKeyInfo` 
+  * `/vol/bucket/key/uploadId/part00002`  → `PartKeyInfo` 
+  * `/vol/bucket/key/uploadId/part00003`  → `PartKeyInfo`
+  * `...`
+
+#### multipartMetadataInfo Table – 1 row
+**V1: OmMultipartMetadataInfo (metadata only)**
+```text
+OmMultipartMetadataInfo {
+  uploadID
+  creationTime
+  type (ReplicationType)
+  factor (ReplicationFactor)
+  objectID
+  updateID
+  parentID
+  ecReplicationConfig
+  schemaVersion: 1
+}
+```
+
+```protobuf
+message MultipartMetadata {
+  required string uploadID = 1;
+  required uint64 creationTime = 2;
+  required hadoop.hdds.ReplicationType type = 3;
+  optional hadoop.hdds.ReplicationFactor factor = 4;
+  optional uint64 objectID = 5;
+  optional uint64 updateID = 6;
+  optional uint64 parentID = 7;
+  optional hadoop.hdds.ECReplicationConfig ecReplicationConfig = 8;
+  optional uint32 schemaVersion = 9;
+  // NO partKeyInfoList - moved to new table
+}
+```
+
+#### Example:
+
+---
+```
+Key: /vol1/bucket1/mp_file1/abc123-uuid-456
+
+Value:
+MultipartMetadata {
+  uploadID: "abc123-uuid-456"
+  creationTime: 1738742400000
+  type: RATIS
+  factor: THREE
+  objectID: 1001
+  updateID: 12345
+  parentID: 0
+  schemaVersion: 1
+}
+```
+
+---
+
+### 2.3. Summary
+* **Approach-1:** Minimal change, same value type, uses `schemaVersion` flag.
+* **Approach-2:** Dedicated table, cleanest separation, requires new lookup 
logic.
+
+----
+### 2.4. Chosen Approach: Approach-1
+We have chosen **Approach-1: Reuse multipartInfoTable with empty part list**
+as the preferred implementation for MPU optimization (V1).
+
+This approach is favored because it introduces minimal changes to the existing 
`OmMultipartKeyInfo` protobuf structure.
+<br>
+By simply introducing an optional `schemaVersion` field and ensuring the 
partKeyInfoList is empty for V1 entries,
+we maintain backward compatibility.
+
+The key advantages are:
+* **Minimal Protobuf Change**: Older clients and processes can still read the 
multipartInfoTable entries without issue,
+    as the **core structure remains the same**.
+* **Compatibility**: Older uploads (V0) **remain fully functional**, and new 
uploads (V1) can be distinguished by
+    the schemaVersion. This significantly reduces the risk of breaking 
existing functionality.
+* **Simplicity**: The transition logic between V0 and V1 is straightforward, 
primarily checking the
+    `schemaVersion` field upon read.
+---

Review Comment:
   As mentioned before, "Summary" and "Chosen Approach" can be pushed inline to 
the "Approach-1" section.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-10611. Design document for MPU GC Optimization [ozone]

Reply via email to