errose28 commented on code in PR #8388:
URL: https://github.com/apache/ozone/pull/8388#discussion_r2096510844


##########
hadoop-hdds/docs/content/design/dn-min-space-configuration.md:
##########
@@ -0,0 +1,108 @@
+---
+title: Minimum free space configuration for datanode volumes
+summary: Describe proposal for minimum free space configuration which volume 
must have to function correctly.
+date: 2025-05-05
+jira: HDDS-12928
+status: implemented
+author: Sumit Agrawal
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# Abstract
+Volume in the datanode stores the container data and metadata (rocks db 
co-located on the volume).
+There are various parallel operation going on such as import container, export 
container, write and delete data blocks,
+container repairs, create and delete containers. The space is also required 
for volume db to perform compaction at regular interval.
+This is hard to capture exact usages and free available space. So, this is 
required to configure minimum free space
+so that datanode operation can perform without any corruption and environment 
being stuck and support read of data.
+
+This free space is used to ensure volume allocation if `required space < 
(volume available space - free space - reserved space)`
+Any container creation and import container need ensure this constrain is met. 
And block write need ensure that this space is available if new blocks are 
written.
+Note: Any issue related to ensuring free space is tracked with separate JIRA.
+
+# Existing configuration (before HDDS-12928)
+Two configurations are provided,
+- hdds.datanode.volume.min.free.space  (default: 5GB)
+- hdds.datanode.volume.min.free.space.percent
+
+1. If nothing is configured, takes default value as 5GB
+2. if both are configured, priority to hdds.datanode.volume.min.free.space
+3. else respective configuration is used.
+
+# Problem Statement
+
+- With 5GB default configuration, its not avoiding full disk scenario due to 
error in ensuring free space availability.
+This is due to container size being imported is 5GB which is near boundary, 
and other parallel operation.
+- Volume DB size can increase with increase in disk space as container and 
blocks it can hold can more and hence metadata.
+- Volume DB size can also vary due to small files and big files combination, 
as more small files can lead to more metadata.
+
+Solution involves
+- appropriate default min free space
+- depends on disk size variation
+
+# Approach 1 Combination of minimum free space and percent increase on disk 
size
+
+Configuration:
+1. Minimum free space: hdds.datanode.volume.min.free.space: default value 
`20GB`
+2. disk size variation: hdds.datanode.volume.min.free.space.percent: default 
0.1% or 0.001 ratio
+
+Minimum free space = Max (`<Min free space>`, `<percent disk space>`)
+
+| Disk space | Min Free Space (percent: 1%) | Min Free Space ( percent: 0.1%) |
+| -- |------------------------------|---------------------------------|
+| 100 GB | 20 GB                        | 20 GB (min space default)       |
+| 1 TB | 20 GB                        | 20 GB (min space default)       |
+| 10 TB | 100 GB                       | 20 GB  (min space default) |
+| 100 TB | 1 TB                         | 100 GB                          |
+
+considering above table with this solution,
+- 0.1 % to be sufficient to hold almost all cases, as not observed any dn 
volume db to be more that 1-2 GB
+
+# Approach 2 Only minimum free space configuration
+
+Considering above approach, 20 GB as default should be sufficient for most of 
the disk, as usually disk size is 10-15TB as seen.
+Higher disk is rarely used, and instead multiple volumes are attached to same 
DN with multiple disk.
+
+Considering this scenario, Minimum free space: 
`hdds.datanode.volume.min.free.space` itself is enough and
+percent based configuration can be removed.
+
+### Compatibility
+If `hdds.datanode.volume.min.free.space.percent` is configured, this should 
not have any impact
+as default value is increased to 20GB which will consider most of the use case.
+
+# Approach 3 Combination of maximum free space and percent configuration on 
disk size
+
+Configuration:
+1. Maximum free space: hdds.datanode.volume.min.free.space: default value 
`20GB`
+2. disk size variation: hdds.datanode.volume.min.free.space.percent: default 
10% or 0.1 ratio
+
+Minimum free space = **Min** (`<Max free space>`, `<percent disk space>`)
+> Difference with approach `one` is, Min function over the 2 above 
configuration
+
+| Disk space | Min Free Space (20GB, 10% of disk) |
+| -- |------------------------------------|
+| 10 GB | 1 GB (=Min(20GB, 1GB)              |
+| 100 GB | 10 GB (=Min(20GB, 10GB)            |
+| 1 TB | 20 GB   (=Min(20GB, 100GB)         |
+| 10 TB | 20 GB (=Min(20GB, 1TB)             |
+| 100 TB | 20GB  (=Min(20GB, 10TB)            |
+
+This case is more useful for test environment where disk space is less and no 
need any additional configuration.
+
+# Conclusion
+1. Going with Approach 1

Review Comment:
   Let me try to add some guiding principles for config modifications which can 
help us compare one decision or another. The following are usability issues 
that can occur with config keys:
   1. **Inconsistent config format**: Configs that operate on similar entities 
(space usage, address + port, percentages) that read those values differently.
   2. **Hidden config dependencies**: When one configuration whose value is 
unchanged functions differently based on the value applied to a different 
config.
       - This does not include invalid config combinations that fail component 
startup, since that is easily caught and called out with an error message. We 
know that no actively running system will have this configuration.
   
   Both `hdds.datanode.du.reserved{.percent}` and 
`hdds.datanode.min.free.space{.percent}` have issues here, and this is our 
chance to fix them. Now let's look at how our options either help or hurt the 
above points.
   
   ## Inconsistent config format
   
   `hdds.datanode.du.reserved` and `hdds.datanode.min.free.space` are both used 
to configure space reservation on datanode drives, so as stated in point 1 it 
is most intuitive if they accept the same value format. It is ok if one format 
is more useful for one than another. For example per-volume configuration may 
be required for `hdds.datanode.du.reserved` but not for 
`hdds.datanode.min.free.space`. It's still ok for both to have that option 
because it is not invalid for `hdds.datanode.min.free.space`, there is still 
only one set of formatting options for users to remember, and only one parser 
in the code. If we pick and choose different valid formats for each config we 
will have two formats to remember and two parsers in the code. Therefore even 
removing allowed config formats from `hdds.datanode.min.free.space` that are 
still present in `hdds.datanode.du.reserved` actually adds complexity. Based on 
this `hdds.datanode.du.reserved` and `hdds.datanode.min.free.space` must accept 
valu
 es of the same format to avoid introducing new config usability problems.
   
   ## Hidden config dependencies
   
   Next let's look at how the percent variations affect point 2. Anything other 
than failing startup if the percent and non-percent variations are specified 
creates this problem, so if a percent and non-percent config key are given like 
`hdds.datanode.min.free.space.percent` and `hdds.datanode.min.free.space` it 
must be considered invalid and fail the datanode. 
   
   There is another option though: get rid of the percentage specific *config 
keys* but still support percentage based configuration with the one 
`hdds.datanode.min.free.space` config. Let's look at why this works:
   - `hdds.datanode.du.reserved` needs to support volume specific configuration 
in the form of `<volume-path>:reserved-size` since not all volumes may be used 
as spill for compute, or the volumes may be utilized differently.
       - This means we will always have a parsing method like 
[VolumeUsage#getReserved](https://github.com/apache/ozone/blob/2a1a6bf124007821eb68779663bbaad371ea668f/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/VolumeUsage.java#L195)
 to handle converting config strings into long values for a volume.
   - `hdds.datanode.min.free.space` and `hdds.datanode.du.reserved` should 
support the same value format, so `hdds.datanode.min.free.space` also needs to 
use this same parser.
   - If we are already need a string parser for both configs, we might as well 
make it differentiate between percentage and size based configs too.
   
   ## Proposal to address all requirements
   
   The following layout meets all the constraints defined above:
   - Only two config keys: `hdds.datanode.min.free.space` and 
`hdds.datanode.du.reserved`
   - The valid formats for either config key are:
       - A fixed size, like `20GB`
       - A percentage as a float, like `0.001`. The lack of a unit 
differentiates it from the first option.
       - A mapping of volumes to sizes, like `/data/hdds1:20GB,/data/hdds2:10GB`
   - Only one parser is required for both types of configs.
       - This is not new since a parser is already required and cannot be 
removed without removing support for per-volume configuration in 
`hdds.datanode.du.reserved`.
   
   We should never introduce usability issues in our configurations. We have 
enough of them already : ) If you can show how an alternate proposal meets all 
the configuration requirements without impacting usability we can consider that 
as well, but currently none of the proposals in the doc satisfy this.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

Reply via email to