ChenSammi commented on code in PR #8528:
URL: https://github.com/apache/ozone/pull/8528#discussion_r2128160461


##########
hadoop-hdds/docs/content/feature/Topology.md:
##########
@@ -23,86 +23,190 @@ summary: Configuration for rack-awarness for improved 
read/write
   limitations under the License.
 -->
 
-Ozone can use topology related information (for example rack placement) to 
optimize read and write pipelines. To get full rack-aware cluster, Ozone 
requires three different configuration.
-
- 1. The topology information should be configured by Ozone.
- 2. Topology related information should be used when Ozone chooses 3 different 
datanodes for a specific pipeline/container. (WRITE)
- 3. When Ozone reads a Key it should prefer to read from the closest node. 
-
-<div class="alert alert-warning" role="alert">
-
-Ozone uses RAFT replication for Open containers (write), and an async 
replication for closed, immutable containers (cold data). As RAFT requires 
low-latency network, topology awareness placement is available only for closed 
containers. See the [page about Containers]({{< ref "concept/Containers.md">}}) 
about more information related to Open vs Closed containers.
-
-</div>
-
-## Topology hierarchy
-
-Topology hierarchy can be configured with using 
`net.topology.node.switch.mapping.impl` configuration key. This configuration 
should define an implementation of the 
`org.apache.hadoop.net.CachedDNSToSwitchMapping`. As this is a Hadoop class, 
the configuration is exactly the same as the Hadoop Configuration
-
-### Static list
-
-Static list can be configured with the help of ```TableMapping```:
-
-```XML
+Apache Ozone uses topology information (e.g., rack placement) to optimize data 
access and improve resilience. A fully rack-aware cluster needs:
+
+1.  Configured network topology.
+2.  Topology-aware DataNode selection for container replica placement (write 
path).
+3.  Prioritized reads from topologically closest DataNodes (read path).
+
+## Applicability to Container Types
+
+Ozone's topology-aware placement strategies vary by container replication type 
and state:
+
+* **RATIS Replicated Containers:** Ozone uses RAFT replication for Open 
containers (write), and an async replication for closed, immutable containers 
(cold data). As RAFT requires low-latency network, topology awareness placement 
is available only for closed containers. See the [page about 
Containers](concept/Containers.md) about more information related to Open vs 
Closed containers.
+* **Erasure Coded (EC) Containers:** EC demands topology awareness from the 
initial write. For an EC key, OM allocates a block group of `$d+p$` distinct 
DataNodes selected by SCM's `ECPipelineProvider` to ensure rack diversity and 
fault tolerance. This topology-aware selection is integral to the EC write path 
for new blocks. \[2]
+
+## Configuring Topology Hierarchy
+
+Ozone determines DataNode network locations (e.g., racks) using Hadoop's rack 
awareness, configured via `net.topology.node.switch.mapping.impl` in 
`core-site.xml`. This key specifies a 
`org.apache.hadoop.net.CachedDNSToSwitchMapping` implementation. \[1]
+
+Two primary methods exist:
+
+### 1. Static List: `TableMapping`
+
+Maps IPs/hostnames to racks using a predefined file.
+
+* **Configuration:** Set `net.topology.node.switch.mapping.impl` to 
`org.apache.hadoop.net.TableMapping` and `net.topology.table.file.name` to the 
mapping file's path. \[1]
+    ```xml
+    <property>
+      <name>net.topology.node.switch.mapping.impl</name>
+      <value>org.apache.hadoop.net.TableMapping</value>
+    </property>
+    <property>
+      <name>net.topology.table.file.name</name>
+      <value>/etc/ozone/topology.map</value>
+    </property>
+    ```
+* **File Format:** A two-column text file (IP/hostname, rack path per line). 
Unlisted nodes go to `/default-rack`. \[1]
+  Example `topology.map`:
+    ```
+    192.168.1.100 /rack1
+    datanode101.example.com /rack1
+    192.168.1.102 /rack2
+    datanode103.example.com /rack2
+    ```
+
+### 2. Dynamic List: `ScriptBasedMapping`
+
+Uses an external script to resolve rack locations for IPs.
+
+* **Configuration:** Set `net.topology.node.switch.mapping.impl` to 
`org.apache.hadoop.net.ScriptBasedMapping` and `net.topology.script.file.name` 
to the script's path. \[1]
+    ```xml
+    <property>
+      <name>net.topology.node.switch.mapping.impl</name>
+      <value>org.apache.hadoop.net.ScriptBasedMapping</value>
+    </property>
+    <property>
+      <name>net.topology.script.file.name</name>
+      <value>/etc/ozone/determine_rack.sh</value>
+    </property>
+    ```
+* **Script:** Admin-provided, executable script. Ozone passes IPs (up to 
`net.topology.script.number.args`, default 100) as arguments; script outputs 
rack paths (one per line).
+  Example `determine_rack.sh`:
+    ```bash
+    #!/bin/bash
+    # This is a simplified example. A real script might query a CMDB or use 
other logic.
+    while [ $# -gt 0 ] ; do
+      nodeAddress=$1
+      if [[ "$nodeAddress" == "192.168.1.100" || "$nodeAddress" == 
"datanode101.example.com" ]]; then
+        echo "/rack1"
+      elif [[ "$nodeAddress" == "192.168.1.102" || "$nodeAddress" == 
"datanode103.example.com" ]]; then
+        echo "/rack2"
+      else
+        echo "/default-rack"
+      fi
+      shift
+    done
+    ```
+  Ensure the script is executable (`chmod +x /etc/ozone/determine_rack.sh`).
+
+**Topology Mapping Best Practices:**
+
+* **Accuracy:** Mappings must be accurate and current.
+* **Static Mapping:** Simpler for small, stable clusters; requires manual 
updates.
+* **Dynamic Mapping:** Flexible for large/dynamic clusters. Script 
performance, correctness, and reliability are vital; ensure it's idempotent and 
handles batch lookups efficiently.
+
+## Container Placement Policies for Replicated (RATIS) Containers
+
+SCM uses a pluggable policy for placing additional replicas of *closed* 
RATIS-replicated containers, configured by `ozone.scm.container.placement.impl` 
in `ozone-site.xml`. Policies are in 
`org.apache.hadoop.hdds.scm.container.placement.algorithms`. \[1, 3]
+
+### 1. `SCMContainerPlacementRackAware` (Default)
+
+* **Function:** Distributes replicas across racks for fault tolerance (e.g., 
for 3 replicas, aims for at least two racks). Similar to HDFS placement. \[1]
+* **Use Cases:** Production clusters needing rack-level fault tolerance.
+* **Configuration:**
+    ```xml
+    <property>
+      <name>ozone.scm.container.placement.impl</name>
+      
<value>org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware</value>
+    </property>
+    ```
+* **Best Practices:** Requires accurate topology mapping.
+* **Limitations:** Designed for single-layer rack topologies (e.g., 
`/rack/node`). Not recommended for multi-layer hierarchies (e.g., 
`/dc/row/rack/node`) as it may not interpret deeper levels correctly. \[1]
+
+### 2. `SCMContainerPlacementRandom`
+
+* **Function:** Randomly selects healthy, available DataNodes meeting basic 
criteria (space, no existing replica), ignoring rack topology. \[1, 4]
+* **Use Cases:** Small/dev/test clusters, or if rack fault tolerance for 
closed replicas isn't critical.
+* **Configuration:** (Default)
+    ```xml
+    <property>
+      <name>ozone.scm.container.placement.impl</name>
+      
<value>org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRandom</value>
+    </property>
+    ```
+* **Best Practices:** Not for production needing rack failure resilience.
+
+### 3. `SCMContainerPlacementCapacity`
+
+* **Function:** Selects DataNodes by available capacity (favors lower disk 
utilization) to balance disk usage. \[5, 6]
+* **Use Cases:** Heterogeneous storage clusters or where even disk utilization 
is key.
+* **Configuration:**
+    ```xml
+    <property>
+      <name>ozone.scm.container.placement.impl</name>
+      
<value>org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementCapacity</value>
+    </property>
+    ```
+* **Best Practices:** Prevents uneven node filling.
+* **Interaction:** Typically respects topology constraints first (like 
`SCMContainerPlacementRackAware` in simple topologies), then chooses by 
capacity among valid nodes. Verify this interaction for specific needs.

Review Comment:
   It doesn't seem true. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to