ChenSammi commented on code in PR #8528:
URL: https://github.com/apache/ozone/pull/8528#discussion_r2128160461
##########
hadoop-hdds/docs/content/feature/Topology.md:
##########
@@ -23,86 +23,190 @@ summary: Configuration for rack-awarness for improved
read/write
limitations under the License.
-->
-Ozone can use topology related information (for example rack placement) to
optimize read and write pipelines. To get full rack-aware cluster, Ozone
requires three different configuration.
-
- 1. The topology information should be configured by Ozone.
- 2. Topology related information should be used when Ozone chooses 3 different
datanodes for a specific pipeline/container. (WRITE)
- 3. When Ozone reads a Key it should prefer to read from the closest node.
-
-<div class="alert alert-warning" role="alert">
-
-Ozone uses RAFT replication for Open containers (write), and an async
replication for closed, immutable containers (cold data). As RAFT requires
low-latency network, topology awareness placement is available only for closed
containers. See the [page about Containers]({{< ref "concept/Containers.md">}})
about more information related to Open vs Closed containers.
-
-</div>
-
-## Topology hierarchy
-
-Topology hierarchy can be configured with using
`net.topology.node.switch.mapping.impl` configuration key. This configuration
should define an implementation of the
`org.apache.hadoop.net.CachedDNSToSwitchMapping`. As this is a Hadoop class,
the configuration is exactly the same as the Hadoop Configuration
-
-### Static list
-
-Static list can be configured with the help of ```TableMapping```:
-
-```XML
+Apache Ozone uses topology information (e.g., rack placement) to optimize data
access and improve resilience. A fully rack-aware cluster needs:
+
+1. Configured network topology.
+2. Topology-aware DataNode selection for container replica placement (write
path).
+3. Prioritized reads from topologically closest DataNodes (read path).
+
+## Applicability to Container Types
+
+Ozone's topology-aware placement strategies vary by container replication type
and state:
+
+* **RATIS Replicated Containers:** Ozone uses RAFT replication for Open
containers (write), and an async replication for closed, immutable containers
(cold data). As RAFT requires low-latency network, topology awareness placement
is available only for closed containers. See the [page about
Containers](concept/Containers.md) about more information related to Open vs
Closed containers.
+* **Erasure Coded (EC) Containers:** EC demands topology awareness from the
initial write. For an EC key, OM allocates a block group of `$d+p$` distinct
DataNodes selected by SCM's `ECPipelineProvider` to ensure rack diversity and
fault tolerance. This topology-aware selection is integral to the EC write path
for new blocks. \[2]
+
+## Configuring Topology Hierarchy
+
+Ozone determines DataNode network locations (e.g., racks) using Hadoop's rack
awareness, configured via `net.topology.node.switch.mapping.impl` in
`core-site.xml`. This key specifies a
`org.apache.hadoop.net.CachedDNSToSwitchMapping` implementation. \[1]
+
+Two primary methods exist:
+
+### 1. Static List: `TableMapping`
+
+Maps IPs/hostnames to racks using a predefined file.
+
+* **Configuration:** Set `net.topology.node.switch.mapping.impl` to
`org.apache.hadoop.net.TableMapping` and `net.topology.table.file.name` to the
mapping file's path. \[1]
+ ```xml
+ <property>
+ <name>net.topology.node.switch.mapping.impl</name>
+ <value>org.apache.hadoop.net.TableMapping</value>
+ </property>
+ <property>
+ <name>net.topology.table.file.name</name>
+ <value>/etc/ozone/topology.map</value>
+ </property>
+ ```
+* **File Format:** A two-column text file (IP/hostname, rack path per line).
Unlisted nodes go to `/default-rack`. \[1]
+ Example `topology.map`:
+ ```
+ 192.168.1.100 /rack1
+ datanode101.example.com /rack1
+ 192.168.1.102 /rack2
+ datanode103.example.com /rack2
+ ```
+
+### 2. Dynamic List: `ScriptBasedMapping`
+
+Uses an external script to resolve rack locations for IPs.
+
+* **Configuration:** Set `net.topology.node.switch.mapping.impl` to
`org.apache.hadoop.net.ScriptBasedMapping` and `net.topology.script.file.name`
to the script's path. \[1]
+ ```xml
+ <property>
+ <name>net.topology.node.switch.mapping.impl</name>
+ <value>org.apache.hadoop.net.ScriptBasedMapping</value>
+ </property>
+ <property>
+ <name>net.topology.script.file.name</name>
+ <value>/etc/ozone/determine_rack.sh</value>
+ </property>
+ ```
+* **Script:** Admin-provided, executable script. Ozone passes IPs (up to
`net.topology.script.number.args`, default 100) as arguments; script outputs
rack paths (one per line).
+ Example `determine_rack.sh`:
+ ```bash
+ #!/bin/bash
+ # This is a simplified example. A real script might query a CMDB or use
other logic.
+ while [ $# -gt 0 ] ; do
+ nodeAddress=$1
+ if [[ "$nodeAddress" == "192.168.1.100" || "$nodeAddress" ==
"datanode101.example.com" ]]; then
+ echo "/rack1"
+ elif [[ "$nodeAddress" == "192.168.1.102" || "$nodeAddress" ==
"datanode103.example.com" ]]; then
+ echo "/rack2"
+ else
+ echo "/default-rack"
+ fi
+ shift
+ done
+ ```
+ Ensure the script is executable (`chmod +x /etc/ozone/determine_rack.sh`).
+
+**Topology Mapping Best Practices:**
+
+* **Accuracy:** Mappings must be accurate and current.
+* **Static Mapping:** Simpler for small, stable clusters; requires manual
updates.
+* **Dynamic Mapping:** Flexible for large/dynamic clusters. Script
performance, correctness, and reliability are vital; ensure it's idempotent and
handles batch lookups efficiently.
+
+## Container Placement Policies for Replicated (RATIS) Containers
+
+SCM uses a pluggable policy for placing additional replicas of *closed*
RATIS-replicated containers, configured by `ozone.scm.container.placement.impl`
in `ozone-site.xml`. Policies are in
`org.apache.hadoop.hdds.scm.container.placement.algorithms`. \[1, 3]
+
+### 1. `SCMContainerPlacementRackAware` (Default)
+
+* **Function:** Distributes replicas across racks for fault tolerance (e.g.,
for 3 replicas, aims for at least two racks). Similar to HDFS placement. \[1]
+* **Use Cases:** Production clusters needing rack-level fault tolerance.
+* **Configuration:**
+ ```xml
+ <property>
+ <name>ozone.scm.container.placement.impl</name>
+
<value>org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware</value>
+ </property>
+ ```
+* **Best Practices:** Requires accurate topology mapping.
+* **Limitations:** Designed for single-layer rack topologies (e.g.,
`/rack/node`). Not recommended for multi-layer hierarchies (e.g.,
`/dc/row/rack/node`) as it may not interpret deeper levels correctly. \[1]
+
+### 2. `SCMContainerPlacementRandom`
+
+* **Function:** Randomly selects healthy, available DataNodes meeting basic
criteria (space, no existing replica), ignoring rack topology. \[1, 4]
+* **Use Cases:** Small/dev/test clusters, or if rack fault tolerance for
closed replicas isn't critical.
+* **Configuration:** (Default)
+ ```xml
+ <property>
+ <name>ozone.scm.container.placement.impl</name>
+
<value>org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRandom</value>
+ </property>
+ ```
+* **Best Practices:** Not for production needing rack failure resilience.
+
+### 3. `SCMContainerPlacementCapacity`
+
+* **Function:** Selects DataNodes by available capacity (favors lower disk
utilization) to balance disk usage. \[5, 6]
+* **Use Cases:** Heterogeneous storage clusters or where even disk utilization
is key.
+* **Configuration:**
+ ```xml
+ <property>
+ <name>ozone.scm.container.placement.impl</name>
+
<value>org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementCapacity</value>
+ </property>
+ ```
+* **Best Practices:** Prevents uneven node filling.
+* **Interaction:** Typically respects topology constraints first (like
`SCMContainerPlacementRackAware` in simple topologies), then chooses by
capacity among valid nodes. Verify this interaction for specific needs.
Review Comment:
It doesn't seem true.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]