jsancio commented on code in PR #12597:
URL: https://github.com/apache/kafka/pull/12597#discussion_r967491694


##########
docs/ops.html:
##########
@@ -1373,6 +1373,27 @@ <h5 class="anchor-heading"><a id="ext4" 
class="anchor-link"></a><a href="#ext4">
     <li>delalloc: Delayed allocation means that the filesystem avoid 
allocating any blocks until the physical write occurs. This allows ext4 to 
allocate a large extent instead of smaller pages and helps ensure the data is 
written sequentially. This feature is great for throughput. It does seem to 
involve some locking in the filesystem which adds a bit of latency variance.
   </ul>
 
+  <h4 class="anchor-heading"><a id="replace_disk" class="anchor-link"></a><a 
href="#replaced_disk">Replace KRaft Controller Disk</a></h4>
+  <p>When Kafka is configured to use KRaft instead of ZooKeeper, the 
controllers stores the cluster metadata in the directory specified in 
<code>metadata.log.dir</code>, <code>log.dir</code> or <code>log.dirs</code>. 
See the documentation for <code>metadata.log.dir</code> for details.</p>
+
+  <p>If the data in the cluster metdata directory (disk) is lost either 
because of hardware failure or the hardware needs to be replace, care should be 
taken when provisioning the new controller node. The new controller node should 
not be formatted and started until the majority of the controllers have all of 
the committed data. To determine if the majority of the controllers have the 
committed data, run the <code>kafka-metadata-quorum.sh</code> tool to describe 
the replication status:
+
+  <pre class="line-numbers"><code class="language-bash"> &gt; 
bin/kafka-metadata-quorum.sh --bootstrap-server broker_host:port describe 
--replication
+ NodeId  LogEndOffset    Lag     LastFetchTimestamp      LastCaughtUpTimestamp 
  Status
+ 1       25806           0       1662500992757           1662500992757         
  Leader
+ ...     ...             ...     ...                     ...                   
  ...
+  </code></pre>
+
+  Check and wait until the <code>Lag</code> is small for the majority of the 
controllers. Check and wait until the <code>LastFetchTimestamp</code> and 
<code>LastCaughtUpTimestamp</code> are close to each other for the majority of 
the controllers. At this point it is safer to format the controller's metadata 
log directory. This can be done by running the <code>kafka-storage.sh</code> 
command.
+
+  <pre class="line-numbers"><code class="language-bash"> &gt; 
bin/kafka-storage.sh format --cluster-id uuid --config 
server_properties</code></pre>
+
+  <p>If multiple log directories and metadata directory are used but only one 
of them is getting replaced it may be necessary to run <code>kafka-storage.sh 
format</code> with the <code>--ignore-formatted</code> option.</p>

Review Comment:
   I agree. I split this sentence into multiple sentences. Hopefully it is 
clearer. I think it is important to document this scenario so that the user has 
a way to get unblocked.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to