slfan1989 commented on code in PR #7009:
URL: https://github.com/apache/ozone/pull/7009#discussion_r1752947669


##########
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/scm/storage/TestContainerCommandsEC.java:
##########
@@ -809,11 +816,48 @@ private void createKeyAndWriteData(String keyString, 
OzoneBucket bucket,
         new HashMap<>())) {
       assertInstanceOf(KeyOutputStream.class, out.getOutputStream());
       for (int i = 0; i < numChunks; i++) {
+        // We generally wait until the data is written to the last chunk
+        // before attempting to trigger CloseContainer.
+        // We use an asynchronous approach for this trigger,
+        // aiming to ensure that closing the container does not interfere with 
the write operation.
+        // However, this process often needs to be executed multiple times 
before it takes effect.
+        if (i == numChunks - 1 && triggerRetry) {
+          triggerRetryByCloseContainer(out);
+        }
         out.write(inputChunks[i]);
       }
     }
   }
 
+  private void triggerRetryByCloseContainer(OzoneOutputStream out) {

Review Comment:
   In the production environment, we encountered a situation where data was 
written to a certain DN, and the Container for that DN Closed. I designed the 
following steps to replicate the scenario we see in production: writing data 
and Closed the Container concurrently. Since the entire process is 
asynchronous, we may need to execute it multiple times to reproduce the issue 
seen in production.
   
   > Example
   
   ```
   2024-09-11 07:46:20,917 [FixedThreadPoolWithAffinityExecutor-1-0] INFO  
container.IncrementalContainerReportHandler 
(AbstractContainerReportHandler.java:updateContainerState(312)) - Moving 
container #1 to CLOSED state, datanode 
4366cc44-4875-4f4d-8afb-5ac7ed9ba40d(bogon/192.168.1.16) reported CLOSED 
replica with index 4.
   07:46:20.914 [4366cc44-4875-4f4d-8afb-5ac7ed9ba40d-ChunkReader-0] ERROR 
DNAudit - user=null | ip=null | op=PUT_BLOCK {blockData=[blockId=conID: 1 
locID: 113750153625600007 bcsId: 0 replicaIndex: 4, size=2097152]} | ret=FAILURE
   java.lang.Exception: Requested operation not allowed as ContainerState is 
CLOSED
        at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:431)
 ~[classes/:?]
        at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:197)
 ~[classes/:?]
        at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
 [classes/:?]
        at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:196)
 [classes/:?]
        at 
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:112)
 [classes/:?]
        at 
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:105)
 [classes/:?]
        at 
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
 [ratis-thirdparty-misc-1.0.6.jar:1.0.6]
        at 
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
 [ratis-thirdparty-misc-1.0.6.jar:1.0.6]
        at 
org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(GrpcServerInterceptor.java:49)
 [classes/:?]
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:329)
 [ratis-thirdparty-misc-1.0.6.jar:1.0.6]
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:314)
 [ratis-thirdparty-misc-1.0.6.jar:1.0.6]
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:833)
 [ratis-thirdparty-misc-1.0.6.jar:1.0.6]
        at 
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
 [ratis-thirdparty-misc-1.0.6.jar:1.0.6]
        at 
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
 [ratis-thirdparty-misc-1.0.6.jar:1.0.6]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_412]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_412]
        at java.lang.Thread.run(Thread.java:750) [?:1.8.0_412]
   2024-09-11 07:46:20,929 [client-write-TID-0] WARN  io.KeyOutputStream 
(ECKeyOutputStream.java:logStreamError(200)) - Put block failed: S S S F S
   2024-09-11 07:46:20,929 [client-write-TID-0] WARN  io.KeyOutputStream 
(ECKeyOutputStream.java:logStreamError(202)) - Failure for replica index: 4, 
DatanodeDetails: 4366cc44-4875-4f4d-8afb-5ac7ed9ba40d(bogon/192.168.1.16)
   java.io.IOException: Unexpected Storage Container Exception: 
org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException: 
Requested operation not allowed as ContainerState is CLOSED
        at 
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.setIoException(BlockOutputStream.java:815)
        at 
org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.validateResponse(ECBlockOutputStream.java:351)
        at 
org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.lambda$executePutBlock$1(ECBlockOutputStream.java:280)
        at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
        at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
        at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   Caused by: 
org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException: 
Requested operation not allowed as ContainerState is CLOSED
        at 
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.validateContainerResponse(ContainerProtocolCalls.java:787)
        at 
org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.validateResponse(ECBlockOutputStream.java:349)
        ... 7 more
   ```
   
   > Reconstruction Result
   
   We can see that during the recovery process, the BlockGroupLength differs 
across different DNs.
   
   ```
   2024-09-11 07:46:21,375 [main] INFO  
reconstruction.ECReconstructionCoordinator 
(ECReconstructionCoordinator.java:logBlockGroupDetails(356)) - Block group 
details for conID: 1 locID: 113750153625600007 bcsId: 0 replicaIndex: null. 
Replication Config EC{rs-3-2-1024k}. Calculated safe length: 3145728. 
   2024-09-11 07:46:21,375 [main] INFO  
reconstruction.ECReconstructionCoordinator 
(ECReconstructionCoordinator.java:logBlockGroupDetails(387)) - Block Data for: 
conID: 1 locID: 113750153625600007 bcsId: 0 replicaIndex: 2 replica Index: 2 
block length: 1048576 block group length: 4194304 chunk list: 
     chunkNum: 1 length: 1048576 offset: 0
   2024-09-11 07:46:21,375 [main] INFO  
reconstruction.ECReconstructionCoordinator 
(ECReconstructionCoordinator.java:logBlockGroupDetails(387)) - Block Data for: 
conID: 1 locID: 113750153625600007 bcsId: 0 replicaIndex: 3 replica Index: 3 
block length: 1048576 block group length: 4194304 chunk list: 
     chunkNum: 1 length: 1048576 offset: 0
   2024-09-11 07:46:21,375 [main] INFO  
reconstruction.ECReconstructionCoordinator 
(ECReconstructionCoordinator.java:logBlockGroupDetails(387)) - Block Data for: 
conID: 1 locID: 113750153625600007 bcsId: 0 replicaIndex: 4 replica Index: 4 
block length: 1048576 block group length: 3145728 chunk list: 
     chunkNum: 1 length: 1048576 offset: 0
   2024-09-11 07:46:21,375 [main] INFO  
reconstruction.ECReconstructionCoordinator 
(ECReconstructionCoordinator.java:logBlockGroupDetails(387)) - Block Data for: 
conID: 1 locID: 113750153625600007 bcsId: 0 replicaIndex: 5 replica Index: 5 
block length: 2097152 block group length: 4194304 chunk list: 
     chunkNum: 1 length: 1048576 offset: 0
     chunkNum: 2 length: 1048576 offset: 1048576
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to