slfan1989 commented on code in PR #7009:
URL: https://github.com/apache/ozone/pull/7009#discussion_r1752947669
##########
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/hdds/scm/storage/TestContainerCommandsEC.java:
##########
@@ -809,11 +816,48 @@ private void createKeyAndWriteData(String keyString,
OzoneBucket bucket,
new HashMap<>())) {
assertInstanceOf(KeyOutputStream.class, out.getOutputStream());
for (int i = 0; i < numChunks; i++) {
+ // We generally wait until the data is written to the last chunk
+ // before attempting to trigger CloseContainer.
+ // We use an asynchronous approach for this trigger,
+ // aiming to ensure that closing the container does not interfere with
the write operation.
+ // However, this process often needs to be executed multiple times
before it takes effect.
+ if (i == numChunks - 1 && triggerRetry) {
+ triggerRetryByCloseContainer(out);
+ }
out.write(inputChunks[i]);
}
}
}
+ private void triggerRetryByCloseContainer(OzoneOutputStream out) {
Review Comment:
In the production environment, we encountered a situation where data was
written to a certain DN, and the Container for that DN Closed. I designed the
following steps to replicate the scenario we see in production: writing data
and Closed the Container concurrently. Since the entire process is
asynchronous, we may need to execute it multiple times to reproduce the issue
seen in production.
> Example
```
2024-09-11 07:46:20,917 [FixedThreadPoolWithAffinityExecutor-1-0] INFO
container.IncrementalContainerReportHandler
(AbstractContainerReportHandler.java:updateContainerState(312)) - Moving
container #1 to CLOSED state, datanode
4366cc44-4875-4f4d-8afb-5ac7ed9ba40d(bogon/192.168.1.16) reported CLOSED
replica with index 4.
07:46:20.914 [4366cc44-4875-4f4d-8afb-5ac7ed9ba40d-ChunkReader-0] ERROR
DNAudit - user=null | ip=null | op=PUT_BLOCK {blockData=[blockId=conID: 1
locID: 113750153625600007 bcsId: 0 replicaIndex: 4, size=2097152]} | ret=FAILURE
java.lang.Exception: Requested operation not allowed as ContainerState is
CLOSED
at
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:431)
~[classes/:?]
at
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:197)
~[classes/:?]
at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
[classes/:?]
at
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:196)
[classes/:?]
at
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:112)
[classes/:?]
at
org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:105)
[classes/:?]
at
org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:262)
[ratis-thirdparty-misc-1.0.6.jar:1.0.6]
at
org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
[ratis-thirdparty-misc-1.0.6.jar:1.0.6]
at
org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(GrpcServerInterceptor.java:49)
[classes/:?]
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:329)
[ratis-thirdparty-misc-1.0.6.jar:1.0.6]
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:314)
[ratis-thirdparty-misc-1.0.6.jar:1.0.6]
at
org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:833)
[ratis-thirdparty-misc-1.0.6.jar:1.0.6]
at
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
[ratis-thirdparty-misc-1.0.6.jar:1.0.6]
at
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
[ratis-thirdparty-misc-1.0.6.jar:1.0.6]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_412]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_412]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_412]
2024-09-11 07:46:20,929 [client-write-TID-0] WARN io.KeyOutputStream
(ECKeyOutputStream.java:logStreamError(200)) - Put block failed: S S S F S
2024-09-11 07:46:20,929 [client-write-TID-0] WARN io.KeyOutputStream
(ECKeyOutputStream.java:logStreamError(202)) - Failure for replica index: 4,
DatanodeDetails: 4366cc44-4875-4f4d-8afb-5ac7ed9ba40d(bogon/192.168.1.16)
java.io.IOException: Unexpected Storage Container Exception:
org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException:
Requested operation not allowed as ContainerState is CLOSED
at
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.setIoException(BlockOutputStream.java:815)
at
org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.validateResponse(ECBlockOutputStream.java:351)
at
org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.lambda$executePutBlock$1(ECBlockOutputStream.java:280)
at
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
at
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by:
org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException:
Requested operation not allowed as ContainerState is CLOSED
at
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.validateContainerResponse(ContainerProtocolCalls.java:787)
at
org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.validateResponse(ECBlockOutputStream.java:349)
... 7 more
```
> Reconstruction Result
We can see that during the recovery process, the BlockGroupLength differs
across different DNs.
```
2024-09-11 07:46:21,375 [main] INFO
reconstruction.ECReconstructionCoordinator
(ECReconstructionCoordinator.java:logBlockGroupDetails(356)) - Block group
details for conID: 1 locID: 113750153625600007 bcsId: 0 replicaIndex: null.
Replication Config EC{rs-3-2-1024k}. Calculated safe length: 3145728.
2024-09-11 07:46:21,375 [main] INFO
reconstruction.ECReconstructionCoordinator
(ECReconstructionCoordinator.java:logBlockGroupDetails(387)) - Block Data for:
conID: 1 locID: 113750153625600007 bcsId: 0 replicaIndex: 2 replica Index: 2
block length: 1048576 block group length: 4194304 chunk list:
chunkNum: 1 length: 1048576 offset: 0
2024-09-11 07:46:21,375 [main] INFO
reconstruction.ECReconstructionCoordinator
(ECReconstructionCoordinator.java:logBlockGroupDetails(387)) - Block Data for:
conID: 1 locID: 113750153625600007 bcsId: 0 replicaIndex: 3 replica Index: 3
block length: 1048576 block group length: 4194304 chunk list:
chunkNum: 1 length: 1048576 offset: 0
2024-09-11 07:46:21,375 [main] INFO
reconstruction.ECReconstructionCoordinator
(ECReconstructionCoordinator.java:logBlockGroupDetails(387)) - Block Data for:
conID: 1 locID: 113750153625600007 bcsId: 0 replicaIndex: 4 replica Index: 4
block length: 1048576 block group length: 3145728 chunk list:
chunkNum: 1 length: 1048576 offset: 0
2024-09-11 07:46:21,375 [main] INFO
reconstruction.ECReconstructionCoordinator
(ECReconstructionCoordinator.java:logBlockGroupDetails(387)) - Block Data for:
conID: 1 locID: 113750153625600007 bcsId: 0 replicaIndex: 5 replica Index: 5
block length: 2097152 block group length: 4194304 chunk list:
chunkNum: 1 length: 1048576 offset: 0
chunkNum: 2 length: 1048576 offset: 1048576
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]