Are you seeing this for Ratis writes or only EC? Have you changed the EC pipeline limit to a higher value than 5? I wonder if a lesser number of open write pipelines could contribute to this problem too.
On Thu, Sep 8, 2022 at 3:35 AM Kaijie Chen <c...@apache.org> wrote: > Thanks Stephen for explaining, > > > I have a few thoughts on this, but my knowledge may be out-dated. > > > > 1. During putBlock, the DN notices the usage has gone beyond 90%, so it > > sends a close command to SCM via its heartbeat. > > > > 2. SCM closes the container on the SCM side. At this point, SCM will not > > allocate any more blocks to it but there may be some currently being > > written, previously allocated. > > Even if this works correctly, it is possible for too many blocks being > allocated > between 2 heartbeats. > > > 3. The 5GB container limit is a soft-limit - its ok for a container to > go > > beyond this size. > > We observed all closed containers are less than 5GB on disk. > > > 4. It was my understanding, although I cannot find the code right now, > that > > there is some "grace period" for inflight blocks to complete writing > when a > > container starts to close. If we stop allocating blocks in SCM because > the > > close process has been triggered, then the grace period should allow > most > > inflight blocks to complete writing. > > > > Does the grace period still exist, and if so, it is not helping with > this > > problem? > > I'm not sure, but we can see a lot of errors like this in the client log. > Please see the attachment for more details. > > 2022-09-06 15:43:57,044 [pool-2-thread-63] WARN io.KeyOutputStream: > Rewriting stripe to new block group > 2022-09-06 15:43:57,058 [pool-2-thread-55] WARN io.KeyOutputStream: EC > stripe write failed: S S S S S S S S S S S S F S > 2022-09-06 15:43:57,058 [pool-2-thread-55] WARN io.KeyOutputStream: > Failure for replica index: 13, DatanodeDetails: ... > java.io.IOException: Unexpected Storage Container Exception: > org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException: > Requested operation not allowed as ContainerState is CLOSED > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.setIoException(BlockOutputStream.java:629) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org > For additional commands, e-mail: dev-h...@ozone.apache.org