Are you seeing this for Ratis writes or only EC? Have you changed the EC
pipeline limit to a higher value than 5? I wonder if a lesser number of
open write pipelines could contribute to this problem too.

On Thu, Sep 8, 2022 at 3:35 AM Kaijie Chen <c...@apache.org> wrote:

> Thanks Stephen for explaining,
>
>  > I have a few thoughts on this, but my knowledge may be out-dated.
>  >
>  > 1. During putBlock, the DN notices the usage has gone beyond 90%, so it
>  > sends a close command to SCM via its heartbeat.
>  >
>  > 2. SCM closes the container on the SCM side. At this point, SCM will not
>  > allocate any more blocks to it but there may be some currently being
>  > written, previously allocated.
>
> Even if this works correctly, it is possible for too many blocks being
> allocated
> between 2 heartbeats.
>
>  > 3. The 5GB container limit is a soft-limit - its ok for a container to
> go
>  > beyond this size.
>
> We observed all closed containers are less than 5GB on disk.
>
>  > 4. It was my understanding, although I cannot find the code right now,
> that
>  > there is some "grace period" for inflight blocks to complete writing
> when a
>  > container starts to close. If we stop allocating blocks in SCM because
> the
>  > close process has been triggered, then the grace period should allow
> most
>  > inflight blocks to complete writing.
>  >
>  > Does the grace period still exist, and if so, it is not helping with
> this
>  > problem?
>
> I'm not sure, but we can see a lot of errors like this in the client log.
> Please see the attachment for more details.
>
>   2022-09-06 15:43:57,044 [pool-2-thread-63] WARN io.KeyOutputStream:
> Rewriting stripe to new block group
>   2022-09-06 15:43:57,058 [pool-2-thread-55] WARN io.KeyOutputStream: EC
> stripe write failed: S S S S S S S S S S S S F S
>   2022-09-06 15:43:57,058 [pool-2-thread-55] WARN io.KeyOutputStream:
> Failure for replica index: 13, DatanodeDetails: ...
>   java.io.IOException: Unexpected Storage Container Exception:
> org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException:
> Requested operation not allowed as ContainerState is CLOSED
>           at
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.setIoException(BlockOutputStream.java:629)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
> For additional commands, e-mail: dev-h...@ozone.apache.org

Reply via email to