I believe the flow is:
1. Datanode notices the container is near full.
2. Datanode sends close container action to SCM on its next heartbeat.
3. SCM closes the container and sends a close container command on the
heartbeat response.
4. Datanodes get the response and close the container. If it is a Ratis
container, the leader will send the close via Ratis.

There is a "grace period" of sorts between steps 1 and 2, but this does not
help the situation because SCM does not stop issuing blocks to this
container until after step 3. Perhaps some amount of pause between steps 3
and 4 would help, either on the SCM or datanode side. This would provide a
"grace period" between when SCM stops allocating blocks for the container
and when the container is actually closed. I'm not sure exactly how this
would be implemented in the code given the current setup, but it seems like
a simple option we should try before other more complicated solutions.

Ethan

On Thu, Sep 8, 2022 at 4:04 AM Kaijie Chen <c...@apache.org> wrote:

>  > Are you seeing this for Ratis writes or only EC? Have you changed the EC
>  > pipeline limit to a higher value than 5? I wonder if a lesser number of
>  > open write pipelines could contribute to this problem too.
>
> This exception is reproducible in both RATIS and EC.
>
> https://paste.ubuntu.com/p/NjpQ64PYfR/plain/
>
> EC pipeline limit was set to 30 in the previous email.
> Increasing pipeline will help, but it doesn't solve the problem from the
> root.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
> For additional commands, e-mail: dev-h...@ozone.apache.org
>
>

Reply via email to