I believe the flow is: 1. Datanode notices the container is near full. 2. Datanode sends close container action to SCM on its next heartbeat. 3. SCM closes the container and sends a close container command on the heartbeat response. 4. Datanodes get the response and close the container. If it is a Ratis container, the leader will send the close via Ratis.
There is a "grace period" of sorts between steps 1 and 2, but this does not help the situation because SCM does not stop issuing blocks to this container until after step 3. Perhaps some amount of pause between steps 3 and 4 would help, either on the SCM or datanode side. This would provide a "grace period" between when SCM stops allocating blocks for the container and when the container is actually closed. I'm not sure exactly how this would be implemented in the code given the current setup, but it seems like a simple option we should try before other more complicated solutions. Ethan On Thu, Sep 8, 2022 at 4:04 AM Kaijie Chen <c...@apache.org> wrote: > > Are you seeing this for Ratis writes or only EC? Have you changed the EC > > pipeline limit to a higher value than 5? I wonder if a lesser number of > > open write pipelines could contribute to this problem too. > > This exception is reproducible in both RATIS and EC. > > https://paste.ubuntu.com/p/NjpQ64PYfR/plain/ > > EC pipeline limit was set to 30 in the previous email. > Increasing pipeline will help, but it doesn't solve the problem from the > root. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org > For additional commands, e-mail: dev-h...@ozone.apache.org > >