I am not sure I understand every bits of the discussion, but let me add some thought as well here.
Waiting on clients to finish writes, and see if the container is full after, sounds like a horrible idea, as with that we will decrease the amount of writes that can be done parallelly and make it to a function of the available space in the open containers, with that potentially decrease write performance. The best solution would be the simplest, and someone might already suggested something like that... but here is my idea to a very simple logic: Some information to establish the base where I am coming from: - SCM as I understand knows how much space is not already used in a container, and it can track the block allocations for the remaining space, as it is the service actually allocates a block - The container is a directory on a DataNode, and hence the 5GB limit is an arbitrary number we came up with as container size, but in theory if it goes up to 10GB then that is just that and there are no further implications by a higher sized container (except for re-replication and validating health of the container) - our block size is also an arbitrary number 256MB which does not come based on any science, or any data from clusters about average file size we just made it up, and doubled the one HDFS used, so that is just the unit of allocation and an upper limit of one unit of uniquely identifiable data. Now in SCM if we hold a counter per container that holds the number of open allocations, and allow over allocation by 50% of the remaining space in the container, we will end up somewhat larger containers than 5GB in the average case but we might get a simple system that ensures that allocated blocks can be finished writing to by the clients. Here is how: - at allocation SCM checks if the available space and the number of allocations allow any new block to be allocated in the container, if it does not , then it checks another open container, if no open containers can allocate more blocks it creates a new container. (A block can be allocated if the in theory size of the container can not exceed the 5GB soft limit by more than 50% of the remaining space in the container considering the open allocations) - at final putBlock the DN in the ICR or in the heartBeat reports that the block write is finished, in container x, so that SCM can decrement the counter. - if a block is abandoned, (afaik) we have a routine that deletes the half written block data, we can also ask SCM to decrement the counter from this routine while freeing up the space. - if a block is abandoned for over a week (the token lifetime), we can decrement the counter (but I believe this should happen with abandoned block detection). - if the container allocations are going down that much that the container size can not exceed the 5GB soft-limit anymore, we can open up the container for allocations again in SCM - if all the open allocations are closed (the counter is 0), and the container exceeds the 5GB soft-limit, we close the container from SCM Seems like this logic is simple enough, it does not require a lease, as we do not have one today either and we somehow (as I know) handle the abandoned blocks. Also it does not require any big changes in the protocols, the only thing needed is to store the counter within SCM's container data structure, and report from the DN to SCM that the block had the final putBlock call. In case of a DDos style block allocation, we will end up with enough containers to hold the data, and we can have 7.5GB containers in the worst case, while in the general case we most likely will have around 5GB containers. Of course if needed we can introduce a hard limit for the number of open containers (pipelines), and if we exceed that then the client would need to wait to allocate more... (some form of QoS mentioned in the doc), but I am not really a fan of this idea, as it introduces possibly large delays not just for the rogue client, do we know what is the consequence of having like 1000 open containers on a 10 node cluster for example? (As that would mean ~5TB of open allocation...) Please let me know if this idea does not account for some problems, I just quickly drafted it to come up with something fairly simple that I believe helps to solve the problem with the least amount of disruption, but I possibly did not thought through all the possible failure scenarios, and haven't checked what is actually in the code at the moment, just wanted to share my idea quickly to validate with folks who know more about the actual code out there in the codebase, to see if it is a feasible alternative. Pifta anu engineer <anu.engin...@gmail.com> ezt írta (időpont: 2022. szept. 9., P, 19:10): > Extending the same thought from Steven. If you are going to do a small > delay, it is better to do it via a Lease. > > So SCM could offer a lease for 60 seconds, with a provision to reacquire > the lease one more time. > This does mean that a single container inside the data node technically > could become larger than 5GB (but that is possible even today). > > I do think a lease or a timeout based approach (as suggested by Steven) > might be easier than pre-allocating blocks. > > Thanks > Anu > > > On Fri, Sep 9, 2022 at 12:47 AM Stephen O'Donnell > <sodonn...@cloudera.com.invalid> wrote: > > > > 4. Datanode wait until no write commands to this container, then close > > it. > > > > This could be done on SCM, with a simple delay. Ie hold back the close > > commands for a "normal" close for some configurable amount of time. Eg if > > we hold for 60 seconds, it is likely almost all blocks will get written. > If > > a very small number fail, it is likely OK. > > > > On Fri, Sep 9, 2022 at 5:01 AM Kaijie Chen <c...@apache.org> wrote: > > > > > Thanks Ethan. Yes this could be a simpler solution. > > > The main idea is allowing container size limit to be exceeded, > > > to ensure all allocated blocks can be finished. > > > > > > We can change it to something like this: > > > 1. Datanode notices the container is near full. > > > 2. Datanode sends close container action to SCM immediately. > > > 3. SCM closes the container and stops allocating new blocks in it. > > > 4. Datanode wait until no write commands to this container, then close > > it. > > > > > > It's still okay to wait for the next heartbeat in step 2. > > > Step 4 is a little bit tricky, we need a lease or timeout to determine > > the > > > time. > > > > > > Kaijie > > > > > > ---- On Fri, 09 Sep 2022 08:54:34 +0800 Ethan Rose wrote --- > > > > I believe the flow is: > > > > 1. Datanode notices the container is near full. > > > > 2. Datanode sends close container action to SCM on its next > heartbeat. > > > > 3. SCM closes the container and sends a close container command on > the > > > > heartbeat response. > > > > 4. Datanodes get the response and close the container. If it is a > > Ratis > > > > container, the leader will send the close via Ratis. > > > > > > > > There is a "grace period" of sorts between steps 1 and 2, but this > > does > > > not > > > > help the situation because SCM does not stop issuing blocks to this > > > > container until after step 3. Perhaps some amount of pause between > > > steps 3 > > > > and 4 would help, either on the SCM or datanode side. This would > > > provide a > > > > "grace period" between when SCM stops allocating blocks for the > > > container > > > > and when the container is actually closed. I'm not sure exactly how > > this > > > > would be implemented in the code given the current setup, but it > seems > > > like > > > > a simple option we should try before other more complicated > solutions. > > > > > > > > Ethan > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org > > > For additional commands, e-mail: dev-h...@ozone.apache.org > > > > > > > > > -- Pifta