As I said in my previous email, I prefer working on these kinds of issues by creating detailed models so I can reason about them. It is also quite possible that I am not understanding the current issue in its full depth.
For what it is worth, I think your solution is correct. But I think that you are only solving the over allocation case. That is the case where everyone is trying to write more than the container size. But apply the same logic that you proposed, that is count the total number of blocks you can write in a container to under allocation case. Here is an example, Let us simply start with the assumption that clients are well behaved and it will chunk the block at the size that we are proposing. Assumptions: 1. Block size is 64MB 2. Container Size is 5GB 3, Therefore, the number of blocks you can write is 78 blocks per container. 4. Imagine a scenario, where a client writes single bytes to a file. That is each I/O is a single byte into different files. 5. Further imagine a scenario, where all those small files are landing in the same container -- what do you have now -- a container with 78 bytes (ignoring the metadata for now), 6. if you only keep the allocated block count, I am afraid that you might stop allocation of the blocks as soon as you hit 78 blocks, in this case, that would be a measly 78 bytes. Is that an ideal scenario? No, that is the reason why Ozone does not just go by counting allocated blocks, instead tries to get the containers to an expected size. In short, the average number of blocks and the size of containers is dependent on your workload, and the BIG promise we made to ourselves that we will try to get away from the small file problem of HDFS. Now, take this case to extreme, if all files in a cluster are one byte files,will the container space explode ? If we follow the "let us count" the blocks algo, yes it will, because for each 78 files, we will allocate a new container. I know it is a very delicate balance that we are dealing with and no algorithm will truly make us happy, it is what trade-offs made us least sad/ If we really want to pack as much data into containers, and reduce the number of containers, IMHO we need to go beyond the block counts. While this case I am illustrating is an extreme case, I wanted to show you that we have to solve for both the under allocation and over allocation in the containers. Again, this is a theoretical case, which might never happen, and if you are confident that counting the blocks algo will work, please go ahead. just my 2 cents. Thanks Anu On Wed, Sep 14, 2022 at 10:19 AM István Fajth <fapi...@gmail.com> wrote: > Thank you for the comments Anu, > > I think if we choose wisely how we enable over-allocation, we will not go > above 5.5GB in almost none of the cases, let me reason about this later on. > > I would like to get some more clarity on the lease, as that also seems to > be easy, but I feel that it is an orthogonal solution to our current > problem and does not solve the current problem, more than a "simple delay" > would. > Let me share why I think so, and please tell me why I am wrong if I do :) > The current problem as I understand is that there are some clients that are > allocating blocks around the same time from the same container, and they > write a full block, or at least good amounts of data compared to block > size, but because there isn't a check whether the container has enough > space to write all the allocated blocks, so some clients fail to finish the > write, as in the middle of writing data to the container the container gets > closed. > An arbitrary delay might help somewhat, a lease that can be renewed for 2 > minutes might help somewhat, but if there is a slow client that will still > fail the write, and we end up with dead data in the containers, which I > believe (but I am not sure) will be cleaned eventually by some background > process that checks after unreferenced blocks and removes them from > containers, while the client has to rewrite the whole data. > > In this situation, if we do not calculate how many blocks we may allocate > in an open container, there will anyway be cases where client can not > finish the write in time, while we can end up with arbitrarily large > containers, as based on the lease there will be a lot of writes that may > finish in time without a constraint on size. (I imagine a scenario where a > distributed client job allocates 10k blocks and then starts to write into > all of them on a cluster where we have 500 open containers.) > > Now if we allow over allocation by 50% of free space or by 2 blocks > (whichever is more), then we can end up having 5.5GB containers most of the > cases, but we also have an upper bound, as blocks in an empty container can > be allocated up to 7.5GB worth of total data if all blocks are fully > written. > The average case still would be 5.5GB, as we do not see this problem often > so far, so in most of the cases data is written fine and the container is > closed after, at least based on the problem description in the document > shared. Why am I saying this? If things go smooth and normal, there can be > two cases, 50% of the remaining space is bigger than 2 blocks, in this case > we will go beyond 5.5GB with allocation, if all blocks allocated last are > fully written, if the remaining space is less than 2 blocks, then we will > over allocate by 2 blocks, which if fully written will make the container > closed with ~5.5GB data. Now if you consider the number of full blocks > compared to the number of partial blocks on a real system, and the fact > that the largest size of a container is still limited to 7.5GB, and the > average should not go much bigger than the original 5GB, I think this is > enough to ensure that writes can be finished, while containers does not > grow beyond the limit too much. > > On the other side with leases we can have way larger containers in the > extreme case, but we do not need to open containers right away when there > is a flood of block allocation. (Though sooner or later we need to open new > ones). > > So the options seem to be: > - having container that might be heavily over allocated and writes that are > limited by the lease timeout, from which some may fail the write still the > same way > - having containers with a size close to our current limit, with the burden > of opening containers during a flood of block allocation, and add metadata > and bookkeeping around the number of open allocations in SCM to limit the > number of allocations that can go to one container. > > The first one seems to be simple, but does not solve the problem just hides > it a bit more due to delays in container closure, the second one seems to > solve the problem as it closes the container when all writes have been > finished, or abandoned, while limiting the number of writes that can go > into a container, at the price of having more open containers (for which I > still don't know what effects it might have if we have too many containers > open). > > Pifta > > > > anu engineer <anu.engin...@gmail.com> ezt írta (időpont: 2022. szept. 13., > K, 19:24): > > > What I am arguing is that if you are going to use time, use it in a way > > that plays well with a distributed system. > > > > That is where leases come from. What you are suggesting is very similar > to > > the original design, that says keep track of allocated > > blocks (the original design doc that was shared) and your argument is > that > > keeping track can be done in terms of metadata if we are willing to > > make the size of containers variable. That is absolutely true. I think we > > will need to really think if we can make containers arbitrarily large and > > what is the impact on the rest of the system. > > > > My point was simply -- There is no notion of a "simple delay" in a > > distributed system -- since we will not be able to reason about it. > > If you want to create such delays, do it via a lease mechanism -- See my > > other email in response to Steven. There are subtle correctness issues > with > > arbitrary delays. > > > > Personally, I am fine with making containers larger, but I think really > it > > should be done after: > > > > 1. Measuring the average container sizes in a long running ozone cluster. > > 2. The average time it takes to replicate a container in a cluster. > > 3. After modelling the recovery time of Ozone node loss. The issue is > that > > if you have clusters with a small number of nodes, Ozone does not do as > > well a job as HDFS on scattering the blocks. > > > > So it really needs some data plotted on a curve with cluster size, > scatter, > > and that will give you a ball-park with recovery times. Long, long time > ago > > -- we did do an extensivel model of this, and kind of convinced ourselves > > that the 5 GB size is viable. You should do a similar exercise to > convince > > yourself that it is possible to do much larger container sizes. > > > > Another option you have is to support splitting closed containers -- let > > it grow to any size, and then once it is closed, you do a special > operation > > via SCM to split the containers into two or so. > > > > Best of luck with either approach, the proposal you are making is > certainly > > doable, but it just needs some data analysis. > > > > Thanks > > Anu > > > > > > > > > > > > > > > > > > On Tue, Sep 13, 2022 at 2:55 AM István Fajth <fapi...@gmail.com> wrote: > > > > > I am not sure I understand every bits of the discussion, but let me add > > > some thought as well here. > > > > > > Waiting on clients to finish writes, and see if the container is full > > > after, sounds like a horrible idea, as with that we will decrease the > > > amount of writes that can be done parallelly and make it to a function > of > > > the available space in the open containers, with that potentially > > decrease > > > write performance. > > > > > > The best solution would be the simplest, and someone might already > > > suggested something like that... but here is my idea to a very simple > > > logic: > > > Some information to establish the base where I am coming from: > > > - SCM as I understand knows how much space is not already used in a > > > container, and it can track the block allocations for the remaining > > space, > > > as it is the service actually allocates a block > > > - The container is a directory on a DataNode, and hence the 5GB limit > is > > an > > > arbitrary number we came up with as container size, but in theory if it > > > goes up to 10GB then that is just that and there are no further > > > implications by a higher sized container (except for re-replication and > > > validating health of the container) > > > - our block size is also an arbitrary number 256MB which does not come > > > based on any science, or any data from clusters about average file size > > we > > > just made it up, and doubled the one HDFS used, so that is just the > unit > > of > > > allocation and an upper limit of one unit of uniquely identifiable > data. > > > > > > Now in SCM if we hold a counter per container that holds the number of > > open > > > allocations, and allow over allocation by 50% of the remaining space in > > the > > > container, we will end up somewhat larger containers than 5GB in the > > > average case but we might get a simple system that ensures that > allocated > > > blocks can be finished writing to by the clients. Here is how: > > > - at allocation SCM checks if the available space and the number of > > > allocations allow any new block to be allocated in the container, if it > > > does not , then it checks another open container, if no open containers > > can > > > allocate more blocks it creates a new container. (A block can be > > allocated > > > if the in theory size of the container can not exceed the 5GB soft > limit > > by > > > more than 50% of the remaining space in the container considering the > > open > > > allocations) > > > - at final putBlock the DN in the ICR or in the heartBeat reports that > > the > > > block write is finished, in container x, so that SCM can decrement the > > > counter. > > > - if a block is abandoned, (afaik) we have a routine that deletes the > > half > > > written block data, we can also ask SCM to decrement the counter from > > this > > > routine while freeing up the space. > > > - if a block is abandoned for over a week (the token lifetime), we can > > > decrement the counter (but I believe this should happen with abandoned > > > block detection). > > > - if the container allocations are going down that much that the > > container > > > size can not exceed the 5GB soft-limit anymore, we can open up the > > > container for allocations again in SCM > > > - if all the open allocations are closed (the counter is 0), and the > > > container exceeds the 5GB soft-limit, we close the container from SCM > > > > > > Seems like this logic is simple enough, it does not require a lease, as > > we > > > do not have one today either and we somehow (as I know) handle the > > > abandoned blocks. Also it does not require any big changes in the > > > protocols, the only thing needed is to store the counter within SCM's > > > container data structure, and report from the DN to SCM that the block > > had > > > the final putBlock call. > > > > > > In case of a DDos style block allocation, we will end up with enough > > > containers to hold the data, and we can have 7.5GB containers in the > > worst > > > case, while in the general case we most likely will have around 5GB > > > containers. > > > Of course if needed we can introduce a hard limit for the number of > open > > > containers (pipelines), and if we exceed that then the client would > need > > to > > > wait to allocate more... (some form of QoS mentioned in the doc), but I > > am > > > not really a fan of this idea, as it introduces possibly large delays > not > > > just for the rogue client, do we know what is the consequence of having > > > like 1000 open containers on a 10 node cluster for example? (As that > > would > > > mean ~5TB of open allocation...) > > > > > > Please let me know if this idea does not account for some problems, I > > just > > > quickly drafted it to come up with something fairly simple that I > believe > > > helps to solve the problem with the least amount of disruption, but I > > > possibly did not thought through all the possible failure scenarios, > and > > > haven't checked what is actually in the code at the moment, just wanted > > to > > > share my idea quickly to validate with folks who know more about the > > actual > > > code out there in the codebase, to see if it is a feasible alternative. > > > > > > Pifta > > > > > > > > > anu engineer <anu.engin...@gmail.com> ezt írta (időpont: 2022. szept. > > 9., > > > P, 19:10): > > > > > > > Extending the same thought from Steven. If you are going to do a > small > > > > delay, it is better to do it via a Lease. > > > > > > > > So SCM could offer a lease for 60 seconds, with a provision to > > reacquire > > > > the lease one more time. > > > > This does mean that a single container inside the data node > technically > > > > could become larger than 5GB (but that is possible even today). > > > > > > > > I do think a lease or a timeout based approach (as suggested by > Steven) > > > > might be easier than pre-allocating blocks. > > > > > > > > Thanks > > > > Anu > > > > > > > > > > > > On Fri, Sep 9, 2022 at 12:47 AM Stephen O'Donnell > > > > <sodonn...@cloudera.com.invalid> wrote: > > > > > > > > > > 4. Datanode wait until no write commands to this container, then > > > close > > > > > it. > > > > > > > > > > This could be done on SCM, with a simple delay. Ie hold back the > > close > > > > > commands for a "normal" close for some configurable amount of time. > > Eg > > > if > > > > > we hold for 60 seconds, it is likely almost all blocks will get > > > written. > > > > If > > > > > a very small number fail, it is likely OK. > > > > > > > > > > On Fri, Sep 9, 2022 at 5:01 AM Kaijie Chen <c...@apache.org> wrote: > > > > > > > > > > > Thanks Ethan. Yes this could be a simpler solution. > > > > > > The main idea is allowing container size limit to be exceeded, > > > > > > to ensure all allocated blocks can be finished. > > > > > > > > > > > > We can change it to something like this: > > > > > > 1. Datanode notices the container is near full. > > > > > > 2. Datanode sends close container action to SCM immediately. > > > > > > 3. SCM closes the container and stops allocating new blocks in > it. > > > > > > 4. Datanode wait until no write commands to this container, then > > > close > > > > > it. > > > > > > > > > > > > It's still okay to wait for the next heartbeat in step 2. > > > > > > Step 4 is a little bit tricky, we need a lease or timeout to > > > determine > > > > > the > > > > > > time. > > > > > > > > > > > > Kaijie > > > > > > > > > > > > ---- On Fri, 09 Sep 2022 08:54:34 +0800 Ethan Rose wrote --- > > > > > > > I believe the flow is: > > > > > > > 1. Datanode notices the container is near full. > > > > > > > 2. Datanode sends close container action to SCM on its next > > > > heartbeat. > > > > > > > 3. SCM closes the container and sends a close container > command > > on > > > > the > > > > > > > heartbeat response. > > > > > > > 4. Datanodes get the response and close the container. If it > is > > a > > > > > Ratis > > > > > > > container, the leader will send the close via Ratis. > > > > > > > > > > > > > > There is a "grace period" of sorts between steps 1 and 2, but > > this > > > > > does > > > > > > not > > > > > > > help the situation because SCM does not stop issuing blocks to > > > this > > > > > > > container until after step 3. Perhaps some amount of pause > > between > > > > > > steps 3 > > > > > > > and 4 would help, either on the SCM or datanode side. This > would > > > > > > provide a > > > > > > > "grace period" between when SCM stops allocating blocks for > the > > > > > > container > > > > > > > and when the container is actually closed. I'm not sure > exactly > > > how > > > > > this > > > > > > > would be implemented in the code given the current setup, but > it > > > > seems > > > > > > like > > > > > > > a simple option we should try before other more complicated > > > > solutions. > > > > > > > > > > > > > > Ethan > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org > > > > > > For additional commands, e-mail: dev-h...@ozone.apache.org > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Pifta > > > > > > > > -- > Pifta >