Re: [RFC] Proposal: Reserve Space for Allocated Blocks

anu engineer Tue, 13 Sep 2022 10:09:35 -0700

Hi Steven,

Thank you for the comments, my earlier comment was very brief, and I did
not explain things in sufficient detail.

 Your comments made me realize that, and I have taken the opportunity to
clarify what I was thinking. Hopefully, it is much clearer now.

1. The Lease is a simple lock with a time-out. The Data node will not close
the Container if there is a "Please keep it open lock" or a lease.

2. A client cannot extend the lease infinitely. The lease will have a
maximum lease extension maxed out at two tries, and I am proposing
something like 60 seconds for each attempt.

Now let me discuss each of the concerns.

 >> Adding a lease will be quite complex, and may not be needed if the
simple delay works.

That is correct if the "simple delay" works. The problem is that "simple
delay" is never simple. You will not be able to reason about the
correctness of the system.

With a simple delay, We would not be able to understand if the client did a
write, if someone is waiting, or worse if the delay is always introduced
for no purpose because all we see is a simple delay—no information about
the actual writes or clients.

So it would help if we had some way to introduce a "deterministic delay"
instead of a simple delay. This delay allows us to reason about the state
from the SCM point of view, client point of view or from the datanode point
of view. So overall, much better for debuggability of the system.

If there is a write failure, it is not dependent on some arbitrary notion
of delay, but the client comes and says I want to write, SCM has given me
an allocated block, and SCM gave me(client)60 seconds to write my

data. Let us say there is a failure, or a bug, with a lease based system we
will be able to reason about the mistakes we are making in introducing the
notion of a "Simple Delay." To be completely honest, I am just proposing a
formal method that is easy to reason about and debug, which will bring the
idea that you are offering - "the simple delay."

If the client runs out of that time, the client can request a lease
extension one more time. However, it is still deterministic since SCM can
decide if the lease can be issued, and we are arguing that the client
should extend the lease only once(this is not a strictly necessary
condition).

I am arguing that "Delay" is a good notion, but it should not be arbitrary
and not something that we cannot see and measure. If you are going to use
time, use it via leveraging a well-known pattern, leases. Now I can log the
lease info on a client if needed, and if I run into bugs, I can find some
way of reasoning about it.

>> Then on SCM you will need to track all open blocks, lease expiry due to
crashed clients etc

I don't understand this assertion. Perhaps there may be things in Ozone now
that I have no clue about. It has been a while.

But let us play this from first principles.

1. You issue a lease; this is a signed lease that says - "Client, You have
60 seconds from 11:00:00."

2. The client is attempting to write and sends this lease.

3. The client crashes -- Why do we care? The lease is gone at the time out
anyway. We use leases in the first place( as opposed to a Distributed Lock
Manager) because we don't need to care about client crashes.

4. Why does SCM need to track anything? The Datanode has to honor the
request if the client has a block write lease and the write arrives within
time.

5. if we know that any lease is issued for 60 seconds for a block write,
all the data node needs to do before closing a container is tell the SCM to
stop giving Block write leases for a specific container and wait for 121
seconds (1 second for extra safety). This guarantees that any outstanding
client with a lease has had sufficient time to write.

Please let me know if you imagine this differently. I am struggling to
understand where the state is coming from; the whole deal with leases is
that it is easy to forget state, and we can still get to a consistent state
if we wait.

Now let us talk about real disasters, say SCM is crashing completely.
(assume there is no HA, for the time being, to make it interesting as a
technical exercise, or to make the point that leases need not be
replicated.)

Now, if any request comes from any data node saying I am going to close the
Container, the standard response works == "Please wait 121 seconds and then
close."

>From that moment onwards, SCM remembers that it can not issue any leases
for this new Container which it learned about now.

 Now, what of a Container notice "I want to close container X" that just
came a few seconds *before the crash *-- SCM after recovery, will not issue
any LEASES for 120 seconds. *This is the most important part.*

That way, the whole write pipeline is stalled because we are in a safe
place. This theoretical discussion illustrates how leases would work and
create no additional state requirements.

>> You would also need to decide what to do with a block that is kept open
for many hours or days by a persistent client slowly writing to one or more
files - we cannot keep the Container open indefinitely.

Not possible at all. A lease is not something that is infinite. It is NOT a
lock. By definition, a lease expires after "X" amount of time. In this
case, I am picking a number like 60 seconds.

So, if we create a model that a client can extend a lease two times, then
the maximum time limit would be 120 seconds for a block write. We can pick
whatever number looks reasonable.

Please let me know if this makes sense.

Thanks

Anu

On Tue, Sep 13, 2022 at 2:33 AM Stephen O'Donnell
<sodonn...@cloudera.com.invalid> wrote:

> I would recommend a proof of concept on your cluster, doing the simplest
> thing first, which is holding the close commands in SCM for a configurable
> delay, and see if you can alleviate the problem that way.
>
> Adding a lease will be quite complex, and may not be needed if the simple
> delay works. You will need a heartbeat thread from client to OM, then relay
> that from OM to SCM, placing more front end load onto SCM. Then on SCM you
> will need to track all open blocks, lease expiry due to crashed clients
> etc. You would also need to decide what to do with a block that is kept
> open for many hours or days by a persistent client slowly writing to one or
> more files - we cannot keep the container open indefinitely.
>
> The front end load on SCM here is probably minimal, but there is work under
> way to cache the container locations in OM partly to speed up KeyInfo
> calls, but also to prevent increasing client load increasing load on SCM.
>
> On Tue, Sep 13, 2022 at 9:19 AM Kaijie Chen <c...@apache.org> wrote:
>
> > Thanks Anu and Steven for the suggestion. Granting a lease to client
> > sounds like a more controllable way.
> >
> > However, if I understand correctly, clients don't talk to SCM directly.
> > Does it mean OM has to relay the renew lease request to SCM?
> > Is there a better way to implement it?
> >
> > Regards
> > Kaijie
> >
> >  ---- On Sat, 10 Sep 2022 01:10:28 +0800  anu engineer  wrote ---
> >  > Extending the same thought from Steven. If you are going to do a small
> >  > delay, it is better to do it via a Lease.
> >  >
> >  > So SCM could offer a lease for 60 seconds, with a provision to
> reacquire
> >  > the lease one more time.
> >  > This does mean that a single container inside the data node
> technically
> >  > could become larger than 5GB (but that is possible even today).
> >  >
> >  > I do think a lease or a timeout based approach (as suggested by
> Steven)
> >  > might be easier than pre-allocating blocks.
> >  >
> >  > Thanks
> >  > Anu
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
> > For additional commands, e-mail: dev-h...@ozone.apache.org
> >
> >
>

Re: [RFC] Proposal: Reserve Space for Allocated Blocks

Reply via email to