Re: Design doc to fix HDDS-5905

Kota Uenishi Mon, 08 Nov 2021 02:41:26 -0800

Important correction: *Thus, the max value of long does not have the
first bit as 1.*


On Mon, Nov 8, 2021 at 6:49 PM Kota Uenishi <k...@preferred.jp> wrote:
>
> Hi Bharat,
>
> Thank you for the suggestion of object ID. By design, I understand
> that object ID is more suitable for delete table use case, regarding
> the requirement for monotonicity. I took a glance on HDDS-4315 and I
> have one question.
>
> By looking the code, the object ID seems to have the most significant
> 2 bits as epoch ID. But it's mostly implemented by Java's primitive
> type of long, which is a signed integer. Thus, the max value of long
> does not have the first bit as 0. That said, object IDs in epoch 2 and
> 3 are supposed to have negative value in long, and in that case,
> monotonicity in integer comparison will be broken. I doubt if it's
> safe in comparing object IDs. The comparison would only be safe by
> encoding into binary or unsigned hex array - but it's not
> straightforward and the comparison could be buggy IMO.
> If the epoch is only supposed to range from 0 to 1, it would be safe.
> Can we assume it, or is the comparison is always supposed to be safe?
>
> > One thing I just want to say, we recommend HA or ratis enabled.
>
> Thank you for the advice. Our cluster runs 1.1.0 but we explicitly
> disabled Ratis when upgrading to 1.1 from 1.0. So I guess it's still
> safe. Maybe enabling Ratis after upgrading to 1.2 would be safe
> regarding the object ID issue in 1.1, if I understand correctly.
>
> Thanks, and sorry for being late,
> Kota
>
> On Tue, Nov 2, 2021 at 2:12 AM Bharat Viswanadham
> <bviswanad...@cloudera.com.invalid> wrote:
> >
> > Hi Kota,
> >
> > >My question is that, is transaction index always available for non-HA
> > >cluster?
> >
> > Yes, transaction index is available for non-HA also. But when you move from
> > non-HA to non-HA the transaction index starts again from 0, as it is a
> > newly setup cluster and ratis transaction index starts from 0 again. So, to
> > avoid the issue of object ID's colliding, we have generated a unique Object
> > ID based on transaction ID and also persisting transaction ID and starting
> > from that after restarts(HDDS-4315). Maybe we can use ObjectID to not
> > collide in an upgrade scenario from non-HA to HA here also.
> >
> > *Example Scenario *where it might cause problem using transaction index:
> > (This is like a very theoretical example)
> > Lets say 100 transaction Id delete key1 before upgrade
> > Now 100 transaction id delete key1 after upgrade, we might miss block clean
> > up. (Like the scenario described in HDDS-5905)
> >
> > Considering the above issue, I am thinking using transaction ID might be an
> > issue, otherwise for HA/ratis enabled deployments for single nodes using
> > transaction ID we should be good.
> >
> >
> > One thing I just want to say, we recommend HA or ratis enabled. (As before
> > HDDS-4315, we have a problem of generating transaction IDs from 0 again
> > after a restart, which might not have unique object ID's in the cluster.
> > And also we have enabled ratis enabled by default from 1.1.0 release (
> > HDDS-4498 <https://issues.apache.org/jira/browse/HDDS-4498>)
> >
> >
> >
> > Thanks,
> > Bharat
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > Thanks,
> > Bharat
> >
> >
> >
> > On Sun, Oct 31, 2021 at 5:29 PM Kota Uenishi <k...@preferred.jp> wrote:
> >
> > > Thank you for the review, Lokesh and Bharat.
> > >
> > > I understand that transaction id would be better than timestamp,
> > > especially because the computation cost of getting timestamp. In this
> > > case, requirement for the sorting of deletion keys has not to be
> > > strictly monotonic, but just mild monotonicity, like where clock skews
> > > in the range of ours or days would be acceptable. I'll update the doc.
> > >
> > > My question is that, is transaction index always available for non-HA
> > > cluster? For example, our 1.1.0 cluster is not using HA for OM nor for
> > > SCM and we are not planning to upgrade to even
> > > single-node Ratis (still using
> > >
> > > org.apache.hadoop.hdds.scm.pipeline.leader.choose.algorithms.DefaultLeaderChoosePolicy
> > > for ozone.scm.pipeline.leader-choose.policy).
> > >
> > > Bharat, on RepeatedKeyInfo;
> > > Yes, in my plan, RepeatedKeyInfo is still needed for data format
> > > compatibility and I'm not planning to change proto. Especially,
> > > changing proto format will make upgrade & downgrade extremely
> > > difficult IMO. I know it doesn't have to be a list any more, but it's
> > > just in theory.
> > >
> > > On Sat, Oct 30, 2021 at 4:45 AM Bharat Viswanadham
> > > <bviswanad...@cloudera.com.invalid> wrote:
> > > >
> > > > Hi Kota,
> > > > Thanks for taking up HDDS-5905 and quickly coming up with a design.
> > > >
> > > > I liked the overall approach, but one thing instead of timestamps, I
> > > agree
> > > > with Lokesh, we can use transaction index, and also this will make
> > > > implementation easy. (As with timestamp, we need to propagate this from
> > > the
> > > > leader, handle clock skews, and need to handle leader changes.
> > > >
> > > > And one question, so do we plan to use RepeatedKeyInfo, now with this
> > > > change it will be no more list. You are not planning to change proto?
> > > >
> > > >
> > > > Thanks,
> > > > Bharat
> > > >
> > > >
> > > > On Thu, Oct 28, 2021 at 11:12 PM Lokesh Jain <lj...@apache.org> wrote:
> > > >
> > > > > Hey Kota
> > > > >
> > > > > I really like the proposed approach because it makes sure that blocks
> > > are
> > > > > deleted in order of key deletion. I would suggest using Ratis
> > > transaction
> > > > > id as the prefix. I don’t think we will need a random suffix with that
> > > > > approach as transaction id would avoid any collisions. Further it
> > > avoid the
> > > > > cost of generating timestamps.
> > > > >
> > > > > Thanks
> > > > > Lokesh
> > > > >
> > > > > > On 29-Oct-2021, at 7:52 AM, Kota Uenishi <k...@preferred.jp> wrote:
> > > > > >
> > > > > > Hi Bharat & devs,
> > > > > >
> > > > > > I've written up some of my idea to fix HDDS-5905, which is a
> > > > > > block-leak issue mentioned by Bharat. It involves some data format
> > > > > > change in deletion table, so I want to get broader range of feedback
> > > > > > from committers in addition to Bharat. If it looks good to you, I
> > > want
> > > > > > to start writing up a patch. Please take a look!
> > > > > >
> > > > > > The proposal:
> > > > >
> > > https://docs.google.com/document/d/1KeyhiE1i5SqRSgLy-pIOGW9X6mUYb8iYEkEoDAEQD9Q/edit#
> > > > > > HDDS-5905: https://issues.apache.org/jira/browse/HDDS-5905
> > > > > >
> > > > > > --
> > > > > > --
> > > > > > Kota UENISHI, Engineer
> > > > > >
> > > > > > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
> > > > > > For additional commands, e-mail: dev-h...@ozone.apache.org
> > > > > >
> > > > >
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
> > > > > For additional commands, e-mail: dev-h...@ozone.apache.org
> > > > >
> > > > >
> > >
> > >
> > >
> > > --
> > > --
> > > Kota UENISHI, Engineer
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
> > > For additional commands, e-mail: dev-h...@ozone.apache.org
> > >
> > >
>
>
>
> --
> --
> Kota UENISHI, Engineer



-- 
--
Kota UENISHI, Engineer

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
For additional commands, e-mail: dev-h...@ozone.apache.org

Re: Design doc to fix HDDS-5905

Reply via email to