Important correction: *Thus, the max value of long does not have the first bit as 1.*
On Mon, Nov 8, 2021 at 6:49 PM Kota Uenishi <k...@preferred.jp> wrote: > > Hi Bharat, > > Thank you for the suggestion of object ID. By design, I understand > that object ID is more suitable for delete table use case, regarding > the requirement for monotonicity. I took a glance on HDDS-4315 and I > have one question. > > By looking the code, the object ID seems to have the most significant > 2 bits as epoch ID. But it's mostly implemented by Java's primitive > type of long, which is a signed integer. Thus, the max value of long > does not have the first bit as 0. That said, object IDs in epoch 2 and > 3 are supposed to have negative value in long, and in that case, > monotonicity in integer comparison will be broken. I doubt if it's > safe in comparing object IDs. The comparison would only be safe by > encoding into binary or unsigned hex array - but it's not > straightforward and the comparison could be buggy IMO. > If the epoch is only supposed to range from 0 to 1, it would be safe. > Can we assume it, or is the comparison is always supposed to be safe? > > > One thing I just want to say, we recommend HA or ratis enabled. > > Thank you for the advice. Our cluster runs 1.1.0 but we explicitly > disabled Ratis when upgrading to 1.1 from 1.0. So I guess it's still > safe. Maybe enabling Ratis after upgrading to 1.2 would be safe > regarding the object ID issue in 1.1, if I understand correctly. > > Thanks, and sorry for being late, > Kota > > On Tue, Nov 2, 2021 at 2:12 AM Bharat Viswanadham > <bviswanad...@cloudera.com.invalid> wrote: > > > > Hi Kota, > > > > >My question is that, is transaction index always available for non-HA > > >cluster? > > > > Yes, transaction index is available for non-HA also. But when you move from > > non-HA to non-HA the transaction index starts again from 0, as it is a > > newly setup cluster and ratis transaction index starts from 0 again. So, to > > avoid the issue of object ID's colliding, we have generated a unique Object > > ID based on transaction ID and also persisting transaction ID and starting > > from that after restarts(HDDS-4315). Maybe we can use ObjectID to not > > collide in an upgrade scenario from non-HA to HA here also. > > > > *Example Scenario *where it might cause problem using transaction index: > > (This is like a very theoretical example) > > Lets say 100 transaction Id delete key1 before upgrade > > Now 100 transaction id delete key1 after upgrade, we might miss block clean > > up. (Like the scenario described in HDDS-5905) > > > > Considering the above issue, I am thinking using transaction ID might be an > > issue, otherwise for HA/ratis enabled deployments for single nodes using > > transaction ID we should be good. > > > > > > One thing I just want to say, we recommend HA or ratis enabled. (As before > > HDDS-4315, we have a problem of generating transaction IDs from 0 again > > after a restart, which might not have unique object ID's in the cluster. > > And also we have enabled ratis enabled by default from 1.1.0 release ( > > HDDS-4498 <https://issues.apache.org/jira/browse/HDDS-4498>) > > > > > > > > Thanks, > > Bharat > > > > > > > > > > > > > > > > > > > > Thanks, > > Bharat > > > > > > > > On Sun, Oct 31, 2021 at 5:29 PM Kota Uenishi <k...@preferred.jp> wrote: > > > > > Thank you for the review, Lokesh and Bharat. > > > > > > I understand that transaction id would be better than timestamp, > > > especially because the computation cost of getting timestamp. In this > > > case, requirement for the sorting of deletion keys has not to be > > > strictly monotonic, but just mild monotonicity, like where clock skews > > > in the range of ours or days would be acceptable. I'll update the doc. > > > > > > My question is that, is transaction index always available for non-HA > > > cluster? For example, our 1.1.0 cluster is not using HA for OM nor for > > > SCM and we are not planning to upgrade to even > > > single-node Ratis (still using > > > > > > org.apache.hadoop.hdds.scm.pipeline.leader.choose.algorithms.DefaultLeaderChoosePolicy > > > for ozone.scm.pipeline.leader-choose.policy). > > > > > > Bharat, on RepeatedKeyInfo; > > > Yes, in my plan, RepeatedKeyInfo is still needed for data format > > > compatibility and I'm not planning to change proto. Especially, > > > changing proto format will make upgrade & downgrade extremely > > > difficult IMO. I know it doesn't have to be a list any more, but it's > > > just in theory. > > > > > > On Sat, Oct 30, 2021 at 4:45 AM Bharat Viswanadham > > > <bviswanad...@cloudera.com.invalid> wrote: > > > > > > > > Hi Kota, > > > > Thanks for taking up HDDS-5905 and quickly coming up with a design. > > > > > > > > I liked the overall approach, but one thing instead of timestamps, I > > > agree > > > > with Lokesh, we can use transaction index, and also this will make > > > > implementation easy. (As with timestamp, we need to propagate this from > > > the > > > > leader, handle clock skews, and need to handle leader changes. > > > > > > > > And one question, so do we plan to use RepeatedKeyInfo, now with this > > > > change it will be no more list. You are not planning to change proto? > > > > > > > > > > > > Thanks, > > > > Bharat > > > > > > > > > > > > On Thu, Oct 28, 2021 at 11:12 PM Lokesh Jain <lj...@apache.org> wrote: > > > > > > > > > Hey Kota > > > > > > > > > > I really like the proposed approach because it makes sure that blocks > > > are > > > > > deleted in order of key deletion. I would suggest using Ratis > > > transaction > > > > > id as the prefix. I don’t think we will need a random suffix with that > > > > > approach as transaction id would avoid any collisions. Further it > > > avoid the > > > > > cost of generating timestamps. > > > > > > > > > > Thanks > > > > > Lokesh > > > > > > > > > > > On 29-Oct-2021, at 7:52 AM, Kota Uenishi <k...@preferred.jp> wrote: > > > > > > > > > > > > Hi Bharat & devs, > > > > > > > > > > > > I've written up some of my idea to fix HDDS-5905, which is a > > > > > > block-leak issue mentioned by Bharat. It involves some data format > > > > > > change in deletion table, so I want to get broader range of feedback > > > > > > from committers in addition to Bharat. If it looks good to you, I > > > want > > > > > > to start writing up a patch. Please take a look! > > > > > > > > > > > > The proposal: > > > > > > > > https://docs.google.com/document/d/1KeyhiE1i5SqRSgLy-pIOGW9X6mUYb8iYEkEoDAEQD9Q/edit# > > > > > > HDDS-5905: https://issues.apache.org/jira/browse/HDDS-5905 > > > > > > > > > > > > -- > > > > > > -- > > > > > > Kota UENISHI, Engineer > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org > > > > > > For additional commands, e-mail: dev-h...@ozone.apache.org > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org > > > > > For additional commands, e-mail: dev-h...@ozone.apache.org > > > > > > > > > > > > > > > > > > > > > > -- > > > -- > > > Kota UENISHI, Engineer > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org > > > For additional commands, e-mail: dev-h...@ozone.apache.org > > > > > > > > > > -- > -- > Kota UENISHI, Engineer -- -- Kota UENISHI, Engineer --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org For additional commands, e-mail: dev-h...@ozone.apache.org