Hi Luís,

What I'm thinking primarily is that we only need to compare the compaction
values as LONG for the offset and timestmap "type" (I still think it is
worth defining `timestamp` as a special compaction value, with the reasons
below).

Not sure if you've seen my other comment earlier regarding the offset /
timestmap, I'm pasting / editing them here to illustrate my idea:

--------------

I think maybe we have a mis-communication here: I'm not against the idea of
using headers, but just trying to argue that we could make `timestamp`
field a special config value that is referring to the timestamp field in
the metadata. So from log cleaner's pov:

1. if the config value is "offset", look into the offset field, *comparing
their value as long*
2. if the config value is "timestamp", look into the timestamp field,
*comparing
their value as long*
3. otherwise, say the config value is "foo", search for key "foo" in the
message header, comparing the value as *byte arrays*

I.e. "offset" and "timestamp" are treated as special cases other than case
3) above.

--------------

I think your main concern is that "Although the byte[] can be compared, it
is not actually comparable as the versioning is based on a long", while I'm
thinking we can indeed generalize it: there is not hard reasons that the
"compaction value" has to be a long, and since the goal of this KIP is to
generalize the log compaction logic to consider header fields, why not
allowing it to be of any types than enforcing them still to be a long type?
Users are then responsible to encode their compaction field according to
the byte array lexico-ordering to full fill their ordering semantics. It is
more flexible to enforce users to encode their compaction field always as a
long type. Let me know WDYT.



Also I have some minor comments on the wiki itself:

1) "When both records being compared contain a matching "compaction value",
then the record with the highest offset will be kept;"

Should it be "the record with the highest value will be kept"?




Guozhang


On Fri, Apr 20, 2018 at 1:05 AM, Luís Cabral <luis_cab...@yahoo.com.invalid>
wrote:

>  Guozhang, is this reply ok with you?
>
>
> If you insist on the byte[] comparison directly, then I would need some
> suggestions on how to represent a "version" with it, and then the KIP could
> be changed to that.
>     On Tuesday, April 17, 2018, 2:44:16 PM GMT+2, Luís Cabral <
> luis_cab...@yahoo.com> wrote:
>
>  Oops, missed that email...
>
> bq. It is because when we compare the bytes we do not treat them as longs
> atall, so we just compare them based on bytes; I admit that if users's
> headertypes have some semantic meanings (e.g. it is encoded from a long)
> they weare forcing them to choose the encoder that obeys key
> lexicographicordering; but I felt it is more general than enforcing any
> fields that maybe used for log cleaner to be defined as a special type.
>
> Yes, you can compare bytes between each other (its what that code does).
> You can then assume (or infer) that the encoding used allows for
> lexicographic ordering, which I hope you do not do a lot of. This is
> (logically) the same as converting to String and then comparing the
> strings, except that it allows for abstracting from the String encoding
> (again, either with assumptions or with inferred knowledge).
> This is purely academic, however, as the versioning is based on a long,
> which is not compatible with this approach. So, is this comment a
> fact-check stating that it is possible to compare byte[] overall, or is it
> about trying to use it in this KIP?
>
> Cheers
>
> PS (because I'm stubborn): It is still not comparable, this comparison is
> all based on assumptions about the content of the byte array, but I hope we
> can leave this stuff to Stack Overflow instead of debating it here :)
>
>



-- 
-- Guozhang

Reply via email to