Hi Matthias,
Concerning the naming: I like `markAsPartitioned`, because it describes
what this operation is actually doing!
Hi Sophie,
I see the concern about poor code cohesion. We declare key mapping in
one place of code, then later in another place we say
"markAsPartitioned()". When we change the code six months later, we
might forget to remove markAsPartitioned(), especially if it's placed in
another method or class. But I don't understand why do you propose to
include this config into Grouped/Joined/StreamJoined, because from this
point of view it's not a better solution?
The best approach regarding the cohesion might be to to add an extra
'preservePartition' flag to every key-changing operation, that is
1) selectKey
2) map
3) flatMap
4) transform
5) flatTransform
in order to tell if the provided mapping require repartition or not.
Indeed, this is a mapping operation property, not grouping one! BTW: the
idea of adding extra parameter to `selectKey` was once coined by John
Roesler.
Arguments in favour for this approach: 1) better code cohesion from the
point of view of the user, 2) 'smarter' code (the decision is taken
depending on metadata provided for all the upstream mappings), 3)
overall safer for the user.
Arguments against: invasive KStreams API change, 5 more method
overloads. Further on, when we add a new key-changing operation to
KStream, we must add an overloaded version with 'preservePartition'.
When we add a new overloaded version for existing operation, we actually
might need to add two or more overloaded versions. This will soon become
a mess.
I thought that since `markAsPartitioned` is intended for advanced users,
they will use it with care. When you're in a position where every
serialization/deserialization round matters for the latency, you're
extremely careful with the topology and you will not thoughtlessly add
new key-changing operations without controlling how it's going to change
the overall topology.
By the way, if we later find a better solution, it's way more easy to
deprecate a single `markAsPartitioned` operation than 5 method overloads.
What do you think?
04.08.2021 4:23, Sophie Blee-Goldman пишет:
Do we really need a whole DSL operator for this? I think the original name
for this
operator -- `cancelRepartition()` -- is itself a sign that this is not an
operation on the
stream itself but rather a command/request to whichever operator would have
otherwise triggered this repartition.
What about instead adding a new field to the Grouped/Joined/StreamJoined
config
objects that signals them to skip the repartitioning?
The one downside to this specific proposal is that you would then need to
specify
this for every stateful operation downstream of the key-changing operation.
So a
better alternative might be to introduce this `skipRepartition` field, or
whatever we
want to call it, to the config object of the operator that's actually doing
the key
changing operation which is apparently preserving the partitioning.
Imo this would be more "safe" relative to the current proposal, as the user
has to
explicitly consider whether every key changing operation is indeed
preserving the
partitioning. Otherwise you could code up a topology with several key
changing
operations at the beginning which do require repartitioning. Then you get
to the end
of the topology and insert one final key changing operation that doesn't,
assume
you can just cancel the repartition, and suddenly you're wondering why your
results
are all screwed up
On Tue, Aug 3, 2021 at 6:02 PM Matthias J. Sax <mj...@apache.org> wrote:
Thanks for the KIP Ivan!
I think it's a good feature to give advanced users more control, and
allow them to build more efficient application.
Not sure if I like the proposed named though (the good old "naming
things" discussion :))
Did you consider alternatives? What about
- markAsPartitioned()
- markAsKeyed()
- skipRepartition()
Not sure if there are other idea on a good name?
-Matthias
On 6/24/21 7:45 AM, Ivan Ponomarev wrote:
Hello,
I'd like to start a discussion for KIP-759:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-759%3A+Unneeded+repartition+canceling
This is an offshoot of the discussion of KIP-655 for a `distinct`
operator, which turned out to be a separate proposal.
The proposal is quite trivial, however, we still might consider
alternatives (see 'Possible Alternatives' section).
Regards,
Ivan