Hello Bill,

Thanks for the KIP. Glad to see that we can likely shooting two birds with
one stone. I have some concerns though about those "two birds" themselves:

1. About not breaking compatibility of stream-stream join materialized
stores: I think this is a valid issue to tackle, but after thinking about
it once more I'm not sure if exposing Materialized would be a good solution
here. My rationles:

1.a For stream-stream join, our current usage of window-store is not ideal,
and we want to modify it in the near future to be more efficient. Not
allowing users to override such state store backend gives us such freedom
(which was also considered in the original DSL design), whereas getting a
Materialized<WindowStore> basically kicks out that freedom out of the
window.
1.b For strema-stream join, in our original design we intend to "never"
want users to query the state, since it is just for buffering the upcoming
records from the stream. Now I know that some users may indeed want to
query it from the debugging perspective, but still I concerned about
whether leveraging IQ for debugging purposes would be the right solution
here. And adding Materialized object opens the door to let users query
about it (unless we did something intentionally to still forbids it), which
also restricts us in the future.

2. About the coupling between Materialized.name() and queryable: again I
think this is a valid issue. But I'm not sure if the current
"withQuerryingDisabled / Enabled" at Materialized is the best approach.
Here I think I agree with John, that generally speaking it's better be a
control function on the `KTable` itself, rather than on `Materialized`, so
fixing it via adding functions through `Materialized` seems not a natural
approach either.


Overall, I'm thinking maybe we should still use two stones rather than one
to kill these two birds, and probably for this KIP we just focus on 1)
above. And for that I'd like to not expose the Materialized either for
rationales that I've listed above. Instead, we just restrict KIP-307 to NOT
use the Joined.name for state store names and always use internal names as
well, which admittedly indeed leaves a hole of not being able to cover all
internal names here, but now I feel this `hole` may better be filled by,
e.g. not creating changelog topics but just use the upstream to
re-bootstrap the materialized store, more concretely: when materializing
the store, try to piggy-back the changelog topic on an existing topic, e.g.
a) if the stream is coming directly from some source topic (including
repartition topic), make that as changelog topic and if it is repartition
topic change the retention / data purging policy necessarily as well; b) if
the stream is coming from some stateless operators, delegate that stateless
operator to the parent stream similar as a); if the stream is coming from a
stream-stream join which is the only stateful operator that can result in a
stream, consider merging the join into multi-way joins (yes, this is a very
hand-wavy thought, but the point here is that we do not try to tackle it
now but leave it for a better solution :).


Guozhang



On Wed, Jun 19, 2019 at 11:41 AM John Roesler <j...@confluent.io> wrote:

> Hi Bill,
>
> Thanks for the KIP! Awesome job catching this unexpected consequence
> of the prior KIPs before it was released.
>
> The proposal looks good to me. On top of just fixing the problem, it
> seems to address two other pain points:
> * that naming a state store automatically causes it to become queriable.
> * that there's currently no way to configure the bytes store for join
> windows.
>
> It's awesome that we can fix this issue and two others with one feature.
>
> I'm wondering about a missing quadrant from the truth table involving
> whether a Materialized is stored or not and querying is
> enabled/disabled... What should be the behavior if there is no store
> configured (e.g., if Materialized with only serdes) and querying is
> enabled?
>
> It seems we have two choices:
> 1. we can force creation of a state store in this case, so the store
> can be used to serve the queries
> 2. we can provide just a queriable view, basically letting IQ query
> into the "KTableValueGetter", which would transparently construct the
> query response by applying the operator logic to the upstream state if
> the operator state isn't already stored.
>
> Offhand, it seems like the second is actually a pretty awesome
> capability. But it might have an awkward interaction with the current
> semantics. Presently, if I provide a Materialized.withName, it implies
> that querying should be enabled AND that the view should actually be
> stored in a state store. Under option 2 above, this behavior would
> change to NOT provision a state store and instead just consult the
> ValueGetter. To get back to the current behavior, users would have to
> add a "bytes store supplier" to the Materialized to indicate that,
> yes, they really want a state store there.
>
> Behavior changes are always kind of scary, but I think in this case,
> it might actually be preferable. In the event where only the name is
> provided, it means that people just wanted to make the operation
> result queriable. If we automatically convert this to a non-stored
> view, then simply upgrading results in the same observable behavior
> and semantics, but a linear reduction in local storage requirements
> and disk i/o, as well as a corresponding linear reduction in memory
> usage both on and off heap.
>
> What do you think?
> -John
>
> On Tue, Jun 18, 2019 at 9:21 PM Bill Bejeck <bbej...@gmail.com> wrote:
> >
> > All,
> >
> > I'd like to start a discussion for adding a Materialized configuration
> > object to KStream.join for naming state stores involved in joins.
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-479%3A+Add+Materialized+to+Join
> >
> > Your comments and suggestions are welcome.
> >
> > Thanks,
> > Bill
>


-- 
-- Guozhang

Reply via email to