Hey,
... and group by. And yes there is no logical context we can present.
The context present has nothing todo with the record currently processed.
Its just doesn't come out
https://en.wikipedia.org/wiki/Relational_algebra#Aggregation
I am all in on this approach.
Best Jan
On 17.11.2017 07:05, Guozhang Wang wrote:
Matthias,
For this idea, are your proposing that for any many-to-one mapping
operations (for now only Join operators), we will strip off the record
context in the resulted records and claim "we cannot infer its traced
context anymore"?
Guozhang
On Thu, Nov 16, 2017 at 1:03 PM, Matthias J. Sax <matth...@confluent.io>
wrote:
Any thoughts about my latest proposal?
-Matthias
On 11/10/17 10:02 PM, Jan Filipiak wrote:
Hi,
i think this is the better way. Naming is always tricky Source is kinda
taken
I had TopicBackedK[Source|Table] in mind
but for the user its way better already IMHO
Thank you for reconsideration
Best Jan
On 10.11.2017 22:48, Matthias J. Sax wrote:
I was thinking about the source stream/table idea once more and it seems
it would not be too hard to implement:
We add two new classes
SourceKStream extends KStream
and
SourceKTable extend KTable
and return both from StreamsBuilder#stream and StreamsBuilder#table
As both are sub-classes, this change is backward compatible. We change
the return type for any single-record transform to this new types, too,
and use KStream/KTable as return type for any multi-record operation.
The new RecordContext API is added to both new classes. For old classes,
we only implement KIP-149 to get access to the key.
WDYT?
-Matthias
On 11/9/17 9:13 PM, Jan Filipiak wrote:
Okay,
looks like it would _at least work_ for Cached KTableSources .
But we make it harder to the user to make mistakes by putting
features into places where they don't make sense and don't
help anyone.
I once again think that my suggestion is easier to implement and
more correct. I will use this email to express my disagreement with the
proposed KIP (-1 non binding of course) state that I am open for any
questions
regarding this. I will also do the usual thing and point out that the
friends
over at Hive got it correct aswell.
One can not user their
https://cwiki.apache.org/confluence/display/Hive/
LanguageManual+VirtualColumns
in any place where its not read from the Sources.
With KSQl in mind it makes me sad how this is evolving here.
Best Jan
On 10.11.2017 01:06, Guozhang Wang wrote:
Hello Jan,
Regarding your question about caching: today we keep the record
context
with the cached entry already so when we flush the cache which may
generate
new records forwarding we will set the record context appropriately;
and
then after the flush is completed we will reset the context to the
record
before the flush happens. But I think when Jeyhun did the PR it is a
good
time to double check on such stages to make sure we are not
introducing any
regressions.
Guozhang
On Mon, Nov 6, 2017 at 8:54 PM, Jan Filipiak <
jan.filip...@trivago.com>
wrote:
I Aggree completely.
Exposing this information in a place where it has no _natural_
belonging
might really be a bad blocker in the long run.
Concerning your first point. I would argue its not to hard to have a
user
keep track of these. If we still don't want the user
to keep track of these I would argue that all > projection only <
transformations on a Source-backed KTable/KStream
could also return a Ktable/KStream instance of the type we return
from the
topology builder.
Only after any operation that exceeds projection or filter one would
return a KTable not granting access to this any longer.
Even then its difficult already: I never ran a topology with caching
but I
am not even 100% sure what the record Context means behind
a materialized KTable with Caching? Topic and Partition are probably
with
some reasoning but offset is probably only the offset causing the
flush?
So one might aswell think to drop offsets from this RecordContext.
Best Jan
On 07.11.2017 03:18, Guozhang Wang wrote:
Regarding the API design (the proposed set of overloads v.s. one
overload
on #map to enrich the record), I think what we have represents a
good
trade-off between API succinctness and user convenience: on one
hand we
definitely want to keep as fewer overloaded functions as possible.
But on
the other hand if we only do that in, say, the #map() function then
this
enrichment could be an overkill: think of a topology that has 7
operators
in a chain, where users want to access the record context on
operator #2
and #6 only, with the "enrichment" manner they need to do the
enrichment
on
operator #2 and keep it that way until #6. In addition, the
RecordContext
fields (topic, offset, etc) are really orthogonal to the key-value
payloads
themselves, so I think separating them into this object is a cleaner
way.
Regarding the RecordContext inheritance, this is actually a good
point
that
have not been discussed thoroughly before. Here are my my two cents:
one
natural way would be to inherit the record context from the
"triggering"
record, for example in a join operator, if the record from stream A
triggers the join then the record context is inherited from with
that
record. This is also aligned with the lower-level PAPI interface. A
counter
argument, though, would be that this is sort of leaking the internal
implementations of the DSL, so that moving forward if we did some
refactoring to our join implementations so that the triggering
record can
change, the RecordContext would also be different. I do not know how
much
it would really affect end users, but would like to hear your
opinions.
Agreed to 100% exposing this information
Guozhang
On Mon, Nov 6, 2017 at 1:00 PM, Jeyhun Karimov <
je.kari...@gmail.com>
wrote:
Hi Jan,
Sorry for late reply.
The API Design doesn't look appealing
In terms of API design we tried to preserve the java functional
interfaces.
We applied the same set of rich methods for KTable to make it
compatible
with the rest of overloaded APIs.
It should be 100% sufficient to offer a KTable + KStream that is
directly
feed from a topic with 1 additional overload for the #map()
methods to
cover every usecase while keeping the API in a way better state.
- IMO this seems a workaround, rather than a direct solution.
Perhaps we should continue this discussion in DISCUSS thread.
Cheers,
Jeyhun
On Mon, Nov 6, 2017 at 9:14 PM Jan Filipiak
<jan.filip...@trivago.com>
wrote:
Hi.
I do understand that it might come in Handy.
From my POV in any relational algebra this is only a
projection.
Currently we hide these "fields" that come with the input record.
It should be 100% sufficient to offer a KTable + KStream that is
directly
feed from a topic with 1 additional overload for the #map()
methods to
cover every usecase while keeping the API in a way better state.
best Jan
On 06.11.2017 17:52, Matthias J. Sax wrote:
Jan,
I understand what you are saying. However, having a
RecordContext is
super useful for operations that are applied to input topic. Many
users
requested this feature -- it's much more convenient that falling
back
to
transform() to implement a a filter() for example that want to
access
some meta data.
Because we cannot distinguish different "origins" of a
KStream/KTable,
I
am not sure if there would be a better way to do this. The only
"workaround" I see, is to have two KStream/KTable interfaces each
and
we
would use the first one for KStream/KTable with "proper"
RecordContext.
But this does not seem to be a good solution either.
Note, a KTable can also be read directly from a topic, I agree
that
using RecordContext on a KTable that is the result of an
aggregation is
questionable. But I don't see a reason to down vote the KIP for
this
reason.
WDYT about this?
-Matthias
On 11/1/17 10:19 PM, Jan Filipiak wrote:
-1 non binding
I don't get the motivation.
In 80% of my DSL processors there is no such thing as a
reasonable
RecordContext.
After a join the record I am processing belongs to at least 2
topics.
After a Group by the record I am processing was created from
multiple
offsets.
The API Design doesn't look appealing
Best Jan
On 01.11.2017 22:02, Jeyhun Karimov wrote:
Dear community,
It seems the discussion for KIP-159 [1] converged finally. I
would
like to
initiate voting for the particular KIP.
[1]
https://cwiki.apache.org/confluence/display/KAFKA/KIP-
159%3A+Introducing+Rich+functions+to+Streams
Cheers,
Jeyhun