Hi Jincheng,
Sounds good!
You should have Wiki permissions now.
Thanks, Fabian
Am Do., 29. Nov. 2018 um 14:46 Uhr schrieb jincheng sun <
sunjincheng...@gmail.com>:
> Thanks Fabian&Piotrek,
>
> Your feedback sounds very good!
> So far we on the same page about how to handle group keys. I will up
Thanks Fabian&Piotrek,
Your feedback sounds very good!
So far we on the same page about how to handle group keys. I will update
the google doc according our discussion and I'd like to convert it to a
FLIP. Thus, it would be great if any of you can grant me the write access
to Confluence. My Co
Hi Jincheng & Fabian,
+1 From my point of view.
I like this idea that you have to close `flatAggregate` with `select`
statement. In a way it will be consistent with normal `groupBy` and indeed it
solves the problem of mixing table and scalar functions.
I would be against supporting `select(‘*
Hi,
OK, not supporting select('*) in the first version, sounds like a good
first step, +1 for this.
However, I don't think that select('*) returning only the result columns of
the agg function would be a significant break in semantics.
Since aggregate()/flatAggregate() is the last command and (vi
Hi Fabian,
Thank you for listing the detailed example of forcing the use of select.
If I didn't make it clear before, I would like to share my thoughts about
the group keys here:
1. agg/flatagg(Expression) keeps a single Expression;
2. The way to force users to use select is as follows(As y
I don't think we came to a conclusion yet.
Are you suggesting that 'w would disappear if we do the following:
val tabX = tab.window(Tumble ... as 'w)
.groupBy('w, 'k1, 'k2)
.flatAgg(tableAgg('a))
.select('*)
Given that tableAgg returns [col1, col2], what would the schema of tabX be?
Thanks Fabian, if we enforcing select, as i said before user should using
'w.start, 'w.end, 'w.rowtime, 'w.proctime, 'w.XXX' etc. In this way we
should not defined the type of 'w, we can keep the current way of using 'w.
I'll file the JIRA. and open the PR ASAP.
Thanks,
Jincheng
Fabian Hueske 于
I thought about it again.
Enforcing select won't help us with the choice of how to represent 'w.
select('*) and
select('a, 'b, 'w).
would still be valid expressions and we need to decide how 'w is
represented.
As I said before, Tuple, Row, and Map have disadvantages because the syntax
'w.rowtime
Before we have a good support for nest-table, may be forcing the use of
select is good way, at least not causing compatibility issues.
Fabian Hueske 于2018年11月26日周一 下午6:48写道:
> I think the question is what is the data type of 'w.
>
> Until now, I assumed it would be a nested tuple (Row or Tuple).
I think the question is what is the data type of 'w.
Until now, I assumed it would be a nested tuple (Row or Tuple).
Accessing nested fields in Row, Tuple or Pojo is done with get, i.e.,
'w.get("rowtime").
Using a Map would not help because the access would be 'w.at("rowtime").
We can of course a
Yes,I agree the problem is needs attention. IMO. It depends on how we
define the ‘w type. The way you above defines the 'w type as a tuple. If
you serialize 'w to a Map, the compatibility will be better. Even more we
can define ‘w as a special type. UDF and Sink can't be used directly. Must
use 'w.
Something like:
val x = tab.window(Tumble ... as 'w)
.groupBy('w, 'k1, 'k2)
.flatAgg(tableAgg('a)).as('w, 'k1, 'k2, 'col1, 'col2)
x.insertInto("sinkTable") // fails because result schema has changed from
((start, end, rowtime), k1, k2, col1, col2) to ((start, end, rowtime,
newProperty), k
Hi Fabian,
I don't fully understand the question you mentioned:
Any query that relies on the composite type with three fields will fail
after adding a forth field.
I am appreciate if you can give some detail examples ?
Regards,
JIncheng
Fabian Hueske 于2018年11月23日周五 下午4:41写道:
> Hi,
>
> My
Hi,
My concerns are about the case when there is no additional select() method,
i.e.,
tab.window(Tumble ... as 'w)
.groupBy('w, 'k1, 'k2)
.flatAgg(tableAgg('a)).as('w, 'k1, 'k2, 'col1, 'col2)
In this case, 'w is a composite field consisting of three fields (end,
start, rowtime).
Once we
Thanks Fabian,
Thanks a lot for your feedback, and very important and necessary design
reminders!
Yes, your are right! Spark is the specified grouping columns displayed
before 1.3, but the grouping columns are implicitly passed in spark1.4 and
later. The reason for changing this behavior is that
Hi all,
First of all, it is correct that the flatMap(Expression*) and
flatAggregate(Expression*) methods would mix scalar and table values.
This would be a new concept that is not present in the current API.
>From my point of view, the semantics are quite clear, but I understand that
others are mo
Hi Jincheng,
#1) ok, got it.
#3)
> From points of my view I we can using
> `Expression`, and after the discussion decided to use Expression*, then
> improve it. In any case, we can use Expression, and there is an opportunity
> to become Expression* (compatibility). If we use Expression* directly,
Hi Piotrek,
#1)We have unbounded and bounded group window aggregate, for unbounded case
we should early fire the result with retract message, we can not using
watermark, because unbounded aggregate never finished. (for improvement we
can introduce micro-batch in feature), for bounded window we nev
Hi Jincheng,
> #1) No,watermark solves the issue of the late event. Here, the performance
> problem is caused by the update emit mode. i.e.: When current calculation
> result is output, the previous calculation result needs to be retracted.
Hmm, yes I missed this. For time-windowed cases (some ag
Hi shaoxuan & Hequn,
Thanks for your suggestion,I'll file the JIRAs later.
We can prepare PRs while continuing to move forward the ongoing discussion.
Regards,
Jincheng
jincheng sun 于2018年11月21日周三 下午7:07写道:
> Hi Piotrek,
> Thanks for your feedback, and thanks for share your thoughts!
>
> #1)
Hi Piotrek,
Thanks for your feedback, and thanks for share your thoughts!
#1) No,watermark solves the issue of the late event. Here, the performance
problem is caused by the update emit mode. i.e.: When current calculation
result is output, the previous calculation result needs to be retracted.
#
Hi,
1.
> In fact, in addition to the design of APIs, there will be various
> performance optimization details, such as: table Aggregate function
> emitValue will generate multiple calculation results, in extreme cases,
> each record will trigger a large number of retract messages, this will have
Hi,
Thank you all for the great proposal and discussion!
I also prefer to move on to the next step, so +1 for opening the JIRAs to
start the work.
We can have more detailed discussion there. Btw, we can start with JIRAs
which we have agreed on.
Best,
Hequn
On Tue, Nov 20, 2018 at 11:38 PM Shaoxu
+1. I agree that we should open the JIRAs to start the work. We may
have better ideas on the flavor of the interface when implement/review
the code.
Regards,
shaoxuan
On 11/20/18, jincheng sun wrote:
> Hi all,
>
> Thanks all for the feedback.
>
> @Piotr About not using abbreviations naming, +1
Hi all,
Thanks all for the feedback.
@Piotr About not using abbreviations naming, +1,I like
your proposal!Currently both DataSet and DataStream API are using `aggregate`,
BTW,I find other language also not using abbreviations naming,such as R.
Sometimes the interface of the API is really diffic
Hi Fabian & Piotr, thanks for the feedback!
I appreciate your concerns, both on timestamp attributes as well as on
implicit group keys. At the same time, I'm also concerned with the proposed
approach of allowing Expression* as parameters, especially for
flatMap/flatAgg. So far, we never allowed a
Hi,
Isn’t the problem of multiple expressions limited only to `flat***` functions
and to be more specific only to having two (or more) different table functions
passed as an expressions? `.flatAgg(TableAggA('a), scalarFunction1(‘b),
scalarFunction2(‘c))` seems to be well defined (duplicate resu
Hi Jincheng,
I said before, that I think that the append() method is better than
implicitly forwarding keys, but still, I believe it adds unnecessary boiler
plate code.
Moreover, I haven't seen a convincing argument why map(Expression*) is
worse than map(Expression). In either case we need to do
Hi Fabian/Xiaowei,
I am very sorry for my late reply! Glad to see your reply, and sounds
pretty good!
I agree that the approach with append() which can clearly defined the
result schema is better which Fabian mentioned.
In addition and append() and also contains non-time attributes, e.g.:
tab
Hi Jincheng,
Thanks for the summary!
I like the approach with append() better than the implicit forwarding as it
clearly indicates which fields are forwarded.
However, I don't see much benefit over the flatMap(Expression*) variant, as
we would still need to analyze the full expression tree to ensu
Hi all,
We are discussing very detailed content about this proposal. We are trying
to design the API in many aspects (functionality, compatibility, ease of
use, etc.). I think this is a very good process. Only such a detailed
discussion, In order to develop PR more clearly and smoothly in the late
Hi Fabian,
I think that the key question you raised is if we allow extra parameters in
the methods map/flatMap/agg/flatAgg. I can see why allowing that may appear
more convenient in some cases. However, it might also cause some confusions
if we do that. For example, do we allow multiple UDFs in th
Hi,
* Re emit:
I think we should start with a well understood semantics of full
replacement. This is how the other agg functions work.
As was said before, there are open questions regarding an append mode
(checkpointing, whether supporting retractions or not and if yes how to
declare them, ...).
S
Hi xiaowei,
Yes, I agree with you that the semantics of TableAggregateFunction emit is
much more complex than AggregateFunction. The fundamental difference is
that TableAggregateFunction emits a "table" while AggregateFunction outputs
(a column of) a "row". In the case of AggregateFunction it only
Hi Xiaowei,
Thank you for mentioned such key points. Yes, I think those points are very
important for the clear definition of the semantics of Table
AggregateFunction!I'd like share my thoughts about the those questions:
1. Do we allow multi-staged TableAggregate in this case?
>From the points of
Hi,
Thanks for the great design document!
It answers my question regarding handling of retraction messages.
Overall, I like the proposal. It is well scoped and the proposed changes
are well described.
I left a question regarding the handling of time attributes for
multi-column output functions.
Hi Jincheng,
Thanks for adding the public interfaces! I think that it's a very good
start. There are a few points that we need to have more discussions.
- TableAggregateFunction - this is a very complex beast, definitely the
most complex user defined objects we introduced so far. I think th
Hi, Xiaowei,
Thanks for bring up the discuss of Table API Enhancement Outline !
I quickly looked at the overall content, these are good expressions of our
offline discussions. But from the points of my view, we should add the
usage of public interfaces that we will introduce in this propose. So,
Hi All,
As Jincheng brought up in the previous email, there are a set of
improvements needed to make Table API more complete/self-contained. To give
a better overview on this, Jincheng, Jiangjie, Shaoxuan and myself
discussed offline a bit and came up with an initial outline.
Table API Enhancemen
39 matches
Mail list logo