Re: [DISCUSS] Table API Enhancement Outline

2018-11-29 Thread Fabian Hueske
Hi Jincheng, Sounds good! You should have Wiki permissions now. Thanks, Fabian Am Do., 29. Nov. 2018 um 14:46 Uhr schrieb jincheng sun < sunjincheng...@gmail.com>: > Thanks Fabian&Piotrek, > > Your feedback sounds very good! > So far we on the same page about how to handle group keys. I will up

Re: [DISCUSS] Table API Enhancement Outline

2018-11-29 Thread jincheng sun
Thanks Fabian&Piotrek, Your feedback sounds very good! So far we on the same page about how to handle group keys. I will update the google doc according our discussion and I'd like to convert it to a FLIP. Thus, it would be great if any of you can grant me the write access to Confluence. My Co

Re: [DISCUSS] Table API Enhancement Outline

2018-11-29 Thread Piotr Nowojski
Hi Jincheng & Fabian, +1 From my point of view. I like this idea that you have to close `flatAggregate` with `select` statement. In a way it will be consistent with normal `groupBy` and indeed it solves the problem of mixing table and scalar functions. I would be against supporting `select(‘*

Re: [DISCUSS] Table API Enhancement Outline

2018-11-29 Thread Fabian Hueske
Hi, OK, not supporting select('*) in the first version, sounds like a good first step, +1 for this. However, I don't think that select('*) returning only the result columns of the agg function would be a significant break in semantics. Since aggregate()/flatAggregate() is the last command and (vi

Re: [DISCUSS] Table API Enhancement Outline

2018-11-28 Thread jincheng sun
Hi Fabian, Thank you for listing the detailed example of forcing the use of select. If I didn't make it clear before, I would like to share my thoughts about the group keys here: 1. agg/flatagg(Expression) keeps a single Expression; 2. The way to force users to use select is as follows(As y

Re: [DISCUSS] Table API Enhancement Outline

2018-11-27 Thread Fabian Hueske
I don't think we came to a conclusion yet. Are you suggesting that 'w would disappear if we do the following: val tabX = tab.window(Tumble ... as 'w) .groupBy('w, 'k1, 'k2) .flatAgg(tableAgg('a)) .select('*) Given that tableAgg returns [col1, col2], what would the schema of tabX be?

Re: [DISCUSS] Table API Enhancement Outline

2018-11-27 Thread jincheng sun
Thanks Fabian, if we enforcing select, as i said before user should using 'w.start, 'w.end, 'w.rowtime, 'w.proctime, 'w.XXX' etc. In this way we should not defined the type of 'w, we can keep the current way of using 'w. I'll file the JIRA. and open the PR ASAP. Thanks, Jincheng Fabian Hueske 于

Re: [DISCUSS] Table API Enhancement Outline

2018-11-27 Thread Fabian Hueske
I thought about it again. Enforcing select won't help us with the choice of how to represent 'w. select('*) and select('a, 'b, 'w). would still be valid expressions and we need to decide how 'w is represented. As I said before, Tuple, Row, and Map have disadvantages because the syntax 'w.rowtime

Re: [DISCUSS] Table API Enhancement Outline

2018-11-26 Thread jincheng sun
Before we have a good support for nest-table, may be forcing the use of select is good way, at least not causing compatibility issues. Fabian Hueske 于2018年11月26日周一 下午6:48写道: > I think the question is what is the data type of 'w. > > Until now, I assumed it would be a nested tuple (Row or Tuple).

Re: [DISCUSS] Table API Enhancement Outline

2018-11-26 Thread Fabian Hueske
I think the question is what is the data type of 'w. Until now, I assumed it would be a nested tuple (Row or Tuple). Accessing nested fields in Row, Tuple or Pojo is done with get, i.e., 'w.get("rowtime"). Using a Map would not help because the access would be 'w.at("rowtime"). We can of course a

Re: [DISCUSS] Table API Enhancement Outline

2018-11-25 Thread jincheng sun
Yes,I agree the problem is needs attention. IMO. It depends on how we define the ‘w type. The way you above defines the 'w type as a tuple. If you serialize 'w to a Map, the compatibility will be better. Even more we can define ‘w as a special type. UDF and Sink can't be used directly. Must use 'w.

Re: [DISCUSS] Table API Enhancement Outline

2018-11-23 Thread Fabian Hueske
Something like: val x = tab.window(Tumble ... as 'w) .groupBy('w, 'k1, 'k2) .flatAgg(tableAgg('a)).as('w, 'k1, 'k2, 'col1, 'col2) x.insertInto("sinkTable") // fails because result schema has changed from ((start, end, rowtime), k1, k2, col1, col2) to ((start, end, rowtime, newProperty), k

Re: [DISCUSS] Table API Enhancement Outline

2018-11-23 Thread jincheng sun
Hi Fabian, I don't fully understand the question you mentioned: Any query that relies on the composite type with three fields will fail after adding a forth field. I am appreciate if you can give some detail examples ? Regards, JIncheng Fabian Hueske 于2018年11月23日周五 下午4:41写道: > Hi, > > My

Re: [DISCUSS] Table API Enhancement Outline

2018-11-23 Thread Fabian Hueske
Hi, My concerns are about the case when there is no additional select() method, i.e., tab.window(Tumble ... as 'w) .groupBy('w, 'k1, 'k2) .flatAgg(tableAgg('a)).as('w, 'k1, 'k2, 'col1, 'col2) In this case, 'w is a composite field consisting of three fields (end, start, rowtime). Once we

Re: [DISCUSS] Table API Enhancement Outline

2018-11-22 Thread jincheng sun
Thanks Fabian, Thanks a lot for your feedback, and very important and necessary design reminders! Yes, your are right! Spark is the specified grouping columns displayed before 1.3, but the grouping columns are implicitly passed in spark1.4 and later. The reason for changing this behavior is that

Re: [DISCUSS] Table API Enhancement Outline

2018-11-22 Thread Fabian Hueske
Hi all, First of all, it is correct that the flatMap(Expression*) and flatAggregate(Expression*) methods would mix scalar and table values. This would be a new concept that is not present in the current API. >From my point of view, the semantics are quite clear, but I understand that others are mo

Re: [DISCUSS] Table API Enhancement Outline

2018-11-22 Thread Piotr Nowojski
Hi Jincheng, #1) ok, got it. #3) > From points of my view I we can using > `Expression`, and after the discussion decided to use Expression*, then > improve it. In any case, we can use Expression, and there is an opportunity > to become Expression* (compatibility). If we use Expression* directly,

Re: [DISCUSS] Table API Enhancement Outline

2018-11-21 Thread jincheng sun
Hi Piotrek, #1)We have unbounded and bounded group window aggregate, for unbounded case we should early fire the result with retract message, we can not using watermark, because unbounded aggregate never finished. (for improvement we can introduce micro-batch in feature), for bounded window we nev

Re: [DISCUSS] Table API Enhancement Outline

2018-11-21 Thread Piotr Nowojski
Hi Jincheng, > #1) No,watermark solves the issue of the late event. Here, the performance > problem is caused by the update emit mode. i.e.: When current calculation > result is output, the previous calculation result needs to be retracted. Hmm, yes I missed this. For time-windowed cases (some ag

Re: [DISCUSS] Table API Enhancement Outline

2018-11-21 Thread jincheng sun
Hi shaoxuan & Hequn, Thanks for your suggestion,I'll file the JIRAs later. We can prepare PRs while continuing to move forward the ongoing discussion. Regards, Jincheng jincheng sun 于2018年11月21日周三 下午7:07写道: > Hi Piotrek, > Thanks for your feedback, and thanks for share your thoughts! > > #1)

Re: [DISCUSS] Table API Enhancement Outline

2018-11-21 Thread jincheng sun
Hi Piotrek, Thanks for your feedback, and thanks for share your thoughts! #1) No,watermark solves the issue of the late event. Here, the performance problem is caused by the update emit mode. i.e.: When current calculation result is output, the previous calculation result needs to be retracted. #

Re: [DISCUSS] Table API Enhancement Outline

2018-11-21 Thread Piotr Nowojski
Hi, 1. > In fact, in addition to the design of APIs, there will be various > performance optimization details, such as: table Aggregate function > emitValue will generate multiple calculation results, in extreme cases, > each record will trigger a large number of retract messages, this will have

Re: [DISCUSS] Table API Enhancement Outline

2018-11-21 Thread Hequn Cheng
Hi, Thank you all for the great proposal and discussion! I also prefer to move on to the next step, so +1 for opening the JIRAs to start the work. We can have more detailed discussion there. Btw, we can start with JIRAs which we have agreed on. Best, Hequn On Tue, Nov 20, 2018 at 11:38 PM Shaoxu

Re: [DISCUSS] Table API Enhancement Outline

2018-11-20 Thread Shaoxuan Wang
+1. I agree that we should open the JIRAs to start the work. We may have better ideas on the flavor of the interface when implement/review the code. Regards, shaoxuan On 11/20/18, jincheng sun wrote: > Hi all, > > Thanks all for the feedback. > > @Piotr About not using abbreviations naming, +1

Re: [DISCUSS] Table API Enhancement Outline

2018-11-19 Thread jincheng sun
Hi all, Thanks all for the feedback. @Piotr About not using abbreviations naming, +1,I like your proposal!Currently both DataSet and DataStream API are using `aggregate`, BTW,I find other language also not using abbreviations naming,such as R. Sometimes the interface of the API is really diffic

Re: [DISCUSS] Table API Enhancement Outline

2018-11-18 Thread Xiaowei Jiang
Hi Fabian & Piotr, thanks for the feedback! I appreciate your concerns, both on timestamp attributes as well as on implicit group keys. At the same time, I'm also concerned with the proposed approach of allowing Expression* as parameters, especially for flatMap/flatAgg. So far, we never allowed a

Re: [DISCUSS] Table API Enhancement Outline

2018-11-15 Thread Piotr Nowojski
Hi, Isn’t the problem of multiple expressions limited only to `flat***` functions and to be more specific only to having two (or more) different table functions passed as an expressions? `.flatAgg(TableAggA('a), scalarFunction1(‘b), scalarFunction2(‘c))` seems to be well defined (duplicate resu

Re: [DISCUSS] Table API Enhancement Outline

2018-11-15 Thread Fabian Hueske
Hi Jincheng, I said before, that I think that the append() method is better than implicitly forwarding keys, but still, I believe it adds unnecessary boiler plate code. Moreover, I haven't seen a convincing argument why map(Expression*) is worse than map(Expression). In either case we need to do

Re: [DISCUSS] Table API Enhancement Outline

2018-11-13 Thread jincheng sun
Hi Fabian/Xiaowei, I am very sorry for my late reply! Glad to see your reply, and sounds pretty good! I agree that the approach with append() which can clearly defined the result schema is better which Fabian mentioned. In addition and append() and also contains non-time attributes, e.g.: tab

Re: [DISCUSS] Table API Enhancement Outline

2018-11-09 Thread Fabian Hueske
Hi Jincheng, Thanks for the summary! I like the approach with append() better than the implicit forwarding as it clearly indicates which fields are forwarded. However, I don't see much benefit over the flatMap(Expression*) variant, as we would still need to analyze the full expression tree to ensu

Re: [DISCUSS] Table API Enhancement Outline

2018-11-08 Thread jincheng sun
Hi all, We are discussing very detailed content about this proposal. We are trying to design the API in many aspects (functionality, compatibility, ease of use, etc.). I think this is a very good process. Only such a detailed discussion, In order to develop PR more clearly and smoothly in the late

Re: [DISCUSS] Table API Enhancement Outline

2018-11-07 Thread Xiaowei Jiang
Hi Fabian, I think that the key question you raised is if we allow extra parameters in the methods map/flatMap/agg/flatAgg. I can see why allowing that may appear more convenient in some cases. However, it might also cause some confusions if we do that. For example, do we allow multiple UDFs in th

Re: [DISCUSS] Table API Enhancement Outline

2018-11-07 Thread Fabian Hueske
Hi, * Re emit: I think we should start with a well understood semantics of full replacement. This is how the other agg functions work. As was said before, there are open questions regarding an append mode (checkpointing, whether supporting retractions or not and if yes how to declare them, ...). S

Re: [DISCUSS] Table API Enhancement Outline

2018-11-06 Thread Shaoxuan Wang
Hi xiaowei, Yes, I agree with you that the semantics of TableAggregateFunction emit is much more complex than AggregateFunction. The fundamental difference is that TableAggregateFunction emits a "table" while AggregateFunction outputs (a column of) a "row". In the case of AggregateFunction it only

Re: [DISCUSS] Table API Enhancement Outline

2018-11-06 Thread jincheng sun
Hi Xiaowei, Thank you for mentioned such key points. Yes, I think those points are very important for the clear definition of the semantics of Table AggregateFunction!I'd like share my thoughts about the those questions: 1. Do we allow multi-staged TableAggregate in this case? >From the points of

Re: [DISCUSS] Table API Enhancement Outline

2018-11-06 Thread Fabian Hueske
Hi, Thanks for the great design document! It answers my question regarding handling of retraction messages. Overall, I like the proposal. It is well scoped and the proposed changes are well described. I left a question regarding the handling of time attributes for multi-column output functions.

Re: [DISCUSS] Table API Enhancement Outline

2018-11-06 Thread Xiaowei Jiang
Hi Jincheng, Thanks for adding the public interfaces! I think that it's a very good start. There are a few points that we need to have more discussions. - TableAggregateFunction - this is a very complex beast, definitely the most complex user defined objects we introduced so far. I think th

Re: [DISCUSS] Table API Enhancement Outline

2018-11-06 Thread jincheng sun
Hi, Xiaowei, Thanks for bring up the discuss of Table API Enhancement Outline ! I quickly looked at the overall content, these are good expressions of our offline discussions. But from the points of my view, we should add the usage of public interfaces that we will introduce in this propose. So,

[DISCUSS] Table API Enhancement Outline

2018-11-05 Thread Xiaowei Jiang
Hi All, As Jincheng brought up in the previous email, there are a set of improvements needed to make Table API more complete/self-contained. To give a better overview on this, Jincheng, Jiangjie, Shaoxuan and myself discussed offline a bit and came up with an initial outline. Table API Enhancemen