Re: [DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Dian Fu Tue, 05 Jan 2021 20:00:29 -0800

Hi all,

I have updated the FLIP about temporal join, sql hints and window TVF.


Regards,
Dian

> 在 2021年1月5日，上午11:58，Dian Fu <dian0511...@gmail.com> 写道：
> 
> Thanks a lot for your comments!
> 
> Regarding to Python Table API examples: I thought it should be 
> straightforward about how to use these operations in Python Table API and so 
> have not added them. However, the suggestions make sense to me and I have 
> added some examples about how to use them in Python Table API to make it more 
> clear.
> 
> Regarding to dropDuplicates vs deduplicate: +1 to use deduplicate. It's more 
> consistent with the feature/concept which is already documented clearly in 
> Flink.
> 
> Regarding to `myTable.coalesce($("a"), 1).as("a")`: I'm still in favor of 
> fillna for now. Compared to coalesce, fillna could handle multiple columns in 
> one method call. For the naming convention, the name "fillna/dropna/replace" 
> comes from Pandas [1][2][3].
> 
> Regarding to `event-time/processing-time temporal join, SQL Hints, window 
> TVF`: Good catch! Definitely we should support them in Table API. I will 
> update the FLIP about these functionalities.
> 
> [1] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html 
> <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html>
> [2] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html 
> <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html>
> [3] 
> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html 
> <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html>
>> 在 2021年1月4日，下午10:59，Timo Walther <twal...@apache.org 
>> <mailto:twal...@apache.org>> 写道：
>> 
>> Hi Dian,
>> 
>> thanks for the proposed FLIP. I haven't taken a deep look at the proposal 
>> yet but will do so shortly. In general, we should aim to make the Table API 
>> as concise and self-explaining as possible. E.g. `dropna` does not sound 
>> obvious to me.
>> 
>> Regarding `myTable.coalesce($("a"), 1).as("a")`: Instead of introducing more 
>> top-level functions, maybe we should also consider introducing more building 
>> blocks e.g. for applying an expression to every column. A more functional 
>> approach (e.g. with lamba function) could solve more use cases.
>> 
>> Regards,
>> Timo
>> 
>> On 04.01.21 15:35, Seth Wiesman wrote:
>>> This makes sense, I have some questions about method names.
>>> What do you think about renaming `dropDuplicates` to `deduplicate`? I don't
>>> think that drop is the right word to use for this operation, it implies
>>> records are filtered where this operator actually issues updates and
>>> retractions. Also, deduplicate is already how we talk about this feature in
>>> the docs so I think it would be easier for users to find.
>>> For null handling, I don't know how close we want to stick with SQL
>>> conventions but what about making `coalesce` a top-level method? Something
>>> like:
>>> myTable.coalesce($("a"), 1).as("a")
>>> We can require the next method to be an `as`. There is already precedent
>>> for this sort of thing, `GroupedTable#aggregate` can only be followed by
>>> `select`.
>>> Seth
>>> On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <weizhong0...@gmail.com 
>>> <mailto:weizhong0...@gmail.com>> wrote:
>>>> Hi Dian,
>>>> 
>>>> Big +1 for making the Table API easier to use. Java users and Python users
>>>> can both benefit from it. I think it would be better if we add some Python
>>>> API examples.
>>>> 
>>>> Best,
>>>> Wei
>>>> 
>>>> 
>>>>> 在 2021年1月4日，20:03，Dian Fu <dian0511...@gmail.com 
>>>>> <mailto:dian0511...@gmail.com>> 写道：
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I'd like to start a discussion about introducing a few convenient
>>>> operations in Table API from the perspective of ease of use.
>>>>> 
>>>>> Currently some tasks are not easy to express in Table API e.g.
>>>> deduplication, topn, etc, or not easy to express when there are hundreds of
>>>> columns in a table, e.g. null data handling, etc.
>>>>> 
>>>>> I'd like to propose to introduce a few operations in Table API with the
>>>> following purposes:
>>>>> - Make Table API users to easily leverage the powerful features already
>>>> in SQL, e.g. deduplication, topn, etc
>>>>> - Provide some convenient operations, e.g. introducing a series of
>>>> operations for null data handling (it may become a problem when there are
>>>> hundreds of columns), data sampling and splitting (which is a very common
>>>> use case in ML which usually needs to split a table into multiple tables
>>>> for training and validation separately).
>>>>> 
>>>>> Please refer to FLIP-155 [1] for more details.
>>>>> 
>>>>> Looking forward to your feedback!
>>>>> 
>>>>> Regards,
>>>>> Dian
>>>>> 
>>>>> [1]
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API
>>>>  
>>>> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API>
>>>> 
>>>> 
>> 
>

Re: [DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Reply via email to