Thanks a lot for your comments!
Regarding to Python Table API examples: I thought it should be straightforward
about how to use these operations in Python Table API and so have not added
them. However, the suggestions make sense to me and I have added some examples
about how to use them in Python Table API to make it more clear.
Regarding to dropDuplicates vs deduplicate: +1 to use deduplicate. It's more
consistent with the feature/concept which is already documented clearly in
Flink.
Regarding to `myTable.coalesce($("a"), 1).as("a")`: I'm still in favor of
fillna for now. Compared to coalesce, fillna could handle multiple columns in
one method call. For the naming convention, the name "fillna/dropna/replace"
comes from Pandas [1][2][3].
Regarding to `event-time/processing-time temporal join, SQL Hints, window TVF`:
Good catch! Definitely we should support them in Table API. I will update the
FLIP about these functionalities.
[1] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html
<https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html>
[2] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html
<https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html>
[3] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
<https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html>
> 在 2021年1月4日,下午10:59,Timo Walther <[email protected]> 写道:
>
> Hi Dian,
>
> thanks for the proposed FLIP. I haven't taken a deep look at the proposal yet
> but will do so shortly. In general, we should aim to make the Table API as
> concise and self-explaining as possible. E.g. `dropna` does not sound obvious
> to me.
>
> Regarding `myTable.coalesce($("a"), 1).as("a")`: Instead of introducing more
> top-level functions, maybe we should also consider introducing more building
> blocks e.g. for applying an expression to every column. A more functional
> approach (e.g. with lamba function) could solve more use cases.
>
> Regards,
> Timo
>
> On 04.01.21 15:35, Seth Wiesman wrote:
>> This makes sense, I have some questions about method names.
>> What do you think about renaming `dropDuplicates` to `deduplicate`? I don't
>> think that drop is the right word to use for this operation, it implies
>> records are filtered where this operator actually issues updates and
>> retractions. Also, deduplicate is already how we talk about this feature in
>> the docs so I think it would be easier for users to find.
>> For null handling, I don't know how close we want to stick with SQL
>> conventions but what about making `coalesce` a top-level method? Something
>> like:
>> myTable.coalesce($("a"), 1).as("a")
>> We can require the next method to be an `as`. There is already precedent
>> for this sort of thing, `GroupedTable#aggregate` can only be followed by
>> `select`.
>> Seth
>> On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <[email protected]> wrote:
>>> Hi Dian,
>>>
>>> Big +1 for making the Table API easier to use. Java users and Python users
>>> can both benefit from it. I think it would be better if we add some Python
>>> API examples.
>>>
>>> Best,
>>> Wei
>>>
>>>
>>>> 在 2021年1月4日,20:03,Dian Fu <[email protected]> 写道:
>>>>
>>>> Hi all,
>>>>
>>>> I'd like to start a discussion about introducing a few convenient
>>> operations in Table API from the perspective of ease of use.
>>>>
>>>> Currently some tasks are not easy to express in Table API e.g.
>>> deduplication, topn, etc, or not easy to express when there are hundreds of
>>> columns in a table, e.g. null data handling, etc.
>>>>
>>>> I'd like to propose to introduce a few operations in Table API with the
>>> following purposes:
>>>> - Make Table API users to easily leverage the powerful features already
>>> in SQL, e.g. deduplication, topn, etc
>>>> - Provide some convenient operations, e.g. introducing a series of
>>> operations for null data handling (it may become a problem when there are
>>> hundreds of columns), data sampling and splitting (which is a very common
>>> use case in ML which usually needs to split a table into multiple tables
>>> for training and validation separately).
>>>>
>>>> Please refer to FLIP-155 [1] for more details.
>>>>
>>>> Looking forward to your feedback!
>>>>
>>>> Regards,
>>>> Dian
>>>>
>>>> [1]
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API
>>>
>>>
>