Re: [DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Timo Walther Mon, 04 Jan 2021 06:59:29 -0800

Hi Dian,

thanks for the proposed FLIP. I haven't taken a deep look at theproposal yet but will do so shortly. In general, we should aim to makethe Table API as concise and self-explaining as possible. E.g. `dropna`does not sound obvious to me.

Regarding `myTable.coalesce($("a"), 1).as("a")`: Instead of introducingmore top-level functions, maybe we should also consider introducing morebuilding blocks e.g. for applying an expression to every column. A morefunctional approach (e.g. with lamba function) could solve more use cases.


Regards,
Timo

On 04.01.21 15:35, Seth Wiesman wrote:

This makes sense, I have some questions about method names.

What do you think about renaming `dropDuplicates` to `deduplicate`? I don't
think that drop is the right word to use for this operation, it implies
records are filtered where this operator actually issues updates and
retractions. Also, deduplicate is already how we talk about this feature in
the docs so I think it would be easier for users to find.

For null handling, I don't know how close we want to stick with SQL
conventions but what about making `coalesce` a top-level method? Something
like:

myTable.coalesce($("a"), 1).as("a")

We can require the next method to be an `as`. There is already precedent
for this sort of thing, `GroupedTable#aggregate` can only be followed by
`select`.

Seth

On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <weizhong0...@gmail.com> wrote:

Hi Dian,

Big +1 for making the Table API easier to use. Java users and Python users
can both benefit from it. I think it would be better if we add some Python
API examples.

Best,
Wei

在 2021年1月4日，20:03，Dian Fu <dian0511...@gmail.com> 写道：

Hi all,

I'd like to start a discussion about introducing a few convenient

operations in Table API from the perspective of ease of use.


Currently some tasks are not easy to express in Table API e.g.

deduplication, topn, etc, or not easy to express when there are hundreds of
columns in a table, e.g. null data handling, etc.


I'd like to propose to introduce a few operations in Table API with the

following purposes:

- Make Table API users to easily leverage the powerful features already

in SQL, e.g. deduplication, topn, etc

- Provide some convenient operations, e.g. introducing a series of

operations for null data handling (it may become a problem when there are
hundreds of columns), data sampling and splitting (which is a very common
use case in ML which usually needs to split a table into multiple tables
for training and validation separately).


Please refer to FLIP-155 [1] for more details.

Looking forward to your feedback!

Regards,
Dian

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API

Re: [DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Reply via email to