Hi Dian,

thanks for the proposed FLIP. I haven't taken a deep look at the proposal yet but will do so shortly. In general, we should aim to make the Table API as concise and self-explaining as possible. E.g. `dropna` does not sound obvious to me.

Regarding `myTable.coalesce($("a"), 1).as("a")`: Instead of introducing more top-level functions, maybe we should also consider introducing more building blocks e.g. for applying an expression to every column. A more functional approach (e.g. with lamba function) could solve more use cases.

Regards,
Timo

On 04.01.21 15:35, Seth Wiesman wrote:
This makes sense, I have some questions about method names.

What do you think about renaming `dropDuplicates` to `deduplicate`? I don't
think that drop is the right word to use for this operation, it implies
records are filtered where this operator actually issues updates and
retractions. Also, deduplicate is already how we talk about this feature in
the docs so I think it would be easier for users to find.

For null handling, I don't know how close we want to stick with SQL
conventions but what about making `coalesce` a top-level method? Something
like:

myTable.coalesce($("a"), 1).as("a")

We can require the next method to be an `as`. There is already precedent
for this sort of thing, `GroupedTable#aggregate` can only be followed by
`select`.

Seth

On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <weizhong0...@gmail.com> wrote:

Hi Dian,

Big +1 for making the Table API easier to use. Java users and Python users
can both benefit from it. I think it would be better if we add some Python
API examples.

Best,
Wei


在 2021年1月4日,20:03,Dian Fu <dian0511...@gmail.com> 写道:

Hi all,

I'd like to start a discussion about introducing a few convenient
operations in Table API from the perspective of ease of use.

Currently some tasks are not easy to express in Table API e.g.
deduplication, topn, etc, or not easy to express when there are hundreds of
columns in a table, e.g. null data handling, etc.

I'd like to propose to introduce a few operations in Table API with the
following purposes:
- Make Table API users to easily leverage the powerful features already
in SQL, e.g. deduplication, topn, etc
- Provide some convenient operations, e.g. introducing a series of
operations for null data handling (it may become a problem when there are
hundreds of columns), data sampling and splitting (which is a very common
use case in ML which usually needs to split a table into multiple tables
for training and validation separately).

Please refer to FLIP-155 [1] for more details.

Looking forward to your feedback!

Regards,
Dian

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API




Reply via email to