Re: [DISCUSS] FLINK-20578 Cannot create empty array using ARRAY[]

eric xiao Thu, 03 Nov 2022 07:54:13 -0700

Ah thank you Yuxia!

Following up on the conversation... Trying to return an empty null array
ARRAY<NULL> proved a challenge. Not impossible, but I think it would
require a further discussion as to if we want to expose values with the
data type NULL in the output - currently there are a lot of guardrails in
Flink ([1] and [2] to name a few) to make sure we do not do that.


Going back to my previous suggestions, would it be wise to introduce a
stop-gap solution for the time being, where we initialize the data type of
an empty array to some non null data type? i.e. string, integer.

[1]
https://github.com/apache/flink/blob/master/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/calcite/FlinkTypeFactory.scala#L350
[2]
https://github.com/apache/flink/blob/master/flink-table/flink-table-common/src/main/java/org/apache/flink/table/data/binary/BinaryArrayData.java#L110-L114

On Fri, Oct 28, 2022 at 10:17 PM yuxia <luoyu...@alumni.sjtu.edu.cn> wrote:

> `ARRAY<INT>`/ `INT ARRAY` is for declaring data type in DDL like `create
> table t1(a INT ARRAY)`.
>
> Best regards,
> Yuxia
>
> ----- 原始邮件 -----
> 发件人: "eric xiao" <xiao.eric...@gmail.com>
> 收件人: "dev" <dev@flink.apache.org>
> 发送时间: 星期六, 2022年 10 月 29日 上午 7:50:22
> 主题: Re: [DISCUSS] FLINK-20578 Cannot create empty array using ARRAY[]
>
> Upon further investigation of the Flink docs of array types, I noticed in
> the data type docs [1] it has the following syntax for declaring an array
> *with
> a data type*:
>
> > ARRAY<t>
> > t ARRAY
> >
>
>
> The type can be declared using ARRAY<t> where t is the data type of the
> > contained elements.
> > t ARRAY is a synonym for being closer to the SQL standard. For example,
> INT
> > ARRAY is equivalent to ARRAY<INT>.
>
>
> But when I try to run:
> SELECT ARRAY<INT> [1]
> or
> SELECT INT ARRAY [1]
>
> In Flink SQL I get this error:
> org.apache.flink.table.api.SqlParserException: SQL parse failed. Incorrect
> syntax near the keyword 'INT' at line 1, column 8.
> Was expecting one of:
>     "ABS" ...
>     "ALL" ...
>     "ARRAY" ...
>     "AVG" ...
>     "CARDINALITY" ...
>
> Am I missing something?
>
> [1]
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/types/#array
>
>
>
> On Fri, Oct 28, 2022 at 11:27 AM eric xiao <xiao.eric...@gmail.com> wrote:
>
> > I seem to still get the datatype array<void>:
> > from pyspark.sql import SparkSession
> > spark = (
> >     SparkSession.builder
> >     .getOrCreate()
> > )
> > spark.sql("SELECT ARRAY() + 1")
> >
> > AnalysisException: cannot resolve '(array() + 1)' due to data type
> mismatch: differing types in '(array() + 1)' (array<void> and int).; line 1
> pos 7;
> > 'Project [unresolvedalias((array() + 1), None)]
> > +- OneRowRelation
> >
> > Perhaps it has to do with me using PySpark vs Java or Scala Spark?
> >
> > $ pip3 freeze | grep pyspark
> > pyspark==3.3.1
> >
> >
> > Regarding the coalesce query, I should have been a bit more explicit with
> > my example:
> > SELECT COALESCE(ARRAY(1), CAST(ARRAY() as ARRAY<INT>))
> >
> > When I was referring to int_column I meant an int array column 😅. It
> > makes sense why you cannot COALESCE a non array data type with an array
> > data type - this is also why the SELECT ARRAY() + 1 query fails.
> >
> > ---
> >
> >> Unfortunately, ARRAY is fully managed by Calcite and maybe deeply
> >> integrated also into the parser (at least this is the case for ROW).
> >>
> >
> > I think that is the problem indeed and people have suggested overriding
> > some of the functions inherited from the Calcite SqlArrayValueConstructor
> > class to allow for the creation of empty arrays. Those suggestions worked
> > as shown in my exploratory PR:
> https://github.com/apache/flink/pull/21156.
> >
> > I think this is one change we will need to make in Flink.
> >
> > So if we wanna go the easy path (without introducing ARRAY_OF), ARRAY()
> >> should result in ARRAY<NULL> if the type can not be derived by the
> >> surrounding call.
> >>
> > Are the SQL functions ARRAY_OF and ROW_OF supported in any other SQL
> > dialect? I did a quick search and came out blank. I am still fairly new
> to
> > the community so I don't know what the stance of trying to stay close to
> > ANSI SQL is - having helper functions is always nice though.
> >
> > I am inlined that we should let the parser figure out the typing of the
> > array based on the surrounding call and if it cannot use a default type
> > such as ARRAY<NULL>.
> >
> > On Fri, Oct 28, 2022 at 4:07 AM Timo Walther <twal...@apache.org> wrote:
> >
> >> Actually, the new type inference stack for UDFs is smart enough to solve
> >> this issue. It could derive a data type for the array from the
> >> surrounding call (expected data type).
> >>
> >> So this can be supported with the right type inference logic:
> >> cast(ARRAY() as int)
> >>
> >> Unfortunately, ARRAY is fully managed by Calcite and maybe deeply
> >> integrated also into the parser (at least this is the case for ROW).
> >> TBH if I were to design a FLIP for the collection functions, I would
> >> actually propose to introduce `ARRAY_OF()`, `ROW_OF()` to have full
> >> control over the type inference in our stack. In our stack, this also
> >> means that NULL is unknown. Calcite distinguished between NULL and
> >> unknown.
> >>
> >> So if we wanna go the easy path (without introducing ARRAY_OF), ARRAY()
> >> should result in ARRAY<NULL> if the type can not be derived by the
> >> surrounding call.
> >>
> >> Regards,
> >> Timo
> >>
> >> On 28.10.22 03:46, yuxia wrote:
> >> > For an empty array, seems different engine use different data type:
> >> > Hive: string
> >> > Spark: string ?
> >> > Trino:  Unknown
> >> > BigQuery: Integer
> >> >
> >> > I have tried with Hive and Spark, but haven't tried with Trino and
> >> BigQuery.
> >> >
> >> > I'm a little of doubt about the spark's behavior. But from my sides,
> >> seems Spark actually use string type which is different from your
> >> investigation.
> >> > I try with the following sql in spark-cli:
> >> > `
> >> > select array() + 1
> >> > `
> >> >
> >> > The exception is
> >> > `
> >> > Error in query: cannot resolve '(array() + 1)' due to data type
> >> mismatch: differing types in '(array() + 1)' (array<string> and int).;
> line
> >> 1 pos 7;
> >> > 'Project [unresolvedalias((array() + 1), None)]
> >> > +- OneRowRelation
> >> > `
> >> >
> >> >
> >> > Seems it's hard to decide which data type Flink should use. I'm
> >> insterested in the reason why you would like to use Integer type.
> >> > I haven't cheked whether the sql stardard specifies it. But from my
> >> side, I prefer to follow Hive/Spark.
> >> >
> >> > BTW: the query `SELECT COALESCE(1, cast(ARRAY() as int))` will fail in
> >> Hive and Spark.
> >> >
> >> >
> >> > Best regards,
> >> > Yuxia
> >> >
> >> > ----- 原始邮件 -----
> >> > 发件人: "eric xiao" <xiao.eric...@gmail.com>
> >> > 收件人: "dev" <dev@flink.apache.org>
> >> > 发送时间: 星期四, 2022年 10 月 27日 下午 9:13:51
> >> > 主题: [DISCUSS] FLINK-20578 Cannot create empty array using ARRAY[]
> >> >
> >> > Hi,
> >> >
> >> > I would like to propose a solution to this JIRA issue. I looked at the
> >> > comments and there was some guidance around where in the code we
> should
> >> > update to allow for this behaviour. But I believe there are still two
> >> > questions that remain open:
> >> >
> >> >     1. Is this expected behaviour (i.e. users should not be able to
> >> create
> >> >     an empty array)?
> >> >     2. If this is indeed expected behaviour, what should the data type
> >> be of
> >> >     the empty array?
> >> >
> >> > I did some digging into other query engines / databases in hopes of
> >> > answering the following two questions - That can be found at the end
> of
> >> > this thread.
> >> >
> >> > *Q: *Is this expected behaviour (i.e. users should not be able to
> >> create an
> >> > empty array)?
> >> > *A: *Yes I would say this is expected behaviour and something we
> should
> >> add
> >> > into the Flink SQL API.
> >> >
> >> > *Q: *What should the data type be of the empty array?
> >> > *A: *This question is a bit harder to answer and I think it would
> >> require
> >> > two steps.
> >> >
> >> > *Step 1: Pick a default data type to initialize the empty array.*
> >> >
> >> > We can use an "empty data type" such as NULL, VOID.
> >> >
> >> > *Step 2: Create or reuse type coercion to make using empty arrays
> >> easier.*
> >> >
> >> > The above should unblock users from creating empty arrays, but if one
> >> would
> >> > use an empty array in an COALESCE operation.
> >> >
> >> > i.e. SELECT COALESCE(int_column, ARRAY[])
> >> >
> >> > I believe they will get a query issue where the type for int_column
> >> (INTEGER)
> >> > and the empty array (NULL, VOID) do not match. Thus a user will need
> to
> >> > cast the empty array:
> >> >
> >> > i.e. SELECT COALESCE(int_column, CAST(ARRAY[] AS INT))
> >> >
> >> > as such to have the COALESCE query to execute successfully.
> >> >
> >> > -----
> >> > *Trino*
> >> >> EXPLAIN SELECT ARRAY[]
> >> > Fragment 0 [SINGLE]
> >> >      Output layout: [expr]
> >> >      Output partitioning: SINGLE []
> >> >      Output[columnNames = [_col0]]
> >> >      │   Layout: [expr:array(unknown)]
> >> >      │   Estimates: {rows: 1 (55B), cpu: 0, memory: 0B, network: 0B}
> >> >      │   _col0 := expr
> >> >      └─ Values[]
> >> >             Layout: [expr:array(unknown)]
> >> >             Estimates: {rows: 1 (55B), cpu: 0, memory: 0B, network:
> 0B}
> >> >
> >> >
> >>
> ("$literal$"(from_base64('AwAAAFJMRQAAAAAKAAAAQllURV9BUlJBWQEAAAABgAAAAAA=')))
> >> >
> >> > Expected behaviour? *Yes.*
> >> > Array data type? *Unknown.*
> >> >
> >> > *Spark*
> >> >> sc.sql.sql("SELECT ARRAY[]").explain()
> >> >   DataFrame[array(): array<void>]
> >> >
> >> > Expected behaviour? *Yes.*
> >> > Array data type? *Void.*
> >> >
> >> > *BigQuery*
> >> >> SELECT ARRAY[]
> >> > Field name Type          Mode
> >> > f0_              INTEGER  REPEATED
> >> >
> >> > Expected behaviour? *Yes.*
> >> > Array data type? *Integer.*
> >> >
> >> > Best,
> >> >
> >> > Eric
> >> >
> >>
> >>
>

Re: [DISCUSS] FLINK-20578 Cannot create empty array using ARRAY[]

Reply via email to