Re: [DISCUSS] FLINK-20578 Cannot create empty array using ARRAY[]

eric xiao Fri, 28 Oct 2022 16:50:46 -0700

Upon further investigation of the Flink docs of array types, I noticed in
the data type docs [1] it has the following syntax for declaring an array *with
a data type*:


> ARRAY<t>
> t ARRAY
>


The type can be declared using ARRAY<t> where t is the data type of the
> contained elements.
> t ARRAY is a synonym for being closer to the SQL standard. For example, INT
> ARRAY is equivalent to ARRAY<INT>.


But when I try to run:
SELECT ARRAY<INT> [1]
or
SELECT INT ARRAY [1]

In Flink SQL I get this error:
org.apache.flink.table.api.SqlParserException: SQL parse failed. Incorrect
syntax near the keyword 'INT' at line 1, column 8.
Was expecting one of:
    "ABS" ...
    "ALL" ...
    "ARRAY" ...
    "AVG" ...
    "CARDINALITY" ...

Am I missing something?

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/types/#array



On Fri, Oct 28, 2022 at 11:27 AM eric xiao <xiao.eric...@gmail.com> wrote:

> I seem to still get the datatype array<void>:
> from pyspark.sql import SparkSession
> spark = (
>     SparkSession.builder
>     .getOrCreate()
> )
> spark.sql("SELECT ARRAY() + 1")
>
> AnalysisException: cannot resolve '(array() + 1)' due to data type mismatch: 
> differing types in '(array() + 1)' (array<void> and int).; line 1 pos 7;
> 'Project [unresolvedalias((array() + 1), None)]
> +- OneRowRelation
>
> Perhaps it has to do with me using PySpark vs Java or Scala Spark?
>
> $ pip3 freeze | grep pyspark
> pyspark==3.3.1
>
>
> Regarding the coalesce query, I should have been a bit more explicit with
> my example:
> SELECT COALESCE(ARRAY(1), CAST(ARRAY() as ARRAY<INT>))
>
> When I was referring to int_column I meant an int array column 😅. It
> makes sense why you cannot COALESCE a non array data type with an array
> data type - this is also why the SELECT ARRAY() + 1 query fails.
>
> ---
>
>> Unfortunately, ARRAY is fully managed by Calcite and maybe deeply
>> integrated also into the parser (at least this is the case for ROW).
>>
>
> I think that is the problem indeed and people have suggested overriding
> some of the functions inherited from the Calcite SqlArrayValueConstructor
> class to allow for the creation of empty arrays. Those suggestions worked
> as shown in my exploratory PR: https://github.com/apache/flink/pull/21156.
>
> I think this is one change we will need to make in Flink.
>
> So if we wanna go the easy path (without introducing ARRAY_OF), ARRAY()
>> should result in ARRAY<NULL> if the type can not be derived by the
>> surrounding call.
>>
> Are the SQL functions ARRAY_OF and ROW_OF supported in any other SQL
> dialect? I did a quick search and came out blank. I am still fairly new to
> the community so I don't know what the stance of trying to stay close to
> ANSI SQL is - having helper functions is always nice though.
>
> I am inlined that we should let the parser figure out the typing of the
> array based on the surrounding call and if it cannot use a default type
> such as ARRAY<NULL>.
>
> On Fri, Oct 28, 2022 at 4:07 AM Timo Walther <twal...@apache.org> wrote:
>
>> Actually, the new type inference stack for UDFs is smart enough to solve
>> this issue. It could derive a data type for the array from the
>> surrounding call (expected data type).
>>
>> So this can be supported with the right type inference logic:
>> cast(ARRAY() as int)
>>
>> Unfortunately, ARRAY is fully managed by Calcite and maybe deeply
>> integrated also into the parser (at least this is the case for ROW).
>> TBH if I were to design a FLIP for the collection functions, I would
>> actually propose to introduce `ARRAY_OF()`, `ROW_OF()` to have full
>> control over the type inference in our stack. In our stack, this also
>> means that NULL is unknown. Calcite distinguished between NULL and
>> unknown.
>>
>> So if we wanna go the easy path (without introducing ARRAY_OF), ARRAY()
>> should result in ARRAY<NULL> if the type can not be derived by the
>> surrounding call.
>>
>> Regards,
>> Timo
>>
>> On 28.10.22 03:46, yuxia wrote:
>> > For an empty array, seems different engine use different data type:
>> > Hive: string
>> > Spark: string ?
>> > Trino:  Unknown
>> > BigQuery: Integer
>> >
>> > I have tried with Hive and Spark, but haven't tried with Trino and
>> BigQuery.
>> >
>> > I'm a little of doubt about the spark's behavior. But from my sides,
>> seems Spark actually use string type which is different from your
>> investigation.
>> > I try with the following sql in spark-cli:
>> > `
>> > select array() + 1
>> > `
>> >
>> > The exception is
>> > `
>> > Error in query: cannot resolve '(array() + 1)' due to data type
>> mismatch: differing types in '(array() + 1)' (array<string> and int).; line
>> 1 pos 7;
>> > 'Project [unresolvedalias((array() + 1), None)]
>> > +- OneRowRelation
>> > `
>> >
>> >
>> > Seems it's hard to decide which data type Flink should use. I'm
>> insterested in the reason why you would like to use Integer type.
>> > I haven't cheked whether the sql stardard specifies it. But from my
>> side, I prefer to follow Hive/Spark.
>> >
>> > BTW: the query `SELECT COALESCE(1, cast(ARRAY() as int))` will fail in
>> Hive and Spark.
>> >
>> >
>> > Best regards,
>> > Yuxia
>> >
>> > ----- 原始邮件 -----
>> > 发件人: "eric xiao" <xiao.eric...@gmail.com>
>> > 收件人: "dev" <dev@flink.apache.org>
>> > 发送时间: 星期四, 2022年 10 月 27日 下午 9:13:51
>> > 主题: [DISCUSS] FLINK-20578 Cannot create empty array using ARRAY[]
>> >
>> > Hi,
>> >
>> > I would like to propose a solution to this JIRA issue. I looked at the
>> > comments and there was some guidance around where in the code we should
>> > update to allow for this behaviour. But I believe there are still two
>> > questions that remain open:
>> >
>> >     1. Is this expected behaviour (i.e. users should not be able to
>> create
>> >     an empty array)?
>> >     2. If this is indeed expected behaviour, what should the data type
>> be of
>> >     the empty array?
>> >
>> > I did some digging into other query engines / databases in hopes of
>> > answering the following two questions - That can be found at the end of
>> > this thread.
>> >
>> > *Q: *Is this expected behaviour (i.e. users should not be able to
>> create an
>> > empty array)?
>> > *A: *Yes I would say this is expected behaviour and something we should
>> add
>> > into the Flink SQL API.
>> >
>> > *Q: *What should the data type be of the empty array?
>> > *A: *This question is a bit harder to answer and I think it would
>> require
>> > two steps.
>> >
>> > *Step 1: Pick a default data type to initialize the empty array.*
>> >
>> > We can use an "empty data type" such as NULL, VOID.
>> >
>> > *Step 2: Create or reuse type coercion to make using empty arrays
>> easier.*
>> >
>> > The above should unblock users from creating empty arrays, but if one
>> would
>> > use an empty array in an COALESCE operation.
>> >
>> > i.e. SELECT COALESCE(int_column, ARRAY[])
>> >
>> > I believe they will get a query issue where the type for int_column
>> (INTEGER)
>> > and the empty array (NULL, VOID) do not match. Thus a user will need to
>> > cast the empty array:
>> >
>> > i.e. SELECT COALESCE(int_column, CAST(ARRAY[] AS INT))
>> >
>> > as such to have the COALESCE query to execute successfully.
>> >
>> > -----
>> > *Trino*
>> >> EXPLAIN SELECT ARRAY[]
>> > Fragment 0 [SINGLE]
>> >      Output layout: [expr]
>> >      Output partitioning: SINGLE []
>> >      Output[columnNames = [_col0]]
>> >      │   Layout: [expr:array(unknown)]
>> >      │   Estimates: {rows: 1 (55B), cpu: 0, memory: 0B, network: 0B}
>> >      │   _col0 := expr
>> >      └─ Values[]
>> >             Layout: [expr:array(unknown)]
>> >             Estimates: {rows: 1 (55B), cpu: 0, memory: 0B, network: 0B}
>> >
>> >
>>  
>> ("$literal$"(from_base64('AwAAAFJMRQAAAAAKAAAAQllURV9BUlJBWQEAAAABgAAAAAA=')))
>> >
>> > Expected behaviour? *Yes.*
>> > Array data type? *Unknown.*
>> >
>> > *Spark*
>> >> sc.sql.sql("SELECT ARRAY[]").explain()
>> >   DataFrame[array(): array<void>]
>> >
>> > Expected behaviour? *Yes.*
>> > Array data type? *Void.*
>> >
>> > *BigQuery*
>> >> SELECT ARRAY[]
>> > Field name Type          Mode
>> > f0_              INTEGER  REPEATED
>> >
>> > Expected behaviour? *Yes.*
>> > Array data type? *Integer.*
>> >
>> > Best,
>> >
>> > Eric
>> >
>>
>>

Re: [DISCUSS] FLINK-20578 Cannot create empty array using ARRAY[]

Reply via email to