Ah thank you Yuxia! Following up on the conversation... Trying to return an empty null array ARRAY<NULL> proved a challenge. Not impossible, but I think it would require a further discussion as to if we want to expose values with the data type NULL in the output - currently there are a lot of guardrails in Flink ([1] and [2] to name a few) to make sure we do not do that.
Going back to my previous suggestions, would it be wise to introduce a stop-gap solution for the time being, where we initialize the data type of an empty array to some non null data type? i.e. string, integer. [1] https://github.com/apache/flink/blob/master/flink-table/flink-table-planner/src/main/scala/org/apache/flink/table/planner/calcite/FlinkTypeFactory.scala#L350 [2] https://github.com/apache/flink/blob/master/flink-table/flink-table-common/src/main/java/org/apache/flink/table/data/binary/BinaryArrayData.java#L110-L114 On Fri, Oct 28, 2022 at 10:17 PM yuxia <luoyu...@alumni.sjtu.edu.cn> wrote: > `ARRAY<INT>`/ `INT ARRAY` is for declaring data type in DDL like `create > table t1(a INT ARRAY)`. > > Best regards, > Yuxia > > ----- 原始邮件 ----- > 发件人: "eric xiao" <xiao.eric...@gmail.com> > 收件人: "dev" <dev@flink.apache.org> > 发送时间: 星期六, 2022年 10 月 29日 上午 7:50:22 > 主题: Re: [DISCUSS] FLINK-20578 Cannot create empty array using ARRAY[] > > Upon further investigation of the Flink docs of array types, I noticed in > the data type docs [1] it has the following syntax for declaring an array > *with > a data type*: > > > ARRAY<t> > > t ARRAY > > > > > The type can be declared using ARRAY<t> where t is the data type of the > > contained elements. > > t ARRAY is a synonym for being closer to the SQL standard. For example, > INT > > ARRAY is equivalent to ARRAY<INT>. > > > But when I try to run: > SELECT ARRAY<INT> [1] > or > SELECT INT ARRAY [1] > > In Flink SQL I get this error: > org.apache.flink.table.api.SqlParserException: SQL parse failed. Incorrect > syntax near the keyword 'INT' at line 1, column 8. > Was expecting one of: > "ABS" ... > "ALL" ... > "ARRAY" ... > "AVG" ... > "CARDINALITY" ... > > Am I missing something? > > [1] > > https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/types/#array > > > > On Fri, Oct 28, 2022 at 11:27 AM eric xiao <xiao.eric...@gmail.com> wrote: > > > I seem to still get the datatype array<void>: > > from pyspark.sql import SparkSession > > spark = ( > > SparkSession.builder > > .getOrCreate() > > ) > > spark.sql("SELECT ARRAY() + 1") > > > > AnalysisException: cannot resolve '(array() + 1)' due to data type > mismatch: differing types in '(array() + 1)' (array<void> and int).; line 1 > pos 7; > > 'Project [unresolvedalias((array() + 1), None)] > > +- OneRowRelation > > > > Perhaps it has to do with me using PySpark vs Java or Scala Spark? > > > > $ pip3 freeze | grep pyspark > > pyspark==3.3.1 > > > > > > Regarding the coalesce query, I should have been a bit more explicit with > > my example: > > SELECT COALESCE(ARRAY(1), CAST(ARRAY() as ARRAY<INT>)) > > > > When I was referring to int_column I meant an int array column 😅. It > > makes sense why you cannot COALESCE a non array data type with an array > > data type - this is also why the SELECT ARRAY() + 1 query fails. > > > > --- > > > >> Unfortunately, ARRAY is fully managed by Calcite and maybe deeply > >> integrated also into the parser (at least this is the case for ROW). > >> > > > > I think that is the problem indeed and people have suggested overriding > > some of the functions inherited from the Calcite SqlArrayValueConstructor > > class to allow for the creation of empty arrays. Those suggestions worked > > as shown in my exploratory PR: > https://github.com/apache/flink/pull/21156. > > > > I think this is one change we will need to make in Flink. > > > > So if we wanna go the easy path (without introducing ARRAY_OF), ARRAY() > >> should result in ARRAY<NULL> if the type can not be derived by the > >> surrounding call. > >> > > Are the SQL functions ARRAY_OF and ROW_OF supported in any other SQL > > dialect? I did a quick search and came out blank. I am still fairly new > to > > the community so I don't know what the stance of trying to stay close to > > ANSI SQL is - having helper functions is always nice though. > > > > I am inlined that we should let the parser figure out the typing of the > > array based on the surrounding call and if it cannot use a default type > > such as ARRAY<NULL>. > > > > On Fri, Oct 28, 2022 at 4:07 AM Timo Walther <twal...@apache.org> wrote: > > > >> Actually, the new type inference stack for UDFs is smart enough to solve > >> this issue. It could derive a data type for the array from the > >> surrounding call (expected data type). > >> > >> So this can be supported with the right type inference logic: > >> cast(ARRAY() as int) > >> > >> Unfortunately, ARRAY is fully managed by Calcite and maybe deeply > >> integrated also into the parser (at least this is the case for ROW). > >> TBH if I were to design a FLIP for the collection functions, I would > >> actually propose to introduce `ARRAY_OF()`, `ROW_OF()` to have full > >> control over the type inference in our stack. In our stack, this also > >> means that NULL is unknown. Calcite distinguished between NULL and > >> unknown. > >> > >> So if we wanna go the easy path (without introducing ARRAY_OF), ARRAY() > >> should result in ARRAY<NULL> if the type can not be derived by the > >> surrounding call. > >> > >> Regards, > >> Timo > >> > >> On 28.10.22 03:46, yuxia wrote: > >> > For an empty array, seems different engine use different data type: > >> > Hive: string > >> > Spark: string ? > >> > Trino: Unknown > >> > BigQuery: Integer > >> > > >> > I have tried with Hive and Spark, but haven't tried with Trino and > >> BigQuery. > >> > > >> > I'm a little of doubt about the spark's behavior. But from my sides, > >> seems Spark actually use string type which is different from your > >> investigation. > >> > I try with the following sql in spark-cli: > >> > ` > >> > select array() + 1 > >> > ` > >> > > >> > The exception is > >> > ` > >> > Error in query: cannot resolve '(array() + 1)' due to data type > >> mismatch: differing types in '(array() + 1)' (array<string> and int).; > line > >> 1 pos 7; > >> > 'Project [unresolvedalias((array() + 1), None)] > >> > +- OneRowRelation > >> > ` > >> > > >> > > >> > Seems it's hard to decide which data type Flink should use. I'm > >> insterested in the reason why you would like to use Integer type. > >> > I haven't cheked whether the sql stardard specifies it. But from my > >> side, I prefer to follow Hive/Spark. > >> > > >> > BTW: the query `SELECT COALESCE(1, cast(ARRAY() as int))` will fail in > >> Hive and Spark. > >> > > >> > > >> > Best regards, > >> > Yuxia > >> > > >> > ----- 原始邮件 ----- > >> > 发件人: "eric xiao" <xiao.eric...@gmail.com> > >> > 收件人: "dev" <dev@flink.apache.org> > >> > 发送时间: 星期四, 2022年 10 月 27日 下午 9:13:51 > >> > 主题: [DISCUSS] FLINK-20578 Cannot create empty array using ARRAY[] > >> > > >> > Hi, > >> > > >> > I would like to propose a solution to this JIRA issue. I looked at the > >> > comments and there was some guidance around where in the code we > should > >> > update to allow for this behaviour. But I believe there are still two > >> > questions that remain open: > >> > > >> > 1. Is this expected behaviour (i.e. users should not be able to > >> create > >> > an empty array)? > >> > 2. If this is indeed expected behaviour, what should the data type > >> be of > >> > the empty array? > >> > > >> > I did some digging into other query engines / databases in hopes of > >> > answering the following two questions - That can be found at the end > of > >> > this thread. > >> > > >> > *Q: *Is this expected behaviour (i.e. users should not be able to > >> create an > >> > empty array)? > >> > *A: *Yes I would say this is expected behaviour and something we > should > >> add > >> > into the Flink SQL API. > >> > > >> > *Q: *What should the data type be of the empty array? > >> > *A: *This question is a bit harder to answer and I think it would > >> require > >> > two steps. > >> > > >> > *Step 1: Pick a default data type to initialize the empty array.* > >> > > >> > We can use an "empty data type" such as NULL, VOID. > >> > > >> > *Step 2: Create or reuse type coercion to make using empty arrays > >> easier.* > >> > > >> > The above should unblock users from creating empty arrays, but if one > >> would > >> > use an empty array in an COALESCE operation. > >> > > >> > i.e. SELECT COALESCE(int_column, ARRAY[]) > >> > > >> > I believe they will get a query issue where the type for int_column > >> (INTEGER) > >> > and the empty array (NULL, VOID) do not match. Thus a user will need > to > >> > cast the empty array: > >> > > >> > i.e. SELECT COALESCE(int_column, CAST(ARRAY[] AS INT)) > >> > > >> > as such to have the COALESCE query to execute successfully. > >> > > >> > ----- > >> > *Trino* > >> >> EXPLAIN SELECT ARRAY[] > >> > Fragment 0 [SINGLE] > >> > Output layout: [expr] > >> > Output partitioning: SINGLE [] > >> > Output[columnNames = [_col0]] > >> > │ Layout: [expr:array(unknown)] > >> > │ Estimates: {rows: 1 (55B), cpu: 0, memory: 0B, network: 0B} > >> > │ _col0 := expr > >> > └─ Values[] > >> > Layout: [expr:array(unknown)] > >> > Estimates: {rows: 1 (55B), cpu: 0, memory: 0B, network: > 0B} > >> > > >> > > >> > ("$literal$"(from_base64('AwAAAFJMRQAAAAAKAAAAQllURV9BUlJBWQEAAAABgAAAAAA='))) > >> > > >> > Expected behaviour? *Yes.* > >> > Array data type? *Unknown.* > >> > > >> > *Spark* > >> >> sc.sql.sql("SELECT ARRAY[]").explain() > >> > DataFrame[array(): array<void>] > >> > > >> > Expected behaviour? *Yes.* > >> > Array data type? *Void.* > >> > > >> > *BigQuery* > >> >> SELECT ARRAY[] > >> > Field name Type Mode > >> > f0_ INTEGER REPEATED > >> > > >> > Expected behaviour? *Yes.* > >> > Array data type? *Integer.* > >> > > >> > Best, > >> > > >> > Eric > >> > > >> > >> >