Upon further investigation of the Flink docs of array types, I noticed in the data type docs [1] it has the following syntax for declaring an array *with a data type*:
> ARRAY<t> > t ARRAY > The type can be declared using ARRAY<t> where t is the data type of the > contained elements. > t ARRAY is a synonym for being closer to the SQL standard. For example, INT > ARRAY is equivalent to ARRAY<INT>. But when I try to run: SELECT ARRAY<INT> [1] or SELECT INT ARRAY [1] In Flink SQL I get this error: org.apache.flink.table.api.SqlParserException: SQL parse failed. Incorrect syntax near the keyword 'INT' at line 1, column 8. Was expecting one of: "ABS" ... "ALL" ... "ARRAY" ... "AVG" ... "CARDINALITY" ... Am I missing something? [1] https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/types/#array On Fri, Oct 28, 2022 at 11:27 AM eric xiao <xiao.eric...@gmail.com> wrote: > I seem to still get the datatype array<void>: > from pyspark.sql import SparkSession > spark = ( > SparkSession.builder > .getOrCreate() > ) > spark.sql("SELECT ARRAY() + 1") > > AnalysisException: cannot resolve '(array() + 1)' due to data type mismatch: > differing types in '(array() + 1)' (array<void> and int).; line 1 pos 7; > 'Project [unresolvedalias((array() + 1), None)] > +- OneRowRelation > > Perhaps it has to do with me using PySpark vs Java or Scala Spark? > > $ pip3 freeze | grep pyspark > pyspark==3.3.1 > > > Regarding the coalesce query, I should have been a bit more explicit with > my example: > SELECT COALESCE(ARRAY(1), CAST(ARRAY() as ARRAY<INT>)) > > When I was referring to int_column I meant an int array column 😅. It > makes sense why you cannot COALESCE a non array data type with an array > data type - this is also why the SELECT ARRAY() + 1 query fails. > > --- > >> Unfortunately, ARRAY is fully managed by Calcite and maybe deeply >> integrated also into the parser (at least this is the case for ROW). >> > > I think that is the problem indeed and people have suggested overriding > some of the functions inherited from the Calcite SqlArrayValueConstructor > class to allow for the creation of empty arrays. Those suggestions worked > as shown in my exploratory PR: https://github.com/apache/flink/pull/21156. > > I think this is one change we will need to make in Flink. > > So if we wanna go the easy path (without introducing ARRAY_OF), ARRAY() >> should result in ARRAY<NULL> if the type can not be derived by the >> surrounding call. >> > Are the SQL functions ARRAY_OF and ROW_OF supported in any other SQL > dialect? I did a quick search and came out blank. I am still fairly new to > the community so I don't know what the stance of trying to stay close to > ANSI SQL is - having helper functions is always nice though. > > I am inlined that we should let the parser figure out the typing of the > array based on the surrounding call and if it cannot use a default type > such as ARRAY<NULL>. > > On Fri, Oct 28, 2022 at 4:07 AM Timo Walther <twal...@apache.org> wrote: > >> Actually, the new type inference stack for UDFs is smart enough to solve >> this issue. It could derive a data type for the array from the >> surrounding call (expected data type). >> >> So this can be supported with the right type inference logic: >> cast(ARRAY() as int) >> >> Unfortunately, ARRAY is fully managed by Calcite and maybe deeply >> integrated also into the parser (at least this is the case for ROW). >> TBH if I were to design a FLIP for the collection functions, I would >> actually propose to introduce `ARRAY_OF()`, `ROW_OF()` to have full >> control over the type inference in our stack. In our stack, this also >> means that NULL is unknown. Calcite distinguished between NULL and >> unknown. >> >> So if we wanna go the easy path (without introducing ARRAY_OF), ARRAY() >> should result in ARRAY<NULL> if the type can not be derived by the >> surrounding call. >> >> Regards, >> Timo >> >> On 28.10.22 03:46, yuxia wrote: >> > For an empty array, seems different engine use different data type: >> > Hive: string >> > Spark: string ? >> > Trino: Unknown >> > BigQuery: Integer >> > >> > I have tried with Hive and Spark, but haven't tried with Trino and >> BigQuery. >> > >> > I'm a little of doubt about the spark's behavior. But from my sides, >> seems Spark actually use string type which is different from your >> investigation. >> > I try with the following sql in spark-cli: >> > ` >> > select array() + 1 >> > ` >> > >> > The exception is >> > ` >> > Error in query: cannot resolve '(array() + 1)' due to data type >> mismatch: differing types in '(array() + 1)' (array<string> and int).; line >> 1 pos 7; >> > 'Project [unresolvedalias((array() + 1), None)] >> > +- OneRowRelation >> > ` >> > >> > >> > Seems it's hard to decide which data type Flink should use. I'm >> insterested in the reason why you would like to use Integer type. >> > I haven't cheked whether the sql stardard specifies it. But from my >> side, I prefer to follow Hive/Spark. >> > >> > BTW: the query `SELECT COALESCE(1, cast(ARRAY() as int))` will fail in >> Hive and Spark. >> > >> > >> > Best regards, >> > Yuxia >> > >> > ----- 原始邮件 ----- >> > 发件人: "eric xiao" <xiao.eric...@gmail.com> >> > 收件人: "dev" <dev@flink.apache.org> >> > 发送时间: 星期四, 2022年 10 月 27日 下午 9:13:51 >> > 主题: [DISCUSS] FLINK-20578 Cannot create empty array using ARRAY[] >> > >> > Hi, >> > >> > I would like to propose a solution to this JIRA issue. I looked at the >> > comments and there was some guidance around where in the code we should >> > update to allow for this behaviour. But I believe there are still two >> > questions that remain open: >> > >> > 1. Is this expected behaviour (i.e. users should not be able to >> create >> > an empty array)? >> > 2. If this is indeed expected behaviour, what should the data type >> be of >> > the empty array? >> > >> > I did some digging into other query engines / databases in hopes of >> > answering the following two questions - That can be found at the end of >> > this thread. >> > >> > *Q: *Is this expected behaviour (i.e. users should not be able to >> create an >> > empty array)? >> > *A: *Yes I would say this is expected behaviour and something we should >> add >> > into the Flink SQL API. >> > >> > *Q: *What should the data type be of the empty array? >> > *A: *This question is a bit harder to answer and I think it would >> require >> > two steps. >> > >> > *Step 1: Pick a default data type to initialize the empty array.* >> > >> > We can use an "empty data type" such as NULL, VOID. >> > >> > *Step 2: Create or reuse type coercion to make using empty arrays >> easier.* >> > >> > The above should unblock users from creating empty arrays, but if one >> would >> > use an empty array in an COALESCE operation. >> > >> > i.e. SELECT COALESCE(int_column, ARRAY[]) >> > >> > I believe they will get a query issue where the type for int_column >> (INTEGER) >> > and the empty array (NULL, VOID) do not match. Thus a user will need to >> > cast the empty array: >> > >> > i.e. SELECT COALESCE(int_column, CAST(ARRAY[] AS INT)) >> > >> > as such to have the COALESCE query to execute successfully. >> > >> > ----- >> > *Trino* >> >> EXPLAIN SELECT ARRAY[] >> > Fragment 0 [SINGLE] >> > Output layout: [expr] >> > Output partitioning: SINGLE [] >> > Output[columnNames = [_col0]] >> > │ Layout: [expr:array(unknown)] >> > │ Estimates: {rows: 1 (55B), cpu: 0, memory: 0B, network: 0B} >> > │ _col0 := expr >> > └─ Values[] >> > Layout: [expr:array(unknown)] >> > Estimates: {rows: 1 (55B), cpu: 0, memory: 0B, network: 0B} >> > >> > >> >> ("$literal$"(from_base64('AwAAAFJMRQAAAAAKAAAAQllURV9BUlJBWQEAAAABgAAAAAA='))) >> > >> > Expected behaviour? *Yes.* >> > Array data type? *Unknown.* >> > >> > *Spark* >> >> sc.sql.sql("SELECT ARRAY[]").explain() >> > DataFrame[array(): array<void>] >> > >> > Expected behaviour? *Yes.* >> > Array data type? *Void.* >> > >> > *BigQuery* >> >> SELECT ARRAY[] >> > Field name Type Mode >> > f0_ INTEGER REPEATED >> > >> > Expected behaviour? *Yes.* >> > Array data type? *Integer.* >> > >> > Best, >> > >> > Eric >> > >> >>