I seem to still get the datatype array<void>: from pyspark.sql import SparkSession spark = ( SparkSession.builder .getOrCreate() ) spark.sql("SELECT ARRAY() + 1")
AnalysisException: cannot resolve '(array() + 1)' due to data type mismatch: differing types in '(array() + 1)' (array<void> and int).; line 1 pos 7; 'Project [unresolvedalias((array() + 1), None)] +- OneRowRelation Perhaps it has to do with me using PySpark vs Java or Scala Spark? $ pip3 freeze | grep pyspark pyspark==3.3.1 Regarding the coalesce query, I should have been a bit more explicit with my example: SELECT COALESCE(ARRAY(1), CAST(ARRAY() as ARRAY<INT>)) When I was referring to int_column I meant an int array column 😅. It makes sense why you cannot COALESCE a non array data type with an array data type - this is also why the SELECT ARRAY() + 1 query fails. --- > Unfortunately, ARRAY is fully managed by Calcite and maybe deeply > integrated also into the parser (at least this is the case for ROW). > I think that is the problem indeed and people have suggested overriding some of the functions inherited from the Calcite SqlArrayValueConstructor class to allow for the creation of empty arrays. Those suggestions worked as shown in my exploratory PR: https://github.com/apache/flink/pull/21156. I think this is one change we will need to make in Flink. So if we wanna go the easy path (without introducing ARRAY_OF), ARRAY() > should result in ARRAY<NULL> if the type can not be derived by the > surrounding call. > Are the SQL functions ARRAY_OF and ROW_OF supported in any other SQL dialect? I did a quick search and came out blank. I am still fairly new to the community so I don't know what the stance of trying to stay close to ANSI SQL is - having helper functions is always nice though. I am inlined that we should let the parser figure out the typing of the array based on the surrounding call and if it cannot use a default type such as ARRAY<NULL>. On Fri, Oct 28, 2022 at 4:07 AM Timo Walther <twal...@apache.org> wrote: > Actually, the new type inference stack for UDFs is smart enough to solve > this issue. It could derive a data type for the array from the > surrounding call (expected data type). > > So this can be supported with the right type inference logic: > cast(ARRAY() as int) > > Unfortunately, ARRAY is fully managed by Calcite and maybe deeply > integrated also into the parser (at least this is the case for ROW). > TBH if I were to design a FLIP for the collection functions, I would > actually propose to introduce `ARRAY_OF()`, `ROW_OF()` to have full > control over the type inference in our stack. In our stack, this also > means that NULL is unknown. Calcite distinguished between NULL and unknown. > > So if we wanna go the easy path (without introducing ARRAY_OF), ARRAY() > should result in ARRAY<NULL> if the type can not be derived by the > surrounding call. > > Regards, > Timo > > On 28.10.22 03:46, yuxia wrote: > > For an empty array, seems different engine use different data type: > > Hive: string > > Spark: string ? > > Trino: Unknown > > BigQuery: Integer > > > > I have tried with Hive and Spark, but haven't tried with Trino and > BigQuery. > > > > I'm a little of doubt about the spark's behavior. But from my sides, > seems Spark actually use string type which is different from your > investigation. > > I try with the following sql in spark-cli: > > ` > > select array() + 1 > > ` > > > > The exception is > > ` > > Error in query: cannot resolve '(array() + 1)' due to data type > mismatch: differing types in '(array() + 1)' (array<string> and int).; line > 1 pos 7; > > 'Project [unresolvedalias((array() + 1), None)] > > +- OneRowRelation > > ` > > > > > > Seems it's hard to decide which data type Flink should use. I'm > insterested in the reason why you would like to use Integer type. > > I haven't cheked whether the sql stardard specifies it. But from my > side, I prefer to follow Hive/Spark. > > > > BTW: the query `SELECT COALESCE(1, cast(ARRAY() as int))` will fail in > Hive and Spark. > > > > > > Best regards, > > Yuxia > > > > ----- 原始邮件 ----- > > 发件人: "eric xiao" <xiao.eric...@gmail.com> > > 收件人: "dev" <dev@flink.apache.org> > > 发送时间: 星期四, 2022年 10 月 27日 下午 9:13:51 > > 主题: [DISCUSS] FLINK-20578 Cannot create empty array using ARRAY[] > > > > Hi, > > > > I would like to propose a solution to this JIRA issue. I looked at the > > comments and there was some guidance around where in the code we should > > update to allow for this behaviour. But I believe there are still two > > questions that remain open: > > > > 1. Is this expected behaviour (i.e. users should not be able to > create > > an empty array)? > > 2. If this is indeed expected behaviour, what should the data type > be of > > the empty array? > > > > I did some digging into other query engines / databases in hopes of > > answering the following two questions - That can be found at the end of > > this thread. > > > > *Q: *Is this expected behaviour (i.e. users should not be able to create > an > > empty array)? > > *A: *Yes I would say this is expected behaviour and something we should > add > > into the Flink SQL API. > > > > *Q: *What should the data type be of the empty array? > > *A: *This question is a bit harder to answer and I think it would require > > two steps. > > > > *Step 1: Pick a default data type to initialize the empty array.* > > > > We can use an "empty data type" such as NULL, VOID. > > > > *Step 2: Create or reuse type coercion to make using empty arrays > easier.* > > > > The above should unblock users from creating empty arrays, but if one > would > > use an empty array in an COALESCE operation. > > > > i.e. SELECT COALESCE(int_column, ARRAY[]) > > > > I believe they will get a query issue where the type for int_column > (INTEGER) > > and the empty array (NULL, VOID) do not match. Thus a user will need to > > cast the empty array: > > > > i.e. SELECT COALESCE(int_column, CAST(ARRAY[] AS INT)) > > > > as such to have the COALESCE query to execute successfully. > > > > ----- > > *Trino* > >> EXPLAIN SELECT ARRAY[] > > Fragment 0 [SINGLE] > > Output layout: [expr] > > Output partitioning: SINGLE [] > > Output[columnNames = [_col0]] > > │ Layout: [expr:array(unknown)] > > │ Estimates: {rows: 1 (55B), cpu: 0, memory: 0B, network: 0B} > > │ _col0 := expr > > └─ Values[] > > Layout: [expr:array(unknown)] > > Estimates: {rows: 1 (55B), cpu: 0, memory: 0B, network: 0B} > > > > > > ("$literal$"(from_base64('AwAAAFJMRQAAAAAKAAAAQllURV9BUlJBWQEAAAABgAAAAAA='))) > > > > Expected behaviour? *Yes.* > > Array data type? *Unknown.* > > > > *Spark* > >> sc.sql.sql("SELECT ARRAY[]").explain() > > DataFrame[array(): array<void>] > > > > Expected behaviour? *Yes.* > > Array data type? *Void.* > > > > *BigQuery* > >> SELECT ARRAY[] > > Field name Type Mode > > f0_ INTEGER REPEATED > > > > Expected behaviour? *Yes.* > > Array data type? *Integer.* > > > > Best, > > > > Eric > > > >