Does spark3.1.1 support parquet nested column predicate pushdown for array type and map type column

2021-05-18 Thread 石鹏磊
I found the filter can not be pushed which is like 'item[0].id > 1', anyone konws? -- Penglei Shi

S3 Access Issues - Spark

2021-05-18 Thread KhajaAsmath Mohammed
Hi, I have written a sample spark job that reads the data residing in hbase. I keep getting below error , any suggestions to resolve this please? Caused by: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified by setting the fs.s3.awsAccessKeyId and fs.s3

Re: Merge two dataframes

2021-05-18 Thread kushagra deep
Thanks a lot Mich , this works though I have to test for scalability. I have one question though . If we dont specify any column in partitionBy will it shuffle all the records in one executor ? Because this is what seems to be happening. Thanks once again ! Regards Kushagra Deep On Tue, May 18,

Re: Merge two dataframes

2021-05-18 Thread Mich Talebzadeh
Ok, this should hopefully work as it uses row_number. from pyspark.sql.window import Window import pyspark.sql.functions as F from pyspark.sql.functions import row_number def spark_session(appName): return SparkSession.builder \ .appName(appName) \ .enableHiveSupport() \

Re: Merge two dataframes

2021-05-18 Thread kushagra deep
The use case is to calculate PSI/CSI values . And yes the union is one to one row as you showed. On Tue, May 18, 2021, 20:39 Mich Talebzadeh wrote: > > Hi Kushagra, > > A bit late on this but what is the business use case for this merge? > > You have two data frames each with one column and you

Re: Merge two dataframes

2021-05-18 Thread Mich Talebzadeh
Hi Kushagra, A bit late on this but what is the business use case for this merge? You have two data frames each with one column and you want to UNION them in a certain way but the correlation is not known. In other words this UNION is as is? amount_6m | amount_9m 100 50

Re: Calculate average from Spark stream

2021-05-18 Thread Mich Talebzadeh
something like below: root |-- window: struct (nullable = false) ||-- start: timestamp (nullable = true) ||-- end: timestamp (nullable = true) |-- avg(temperature): double (nullable = true) import pyspark.sql.function

Re: Calculate average from Spark stream

2021-05-18 Thread Mich Talebzadeh
Ok let me provide some suggestions here. ResultM is a data frame and if you do ResultM.printShema() You will get the struct column called window with two columns namely start and end plus the average temperature. Just try to confirm that now HTH, Much On Tue, 18 May 2021 at 14:15, Giuseppe Ri

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-05-18 Thread Maziyar Panahi
Hi Rao, Yes, I have created this ticket: https://issues.apache.org/jira/browse/SPARK-35066 It's not assigned to anybody so I don't have any ETA on the fix or possible workarounds. Best Maziyar > On 18 May 2021, at 07:42, Rao, Abhishek (Nok