date:20210518

Does spark3.1.1 support parquet nested column predicate pushdown for array type and map type column

2021-05-18 Thread 石鹏磊

I found the filter can not be pushed which is like 'item[0].id > 1', anyone konws? -- Penglei Shi

S3 Access Issues - Spark

2021-05-18 Thread KhajaAsmath Mohammed

Hi, I have written a sample spark job that reads the data residing in hbase. I keep getting below error , any suggestions to resolve this please? Caused by: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified by setting the fs.s3.awsAccessKeyId and fs.s3

Re: Merge two dataframes

2021-05-18 Thread kushagra deep

Thanks a lot Mich , this works though I have to test for scalability. I have one question though . If we dont specify any column in partitionBy will it shuffle all the records in one executor ? Because this is what seems to be happening. Thanks once again ! Regards Kushagra Deep On Tue, May 18,

Re: Merge two dataframes

2021-05-18 Thread Mich Talebzadeh

Ok, this should hopefully work as it uses row_number. from pyspark.sql.window import Window import pyspark.sql.functions as F from pyspark.sql.functions import row_number def spark_session(appName): return SparkSession.builder \ .appName(appName) \ .enableHiveSupport() \

Re: Merge two dataframes

2021-05-18 Thread kushagra deep

The use case is to calculate PSI/CSI values . And yes the union is one to one row as you showed. On Tue, May 18, 2021, 20:39 Mich Talebzadeh wrote: > > Hi Kushagra, > > A bit late on this but what is the business use case for this merge? > > You have two data frames each with one column and you

Re: Merge two dataframes

2021-05-18 Thread Mich Talebzadeh

Hi Kushagra, A bit late on this but what is the business use case for this merge? You have two data frames each with one column and you want to UNION them in a certain way but the correlation is not known. In other words this UNION is as is? amount_6m | amount_9m 100 50

Re: Calculate average from Spark stream

2021-05-18 Thread Mich Talebzadeh

something like below: root |-- window: struct (nullable = false) ||-- start: timestamp (nullable = true) ||-- end: timestamp (nullable = true) |-- avg(temperature): double (nullable = true) import pyspark.sql.function

Re: Calculate average from Spark stream

2021-05-18 Thread Mich Talebzadeh

Ok let me provide some suggestions here. ResultM is a data frame and if you do ResultM.printShema() You will get the struct column called window with two columns namely start and end plus the average temperature. Just try to confirm that now HTH, Much On Tue, 18 May 2021 at 14:15, Giuseppe Ri

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-05-18 Thread Maziyar Panahi

Hi Rao, Yes, I have created this ticket: https://issues.apache.org/jira/browse/SPARK-35066 It's not assigned to anybody so I don't have any ETA on the fix or possible workarounds. Best Maziyar > On 18 May 2021, at 07:42, Rao, Abhishek (Nok

Does spark3.1.1 support parquet nested column predicate pushdown for array type and map type column

S3 Access Issues - Spark

Re: Merge two dataframes

Re: Merge two dataframes

Re: Merge two dataframes

Re: Merge two dataframes

Re: Calculate average from Spark stream

Re: Calculate average from Spark stream

Re: Why is Spark 3.0.x faster than Spark 3.1.x

9 matches

Site Navigation

Mail list logo

Footer information