Re: Spark Doubts

2022-06-25 Thread Tufan Rakshit
Please find the answers inline please . 1) Can I apply predicate pushdown filters if I have data stored in S3 or it should be used only while reading from DBs? it can be applied in s3 if you store parquet , csv, json or in avro format .It does not depend on the DB , its supported in object store li

Re: Migration from Spark 2.4.0 to Spark 3.1.1 caused SortMergeJoin to change to BroadcastHashJoin

2022-07-06 Thread Tufan Rakshit
There are a few solutions : 1. Please make sure your driver has enough memory to broadcast the smaller dataframe . 2. Please change the config "spark.sql.autoBroadcastJoinThreshold": "2g" this an example 3. please use Hint in the Join , you need to scroll a bit down https://spark.apache.org/docs/l

Re: about cpu cores

2022-07-10 Thread Tufan Rakshit
Mainly depends what your cluster manager Yarn or kubernates ? Best Tufan On Sun, 10 Jul 2022 at 14:38, Sean Owen wrote: > Jobs consist of tasks, each of which consumes a core (can be set to >1 > too, but that's a different story). If there are more tasks ready to > execute than available cores,

Re: about cpu cores

2022-07-11 Thread Tufan Rakshit
so as an average every 4 core , you get back 3.6 core in Yarn , but you can use only 3 . in Kubernetes you get back 3.6 and also can use 3.6 Best Tufan On Mon, 11 Jul 2022 at 11:02, Yong Walt wrote: > We were using Yarn. thanks. > > On Sun, Jul 10, 2022 at 9:02 PM Tufan Raksh

Re: [Building] Building with JDK11

2022-07-15 Thread Tufan Rakshit
maybe try intellij or some other IDE with SBT . Maven has been always magical for me Best Tufan On Sat, 16 Jul 2022 at 00:11, Sean Owen wrote: > Java 8 binaries are probably on your PATH > > On Fri, Jul 15, 2022, 5:01 PM Szymon Kuryło > wrote: > >> Hello, >> >> I'm trying to build a Java 11 Sp

Re: Question regarding how to make spar Scala to evenly divide the spark job between executors

2022-07-17 Thread Tufan Rakshit
Hey Could you provide some pseudo code ? Also what kind of machine are you using per executor ? How many cores per executor ? What's the size of input data and what's the size of the output ? What kind of errors are you getting ? Best Tufan On Sun, 17 Jul 2022 at 00:31, Orkhan Dadashov wrote: >

Re: [EXTERNAL] Partial data with ADLS Gen2

2022-07-24 Thread Tufan Rakshit
Just use Delta Best Tufan Sent from my iPhone > On 24 Jul 2022, at 12:20, Shay Elbaz wrote: > >  > This is a known issue. Apache Iceberg, Hudi and Delta lake and among the > possible solutions. > Alternatively, instead of writing the output directly to the "official" > location, write it t

Re: Help with Shuffle Read performance

2022-09-29 Thread Tufan Rakshit
that's Total Nonsense , EMR is total crap , use kubernetes i will help you . can you please provide whats the size of the shuffle file that is getting generated in each task . What's the total number of Partitions that you have ? What machines are you using ? Are you using an SSD ? Best Tufan On

Re: 回复:Re: Build SPARK from source with SBT failed

2023-03-07 Thread Tufan Rakshit
I use m1 apple silicon , use java11 from Zulu , and runs SBT based Build Jobs in Kubernetes Best Tufan On Tue, 7 Mar 2023 at 16:11, Sean Owen wrote: > No, it's that JAVA_HOME wasn't set to .../Home. It is simply not finding > javac, in the error. Zulu supports M1. > > On Tue, Mar 7, 2023 at 9:0