t;
> # add an action to try and debug
>
> # this should not change the physical plan. I.e. we still
> have the same number of shuffles
>
> # which results in the same number of stage. We are just not
> building up a plan with thousands
>
>
ns
# execution plan should be the same for each join
#rawCountsSDF.explain()
self.logger.info( "END\n" )
return retNumReadsDF
From: David Diebold
Date: Monday, January 3, 2022 at 12:39 AM
To: Andrew Davidson , "user @spark"
Subject:
Hello Andy,
Are you sure you want to perform lots of join operations, and not simple
unions ?
Are you doing inner joins or outer joins ?
Can you provide us with a rough amount of your list size plus each
individual dataset size ?
Have a look at execution plan would help, maybe the high amount of j
Hi Gourav
I will give databricks a try.
Each data gets loaded into a data frame.
I select one column from the data frame
I join the column to the accumulated joins from previous data frames in
the list
To debug. I think am gaining to put an action and log statement after each
join. I do not thi
Hi Andrew,
Any chance you might give Databricks a try in GCP?
The above transformations look complicated to me, why are you adding
dataframes to a list?
Regards,
Gourav Sengupta
On Sun, Dec 26, 2021 at 7:00 PM Andrew Davidson
wrote:
> Hi
>
>
>
> I am having trouble debugging my driver. It
Hi
I am having trouble debugging my driver. It runs correctly on smaller data set
but fails on large ones. It is very hard to figure out what the bug is. I
suspect it may have something do with the way spark is installed and
configured. I am using google cloud platform dataproc pyspark
The lo