I prefer not to do a .cache() due to memory limits. But I did try a
persist() with DISK_ONLY
I did the repartition(), followed by a .count() followed by a persist() of
DISK_ONLY
That didn't change the number of tasks either
On Sun, Jul 1, 2018, 15:50 Alexander Czech
wrote:
> You could try to
You could try to force a repartion right at that point by producing a
cached version of the DF with .cache() if memory allows it.
On Sun, Jul 1, 2018 at 5:04 AM, Abdeali Kothari
wrote:
> I've tried that too - it doesn't work. It does a repetition, but not right
> after the broadcast join - it do
I've tried that too - it doesn't work. It does a repetition, but not right
after the broadcast join - it does a lot more processing and does the
repetition right before I do my next sortmerge join (stage 12 I described
above)
As the heavy processing is before the sort merge join, it still doesn't h
Abdeali Kothari wrote
> My entire CSV is less than 20KB.
> By somewhere in between, I do a broadcast join with 3500 records in
> another
> file.
> After the broadcast join I have a lot of processing to do. Overall, the
> time to process a single record goes up-to 5mins on 1 executor
>
> I'm trying
My entire CSV is less than 20KB.
By somewhere in between, I do a broadcast join with 3500 records in another
file.
After the broadcast join I have a lot of processing to do. Overall, the
time to process a single record goes up-to 5mins on 1 executor
I'm trying to increase the partitions that my da
Abdeali Kothari wrote
> I am using Spark 2.3.0 and trying to read a CSV file which has 500
> records.
> When I try to read it, spark says that it has two stages: 10, 11 and then
> they join into stage 12.
What's your CSV size per file? I think Spark optimizer may put many files
into one task when