> either materialize the Dataframe on HDFS (e.g. parquet or checkpoint)
I wonder if avro is a better candidate for this because it's row
oriented it should be faster to write/read for such a task. Never heard
about checkpoint.
Enrico Minack writes:
> It is not about very large or small, it is
It is not about very large or small, it is about how large your cluster
is w.r.t. your data. Caching is only useful if you have the respective
memory available across your executors. Otherwise you could either
materialize the Dataframe on HDFS (e.g. parquet or checkpoint) or indeed
have to do t
> .dropDuplicates() \ .cache() |
> Since df_actions is cached, you can count inserts and updates quickly
> with only that one join in df_actions:
Hi Enrico. I am wondering if this is ok for very large tables ? Is
caching faster than recomputing both insert/update ?
Thanks
Enrico Minack writes
Hi,
Thank you both for your suggestions! These have been eyeopeners for me.
Just to clarify, I need the counts for logging and auditing purposes
otherwise I would exclude the step. I should have also mentioned that
while I am processing around 30 GB of raw data, the individual outputs are
relat
Ashley,
I want to suggest a few optimizations. The problem might go away but at
least performance should improve.
The freeze problems could have many reasons, the Spark UI SQL pages and
stages detail pages would be useful. You can send them privately, if you
wish.
1. the repartition(1) shoul
Hi ashley,
Apologies reading this on my phone as work l laptop doesn't let me access
personal email.
Are you actually doing anything with the counts (printing to log, writing
to table?)
If you're not doing anything with them get rid of them and the caches
entirely.
If you do want to do somethin
Thanks David,
I did experiment with the .cache() keyword and have to admit I didn't see
any marked improvement on the sample that I was running, so yes I am a bit
apprehensive including it (not even sure why I actually left it in).
When you say "do the count as the final step", are you referring
Hi Ashley,
I'm not an expert but think this is because spark does lazy execution and
doesn't actually perform any actions until you do some kind of write, count
or other operation on the dataframe.
If you remove the count steps it will work out a more efficient execution
plan reducing the number
Hi,
I am currently working on an app using PySpark to produce an insert and
update daily delta capture, being outputted as Parquet. This is running on
a 8 core 32 GB Linux server in standalone mode (set to 6 worker cores of
2GB memory each) running Spark 2.4.3.
This is being achieved by reading