Hi ashley,
Apologies reading this on my phone as work l laptop doesn't let me access
personal email.
Are you actually doing anything with the counts (printing to log, writing
to table?)
If you're not doing anything with them get rid of them and the caches
entirely.
If you do want to do somethin
Thanks David,
I did experiment with the .cache() keyword and have to admit I didn't see
any marked improvement on the sample that I was running, so yes I am a bit
apprehensive including it (not even sure why I actually left it in).
When you say "do the count as the final step", are you referring
Hi Ashley,
I'm not an expert but think this is because spark does lazy execution and
doesn't actually perform any actions until you do some kind of write, count
or other operation on the dataframe.
If you remove the count steps it will work out a more efficient execution
plan reducing the number
I hate to be "that guy", but I'd like to know myself.
I tried to setup something similar, except I created a "service" account
which starts the Spark service, but like you, I kept on getting file
permission errors when submitting jobs under my own login. My current
workaround was to su to the ser
Hi,
I am currently working on an app using PySpark to produce an insert and
update daily delta capture, being outputted as Parquet. This is running on
a 8 core 32 GB Linux server in standalone mode (set to 6 worker cores of
2GB memory each) running Spark 2.4.3.
This is being achieved by reading
Hi
Anyone has experience in ceph / lustre as a replacement of hdfs for
spark storage (parquet, orc..)?
Is hdfs still far superior to the former ?
Thanks
--
nicolas paris
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.