Re: Questions about count() performance with dataframes and parquet files

2020-02-12 Thread David Edwards
Hi ashley, Apologies reading this on my phone as work l laptop doesn't let me access personal email. Are you actually doing anything with the counts (printing to log, writing to table?) If you're not doing anything with them get rid of them and the caches entirely. If you do want to do somethin

Re: Questions about count() performance with dataframes and parquet files

2020-02-12 Thread Ashley Hoff
Thanks David, I did experiment with the .cache() keyword and have to admit I didn't see any marked improvement on the sample that I was running, so yes I am a bit apprehensive including it (not even sure why I actually left it in). When you say "do the count as the final step", are you referring

Re: Questions about count() performance with dataframes and parquet files

2020-02-12 Thread David Edwards
Hi Ashley, I'm not an expert but think this is because spark does lazy execution and doesn't actually perform any actions until you do some kind of write, count or other operation on the dataframe. If you remove the count steps it will work out a more efficient execution plan reducing the number

Re: Start a standalone server as root and use it with user accounts

2020-02-12 Thread WranglingData
I hate to be "that guy", but I'd like to know myself. I tried to setup something similar, except I created a "service" account which starts the Spark service, but like you, I kept on getting file permission errors when submitting jobs under my own login. My current workaround was to su to the ser

Questions about count() performance with dataframes and parquet files

2020-02-12 Thread Ashley Hoff
Hi, I am currently working on an app using PySpark to produce an insert and update daily delta capture, being outputted as Parquet. This is running on a 8 core 32 GB Linux server in standalone mode (set to 6 worker cores of 2GB memory each) running Spark 2.4.3. This is being achieved by reading

Ceph / Lustre VS hdfs comparison

2020-02-12 Thread Nicolas PARIS
Hi Anyone has experience in ceph / lustre as a replacement of hdfs for spark storage (parquet, orc..)? Is hdfs still far superior to the former ? Thanks -- nicolas paris - To unsubscribe e-mail: user-unsubscr...@spark.apache.