Spark Memory Optimization

2017-06-23 Thread Tw UxTLi51Nus
Hi, I have a Spark-SQL Dataframe (reading from parquet) with some 20 columns. The data is divided into chunks of about 50 million rows each. Among the columns is a "GROUP_ID", which is basically a string of 32 hexadecimal characters. Following the guide [0] I thought to improve on performanc

Re: Spark memory optimization

2014-07-07 Thread Surendranauth Hiraman
Using persist() is a sort of a "hack" or a hint (depending on your perspective :-)) to make the RDD use disk, not memory. As I mentioned though, the disk io has consequences, mainly (I think) making sure you have enough disks to not let io be a bottleneck. Increasing partitions I think is the othe

Re: Spark memory optimization

2014-07-07 Thread Igor Pernek
Thanks guys! Actually, I'm not doing any caching (at least I'm not calling cache/persist), do I still need to use the DISK_ONLY storage level? However, I do use reduceByKey and sortByKey. Mayur, you mentioned that sortByKey requires data to fit the memory. Is there any way to work around this (mayb

Re: Spark memory optimization

2014-07-04 Thread Surendranauth Hiraman
When using DISK_ONLY, keep in mind that disk I/O is pretty high. Make sure you are writing to multiple disks for best operation. And even with DISK_ONLY, we've found that there is a minimum threshold for executor ram (spark.executor.memory), which for us seemed to be around 8 GB. If you find that,

Re: Spark memory optimization

2014-07-04 Thread Mayur Rustagi
I would go with Spark only if you are certain that you are going to scale out in the near future. You can change the default storage of RDD to DISK_ONLY, that might remove issues around any rdd leveraging memory. Thr are some functions particularly sortbykey that require data to fit in memory to wo

Spark memory optimization

2014-07-04 Thread Igor Pernek
Hi all! I have a folder with 150 G of txt files (around 700 files, on average each 200 MB). I'm using scala to process the files and calculate some aggregate statistics in the end. I see two possible approaches to do that: - manually loop through all the files, do the calculations per file and me