Re: Spark memory optimization

2014-07-07 Thread Surendranauth Hiraman
Using persist() is a sort of a "hack" or a hint (depending on your perspective :-)) to make the RDD use disk, not memory. As I mentioned though, the disk io has consequences, mainly (I think) making sure you have enough disks to not let io be a bottleneck. Increasing partitions I think is the othe

Re: Spark memory optimization

2014-07-07 Thread Igor Pernek
Thanks guys! Actually, I'm not doing any caching (at least I'm not calling cache/persist), do I still need to use the DISK_ONLY storage level? However, I do use reduceByKey and sortByKey. Mayur, you mentioned that sortByKey requires data to fit the memory. Is there any way to work around this (mayb

Re: Spark memory optimization

2014-07-04 Thread Surendranauth Hiraman
When using DISK_ONLY, keep in mind that disk I/O is pretty high. Make sure you are writing to multiple disks for best operation. And even with DISK_ONLY, we've found that there is a minimum threshold for executor ram (spark.executor.memory), which for us seemed to be around 8 GB. If you find that,

Re: Spark memory optimization

2014-07-04 Thread Mayur Rustagi
I would go with Spark only if you are certain that you are going to scale out in the near future. You can change the default storage of RDD to DISK_ONLY, that might remove issues around any rdd leveraging memory. Thr are some functions particularly sortbykey that require data to fit in memory to wo