Using persist() is a sort of a "hack" or a hint (depending on your
perspective :-)) to make the RDD use disk, not memory. As I mentioned
though, the disk io has consequences, mainly (I think) making sure you have
enough disks to not let io be a bottleneck.
Increasing partitions I think is the othe
Thanks guys! Actually, I'm not doing any caching (at least I'm not calling
cache/persist), do I still need to use the DISK_ONLY storage level?
However, I do use reduceByKey and sortByKey. Mayur, you mentioned that
sortByKey requires data to fit the memory. Is there any way to work around
this (mayb
When using DISK_ONLY, keep in mind that disk I/O is pretty high. Make sure
you are writing to multiple disks for best operation. And even with
DISK_ONLY, we've found that there is a minimum threshold for executor ram
(spark.executor.memory), which for us seemed to be around 8 GB.
If you find that,
I would go with Spark only if you are certain that you are going to scale
out in the near future.
You can change the default storage of RDD to DISK_ONLY, that might remove
issues around any rdd leveraging memory. Thr are some functions
particularly sortbykey that require data to fit in memory to wo