Hi,
I have a Spark-SQL Dataframe (reading from parquet) with some 20
columns. The data is divided into chunks of about 50 million rows each.
Among the columns is a "GROUP_ID", which is basically a string of 32
hexadecimal characters.
Following the guide [0] I thought to improve on performanc
Using persist() is a sort of a "hack" or a hint (depending on your
perspective :-)) to make the RDD use disk, not memory. As I mentioned
though, the disk io has consequences, mainly (I think) making sure you have
enough disks to not let io be a bottleneck.
Increasing partitions I think is the othe
Thanks guys! Actually, I'm not doing any caching (at least I'm not calling
cache/persist), do I still need to use the DISK_ONLY storage level?
However, I do use reduceByKey and sortByKey. Mayur, you mentioned that
sortByKey requires data to fit the memory. Is there any way to work around
this (mayb
When using DISK_ONLY, keep in mind that disk I/O is pretty high. Make sure
you are writing to multiple disks for best operation. And even with
DISK_ONLY, we've found that there is a minimum threshold for executor ram
(spark.executor.memory), which for us seemed to be around 8 GB.
If you find that,
I would go with Spark only if you are certain that you are going to scale
out in the near future.
You can change the default storage of RDD to DISK_ONLY, that might remove
issues around any rdd leveraging memory. Thr are some functions
particularly sortbykey that require data to fit in memory to wo
Hi all!
I have a folder with 150 G of txt files (around 700 files, on average each
200 MB).
I'm using scala to process the files and calculate some aggregate
statistics in the end. I see two possible approaches to do that: - manually
loop through all the files, do the calculations per file and me