Hello,
A bit of a background.
I have a dataset with about 200 million records and around 10 columns. The size 
of this dataset is around 1.5Tb and is split into around 600 files.
When I read this dataset, using sparkContext, by default it creates around 3000 
partitions if I do not specify the number of partitions in the textFile() 
command.
Now I see that even though my spark application has around 400 executors 
assigned to it, the data is spread out only to about 200 of them. I am using 
.cache() method to hold my data in-memory.
Each of these 200 executors, each with a total available memory of 6Gb, are now 
having multiple blocks and are thus using up their entire memory by caching the 
data.

Even though I have about 400 machines, only about 200 of them are actually 
being used.
Now, my question is:

How do I partition my data so all 400 of the executors have some chunks of the 
data, thus better parallelizing my work?
So, instead of only about 200 machines having about 6Gb of data each, I would 
like to have 400 machines with about 3Gb data each.




Any idea on how I can set about achieving the above?
Thanks,
Vinay

Reply via email to