On Wed, Apr 23, 2014 at 12:06 AM, jaeholee <jho...@lbl.gov> wrote: > How do you determine the number of partitions? For example, I have 16 > workers, and the number of cores and the worker memory set in spark-env.sh > are: > > CORE = 8 > MEMORY = 16g >
So you have the capacity to work on 16 * 8 = 128 tasks at a time. If you use less than 128 partitions, some CPUs will go unused, and the job will complete more slowly than it could. But depending on what you are actually doing with the data, you may want to use much more partitions than that. I find it fairly difficult to estimate the memory use -- it is much easier to just try with more partitions and see if it is enough :). The overhead for using more partitions than necessary is pretty small. The .csv data I have is about 500MB, but I am eventually going to use a file > that is about 15GB. > > Is the MEMORY variable in spark-env.sh different from spark.executor.memory > that you mentioned? If they're different, how do I set > spark.executor.memory? > It is probably the same thing. Sorry for confusing you. To be sure, check the Spark master web UI while the application is connected. You will see how much RAM is used per worker on the main page. You can set "spark.executor.memory" through SparkConf.set("spark.executor.memory", "10g") when creating the SparkContext.