More particular example: I run pi.py Spark Python example in *yarn-cluster* mode (--master) through SparkLauncher in Java.
While the program is running, these are the stats of how much memory each process takes: SparkSubmit process : 11.266 *gigabyte* Virtual Memory ApplicationMaster process: 2303480 *byte *Virtual Memory Why does SparkSubmit process takes so much virtual memory in yarn-cluster mode ? (which usually causes your Yarn container to be killed because of outofmemory exception) On Tue, Jul 14, 2015 at 9:39 AM, Elkhan Dadashov <elkhan8...@gmail.com> wrote: > Hi all, > > If you want to launch Spark job from Java in programmatic way, then you > need to Use SparkLauncher. > > SparkLauncher uses ProcessBuilder for creating new process - Java seems > handle process creation in an inefficient way. > > " > When you execute a process, you must first fork() and then exec(). Forking > creates a child process by duplicating the current process. Then, you call > exec() to change the “process image” to a new “process image”, essentially > executing different code within the child process. > ... > When we want to fork a new process, we have to copy the ENTIRE Java JVM… > What we really are doing is requesting the same amount of memory the JVM > been allocated. > " > Source: http://bryanmarty.com/2012/01/14/forking-jvm/ > This link <http://bryanmarty.com/2012/01/14/forking-jvm/> shows different > solutions for launching new processes in Java. > > If our main program JVM already uses big amount of memory (let's say 6GB), > then for creating new process while using SparkLauncher, we need 12 GB > (virtual) memory available, even though we will not use it. > > It will be very helpful if someone could share his/her experience for > handing this memory inefficiency in creating new processes in Java. > > -- Best regards, Elkhan Dadashov