Hi Vinay, just out of curiosity, why are you converting your Dataframes into RDDs before the join? Join works quite well with Dataframes.
As for your problem, it looks like you gave to your executors more memory than you physically have. As an example of executors configuration: > Cluster of 6 nodes, 16 cores/node, 64 ram/node => Gives: 17 executors, > 19Gb/exec, 5 cores/exec > No more than 5 cores per exec > Leave some cores/Ram for the driver More on the matter here http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications -- Bedrytski Aliaksandr sp...@bedryt.ski On Fri, Aug 12, 2016, at 01:41, Muttineni, Vinay wrote: > Hello, > I have a spark job that basically reads data from two tables into two > Dataframes which are subsequently converted to RDD's. I, then, join > them based on a common key. > Each table is about 10 TB in size but after filtering, the two RDD’s > are about 500GB each. > I have 800 executors with 8GB memory per executor. > Everything works fine until the join stage. But, the join stage is > throwing the below error. > I tried increasing the partitions before the join stage but it doesn’t > change anything. > Any ideas, how I can fix this and what I might be doing wrong? > Thanks, > Vinay > > ExecutorLostFailure (executor 208 exited caused by one of the running > tasks) Reason: Container marked as failed: > container_1469773002212_96618_01_000246 on host:. Exit status: 143. > Diagnostics: Container > [pid=31872,containerID=container_1469773002212_96618_01_000246] is > running beyond physical memory limits. Current usage: 15.2 GB of 15.1 > GB physical memory used; 15.9 GB of 31.8 GB virtual memory used. > Killing container. > Dump of the process-tree for container_1469773002212_96618_01_000246 : > |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) > | SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) > | FULL_CMD_LINE > |- 31883 31872 31872 31872 (java) 519517 41888 17040175104 > | 3987193 /usr/java/latest/bin/java -server - > | XX:OnOutOfMemoryError=kill %p -Xms14336m -Xmx14336m - > | Djava.io.tmpdir=/hadoop/11/scratch/local/usercacheappcach- > | e/application_1469773002212_96618/container_1469773002212- > | _96618_01_000246/tmp -Dspark.driver.port=32988 - > | Dspark.ui.port=0 -Dspark.akka.frameSize=256 - > | Dspark.yarn.app.container.log.dir=/hadoop/12/scratch/logs- > | /application_1469773002212_96618/container_1469773002212_- > | 96618_01_000246 -XX:MaxPermSize=256m > | org.apache.spark.executor.CoarseGrainedExecutorBackend --driver- > | url spark://CoarseGrainedScheduler@10.12.7.4:32988 --executor- > | id 208 –hostname x.com --cores 11 --app-id > | application_1469773002212_96618 --user-class-path > | file:/hadoop/11/scratch/local/usercache /appcache/applica- > | tion_1469773002212_96618/container_1469773002212_96618_01- > | _000246/__app__.jar --user-class-path > | file:/hadoop/11/scratch/local/usercache/ appcache/applica- > | tion_1469773002212_96618/container_1469773002212_96618_01- > | _000246/mysql-connector-java-5.0.8-bin.jar --user-class- > | path file:/hadoop/11/scratch/local/usercache/appcache/app- > | lication_1469773002212_96618/container_1469773002212_9661- > | 8_01_000246/datanucleus-core-3.2.10.jar --user-class-path > | file:/hadoop/11/scratch/local/usercache/appcache/applicat- > | ion_1469773002212_96618/container_1469773002212_96618_01_- > | 000246/datanucleus-api-jdo-3.2.6.jar --user-class-path fi- > | le:/hadoop/11/scratch/local/usercache/appcache/applicatio- > | n_1469773002212_96618/container_1469773002212_96618_01_00- > | 0246/datanucleus-rdbms-3.2.9.jar > |- 31872 16580 31872 31872 (bash) 0 0 9146368 267 /bin/bash > | -c LD_LIBRARY_PATH=/apache/hadoop/lib/native:/apache/hado- > | op/lib/native/Linux-amd64-64: /usr/java/latest/bin/java > | -server -XX:OnOutOfMemoryError='kill %p' -Xms14336m - > | Xmx14336m - > | Djava.io.tmpdir=/hadoop/11/scratch/local/usercache/ appca- > | che/application_1469773002212_96618/container_14697730022- > | 12_96618_01_000246/tmp '-Dspark.driver.port=32988' '- > | Dspark.ui.port=0' '-Dspark.akka.frameSize=256' - > | Dspark.yarn.app.container.log.dir=/hadoop/12/scratch/logs- > | /application_1469773002212_96618/container_1469773002212_- > | 96618_01_000246 -XX:MaxPermSize=256m > | org.apache.spark.executor.CoarseGrainedExecutorBackend --driver- > | url spark://CoarseGrainedScheduler@1.4.1.6:32988 --executor- > | id 208 --hostname x.com --cores 11 --app-id > | application_1469773002212_96618 --user-class-path > | file:/hadoop/11/scratch/local/usercache/ appcache/applica- > | tion_1469773002212_96618/container_1469773002212_96618_01- > | _000246/__app__.jar --user-class-path file:/hadoop/11/scr- > | atch/local/usercache/appcache/application_1469773002212_9- > | 6618/container_1469773002212_96618_01_000246/mysql- > | connector-java-5.0.8-bin.jar --user-class-path file:/hado- > | op/11/scratch/local/usercache/appcache/application_146977- > | 3002212_96618/container_1469773002212_96618_01_000246/dat- > | anucleus-core-3.2.10.jar --user-class-path file:/hadoop/1- > | 1/scratch/local/usercache/appcache/application_1469773002- > | 212_96618/container_1469773002212_96618_01_000246/datanuc- > | leus-api-jdo-3.2.6.jar --user-class-path file:/hadoop/11/- > | scratch/local/usercache/appcache/application_146977300221- > | 2_96618/container_1469773002212_96618_01_000246/datanucle- > | us-rdbms-3.2.9.jar 1> /hadoop/12/scratch/logs/application- > | _1469773002212_96618/container_1469773002212_96618_01_000- > | 246/stdout 2> /hadoop/12/scratch/logs/application_1469773- > | 002212_96618/container_1469773002212_96618_01_000246/stde- > | rr > > Container killed on request. Exit code is 143 > Container exited with a non-zero exit code 143 > >