I am trying to read an XML file which is 1GB is size. I am getting an error 'java.lang.OutOfMemoryError: Requested array size exceeds VM limit' after reading 7 partitions in local mode. In Yarn mode, it throws 'java.lang.OutOfMemoryError: Java heap space' error after reading 3 partitions.
Any suggestion? PySpark Shell Command: pyspark --master local[4] --driver-memory 3G --jars / tmp/spark-xml_2.10-0.3.3.jar Dataframe Creation Command: df = sqlContext.read.format('com. databricks.spark.xml').options(rowTag='GGL').load('GGL_1.2G.xml') 16/11/15 18:27:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 25978 ms on localhost (1/10) 16/11/15 18:27:04 INFO NewHadoopRDD: Input split: hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:268435456+134217728 16/11/15 18:27:55 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 2309 bytes result sent to driver 16/11/15 18:27:55 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, partition 3,ANY, 2266 bytes) 16/11/15 18:27:55 INFO Executor: Running task 3.0 in stage 0.0 (TID 3) 16/11/15 18:27:55 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 51001 ms on localhost (2/10) 16/11/15 18:27:55 INFO NewHadoopRDD: Input split: hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:402653184+134217728 16/11/15 18:28:19 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 2309 bytes result sent to driver 16/11/15 18:28:19 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, localhost, partition 4,ANY, 2266 bytes) 16/11/15 18:28:19 INFO Executor: Running task 4.0 in stage 0.0 (TID 4) 16/11/15 18:28:19 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 24336 ms on localhost (3/10) 16/11/15 18:28:19 INFO NewHadoopRDD: Input split: hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:536870912+134217728 16/11/15 18:28:40 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4). 2309 bytes result sent to driver 16/11/15 18:28:40 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, localhost, partition 5,ANY, 2266 bytes) 16/11/15 18:28:40 INFO Executor: Running task 5.0 in stage 0.0 (TID 5) 16/11/15 18:28:40 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 20895 ms on localhost (4/10) 16/11/15 18:28:40 INFO NewHadoopRDD: Input split: hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:671088640+134217728 16/11/15 18:29:01 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5). 2309 bytes result sent to driver 16/11/15 18:29:01 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, localhost, partition 6,ANY, 2266 bytes) 16/11/15 18:29:01 INFO Executor: Running task 6.0 in stage 0.0 (TID 6) 16/11/15 18:29:01 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 20793 ms on localhost (5/10) 16/11/15 18:29:01 INFO NewHadoopRDD: Input split: hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:805306368+134217728 16/11/15 18:29:22 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6). 2309 bytes result sent to driver 16/11/15 18:29:22 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, localhost, partition 7,ANY, 2266 bytes) 16/11/15 18:29:22 INFO Executor: Running task 7.0 in stage 0.0 (TID 7) 16/11/15 18:29:22 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 21306 ms on localhost (6/10) 16/11/15 18:29:22 INFO NewHadoopRDD: Input split: hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:939524096+134217728 16/11/15 18:29:43 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7). 2309 bytes result sent to driver 16/11/15 18:29:43 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8, localhost, partition 8,ANY, 2266 bytes) 16/11/15 18:29:43 INFO Executor: Running task 8.0 in stage 0.0 (TID 8) 16/11/15 18:29:43 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 21130 ms on localhost (7/10) 16/11/15 18:29:43 INFO NewHadoopRDD: Input split: hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:1073741824+134217728 16/11/15 18:29:48 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream. java:113) at java.io.ByteArrayOutputStream.ensureCapacity( ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream. java:122) at java.io.DataOutputStream.write(DataOutputStream.java:88) at com.databricks.spark.xml.XmlRecordReader. readUntilMatch(XmlInputFormat.scala:188) at com.databricks.spark.xml.XmlRecordReader.next( XmlInputFormat.scala:156) at com.databricks.spark.xml.XmlRecordReader.nextKeyValue( XmlInputFormat.scala:141) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext( NewHadoopRDD.scala:168) at org.apache.spark.InterruptibleIterator.hasNext( InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce. scala:144) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157) at scala.collection.TraversableOnce$class.aggregate(TraversableOnce. scala:201) at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$ anonfun$24.apply(RDD.scala:1142) at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$ anonfun$24.apply(RDD.scala:1142) at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$ anonfun$25.apply(RDD.scala:1143) at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$ anonfun$25.apply(RDD.scala:1143) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$ anonfun$apply$22.apply(RDD.scala:717) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$ anonfun$apply$22.apply(RDD.scala:717) at org.apache.spark.rdd.MapPartitionsRDD.compute( MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) at org.apache.spark.rdd.MapPartitionsRDD.compute( MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) at org.apache.spark.scheduler.ShuffleMapTask.runTask( ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask( ShuffleMapTask.scala:41) 16/11/15 18:29:48 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main] java.lang.OutOfMemoryError: Requested array size exceeds VM limit