Hi Arun,
I have few questions. Dose your XML file have like few huge documents? In this case of a row having a huge size like (like 500MB), it would consume a lot of memory becuase at least it should hold a row to iterate if I remember correctly. I remember this happened to me before while processing a huge record for test purpose. How about trying to increase --executor-memory? Also, you could try to select only few fields to prune the data with the latest version just to doubly sure if you don't mind?. Lastly, do you mind if I ask to open an issue in https://github.com/databricks/spark-xml/issues if you still face this problem? I will try to take a look at my best. Thank you. 2016-11-16 9:12 GMT+09:00 Arun Patel <arunp.bigd...@gmail.com>: > I am trying to read an XML file which is 1GB is size. I am getting an > error 'java.lang.OutOfMemoryError: Requested array size exceeds VM limit' > after reading 7 partitions in local mode. In Yarn mode, it > throws 'java.lang.OutOfMemoryError: Java heap space' error after reading > 3 partitions. > > Any suggestion? > > PySpark Shell Command: pyspark --master local[4] --driver-memory 3G > --jars / tmp/spark-xml_2.10-0.3.3.jar > > > > Dataframe Creation Command: df = sqlContext.read.format('com.da > tabricks.spark.xml').options(rowTag='GGL').load('GGL_1.2G.xml') > > > > 16/11/15 18:27:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID > 1) in 25978 ms on localhost (1/10) > > 16/11/15 18:27:04 INFO NewHadoopRDD: Input split: > hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:268435456+134217728 > > 16/11/15 18:27:55 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). > 2309 bytes result sent to driver > > 16/11/15 18:27:55 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID > 3, localhost, partition 3,ANY, 2266 bytes) > > 16/11/15 18:27:55 INFO Executor: Running task 3.0 in stage 0.0 (TID 3) > > 16/11/15 18:27:55 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID > 2) in 51001 ms on localhost (2/10) > > 16/11/15 18:27:55 INFO NewHadoopRDD: Input split: > hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:402653184+134217728 > > 16/11/15 18:28:19 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). > 2309 bytes result sent to driver > > 16/11/15 18:28:19 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID > 4, localhost, partition 4,ANY, 2266 bytes) > > 16/11/15 18:28:19 INFO Executor: Running task 4.0 in stage 0.0 (TID 4) > > 16/11/15 18:28:19 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID > 3) in 24336 ms on localhost (3/10) > > 16/11/15 18:28:19 INFO NewHadoopRDD: Input split: > hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:536870912+134217728 > > 16/11/15 18:28:40 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4). > 2309 bytes result sent to driver > > 16/11/15 18:28:40 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID > 5, localhost, partition 5,ANY, 2266 bytes) > > 16/11/15 18:28:40 INFO Executor: Running task 5.0 in stage 0.0 (TID 5) > > 16/11/15 18:28:40 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID > 4) in 20895 ms on localhost (4/10) > > 16/11/15 18:28:40 INFO NewHadoopRDD: Input split: > hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:671088640+134217728 > > 16/11/15 18:29:01 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5). > 2309 bytes result sent to driver > > 16/11/15 18:29:01 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID > 6, localhost, partition 6,ANY, 2266 bytes) > > 16/11/15 18:29:01 INFO Executor: Running task 6.0 in stage 0.0 (TID 6) > > 16/11/15 18:29:01 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID > 5) in 20793 ms on localhost (5/10) > > 16/11/15 18:29:01 INFO NewHadoopRDD: Input split: > hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:805306368+134217728 > > 16/11/15 18:29:22 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6). > 2309 bytes result sent to driver > > 16/11/15 18:29:22 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID > 7, localhost, partition 7,ANY, 2266 bytes) > > 16/11/15 18:29:22 INFO Executor: Running task 7.0 in stage 0.0 (TID 7) > > 16/11/15 18:29:22 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID > 6) in 21306 ms on localhost (6/10) > > 16/11/15 18:29:22 INFO NewHadoopRDD: Input split: > hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:939524096+134217728 > > 16/11/15 18:29:43 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7). > 2309 bytes result sent to driver > > 16/11/15 18:29:43 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID > 8, localhost, partition 8,ANY, 2266 bytes) > > 16/11/15 18:29:43 INFO Executor: Running task 8.0 in stage 0.0 (TID 8) > > 16/11/15 18:29:43 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID > 7) in 21130 ms on localhost (7/10) > > 16/11/15 18:29:43 INFO NewHadoopRDD: Input split: > hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:1073741824+134217728 > > 16/11/15 18:29:48 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID > 0) > > java.lang.OutOfMemoryError: Requested array size exceeds VM limit > > at java.util.Arrays.copyOf(Arrays.java:2271) > > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.jav > a:113) > > at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutput > Stream.java:93) > > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.ja > va:122) > > at java.io.DataOutputStream.write(DataOutputStream.java:88) > > at com.databricks.spark.xml.XmlRecordReader.readUntilMatch( > XmlInputFormat.scala:188) > > at com.databricks.spark.xml.XmlRecordReader.next(XmlInputFormat > .scala:156) > > at com.databricks.spark.xml.XmlRecordReader.nextKeyValue(XmlInp > utFormat.scala:141) > > at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopR > DD.scala:168) > > at org.apache.spark.InterruptibleIterator.hasNext(Interruptible > Iterator.scala:39) > > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > > at scala.collection.TraversableOnce$class.foldLeft( > TraversableOnce.scala:144) > > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157) > > at scala.collection.TraversableOnce$class.aggregate( > TraversableOnce.scala:201) > > at scala.collection.AbstractIterator.aggregate(Iterator.scala: > 1157) > > at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$ > 24.apply(RDD.scala:1142) > > at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$ > 24.apply(RDD.scala:1142) > > at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$ > 25.apply(RDD.scala:1143) > > at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$ > 25.apply(RDD.scala:1143) > > at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$ > apply$22.apply(RDD.scala:717) > > at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$ > apply$22.apply(RDD.scala:717) > > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR > DD.scala:38) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR > DD.scala:38) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap > Task.scala:73) > > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap > Task.scala:41) > > 16/11/15 18:29:48 ERROR SparkUncaughtExceptionHandler: Uncaught exception > in thread Thread[Executor task launch worker-0,5,main] > > java.lang.OutOfMemoryError: Requested array size exceeds VM limit > > >