Hi Arun,

I have few questions.

Dose your XML file have like few huge documents? In this case of a row
having a huge size like (like 500MB), it would consume a lot of memory

becuase at least it should hold a row to iterate if I remember correctly. I
remember this happened to me before while processing a huge record for test
purpose.


How about trying to increase --executor-memory?


Also, you could try to select only few fields to prune the data with the
latest version just to doubly sure if you don't mind?.


Lastly, do you mind if I ask to open an issue in
https://github.com/databricks/spark-xml/issues if you still face this
problem?

I will try to take a look at my best.


Thank you.


2016-11-16 9:12 GMT+09:00 Arun Patel <arunp.bigd...@gmail.com>:

> I am trying to read an XML file which is 1GB is size.  I am getting an
> error 'java.lang.OutOfMemoryError: Requested array size exceeds VM limit'
> after reading 7 partitions in local mode.  In Yarn mode, it
> throws 'java.lang.OutOfMemoryError: Java heap space' error after reading
> 3 partitions.
>
> Any suggestion?
>
> PySpark Shell Command:    pyspark --master local[4] --driver-memory 3G
> --jars / tmp/spark-xml_2.10-0.3.3.jar
>
>
>
> Dataframe Creation Command:   df = sqlContext.read.format('com.da
> tabricks.spark.xml').options(rowTag='GGL').load('GGL_1.2G.xml')
>
>
>
> 16/11/15 18:27:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID
> 1) in 25978 ms on localhost (1/10)
>
> 16/11/15 18:27:04 INFO NewHadoopRDD: Input split:
> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:268435456+134217728
>
> 16/11/15 18:27:55 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2).
> 2309 bytes result sent to driver
>
> 16/11/15 18:27:55 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID
> 3, localhost, partition 3,ANY, 2266 bytes)
>
> 16/11/15 18:27:55 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
>
> 16/11/15 18:27:55 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID
> 2) in 51001 ms on localhost (2/10)
>
> 16/11/15 18:27:55 INFO NewHadoopRDD: Input split:
> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:402653184+134217728
>
> 16/11/15 18:28:19 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3).
> 2309 bytes result sent to driver
>
> 16/11/15 18:28:19 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID
> 4, localhost, partition 4,ANY, 2266 bytes)
>
> 16/11/15 18:28:19 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)
>
> 16/11/15 18:28:19 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID
> 3) in 24336 ms on localhost (3/10)
>
> 16/11/15 18:28:19 INFO NewHadoopRDD: Input split:
> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:536870912+134217728
>
> 16/11/15 18:28:40 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4).
> 2309 bytes result sent to driver
>
> 16/11/15 18:28:40 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID
> 5, localhost, partition 5,ANY, 2266 bytes)
>
> 16/11/15 18:28:40 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)
>
> 16/11/15 18:28:40 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID
> 4) in 20895 ms on localhost (4/10)
>
> 16/11/15 18:28:40 INFO NewHadoopRDD: Input split:
> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:671088640+134217728
>
> 16/11/15 18:29:01 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5).
> 2309 bytes result sent to driver
>
> 16/11/15 18:29:01 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID
> 6, localhost, partition 6,ANY, 2266 bytes)
>
> 16/11/15 18:29:01 INFO Executor: Running task 6.0 in stage 0.0 (TID 6)
>
> 16/11/15 18:29:01 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID
> 5) in 20793 ms on localhost (5/10)
>
> 16/11/15 18:29:01 INFO NewHadoopRDD: Input split:
> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:805306368+134217728
>
> 16/11/15 18:29:22 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6).
> 2309 bytes result sent to driver
>
> 16/11/15 18:29:22 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID
> 7, localhost, partition 7,ANY, 2266 bytes)
>
> 16/11/15 18:29:22 INFO Executor: Running task 7.0 in stage 0.0 (TID 7)
>
> 16/11/15 18:29:22 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID
> 6) in 21306 ms on localhost (6/10)
>
> 16/11/15 18:29:22 INFO NewHadoopRDD: Input split:
> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:939524096+134217728
>
> 16/11/15 18:29:43 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7).
> 2309 bytes result sent to driver
>
> 16/11/15 18:29:43 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID
> 8, localhost, partition 8,ANY, 2266 bytes)
>
> 16/11/15 18:29:43 INFO Executor: Running task 8.0 in stage 0.0 (TID 8)
>
> 16/11/15 18:29:43 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID
> 7) in 21130 ms on localhost (7/10)
>
> 16/11/15 18:29:43 INFO NewHadoopRDD: Input split:
> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:1073741824+134217728
>
> 16/11/15 18:29:48 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
> 0)
>
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>
>         at java.util.Arrays.copyOf(Arrays.java:2271)
>
>         at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.jav
> a:113)
>
>         at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutput
> Stream.java:93)
>
>         at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.ja
> va:122)
>
>         at java.io.DataOutputStream.write(DataOutputStream.java:88)
>
>         at com.databricks.spark.xml.XmlRecordReader.readUntilMatch(
> XmlInputFormat.scala:188)
>
>         at com.databricks.spark.xml.XmlRecordReader.next(XmlInputFormat
> .scala:156)
>
>         at com.databricks.spark.xml.XmlRecordReader.nextKeyValue(XmlInp
> utFormat.scala:141)
>
>         at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopR
> DD.scala:168)
>
>         at org.apache.spark.InterruptibleIterator.hasNext(Interruptible
> Iterator.scala:39)
>
>         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
>         at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>
>         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>
>         at scala.collection.TraversableOnce$class.foldLeft(
> TraversableOnce.scala:144)
>
>         at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
>
>         at scala.collection.TraversableOnce$class.aggregate(
> TraversableOnce.scala:201)
>
>         at scala.collection.AbstractIterator.aggregate(Iterator.scala:
> 1157)
>
>         at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$
> 24.apply(RDD.scala:1142)
>
>         at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$
> 24.apply(RDD.scala:1142)
>
>         at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$
> 25.apply(RDD.scala:1143)
>
>         at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$
> 25.apply(RDD.scala:1143)
>
>         at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$
> apply$22.apply(RDD.scala:717)
>
>         at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$
> apply$22.apply(RDD.scala:717)
>
>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
> DD.scala:38)
>
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>
>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
> DD.scala:38)
>
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
> Task.scala:73)
>
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
> Task.scala:41)
>
> 16/11/15 18:29:48 ERROR SparkUncaughtExceptionHandler: Uncaught exception
> in thread Thread[Executor task launch worker-0,5,main]
>
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>
>
>

Reply via email to