It seems a bit weird. Could we open an issue and talk in the repository link I sent?
Let me try to reproduce your case with your data if possible. On 17 Nov 2016 2:26 a.m., "Arun Patel" <arunp.bigd...@gmail.com> wrote: > I tried below options. > > 1) Increase executor memory. Increased up to maximum possibility 14GB. > Same error. > 2) Tried new version - spark-xml_2.10:0.4.1. Same error. > 3) Tried with low level rowTags. It worked for lower level rowTag and > returned 16000 rows. > > Are there any workarounds for this issue? I tried playing with > spark.memory.fraction > and spark.memory.storageFraction. But, it did not help. Appreciate your > help on this!!! > > > > On Tue, Nov 15, 2016 at 8:44 PM, Arun Patel <arunp.bigd...@gmail.com> > wrote: > >> Thanks for the quick response. >> >> Its a single XML file and I am using a top level rowTag. So, it creates >> only one row in a Dataframe with 5 columns. One of these columns will >> contain most of the data as StructType. Is there a limitation to store >> data in a cell of a Dataframe? >> >> I will check with new version and try to use different rowTags and >> increase executor-memory tomorrow. I will open a new issue as well. >> >> >> >> On Tue, Nov 15, 2016 at 7:52 PM, Hyukjin Kwon <gurwls...@gmail.com> >> wrote: >> >>> Hi Arun, >>> >>> >>> I have few questions. >>> >>> Dose your XML file have like few huge documents? In this case of a row >>> having a huge size like (like 500MB), it would consume a lot of memory >>> >>> becuase at least it should hold a row to iterate if I remember >>> correctly. I remember this happened to me before while processing a huge >>> record for test purpose. >>> >>> >>> How about trying to increase --executor-memory? >>> >>> >>> Also, you could try to select only few fields to prune the data with the >>> latest version just to doubly sure if you don't mind?. >>> >>> >>> Lastly, do you mind if I ask to open an issue in >>> https://github.com/databricks/spark-xml/issues if you still face this >>> problem? >>> >>> I will try to take a look at my best. >>> >>> >>> Thank you. >>> >>> >>> 2016-11-16 9:12 GMT+09:00 Arun Patel <arunp.bigd...@gmail.com>: >>> >>>> I am trying to read an XML file which is 1GB is size. I am getting an >>>> error 'java.lang.OutOfMemoryError: Requested array size exceeds VM >>>> limit' after reading 7 partitions in local mode. In Yarn mode, it >>>> throws 'java.lang.OutOfMemoryError: Java heap space' error after >>>> reading 3 partitions. >>>> >>>> Any suggestion? >>>> >>>> PySpark Shell Command: pyspark --master local[4] --driver-memory 3G >>>> --jars / tmp/spark-xml_2.10-0.3.3.jar >>>> >>>> >>>> >>>> Dataframe Creation Command: df = sqlContext.read.format('com.da >>>> tabricks.spark.xml').options(rowTag='GGL').load('GGL_1.2G.xml') >>>> >>>> >>>> >>>> 16/11/15 18:27:04 INFO TaskSetManager: Finished task 1.0 in stage 0.0 >>>> (TID 1) in 25978 ms on localhost (1/10) >>>> >>>> 16/11/15 18:27:04 INFO NewHadoopRDD: Input split: >>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:268435456+134217728 >>>> >>>> 16/11/15 18:27:55 INFO Executor: Finished task 2.0 in stage 0.0 (TID >>>> 2). 2309 bytes result sent to driver >>>> >>>> 16/11/15 18:27:55 INFO TaskSetManager: Starting task 3.0 in stage 0.0 >>>> (TID 3, localhost, partition 3,ANY, 2266 bytes) >>>> >>>> 16/11/15 18:27:55 INFO Executor: Running task 3.0 in stage 0.0 (TID 3) >>>> >>>> 16/11/15 18:27:55 INFO TaskSetManager: Finished task 2.0 in stage 0.0 >>>> (TID 2) in 51001 ms on localhost (2/10) >>>> >>>> 16/11/15 18:27:55 INFO NewHadoopRDD: Input split: >>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:402653184+134217728 >>>> >>>> 16/11/15 18:28:19 INFO Executor: Finished task 3.0 in stage 0.0 (TID >>>> 3). 2309 bytes result sent to driver >>>> >>>> 16/11/15 18:28:19 INFO TaskSetManager: Starting task 4.0 in stage 0.0 >>>> (TID 4, localhost, partition 4,ANY, 2266 bytes) >>>> >>>> 16/11/15 18:28:19 INFO Executor: Running task 4.0 in stage 0.0 (TID 4) >>>> >>>> 16/11/15 18:28:19 INFO TaskSetManager: Finished task 3.0 in stage 0.0 >>>> (TID 3) in 24336 ms on localhost (3/10) >>>> >>>> 16/11/15 18:28:19 INFO NewHadoopRDD: Input split: >>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:536870912+134217728 >>>> >>>> 16/11/15 18:28:40 INFO Executor: Finished task 4.0 in stage 0.0 (TID >>>> 4). 2309 bytes result sent to driver >>>> >>>> 16/11/15 18:28:40 INFO TaskSetManager: Starting task 5.0 in stage 0.0 >>>> (TID 5, localhost, partition 5,ANY, 2266 bytes) >>>> >>>> 16/11/15 18:28:40 INFO Executor: Running task 5.0 in stage 0.0 (TID 5) >>>> >>>> 16/11/15 18:28:40 INFO TaskSetManager: Finished task 4.0 in stage 0.0 >>>> (TID 4) in 20895 ms on localhost (4/10) >>>> >>>> 16/11/15 18:28:40 INFO NewHadoopRDD: Input split: >>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:671088640+134217728 >>>> >>>> 16/11/15 18:29:01 INFO Executor: Finished task 5.0 in stage 0.0 (TID >>>> 5). 2309 bytes result sent to driver >>>> >>>> 16/11/15 18:29:01 INFO TaskSetManager: Starting task 6.0 in stage 0.0 >>>> (TID 6, localhost, partition 6,ANY, 2266 bytes) >>>> >>>> 16/11/15 18:29:01 INFO Executor: Running task 6.0 in stage 0.0 (TID 6) >>>> >>>> 16/11/15 18:29:01 INFO TaskSetManager: Finished task 5.0 in stage 0.0 >>>> (TID 5) in 20793 ms on localhost (5/10) >>>> >>>> 16/11/15 18:29:01 INFO NewHadoopRDD: Input split: >>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:805306368+134217728 >>>> >>>> 16/11/15 18:29:22 INFO Executor: Finished task 6.0 in stage 0.0 (TID >>>> 6). 2309 bytes result sent to driver >>>> >>>> 16/11/15 18:29:22 INFO TaskSetManager: Starting task 7.0 in stage 0.0 >>>> (TID 7, localhost, partition 7,ANY, 2266 bytes) >>>> >>>> 16/11/15 18:29:22 INFO Executor: Running task 7.0 in stage 0.0 (TID 7) >>>> >>>> 16/11/15 18:29:22 INFO TaskSetManager: Finished task 6.0 in stage 0.0 >>>> (TID 6) in 21306 ms on localhost (6/10) >>>> >>>> 16/11/15 18:29:22 INFO NewHadoopRDD: Input split: >>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:939524096+134217728 >>>> >>>> 16/11/15 18:29:43 INFO Executor: Finished task 7.0 in stage 0.0 (TID >>>> 7). 2309 bytes result sent to driver >>>> >>>> 16/11/15 18:29:43 INFO TaskSetManager: Starting task 8.0 in stage 0.0 >>>> (TID 8, localhost, partition 8,ANY, 2266 bytes) >>>> >>>> 16/11/15 18:29:43 INFO Executor: Running task 8.0 in stage 0.0 (TID 8) >>>> >>>> 16/11/15 18:29:43 INFO TaskSetManager: Finished task 7.0 in stage 0.0 >>>> (TID 7) in 21130 ms on localhost (7/10) >>>> >>>> 16/11/15 18:29:43 INFO NewHadoopRDD: Input split: >>>> hdfs://singlenodevm:8020/user/arunp/GGL_1.2G.xml:1073741824+134217728 >>>> >>>> 16/11/15 18:29:48 ERROR Executor: Exception in task 0.0 in stage 0.0 >>>> (TID 0) >>>> >>>> java.lang.OutOfMemoryError: Requested array size exceeds VM limit >>>> >>>> at java.util.Arrays.copyOf(Arrays.java:2271) >>>> >>>> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.jav >>>> a:113) >>>> >>>> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutput >>>> Stream.java:93) >>>> >>>> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.ja >>>> va:122) >>>> >>>> at java.io.DataOutputStream.write(DataOutputStream.java:88) >>>> >>>> at com.databricks.spark.xml.XmlRecordReader.readUntilMatch(XmlI >>>> nputFormat.scala:188) >>>> >>>> at com.databricks.spark.xml.XmlRecordReader.next(XmlInputFormat >>>> .scala:156) >>>> >>>> at com.databricks.spark.xml.XmlRecordReader.nextKeyValue(XmlInp >>>> utFormat.scala:141) >>>> >>>> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopR >>>> DD.scala:168) >>>> >>>> at org.apache.spark.InterruptibleIterator.hasNext(Interruptible >>>> Iterator.scala:39) >>>> >>>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:32 >>>> 7) >>>> >>>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:37 >>>> 1) >>>> >>>> at scala.collection.Iterator$class.foreach(Iterator.scala:727) >>>> >>>> at scala.collection.AbstractIterator.foreach(Iterator.scala:115 >>>> 7) >>>> >>>> at scala.collection.TraversableOnce$class.foldLeft(TraversableO >>>> nce.scala:144) >>>> >>>> at scala.collection.AbstractIterator.foldLeft(Iterator.scala:11 >>>> 57) >>>> >>>> at scala.collection.TraversableOnce$class.aggregate(Traversable >>>> Once.scala:201) >>>> >>>> at scala.collection.AbstractIterator.aggregate(Iterator.scala:1 >>>> 157) >>>> >>>> at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2 >>>> 4.apply(RDD.scala:1142) >>>> >>>> at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2 >>>> 4.apply(RDD.scala:1142) >>>> >>>> at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2 >>>> 5.apply(RDD.scala:1143) >>>> >>>> at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$2 >>>> 5.apply(RDD.scala:1143) >>>> >>>> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$a >>>> pply$22.apply(RDD.scala:717) >>>> >>>> at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$a >>>> pply$22.apply(RDD.scala:717) >>>> >>>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR >>>> DD.scala:38) >>>> >>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:3 >>>> 13) >>>> >>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) >>>> >>>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR >>>> DD.scala:38) >>>> >>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:3 >>>> 13) >>>> >>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) >>>> >>>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap >>>> Task.scala:73) >>>> >>>> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap >>>> Task.scala:41) >>>> >>>> 16/11/15 18:29:48 ERROR SparkUncaughtExceptionHandler: Uncaught >>>> exception in thread Thread[Executor task launch worker-0,5,main] >>>> >>>> java.lang.OutOfMemoryError: Requested array size exceeds VM limit >>>> >>>> >>>> >>> >> >