Thanks Ryan. I am running into this rarer issue. For now, I have moved away from parquet but if I will create a bug in jira if I am able to produce code that easily reproduces this.
Thanks, Aniket On Mon, Nov 21, 2016, 3:24 PM Ryan Blue [via Apache Spark Developers List] < ml-node+s1001551n19972...@n3.nabble.com> wrote: > Aniket, > > The solution was to add a sort so that only one file is written at a time, > which minimizes the memory footprint of columnar formats like Parquet. > That's been released for quite a while, so memory issues caused by Parquet > are more rare now. If you're using Parquet default settings and a recent > Spark version, you should be fine. > > rb > On Sun, Nov 20, 2016 at 3:35 AM, Aniket <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=19972&i=0>> wrote: > > Was anyone able find a solution or recommended conf for this? I am > running into the same "java.lang.OutOfMemoryError: Direct buffer memory" > but during snappy compression. > > Thanks, > Aniket > > On Tue, Sep 23, 2014 at 7:04 PM Aaron Davidson [via Apache Spark > Developers List] <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=19965&i=0>> wrote: > > This may be related: https://github.com/Parquet/parquet-mr/issues/211 > > Perhaps if we change our configuration settings for Parquet it would get > better, but the performance characteristics of Snappy are pretty bad here > under some circumstances. > > On Tue, Sep 23, 2014 at 10:13 AM, Cody Koeninger <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=8528&i=0>> wrote: > > > Cool, that's pretty much what I was thinking as far as configuration > goes. > > > > Running on Mesos. Worker nodes are amazon xlarge, so 4 core / 15g. > I've > > tried executor memory sizes as high as 6G > > Default hdfs block size 64m, about 25G of total data written by a job > with > > 128 partitions. The exception comes when trying to read the data (all > > columns). > > > > Schema looks like this: > > > > case class A( > > a: Long, > > b: Long, > > c: Byte, > > d: Option[Long], > > e: Option[Long], > > f: Option[Long], > > g: Option[Long], > > h: Option[Int], > > i: Long, > > j: Option[Int], > > k: Seq[Int], > > l: Seq[Int], > > m: Seq[Int] > > ) > > > > We're just going back to gzip for now, but might be nice to help someone > > else avoid running into this. > > > > On Tue, Sep 23, 2014 at 11:18 AM, Michael Armbrust <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=8528&i=1> > > > > > wrote: > > > > > I actually submitted a patch to do this yesterday: > > > https://github.com/apache/spark/pull/2493 > > > > > > Can you tell us more about your configuration. In particular how much > > > memory/cores do the executors have and what does the schema of your > data > > > look like? > > > > > > On Tue, Sep 23, 2014 at 7:39 AM, Cody Koeninger <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=8528&i=2>> > > > wrote: > > > > > >> So as a related question, is there any reason the settings in SQLConf > > >> aren't read from the spark context's conf? I understand why the sql > > conf > > >> is mutable, but it's not particularly user friendly to have most > spark > > >> configuration set via e.g. defaults.conf or --properties-file, but > for > > >> spark sql to ignore those. > > >> > > >> On Mon, Sep 22, 2014 at 4:34 PM, Cody Koeninger <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=8528&i=3>> > > >> wrote: > > >> > > >> > After commit 8856c3d8 switched from gzip to snappy as default > parquet > > >> > compression codec, I'm seeing the following when trying to read > > parquet > > >> > files saved using the new default (same schema and roughly same > size > > as > > >> > files that were previously working): > > >> > > > >> > java.lang.OutOfMemoryError: Direct buffer memory > > >> > java.nio.Bits.reserveMemory(Bits.java:658) > > >> > java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) > > >> > java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > > >> > > > >> > > > >> > > > parquet.hadoop.codec.SnappyDecompressor.setInput(SnappyDecompressor.java:99) > > > >> > > > >> > > > >> > > > parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:43) > > > >> > java.io.DataInputStream.readFully(DataInputStream.java:195) > > >> > java.io.DataInputStream.readFully(DataInputStream.java:169) > > >> > > > >> > > > >> > > > parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:201) > > >> > > > >> > > > parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:521) > > >> > > > >> > > > >> > > > parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:493) > > >> > > > >> > > > parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:546) > > >> > > > >> > > parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:339) > > >> > > > >> > > > >> > > > parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63) > > > >> > > > >> > > > >> > > > parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58) > > > >> > > > >> > > > >> > > parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:265) > > > >> > > > >> parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:60) > > >> > > > >> parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74) > > >> > > > >> > > > >> > > > parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110) > > > >> > > > >> > > > >> > > > parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172) > > > >> > > > >> > > > >> > > > parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130) > > > >> > > > >> > > > >> > > > org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:139) > > >> > > > >> > > > >> > > > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > > > >> > > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > > >> > > scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) > > >> > > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > > >> > > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > > >> > scala.collection.Iterator$class.isEmpty(Iterator.scala:256) > > >> > > scala.collection.AbstractIterator.isEmpty(Iterator.scala:1157) > > >> > > > >> > > > >> > > > org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:220) > > > >> > > > >> > > > >> > > > org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:219) > > > >> > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) > > >> > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) > > >> > > > >> > > > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > > >> > > > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > > >> > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > >> > > > >> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > > >> > org.apache.spark.scheduler.Task.run(Task.scala:54) > > >> > > > >> > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) > > >> > > > >> > > > >> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > > >> > > > >> > > > >> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > > >> > java.lang.Thread.run(Thread.java:722) > > >> > > > >> > > > >> > > > >> > > > > > > > > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-developers-list.1001551.n3.nabble.com/OutOfMemoryError-on-parquet-SnappyDecompressor-tp8517p8528.html > > To start a new topic under Apache Spark Developers List, email [hidden > email] <http:///user/SendEmail.jtp?type=node&node=19965&i=1> > > To unsubscribe from Apache Spark Developers List, click here. > NAML > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > > ------------------------------ > View this message in context: Re: OutOfMemoryError on parquet > SnappyDecompressor > <http://apache-spark-developers-list.1001551.n3.nabble.com/OutOfMemoryError-on-parquet-SnappyDecompressor-tp8517p19965.html> > Sent from the Apache Spark Developers List mailing list archive > <http://apache-spark-developers-list.1001551.n3.nabble.com/> at > Nabble.com. > > > > > -- > Ryan Blue > Software Engineer > Netflix > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-developers-list.1001551.n3.nabble.com/OutOfMemoryError-on-parquet-SnappyDecompressor-tp8517p19972.html > To start a new topic under Apache Spark Developers List, email > ml-node+s1001551n1...@n3.nabble.com > To unsubscribe from Apache Spark Developers List, click here > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YW5pa2V0LmJoYXRuYWdhckBnbWFpbC5jb218MXwxMzE3NTAzMzQz> > . > NAML > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/OutOfMemoryError-on-parquet-SnappyDecompressor-tp8517p19973.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.