Hi Don, It’s not so much map() vs flatMap(). You can return a collection and have Spark flatten the result.
My point was more to change from Seq[BigDataStructure] to Seq[SmallDataStructure] If the use case is really storing image data - I would try to use Seq[Vector] and store the values as a sparse array to reduce the overall size of the object. Secondly, Databricks released an open source image processing Utility library in Deep Learning pipelines, specifically for reading in images and loading them as arrays in DataFrames or DataSets efficiently. https://github.com/databricks/spark-deep-learning/blob/f088de45daec06865ac02a9ec1323eb2c9eebb89/src/main/scala/com/databricks/sparkdl/ImageUtils.scala You can reuse this code potentially. Richard Garris Principal Architect Databricks, Inc 650.200.0840 rlgar...@databricks.com On December 17, 2017 at 3:12:41 PM, Don Drake (dondr...@gmail.com) wrote: Hey Richard, Good to hear from you as well. I thought I would ask if there was something Scala specific I was missing in handling these large classes. I can tweak my job to do a map() and then only one large object will be created at a time and returned, which should allow me to lower my executor memory size. Thanks. -Don On Thu, Dec 14, 2017 at 2:58 PM, Richard Garris <rlgar...@databricks.com> wrote: > Hi Don, > > Good to hear from you. I think the problem is that regardless of whether > you use yield or a generator - Spark internally will produce the entire > result as a single large JVM object which will blow up your heap space. > > Would it be possible to shrink the overall size of the image object > storing it as a vector or Array vs a large Java class object? > > That might be the more prudent approach. > > -RG > > Richard Garris > > Principal Architect > > Databricks, Inc > > 650.200.0840 <(650)%20200-0840> > > rlgar...@databricks.com > > On December 14, 2017 at 10:23:00 AM, Marcelo Vanzin (van...@cloudera.com) > wrote: > > This sounds like something mapPartitions should be able to do, not > sure if there's an easier way. > > On Thu, Dec 14, 2017 at 10:20 AM, Don Drake <dondr...@gmail.com> wrote: > > I'm looking for some advice when I have a flatMap on a Dataset that is > > creating and returning a sequence of a new case class > > (Seq[BigDataStructure]) that contains a very large amount of data, much > > larger than the single input record (think images). > > > > In python, you can use generators (yield) to bypass creating a large > list of > > structures and returning the list. > > > > I'm programming this is in Scala and was wondering if there are any > similar > > tricks to optimally return a list of classes?? I found the for/yield > > semantics, but it appears the compiler is just creating a sequence for > you > > and this will blow through my Heap given the number of elements in the > list > > and the size of each element. > > > > Is there anything else I can use? > > > > Thanks. > > > > -Don > > > > -- > > Donald Drake > > Drake Consulting > > http://www.drakeconsulting.com/ > > https://twitter.com/dondrake > > 800-733-2143 <(800)%20733-2143> > > > > -- > Marcelo > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Donald Drake Drake Consulting http://www.drakeconsulting.com/ https://twitter.com/dondrake <http://www.MailLaunder.com/> 800-733-2143