Hi Don, Good to hear from you. I think the problem is that regardless of whether you use yield or a generator - Spark internally will produce the entire result as a single large JVM object which will blow up your heap space.
Would it be possible to shrink the overall size of the image object storing it as a vector or Array vs a large Java class object? That might be the more prudent approach. -RG Richard Garris Principal Architect Databricks, Inc 650.200.0840 rlgar...@databricks.com On December 14, 2017 at 10:23:00 AM, Marcelo Vanzin (van...@cloudera.com) wrote: This sounds like something mapPartitions should be able to do, not sure if there's an easier way. On Thu, Dec 14, 2017 at 10:20 AM, Don Drake <dondr...@gmail.com> wrote: > I'm looking for some advice when I have a flatMap on a Dataset that is > creating and returning a sequence of a new case class > (Seq[BigDataStructure]) that contains a very large amount of data, much > larger than the single input record (think images). > > In python, you can use generators (yield) to bypass creating a large list of > structures and returning the list. > > I'm programming this is in Scala and was wondering if there are any similar > tricks to optimally return a list of classes?? I found the for/yield > semantics, but it appears the compiler is just creating a sequence for you > and this will blow through my Heap given the number of elements in the list > and the size of each element. > > Is there anything else I can use? > > Thanks. > > -Don > > -- > Donald Drake > Drake Consulting > http://www.drakeconsulting.com/ > https://twitter.com/dondrake > 800-733-2143 -- Marcelo --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org