Hi Don,

It’s not so much map() vs flatMap(). You can return a collection and have
Spark flatten the result.

My point was more to change from Seq[BigDataStructure]  to
Seq[SmallDataStructure]

If the use case is really storing image data - I would try to use
Seq[Vector] and store the values as a sparse array to reduce the overall
size of the object.

Secondly, Databricks released an open source image processing Utility
library in Deep Learning pipelines, specifically for reading in images and
loading them as arrays in DataFrames or DataSets efficiently.

https://github.com/databricks/spark-deep-learning/blob/f088de45daec06865ac02a9ec1323eb2c9eebb89/src/main/scala/com/databricks/sparkdl/ImageUtils.scala

You can reuse this code potentially.

Richard Garris

Principal Architect

Databricks, Inc

650.200.0840

rlgar...@databricks.com

On December 17, 2017 at 3:12:41 PM, Don Drake (dondr...@gmail.com) wrote:

Hey Richard,

Good to hear from you as well.  I thought I would ask if there was
something Scala specific I was missing in handling these large classes.

I can tweak my job to do a map() and then only one large object will be
created at a time and returned, which should allow me to lower my executor
memory size.

Thanks.

-Don


On Thu, Dec 14, 2017 at 2:58 PM, Richard Garris <rlgar...@databricks.com>
wrote:

> Hi Don,
>
> Good to hear from you. I think the problem is that regardless of whether
> you use yield or a generator - Spark internally will produce the entire
> result as a single large JVM object which will blow up your heap space.
>
> Would it be possible to shrink the overall size of the image object
> storing it as a vector or Array vs a large Java class object?
>
> That might be the more prudent approach.
>
> -RG
>
> Richard Garris
>
> Principal Architect
>
> Databricks, Inc
>
> 650.200.0840 <(650)%20200-0840>
>
> rlgar...@databricks.com
>
> On December 14, 2017 at 10:23:00 AM, Marcelo Vanzin (van...@cloudera.com)
> wrote:
>
> This sounds like something mapPartitions should be able to do, not
> sure if there's an easier way.
>
> On Thu, Dec 14, 2017 at 10:20 AM, Don Drake <dondr...@gmail.com> wrote:
> > I'm looking for some advice when I have a flatMap on a Dataset that is
> > creating and returning a sequence of a new case class
> > (Seq[BigDataStructure]) that contains a very large amount of data, much
> > larger than the single input record (think images).
> >
> > In python, you can use generators (yield) to bypass creating a large
> list of
> > structures and returning the list.
> >
> > I'm programming this is in Scala and was wondering if there are any
> similar
> > tricks to optimally return a list of classes?? I found the for/yield
> > semantics, but it appears the compiler is just creating a sequence for
> you
> > and this will blow through my Heap given the number of elements in the
> list
> > and the size of each element.
> >
> > Is there anything else I can use?
> >
> > Thanks.
> >
> > -Don
> >
> > --
> > Donald Drake
> > Drake Consulting
> > http://www.drakeconsulting.com/
> > https://twitter.com/dondrake
> > 800-733-2143 <(800)%20733-2143>
>
>
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


--
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake <http://www.MailLaunder.com/>
800-733-2143

Reply via email to