How to hold some data in memory while processing rows in a DataFrame?

David Rosenstrauch Mon, 22 Jan 2018 19:24:38 -0800

 This seems like an easy thing to do, but I've been banging my head against
the wall for hours trying to get it to work.


I'm processing a spark dataframe (in python).  What I want to do is, as I'm
processing it I want to hold some data from one record in some local
variables in memory, and then use those values later while I'm processing a
subsequent record.  But I can't see any way to do this.

I tried using:

dataframe.select(a_custom_udf_function('some_column'))

... and then reading/writing to local variables in the udf function, but I
can't get this to work properly.

My next guess would be to use dataframe.foreach(a_custom_function) and try
to save data to local variables in there, but I have a suspicion that may
not work either.


What's the correct way to do something like this in Spark?  In Hadoop I
would just go ahead and declare local variables, and read and write to them
in my map function as I like.  (Although with the knowledge that a) the
same map function would get repeatedly called for records with many
different keys, and b) there would be many different instances of my code
spread across many machines, and so each map function running on an
instance would only see a subset of the records.)  But in Spark it seems to
be extraordinarily difficult to create local variables that can be read
from / written to across different records in the dataframe.

Perhaps there's something obvious I'm missing here?  If so, any help would
be greatly appreciated!

Thanks,

DR

How to hold some data in memory while processing rows in a DataFrame?

Reply via email to