RE: is dataframe thread safe?

Mendelson, Assaf Sun, 12 Feb 2017 05:32:01 -0800

There is no threads within maps here. The idea is to have two jobs on two 
different threads which use the same dataframe (which is cached btw).
This does not override spark’s parallel execution of transformation or any 
such. The documentation (job scheduling) actually hints at this option but 
doesn’t say specifically if it is supported when the same dataframe is used.
As for configuring the scheduler, this would not work. First it would mean that 
the same cached dataframe cannot be used, I would have to add some additional 
configuration such as alluxio (and it would still have to 
serialize/deserialize) as opposed to using the cached data. Furthermore, 
multi-tenancy between applications is limited to either dividing the cluster 
between the applications or using dynamic allocation (which has its own 
overheads).

Therefore Sean’s answer is what I was looking for (and hoping for…)
Assaf

From: Jörn Franke [mailto:jornfra...@gmail.com]
Sent: Sunday, February 12, 2017 2:46 PM
To: Sean Owen
Cc: Mendelson, Assaf; user
Subject: Re: is dataframe thread safe?

I did not doubt that the submission of several jobs of one application makes 
sense. However, he want to create threads within maps etc., which looks like 
calling for issues (not only for running the application itself, but also for 
operating it in production within a shared cluster). I would rely for parallel 
execution of the transformations on the out-of-the-box functionality within 
Spark.

For me he looks for a solution that can be achieved by a simple configuration 
of the scheduler in Spark, yarn or mesos. In this way the application would be 
more maintainable in production.

On 12 Feb 2017, at 11:45, Sean Owen 
<so...@cloudera.com<mailto:so...@cloudera.com>> wrote:
No this use case is perfectly sensible. Yes it is thread safe.
On Sun, Feb 12, 2017, 10:30 Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:
I think you should have a look at the spark documentation. It has something 
called scheduler who does exactly this. In more sophisticated environments yarn 
or mesos do this for you.

Using threads for transformations does not make sense.

On 12 Feb 2017, at 09:50, Mendelson, Assaf 
<assaf.mendel...@rsa.com<mailto:assaf.mendel...@rsa.com>> wrote:
I know spark takes care of executing everything in a distributed manner, 
however, spark also supports having multiple threads on the same spark 
session/context and knows (Through fair scheduler) to distribute the tasks from 
them in a round robin.

The question is, can those two actions (with a different set of 
transformations) be applied to the SAME dataframe.

Let’s say I want to do something like:

Val df = ???
df.cache()
df.count()

def f1(df: DataFrame): Unit = {
  val df1 = df.groupby(something).agg(some aggs)
  df1.write.parquet(“some path”)
}

def f2(df: DataFrame): Unit = {
  val df2 = df.groupby(something else).agg(some different aggs)
  df2.write.parquet(“some path 2”)
}

f1(df)
f2(df)

df.unpersist()

if the aggregations do not use the full cluster (e.g. because of data skewness, 
because there aren’t enough partitions or any other reason) then this would 
leave the cluster under utilized.

However, if I would call f1 and f2 on different threads, then df2 can use free 
resources f1 has not consumed and the overall utilization would improve.

Of course, I can do this only if the operations on the dataframe are thread 
safe. For example, if I would do a cache in f1 and an unpersist in f2 I would 
get an inconsistent result. So my question is, what, if any are the legal 
operations to use on a dataframe so I could do the above.

Thanks,
                Assaf.

From: Jörn Franke [mailto:jornfra...@gmail.com]
Sent: Sunday, February 12, 2017 10:39 AM
To: Mendelson, Assaf
Cc: user
Subject: Re: is dataframe thread safe?

I am not sure what you are trying to achieve here. Spark is taking care of 
executing the transformations in a distributed fashion. This means you must not 
use threads - it does not make sense. Hence, you do not find documentation 
about it.

On 12 Feb 2017, at 09:06, Mendelson, Assaf 
<assaf.mendel...@rsa.com<mailto:assaf.mendel...@rsa.com>> wrote:
Hi,
I was wondering if dataframe is considered thread safe. I know the spark 
session and spark context are thread safe (and actually have tools to manage 
jobs from different threads) but the question is, can I use the same dataframe 
in both threads.
The idea would be to create a dataframe in the main thread and then in two sub 
threads do different transformations and actions on it.
I understand that some things might not be thread safe (e.g. if I unpersist in 
one thread it would affect the other. Checkpointing would cause similar 
issues), however, I can’t find any documentation as to what operations (if any) 
are thread safe.

Thanks,
                Assaf.

RE: is dataframe thread safe?

Reply via email to