Re: is dataframe thread safe?

Sean Owen Sun, 12 Feb 2017 02:46:46 -0800

No this use case is perfectly sensible. Yes it is thread safe.

On Sun, Feb 12, 2017, 10:30 Jörn Franke <jornfra...@gmail.com> wrote:


> I think you should have a look at the spark documentation. It has
> something called scheduler who does exactly this. In more sophisticated
> environments yarn or mesos do this for you.
>
> Using threads for transformations does not make sense.
>
> On 12 Feb 2017, at 09:50, Mendelson, Assaf <assaf.mendel...@rsa.com>
> wrote:
>
> I know spark takes care of executing everything in a distributed manner,
> however, spark also supports having multiple threads on the same spark
> session/context and knows (Through fair scheduler) to distribute the tasks
> from them in a round robin.
>
>
>
> The question is, can those two actions (with a different set of
> transformations) be applied to the SAME dataframe.
>
>
>
> Let’s say I want to do something like:
>
>
>
>
>
>
>
> Val df = ???
>
> df.cache()
>
> df.count()
>
>
>
> def f1(df: DataFrame): Unit = {
>
>   val df1 = df.groupby(something).agg(some aggs)
>
>   df1.write.parquet(“some path”)
>
> }
>
>
>
> def f2(df: DataFrame): Unit = {
>
>   val df2 = df.groupby(something else).agg(some different aggs)
>
>   df2.write.parquet(“some path 2”)
>
> }
>
>
>
> f1(df)
>
> f2(df)
>
>
>
> df.unpersist()
>
>
>
> if the aggregations do not use the full cluster (e.g. because of data
> skewness, because there aren’t enough partitions or any other reason) then
> this would leave the cluster under utilized.
>
>
>
> However, if I would call f1 and f2 on different threads, then df2 can use
> free resources f1 has not consumed and the overall utilization would
> improve.
>
>
>
> Of course, I can do this only if the operations on the dataframe are
> thread safe. For example, if I would do a cache in f1 and an unpersist in
> f2 I would get an inconsistent result. So my question is, what, if any are
> the legal operations to use on a dataframe so I could do the above.
>
>
>
> Thanks,
>
>                 Assaf.
>
>
>
> *From:* Jörn Franke [mailto:jornfra...@gmail.com <jornfra...@gmail.com>]
> *Sent:* Sunday, February 12, 2017 10:39 AM
> *To:* Mendelson, Assaf
> *Cc:* user
> *Subject:* Re: is dataframe thread safe?
>
>
>
> I am not sure what you are trying to achieve here. Spark is taking care of
> executing the transformations in a distributed fashion. This means you must
> not use threads - it does not make sense. Hence, you do not find
> documentation about it.
>
>
> On 12 Feb 2017, at 09:06, Mendelson, Assaf <assaf.mendel...@rsa.com>
> wrote:
>
> Hi,
>
> I was wondering if dataframe is considered thread safe. I know the spark
> session and spark context are thread safe (and actually have tools to
> manage jobs from different threads) but the question is, can I use the same
> dataframe in both threads.
>
> The idea would be to create a dataframe in the main thread and then in two
> sub threads do different transformations and actions on it.
>
> I understand that some things might not be thread safe (e.g. if I
> unpersist in one thread it would affect the other. Checkpointing would
> cause similar issues), however, I can’t find any documentation as to what
> operations (if any) are thread safe.
>
>
>
> Thanks,
>
>                 Assaf.
>
>
>
>

Re: is dataframe thread safe?

Reply via email to