No this use case is perfectly sensible. Yes it is thread safe. On Sun, Feb 12, 2017, 10:30 Jörn Franke <jornfra...@gmail.com> wrote:
> I think you should have a look at the spark documentation. It has > something called scheduler who does exactly this. In more sophisticated > environments yarn or mesos do this for you. > > Using threads for transformations does not make sense. > > On 12 Feb 2017, at 09:50, Mendelson, Assaf <assaf.mendel...@rsa.com> > wrote: > > I know spark takes care of executing everything in a distributed manner, > however, spark also supports having multiple threads on the same spark > session/context and knows (Through fair scheduler) to distribute the tasks > from them in a round robin. > > > > The question is, can those two actions (with a different set of > transformations) be applied to the SAME dataframe. > > > > Let’s say I want to do something like: > > > > > > > > Val df = ??? > > df.cache() > > df.count() > > > > def f1(df: DataFrame): Unit = { > > val df1 = df.groupby(something).agg(some aggs) > > df1.write.parquet(“some path”) > > } > > > > def f2(df: DataFrame): Unit = { > > val df2 = df.groupby(something else).agg(some different aggs) > > df2.write.parquet(“some path 2”) > > } > > > > f1(df) > > f2(df) > > > > df.unpersist() > > > > if the aggregations do not use the full cluster (e.g. because of data > skewness, because there aren’t enough partitions or any other reason) then > this would leave the cluster under utilized. > > > > However, if I would call f1 and f2 on different threads, then df2 can use > free resources f1 has not consumed and the overall utilization would > improve. > > > > Of course, I can do this only if the operations on the dataframe are > thread safe. For example, if I would do a cache in f1 and an unpersist in > f2 I would get an inconsistent result. So my question is, what, if any are > the legal operations to use on a dataframe so I could do the above. > > > > Thanks, > > Assaf. > > > > *From:* Jörn Franke [mailto:jornfra...@gmail.com <jornfra...@gmail.com>] > *Sent:* Sunday, February 12, 2017 10:39 AM > *To:* Mendelson, Assaf > *Cc:* user > *Subject:* Re: is dataframe thread safe? > > > > I am not sure what you are trying to achieve here. Spark is taking care of > executing the transformations in a distributed fashion. This means you must > not use threads - it does not make sense. Hence, you do not find > documentation about it. > > > On 12 Feb 2017, at 09:06, Mendelson, Assaf <assaf.mendel...@rsa.com> > wrote: > > Hi, > > I was wondering if dataframe is considered thread safe. I know the spark > session and spark context are thread safe (and actually have tools to > manage jobs from different threads) but the question is, can I use the same > dataframe in both threads. > > The idea would be to create a dataframe in the main thread and then in two > sub threads do different transformations and actions on it. > > I understand that some things might not be thread safe (e.g. if I > unpersist in one thread it would affect the other. Checkpointing would > cause similar issues), however, I can’t find any documentation as to what > operations (if any) are thread safe. > > > > Thanks, > > Assaf. > > > >