Thanks Robert and Brian. As for "writing the RDD somewhere", I can totally write a bunch of files on disk/s3. Any other options? -Yushu
On Mon, May 23, 2022 at 11:40 AM Brian Hulette <bhule...@google.com> wrote: > Yeah I'm not sure of any simple way to do this. I wonder if it's worth > considering building some Spark runner-specific feature around this, or at > least packaging up Robert's proposed solution? > > There could be other interesting integrations in this space too, e.g. > using Spark RDDs as a cache for Interactive Beam. > > Brian > > On Mon, May 23, 2022 at 11:35 AM Robert Bradshaw <rober...@google.com> > wrote: > >> The easiest way to do this would be to write the RDD somewhere then >> read it from Beam. >> >> On Mon, May 23, 2022 at 9:39 AM Yushu Yao <yao.yu...@gmail.com> wrote: >> > >> > Hi Folks, >> > >> > I know this is not the optimal way to use beam :-) But assume I only >> use the spark runner. >> > >> > I have a spark library (very complex) that emits a spark dataframe (or >> RDD). >> > I also have an existing complex beam pipeline that can do post >> processing on the data inside the dataframe. >> > >> > However, the beam part needs a pcollection to start with. The question >> is, how can I convert a spark RDD into a pcollection? >> > >> > Thanks >> > -Yushu >> > >> >