RDD (Spark dataframe) into a PCollection?

2022-05-23 Thread Yushu Yao
Hi Folks, I know this is not the optimal way to use beam :-) But assume I only use the spark runner. I have a spark library (very complex) that emits a spark dataframe (or RDD). I also have an existing complex beam pipeline that can do post processing on the data inside the dataframe. However, t

Re: RDD (Spark dataframe) into a PCollection?

2022-05-23 Thread Alexey Romanenko
To add a bit more to what Robert suggested. Right, in general we can’t read Spark RDD directly with Beam (Spark runner uses RDD under the hood but it’s a different story) but you can write the results to any storage and in data format that Beam supports and then read it with a corespondent Beam

Re: RDD (Spark dataframe) into a PCollection?

2022-05-23 Thread Yushu Yao
Thanks Robert and Brian. As for "writing the RDD somewhere", I can totally write a bunch of files on disk/s3. Any other options? -Yushu On Mon, May 23, 2022 at 11:40 AM Brian Hulette wrote: > Yeah I'm not sure of any simple way to do this. I wonder if it's worth > considering building some Spar

Re: RDD (Spark dataframe) into a PCollection?

2022-05-23 Thread Alexey Romanenko
> On 23 May 2022, at 20:40, Brian Hulette wrote: > > Yeah I'm not sure of any simple way to do this. I wonder if it's worth > considering building some Spark runner-specific feature around this, or at > least packaging up Robert's proposed solution? I’m not sure that a runner specific featu