Re: RDD (Spark dataframe) into a PCollection?

2022-05-24 Thread Jan Lukavský
Hi, I think this feature is valid. Every runner for which Beam is not a 'native' SDK uses some form of translation context, which maps PCollection to internal representation of the particular SDK of the runner (RDD in this case). It should be possible to "import" an RDD into the specific runne

Re: RDD (Spark dataframe) into a PCollection?

2022-05-24 Thread Jan Lukavský
+dev@beam On 5/24/22 11:40, Jan Lukavský wrote: Hi, I think this feature is valid. Every runner for which Beam is not a 'native' SDK uses some form of translation context, which maps PCollection to internal representation of the particular SDK of the runner (RDD

Re: RDD (Spark dataframe) into a PCollection?

2022-05-24 Thread Yushu Yao
Looks like it's a valid use case. Wondering anyone can give some high level guidelines on how to implement this? I can give it a try. -Yushu On Tue, May 24, 2022 at 2:42 AM Jan Lukavský wrote: > +dev@beam > On 5/24/22 11:40, Jan Lukavský wrote: > > Hi, > I think this feature is valid. Every r

Re: RDD (Spark dataframe) into a PCollection?

2022-05-24 Thread Jan Lukavský
Yes, I suppose it might be more complex than the code snippet, that was just to demonstrate the idea. Also the "exportRDD" would probably return WindowedValue instead of plain T. On 5/24/22 17:23, Reuven Lax wrote: Something like this seems reasonable. Beam PCollections also have a timestamp a

Re: RDD (Spark dataframe) into a PCollection?

2022-05-24 Thread Moritz Mack
Hi Yushu, Have a look at org.apache.beam.runners.spark.translation.EvaluationContext in the Spark runner. It maintains that mapping between PCollections and RDDs (wrapped in the BoundedDataset helper). As Reuven just pointed out, values are timestamped (and windowed) in Beam, therefore BoundedD