Re: RDD (Spark dataframe) into a PCollection?

Jan Lukavský Tue, 24 May 2022 02:42:24 -0700

+dev@beam <mailto:[email protected]>


On 5/24/22 11:40, Jan Lukavský wrote:

Hi,
I think this feature is valid. Every runner for which Beam is not a'native' SDK uses some form of translation context, which mapsPCollection to internal representation of the particular SDK of therunner (RDD in this case). It should be possible to "import" an RDDinto the specific runner via something like
  SparkRunner runner = ....;
  PCollection<...> pCollection = runner.importRDD(rdd);

and similarly

  RDD<...> rdd = runner.exportRDD(pCollection);
Yes, apparently this would be runner specific, but that is the point,actually. This would enable using features and libraries, that Beamdoes not have, or micro-optimize some particular step usingrunner-specific features, that we don't have in Beam. We actually hadthis feature (at least in a prototype) many years ago when Euphoriawas a separate project.
 Jan

On 5/23/22 20:58, Alexey Romanenko wrote:
On 23 May 2022, at 20:40, Brian Hulette <[email protected]> wrote:
Yeah I'm not sure of any simple way to do this. I wonder if it'sworth considering building some Spark runner-specific feature aroundthis, or at least packaging up Robert's proposed solution?
I’m not sure that a runner specific feature is a good way to do thissince the other runners won’t be able to support it or I’m missingsomething?
There could be other interesting integrations in this space too,e.g. using Spark RDDs as a cache for Interactive Beam.
Another option could be to add something like SparkIO (orFlinkIO/whatever) to read/write data from/to Spark data structuresfor such cases (Spark schema to Beam schema convention also could besupported). And dreaming a bit more, for those who need to have amixed pipeline (e.g. Spark + Beam) such connectors could support thepush-downs of pure Spark pipelines and then use the result downstreamin Beam.
—
Alexey
Brian
On Mon, May 23, 2022 at 11:35 AM Robert Bradshaw<[email protected]> wrote:
    The easiest way to do this would be to write the RDD somewhere then
    read it from Beam.

    On Mon, May 23, 2022 at 9:39 AM Yushu Yao <[email protected]>
    wrote:
    >
    > Hi Folks,
    >
    > I know this is not the optimal way to use beam :-) But assume
    I only use the spark runner.
    >
    > I have a spark library (very complex) that emits a spark
    dataframe (or RDD).
    > I also have an existing complex beam pipeline that can do post
    processing on the data inside the dataframe.
    >
    > However, the beam part needs a pcollection to start with. The
    question is, how can I convert a spark RDD into a pcollection?
    >
    > Thanks
    > -Yushu
    >

Re: RDD (Spark dataframe) into a PCollection?

Reply via email to