Hi guys, Spark 1.6.1 here.

I am trying to "DataFrame-ize" a complex function I have that currently
operates on a DataSet, and returns another DataSet with a new "column"
added to it. I'm trying to fit this into the new ML "Model" format where I
can receive a DataFrame, ensure the input column exists, then perform my
transform and append as a new column.

>From reviewing other ML Model code, the way I see this happen is typically
using a UDF on the input to create the output. My problem is this requires
the UDF to operate on each record one by one.

In my case I am doing a chain of RDD/DataSet operations (flatMap, join with
another cached RDD, run a calculation, reduce) on the original input column.

How can I do this with DataFrames?

thanks,
Thunder

Reply via email to