Hi guys, Spark 1.6.1 here. I am trying to "DataFrame-ize" a complex function I have that currently operates on a DataSet, and returns another DataSet with a new "column" added to it. I'm trying to fit this into the new ML "Model" format where I can receive a DataFrame, ensure the input column exists, then perform my transform and append as a new column.
>From reviewing other ML Model code, the way I see this happen is typically using a UDF on the input to create the output. My problem is this requires the UDF to operate on each record one by one. In my case I am doing a chain of RDD/DataSet operations (flatMap, join with another cached RDD, run a calculation, reduce) on the original input column. How can I do this with DataFrames? thanks, Thunder