>
> I am trying to implement org.apache.spark.ml.Transformer interface in
> Java 8.
>
My understanding is the sudo code for transformers is something like
>
> @Override
>
> public DataFrame transform(DataFrame df) {
>
> 1. Select the input column
>
> 2. Create a new column
>
> 3. Append the new column to the df argument and return
>
> }
>
The following line can be used inside of the transform function to return a
Dataframe that has been augmented with a new column using the stem lambda
function (defined as a UDF below).
return df.withColumn("filteredInput", expr("stem(rawInput)"));
This is producing a new column called filterInput (that is appended to
whatever columns are already there) by passing the column rawInput to your
arbitrary lambda function.
> Based on my experience the current DataFrame api is very limited. You can
> not apply a complicated lambda function. As a work around I convert the
> data frame to a JavaRDD, apply my complicated lambda, and then convert the
> resulting RDD back to a Data Frame.
>
This is exactly what this code is doing. You are defining an arbitrary
lambda function as a UDF. The difference here, when compared to a JavaRDD
map, is that you can use this UDF to append columns without having to
manually append the new data to some existing object.
sqlContext.udf().register("stem", new UDF1<String, String>() {
@Override
public String call(String str) {
return // TODO: stemming code here
}
}, DataTypes.StringType);
Now I select the “new column” from the Data Frame and try to call
> df.withColumn().
>
>
> I can try an implement this as a UDF. How ever I need to use several 3rd
> party jars. Any idea how insure the workers will have the required jar
> files? If I was submitting a normal java app I would create an uber jar
> will this work with UDFs?
>
Yeah, UDFs are run the same way as your RDD lambda functions.