In my transformSchema I do specify that the output column type is a VectorUDT :
*override def transformSchema(schema: StructType, paramMap: ParamMap):
StructType = { val map = this.paramMap ++ paramMap
checkInputColumn(schema, map(inputCol), ArrayType(FloatType, false))
addOutputColumn(schem
My guess is that the `createDataFrame` call is failing here. Can you check
if the schema being passed to it includes the column name and type for the
newly being zipped `features` ?
Joseph probably knows this better, but AFAIK the DenseVector here will need
to be marked as a VectorUDT while creat
Following your suggestion, I end up with the following implementation :
*override def transform(dataSet: DataFrame, paramMap: ParamMap):
DataFrame = { val schema = transformSchema(dataSet.schema, paramMap,
logging = true) val map = this.paramMap ++ paramMap*
*val features = da
One workaround could be to convert a DataFrame into a RDD inside the
transform function and then use mapPartitions/broadcast to work with the
JNI calls and then convert back to RDD.
Thanks
Shivaram
On Mon, Mar 30, 2015 at 8:37 AM, Jaonary Rabarisoa
wrote:
> Dear all,
>
> I'm still struggling to
Dear all,
I'm still struggling to make a pre-trained caffe model transformer for
dataframe works. The main problem is that creating a caffe model inside the
UDF is very slow and consumes memories.
Some of you suggest to broadcast the model. The problem with broadcasting
is that I use a JNI interf
If think it will be interesting to have the equivalents of mappartitions with
dataframe. There are many use cases where data are processed in batch. Another
example is a simple linear classifier Ax=b where A is the matrix of feature
vectors, x the model and b the output. Here again the product A
I see. I think your best bet is to create the cnnModel on the master and
then serialize it to send to the workers. If it's big (1M or so), then you
can broadcast it and use the broadcast variable in the UDF. There is not a
great way to do something equivalent to mapPartitions with UDFs right now
Here is my current implementation with current master version of spark
*class DeepCNNFeature extends Transformer with HasInputCol with
HasOutputCol ... { override def transformSchema(...) { ... }*
*override def transform(dataSet: DataFrame, paramMap: ParamMap):
DataFrame = {*
*
I see, thanks for clarifying!
I'd recommend following existing implementations in spark.ml transformers.
You'll need to define a UDF which operates on a single Row to compute the
value for the new column. You can then use the DataFrame DSL to create the
new column; the DSL provides a nice syntax
class DeepCNNFeature extends Transformer ... {
override def transform(data: DataFrame, paramMap: ParamMap): DataFrame
= {
// How can I do a map partition on the underlying RDD and
then add the column ?
}
}
On Sun, Mar 1, 2015 at 10:23 AM, Jaonary Rabarisoa
wrote:
>
Hi Joseph,
Thank your for the tips. I understand what should I do when my data are
represented as a RDD. The thing that I can't figure out is how to do the
same thing when the data is view as a DataFrame and I need to add the
result of my pretrained model as a new column in the DataFrame. Precisel
Hi Jao,
You can use external tools and libraries if they can be called from your
Spark program or script (with appropriate conversion of data types, etc.).
The best way to apply a pre-trained model to a dataset would be to call the
model from within a closure, e.g.:
myRDD.map { myDatum => preTrai
12 matches
Mail list logo