Re: Need some help to create user defined type for ML pipeline

Joseph Bradley Sat, 24 Jan 2015 17:28:09 -0800

Hi Jao,

You're right that defining serialize and deserialize is the main task in
implementing a UDT.  They are basically translating between your native
representation (ByteImage) and SQL DataTypes.  The sqlType you defined
looks correct, and you're correct to use a row of length 4.  Other than
that, it should just require copying data to and from SQL Rows.  There are
quite a few examples of that in the codebase; I'd recommend searching based
on the particular DataTypes you're using.


Are there particular issues you're running into?

Joseph

On Mon, Jan 19, 2015 at 12:59 AM, Jaonary Rabarisoa <jaon...@gmail.com>
wrote:

> Hi all,
>
> I'm trying to implement a pipeline for computer vision based on the latest
> ML package in spark. The first step of my pipeline is to decode image (jpeg
> for instance) stored in a parquet file.
> For this, I begin to create a UserDefinedType that represents a decoded
> image stored in a array of byte. Here is my first attempt :
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *@SQLUserDefinedType(udt = classOf[ByteImageUDT])class ByteImage(channels: 
> Int, width: Int, height: Int, data: Array[Byte])private[spark] class 
> ByteImageUDT extends UserDefinedType[ByteImage] {  override def sqlType: 
> StructType = {    // type: 0 = sparse, 1 = dense    // We only use "values" 
> for dense vectors, and "size", "indices", and "values" for sparse    // 
> vectors. The "values" field is nullable because we might want to add binary 
> vectors later,    // which uses "size" and "indices", but not "values".    
> StructType(Seq(      StructField("channels", IntegerType, nullable = false),  
>     StructField("width", IntegerType, nullable = false),      
> StructField("height", IntegerType, nullable = false),      
> StructField("data", BinaryType, nullable = false)  }  override def 
> serialize(obj: Any): Row = {    val row = new GenericMutableRow(4)    val img 
> = obj.asInstanceOf[ByteImage]*
>
>
>
>
>
>
> *...  }  override def deserialize(datum: Any): Vector = {  *
>
> *....*
>
>
>
>
>
>
>
>
> *    }  }  override def pyUDT: String = "pyspark.mllib.linalg.VectorUDT"  
> override def userClass: Class[Vector] = classOf[Vector]}*
>
>
> I take the VectorUDT as a starting point but there's a lot of thing that I 
> don't really understand. So any help on defining serialize and deserialize 
> methods will be appreciated.
>
> Best Regards,
>
> Jao
>
>

Re: Need some help to create user defined type for ML pipeline

Reply via email to