Re: Row Encoder For DataSet

Weichen Xu Thu, 07 Dec 2017 21:00:05 -0800

You can groupBy multiple columns on dataframe, so why you need so
complicated schema ?


suppose df schema: (x, y, u, v, z)

df.groupBy($"x", $"y").agg(...)

Is this you want ?

On Fri, Dec 8, 2017 at 11:51 AM, Sandip Mehta <sandip.mehta....@gmail.com>
wrote:

> Hi,
>
> During my aggregation I end up having following schema.
>
> Row(Row(val1,val2), Row(val1,val2,val3...))
>
> val values = Seq(
>     (Row(10, 11), Row(10, 2, 11)),
>     (Row(10, 11), Row(10, 2, 11)),
>     (Row(20, 11), Row(10, 2, 11))
>   )
>
>
> 1st tuple is used to group the relevant records for aggregation. I have
> used following to create dataset.
>
> val s = StructType(Seq(
>   StructField("x", IntegerType, true),
>   StructField("y", IntegerType, true)
> ))
> val s1 = StructType(Seq(
>   StructField("u", IntegerType, true),
>   StructField("v", IntegerType, true),
>   StructField("z", IntegerType, true)
> ))
>
> val ds = 
> sparkSession.sqlContext.createDataset(sparkSession.sparkContext.parallelize(values))(Encoders.tuple(RowEncoder(s),
>  RowEncoder(s1)))
>
> Is this correct way of representing this?
>
> How do I create dataset and row encoder for such use case for doing
> groupByKey on this?
>
>
>
> Regards
> Sandeep
>

Re: Row Encoder For DataSet

Reply via email to