Perhaps you are looking for GROUP BY and collect_set, which would allow you
to stay in SQL.  I'll add that in Spark 1.4 you can get access to items of
a row by name.

On Fri, May 15, 2015 at 10:48 AM, Edward Sargisson <[email protected]>
wrote:

> Hi all,
> This might be a question to be answered or feedback for a possibly new
> feature depending:
>
> We have source data which is events about the state changes of an entity
> (identified by an ID) represented as nested JSON.
> We wanted to sessionize this data so that we had a collection of all the
> events for a given ID as we have to do more processing based on what we
> find.
>
> We tried doing this using Spark SQL and then converting to a JavaPairRDD
> using DataFrame.javaRdd.groupByKey.
>
> The schema inference worked great but what was frustrating was that the
> result of groupByKey is <String, Iterable<Row>>. Rows only have get(int)
> methods and don't take notice of the schema stuff so they ended up being
> something we didn't want to work with.
>
> We are currently solving this problem by ignoring Spark SQL and
> deserializing the event JSON into a POJO for further processing.
>
> Are there better approaches to this?
> Perhaps Spark should have a DataFrame.groupByKey that returns Rows that
> can be used with the schema stuff?
>
> Thanks!
> Edward
>

Reply via email to