Perhaps you are looking for GROUP BY and collect_set, which would allow you to stay in SQL. I'll add that in Spark 1.4 you can get access to items of a row by name.
On Fri, May 15, 2015 at 10:48 AM, Edward Sargisson <[email protected]> wrote: > Hi all, > This might be a question to be answered or feedback for a possibly new > feature depending: > > We have source data which is events about the state changes of an entity > (identified by an ID) represented as nested JSON. > We wanted to sessionize this data so that we had a collection of all the > events for a given ID as we have to do more processing based on what we > find. > > We tried doing this using Spark SQL and then converting to a JavaPairRDD > using DataFrame.javaRdd.groupByKey. > > The schema inference worked great but what was frustrating was that the > result of groupByKey is <String, Iterable<Row>>. Rows only have get(int) > methods and don't take notice of the schema stuff so they ended up being > something we didn't want to work with. > > We are currently solving this problem by ignoring Spark SQL and > deserializing the event JSON into a POJO for further processing. > > Are there better approaches to this? > Perhaps Spark should have a DataFrame.groupByKey that returns Rows that > can be used with the schema stuff? > > Thanks! > Edward >
