Hi all, This might be a question to be answered or feedback for a possibly new feature depending:
We have source data which is events about the state changes of an entity (identified by an ID) represented as nested JSON. We wanted to sessionize this data so that we had a collection of all the events for a given ID as we have to do more processing based on what we find. We tried doing this using Spark SQL and then converting to a JavaPairRDD using DataFrame.javaRdd.groupByKey. The schema inference worked great but what was frustrating was that the result of groupByKey is <String, Iterable<Row>>. Rows only have get(int) methods and don't take notice of the schema stuff so they ended up being something we didn't want to work with. We are currently solving this problem by ignoring Spark SQL and deserializing the event JSON into a POJO for further processing. Are there better approaches to this? Perhaps Spark should have a DataFrame.groupByKey that returns Rows that can be used with the schema stuff? Thanks! Edward
