Hey Reynold, Thanks for the suggestion. Maybe a better definition of what I mean by a "recursive" data structure is rather what might resemble (in Scala) the type Map[String, Any]. With a type like this, the keys are well-defined as strings (as this is JSON) but the values can be basically any arbitrary value, including another Map[String, Any].
For example, in the below "stream" of JSON records: { "timestamp": "2015-01-01T00:00:00Z", "data": { "event": "click", "url": "http://mywebsite.com" } } ... { "timestamp": "2015-01-01T08:00:00Z", "data": { "event": "purchase", "sku": "123456789", "quantity": 1, "params": { "arbitrary-param-1": "blah", "arbitrary-param-2": 123456 } } I am trying to figure out a way to run SparkSQL over the above JSON records. My inclination would be to define the "timestamp" field as a well-defined DateType, but the "data" field is way more free-form. Also, any pointers on where to look for how data types are evaluated and serialized/deserialized would be super helpful as well. Thanks On Thu, May 28, 2015 at 12:30 AM Reynold Xin <r...@databricks.com> wrote: > I think it is fairly hard to support recursive data types. What I've seen > in one other proprietary system in the past is to let the user define the > depth of the nested data types, and then just expand the struct/map/list > definition to the maximum level of depth. > > Would this solve your problem? > > > > > On Wed, May 20, 2015 at 6:07 PM, Jeremy Lucas <jeremyalu...@gmail.com> > wrote: > >> Hey Rakesh, >> >> To clarify, what I was referring to is when doing something like this: >> >> sqlContext.applySchema(rdd, mySchema) >> >> mySchema must be a well-defined StructType, which presently does not >> allow for a recursive type. >> >> >> On Wed, May 20, 2015 at 5:39 PM Rakesh Chalasani <vnit.rak...@gmail.com> >> wrote: >> >>> Hi Jeremy: >>> >>> Row is a collect of 'Any'. So, you can be used as a recursive data type. >>> Is this what you were looking for? >>> >>> Example: >>> val x = sc.parallelize(Array.range(0,10)).map(x => Row(Row(x), >>> Row(x.toString))) >>> >>> Rakesh >>> >>> >>> >>> On Wed, May 20, 2015 at 7:23 PM Jeremy Lucas <jeremyalu...@gmail.com> >>> wrote: >>> >>>> Spark SQL has proven to be quite useful in applying a partial schema to >>>> large JSON logs and being able to write plain SQL to perform a wide variety >>>> of operations over this data. However, one small thing that keeps coming >>>> back to haunt me is the lack of support for recursive data types, whereby a >>>> member of a complex/struct value can be of the same type as the >>>> complex/struct value itself. >>>> >>>> I am hoping someone may be able to point me in the right direction of >>>> where to start to build out such capabilities, as I'd be happy to >>>> contribute, but am very new to this particular component of the Spark >>>> project. >>>> >>> >