Re: Representing a recursive data type in Spark SQL

Jeremy Lucas Thu, 28 May 2015 12:50:37 -0700

Hey Reynold,

Thanks for the suggestion. Maybe a better definition of what I mean by a
"recursive" data structure is rather what might resemble (in Scala) the
type Map[String, Any]. With a type like this, the keys are well-defined as
strings (as this is JSON) but the values can be basically any arbitrary
value, including another Map[String, Any].


For example, in the below "stream" of JSON records:

{
  "timestamp": "2015-01-01T00:00:00Z",
  "data": {
    "event": "click",
    "url": "http://mywebsite.com";
  }
}
...
{
  "timestamp": "2015-01-01T08:00:00Z",
  "data": {
    "event": "purchase",
    "sku": "123456789",
    "quantity": 1,
    "params": {
      "arbitrary-param-1": "blah",
      "arbitrary-param-2": 123456
  }
}

I am trying to figure out a way to run SparkSQL over the above JSON
records. My inclination would be to define the "timestamp" field as a
well-defined DateType, but the "data" field is way more free-form.

Also, any pointers on where to look for how data types are evaluated and
serialized/deserialized would be super helpful as well.

Thanks



On Thu, May 28, 2015 at 12:30 AM Reynold Xin <r...@databricks.com> wrote:

> I think it is fairly hard to support recursive data types. What I've seen
> in one other proprietary system in the past is to let the user define the
> depth of the nested data types, and then just expand the struct/map/list
> definition to the maximum level of depth.
>
> Would this solve your problem?
>
>
>
>
> On Wed, May 20, 2015 at 6:07 PM, Jeremy Lucas <jeremyalu...@gmail.com>
> wrote:
>
>> Hey Rakesh,
>>
>> To clarify, what I was referring to is when doing something like this:
>>
>> sqlContext.applySchema(rdd, mySchema)
>>
>> mySchema must be a well-defined StructType, which presently does not
>> allow for a recursive type.
>>
>>
>> On Wed, May 20, 2015 at 5:39 PM Rakesh Chalasani <vnit.rak...@gmail.com>
>> wrote:
>>
>>> Hi Jeremy:
>>>
>>> Row is a collect of 'Any'. So, you can be used as a recursive data type.
>>> Is this what you were looking for?
>>>
>>> Example:
>>> val x = sc.parallelize(Array.range(0,10)).map(x => Row(Row(x),
>>> Row(x.toString)))
>>>
>>> Rakesh
>>>
>>>
>>>
>>> On Wed, May 20, 2015 at 7:23 PM Jeremy Lucas <jeremyalu...@gmail.com>
>>> wrote:
>>>
>>>> Spark SQL has proven to be quite useful in applying a partial schema to
>>>> large JSON logs and being able to write plain SQL to perform a wide variety
>>>> of operations over this data. However, one small thing that keeps coming
>>>> back to haunt me is the lack of support for recursive data types, whereby a
>>>> member of a complex/struct value can be of the same type as the
>>>> complex/struct value itself.
>>>>
>>>> I am hoping someone may be able to point me in the right direction of
>>>> where to start to build out such capabilities, as I'd be happy to
>>>> contribute, but am very new to this particular component of the Spark
>>>> project.
>>>>
>>>
>

Re: Representing a recursive data type in Spark SQL

Reply via email to