DSE, with Solr integration, does provide “field input transformers” so that you 
can parse a column in JSON or any other format and then split it into any 
number of Solr fields, including dynamic fields, which would then let you query 
elements of that JSON.

-- Jack Krupansky

From: Alain RODRIGUEZ 
Sent: Tuesday, July 22, 2014 11:29 AM
To: user@cassandra.apache.org 
Subject: Re: JSON to Cassandra ?

Hi, 

This seems to fit, even if I would need to have to look on how these fields can 
be queried and indexed. Also, I would need to see if those UDF can be modified 
once created and how they behave in this use case.

Yet, 2.1 is currently in beta, and we won't switch to this version immediately 
(even if we could take profit of this and improved counters also...) since we 
are using C*1.2 and are giving a try at DSE 4.5. In both cases, we are far from 
using 2.1. How does people use to do this without UDF ?

Thanks for the pointer though, will probably help someday :-).



2014-07-22 16:30 GMT+02:00 Jack Krupansky <j...@basetechnology.com>:

  Sounds like user-defined types (UDF) in Cassandra 2.1:
  https://issues.apache.org/jira/browse/CASSANDRA-5590

  But... be careful to make sure that you aren’t using this powerful (and 
dangerous) feature as a crutch merely to avoid disciplined data modeling.

  -- Jack Krupansky

  From: Alain RODRIGUEZ 
  Sent: Tuesday, July 22, 2014 9:56 AM
  To: user@cassandra.apache.org 
  Subject: JSON to Cassandra ?

  Hi guys, I know this topic as already been spoken many times, and I read a 
lot of these discussions. 

  Yet, I have not been able to find a good way to do what I want.

  We are receiving messages from our app that is a complex, dynamic, nested 
JSON (can be a few or thousands of attributes). JSON is variable and can 
contain nested arrays or sub-JSONs.

  Please, consider this example:

  JSON

  {
      "struct-id": 141241321,
      "nested-1-1": {
          "value-1-1-1": "36d1f74d-1663-418d-8b1b-665bbb2d9ecb",
          "value-1-1-2": 5,
          "value-1-1-3": 0.5,
          "value-1-1-4": ["foo", "bar", "foobar"],
          "nested-2-1": {
              "test-2-1-1": "whatever",
              "test-2-1-2": 42
          }
      },
      "nested-1-2": {
          "value-1-2-1": [{
              "id": 1,
              "deeply-nested": {
                  "data-1": "test",
                  "data-2": 4023
              }
          },
          {
              "id": 2,
              "data-3": "that's enough data"
          }]
      }
  }

  We would like to store those messages to Cassandra and then run SPARK jobs 
over it. Basically, storing it as a text (full JSON in one column) would work 
but wouldn't be optimised since I might want to count how many times 
"value-1-1-3" is bigger or equal to 1, I would have to read all the JSON before 
answering this. I read a lot of things about people using composite columns and 
dynamic composite columns, but no precise example. I am also aware of 
collections support, yet nested collections are not supported currently.

  I would like to have:

  - 1 column per attribute
  - typed values
  - something that would be able to parse and store any valid JSON (with nested 
arrays of JSON or whatever).
  - The most efficient model to use alongside with spark to query anything 
inside.

  What would be the possible CQL schemas to create such a data structure ?

  What are the defaults of the following schema ?

  Cassandra

  CREATE TABLE test-schema (
      struct-id int,
      nested-1-1#value-1-1-1 string,
      nested-1-1#value-1-1-2 int,
      nested-1-1#value-1-1-3 float,
      nested-1-1#value-1-1-4#array0 string,
      nested-1-1#value-1-1-4#array1 string,
      nested-1-1#value-1-1-4#array2 string,
      nested-1-1#nested-2-1#test-2-1-1 string,
      nested-1-1#nested-2-1#test-2-1-2 int,
      nested-1-2#value-1-2-1#array0#id int,
      nested-1-2#value-1-2-1#array0#deeply-nested#data-1 string,
      nested-1-2#value-1-2-1#array0#deeply-nested#data-2 int,
      nested-1-2#id int,
      nested-1-2#data-3 string,
      PRIMARY KEY (struct-id)
  )

  I could use:

      nested-1-1#value-1-1-4 list<string>,


  instead of:

      nested-1-1#value-1-1-4#array0 string,
      nested-1-1#value-1-1-4#array1 string,
      nested-1-1#value-1-1-4#array2 string,

  yet it wouldn't work here:

      nested-1-2#value-1-2-1#array0#deeply-nested#data-1 string,
      nested-1-2#value-1-2-1#array0#deeply-nested#data-2 int,
      nested-1-2#value-1-2-1#array1#id int,
      nested-1-2#value-1-2-1#array1#data-3 string,

  since this is a nested structure inside the list.



  To create this schema, could we imagine that the app logging this try to 
write to the corresponding column, for each JSON attribute, and if the column 
is missing, catch the error, create the column and reprocess write ?

  This exception would happen for each new field, only once and would modify 
the schema.

  Any thought that would help us (and probably more people) ?

  Alain

Reply via email to