Data modeling question

Peter Hsu Fri, 29 Jun 2012 17:13:46 -0700

I have a question on what the best way is to store the data in my schema.

The data
I have millions of nodes, each with a different cartesian coordinate.  The keys 
for the nodes are hashed based on the coordinate.


My search is a proximity search.  I'd like to find all the nodes within a given 
distance from a particular node.  I can create an arbitrary grouping that 
groups an arbitrary number of nodes together, based on proximity… 

e.g. 
group 0  contains all points from (0,0) to (10,10)
group 1 contains all points from (10,0 to 20,10).

For each coordinate, I store various meta data:
 8 columns, 4 UTF8Type ~20bytes each, 4 DoubleType

The query
I need a proximity search to return all data within a range from a selected 
node.  The typical read size is ~100 distinct rows (e.g. a 10x10 grid around 
the selected node)..  Since it's on a coordinate system, I know ahead of time 
exactly which 100 rows I need.

The modeling options

Option 1:
 - single column family, with key being the coordinate hash

e,g,
'0,0' : { meta }
'0,1' : { meta }
…
'10, 20' : { meta}

 - query for 100 rows in parallel

 - I think this option sucks because it's essentially 100 non-sequential reads??

Option 2:
 - group my data into super columns, with key being the grouping

e.g.
 '0' {
  '0, 0' : { meta }
 ...
  '10, 10' : { meta }
 }
'1' {
 '10, 0' : {meta}
…
 '20, 10': {meta}
}


 - query by the appropriate grouping 
 - since i can't guarantee the query won't fall near the boundary of a 
grouping, I'm looking at querying up to 4 different super column rows for each 
query
 - this seems reasonable, since i'm doing bulk sequential reads, but have some 
overhead in terms of pre-filtering and post-filtering
 - sucks in terms of flexibility for modifying size of proximity search

Option 3:
 - create a secondary index based on the grouping

e.g.

e,g,
'0,0' : { meta, group='0' }
'0,1' : { meta, group='0' }
…
'10, 20' : { meta, group='1'}

 - query by secondary index
 - same as above, will return some extra data, and will need to do filtering..
 - no idea how cassandra stores this data internally, but will the data access 
here be sequential?
 - a little more flexible in terms of proximity search - can create multiple 
grouping types based on the size of the search

Option 4:
 - composite queries??
 -- I haven't had time to read up too much on this, so I'm not sure if it would 
help for my use case or not.

questions
 - I know there are pros and cons to each approach wrt flexibility of my search 
size, but assuming my search proximity size is fixed, which method provides the 
optimal performance?
 - I guess the main question is will querying by secondary index be efficient 
enough or is it worth it to group the data into super columns?
 - Is there a better way I haven't thought about to model the data?

Data modeling question

Reply via email to