Re: Data modeling question

Peter Hsu Fri, 29 Jun 2012 18:25:32 -0700

Just read up on composite keys and what looks like future deprecation of super 
column families.


I guess Option 2 would now be:

- column family with composite key from grouping and location

> e.g.
>  '0:0,0': { meta }
>  ...
>  '0:10,10' : { meta }
>  '1:10,0' : {meta}
> …
>  '1:20, 10': {meta}
> }



On Jun 29, 2012, at 5:13 PM, Peter Hsu wrote:

> I have a question on what the best way is to store the data in my schema.
> 
> The data
> I have millions of nodes, each with a different cartesian coordinate.  The 
> keys for the nodes are hashed based on the coordinate.
> 
> My search is a proximity search.  I'd like to find all the nodes within a 
> given distance from a particular node.  I can create an arbitrary grouping 
> that groups an arbitrary number of nodes together, based on proximity… 
> 
> e.g. 
> group 0  contains all points from (0,0) to (10,10)
> group 1 contains all points from (10,0 to 20,10).
> 
> For each coordinate, I store various meta data:
>  8 columns, 4 UTF8Type ~20bytes each, 4 DoubleType
> 
> The query
> I need a proximity search to return all data within a range from a selected 
> node.  The typical read size is ~100 distinct rows (e.g. a 10x10 grid around 
> the selected node)..  Since it's on a coordinate system, I know ahead of time 
> exactly which 100 rows I need.
> 
> The modeling options
> 
> Option 1:
>  - single column family, with key being the coordinate hash
> 
> e,g,
> '0,0' : { meta }
> '0,1' : { meta }
> …
> '10, 20' : { meta}
> 
>  - query for 100 rows in parallel
> 
>  - I think this option sucks because it's essentially 100 non-sequential 
> reads??
> 
> Option 2:
>  - group my data into super columns, with key being the grouping
> 
> e.g.
>  '0' {
>   '0, 0' : { meta }
>  ...
>   '10, 10' : { meta }
>  }
> '1' {
>  '10, 0' : {meta}
> …
>  '20, 10': {meta}
> }
> 
> 
>  - query by the appropriate grouping 
>  - since i can't guarantee the query won't fall near the boundary of a 
> grouping, I'm looking at querying up to 4 different super column rows for 
> each query
>  - this seems reasonable, since i'm doing bulk sequential reads, but have 
> some overhead in terms of pre-filtering and post-filtering
>  - sucks in terms of flexibility for modifying size of proximity search
> 
> Option 3:
>  - create a secondary index based on the grouping
> 
> e.g.
> 
> e,g,
> '0,0' : { meta, group='0' }
> '0,1' : { meta, group='0' }
> …
> '10, 20' : { meta, group='1'}
> 
>  - query by secondary index
>  - same as above, will return some extra data, and will need to do filtering..
>  - no idea how cassandra stores this data internally, but will the data 
> access here be sequential?
>  - a little more flexible in terms of proximity search - can create multiple 
> grouping types based on the size of the search
> 
> Option 4:
>  - composite queries??
>  -- I haven't had time to read up too much on this, so I'm not sure if it 
> would help for my use case or not.
> 
> questions
>  - I know there are pros and cons to each approach wrt flexibility of my 
> search size, but assuming my search proximity size is fixed, which method 
> provides the optimal performance?
>  - I guess the main question is will querying by secondary index be efficient 
> enough or is it worth it to group the data into super columns?
>  - Is there a better way I haven't thought about to model the data?
> 
>

Re: Data modeling question

Reply via email to