Just read up on composite keys and what looks like future deprecation of super column families.
I guess Option 2 would now be: - column family with composite key from grouping and location > e.g. > '0:0,0': { meta } > ... > '0:10,10' : { meta } > '1:10,0' : {meta} > … > '1:20, 10': {meta} > } On Jun 29, 2012, at 5:13 PM, Peter Hsu wrote: > I have a question on what the best way is to store the data in my schema. > > The data > I have millions of nodes, each with a different cartesian coordinate. The > keys for the nodes are hashed based on the coordinate. > > My search is a proximity search. I'd like to find all the nodes within a > given distance from a particular node. I can create an arbitrary grouping > that groups an arbitrary number of nodes together, based on proximity… > > e.g. > group 0 contains all points from (0,0) to (10,10) > group 1 contains all points from (10,0 to 20,10). > > For each coordinate, I store various meta data: > 8 columns, 4 UTF8Type ~20bytes each, 4 DoubleType > > The query > I need a proximity search to return all data within a range from a selected > node. The typical read size is ~100 distinct rows (e.g. a 10x10 grid around > the selected node).. Since it's on a coordinate system, I know ahead of time > exactly which 100 rows I need. > > The modeling options > > Option 1: > - single column family, with key being the coordinate hash > > e,g, > '0,0' : { meta } > '0,1' : { meta } > … > '10, 20' : { meta} > > - query for 100 rows in parallel > > - I think this option sucks because it's essentially 100 non-sequential > reads?? > > Option 2: > - group my data into super columns, with key being the grouping > > e.g. > '0' { > '0, 0' : { meta } > ... > '10, 10' : { meta } > } > '1' { > '10, 0' : {meta} > … > '20, 10': {meta} > } > > > - query by the appropriate grouping > - since i can't guarantee the query won't fall near the boundary of a > grouping, I'm looking at querying up to 4 different super column rows for > each query > - this seems reasonable, since i'm doing bulk sequential reads, but have > some overhead in terms of pre-filtering and post-filtering > - sucks in terms of flexibility for modifying size of proximity search > > Option 3: > - create a secondary index based on the grouping > > e.g. > > e,g, > '0,0' : { meta, group='0' } > '0,1' : { meta, group='0' } > … > '10, 20' : { meta, group='1'} > > - query by secondary index > - same as above, will return some extra data, and will need to do filtering.. > - no idea how cassandra stores this data internally, but will the data > access here be sequential? > - a little more flexible in terms of proximity search - can create multiple > grouping types based on the size of the search > > Option 4: > - composite queries?? > -- I haven't had time to read up too much on this, so I'm not sure if it > would help for my use case or not. > > questions > - I know there are pros and cons to each approach wrt flexibility of my > search size, but assuming my search proximity size is fixed, which method > provides the optimal performance? > - I guess the main question is will querying by secondary index be efficient > enough or is it worth it to group the data into super columns? > - Is there a better way I haven't thought about to model the data? > >