I have a question on what the best way is to store the data in my schema. The data I have millions of nodes, each with a different cartesian coordinate. The keys for the nodes are hashed based on the coordinate.
My search is a proximity search. I'd like to find all the nodes within a given distance from a particular node. I can create an arbitrary grouping that groups an arbitrary number of nodes together, based on proximity… e.g. group 0 contains all points from (0,0) to (10,10) group 1 contains all points from (10,0 to 20,10). For each coordinate, I store various meta data: 8 columns, 4 UTF8Type ~20bytes each, 4 DoubleType The query I need a proximity search to return all data within a range from a selected node. The typical read size is ~100 distinct rows (e.g. a 10x10 grid around the selected node).. Since it's on a coordinate system, I know ahead of time exactly which 100 rows I need. The modeling options Option 1: - single column family, with key being the coordinate hash e,g, '0,0' : { meta } '0,1' : { meta } … '10, 20' : { meta} - query for 100 rows in parallel - I think this option sucks because it's essentially 100 non-sequential reads?? Option 2: - group my data into super columns, with key being the grouping e.g. '0' { '0, 0' : { meta } ... '10, 10' : { meta } } '1' { '10, 0' : {meta} … '20, 10': {meta} } - query by the appropriate grouping - since i can't guarantee the query won't fall near the boundary of a grouping, I'm looking at querying up to 4 different super column rows for each query - this seems reasonable, since i'm doing bulk sequential reads, but have some overhead in terms of pre-filtering and post-filtering - sucks in terms of flexibility for modifying size of proximity search Option 3: - create a secondary index based on the grouping e.g. e,g, '0,0' : { meta, group='0' } '0,1' : { meta, group='0' } … '10, 20' : { meta, group='1'} - query by secondary index - same as above, will return some extra data, and will need to do filtering.. - no idea how cassandra stores this data internally, but will the data access here be sequential? - a little more flexible in terms of proximity search - can create multiple grouping types based on the size of the search Option 4: - composite queries?? -- I haven't had time to read up too much on this, so I'm not sure if it would help for my use case or not. questions - I know there are pros and cons to each approach wrt flexibility of my search size, but assuming my search proximity size is fixed, which method provides the optimal performance? - I guess the main question is will querying by secondary index be efficient enough or is it worth it to group the data into super columns? - Is there a better way I haven't thought about to model the data?