I'm trying to figure out the best way to store items for query based on multiple dimensions. I've got a large volume (many 100s of millions per day) of time-ordered objects with 10+ properties each that I need to support arbitrary query expressions on. So I may need to support a query based on a segment of time plus an expression like "A == 'foo' and B == 'bar' and C == 'baz'" etc... Any pointers?
For simple time-ordered retrieval I was going to have a set of time buckets used as row keys, something like YYYY-MM-DD-HH, and with an extra character or two to reduce hotspots (probably take a hash of the object and use the first byte of the hash in hex), so a row key might look like: 2012-02-16-09:a This way I'm spreading writes for that hour across 16 rows. Then the column name would be a TimeUUID or some time-based value, and the column value would be the object. This lets me easily slice out segments of time, and lets me write data really well. However if I need to satisfy a query for items matching some expression during the day, I have to scan a *lot* of records. I can require some property to always be present in the query, and I can base the above extra byte in the row key, so when I scan records I can cut down the number of row keys read by a factor of 16, but that's still a huge amount of data to just scan through. One obvious choice here are secondary indexes, but that implies "short" rows that can't be time sliced as easily, and I don't know that have a bunch of secondary indexes will scale very well (or support range queries). Any ideas on a way to structure data for easy queries like this? Thanks, -nate Nate Sammons | Sr. Technical Specialist | FTEN, A NASDAQ OMX Company Office: +1.720.889.5141 | Email: nsamm...@ften.com<mailto:nsamm...@ften.com> Aggregation. Transparency. Control. (tm) | www.FTEN.com<http://ften.com/>