I'm trying to figure out the best way to store items for query based on 
multiple dimensions.  I've got a large volume (many 100s of millions per day) 
of time-ordered objects with 10+ properties each that I need to support 
arbitrary query expressions on.  So I may need to support a query based on a 
segment of time plus an expression like "A == 'foo' and B == 'bar' and C == 
'baz'" etc...  Any pointers?

For simple time-ordered retrieval I was going to have a set of time buckets 
used as row keys, something like YYYY-MM-DD-HH, and with an extra character or 
two to reduce hotspots (probably take a hash of the object and use the first 
byte of the hash in hex), so a row key might look like:

   2012-02-16-09:a

This way I'm spreading writes for that hour across 16 rows.  Then the column 
name would be a TimeUUID or some time-based value, and the column value would 
be the object.  This lets me easily slice out segments of time, and lets me 
write data really well.

However if I need to satisfy a query for items matching some expression during 
the day, I have to scan a *lot* of records.  I can require some property to 
always be present in the query, and I can base the above extra byte in the row 
key, so when I scan records I can cut down the number of row keys read by a 
factor of 16, but that's still a huge amount of data to just scan through.

One obvious choice here are secondary indexes, but that implies "short" rows 
that can't be time sliced as easily, and I don't know that have a bunch of 
secondary indexes will scale very well (or support range queries).


Any ideas on a way to structure data for easy queries like this?


Thanks,

-nate



Nate Sammons | Sr. Technical Specialist | FTEN, A NASDAQ OMX Company
Office: +1.720.889.5141 | Email: nsamm...@ften.com<mailto:nsamm...@ften.com>
Aggregation.  Transparency.  Control. (tm)  | www.FTEN.com<http://ften.com/>

Reply via email to