yep, I'll probably try both I don't think there is anything out there which can beat in-memory db in terms of bulk throughput (e.g http://cs.nyu.edu/cs/faculty/shasha/papers/sigmodpap.pdf) but will see how far we can get with open source tools and using a combination of persistent storage and caching/pre-fetching. When you have half a terabyte of RAM on a 20-node cluster there must be a way to utilize it
I'm also evaluating Hbase so hopefully will have some benchmark results to show If anyone on this list using Cassandra or HBase for time series indexing I'd be happy to hear and share our findings Thanks Alex http://www.linkedin.com/in/alexkamil On Wed, Mar 17, 2010 at 9:52 AM, Jonathan Ellis <jbel...@gmail.com> wrote: > I guess if you are going to read the full 5MB at once then that makes > more sense. > > But if you are going to slice it or access parts by column name then > the other does. > > On Tue, Mar 16, 2010 at 12:15 PM, alex kamil <alex.ka...@gmail.com> wrote: > > which index structure would fit Cassandra more naturally and perform > better: > > 1) a sparse index where in each row there are 100 columns each containing > a > > 5MB data block (under a single column family) > > or > > 2) a dense index where each row contains 100 columns with a single > 6bytes > > value (under a single column family) > > > > - assuming the total data size is 30-50TB, 500GB appends per day > > - the data is time series (output from a multichannel EEG sensor) > > the key performance metric for us is read throughput (random reads/sec, > > range queries, sequential scans) > > > > Thanks > > Alex > > >