Thanks Aaron. >> 2) >> I'm doing batch writes to the database (pulling data from multiple resources >> and put them together). I wish to know if there's some better methods to >> improve the writing efficiency since it's just about the same speed as >> MySQL, when writing sequentially. Seems like the commitlog requires a huge >> mount of disk IO comparing with my test machine can afford. > Have a look at http://www.datastax.com/dev/blog/bulk-loading This is a great tool for me. I'll try on this tool since it will require much lower bandwidth cost and disk IO.
> >> 3) >> In my case, each row is read randomly with the same chance. I have around >> 0.5M rows in total. Can you provide some practical advices on optimizing >> the row cache and key cache? I can use up to 8 gig of memory on test >> machines. > If your data set small enough to fit in memory ? . You may also be interested > in the row_cache_provider setting for column families, see the CLI help for > create column family and the IRowCacheProvider interface. You can replace the > caching strategy if you want to. The dataset is about 150 Gig storing as CSV and estimated as 1.3T storing as SSTable. Hence I don't think it can fit into memory. I'll try the caching strategy a little bit but I think it can improve my case a little bit. I'm now looking into some native compression on SSTable, just patched the CASSANDRA-47 and found there is a huge performance penalty in my use case, and I haven't figured out the reason yet. I suppose CASSANDRA-647 will solve it better, however I seek there's a number of tickets working at a similar issue, including CASSANDRA-1608 etc. Is that because cassandra really cost a huge disk space? Well my target is to simply get the 1.3T compressed to 700 Gig so that I can fit it into a single server, while keeping the same level of performance. Best, Steve On Aug 16, 2011, at 2:27 PM, aaron morton wrote: >> > > Hope that helps. > > > ----------------- > Aaron Morton > Freelance Cassandra Developer > @aaronmorton > http://www.thelastpickle.com > > On 16/08/2011, at 12:44 PM, Yi Yang wrote: > >> Dear all, >> >> I wanna report my use case, and have a discussion with you guys. >> >> I'm currently working on my second Cassandra project. I got into somehow a >> unique use case: storing traditional, relational data set into Cassandra >> datastore, it's a dataset of int and float numbers, no more strings, no more >> other data and the column names are much longer than the value itself. >> Besides, row-key is the md-5 hash ver3 UUID of some other data. >> >> 1) >> I did some workaround to make it save some disk space however it still takes >> approximately 12-15x more disk space than MySQL. I looked into Cassandra >> SSTable internal, did some optimizing on selecting better data serializer >> and also hashed the column name into one byte. That made the current >> database having ~6x overhead on disk space comparing with MySQL, which I >> think it might be acceptable. >> >> I'm currently interested into CASSANDRA-674 and will also test CASSANDRA-47 >> in the coming days. I'll keep you updated on my testing. But I'm willing >> to hear your idea on saving disk space. >> >> 2) >> I'm doing batch writes to the database (pulling data from multiple resources >> and put them together). I wish to know if there's some better methods to >> improve the writing efficiency since it's just about the same speed as >> MySQL, when writing sequentially. Seems like the commitlog requires a huge >> mount of disk IO comparing with my test machine can afford. >> >> 3) >> In my case, each row is read randomly with the same chance. I have around >> 0.5M rows in total. Can you provide some practical advices on optimizing >> the row cache and key cache? I can use up to 8 gig of memory on test >> machines. >> >> Thanks for your help. >> >> >> Best, >> >> Steve >> >> >