> Is that because cassandra really cost a huge disk space? The general design approach is / has been that storage space is cheap and plentiful.
> Well my target is to simply get the 1.3T compressed to 700 Gig so that I can > fit it into a single server, while keeping the same level of performance. Not sure it's going to be possible to get the same performance from one machine as you would from several. Cheers ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 17/08/2011, at 10:24 AM, Yi Yang wrote: > > Thanks Aaron. > >>> 2) >>> I'm doing batch writes to the database (pulling data from multiple >>> resources and put them together). I wish to know if there's some better >>> methods to improve the writing efficiency since it's just about the same >>> speed as MySQL, when writing sequentially. Seems like the commitlog >>> requires a huge mount of disk IO comparing with my test machine can afford. >> Have a look at http://www.datastax.com/dev/blog/bulk-loading > This is a great tool for me. I'll try on this tool since it will require > much lower bandwidth cost and disk IO. > >> >>> 3) >>> In my case, each row is read randomly with the same chance. I have around >>> 0.5M rows in total. Can you provide some practical advices on optimizing >>> the row cache and key cache? I can use up to 8 gig of memory on test >>> machines. >> If your data set small enough to fit in memory ? . You may also be >> interested in the row_cache_provider setting for column families, see the >> CLI help for create column family and the IRowCacheProvider interface. You >> can replace the caching strategy if you want to. > The dataset is about 150 Gig storing as CSV and estimated as 1.3T storing as > SSTable. Hence I don't think it can fit into memory. I'll try the > caching strategy a little bit but I think it can improve my case a little bit. > > I'm now looking into some native compression on SSTable, just patched the > CASSANDRA-47 and found there is a huge performance penalty in my use case, > and I haven't figured out the reason yet. I suppose CASSANDRA-647 will > solve it better, however I seek there's a number of tickets working at a > similar issue, including CASSANDRA-1608 etc. Is that because cassandra > really cost a huge disk space? > > Well my target is to simply get the 1.3T compressed to 700 Gig so that I can > fit it into a single server, while keeping the same level of performance. > > Best, > Steve > > > On Aug 16, 2011, at 2:27 PM, aaron morton wrote: > >>> >> >> Hope that helps. >> >> >> ----------------- >> Aaron Morton >> Freelance Cassandra Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 16/08/2011, at 12:44 PM, Yi Yang wrote: >> >>> Dear all, >>> >>> I wanna report my use case, and have a discussion with you guys. >>> >>> I'm currently working on my second Cassandra project. I got into somehow >>> a unique use case: storing traditional, relational data set into Cassandra >>> datastore, it's a dataset of int and float numbers, no more strings, no >>> more other data and the column names are much longer than the value itself. >>> Besides, row-key is the md-5 hash ver3 UUID of some other data. >>> >>> 1) >>> I did some workaround to make it save some disk space however it still >>> takes approximately 12-15x more disk space than MySQL. I looked into >>> Cassandra SSTable internal, did some optimizing on selecting better data >>> serializer and also hashed the column name into one byte. That made the >>> current database having ~6x overhead on disk space comparing with MySQL, >>> which I think it might be acceptable. >>> >>> I'm currently interested into CASSANDRA-674 and will also test CASSANDRA-47 >>> in the coming days. I'll keep you updated on my testing. But I'm >>> willing to hear your idea on saving disk space. >>> >>> 2) >>> I'm doing batch writes to the database (pulling data from multiple >>> resources and put them together). I wish to know if there's some better >>> methods to improve the writing efficiency since it's just about the same >>> speed as MySQL, when writing sequentially. Seems like the commitlog >>> requires a huge mount of disk IO comparing with my test machine can afford. >>> >>> 3) >>> In my case, each row is read randomly with the same chance. I have around >>> 0.5M rows in total. Can you provide some practical advices on optimizing >>> the row cache and key cache? I can use up to 8 gig of memory on test >>> machines. >>> >>> Thanks for your help. >>> >>> >>> Best, >>> >>> Steve >>> >>> >> >