Dear all, I wanna report my use case, and have a discussion with you guys.
I'm currently working on my second Cassandra project. I got into somehow a unique use case: storing traditional, relational data set into Cassandra datastore, it's a dataset of int and float numbers, no more strings, no more other data and the column names are much longer than the value itself. Besides, row-key is the md-5 hash ver3 UUID of some other data. 1) I did some workaround to make it save some disk space however it still takes approximately 12-15x more disk space than MySQL. I looked into Cassandra SSTable internal, did some optimizing on selecting better data serializer and also hashed the column name into one byte. That made the current database having ~6x overhead on disk space comparing with MySQL, which I think it might be acceptable. I'm currently interested into CASSANDRA-674 and will also test CASSANDRA-47 in the coming days. I'll keep you updated on my testing. But I'm willing to hear your idea on saving disk space. 2) I'm doing batch writes to the database (pulling data from multiple resources and put them together). I wish to know if there's some better methods to improve the writing efficiency since it's just about the same speed as MySQL, when writing sequentially. Seems like the commitlog requires a huge mount of disk IO comparing with my test machine can afford. 3) In my case, each row is read randomly with the same chance. I have around 0.5M rows in total. Can you provide some practical advices on optimizing the row cache and key cache? I can use up to 8 gig of memory on test machines. Thanks for your help. Best, Steve