Cassandra for numerical data set

Yi Yang Mon, 15 Aug 2011 17:45:27 -0700

Dear all,

I wanna report my use case, and have a discussion with you guys.


I'm currently working on my second Cassandra project.   I got into somehow a 
unique use case: storing traditional, relational data set into Cassandra 
datastore, it's a dataset of int and float numbers, no more strings, no more 
other data and the column names are much longer than the value itself.   
Besides, row-key is the md-5 hash ver3 UUID of some other data.

1)
I did some workaround to make it save some disk space however it still takes 
approximately 12-15x more disk space than MySQL.   I looked into Cassandra 
SSTable internal, did some optimizing on selecting better data serializer and 
also hashed the column name into one byte.   That made the current database 
having ~6x overhead on disk space comparing with MySQL, which I think it might 
be acceptable.

I'm currently interested into CASSANDRA-674 and will also test CASSANDRA-47 in 
the coming days.   I'll keep you updated on my testing.   But I'm willing to 
hear your idea on saving disk space.

2)
I'm doing batch writes to the database (pulling data from multiple resources 
and put them together).   I wish to know if there's some better methods to 
improve the writing efficiency since it's just about the same speed as MySQL, 
when writing sequentially.   Seems like the commitlog requires a huge mount of 
disk IO comparing with my test machine can afford.

3)
In my case, each row is read randomly with the same chance.   I have around 
0.5M rows in total.   Can you provide some practical advices on optimizing the 
row cache and key cache?   I can use up to 8 gig of memory on test machines.

Thanks for your help.


Best,

Steve

Cassandra for numerical data set

Reply via email to