Re: Cassandra for numerical data set

Yi Yang Tue, 16 Aug 2011 15:25:16 -0700

Thanks Aaron.

>> 2)
>> I'm doing batch writes to the database (pulling data from multiple resources 
>> and put them together).   I wish to know if there's some better methods to 
>> improve the writing efficiency since it's just about the same speed as 
>> MySQL, when writing sequentially.   Seems like the commitlog requires a huge 
>> mount of disk IO comparing with my test machine can afford.
> Have a look at http://www.datastax.com/dev/blog/bulk-loading
This is a great tool for me.   I'll try on this tool since it will require much 
lower bandwidth cost and disk IO.


> 
>> 3)
>> In my case, each row is read randomly with the same chance.   I have around 
>> 0.5M rows in total.   Can you provide some practical advices on optimizing 
>> the row cache and key cache?   I can use up to 8 gig of memory on test 
>> machines.
> If your data set small enough to fit in memory ? . You may also be interested 
> in the row_cache_provider setting for column families, see the CLI help for 
> create column family and the IRowCacheProvider interface. You can replace the 
> caching strategy if you want to.  
The dataset is about 150 Gig storing as CSV and estimated as 1.3T storing as 
SSTable.   Hence I don't think it can fit into memory.    I'll try the caching 
strategy a little bit but I think it can improve my case a little bit.

I'm now looking into some native compression on SSTable, just patched the 
CASSANDRA-47 and found there is a huge performance penalty in my use case, and 
I haven't figured out the reason yet.   I suppose CASSANDRA-647 will solve it 
better, however I seek there's a number of tickets working at a similar issue, 
including CASSANDRA-1608 etc.   Is that because cassandra really cost a huge 
disk space?

Well my target is to simply get the 1.3T compressed to 700 Gig so that I can 
fit it into a single server, while keeping the same level of performance.

Best,
Steve


On Aug 16, 2011, at 2:27 PM, aaron morton wrote:

>> 
> 
> Hope that helps. 
> 
>  
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 16/08/2011, at 12:44 PM, Yi Yang wrote:
> 
>> Dear all,
>> 
>> I wanna report my use case, and have a discussion with you guys.
>> 
>> I'm currently working on my second Cassandra project.   I got into somehow a 
>> unique use case: storing traditional, relational data set into Cassandra 
>> datastore, it's a dataset of int and float numbers, no more strings, no more 
>> other data and the column names are much longer than the value itself.   
>> Besides, row-key is the md-5 hash ver3 UUID of some other data.
>> 
>> 1)
>> I did some workaround to make it save some disk space however it still takes 
>> approximately 12-15x more disk space than MySQL.   I looked into Cassandra 
>> SSTable internal, did some optimizing on selecting better data serializer 
>> and also hashed the column name into one byte.   That made the current 
>> database having ~6x overhead on disk space comparing with MySQL, which I 
>> think it might be acceptable.
>> 
>> I'm currently interested into CASSANDRA-674 and will also test CASSANDRA-47 
>> in the coming days.   I'll keep you updated on my testing.   But I'm willing 
>> to hear your idea on saving disk space.
>> 
>> 2)
>> I'm doing batch writes to the database (pulling data from multiple resources 
>> and put them together).   I wish to know if there's some better methods to 
>> improve the writing efficiency since it's just about the same speed as 
>> MySQL, when writing sequentially.   Seems like the commitlog requires a huge 
>> mount of disk IO comparing with my test machine can afford.
>> 
>> 3)
>> In my case, each row is read randomly with the same chance.   I have around 
>> 0.5M rows in total.   Can you provide some practical advices on optimizing 
>> the row cache and key cache?   I can use up to 8 gig of memory on test 
>> machines.
>> 
>> Thanks for your help.
>> 
>> 
>> Best,
>> 
>> Steve
>> 
>> 
>

Re: Cassandra for numerical data set

Reply via email to