Re: Cassandra for numerical data set

aaron morton Tue, 16 Aug 2011 16:53:26 -0700

>  Is that because cassandra really cost a huge disk space?
The general design approach is / has been that storage space is cheap and 
plentiful.


> Well my target is to simply get the 1.3T compressed to 700 Gig so that I can 
> fit it into a single server, while keeping the same level of performance.

Not sure it's going to be possible to get the same performance from one machine 
as you would from several. 

Cheers
 
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 17/08/2011, at 10:24 AM, Yi Yang wrote:

> 
> Thanks Aaron.
> 
>>> 2)
>>> I'm doing batch writes to the database (pulling data from multiple 
>>> resources and put them together).   I wish to know if there's some better 
>>> methods to improve the writing efficiency since it's just about the same 
>>> speed as MySQL, when writing sequentially.   Seems like the commitlog 
>>> requires a huge mount of disk IO comparing with my test machine can afford.
>> Have a look at http://www.datastax.com/dev/blog/bulk-loading
> This is a great tool for me.   I'll try on this tool since it will require 
> much lower bandwidth cost and disk IO.
> 
>> 
>>> 3)
>>> In my case, each row is read randomly with the same chance.   I have around 
>>> 0.5M rows in total.   Can you provide some practical advices on optimizing 
>>> the row cache and key cache?   I can use up to 8 gig of memory on test 
>>> machines.
>> If your data set small enough to fit in memory ? . You may also be 
>> interested in the row_cache_provider setting for column families, see the 
>> CLI help for create column family and the IRowCacheProvider interface. You 
>> can replace the caching strategy if you want to.  
> The dataset is about 150 Gig storing as CSV and estimated as 1.3T storing as 
> SSTable.   Hence I don't think it can fit into memory.    I'll try the 
> caching strategy a little bit but I think it can improve my case a little bit.
> 
> I'm now looking into some native compression on SSTable, just patched the 
> CASSANDRA-47 and found there is a huge performance penalty in my use case, 
> and I haven't figured out the reason yet.   I suppose CASSANDRA-647 will 
> solve it better, however I seek there's a number of tickets working at a 
> similar issue, including CASSANDRA-1608 etc.   Is that because cassandra 
> really cost a huge disk space?
> 
> Well my target is to simply get the 1.3T compressed to 700 Gig so that I can 
> fit it into a single server, while keeping the same level of performance.
> 
> Best,
> Steve
> 
> 
> On Aug 16, 2011, at 2:27 PM, aaron morton wrote:
> 
>>> 
>> 
>> Hope that helps. 
>> 
>>  
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 16/08/2011, at 12:44 PM, Yi Yang wrote:
>> 
>>> Dear all,
>>> 
>>> I wanna report my use case, and have a discussion with you guys.
>>> 
>>> I'm currently working on my second Cassandra project.   I got into somehow 
>>> a unique use case: storing traditional, relational data set into Cassandra 
>>> datastore, it's a dataset of int and float numbers, no more strings, no 
>>> more other data and the column names are much longer than the value itself. 
>>>   Besides, row-key is the md-5 hash ver3 UUID of some other data.
>>> 
>>> 1)
>>> I did some workaround to make it save some disk space however it still 
>>> takes approximately 12-15x more disk space than MySQL.   I looked into 
>>> Cassandra SSTable internal, did some optimizing on selecting better data 
>>> serializer and also hashed the column name into one byte.   That made the 
>>> current database having ~6x overhead on disk space comparing with MySQL, 
>>> which I think it might be acceptable.
>>> 
>>> I'm currently interested into CASSANDRA-674 and will also test CASSANDRA-47 
>>> in the coming days.   I'll keep you updated on my testing.   But I'm 
>>> willing to hear your idea on saving disk space.
>>> 
>>> 2)
>>> I'm doing batch writes to the database (pulling data from multiple 
>>> resources and put them together).   I wish to know if there's some better 
>>> methods to improve the writing efficiency since it's just about the same 
>>> speed as MySQL, when writing sequentially.   Seems like the commitlog 
>>> requires a huge mount of disk IO comparing with my test machine can afford.
>>> 
>>> 3)
>>> In my case, each row is read randomly with the same chance.   I have around 
>>> 0.5M rows in total.   Can you provide some practical advices on optimizing 
>>> the row cache and key cache?   I can use up to 8 gig of memory on test 
>>> machines.
>>> 
>>> Thanks for your help.
>>> 
>>> 
>>> Best,
>>> 
>>> Steve
>>> 
>>> 
>> 
>

Re: Cassandra for numerical data set

Reply via email to