> Will adding a few tens of wide rows like this every day cause me problems on 
> the long term? Should I consider lowering the time bucket?
IMHO yeah, yup, ya and yes.


> From experience I am a bit reluctant to create too many rows because I see 
> that reading across multiple rows seriously affects performance. Of course I 
> will use map-reduce as well ...will it be significantly affected by many rows?
Don't think it would make too much difference. 
range slice used by map-reduce will find the first row in the batch and then 
step through them.

Cheers


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 22/03/2012, at 11:43 PM, Alexandru Sicoe wrote:

> Hi guys,
> 
> Based on what you are saying there seems to be a tradeoff that developers 
> have to handle between: 
> 
>                                "keep your rows under a certain size" vs "keep 
> data that's queried together, on disk together"
> 
> How would you handle this tradeoff in my case: 
> 
> I monitor about 40.000 independent timeseries streams of data. The streams 
> have highly variable rates. Each stream has its own row and I go to a new row 
> every 28 hrs. With this scheme, I see several tens of rows reaching sizes in 
> the millions of columns within this time bucket (largest I saw was 6.4 
> million). The sizes of these wide rows are around 400MBytes (considerably > 
> than 60MB)
> 
> Will adding a few tens of wide rows like this every day cause me problems on 
> the long term? Should I consider lowering the time bucket?
> 
> From experience I am a bit reluctant to create too many rows because I see 
> that reading across multiple rows seriously affects performance. Of course I 
> will use map-reduce as well ...will it be significantly affected by many rows?
> 
> Cheers,
> Alex
> 
> On Tue, Mar 20, 2012 at 6:37 PM, aaron morton <aa...@thelastpickle.com> wrote:
>> The reads are only fetching slices of 20 to 100 columns max at a time from 
>> the row but if the key is planted on one node in the cluster I am concerned 
>> about that node getting the brunt of traffic.
> What RF are you using, how many nodes are in the cluster, what CL do you read 
> at ?
> 
> If you have lots of nodes that are in different racks the 
> NetworkTopologyStrategy will do a better job of distributing read load than 
> the SimpleStrategy. The DynamicSnitch can also result distribute load, see 
> cassandra yaml for it's configuration. 
> 
>> I thought about breaking the column data into multiple different row keys to 
>> help distribute throughout the cluster but its so darn handy having all the 
>> columns in one key!!
> If you have a row that will continually grow it is a good idea to partition 
> it in some way. Large rows can slow things like compaction and repair down. 
> If you have something above 60MB it's starting to slow things down. Can you 
> partition by a date range such as month ?
> 
> Large rows are also a little slower to query from
> http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/
> 
> If most reads are only pulling 20 to 100 columns at a time are there two 
> workloads ? Is it possible store just these columns in a separate row ? If 
> you understand how big a row may get may be able to use the row cache to 
> improve performance.  
> 
> Cheers
> 
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 20/03/2012, at 2:05 PM, Blake Starkenburg wrote:
> 
>> I have a row key which is now up to 125,000 columns (and anticipated to 
>> grow), I know this is a far-cry from the 2-billion columns a single row key 
>> can store in Cassandra but my concern is the amount of reads that this 
>> specific row key may get compared to other row keys. This particular row key 
>> houses column data associated with one of the more popular areas of the 
>> site. The reads are only fetching slices of 20 to 100 columns max at a time 
>> from the row but if the key is planted on one node in the cluster I am 
>> concerned about that node getting the brunt of traffic.
>> 
>> I thought about breaking the column data into multiple different row keys to 
>> help distribute throughout the cluster but its so darn handy having all the 
>> columns in one key!!
>> 
>> key_cache is enabled but row cache is disabled on the column family.
>> 
>> Should I be concerned going forward? Any particular advice on large wide 
>> rows?
>> 
>> Thanks!
> 
> 

Reply via email to