Thanks Aaron, I'll lower the time bucket, see how it goes.

Cheers,
Alex


On Thu, Mar 22, 2012 at 10:07 PM, aaron morton <aa...@thelastpickle.com>wrote:

> Will adding a few tens of wide rows like this every day cause me problems
> on the long term? Should I consider lowering the time bucket?
>
> IMHO yeah, yup, ya and yes.
>
>
> From experience I am a bit reluctant to create too many rows because I see
> that reading across multiple rows seriously affects performance. Of course
> I will use map-reduce as well ...will it be significantly affected by many
> rows?
>
> Don't think it would make too much difference.
> range slice used by map-reduce will find the first row in the batch and
> then step through them.
>
> Cheers
>
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 22/03/2012, at 11:43 PM, Alexandru Sicoe wrote:
>
> Hi guys,
>
> Based on what you are saying there seems to be a tradeoff that developers
> have to handle between:
>
>                                "keep your rows under a certain size" vs
> "keep data that's queried together, on disk together"
>
> How would you handle this tradeoff in my case:
>
> I monitor about 40.000 independent timeseries streams of data. The streams
> have highly variable rates. Each stream has its own row and I go to a new
> row every 28 hrs. With this scheme, I see several tens of rows reaching
> sizes in the millions of columns within this time bucket (largest I saw was
> 6.4 million). The sizes of these wide rows are around 400MBytes
> (considerably > than 60MB)
>
> Will adding a few tens of wide rows like this every day cause me problems
> on the long term? Should I consider lowering the time bucket?
>
> From experience I am a bit reluctant to create too many rows because I see
> that reading across multiple rows seriously affects performance. Of course
> I will use map-reduce as well ...will it be significantly affected by many
> rows?
>
> Cheers,
> Alex
>
> On Tue, Mar 20, 2012 at 6:37 PM, aaron morton <aa...@thelastpickle.com>wrote:
>
>> The reads are only fetching slices of 20 to 100 columns max at a time
>> from the row but if the key is planted on one node in the cluster I am
>> concerned about that node getting the brunt of traffic.
>>
>> What RF are you using, how many nodes are in the cluster, what CL do you
>> read at ?
>>
>> If you have lots of nodes that are in different racks the
>> NetworkTopologyStrategy will do a better job of distributing read load than
>> the SimpleStrategy. The DynamicSnitch can also result distribute load, see
>> cassandra yaml for it's configuration.
>>
>> I thought about breaking the column data into multiple different row keys
>> to help distribute throughout the cluster but its so darn handy having all
>> the columns in one key!!
>>
>> If you have a row that will continually grow it is a good idea to
>> partition it in some way. Large rows can slow things like compaction and
>> repair down. If you have something above 60MB it's starting to slow things
>> down. Can you partition by a date range such as month ?
>>
>> Large rows are also a little slower to query from
>> http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/
>>
>> If most reads are only pulling 20 to 100 columns at a time are there two
>> workloads ? Is it possible store just these columns in a separate row ? If
>> you understand how big a row may get may be able to use the row cache to
>> improve performance.
>>
>> Cheers
>>
>>
>>   -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 20/03/2012, at 2:05 PM, Blake Starkenburg wrote:
>>
>> I have a row key which is now up to 125,000 columns (and anticipated to
>> grow), I know this is a far-cry from the 2-billion columns a single row key
>> can store in Cassandra but my concern is the amount of reads that this
>> specific row key may get compared to other row keys. This particular row
>> key houses column data associated with one of the more popular areas of the
>> site. The reads are only fetching slices of 20 to 100 columns max at a time
>> from the row but if the key is planted on one node in the cluster I am
>> concerned about that node getting the brunt of traffic.
>>
>> I thought about breaking the column data into multiple different row keys
>> to help distribute throughout the cluster but its so darn handy having all
>> the columns in one key!!
>>
>> key_cache is enabled but row cache is disabled on the column family.
>>
>> Should I be concerned going forward? Any particular advice on large wide
>> rows?
>>
>> Thanks!
>>
>>
>>
>
>

Reply via email to