Hi guys,

Based on what you are saying there seems to be a tradeoff that developers
have to handle between:

                               "keep your rows under a certain size" vs
"keep data that's queried together, on disk together"

How would you handle this tradeoff in my case:

I monitor about 40.000 independent timeseries streams of data. The streams
have highly variable rates. Each stream has its own row and I go to a new
row every 28 hrs. With this scheme, I see several tens of rows reaching
sizes in the millions of columns within this time bucket (largest I saw was
6.4 million). The sizes of these wide rows are around 400MBytes
(considerably > than 60MB)

Will adding a few tens of wide rows like this every day cause me problems
on the long term? Should I consider lowering the time bucket?

>From experience I am a bit reluctant to create too many rows because I see
that reading across multiple rows seriously affects performance. Of course
I will use map-reduce as well ...will it be significantly affected by many
rows?

Cheers,
Alex

On Tue, Mar 20, 2012 at 6:37 PM, aaron morton <aa...@thelastpickle.com>wrote:

> The reads are only fetching slices of 20 to 100 columns max at a time from
> the row but if the key is planted on one node in the cluster I am concerned
> about that node getting the brunt of traffic.
>
> What RF are you using, how many nodes are in the cluster, what CL do you
> read at ?
>
> If you have lots of nodes that are in different racks the
> NetworkTopologyStrategy will do a better job of distributing read load than
> the SimpleStrategy. The DynamicSnitch can also result distribute load, see
> cassandra yaml for it's configuration.
>
> I thought about breaking the column data into multiple different row keys
> to help distribute throughout the cluster but its so darn handy having all
> the columns in one key!!
>
> If you have a row that will continually grow it is a good idea to
> partition it in some way. Large rows can slow things like compaction and
> repair down. If you have something above 60MB it's starting to slow things
> down. Can you partition by a date range such as month ?
>
> Large rows are also a little slower to query from
> http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/
>
> If most reads are only pulling 20 to 100 columns at a time are there two
> workloads ? Is it possible store just these columns in a separate row ? If
> you understand how big a row may get may be able to use the row cache to
> improve performance.
>
> Cheers
>
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 20/03/2012, at 2:05 PM, Blake Starkenburg wrote:
>
> I have a row key which is now up to 125,000 columns (and anticipated to
> grow), I know this is a far-cry from the 2-billion columns a single row key
> can store in Cassandra but my concern is the amount of reads that this
> specific row key may get compared to other row keys. This particular row
> key houses column data associated with one of the more popular areas of the
> site. The reads are only fetching slices of 20 to 100 columns max at a time
> from the row but if the key is planted on one node in the cluster I am
> concerned about that node getting the brunt of traffic.
>
> I thought about breaking the column data into multiple different row keys
> to help distribute throughout the cluster but its so darn handy having all
> the columns in one key!!
>
> key_cache is enabled but row cache is disabled on the column family.
>
> Should I be concerned going forward? Any particular advice on large wide
> rows?
>
> Thanks!
>
>
>

Reply via email to