Thanks Aaron, I'll lower the time bucket, see how it goes. Cheers, Alex
On Thu, Mar 22, 2012 at 10:07 PM, aaron morton <aa...@thelastpickle.com>wrote: > Will adding a few tens of wide rows like this every day cause me problems > on the long term? Should I consider lowering the time bucket? > > IMHO yeah, yup, ya and yes. > > > From experience I am a bit reluctant to create too many rows because I see > that reading across multiple rows seriously affects performance. Of course > I will use map-reduce as well ...will it be significantly affected by many > rows? > > Don't think it would make too much difference. > range slice used by map-reduce will find the first row in the batch and > then step through them. > > Cheers > > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 22/03/2012, at 11:43 PM, Alexandru Sicoe wrote: > > Hi guys, > > Based on what you are saying there seems to be a tradeoff that developers > have to handle between: > > "keep your rows under a certain size" vs > "keep data that's queried together, on disk together" > > How would you handle this tradeoff in my case: > > I monitor about 40.000 independent timeseries streams of data. The streams > have highly variable rates. Each stream has its own row and I go to a new > row every 28 hrs. With this scheme, I see several tens of rows reaching > sizes in the millions of columns within this time bucket (largest I saw was > 6.4 million). The sizes of these wide rows are around 400MBytes > (considerably > than 60MB) > > Will adding a few tens of wide rows like this every day cause me problems > on the long term? Should I consider lowering the time bucket? > > From experience I am a bit reluctant to create too many rows because I see > that reading across multiple rows seriously affects performance. Of course > I will use map-reduce as well ...will it be significantly affected by many > rows? > > Cheers, > Alex > > On Tue, Mar 20, 2012 at 6:37 PM, aaron morton <aa...@thelastpickle.com>wrote: > >> The reads are only fetching slices of 20 to 100 columns max at a time >> from the row but if the key is planted on one node in the cluster I am >> concerned about that node getting the brunt of traffic. >> >> What RF are you using, how many nodes are in the cluster, what CL do you >> read at ? >> >> If you have lots of nodes that are in different racks the >> NetworkTopologyStrategy will do a better job of distributing read load than >> the SimpleStrategy. The DynamicSnitch can also result distribute load, see >> cassandra yaml for it's configuration. >> >> I thought about breaking the column data into multiple different row keys >> to help distribute throughout the cluster but its so darn handy having all >> the columns in one key!! >> >> If you have a row that will continually grow it is a good idea to >> partition it in some way. Large rows can slow things like compaction and >> repair down. If you have something above 60MB it's starting to slow things >> down. Can you partition by a date range such as month ? >> >> Large rows are also a little slower to query from >> http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ >> >> If most reads are only pulling 20 to 100 columns at a time are there two >> workloads ? Is it possible store just these columns in a separate row ? If >> you understand how big a row may get may be able to use the row cache to >> improve performance. >> >> Cheers >> >> >> ----------------- >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 20/03/2012, at 2:05 PM, Blake Starkenburg wrote: >> >> I have a row key which is now up to 125,000 columns (and anticipated to >> grow), I know this is a far-cry from the 2-billion columns a single row key >> can store in Cassandra but my concern is the amount of reads that this >> specific row key may get compared to other row keys. This particular row >> key houses column data associated with one of the more popular areas of the >> site. The reads are only fetching slices of 20 to 100 columns max at a time >> from the row but if the key is planted on one node in the cluster I am >> concerned about that node getting the brunt of traffic. >> >> I thought about breaking the column data into multiple different row keys >> to help distribute throughout the cluster but its so darn handy having all >> the columns in one key!! >> >> key_cache is enabled but row cache is disabled on the column family. >> >> Should I be concerned going forward? Any particular advice on large wide >> rows? >> >> Thanks! >> >> >> > >