Hi guys, Based on what you are saying there seems to be a tradeoff that developers have to handle between:
"keep your rows under a certain size" vs "keep data that's queried together, on disk together" How would you handle this tradeoff in my case: I monitor about 40.000 independent timeseries streams of data. The streams have highly variable rates. Each stream has its own row and I go to a new row every 28 hrs. With this scheme, I see several tens of rows reaching sizes in the millions of columns within this time bucket (largest I saw was 6.4 million). The sizes of these wide rows are around 400MBytes (considerably > than 60MB) Will adding a few tens of wide rows like this every day cause me problems on the long term? Should I consider lowering the time bucket? >From experience I am a bit reluctant to create too many rows because I see that reading across multiple rows seriously affects performance. Of course I will use map-reduce as well ...will it be significantly affected by many rows? Cheers, Alex On Tue, Mar 20, 2012 at 6:37 PM, aaron morton <aa...@thelastpickle.com>wrote: > The reads are only fetching slices of 20 to 100 columns max at a time from > the row but if the key is planted on one node in the cluster I am concerned > about that node getting the brunt of traffic. > > What RF are you using, how many nodes are in the cluster, what CL do you > read at ? > > If you have lots of nodes that are in different racks the > NetworkTopologyStrategy will do a better job of distributing read load than > the SimpleStrategy. The DynamicSnitch can also result distribute load, see > cassandra yaml for it's configuration. > > I thought about breaking the column data into multiple different row keys > to help distribute throughout the cluster but its so darn handy having all > the columns in one key!! > > If you have a row that will continually grow it is a good idea to > partition it in some way. Large rows can slow things like compaction and > repair down. If you have something above 60MB it's starting to slow things > down. Can you partition by a date range such as month ? > > Large rows are also a little slower to query from > http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ > > If most reads are only pulling 20 to 100 columns at a time are there two > workloads ? Is it possible store just these columns in a separate row ? If > you understand how big a row may get may be able to use the row cache to > improve performance. > > Cheers > > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 20/03/2012, at 2:05 PM, Blake Starkenburg wrote: > > I have a row key which is now up to 125,000 columns (and anticipated to > grow), I know this is a far-cry from the 2-billion columns a single row key > can store in Cassandra but my concern is the amount of reads that this > specific row key may get compared to other row keys. This particular row > key houses column data associated with one of the more popular areas of the > site. The reads are only fetching slices of 20 to 100 columns max at a time > from the row but if the key is planted on one node in the cluster I am > concerned about that node getting the brunt of traffic. > > I thought about breaking the column data into multiple different row keys > to help distribute throughout the cluster but its so darn handy having all > the columns in one key!! > > key_cache is enabled but row cache is disabled on the column family. > > Should I be concerned going forward? Any particular advice on large wide > rows? > > Thanks! > > >