Hi, Jim, it seems we share a very similar use case with highly variable rates in the timeseries data sources we archive. When I first started I was preocupied about this very big difference in row lengths. I was using a schema similar to the one Aaron mentioned: for each data source I had a row with a row key = <source:timestamp> and col name = <timestamp>.
At the time I was using 0.7 which did not have counters (or at least I was not aware of them). I used to count the number of columns in every row on the inserting client side and when a fixed threshold was reached for a certain data source (row key) I would generate a new row key for that data source with the following structure <source:timestamp> where timestamp = the timestamp of the last value added in the old row (this is the minimum amount of info needed to reconstruct a temporal query across multiple rows). At this point I would reset the counter for the data source to zero to start again. Of course I had to keep track of the row keys in a CF and also flush the counters in another CF whenever the client went down, so I can rebuild a cache of counters when the client came back on again. I can say this approach was a pain and I eventually replaced it with a bucketing scheme similar to what Aaron described, with a fixed bucket across all rows. As you can see, unfortunately, I am still trying to choose a bucket size that is the best compromise for all rows. But it is indeed a lot easier if you can generate all the possible keys for a certain data source on the retrieving client side. If you want more details of how I do this let me know. So, as I see from Aaron's suggestion, he's more in favour of pure uniform time bucketing. On wednesday I'm going to attend http://www.cassandra-eu.org/ and hopefully I will get more opinions there. I'll follow up on this thread if something interesting comes up! Cheers, Alex On Mon, Mar 26, 2012 at 4:10 AM, aaron morton <aa...@thelastpickle.com>wrote: > There is a great deal of utility in been able to derive the set of > possible row keys for a date range on the client side. So I would try to > carve up the time slices with respect to the time rather than the amount of > data in them. This may not be practical but I think it's very useful. > > Say you are storing the raw time series facts in the Fact CF, and the row > key is something like <source:datetime> (may want to add a bucket size see > below) and the column name is the <isotimestamp>. The data source also has > a bucket size stored something, such as hourly, daily, month. > > For an hourly bucket source, the datetime in the row keys is something > like "2012-01-02T13:00" (one for each hour) for a daily it's something like > "2012-01-02T00:00" . You can then work out the set of possible keys in a > date range and perform multi selects against those keys until you have all > the data. > > If you change the bucketing scheme for a data source you need to keep a > history so you can work out which keys may exist. That may be a huge pain. > As an alternative create a custom secondary, as you discussed, of all the > row keys for the data source. But continue to use a consistent time based > method for partitioning time ranges if possible. > > Hope that helps. > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 24/03/2012, at 3:22 AM, Jim Ancona wrote: > > I'm dealing with a similar issue, with an additional complication. We are > collecting time series data, and the amount of data per time period varies > greatly. We will collect and query event data by account, but the biggest > account will accumulate about 10,000 times as much data per time period as > the median account. So for the median account I could put multiple years of > data in one row, while for the largest accounts I don't want to put more > one day's worth in a row. If I use a uniform bucket size of one day (to > accomodate the largest accounts) it will make for rows that are too short > for the vast majority of accounts--we would have to read thirty rows to get > a month's worth of data. One obvious approach is to set a maximum row size, > that is, write data in a row until it reaches a maximum length, then start > a new one. There are two things that make that harder than it sounds: > > 1. There's no efficient way to count columns in a Cassandra row in > order to find out when to start a new one. > 2. Row keys aren't searchable. So I need to be able to construct or > look up the key to each row that contains a account's data. (Our data will > be in reverse date order.) > > Possible solutions: > > 1. Cassandra counter columns are an efficient way to keep counts > 2. I could have a "directory" row that contains pointers to the rows > that contain an account data > > (I could probably combine the row directory and the column counter into a > single counter column family, where the column name is the row key and the > value is the counter.) A naive solution would require reading the directory > before every read and the counter before every write--caching could > probably help with that. So this approach would probably lead to a > reasonable solution, but it's liable to be somewhat complex. Before I go > much further down this path, I thought I'd run it by this group in case > someone can point out a more clever solution. > > Thanks, > > Jim > On Thu, Mar 22, 2012 at 5:36 PM, Alexandru Sicoe <adsi...@gmail.com>wrote: > >> Thanks Aaron, I'll lower the time bucket, see how it goes. >> >> Cheers, >> Alex >> >> >> On Thu, Mar 22, 2012 at 10:07 PM, aaron morton >> <aa...@thelastpickle.com>wrote: >> >>> Will adding a few tens of wide rows like this every day cause me >>> problems on the long term? Should I consider lowering the time bucket? >>> >>> IMHO yeah, yup, ya and yes. >>> >>> >>> From experience I am a bit reluctant to create too many rows because I >>> see that reading across multiple rows seriously affects performance. Of >>> course I will use map-reduce as well ...will it be significantly affected >>> by many rows? >>> >>> Don't think it would make too much difference. >>> range slice used by map-reduce will find the first row in the batch and >>> then step through them. >>> >>> Cheers >>> >>> >>> ----------------- >>> Aaron Morton >>> Freelance Developer >>> @aaronmorton >>> http://www.thelastpickle.com >>> >>> On 22/03/2012, at 11:43 PM, Alexandru Sicoe wrote: >>> >>> Hi guys, >>> >>> Based on what you are saying there seems to be a tradeoff that >>> developers have to handle between: >>> >>> "keep your rows under a certain size" vs >>> "keep data that's queried together, on disk together" >>> >>> How would you handle this tradeoff in my case: >>> >>> I monitor about 40.000 independent timeseries streams of data. The >>> streams have highly variable rates. Each stream has its own row and I go to >>> a new row every 28 hrs. With this scheme, I see several tens of rows >>> reaching sizes in the millions of columns within this time bucket (largest >>> I saw was 6.4 million). The sizes of these wide rows are around 400MBytes >>> (considerably > than 60MB) >>> >>> Will adding a few tens of wide rows like this every day cause me >>> problems on the long term? Should I consider lowering the time bucket? >>> >>> From experience I am a bit reluctant to create too many rows because I >>> see that reading across multiple rows seriously affects performance. Of >>> course I will use map-reduce as well ...will it be significantly affected >>> by many rows? >>> >>> Cheers, >>> Alex >>> >>> On Tue, Mar 20, 2012 at 6:37 PM, aaron morton >>> <aa...@thelastpickle.com>wrote: >>> >>>> The reads are only fetching slices of 20 to 100 columns max at a time >>>> from the row but if the key is planted on one node in the cluster I am >>>> concerned about that node getting the brunt of traffic. >>>> >>>> What RF are you using, how many nodes are in the cluster, what CL do >>>> you read at ? >>>> >>>> If you have lots of nodes that are in different racks the >>>> NetworkTopologyStrategy will do a better job of distributing read load than >>>> the SimpleStrategy. The DynamicSnitch can also result distribute load, see >>>> cassandra yaml for it's configuration. >>>> >>>> I thought about breaking the column data into multiple different row >>>> keys to help distribute throughout the cluster but its so darn handy having >>>> all the columns in one key!! >>>> >>>> If you have a row that will continually grow it is a good idea to >>>> partition it in some way. Large rows can slow things like compaction and >>>> repair down. If you have something above 60MB it's starting to slow things >>>> down. Can you partition by a date range such as month ? >>>> >>>> Large rows are also a little slower to query from >>>> http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ >>>> >>>> If most reads are only pulling 20 to 100 columns at a time are there >>>> two workloads ? Is it possible store just these columns in a separate row ? >>>> If you understand how big a row may get may be able to use the row cache to >>>> improve performance. >>>> >>>> Cheers >>>> >>>> >>>> ----------------- >>>> Aaron Morton >>>> Freelance Developer >>>> @aaronmorton >>>> http://www.thelastpickle.com >>>> >>>> On 20/03/2012, at 2:05 PM, Blake Starkenburg wrote: >>>> >>>> I have a row key which is now up to 125,000 columns (and anticipated to >>>> grow), I know this is a far-cry from the 2-billion columns a single row key >>>> can store in Cassandra but my concern is the amount of reads that this >>>> specific row key may get compared to other row keys. This particular row >>>> key houses column data associated with one of the more popular areas of the >>>> site. The reads are only fetching slices of 20 to 100 columns max at a time >>>> from the row but if the key is planted on one node in the cluster I am >>>> concerned about that node getting the brunt of traffic. >>>> >>>> I thought about breaking the column data into multiple different row >>>> keys to help distribute throughout the cluster but its so darn handy having >>>> all the columns in one key!! >>>> >>>> key_cache is enabled but row cache is disabled on the column family. >>>> >>>> Should I be concerned going forward? Any particular advice on large >>>> wide rows? >>>> >>>> Thanks! >>>> >>>> >>>> >>> >>> >> > >