Hi,

Jim, it seems we share a very similar use case with highly variable rates
in the timeseries data sources we archive. When I first started I was
preocupied about this very big difference in row lengths. I was using a
schema similar to the one Aaron mentioned: for each data source I had a row
with a row key = <source:timestamp> and col name = <timestamp>.

At the time I was using 0.7 which did not have counters (or at least I was
not aware of them). I used to count the number of columns in every row on
the inserting client side and when a fixed threshold was reached for a
certain data source (row key) I would generate a new row key for that data
source with the following structure <source:timestamp> where timestamp =
the timestamp of the last value added in the old row (this is the minimum
amount of info needed to reconstruct a temporal query across multiple
rows). At this point I would reset the counter for the data source to zero
to start again. Of course I had to keep track of the row keys in a CF and
also flush the counters in another CF whenever the client went down, so I
can rebuild a cache of counters when the client came back on again.

I can say this approach was a pain and I eventually replaced it with a
bucketing scheme similar to what Aaron described, with a fixed bucket
across all rows. As you can see, unfortunately, I am still trying to choose
a bucket size that is the best compromise for all rows. But it is indeed a
lot easier if you can generate all the possible keys for a certain data
source on the retrieving client side. If you want more details of how I do
this let me know.

So, as I see from Aaron's suggestion, he's more in favour of pure uniform
time bucketing. On wednesday I'm going to attend
http://www.cassandra-eu.org/ and hopefully I will get more opinions there.
I'll follow up on this thread if something interesting comes up!

Cheers,
Alex



On Mon, Mar 26, 2012 at 4:10 AM, aaron morton <aa...@thelastpickle.com>wrote:

> There is a great deal of utility in been able to derive the set of
> possible row keys for a date range on the client side. So I would try to
> carve up the time slices with respect to the time rather than the amount of
> data in them. This may not be practical but I think it's very useful.
>
> Say you are storing the raw time series facts in the Fact CF, and the row
> key is something like <source:datetime> (may want to add a bucket size see
> below) and the column name is the <isotimestamp>. The data source also has
> a bucket size stored something, such as hourly, daily, month.
>
> For an hourly bucket source, the datetime in the row keys is something
> like "2012-01-02T13:00" (one for each hour) for a daily it's something like
> "2012-01-02T00:00" . You can then work out the set of possible keys in a
> date range and perform multi selects against those keys until you have all
> the data.
>
> If you change the bucketing scheme for a data source you need to keep a
> history so you can work out which keys may exist. That may be a huge pain.
> As an alternative create a custom secondary, as you discussed, of all the
> row keys for the data source. But continue to use a consistent time based
> method for partitioning time ranges if possible.
>
> Hope that helps.
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 24/03/2012, at 3:22 AM, Jim Ancona wrote:
>
> I'm dealing with a similar issue, with an additional complication. We are
> collecting time series data, and the amount of data per time period varies
> greatly. We will collect and query event data by account, but the biggest
> account will accumulate about 10,000 times as much data per time period as
> the median account. So for the median account I could put multiple years of
> data in one row, while for the largest accounts I don't want to put more
> one day's worth in a row. If I use a uniform bucket size of one day (to
> accomodate the largest accounts) it will make for rows that are too short
> for the vast majority of accounts--we would have to read thirty rows to get
> a month's worth of data. One obvious approach is to set a maximum row size,
> that is, write data in a row until it reaches a maximum length, then start
> a new one. There are two things that make that harder than it sounds:
>
>    1. There's no efficient way to count columns in a Cassandra row in
>    order to find out when to start a new one.
>    2. Row keys aren't searchable. So I need to be able to construct or
>    look up the key to each row that contains a account's data. (Our data will
>    be in reverse date order.)
>
> Possible solutions:
>
>    1. Cassandra counter columns are an efficient way to keep counts
>    2. I could have a "directory" row that contains pointers to the rows
>    that contain an account data
>
> (I could probably combine the row directory and the column counter into a
> single counter column family, where the column name is the row key and the
> value is the counter.) A naive solution would require reading the directory
> before every read and the counter before every write--caching could
> probably help with that. So this approach would probably lead to a
> reasonable solution, but it's liable to be somewhat complex. Before I go
> much further down this path, I thought I'd run it by this group in case
> someone can point out a more clever solution.
>
> Thanks,
>
> Jim
> On Thu, Mar 22, 2012 at 5:36 PM, Alexandru Sicoe <adsi...@gmail.com>wrote:
>
>> Thanks Aaron, I'll lower the time bucket, see how it goes.
>>
>> Cheers,
>> Alex
>>
>>
>> On Thu, Mar 22, 2012 at 10:07 PM, aaron morton 
>> <aa...@thelastpickle.com>wrote:
>>
>>> Will adding a few tens of wide rows like this every day cause me
>>> problems on the long term? Should I consider lowering the time bucket?
>>>
>>> IMHO yeah, yup, ya and yes.
>>>
>>>
>>> From experience I am a bit reluctant to create too many rows because I
>>> see that reading across multiple rows seriously affects performance. Of
>>> course I will use map-reduce as well ...will it be significantly affected
>>> by many rows?
>>>
>>> Don't think it would make too much difference.
>>> range slice used by map-reduce will find the first row in the batch and
>>> then step through them.
>>>
>>> Cheers
>>>
>>>
>>>   -----------------
>>> Aaron Morton
>>> Freelance Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>>
>>> On 22/03/2012, at 11:43 PM, Alexandru Sicoe wrote:
>>>
>>> Hi guys,
>>>
>>> Based on what you are saying there seems to be a tradeoff that
>>> developers have to handle between:
>>>
>>>                                "keep your rows under a certain size" vs
>>> "keep data that's queried together, on disk together"
>>>
>>> How would you handle this tradeoff in my case:
>>>
>>> I monitor about 40.000 independent timeseries streams of data. The
>>> streams have highly variable rates. Each stream has its own row and I go to
>>> a new row every 28 hrs. With this scheme, I see several tens of rows
>>> reaching sizes in the millions of columns within this time bucket (largest
>>> I saw was 6.4 million). The sizes of these wide rows are around 400MBytes
>>> (considerably > than 60MB)
>>>
>>> Will adding a few tens of wide rows like this every day cause me
>>> problems on the long term? Should I consider lowering the time bucket?
>>>
>>> From experience I am a bit reluctant to create too many rows because I
>>> see that reading across multiple rows seriously affects performance. Of
>>> course I will use map-reduce as well ...will it be significantly affected
>>> by many rows?
>>>
>>> Cheers,
>>> Alex
>>>
>>> On Tue, Mar 20, 2012 at 6:37 PM, aaron morton 
>>> <aa...@thelastpickle.com>wrote:
>>>
>>>> The reads are only fetching slices of 20 to 100 columns max at a time
>>>> from the row but if the key is planted on one node in the cluster I am
>>>> concerned about that node getting the brunt of traffic.
>>>>
>>>> What RF are you using, how many nodes are in the cluster, what CL do
>>>> you read at ?
>>>>
>>>> If you have lots of nodes that are in different racks the
>>>> NetworkTopologyStrategy will do a better job of distributing read load than
>>>> the SimpleStrategy. The DynamicSnitch can also result distribute load, see
>>>> cassandra yaml for it's configuration.
>>>>
>>>> I thought about breaking the column data into multiple different row
>>>> keys to help distribute throughout the cluster but its so darn handy having
>>>> all the columns in one key!!
>>>>
>>>> If you have a row that will continually grow it is a good idea to
>>>> partition it in some way. Large rows can slow things like compaction and
>>>> repair down. If you have something above 60MB it's starting to slow things
>>>> down. Can you partition by a date range such as month ?
>>>>
>>>> Large rows are also a little slower to query from
>>>> http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/
>>>>
>>>> If most reads are only pulling 20 to 100 columns at a time are there
>>>> two workloads ? Is it possible store just these columns in a separate row ?
>>>> If you understand how big a row may get may be able to use the row cache to
>>>> improve performance.
>>>>
>>>> Cheers
>>>>
>>>>
>>>>   -----------------
>>>> Aaron Morton
>>>> Freelance Developer
>>>> @aaronmorton
>>>> http://www.thelastpickle.com
>>>>
>>>> On 20/03/2012, at 2:05 PM, Blake Starkenburg wrote:
>>>>
>>>> I have a row key which is now up to 125,000 columns (and anticipated to
>>>> grow), I know this is a far-cry from the 2-billion columns a single row key
>>>> can store in Cassandra but my concern is the amount of reads that this
>>>> specific row key may get compared to other row keys. This particular row
>>>> key houses column data associated with one of the more popular areas of the
>>>> site. The reads are only fetching slices of 20 to 100 columns max at a time
>>>> from the row but if the key is planted on one node in the cluster I am
>>>> concerned about that node getting the brunt of traffic.
>>>>
>>>> I thought about breaking the column data into multiple different row
>>>> keys to help distribute throughout the cluster but its so darn handy having
>>>> all the columns in one key!!
>>>>
>>>> key_cache is enabled but row cache is disabled on the column family.
>>>>
>>>> Should I be concerned going forward? Any particular advice on large
>>>> wide rows?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>

Reply via email to