Well, I am now thinking of adding a virtual capability to PlayOrm which we currently use to allow grouping entities into one column family. Right now the CF creation comes from a single entity so this then may change for those entities that define they are in a single CF groupÅ .This should not be a very hard change if we decide to do that.
This makes us rely even more on PlayOrm's command line tool(instead of cassandra-cli) as I can't stand reading hex all the time nor do I like switching my "assume validator to utf8 to decimal, to integer just so I can read stuff". Later, Dean On 10/1/12 9:22 AM, "Brian O'Neill" <b...@alumni.brown.edu> wrote: >Dean, > >We have the same question... > >We have thousands of separate feeds of data as well (20,000+). To >date, we've been using a CF per feed strategy, but as we scale this >thing out to accommodate all of those feeds, we're trying to figure >out if we're going to blow out the memory. > >The initial documentation for heap sizing had column families in the >equation: >http://www.datastax.com/docs/0.7/operations/tuning#heap-sizing > >But in the more recent documentation, it looks like they removed the >column family variable with the introduction of the universal >key_cache_size. >http://www.datastax.com/docs/1.0/operations/tuning#tuning-java-heap-size > >We haven't committed either way yet, but given Ed Anuff's presentation >on virtual keyspaces, we were leaning towards a single column family >approach: >http://blog.apigee.com/detail/building_a_mobile_data_platform_with_cassand >ra_-_apigee_under_the_hood/? > >Definitely let us know what you decide. > >-brian > >On Fri, Sep 28, 2012 at 11:48 AM, Flavio Baronti ><f.baro...@list-group.com> wrote: >> We had some serious trouble with dynamically adding CFs, although last >>time >> we tried we were using version 0.7, so maybe >> that's not an issue any more. >> Our problems were two: >> - You are (were?) not supposed to add CFs concurrently. Since we had >>more >> servers talking to the same Cassandra cluster, >> we had to use distributed locks (Hazelcast) to avoid concurrency. >> - You must be very careful to add new CFs to different Cassandra nodes. >>If >> you do that fast enough, and the clocks of >> the two servers are skewed, you will severely compromise your schema >> (Cassandra will not understand in which order the >> updates must be applied). >> >> As I said, this applied to version 0.7, maybe current versions solved >>these >> problems. >> >> Flavio >> >> >> Il 2012/09/27 16:11 PM, Hiller, Dean ha scritto: >>> We have 1000's of different building devices and we stream data from >>>these >> devices. The format and data from each one varies so one device has >>temperature >> at timeX with some other variables, another device has CO2 percentage >>and other >> variables. Every device is unique and streams it's own data. We >>dynamically >> discover devices and register them. Basically, one CF or table per >>thing really >> makes sense in this environment. While we could try to find out which >>devices >> "are" similar, this would really be a pain and some devices add some new >> variable into the equation. NOT only that but researchers can register >>new >> datasets and upload them as well and each dataset they have they do NOT >>want to >> share with other researches necessarily so we have security groups and >>each CF >> belongs to security groups. We dynamically create CF's on the fly as >>people >> register new datasets. >>> >>> On top of that, when the data sets get too large, we probably want to >> partition a single CF into time partitions. We could create one CF and >>put all >> the data and have a partition per device, but then a time partition >>will contain >> "multiple" devices of data meaning we need to shrink our time partition >>size >> where if we have CF per device, the time partition can be larger as it >>is only >> for that one device. >>> >>> THEN, on top of that, we have a meta CF for these devices so some >>>people want >> to query for streams that match criteria AND which returns a CF name >>and they >> query that CF name so we almost need a query with variables like select >>cfName >> from Meta where x = y and then select * from cfName where xxxxx. Which >>we can do >> today. >>> >>> Dean >>> >>> From: Marcelo Elias Del Valle >>><mvall...@gmail.com<mailto:mvall...@gmail.com>> >>> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" >> <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> >>> Date: Thursday, September 27, 2012 8:01 AM >>> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" >> <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> >>> Subject: Re: 1000's of column families >>> >>> Out of curiosity, is it really necessary to have that amount of CFs? >>> I am probably still used to relational databases, where you would use >>>a new >> table just in case you need to store different kinds of data. As >>Cassandra >> stores anything in each CF, it might probably make sense to have a lot >>of CFs to >> store your data... >>> But why wouldn't you use a single CF with partitions in these case? >>>Wouldn't >> it be the same thing? I am asking because I might learn a new modeling >>technique >> with the answer. >>> >>> []s >>> >>> 2012/9/26 Hiller, Dean >>><dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov>> >>> We are streaming data with 1 stream per 1 CF and we have 1000's of CF. >>> When >> using the tools they are all geared to analyzing ONE column family at a >>time :(. >> If I remember correctly, Cassandra supports as many CF's as you want, >>correct? >> Even though I am going to have tons of funs with limitations on the >>tools, >> correct? >>> >>> (I may end up wrapping the node tool with my own aggregate calls if >>>needed to >> sum up multiple column families and such). >>> >>> Thanks, >>> Dean >>> >>> >>> >>> -- >>> Marcelo Elias Del Valle >>> http://mvalle.com - @mvallebr >>> >> >> > > > >-- >Brian ONeill >Lead Architect, Health Market Science (http://healthmarketscience.com) >Apache Cassandra MVP >mobile:215.588.6024 >blog: http://brianoneill.blogspot.com/ >twitter: @boneill42