On Tue, Feb 7, 2012 at 6:39 AM, aaron morton <aa...@thelastpickle.com>wrote:
> Sounds like a good start. Super columns are not a great fit for modeling > time series data for a few reasons, here is one > http://wiki.apache.org/cassandra/CassandraLimitations > None of those jump out at me as horrible for my case. If I modelled with Super Columns I would have less than 10,000 Super Columns with an average of 50 columns - big but no insane ? > > It's also a good idea to partition time series data so that the rows do > not grow too big. You can have 2 billion columns in a row, but big rows > have operational down sides. > > You could go with either: > > rows: <entity_id:date> > column: <property_name> > > Which would mean each time your query for a date range you need to query > multiple rows. But it is possible to get a range of columns / properties. > > Or > > rows: <entity_id:time_partition> > column: <date:property_name> > That's an interesting idea - I'll talk to the data experts to see if we have a sensible range. > > Where time_partition is something that makes sense in your problem domain, > e.g. a calendar month. If you often query for days in a month you can then > get all the columns for the days you are interested in (using a column > range). If you only want to get a sub set of the entity properties you will > need to get them all and filter them client side, depending on the number > and size of the properties this may be more efficient than multiple calls. > I'm find with doing work on the client side - I have a bias in that direction as it tends to scale better. > > One word of warning, avoid sending read requests for lots (i.e. 100's) of > rows at once it will reduce overall query throughput. Some clients like > pycassa take care of this for you. > Because of request overhead ? I'm currently using the batch interface of pycassa to do bulk reads. Is the same problem going to bite me if I have many clients reading (using bulk reads) ? In production we will have ~50 clients. thanks > Good luck. > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 5/02/2012, at 12:12 AM, Franc Carter wrote: > > > Hi, > > I'm pretty new to Cassandra and am currently doing a proof of concept, and > thought it would be a good idea to ask if my data model is sane . . . > > The data I have, and need to query, is reasonably simple. It consists of > about 10 million entities, each of which have a set of key/value properties > for each day for about 10 years. The number of keys is in the 50-100 range > and there will be a lot of overlap for keys in <entity,days> > > The queries I need to make are for sets of key/value properties for an > entity on a day, e.g key1,keys2,key3 for 10 entities on 20 days. The number > of entities and/or days in the query could be either very small or very > large. > > I've modeled this with a simple column family for the keys with the row > key being the concatenation of the entity and date. My first go, used only > the entity as the row key and then used a supercolumn for each date. I > decided against this mostly because it seemed more complex for a gain I > didn't really understand. > > Does this seem sensible ? > > thanks > > -- > *Franc Carter* | Systems architect | Sirca Ltd > <marc.zianideferra...@sirca.org.au> > franc.car...@sirca.org.au | www.sirca.org.au > Tel: +61 2 9236 9118 > Level 9, 80 Clarence St, Sydney NSW 2000 > PO Box H58, Australia Square, Sydney NSW 1215 > > > -- *Franc Carter* | Systems architect | Sirca Ltd <marc.zianideferra...@sirca.org.au> franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215