On Wed, Feb 8, 2012 at 6:05 AM, aaron morton <aa...@thelastpickle.com>wrote:
> None of those jump out at me as horrible for my case. If I modelled with > Super Columns I would have less than 10,000 Super Columns with an average > of 50 columns - big but no insane ? > > I would still try to do it without super columns. The common belief is > they are about 10% slower, and they are a lot clunkier. There are some > query and delete cases where they do things composite columns cannot, but > in general I try to model things without using them first. > Ok - it seems cleaner to model without them to me as well. > > Because of request overhead ? I'm currently using the batch interface of > pycassa to do bulk reads. Is the same problem going to bite me if I have > many clients reading (using bulk reads) ? In production we will have ~50 > clients. > > pycassa has support for chunking requests to the server > https://github.com/pycassa/pycassa/blob/master/pycassa/columnfamily.py#L633 > > It's because each row requested becomes a read task on the server and is > placed into the read thread pool. There are only 32 (default) read thread > in the pool. If one query comes along and requests 100 rows, it places 100 > tasks in the thread pool where only 32 can be processed at a time. Some > will back up as pending tasks and eventually be processed. If row reads > reads take 1ms (just to pick a number, may be better) to read 100 rows we > are talking about 3 or 4ms for that query. During that time any read > requests received will have to wait for read threads. > > To that client this is excellent, it's has a high row throughput. To the > other clients this is not, overall query throughput will drop. More is not > always better. Note that as the number of nodes increases and this effect > is may be reduced as reading 100 rows may result in the coordinator sending > 25 row requests to 4 nodes. > > And there is also overhead involved in very big requests, seeā¦ > > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Read-Latency-td5636553.html#a5652476 > thanks > > Cheers > > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 7/02/2012, at 2:28 PM, Franc Carter wrote: > > On Tue, Feb 7, 2012 at 6:39 AM, aaron morton <aa...@thelastpickle.com>wrote: > >> Sounds like a good start. Super columns are not a great fit for modeling >> time series data for a few reasons, here is one >> http://wiki.apache.org/cassandra/CassandraLimitations >> > > > None of those jump out at me as horrible for my case. If I modelled with > Super Columns I would have less than 10,000 Super Columns with an average > of 50 columns - big but no insane ? > > >> >> It's also a good idea to partition time series data so that the rows do >> not grow too big. You can have 2 billion columns in a row, but big rows >> have operational down sides. >> >> You could go with either: >> >> rows: <entity_id:date> >> column: <property_name> >> >> Which would mean each time your query for a date range you need to query >> multiple rows. But it is possible to get a range of columns / properties. >> >> Or >> >> rows: <entity_id:time_partition> >> column: <date:property_name> >> > > That's an interesting idea - I'll talk to the data experts to see if we > have a sensible range. > > >> >> Where time_partition is something that makes sense in your problem >> domain, e.g. a calendar month. If you often query for days in a month you >> can then get all the columns for the days you are interested in (using a >> column range). If you only want to get a sub set of the entity properties >> you will need to get them all and filter them client side, depending on the >> number and size of the properties this may be more efficient than multiple >> calls. >> > > I'm find with doing work on the client side - I have a bias in that > direction as it tends to scale better. > > >> >> One word of warning, avoid sending read requests for lots (i.e. 100's) of >> rows at once it will reduce overall query throughput. Some clients like >> pycassa take care of this for you. >> > > Because of request overhead ? I'm currently using the batch interface of > pycassa to do bulk reads. Is the same problem going to bite me if I have > many clients reading (using bulk reads) ? In production we will have ~50 > clients. > > thanks > > >> Good luck. >> >> ----------------- >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 5/02/2012, at 12:12 AM, Franc Carter wrote: >> >> >> Hi, >> >> I'm pretty new to Cassandra and am currently doing a proof of concept, >> and thought it would be a good idea to ask if my data model is sane . . . >> >> The data I have, and need to query, is reasonably simple. It consists of >> about 10 million entities, each of which have a set of key/value properties >> for each day for about 10 years. The number of keys is in the 50-100 range >> and there will be a lot of overlap for keys in <entity,days> >> >> The queries I need to make are for sets of key/value properties for an >> entity on a day, e.g key1,keys2,key3 for 10 entities on 20 days. The number >> of entities and/or days in the query could be either very small or very >> large. >> >> I've modeled this with a simple column family for the keys with the row >> key being the concatenation of the entity and date. My first go, used only >> the entity as the row key and then used a supercolumn for each date. I >> decided against this mostly because it seemed more complex for a gain I >> didn't really understand. >> >> Does this seem sensible ? >> >> thanks >> >> -- >> *Franc Carter* | Systems architect | Sirca Ltd >> <marc.zianideferra...@sirca.org.au> >> franc.car...@sirca.org.au | www.sirca.org.au >> Tel: +61 2 9236 9118 >> Level 9, 80 Clarence St, Sydney NSW 2000 >> PO Box H58, Australia Square, Sydney NSW 1215 >> >> >> > > > -- > *Franc Carter* | Systems architect | Sirca Ltd > <marc.zianideferra...@sirca.org.au> > franc.car...@sirca.org.au | www.sirca.org.au > Tel: +61 2 9236 9118 > Level 9, 80 Clarence St, Sydney NSW 2000 > PO Box H58, Australia Square, Sydney NSW 1215 > > > -- *Franc Carter* | Systems architect | Sirca Ltd <marc.zianideferra...@sirca.org.au> franc.car...@sirca.org.au | www.sirca.org.au Tel: +61 2 9236 9118 Level 9, 80 Clarence St, Sydney NSW 2000 PO Box H58, Australia Square, Sydney NSW 1215