Re: sensible data model ?

Franc Carter Tue, 07 Feb 2012 13:06:55 -0800

On Wed, Feb 8, 2012 at 6:05 AM, aaron morton <aa...@thelastpickle.com>wrote:


> None of those jump out at me as horrible for my case. If I modelled with
> Super Columns I would have less than 10,000 Super Columns with an average
> of 50 columns - big but no insane ?
>
> I would still try to do it without super columns. The common belief is
> they are about 10% slower, and they are a lot clunkier. There are some
> query and delete cases where they do things composite columns cannot, but
> in general I try to model things without using them first.
>

Ok - it seems cleaner to model without them to me as well.


>
> Because of request overhead ? I'm currently using the batch interface of
> pycassa to do bulk reads. Is the same problem going to bite me if I have
> many clients reading (using bulk reads) ? In production we will have ~50
> clients.
>
> pycassa has support for chunking requests to the server
> https://github.com/pycassa/pycassa/blob/master/pycassa/columnfamily.py#L633
>
> It's because each row requested becomes a read task on the server and is
> placed into the read thread pool. There are only 32 (default) read thread
> in the pool. If one query comes along and requests 100 rows, it places 100
> tasks in the thread pool where only 32 can be processed at a time. Some
> will back up as pending tasks and eventually be processed.  If row reads
> reads take  1ms (just to pick a number, may be better) to read 100 rows we
> are talking about 3 or 4ms for that query. During that time any read
> requests received will have to wait for read threads.
>
> To that client this is excellent, it's has a high row throughput. To the
> other clients this is not, overall query throughput will drop. More is not
> always better. Note that as the number of nodes increases and this effect
> is may be reduced as reading 100 rows may result in the coordinator sending
> 25 row requests to 4 nodes.
>
> And there is also overhead involved in very big requests, see…
>
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Read-Latency-td5636553.html#a5652476
>

thanks



>
> Cheers
>
>
>   -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 7/02/2012, at 2:28 PM, Franc Carter wrote:
>
> On Tue, Feb 7, 2012 at 6:39 AM, aaron morton <aa...@thelastpickle.com>wrote:
>
>> Sounds like a good start. Super columns are not a great fit for modeling
>> time series data for a few reasons, here is one
>> http://wiki.apache.org/cassandra/CassandraLimitations
>>
>
>
> None of those jump out at me as horrible for my case. If I modelled with
> Super Columns I would have less than 10,000 Super Columns with an average
> of 50 columns - big but no insane ?
>
>
>>
>> It's also a good idea to partition time series data so that the rows do
>> not grow too big. You can have 2 billion columns in a row, but big rows
>> have operational down sides.
>>
>> You could go with either:
>>
>> rows: <entity_id:date>
>> column: <property_name>
>>
>> Which would mean each time your query for a date range you need to query
>> multiple rows. But it is possible to get a range of  columns / properties.
>>
>> Or
>>
>> rows: <entity_id:time_partition>
>> column: <date:property_name>
>>
>
> That's an interesting idea - I'll talk to the data experts to see if we
> have a sensible range.
>
>
>>
>> Where time_partition is something that makes sense in your problem
>> domain, e.g. a calendar month. If you often query for days in a month you
>>  can then get all the columns for the days you are interested in (using a
>> column range). If you only want to get a sub set of the entity properties
>> you will need to get them all and filter them client side, depending on the
>> number and size of the properties this may be more efficient than multiple
>> calls.
>>
>
> I'm find with doing work on the client side - I have a bias in that
> direction as it tends to scale better.
>
>
>>
>> One word of warning, avoid sending read requests for lots (i.e. 100's) of
>> rows at once it will reduce overall query throughput. Some clients like
>> pycassa take care of this for you.
>>
>
> Because of request overhead ? I'm currently using the batch interface of
> pycassa to do bulk reads. Is the same problem going to bite me if I have
> many clients reading (using bulk reads) ? In production we will have ~50
> clients.
>
> thanks
>
>
>> Good luck.
>>
>>   -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 5/02/2012, at 12:12 AM, Franc Carter wrote:
>>
>>
>> Hi,
>>
>> I'm pretty new to Cassandra and am currently doing a proof of concept,
>> and thought it would be a good idea to ask if my data model is sane . . .
>>
>> The data I have, and need to query, is reasonably simple. It consists of
>> about 10 million entities, each of which have a set of key/value properties
>> for each day for about 10 years. The number of keys is in the 50-100 range
>> and there will be a lot of overlap for keys in <entity,days>
>>
>> The queries I need to make are for sets of key/value properties for an
>> entity on a day, e.g key1,keys2,key3 for 10 entities on 20 days. The number
>> of entities and/or days in the query could be either very small or very
>> large.
>>
>> I've modeled this with a simple column family for the keys with the row
>> key being the concatenation of the entity and date. My first go, used only
>> the entity as the row key and then used a supercolumn for each date. I
>> decided against this mostly because it seemed more complex for a gain I
>> didn't really understand.
>>
>> Does this seem sensible ?
>>
>> thanks
>>
>> --
>> *Franc Carter* | Systems architect | Sirca Ltd
>>  <marc.zianideferra...@sirca.org.au>
>> franc.car...@sirca.org.au | www.sirca.org.au
>> Tel: +61 2 9236 9118
>>  Level 9, 80 Clarence St, Sydney NSW 2000
>> PO Box H58, Australia Square, Sydney NSW 1215
>>
>>
>>
>
>
> --
> *Franc Carter* | Systems architect | Sirca Ltd
>  <marc.zianideferra...@sirca.org.au>
> franc.car...@sirca.org.au | www.sirca.org.au
> Tel: +61 2 9236 9118
>  Level 9, 80 Clarence St, Sydney NSW 2000
> PO Box H58, Australia Square, Sydney NSW 1215
>
>
>


-- 

*Franc Carter* | Systems architect | Sirca Ltd
 <marc.zianideferra...@sirca.org.au>

franc.car...@sirca.org.au | www.sirca.org.au

Tel: +61 2 9236 9118

Level 9, 80 Clarence St, Sydney NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215

Re: sensible data model ?

Reply via email to