Re: sensible data model ?

aaron morton Tue, 07 Feb 2012 11:06:35 -0800

> None of those jump out at me as horrible for my case. If I modelled with 
> Super Columns I would have less than 10,000 Super Columns with an average of 
> 50 columns - big but no insane ?
I would still try to do it without super columns. The common belief is they are 
about 10% slower, and they are a lot clunkier. There are some query and delete 
cases where they do things composite columns cannot, but in general I try to 
model things without using them first.


> Because of request overhead ? I'm currently using the batch interface of 
> pycassa to do bulk reads. Is the same problem going to bite me if I have many 
> clients reading (using bulk reads) ? In production we will have ~50 clients. 
pycassa has support for chunking requests to the server
https://github.com/pycassa/pycassa/blob/master/pycassa/columnfamily.py#L633

It's because each row requested becomes a read task on the server and is placed 
into the read thread pool. There are only 32 (default) read thread in the pool. 
If one query comes along and requests 100 rows, it places 100 tasks in the 
thread pool where only 32 can be processed at a time. Some will back up as 
pending tasks and eventually be processed.  If row reads reads take  1ms (just 
to pick a number, may be better) to read 100 rows we are talking about 3 or 4ms 
for that query. During that time any read requests received will have to wait 
for read threads.

To that client this is excellent, it's has a high row throughput. To the other 
clients this is not, overall query throughput will drop. More is not always 
better. Note that as the number of nodes increases and this effect is may be 
reduced as reading 100 rows may result in the coordinator sending 25 row 
requests to 4 nodes. 

And there is also overhead involved in very big requests, see…
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Read-Latency-td5636553.html#a5652476

Cheers


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 7/02/2012, at 2:28 PM, Franc Carter wrote:

> On Tue, Feb 7, 2012 at 6:39 AM, aaron morton <aa...@thelastpickle.com> wrote:
> Sounds like a good start. Super columns are not a great fit for modeling time 
> series data for a few reasons, here is one 
> http://wiki.apache.org/cassandra/CassandraLimitations
> 
> 
> None of those jump out at me as horrible for my case. If I modelled with 
> Super Columns I would have less than 10,000 Super Columns with an average of 
> 50 columns - big but no insane ?
>  
> 
> It's also a good idea to partition time series data so that the rows do not 
> grow too big. You can have 2 billion columns in a row, but big rows have 
> operational down sides.
> 
> You could go with either:
> 
> rows: <entity_id:date>
> column: <property_name>
> 
> Which would mean each time your query for a date range you need to query 
> multiple rows. But it is possible to get a range of  columns / properties.
> 
> Or
> 
> rows: <entity_id:time_partition>
> column: <date:property_name>
> 
> That's an interesting idea - I'll talk to the data experts to see if we have 
> a sensible range.
>  
> 
> Where time_partition is something that makes sense in your problem domain, 
> e.g. a calendar month. If you often query for days in a month you  can then 
> get all the columns for the days you are interested in (using a column 
> range). If you only want to get a sub set of the entity properties you will 
> need to get them all and filter them client side, depending on the number and 
> size of the properties this may be more efficient than multiple calls. 
> 
> I'm find with doing work on the client side - I have a bias in that direction 
> as it tends to scale better.
>  
> 
> One word of warning, avoid sending read requests for lots (i.e. 100's) of 
> rows at once it will reduce overall query throughput. Some clients like 
> pycassa take care of this for you.
> 
> Because of request overhead ? I'm currently using the batch interface of 
> pycassa to do bulk reads. Is the same problem going to bite me if I have many 
> clients reading (using bulk reads) ? In production we will have ~50 clients. 
>  
> thanks
> 
> 
> Good luck. 
>  
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 5/02/2012, at 12:12 AM, Franc Carter wrote:
> 
>> 
>> Hi,
>> 
>> I'm pretty new to Cassandra and am currently doing a proof of concept, and 
>> thought it would be a good idea to ask if my data model is sane . . . 
>> 
>> The data I have, and need to query, is reasonably simple. It consists of 
>> about 10 million entities, each of which have a set of key/value properties 
>> for each day for about 10 years. The number of keys is in the 50-100 range 
>> and there will be a lot of overlap for keys in <entity,days>
>> 
>> The queries I need to make are for sets of key/value properties for an 
>> entity on a day, e.g key1,keys2,key3 for 10 entities on 20 days. The number 
>> of entities and/or days in the query could be either very small or very 
>> large.
>> 
>> I've modeled this with a simple column family for the keys with the row key 
>> being the concatenation of the entity and date. My first go, used only the 
>> entity as the row key and then used a supercolumn for each date. I decided 
>> against this mostly because it seemed more complex for a gain I didn't 
>> really understand.
>> 
>> Does this seem sensible ?
>> 
>> thanks
>> 
>> -- 
>> Franc Carter | Systems architect | Sirca Ltd
>> franc.car...@sirca.org.au | www.sirca.org.au
>> Tel: +61 2 9236 9118 
>> Level 9, 80 Clarence St, Sydney NSW 2000
>> PO Box H58, Australia Square, Sydney NSW 1215
>> 
> 
> 
> 
> 
> -- 
> Franc Carter | Systems architect | Sirca Ltd
> franc.car...@sirca.org.au | www.sirca.org.au
> Tel: +61 2 9236 9118 
> Level 9, 80 Clarence St, Sydney NSW 2000
> PO Box H58, Australia Square, Sydney NSW 1215
>

Re: sensible data model ?

Reply via email to