Re: Query around Data Modelling

Jeff Jirsa Wed, 22 Jun 2022 21:06:38 -0700


Ok so here’s how I would think about this

The writes don’t matter. (There’s a tiny tiny bit of nuance in one table where 
you can contend adding to the memtable but the best cassandra engineers on 
earth probably won’t notice that unless you have really super hot partitions, 
so ignore the write path).

The reads are where it changes

In both models/cases, you’ll use the partition index to seek to where the 
partition starts. 

In model 2 table 1 if you use ck+col1+… the read will load the column index and 
use that to jump to within 64kb of the col1 value you specify 

In model 2 table 2, if you use ck+col3+…, same thing - column index can jump to 
within 64k

What you give up in model one is the granularity of that jump. If you use model 
1 and col3 instead of col1, the read will have to scan the partition. In your 
case, with 80 rows, that may still be within that 64kb block - you may not get 
more granular than that anyway. And even if it’s slightly larger, you’re 
probably going to be compressing 64k chunks - maybe you have to decompress one 
extra chunk on read if your 1000 rows goes past 64k, but you likely won’t 
actually notice. You’re technically asking the server to read and skip data it 
doesn’t need to return - it’s not really the most efficient, but at that 
partition size it’s noise. You could also just return all 80-100 rows, let the 
server do slightly less work and filter client side - also valid, probably 
slightly worse than the server side filter. 

Having one table instead of two, though, probably saves you a ton of disk space 
($€£), and the lower disk space may also mean that data stays in page cache, so 
the extra read may not even go to disk anyway.

So with your actual data shape, I imagine you won’t really notice the nominal 
inefficiency of the first model, and I’d be inclined to do that until you 
demonstrate it won’t work (I bet it works fine for a long long time). 

> On Jun 22, 2022, at 7:11 PM, MyWorld <timeplus.1...@gmail.com> wrote:
> 
> 
> Hi Jeff,
> Let me know how no of rows have an impact here.
> May be today I have 80-100 rows per partition. But what if I started storing 
> 2-4k rows per partition. However total partition size is still under 100 MB 
> 
>> On Thu, Jun 23, 2022, 7:18 AM Jeff Jirsa <jji...@gmail.com> wrote:
>> How many rows per partition in each model?
>> 
>> 
>> > On Jun 22, 2022, at 6:38 PM, MyWorld <timeplus.1...@gmail.com> wrote:
>> > 
>> > 
>> > Hi all,
>> > 
>> > Just a small query around data Modelling.
>> > Suppose we have to design the data model for 2 different use cases which 
>> > will query the data on same set of (partion+clustering key). So should we 
>> > maintain a seperate table for each or a single table. 
>> > 
>> > Model1 - Combined table
>> > Table(Pk,CK, col1,col2, col3, col4,col5)
>> > 
>> > Model2 - Seperate tables
>> > Table1(Pk,CK,col1,col2,col3)
>> > Table2(Pk,CK,col3,col4,col45)
>> > 
>> > So here partion and clustering keys are same. Also note column col3 is 
>> > required in both use cases.
>> > 
>> > As per my thought in Model2, partition size would be less. There would be 
>> > less sstables and when I use level compaction, it could be easily 
>> > maintained. So should be better read performance.
>> > 
>> > Please help me to highlight the drawback and advantage of each data model. 
>> > Here we have a mix kind of workload (read/write)

Re: Query around Data Modelling

Reply via email to