Re: Query around Data Modelling

Jeff Jirsa Wed, 22 Jun 2022 21:20:34 -0700

This is assuming each row is like … I dunno 10-1000 bytes. If you’re storing 
like a huge 1mb blob use two tables for sure.


> On Jun 22, 2022, at 9:06 PM, Jeff Jirsa <jji...@gmail.com> wrote:
> 
> 
> 
> Ok so here’s how I would think about this
> 
> The writes don’t matter. (There’s a tiny tiny bit of nuance in one table 
> where you can contend adding to the memtable but the best cassandra engineers 
> on earth probably won’t notice that unless you have really super hot 
> partitions, so ignore the write path).
> 
> The reads are where it changes
> 
> In both models/cases, you’ll use the partition index to seek to where the 
> partition starts. 
> 
> In model 2 table 1 if you use ck+col1+… the read will load the column index 
> and use that to jump to within 64kb of the col1 value you specify 
> 
> In model 2 table 2, if you use ck+col3+…, same thing - column index can jump 
> to within 64k
> 
> What you give up in model one is the granularity of that jump. If you use 
> model 1 and col3 instead of col1, the read will have to scan the partition. 
> In your case, with 80 rows, that may still be within that 64kb block - you 
> may not get more granular than that anyway. And even if it’s slightly larger, 
> you’re probably going to be compressing 64k chunks - maybe you have to 
> decompress one extra chunk on read if your 1000 rows goes past 64k, but you 
> likely won’t actually notice. You’re technically asking the server to read 
> and skip data it doesn’t need to return - it’s not really the most efficient, 
> but at that partition size it’s noise. You could also just return all 80-100 
> rows, let the server do slightly less work and filter client side - also 
> valid, probably slightly worse than the server side filter. 
> 
> Having one table instead of two, though, probably saves you a ton of disk 
> space ($€£), and the lower disk space may also mean that data stays in page 
> cache, so the extra read may not even go to disk anyway.
> 
> So with your actual data shape, I imagine you won’t really notice the nominal 
> inefficiency of the first model, and I’d be inclined to do that until you 
> demonstrate it won’t work (I bet it works fine for a long long time). 
> 
>>> On Jun 22, 2022, at 7:11 PM, MyWorld <timeplus.1...@gmail.com> wrote:
>>> 
>> 
>> Hi Jeff,
>> Let me know how no of rows have an impact here.
>> May be today I have 80-100 rows per partition. But what if I started storing 
>> 2-4k rows per partition. However total partition size is still under 100 MB 
>> 
>>> On Thu, Jun 23, 2022, 7:18 AM Jeff Jirsa <jji...@gmail.com> wrote:
>>> How many rows per partition in each model?
>>> 
>>> 
>>> > On Jun 22, 2022, at 6:38 PM, MyWorld <timeplus.1...@gmail.com> wrote:
>>> > 
>>> > 
>>> > Hi all,
>>> > 
>>> > Just a small query around data Modelling.
>>> > Suppose we have to design the data model for 2 different use cases which 
>>> > will query the data on same set of (partion+clustering key). So should we 
>>> > maintain a seperate table for each or a single table. 
>>> > 
>>> > Model1 - Combined table
>>> > Table(Pk,CK, col1,col2, col3, col4,col5)
>>> > 
>>> > Model2 - Seperate tables
>>> > Table1(Pk,CK,col1,col2,col3)
>>> > Table2(Pk,CK,col3,col4,col45)
>>> > 
>>> > So here partion and clustering keys are same. Also note column col3 is 
>>> > required in both use cases.
>>> > 
>>> > As per my thought in Model2, partition size would be less. There would be 
>>> > less sstables and when I use level compaction, it could be easily 
>>> > maintained. So should be better read performance.
>>> > 
>>> > Please help me to highlight the drawback and advantage of each data 
>>> > model. Here we have a mix kind of workload (read/write)

Re: Query around Data Modelling

Reply via email to