Thanks a lot Jeff, Michiel and Manish for your replies. Really helpful. On Thu, Jun 23, 2022, 9:50 AM Jeff Jirsa <jji...@gmail.com> wrote:
> This is assuming each row is like … I dunno 10-1000 bytes. If you’re > storing like a huge 1mb blob use two tables for sure. > > On Jun 22, 2022, at 9:06 PM, Jeff Jirsa <jji...@gmail.com> wrote: > > > > Ok so here’s how I would think about this > > The writes don’t matter. (There’s a tiny tiny bit of nuance in one table > where you can contend adding to the memtable but the best cassandra > engineers on earth probably won’t notice that unless you have really super > hot partitions, so ignore the write path). > > The reads are where it changes > > In both models/cases, you’ll use the partition index to seek to where the > partition starts. > > In model 2 table 1 if you use ck+col1+… the read will load the column > index and use that to jump to within 64kb of the col1 value you specify > > In model 2 table 2, if you use ck+col3+…, same thing - column index can > jump to within 64k > > What you give up in model one is the granularity of that jump. If you use > model 1 and col3 instead of col1, the read will have to scan the partition. > In your case, with 80 rows, that may still be within that 64kb block - you > may not get more granular than that anyway. And even if it’s slightly > larger, you’re probably going to be compressing 64k chunks - maybe you have > to decompress one extra chunk on read if your 1000 rows goes past 64k, but > you likely won’t actually notice. You’re technically asking the server to > read and skip data it doesn’t need to return - it’s not really the most > efficient, but at that partition size it’s noise. You could also just > return all 80-100 rows, let the server do slightly less work and filter > client side - also valid, probably slightly worse than the server side > filter. > > Having one table instead of two, though, probably saves you a ton of disk > space ($€£), and the lower disk space may also mean that data stays in page > cache, so the extra read may not even go to disk anyway. > > So with your actual data shape, I imagine you won’t really notice the > nominal inefficiency of the first model, and I’d be inclined to do that > until you demonstrate it won’t work (I bet it works fine for a long long > time). > > On Jun 22, 2022, at 7:11 PM, MyWorld <timeplus.1...@gmail.com> wrote: > > > Hi Jeff, > Let me know how no of rows have an impact here. > May be today I have 80-100 rows per partition. But what if I started > storing 2-4k rows per partition. However total partition size is still > under 100 MB > > On Thu, Jun 23, 2022, 7:18 AM Jeff Jirsa <jji...@gmail.com> wrote: > >> How many rows per partition in each model? >> >> >> > On Jun 22, 2022, at 6:38 PM, MyWorld <timeplus.1...@gmail.com> wrote: >> > >> > >> > Hi all, >> > >> > Just a small query around data Modelling. >> > Suppose we have to design the data model for 2 different use cases >> which will query the data on same set of (partion+clustering key). So >> should we maintain a seperate table for each or a single table. >> > >> > Model1 - Combined table >> > Table(Pk,CK, col1,col2, col3, col4,col5) >> > >> > Model2 - Seperate tables >> > Table1(Pk,CK,col1,col2,col3) >> > Table2(Pk,CK,col3,col4,col45) >> > >> > So here partion and clustering keys are same. Also note column col3 is >> required in both use cases. >> > >> > As per my thought in Model2, partition size would be less. There would >> be less sstables and when I use level compaction, it could be easily >> maintained. So should be better read performance. >> > >> > Please help me to highlight the drawback and advantage of each data >> model. Here we have a mix kind of workload (read/write) >> >