Thanks a lot Jeff, Michiel and Manish for your replies. Really helpful.

On Thu, Jun 23, 2022, 9:50 AM Jeff Jirsa <jji...@gmail.com> wrote:

> This is assuming each row is like … I dunno 10-1000 bytes. If you’re
> storing like a huge 1mb blob use two tables for sure.
>
> On Jun 22, 2022, at 9:06 PM, Jeff Jirsa <jji...@gmail.com> wrote:
>
> 
>
> Ok so here’s how I would think about this
>
> The writes don’t matter. (There’s a tiny tiny bit of nuance in one table
> where you can contend adding to the memtable but the best cassandra
> engineers on earth probably won’t notice that unless you have really super
> hot partitions, so ignore the write path).
>
> The reads are where it changes
>
> In both models/cases, you’ll use the partition index to seek to where the
> partition starts.
>
> In model 2 table 1 if you use ck+col1+… the read will load the column
> index and use that to jump to within 64kb of the col1 value you specify
>
> In model 2 table 2, if you use ck+col3+…, same thing - column index can
> jump to within 64k
>
> What you give up in model one is the granularity of that jump. If you use
> model 1 and col3 instead of col1, the read will have to scan the partition.
> In your case, with 80 rows, that may still be within that 64kb block - you
> may not get more granular than that anyway. And even if it’s slightly
> larger, you’re probably going to be compressing 64k chunks - maybe you have
> to decompress one extra chunk on read if your 1000 rows goes past 64k, but
> you likely won’t actually notice. You’re technically asking the server to
> read and skip data it doesn’t need to return - it’s not really the most
> efficient, but at that partition size it’s noise. You could also just
> return all 80-100 rows, let the server do slightly less work and filter
> client side - also valid, probably slightly worse than the server side
> filter.
>
> Having one table instead of two, though, probably saves you a ton of disk
> space ($€£), and the lower disk space may also mean that data stays in page
> cache, so the extra read may not even go to disk anyway.
>
> So with your actual data shape, I imagine you won’t really notice the
> nominal inefficiency of the first model, and I’d be inclined to do that
> until you demonstrate it won’t work (I bet it works fine for a long long
> time).
>
> On Jun 22, 2022, at 7:11 PM, MyWorld <timeplus.1...@gmail.com> wrote:
>
> 
> Hi Jeff,
> Let me know how no of rows have an impact here.
> May be today I have 80-100 rows per partition. But what if I started
> storing 2-4k rows per partition. However total partition size is still
> under 100 MB
>
> On Thu, Jun 23, 2022, 7:18 AM Jeff Jirsa <jji...@gmail.com> wrote:
>
>> How many rows per partition in each model?
>>
>>
>> > On Jun 22, 2022, at 6:38 PM, MyWorld <timeplus.1...@gmail.com> wrote:
>> >
>> > 
>> > Hi all,
>> >
>> > Just a small query around data Modelling.
>> > Suppose we have to design the data model for 2 different use cases
>> which will query the data on same set of (partion+clustering key). So
>> should we maintain a seperate table for each or a single table.
>> >
>> > Model1 - Combined table
>> > Table(Pk,CK, col1,col2, col3, col4,col5)
>> >
>> > Model2 - Seperate tables
>> > Table1(Pk,CK,col1,col2,col3)
>> > Table2(Pk,CK,col3,col4,col45)
>> >
>> > So here partion and clustering keys are same. Also note column col3 is
>> required in both use cases.
>> >
>> > As per my thought in Model2, partition size would be less. There would
>> be less sstables and when I use level compaction, it could be easily
>> maintained. So should be better read performance.
>> >
>> > Please help me to highlight the drawback and advantage of each data
>> model. Here we have a mix kind of workload (read/write)
>>
>

Reply via email to