I over-simplified the original example. In the real model I cannot just merge the row types. Suppose
    create table t1(
        p int,
        q1 int,
        c1 int,
        primary key( p, q1 )
    )
    create table t2(
        p int,
        q2 uuid,
        c2 text,
        primary key( p, q2 )
    )

Merging the tables will be slightly ugly and waste some storage in the clustering columns:
    create table t(
        p int,
        rowtype tinyint, // t1 or t2
        q1 int, q2 uuid, // depending on rowtype, either q1 or q2 is unused
        c1 int, c2 text, // depending on rowtype, either c1 or c2 is null
        primary key( p, rowtype, q1, q2)
    )

Nevertheless, putting them into one table seems faster as we only need one query to get both types, and have better cache locality. Am I correct?


On Saturday, January 09, 2016 06:47 AM, Jack Krupansky wrote:
A simple denormalization is probably all that is called for - just merge the two tables into one (their union.) No need for this row type.


-- Jack Krupansky

On Fri, Jan 8, 2016 at 9:30 AM, Jeff Jirsa <jeff.ji...@crowdstrike.com <mailto:jeff.ji...@crowdstrike.com>> wrote:

    You’ll see better performance using a slice (which is effectively
    what will happen if you put them into the same table and use
    query-1table-b), as each node will only need to merge
    cells/results once. It may not be twice as fast, but it’ll be fast
    enough to make it worthwhile.



    On 1/8/16, 12:13 AM, "Bamoqi" <bam...@gmail.com
    <mailto:bam...@gmail.com>> wrote:

    >[Correction of the original message which contains typos in code.]
    >
    >Is it good for performance to put rows that are of different
    types but
    >are always queried together in the same table partition?
    >
    >My consideration is that whether doing so will result in better
    >memory/disk cache locality.
    >
    >Suppose I need to query for 2 different types of rows for a frequent
    >user request, I can use 2 tables or 1 table:
    >
    >2 tables:
    >
    >   create table t1(
    >     partitionkey int primary key,
    >     col1 int, col2 int, ...
    >   )
    >   create table t2(
    >     partitionkey int primary key,
    >     col3 int, col4 int, ...
    >   )
    >
    >query-2table:
    >   select col1,col2 from t1 where partitionkey = ?
    >   select col3,col4 from t2 where partitionkey = ?
    >
    >1 table:
    >
    >   create table t(
    >     partitionkey int,
    >     rowtype tinyint,
    >     col1 int, col2 int, ...
    >     col3 int, col4 int, ...
    >     primary key( partitionkey, rowtype )
    >   )
    >
    >query-1table-a:
    >   select col1,col2 from t where partitionkey = ? and rowtype = 1
    >   select col3,col4 from t where partitionkey = ? and rowtype = 2
    >
    >or alternatively, query-1table-b:
    >   select rowtype,col1,col2,col3,col4 from t where partitionkey = ?
    >   // Used columns are `null`. Switch on `rowtype` in the app code
    >
    >Is there significant performance difference in query-2table,
    >query-1table-a, query-1table-b?
    >Is the cassandra client/coordinator smart enough to direct subsequent
    >queries of the same (table, partitionkey) to the same node so
    they can
    >reuse a cached page?
    >
    >Regards & Thanks



Reply via email to