You can use static columns to and just one table:

CREATE TABLE documents (

    doc_id uuid,

    element_id uuid,

    description text static,

    doc_title text static,

    element_title text,

    PRIMARY KEY (doc_id, element_id)

);


The static columns are present once per unique doc_id.


On 03/27/2017 01:08 PM, Zoltan Lorincz wrote:
Hi Alexander,

thank you for your help! I think we found the answer:

CREATE TABLE documents (
    doc_id uuid,
    description text,
    title text,
    PRIMARY KEY (doc_id)
 );

CREATE TABLE nodes (
    doc_id uuid,
    element_id uuid,
    title text,
    PRIMARY KEY (doc_id, element_id)
);

We can retrieve all elements with the following query:
 SELECT * FROM elements WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613
UPDATE elements SET title='Hello' WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613 AND element_id=a5e41c5d-fd69-45d1-959b-2fe7a1578949;

Zoltan.


On Mon, Mar 27, 2017 at 9:47 AM, Alexander Dejanovski <a...@thelastpickle.com <mailto:a...@thelastpickle.com>> wrote:

    Hi Zoltan,

    you must try to avoid multi partition queries as much as possible.
    Instead, use asynchronous queries to grab several partitions
    concurrently.
    Try to send no more than  ~100 queries at the same time to avoid
    DDOS-ing your cluster.
    This would leave you roughly with 1000+ async queries groups to
    run. Performance will really depend on your hardware, consistency
    level, load balancing policy, partition fragmentation (how many
    updates you'll run on each element over time) and the SLA you're
    expecting.

    If that approach doesn't meet your SLA requirements, you can try
    to use wide partitions and group elements under buckets :

    CREATE TABLE elements (
    doc_id long,
    bucket long,
    element_id long,
    element_content text,
    PRIMARY KEY((doc_id, bucket), element_id)
    )

    The bucket here could be a modulus of the element_id (or of the
    hash of element_id if it is not a numerical value). This way you
    can spread elements over the cluster and access them directly if
    you have the doc_id and the element_id to perform updates.
    You'll get to run less queries concurrently but they'll take more
    time than individual ones in the first scenario (1 partition per
    element). You should benchmark both solutions to see which one
    gives best performance.
    Bucket your elements so that your partitions don't grow over
    100MB. Large partitions are silent cluster killers (1GB+
    partitions are a direct threat to cluster stability)...

    To ensure best performance, use prepared statements along with the
    TokenAwarePolicy
    
<http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html>
 to
    avoid unnecessary coordination.

    Cheers,


    On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz <zol...@gmail.com
    <mailto:zol...@gmail.com>> wrote:

        Querying by (doc_id and element_id ) OR just by (element_id)
        is fine, but the real question is, will it be efficient to
        query 100k+ primary keys in the elements table?
        e.g.

        SELECT * FROM elements WHERE element_id IN (element_id1,
        element_id2, element_id3, .... element_id100K+)  ?

        The elements_id is a primary key.

        Thank you?


        On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec
        <matija0...@gmail.com <mailto:matija0...@gmail.com>> wrote:

            Have one table hold document metadata (doc_id, title,
            description, ...) and have another table elements where
            partition key is doc_id and clustering key is element_id.
            Only problem here is if you need to query and/or update
            element just by element_id but I don't know your queries
            up front.

            On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz
            <zol...@gmail.com <mailto:zol...@gmail.com>> wrote:

                Dear cassandra users,

                We have the following structure in MySql:

                documents->[doc_id(primary key), title, description]
                elements->[element_id(primary key), doc_id(index),
                title, description]

                Notation: table name->[column1(key or index), column2, …]

                We want to transfer the data to Cassandra.

                Each document can contain a large number of elements
                (between 1 and 100k+)

                We have two requirements:
                a) Load all elements for a given doc_id quickly
                b) Update the value of one individual element quickly


                We were thinking on the following cassandra
                configurations:

                Option A

                documents->[doc_id(primary key), title, description,
                elements] (elements could be a SET or a TEXT, each
                time new elements are added (they are never removed)
                we would append it to this column)
                elements->[element_id(primary key), title, description]

                Loading a document:

                 a) Load document with given <doc_id> and get all
                element ids
                  SELECT * from documents where doc_id=‘id’

                 b) Load all elements with the given ids
                  SELECT * FROM elements where element_id IN (ids
                loaded from query a)


                Option B

                documents->[doc_id(primary key), title, description]
                elements->[element_id(primary key), doc_id(secondary
                index), title, description]

                Loading a document:
                 a) SELECT * from elements where doc_id=‘id’


                Neither solutions doesn’t seem to be good, in Option
                A, even if we are querying by Primary keys, the second
                query will have 100k+ primary key id’s in the WHERE
                clause, and the second solution looks like an anti
                pattern in cassandra.

                Could anyone give any advice how would we create a
                model for our use case?

                Thank you in advance,
                Zoltan.



-- -----------------
    Alexander Dejanovski
    France
    @alexanderdeja

    Consultant
    Apache Cassandra Consulting
    http://www.thelastpickle.com <http://www.thelastpickle.com/>



Reply via email to