Re: Help with data modelling (from MySQL to Cassandra)

Matija Gobec Mon, 27 Mar 2017 04:20:08 -0700

Thats exactly what I described. IN queries can be used sometimes but I
usually run parallel async as Alexander explained.


On Mon, Mar 27, 2017 at 12:08 PM, Zoltan Lorincz <zol...@gmail.com> wrote:

> Hi Alexander,
>
> thank you for your help! I think we found the answer:
>
> CREATE TABLE documents (
>     doc_id uuid,
>     description text,
>     title text,
>     PRIMARY KEY (doc_id)
>  );
>
> CREATE TABLE nodes (
>     doc_id uuid,
>     element_id uuid,
>     title text,
>     PRIMARY KEY (doc_id, element_id)
> );
>
> We can retrieve all elements with the following query:
>  SELECT * FROM elements WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613
>  UPDATE elements SET title='Hello' WHERE 
> doc_id=131cfa55-181e-431e-7956-fe449139d613
> AND element_id=a5e41c5d-fd69-45d1-959b-2fe7a1578949;
>
> Zoltan.
>
>
> On Mon, Mar 27, 2017 at 9:47 AM, Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
>> Hi Zoltan,
>>
>> you must try to avoid multi partition queries as much as possible.
>> Instead, use asynchronous queries to grab several partitions concurrently.
>> Try to send no more than  ~100 queries at the same time to avoid DDOS-ing
>> your cluster.
>> This would leave you roughly with 1000+ async queries groups to run.
>> Performance will really depend on your hardware, consistency level, load
>> balancing policy, partition fragmentation (how many updates you'll run on
>> each element over time) and the SLA you're expecting.
>>
>> If that approach doesn't meet your SLA requirements, you can try to use
>> wide partitions and group elements under buckets :
>>
>> CREATE TABLE elements (
>> doc_id long,
>> bucket long,
>> element_id long,
>> element_content text,
>> PRIMARY KEY((doc_id, bucket), element_id)
>> )
>>
>> The bucket here could be a modulus of the element_id (or of the hash of
>> element_id if it is not a numerical value). This way you can spread
>> elements over the cluster and access them directly if you have the doc_id
>> and the element_id to perform updates.
>> You'll get to run less queries concurrently but they'll take more time
>> than individual ones in the first scenario (1 partition per element). You
>> should benchmark both solutions to see which one gives best performance.
>> Bucket your elements so that your partitions don't grow over 100MB. Large
>> partitions are silent cluster killers (1GB+ partitions are a direct threat
>> to cluster stability)...
>>
>> To ensure best performance, use prepared statements along with the
>> TokenAwarePolicy
>> <http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html>
>>  to
>> avoid unnecessary coordination.
>>
>> Cheers,
>>
>>
>> On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz <zol...@gmail.com> wrote:
>>
>>> Querying by (doc_id and element_id ) OR just by (element_id) is fine,
>>> but the real question is, will it be efficient to query 100k+ primary keys
>>> in the elements table?
>>> e.g.
>>>
>>> SELECT * FROM elements WHERE element_id IN (element_id1, element_id2,
>>> element_id3, .... element_id100K+)  ?
>>>
>>> The elements_id is a primary key.
>>>
>>> Thank you?
>>>
>>>
>>> On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec <matija0...@gmail.com>
>>> wrote:
>>>
>>> Have one table hold document metadata (doc_id, title, description, ...)
>>> and have another table elements where partition key is doc_id and
>>> clustering key is element_id.
>>> Only problem here is if you need to query and/or update element just by
>>> element_id but I don't know your queries up front.
>>>
>>> On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz <zol...@gmail.com>
>>> wrote:
>>>
>>> Dear cassandra users,
>>>
>>> We have the following structure in MySql:
>>>
>>> documents->[doc_id(primary key), title, description]
>>> elements->[element_id(primary key), doc_id(index), title, description]
>>>
>>> Notation: table name->[column1(key or index), column2, …]
>>>
>>> We want to transfer the data to Cassandra.
>>>
>>> Each document can contain a large number of elements (between 1 and
>>> 100k+)
>>>
>>> We have two requirements:
>>> a) Load all elements for a given doc_id quickly
>>> b) Update the value of one individual element quickly
>>>
>>>
>>> We were thinking on the following cassandra configurations:
>>>
>>> Option A
>>>
>>> documents->[doc_id(primary key), title, description, elements] (elements
>>> could be a SET or a TEXT, each time new elements are added (they are never
>>> removed) we would append it to this column)
>>> elements->[element_id(primary key), title, description]
>>>
>>> Loading a document:
>>>
>>>  a) Load document with given <doc_id> and get all element ids
>>>     SELECT * from documents where doc_id=‘id’
>>>
>>>  b) Load all elements with the given ids
>>>     SELECT * FROM elements where element_id IN (ids loaded from query a)
>>>
>>>
>>> Option B
>>>
>>> documents->[doc_id(primary key), title, description]
>>> elements->[element_id(primary key), doc_id(secondary index), title,
>>> description]
>>>
>>> Loading a document:
>>>  a) SELECT * from elements where doc_id=‘id’
>>>
>>>
>>> Neither solutions doesn’t seem to be good, in Option A, even if we are
>>> querying by Primary keys, the second query will have 100k+ primary key id’s
>>> in the WHERE clause, and the second solution looks like an anti pattern in
>>> cassandra.
>>>
>>> Could anyone give any advice how would we create a model for our use
>>> case?
>>>
>>> Thank you in advance,
>>> Zoltan.
>>>
>>>
>>>
>>> --
>> -----------------
>> Alexander Dejanovski
>> France
>> @alexanderdeja
>>
>> Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>
>

Re: Help with data modelling (from MySQL to Cassandra)

Reply via email to