Thank you Matija, because i am newbie, it was not clear for me that i am
able to query by the partition key (not providing the clustering key),
sorry about that!
Zoltan.

On Mon, Mar 27, 2017 at 1:54 PM, Matija Gobec <matija0...@gmail.com> wrote:

> Thats exactly what I described. IN queries can be used sometimes but I
> usually run parallel async as Alexander explained.
>
> On Mon, Mar 27, 2017 at 12:08 PM, Zoltan Lorincz <zol...@gmail.com> wrote:
>
>> Hi Alexander,
>>
>> thank you for your help! I think we found the answer:
>>
>> CREATE TABLE documents (
>>     doc_id uuid,
>>     description text,
>>     title text,
>>     PRIMARY KEY (doc_id)
>>  );
>>
>> CREATE TABLE nodes (
>>     doc_id uuid,
>>     element_id uuid,
>>     title text,
>>     PRIMARY KEY (doc_id, element_id)
>> );
>>
>> We can retrieve all elements with the following query:
>>  SELECT * FROM elements WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613
>>  UPDATE elements SET title='Hello' WHERE 
>> doc_id=131cfa55-181e-431e-7956-fe449139d613
>> AND element_id=a5e41c5d-fd69-45d1-959b-2fe7a1578949;
>>
>> Zoltan.
>>
>>
>> On Mon, Mar 27, 2017 at 9:47 AM, Alexander Dejanovski <
>> a...@thelastpickle.com> wrote:
>>
>>> Hi Zoltan,
>>>
>>> you must try to avoid multi partition queries as much as possible.
>>> Instead, use asynchronous queries to grab several partitions concurrently.
>>> Try to send no more than  ~100 queries at the same time to avoid
>>> DDOS-ing your cluster.
>>> This would leave you roughly with 1000+ async queries groups to run.
>>> Performance will really depend on your hardware, consistency level, load
>>> balancing policy, partition fragmentation (how many updates you'll run on
>>> each element over time) and the SLA you're expecting.
>>>
>>> If that approach doesn't meet your SLA requirements, you can try to use
>>> wide partitions and group elements under buckets :
>>>
>>> CREATE TABLE elements (
>>> doc_id long,
>>> bucket long,
>>> element_id long,
>>> element_content text,
>>> PRIMARY KEY((doc_id, bucket), element_id)
>>> )
>>>
>>> The bucket here could be a modulus of the element_id (or of the hash of
>>> element_id if it is not a numerical value). This way you can spread
>>> elements over the cluster and access them directly if you have the doc_id
>>> and the element_id to perform updates.
>>> You'll get to run less queries concurrently but they'll take more time
>>> than individual ones in the first scenario (1 partition per element). You
>>> should benchmark both solutions to see which one gives best performance.
>>> Bucket your elements so that your partitions don't grow over 100MB.
>>> Large partitions are silent cluster killers (1GB+ partitions are a direct
>>> threat to cluster stability)...
>>>
>>> To ensure best performance, use prepared statements along with the
>>> TokenAwarePolicy
>>> <http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html>
>>>  to
>>> avoid unnecessary coordination.
>>>
>>> Cheers,
>>>
>>>
>>> On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz <zol...@gmail.com> wrote:
>>>
>>>> Querying by (doc_id and element_id ) OR just by (element_id) is fine,
>>>> but the real question is, will it be efficient to query 100k+ primary keys
>>>> in the elements table?
>>>> e.g.
>>>>
>>>> SELECT * FROM elements WHERE element_id IN (element_id1, element_id2,
>>>> element_id3, .... element_id100K+)  ?
>>>>
>>>> The elements_id is a primary key.
>>>>
>>>> Thank you?
>>>>
>>>>
>>>> On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec <matija0...@gmail.com>
>>>> wrote:
>>>>
>>>> Have one table hold document metadata (doc_id, title, description, ...)
>>>> and have another table elements where partition key is doc_id and
>>>> clustering key is element_id.
>>>> Only problem here is if you need to query and/or update element just by
>>>> element_id but I don't know your queries up front.
>>>>
>>>> On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz <zol...@gmail.com>
>>>> wrote:
>>>>
>>>> Dear cassandra users,
>>>>
>>>> We have the following structure in MySql:
>>>>
>>>> documents->[doc_id(primary key), title, description]
>>>> elements->[element_id(primary key), doc_id(index), title, description]
>>>>
>>>> Notation: table name->[column1(key or index), column2, …]
>>>>
>>>> We want to transfer the data to Cassandra.
>>>>
>>>> Each document can contain a large number of elements (between 1 and
>>>> 100k+)
>>>>
>>>> We have two requirements:
>>>> a) Load all elements for a given doc_id quickly
>>>> b) Update the value of one individual element quickly
>>>>
>>>>
>>>> We were thinking on the following cassandra configurations:
>>>>
>>>> Option A
>>>>
>>>> documents->[doc_id(primary key), title, description, elements]
>>>> (elements could be a SET or a TEXT, each time new elements are added (they
>>>> are never removed) we would append it to this column)
>>>> elements->[element_id(primary key), title, description]
>>>>
>>>> Loading a document:
>>>>
>>>>  a) Load document with given <doc_id> and get all element ids
>>>>     SELECT * from documents where doc_id=‘id’
>>>>
>>>>  b) Load all elements with the given ids
>>>>     SELECT * FROM elements where element_id IN (ids loaded from query a)
>>>>
>>>>
>>>> Option B
>>>>
>>>> documents->[doc_id(primary key), title, description]
>>>> elements->[element_id(primary key), doc_id(secondary index), title,
>>>> description]
>>>>
>>>> Loading a document:
>>>>  a) SELECT * from elements where doc_id=‘id’
>>>>
>>>>
>>>> Neither solutions doesn’t seem to be good, in Option A, even if we are
>>>> querying by Primary keys, the second query will have 100k+ primary key id’s
>>>> in the WHERE clause, and the second solution looks like an anti pattern in
>>>> cassandra.
>>>>
>>>> Could anyone give any advice how would we create a model for our use
>>>> case?
>>>>
>>>> Thank you in advance,
>>>> Zoltan.
>>>>
>>>>
>>>>
>>>> --
>>> -----------------
>>> Alexander Dejanovski
>>> France
>>> @alexanderdeja
>>>
>>> Consultant
>>> Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>>
>>
>>
>

Reply via email to