Thank you Matija, because i am newbie, it was not clear for me that i am able to query by the partition key (not providing the clustering key), sorry about that! Zoltan.
On Mon, Mar 27, 2017 at 1:54 PM, Matija Gobec <matija0...@gmail.com> wrote: > Thats exactly what I described. IN queries can be used sometimes but I > usually run parallel async as Alexander explained. > > On Mon, Mar 27, 2017 at 12:08 PM, Zoltan Lorincz <zol...@gmail.com> wrote: > >> Hi Alexander, >> >> thank you for your help! I think we found the answer: >> >> CREATE TABLE documents ( >> doc_id uuid, >> description text, >> title text, >> PRIMARY KEY (doc_id) >> ); >> >> CREATE TABLE nodes ( >> doc_id uuid, >> element_id uuid, >> title text, >> PRIMARY KEY (doc_id, element_id) >> ); >> >> We can retrieve all elements with the following query: >> SELECT * FROM elements WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613 >> UPDATE elements SET title='Hello' WHERE >> doc_id=131cfa55-181e-431e-7956-fe449139d613 >> AND element_id=a5e41c5d-fd69-45d1-959b-2fe7a1578949; >> >> Zoltan. >> >> >> On Mon, Mar 27, 2017 at 9:47 AM, Alexander Dejanovski < >> a...@thelastpickle.com> wrote: >> >>> Hi Zoltan, >>> >>> you must try to avoid multi partition queries as much as possible. >>> Instead, use asynchronous queries to grab several partitions concurrently. >>> Try to send no more than ~100 queries at the same time to avoid >>> DDOS-ing your cluster. >>> This would leave you roughly with 1000+ async queries groups to run. >>> Performance will really depend on your hardware, consistency level, load >>> balancing policy, partition fragmentation (how many updates you'll run on >>> each element over time) and the SLA you're expecting. >>> >>> If that approach doesn't meet your SLA requirements, you can try to use >>> wide partitions and group elements under buckets : >>> >>> CREATE TABLE elements ( >>> doc_id long, >>> bucket long, >>> element_id long, >>> element_content text, >>> PRIMARY KEY((doc_id, bucket), element_id) >>> ) >>> >>> The bucket here could be a modulus of the element_id (or of the hash of >>> element_id if it is not a numerical value). This way you can spread >>> elements over the cluster and access them directly if you have the doc_id >>> and the element_id to perform updates. >>> You'll get to run less queries concurrently but they'll take more time >>> than individual ones in the first scenario (1 partition per element). You >>> should benchmark both solutions to see which one gives best performance. >>> Bucket your elements so that your partitions don't grow over 100MB. >>> Large partitions are silent cluster killers (1GB+ partitions are a direct >>> threat to cluster stability)... >>> >>> To ensure best performance, use prepared statements along with the >>> TokenAwarePolicy >>> <http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html> >>> to >>> avoid unnecessary coordination. >>> >>> Cheers, >>> >>> >>> On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz <zol...@gmail.com> wrote: >>> >>>> Querying by (doc_id and element_id ) OR just by (element_id) is fine, >>>> but the real question is, will it be efficient to query 100k+ primary keys >>>> in the elements table? >>>> e.g. >>>> >>>> SELECT * FROM elements WHERE element_id IN (element_id1, element_id2, >>>> element_id3, .... element_id100K+) ? >>>> >>>> The elements_id is a primary key. >>>> >>>> Thank you? >>>> >>>> >>>> On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec <matija0...@gmail.com> >>>> wrote: >>>> >>>> Have one table hold document metadata (doc_id, title, description, ...) >>>> and have another table elements where partition key is doc_id and >>>> clustering key is element_id. >>>> Only problem here is if you need to query and/or update element just by >>>> element_id but I don't know your queries up front. >>>> >>>> On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz <zol...@gmail.com> >>>> wrote: >>>> >>>> Dear cassandra users, >>>> >>>> We have the following structure in MySql: >>>> >>>> documents->[doc_id(primary key), title, description] >>>> elements->[element_id(primary key), doc_id(index), title, description] >>>> >>>> Notation: table name->[column1(key or index), column2, …] >>>> >>>> We want to transfer the data to Cassandra. >>>> >>>> Each document can contain a large number of elements (between 1 and >>>> 100k+) >>>> >>>> We have two requirements: >>>> a) Load all elements for a given doc_id quickly >>>> b) Update the value of one individual element quickly >>>> >>>> >>>> We were thinking on the following cassandra configurations: >>>> >>>> Option A >>>> >>>> documents->[doc_id(primary key), title, description, elements] >>>> (elements could be a SET or a TEXT, each time new elements are added (they >>>> are never removed) we would append it to this column) >>>> elements->[element_id(primary key), title, description] >>>> >>>> Loading a document: >>>> >>>> a) Load document with given <doc_id> and get all element ids >>>> SELECT * from documents where doc_id=‘id’ >>>> >>>> b) Load all elements with the given ids >>>> SELECT * FROM elements where element_id IN (ids loaded from query a) >>>> >>>> >>>> Option B >>>> >>>> documents->[doc_id(primary key), title, description] >>>> elements->[element_id(primary key), doc_id(secondary index), title, >>>> description] >>>> >>>> Loading a document: >>>> a) SELECT * from elements where doc_id=‘id’ >>>> >>>> >>>> Neither solutions doesn’t seem to be good, in Option A, even if we are >>>> querying by Primary keys, the second query will have 100k+ primary key id’s >>>> in the WHERE clause, and the second solution looks like an anti pattern in >>>> cassandra. >>>> >>>> Could anyone give any advice how would we create a model for our use >>>> case? >>>> >>>> Thank you in advance, >>>> Zoltan. >>>> >>>> >>>> >>>> -- >>> ----------------- >>> Alexander Dejanovski >>> France >>> @alexanderdeja >>> >>> Consultant >>> Apache Cassandra Consulting >>> http://www.thelastpickle.com >>> >> >> >