Great suggestion! Thanks Avi! On Mon, Mar 27, 2017 at 3:47 PM, Avi Kivity <a...@scylladb.com> wrote:
> You can use static columns to and just one table: > > > CREATE TABLE documents ( > > doc_id uuid, > > element_id uuid, > > description text static, > > doc_title text static, > > element_title text, > > PRIMARY KEY (doc_id, element_id) > > ); > > The static columns are present once per unique doc_id. > > > > On 03/27/2017 01:08 PM, Zoltan Lorincz wrote: > > Hi Alexander, > > thank you for your help! I think we found the answer: > > CREATE TABLE documents ( > doc_id uuid, > description text, > title text, > PRIMARY KEY (doc_id) > ); > > CREATE TABLE nodes ( > doc_id uuid, > element_id uuid, > title text, > PRIMARY KEY (doc_id, element_id) > ); > > We can retrieve all elements with the following query: > SELECT * FROM elements WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613 > UPDATE elements SET title='Hello' WHERE > doc_id=131cfa55-181e-431e-7956-fe449139d613 > AND element_id=a5e41c5d-fd69-45d1-959b-2fe7a1578949; > > Zoltan. > > > On Mon, Mar 27, 2017 at 9:47 AM, Alexander Dejanovski < > a...@thelastpickle.com> wrote: > >> Hi Zoltan, >> >> you must try to avoid multi partition queries as much as possible. >> Instead, use asynchronous queries to grab several partitions concurrently. >> Try to send no more than ~100 queries at the same time to avoid DDOS-ing >> your cluster. >> This would leave you roughly with 1000+ async queries groups to run. >> Performance will really depend on your hardware, consistency level, load >> balancing policy, partition fragmentation (how many updates you'll run on >> each element over time) and the SLA you're expecting. >> >> If that approach doesn't meet your SLA requirements, you can try to use >> wide partitions and group elements under buckets : >> >> CREATE TABLE elements ( >> doc_id long, >> bucket long, >> element_id long, >> element_content text, >> PRIMARY KEY((doc_id, bucket), element_id) >> ) >> >> The bucket here could be a modulus of the element_id (or of the hash of >> element_id if it is not a numerical value). This way you can spread >> elements over the cluster and access them directly if you have the doc_id >> and the element_id to perform updates. >> You'll get to run less queries concurrently but they'll take more time >> than individual ones in the first scenario (1 partition per element). You >> should benchmark both solutions to see which one gives best performance. >> Bucket your elements so that your partitions don't grow over 100MB. Large >> partitions are silent cluster killers (1GB+ partitions are a direct threat >> to cluster stability)... >> >> To ensure best performance, use prepared statements along with the >> TokenAwarePolicy >> <http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html> >> to >> avoid unnecessary coordination. >> >> Cheers, >> >> >> On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz <zol...@gmail.com> wrote: >> >>> Querying by (doc_id and element_id ) OR just by (element_id) is fine, >>> but the real question is, will it be efficient to query 100k+ primary keys >>> in the elements table? >>> e.g. >>> >>> SELECT * FROM elements WHERE element_id IN (element_id1, element_id2, >>> element_id3, .... element_id100K+) ? >>> >>> The elements_id is a primary key. >>> >>> Thank you? >>> >>> >>> On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec <matija0...@gmail.com> >>> wrote: >>> >>> Have one table hold document metadata (doc_id, title, description, ...) >>> and have another table elements where partition key is doc_id and >>> clustering key is element_id. >>> Only problem here is if you need to query and/or update element just by >>> element_id but I don't know your queries up front. >>> >>> On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz <zol...@gmail.com> >>> wrote: >>> >>> Dear cassandra users, >>> >>> We have the following structure in MySql: >>> >>> documents->[doc_id(primary key), title, description] >>> elements->[element_id(primary key), doc_id(index), title, description] >>> >>> Notation: table name->[column1(key or index), column2, …] >>> >>> We want to transfer the data to Cassandra. >>> >>> Each document can contain a large number of elements (between 1 and >>> 100k+) >>> >>> We have two requirements: >>> a) Load all elements for a given doc_id quickly >>> b) Update the value of one individual element quickly >>> >>> >>> We were thinking on the following cassandra configurations: >>> >>> Option A >>> >>> documents->[doc_id(primary key), title, description, elements] (elements >>> could be a SET or a TEXT, each time new elements are added (they are never >>> removed) we would append it to this column) >>> elements->[element_id(primary key), title, description] >>> >>> Loading a document: >>> >>> a) Load document with given <doc_id> and get all element ids >>> SELECT * from documents where doc_id=‘id’ >>> >>> b) Load all elements with the given ids >>> SELECT * FROM elements where element_id IN (ids loaded from query a) >>> >>> >>> Option B >>> >>> documents->[doc_id(primary key), title, description] >>> elements->[element_id(primary key), doc_id(secondary index), title, >>> description] >>> >>> Loading a document: >>> a) SELECT * from elements where doc_id=‘id’ >>> >>> >>> Neither solutions doesn’t seem to be good, in Option A, even if we are >>> querying by Primary keys, the second query will have 100k+ primary key id’s >>> in the WHERE clause, and the second solution looks like an anti pattern in >>> cassandra. >>> >>> Could anyone give any advice how would we create a model for our use >>> case? >>> >>> Thank you in advance, >>> Zoltan. >>> >>> >>> >>> -- >> ----------------- >> Alexander Dejanovski >> France >> @alexanderdeja >> >> Consultant >> Apache Cassandra Consulting >> http://www.thelastpickle.com >> > > >