Hi Alexander, thank you for your help! I think we found the answer:
CREATE TABLE documents ( doc_id uuid, description text, title text, PRIMARY KEY (doc_id) ); CREATE TABLE nodes ( doc_id uuid, element_id uuid, title text, PRIMARY KEY (doc_id, element_id) ); We can retrieve all elements with the following query: SELECT * FROM elements WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613 UPDATE elements SET title='Hello' WHERE doc_id=131cfa55-181e-431e-7956-fe449139d613 AND element_id=a5e41c5d-fd69-45d1-959b-2fe7a1578949; Zoltan. On Mon, Mar 27, 2017 at 9:47 AM, Alexander Dejanovski < a...@thelastpickle.com> wrote: > Hi Zoltan, > > you must try to avoid multi partition queries as much as possible. > Instead, use asynchronous queries to grab several partitions concurrently. > Try to send no more than ~100 queries at the same time to avoid DDOS-ing > your cluster. > This would leave you roughly with 1000+ async queries groups to run. > Performance will really depend on your hardware, consistency level, load > balancing policy, partition fragmentation (how many updates you'll run on > each element over time) and the SLA you're expecting. > > If that approach doesn't meet your SLA requirements, you can try to use > wide partitions and group elements under buckets : > > CREATE TABLE elements ( > doc_id long, > bucket long, > element_id long, > element_content text, > PRIMARY KEY((doc_id, bucket), element_id) > ) > > The bucket here could be a modulus of the element_id (or of the hash of > element_id if it is not a numerical value). This way you can spread > elements over the cluster and access them directly if you have the doc_id > and the element_id to perform updates. > You'll get to run less queries concurrently but they'll take more time > than individual ones in the first scenario (1 partition per element). You > should benchmark both solutions to see which one gives best performance. > Bucket your elements so that your partitions don't grow over 100MB. Large > partitions are silent cluster killers (1GB+ partitions are a direct threat > to cluster stability)... > > To ensure best performance, use prepared statements along with the > TokenAwarePolicy > <http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html> > to > avoid unnecessary coordination. > > Cheers, > > > On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz <zol...@gmail.com> wrote: > >> Querying by (doc_id and element_id ) OR just by (element_id) is fine, but >> the real question is, will it be efficient to query 100k+ primary keys in >> the elements table? >> e.g. >> >> SELECT * FROM elements WHERE element_id IN (element_id1, element_id2, >> element_id3, .... element_id100K+) ? >> >> The elements_id is a primary key. >> >> Thank you? >> >> >> On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec <matija0...@gmail.com> >> wrote: >> >> Have one table hold document metadata (doc_id, title, description, ...) >> and have another table elements where partition key is doc_id and >> clustering key is element_id. >> Only problem here is if you need to query and/or update element just by >> element_id but I don't know your queries up front. >> >> On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz <zol...@gmail.com> >> wrote: >> >> Dear cassandra users, >> >> We have the following structure in MySql: >> >> documents->[doc_id(primary key), title, description] >> elements->[element_id(primary key), doc_id(index), title, description] >> >> Notation: table name->[column1(key or index), column2, …] >> >> We want to transfer the data to Cassandra. >> >> Each document can contain a large number of elements (between 1 and >> 100k+) >> >> We have two requirements: >> a) Load all elements for a given doc_id quickly >> b) Update the value of one individual element quickly >> >> >> We were thinking on the following cassandra configurations: >> >> Option A >> >> documents->[doc_id(primary key), title, description, elements] (elements >> could be a SET or a TEXT, each time new elements are added (they are never >> removed) we would append it to this column) >> elements->[element_id(primary key), title, description] >> >> Loading a document: >> >> a) Load document with given <doc_id> and get all element ids >> SELECT * from documents where doc_id=‘id’ >> >> b) Load all elements with the given ids >> SELECT * FROM elements where element_id IN (ids loaded from query a) >> >> >> Option B >> >> documents->[doc_id(primary key), title, description] >> elements->[element_id(primary key), doc_id(secondary index), title, >> description] >> >> Loading a document: >> a) SELECT * from elements where doc_id=‘id’ >> >> >> Neither solutions doesn’t seem to be good, in Option A, even if we are >> querying by Primary keys, the second query will have 100k+ primary key id’s >> in the WHERE clause, and the second solution looks like an anti pattern in >> cassandra. >> >> Could anyone give any advice how would we create a model for our use case? >> >> Thank you in advance, >> Zoltan. >> >> >> >> -- > ----------------- > Alexander Dejanovski > France > @alexanderdeja > > Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com >