Hi Zoltan, you must try to avoid multi partition queries as much as possible. Instead, use asynchronous queries to grab several partitions concurrently. Try to send no more than ~100 queries at the same time to avoid DDOS-ing your cluster. This would leave you roughly with 1000+ async queries groups to run. Performance will really depend on your hardware, consistency level, load balancing policy, partition fragmentation (how many updates you'll run on each element over time) and the SLA you're expecting.
If that approach doesn't meet your SLA requirements, you can try to use wide partitions and group elements under buckets : CREATE TABLE elements ( doc_id long, bucket long, element_id long, element_content text, PRIMARY KEY((doc_id, bucket), element_id) ) The bucket here could be a modulus of the element_id (or of the hash of element_id if it is not a numerical value). This way you can spread elements over the cluster and access them directly if you have the doc_id and the element_id to perform updates. You'll get to run less queries concurrently but they'll take more time than individual ones in the first scenario (1 partition per element). You should benchmark both solutions to see which one gives best performance. Bucket your elements so that your partitions don't grow over 100MB. Large partitions are silent cluster killers (1GB+ partitions are a direct threat to cluster stability)... To ensure best performance, use prepared statements along with the TokenAwarePolicy <http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html> to avoid unnecessary coordination. Cheers, On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz <zol...@gmail.com> wrote: > Querying by (doc_id and element_id ) OR just by (element_id) is fine, but > the real question is, will it be efficient to query 100k+ primary keys in > the elements table? > e.g. > > SELECT * FROM elements WHERE element_id IN (element_id1, element_id2, > element_id3, .... element_id100K+) ? > > The elements_id is a primary key. > > Thank you? > > > On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec <matija0...@gmail.com> > wrote: > > Have one table hold document metadata (doc_id, title, description, ...) > and have another table elements where partition key is doc_id and > clustering key is element_id. > Only problem here is if you need to query and/or update element just by > element_id but I don't know your queries up front. > > On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz <zol...@gmail.com> wrote: > > Dear cassandra users, > > We have the following structure in MySql: > > documents->[doc_id(primary key), title, description] > elements->[element_id(primary key), doc_id(index), title, description] > > Notation: table name->[column1(key or index), column2, …] > > We want to transfer the data to Cassandra. > > Each document can contain a large number of elements (between 1 and 100k+) > > We have two requirements: > a) Load all elements for a given doc_id quickly > b) Update the value of one individual element quickly > > > We were thinking on the following cassandra configurations: > > Option A > > documents->[doc_id(primary key), title, description, elements] (elements > could be a SET or a TEXT, each time new elements are added (they are never > removed) we would append it to this column) > elements->[element_id(primary key), title, description] > > Loading a document: > > a) Load document with given <doc_id> and get all element ids > SELECT * from documents where doc_id=‘id’ > > b) Load all elements with the given ids > SELECT * FROM elements where element_id IN (ids loaded from query a) > > > Option B > > documents->[doc_id(primary key), title, description] > elements->[element_id(primary key), doc_id(secondary index), title, > description] > > Loading a document: > a) SELECT * from elements where doc_id=‘id’ > > > Neither solutions doesn’t seem to be good, in Option A, even if we are > querying by Primary keys, the second query will have 100k+ primary key id’s > in the WHERE clause, and the second solution looks like an anti pattern in > cassandra. > > Could anyone give any advice how would we create a model for our use case? > > Thank you in advance, > Zoltan. > > > > -- ----------------- Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com