I would avoid secondary indexes unless absolutely necessary. It will be much better to have an efficient partition key + clustering key, as for any given partition C* will know exactly what nodes to contact, whereas this may not be the case with secondary indexes and you'll still likely need to talk to every node. There is also still storage and performance overhead with having indexes, which you won't get by having your time bucket in the partition keys.
The time bucket in your partition key is a well utilised pattern and works very well with time window compaction strategy, the same cannot be said for using SI. raft.so - Cassandra consulting, support, and managed services On Thu, Mar 18, 2021 at 3:12 AM Joe Obernberger < joseph.obernber...@gmail.com> wrote: > Thank you for this. > What about using a UUID for every row as the partition key and then a > secondary index for your time buckets instead of being part of the > partition key? > Example - say your buckets are 2021-03-15, 2021-03-16 etc... Your table: > > create table whatever (uuid text, time_bucket text, primary key (uuid)); > create index bucket_idx on whatever (bucket); > > If I understand secondary indexes, they are OK to use as long as what your > are indexing on doesn't have a huge number of distinct values. Even after > 10 years, your secondary index only has 3650 distinct values. > Bad idea? > > -Joe > On 3/16/2021 9:59 AM, Durity, Sean R wrote: > > Sometimes time bucketing can be used to create manageable partition sizes. > How much data is attached to a day, week, or minute? Could you use a > partition and clustering key like: ((source, time_bucket), timestamp)? > > > > Then your application logic can iterate through time buckets to pull out > the data in scalable chunks: > > Select column1, column2 from my_table where source = ‘PRIME SOURCE’ and > time_bucket = ‘2021-03-15’; > > Select column1, column2 from my_table where source = ‘PRIME SOURCE’ and > time_bucket = ‘2021-03-16’ > > … > > > > Also, there are implementations of Spark that will create the proper, > single partition queries for large data sets. DataStax Analytics is one > example (spark runs on each node). > > > > > > Sean Durity – Staff Systems Engineer, Cassandra > > > > *From:* Bowen Song <bo...@bso.ng.INVALID> <bo...@bso.ng.INVALID> > *Sent:* Monday, March 15, 2021 5:27 PM > *To:* user@cassandra.apache.org > *Subject:* [EXTERNAL] Re: No node was available to execute query error > > > > There are different approaches, depending on the application's logic. > Roughly speaking, there's two distinct scenarios: > > 1. Your application knows all the partition keys of the required data > in advance, either by reading them from another data source (e.g.: another > Cassandra table, other database, a file, or an API), or can reconstruct the > partition keys from other known information (e.g.: sequential numbers, date > time in a known range, etc.). > 2. Your application needs all (or nearly all) rows from a given table, > so you can use range requests to read everything out from that table. > > However, before you choose the second option and create a table for each > "source" value, I must warn you that creating hundreds of tables in > Cassandra is a bad idea. > > Ask yourself a question, what is really required to 'do something'? Do you > really need *all* data each time? Is it possible to make 'do something' > incremental, so you'll only need *some* data each time? > > > > On 15/03/2021 19:33, Joe Obernberger wrote: > > Thank you. > What is the best way to iterate over a very large number of rows in > Cassandra? I know the datastax driver let's java do blocks of n records, > but is that the best way? > > -joe > > On 3/15/2021 1:42 PM, Bowen Song wrote: > > I personally try to avoid using secondary indexes, especially in large > clusters. > > SI is not scalable, because a SI query doesn't have the partition key > information, Cassandra must send it to nearly all nodes in a DC to get the > answer. Thus, the more nodes you have in a cluster, the slower and more > expensive to run a SI query. Creating a SI on a table also can indirectly > create large partitions in the index tables. > > > > On 15/03/2021 17:27, Joe Obernberger wrote: > > Great stuff - thank you. I've spent the morning here redesigning with > smaller partitions. > > If I have a large number of unique IDs that I want to regularly 'do > something' with, would it make sense to have a table where a UUID is the > partition key, and create a secondary index on a field (call it source) > that I want to select from where the number of UUIDs per source might be > very large (billions). > So - select * from table where source=? > The number of unique source values is small - maybe 1000 > Whereas each source may have billions of UUIDs. > > -Joe > > > > On 3/15/2021 11:18 AM, Bowen Song wrote: > > To be clear, this > > CREATE TABLE ... PRIMARY KEY (k1, k2); > > is the same as: > > CREATE TABLE ... PRIMARY KEY ((k1), k2); > > but they are NOT the same as: > > CREATE TABLE ... PRIMARY KEY ((k1, k2)); > > The first two statements creates a table with a partition key k1 and a > clustering key k2. The 3rd statement creates a composite partition key > from k1 and k2, therefore k1 and k2 are the partition keys for this table. > > > > Your example "create table xyz (uuid text, source text, primary key > (source, uuid));" uses the same syntax as the first statement, which > creates the table xyz with a partition key source, and a clustering key > uuid (which, BTW, is a non-reserved keyword). > > > > A partition in Cassandra is solely determined by the partition key(s), and > the clustering key(s) have nothing to do with it. The size of a compacted > partition is determined by the number of rows in the partition and the size > of each row. If the table doesn't have a clustering key, each partition > will have at most one row. The row size is the serialized size of all data > in that row, including tombstones. > > > > You can reduce the partition size for a table by either reducing the > serialized data size or adding more columns to the (composite) partition > keys. But please be aware, you will have to provide ALL partition key > values when you read from or write to this table (other than range, SI or > MV queries), therefore you will need to consider the queries before > designing the table schema. For scalability, you will need predictable > partition size that does not grow over time, or have an actionable plan to > re-partition the table when the partition size exceeds a certain threshold. > Picking the threshold is more of an art than science, generally speaking it > should stay below a few hundred MBs, and often no more than 100 MB. > > > > On 15/03/2021 14:36, Joe Obernberger wrote: > > Thank you Bowen - I'm redesigning the tables now. When you give Cassandra > two parts to the primary key like > > create table xyz (uuid text, source text, primary key (source, uuid)); > How is the second part of the primary key used to determine partition size? > > -Joe > > On 3/12/2021 5:27 PM, Bowen Song wrote: > > The partition size min/avg/max of 8409008/15096925/25109160 bytes looks > fine for the table fieldcounts, but the number of partitions is a bit > worrying. Only 3 partitions? Are you expecting the partition size (instead > of number of partitions) to grow in the future? That can lead to a lots of > headaches. > > Forget about the fieldcounts table for now, the doc table looks really > bad. It has min/avg/max partition size of 24602/7052951452/63771372175 > bytes, the partition sizes are severely unevenly distributed, and the over > 60GB partition is way too big. > > You really need to redesign your table schemas, and avoid creating large > or uneven partitions. > > > > On 12/03/2021 18:52, Joe Obernberger wrote: > > Thank you very much for helping me out on this! The table fieldcounts is > currently pretty small - 6.4 million rows. > > cfstats are: > > Total number of tables: 81 > ---------------- > Keyspace : doc > Read Count: 3713134 > Read Latency: 0.2664131157130338 ms > Write Count: 47513045 > Write Latency: 1.0725477948634947 ms > Pending Flushes: 0 > Table: fieldcounts > SSTable count: 3 > Space used (live): 16010248 > Space used (total): 16010248 > Space used by snapshots (total): 0 > Off heap memory used (total): 4947 > SSTable Compression Ratio: 0.3994304032360534 > Number of partitions (estimate): 3 > Memtable cell count: 0 > Memtable data size: 0 > Memtable off heap memory used: 0 > Memtable switch count: 0 > Local read count: 379 > Local read latency: NaN ms > Local write count: 0 > Local write latency: NaN ms > Pending flushes: 0 > Percent repaired: 100.0 > Bloom filter false positives: 0 > Bloom filter false ratio: 0.00000 > Bloom filter space used: 48 > Bloom filter off heap memory used: 24 > Index summary off heap memory used: 51 > Compression metadata off heap memory used: 4872 > Compacted partition minimum bytes: 8409008 > Compacted partition maximum bytes: 25109160 > Compacted partition mean bytes: 15096925 > Average live cells per slice (last five minutes): NaN > Maximum live cells per slice (last five minutes): 0 > Average tombstones per slice (last five minutes): NaN > Maximum tombstones per slice (last five minutes): 0 > Dropped Mutations: 0 > > Commitlog is on a separate spindle on the 7 node cluster. All disks are > SATA (spinning rust as they say!). This is an R&D platform, but I will > switch to NetworkTopologyStrategy. I'm using Prometheus and Grafana to > monitor Cassandra and the CPU load is typically 100 to 200% on most of the > nodes. Disk IO is typically pretty low. > > Performance - in general Async is about 10x faster. > ExecuteAsync: > 35mSec for 364 rows. > 8120mSec for 205001 rows. > 14788mSec for 345001 rows. > 4117mSec for 86400 rows. > > 23,330 rows per second on average > > Execute: > 232mSec for 364 rows. > 584869mSec for 1263283 rows > 46290mSec for 86400 rows > > 2,160 rows per second on average > > Curious - our largest table (doc) has the following stats - is it not > partitioned well? > > Total number of tables: 81 > ---------------- > Keyspace : doc > Read Count: 3713134 > Read Latency: 0.2664131157130338 ms > Write Count: 47513045 > Write Latency: 1.0725477948634947 ms > Pending Flushes: 0 > Table: doc > SSTable count: 26 > Space used (live): 57124641753 > Space used (total): 57124641753 > Space used by snapshots (total): 113012646218 > Off heap memory used (total): 27331913 > SSTable Compression Ratio: 0.2531585373184219 > Number of partitions (estimate): 12 > Memtable cell count: 0 > Memtable data size: 0 > Memtable off heap memory used: 0 > Memtable switch count: 0 > Local read count: 27169 > Local read latency: NaN ms > Local write count: 0 > Local write latency: NaN ms > Pending flushes: 0 > Percent repaired: 0.0 > Bloom filter false positives: 0 > Bloom filter false ratio: 0.00000 > Bloom filter space used: 576 > Bloom filter off heap memory used: 368 > Index summary off heap memory used: 425 > Compression metadata off heap memory used: 27331120 > Compacted partition minimum bytes: 24602 > Compacted partition maximum bytes: 63771372175 > Compacted partition mean bytes: 7052951452 > Average live cells per slice (last five minutes): NaN > Maximum live cells per slice (last five minutes): 0 > Average tombstones per slice (last five minutes): NaN > Maximum tombstones per slice (last five minutes): 0 > Dropped Mutations: 0 > > Thank again! > > -Joe > > On 3/12/2021 11:01 AM, Bowen Song wrote: > > Sleep-then-retry works is just another indicator that it's likely a GC > pause related issue. I'd recommend you to check your Cassandra servers' GC > logs first. > > Do you know what's the maximum partition size for the doc.fieldcounts > table? (Try the "nodetool cfstats doc.fieldcounts" command) I suspect this > table has large partitions, which usually leads to GC issues. > > As of your failed executeAsync() insert issue, do you know how many > concurrent on-the-fly queries do you have? Cassandra driver has limitations > on it, and new executeAsync() calls will fail when the limit is reached. > > I'm also a bit concerned about your "significantly" slower inserts. > Inserts (excluding "INSERT IF NOT EXISTS") should be very fast in > Cassandra. How slow are they? Are they always slow like that, or usually > fast but some are much slower than others? What does the CPU usage & disk > IO look like on the Cassandra server? Do you have commitlog on the same > disk as the data? Is it a spinning disk, SATA SSD or NVMe? > > BTW, you really shouldn't use SimpleStrategy for production environments. > > > > On 12/03/2021 15:18, Joe Obernberger wrote: > > The queries that are failing are: > > select fieldvalue, count from doc.ordered_fieldcounts where source=? and > fieldname=? limit 10 > > Created with: > CREATE TABLE doc.ordered_fieldcounts ( > source text, > fieldname text, > count bigint, > fieldvalue text, > PRIMARY KEY ((source, fieldname), count, fieldvalue) > ) WITH CLUSTERING ORDER BY (count DESC, fieldvalue ASC) > > and: > > select fieldvalue, count from doc.fieldcounts where source=? and > fieldname=? > > Created with: > CREATE TABLE doc.fieldcounts ( > source text, > fieldname text, > fieldvalue text, > count bigint, > PRIMARY KEY (source, fieldname, fieldvalue) > ) > > This really seems like a driver issue. I put retry logic around the calls > and now those queries work. Basically if it throws an exception, I > Thread.sleep(500) and then retry. This seems to be a continuing theme with > Cassandra in general. Is this common practice? > > After doing this retry logic, an insert statement started failing with an > illegal state exception when I retried it (which makes sense). This insert > was using session.executeAsync(boundStatement). I changed that to just > execute (instead of async) and now I get no errors, no retries anywhere. > The insert is *significantly* slower when running execute vs executeAsync. > When using executeAsync: > > com.datastax.oss.driver.api.core.NoNodeAvailableException: No node was > available to execute the query > at > com.datastax.oss.driver.api.core.NoNodeAvailableException.copy(NoNodeAvailableException.java:40) > at > com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149) > at > com.datastax.oss.driver.internal.core.cql.MultiPageResultSet$RowIterator.maybeMoveToNextPage(MultiPageResultSet.java:99) > at > com.datastax.oss.driver.internal.core.cql.MultiPageResultSet$RowIterator.computeNext(MultiPageResultSet.java:91) > at > com.datastax.oss.driver.internal.core.cql.MultiPageResultSet$RowIterator.computeNext(MultiPageResultSet.java:79) > at > com.datastax.oss.driver.internal.core.util.CountingIterator.tryToComputeNext(CountingIterator.java:91) > at > com.datastax.oss.driver.internal.core.util.CountingIterator.hasNext(CountingIterator.java:86) > at > com.ngc.helios.fieldanalyzer.FTAProcess.handleOrderedFieldCounts(FTAProcess.java:684) > at > com.ngc.helios.fieldanalyzer.FTAProcess.storeResults(FTAProcess.java:214) > at > com.ngc.helios.fieldanalyzer.FTAProcess.startProcess(FTAProcess.java:190) > at com.ngc.helios.fieldanalyzer.Main.main(Main.java:20) > > The interesting part here is the the line that is now failing (line 684 in > FTAProcess) is: > > if (itRs.hasNext()) > > where itRs is an iterator<Row> over a select query from another table. > I'm iterating over a result set from a select and inserting those results > via executeAsync. > > -Joe > > On 3/12/2021 9:07 AM, Bowen Song wrote: > > Millions rows in a single query? That sounds like a bad idea to me. Your > "NoNodeAvailableException" could be caused by stop-the-world GC pauses, and > the GC pauses are likely caused by the query itself. > > On 12/03/2021 13:39, Joe Obernberger wrote: > > Thank you Paul and Erick. The keyspace is defined like this: > CREATE KEYSPACE doc WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': '3'} AND durable_writes = true; > > Would that cause this? > > The program that is having the problem selects data, calculates stuff, and > inserts. It works with smaller selects, but when the number of rows is in > the millions, I start to get this error. Since it works with smaller sets, > I don't believe it to be a network error. All the nodes are definitely up > as other processes are working OK, it's just this one program that fails. > > The full stack trace: > > Error: com.datastax.oss.driver.api.core.NoNodeAvailableException: No node > was available to execute the query > com.datastax.oss.driver.api.core.NoNodeAvailableException: No node was > available to execute the query > at > com.datastax.oss.driver.api.core.NoNodeAvailableException.copy(NoNodeAvailableException.java:40) > at > com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149) > at > com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:53) > at > com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:30) > at > com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:230) > at > com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:54) > at > com.abc.xxxx.fieldanalyzer.FTAProcess.udpateCassandraFTAMetrics(FTAProcess.java:275) > at > com.abc.xxxx.fieldanalyzer.FTAProcess.storeResults(FTAProcess.java:216) > at > com.abc.xxxx.fieldanalyzer.FTAProcess.startProcess(FTAProcess.java:199) > at com.abc.xxxx.fieldanalyzer.Main.main(Main.java:20) > > FTAProcess like 275 is: > > ResultSet rs = session.execute(getFieldCounts.bind().setString(0, > rb.getSource()).setString(1, rb.getFieldName())); > > -Joe > > On 3/12/2021 8:30 AM, Paul Chandler wrote: > > Hi Joe > > > > This could also be caused by the replication factor of the keyspace, if > you have NetworkTopologyStrategy and it doesn’t list a replication factor > for the datacenter datacenter1 then you will get this error message too. > > > > Paul > > > > On 12 Mar 2021, at 13:07, Erick Ramirez <erick.rami...@datastax.com> > wrote: > > > > Does it get returned by the driver every single time? The > NoNodeAvailableException gets thrown when (1) all nodes are down, or (2) > all the contact points are invalid from the driver's perspective. > > > > Is it possible there's no route/connectivity from your app server(s) to > the 172.16.x.x network? If you post the full error message + full > stacktrace, it might provide clues. Cheers! > > > > > > [image: Image removed by sender.][avg.com] > <https://urldefense.com/v3/__http:/www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient__;!!M-nmYVHPHQ!ZK_czHV8ufaud1GbUyHgozM9QifY0YOLfeSg_Vwrr-j1VuMR-W22UrsW02d6CcLzUo1_hB0$> > > Virus-free. www.avg.com [avg.com] > <https://urldefense.com/v3/__http:/www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient__;!!M-nmYVHPHQ!ZK_czHV8ufaud1GbUyHgozM9QifY0YOLfeSg_Vwrr-j1VuMR-W22UrsW02d6CcLzUo1_hB0$> > > > > > ------------------------------ > > The information in this Internet Email is confidential and may be legally > privileged. It is intended solely for the addressee. Access to this Email > by anyone else is unauthorized. If you are not the intended recipient, any > disclosure, copying, distribution or any action taken or omitted to be > taken in reliance on it, is prohibited and may be unlawful. When addressed > to our clients any opinions or advice contained in this Email are subject > to the terms and conditions expressed in any applicable governing The Home > Depot terms of business or client engagement letter. The Home Depot > disclaims all responsibility and liability for the accuracy and content of > this attachment and for any damages or losses arising from any > inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other > items of a destructive nature, which may be contained in this attachment > and shall not be liable for direct, indirect, consequential or special > damages in connection with this e-mail message or its attachment. > >