Re: [EXTERNAL] Writes and Reads with high latency
- How many event_datetime records can you have per pkey? during a day of work I can have less than 10 event_datetime records per pkey. Every day I maintain maximum 3 of them, so each new event_datetime for a pkey determines a delete and an insert into Cassandra. - How many pkeys (roughly) do you have? Few millions but it is going to rise up. - In general, you only want to have at most 100 MB of data per partition (pkey). If it is larger than that, I would expect some timeouts. I suspect you either have very wide rows or lots of tombstones I ran some nodetool commands in order to give you more data: CFSTATS output: nodetool cfstats my_keyspace.my_table -H Total number of tables: 52 Keyspace : my_keyspace Read Count: 2441795 Read Latency: 400.53986035478 ms Write Count: 5097368 Write Latency: 6.494159368913525 ms Pending Flushes: 0 Table: my_table SSTable count: 13 Space used (live): 185.45 GiB Space used (total): 185.45 GiB Space used by snapshots (total): 0 bytes Off heap memory used (total): 80.66 MiB SSTable Compression Ratio: 0.2973552755387901 Number of partitions (estimate): 762039 Memtable cell count: 915 Memtable data size: 43.75 MiB Memtable off heap memory used: 0 bytes Memtable switch count: 598 Local read count: 2441795 Local read latency: 93.186 ms Local write count: 5097368 Local write latency: 3.189 ms Pending flushes: 0 Percent repaired: 0.0 Bloom filter false positives: 5719 Bloom filter false ratio: 0.0 Bloom filter space used: 1.65 MiB Bloom filter off heap memory used: 1.65 MiB Index summary off heap memory used: 1.17 MiB Compression metadata off heap memory used: 77.83 MiB Compacted partition minimum bytes: 104 Compacted partition maximum bytes: 20924300 Compacted partition mean bytes: 529420 Average live cells per slice (last five minutes): 2.0 Maximum live cells per slice (last five minutes): 3 Average tombstones per slice (last five minutes): 7.423841059602649 Maximum tombstones per slice (last five minutes): 50 Dropped Mutations: 0 bytes CFHISTOGRAMS output: nodetool cfhistograms my_keyspace my_table my_keyspace/my_table histograms Percentile SSTables Write Latency Read LatencyPartition Size Cell Count (micros) (micros) (bytes) 50%10.00379.02 1955.67379022 8 75%12.00654.95 186563.16654949 17 95%12.00 20924.30 268650.95 1629722 35 98%12.00 20924.30 322381.14 2346799 42 99%12.00 20924.30 386857.37 3379391 50 Min 0.00 6.87 88.15 104 0 Max12.00 25109.16 464228.84 20924300 179 I tried to enable 'tracing on' on CQLSH cli and make some queries in order to find out if there are tombstones scanned frequentely but, in my little sample of queries, I got almost similar answers like the following: Preparing statement [Native-Transport-Requests-1] Executing single-partition query on my_table [ReadStage-2] Acquiring sstable references [ReadStage-2] Bloom filter allows skipping sstable 2581 [ReadStage-2] Bloom filter allows skipping sstable 2580 [ReadStage-2] Bloom filter allows skipping sstable 2575 [ReadStage-2] Partition index with 2 entries found for sstable 2570 [ReadStage-2] Bloom filter allows skipping sstable 2548 [ReadStage-2] Bloom filter allows skipping sstable 2463 [ReadStage-2] Bloom filter allows skipping sstable 2416 [ReadStage-2] Partition index with 3 entries found for sstable 2354 [ReadStage-2] Bloom filter allows skipping sstable 1784 [ReadStage-2] Partition index with 5 entries found for sstable 1296 [ReadStage-2] Partition index with 3 entries found for sstable 1002 [ReadStage-2] Partition index with 3 entries found for sstable 372 [ReadStage-2] Skipped 0/12 non-slice-intersecting sstables, included 0 due to tombstones [ReadStage-2] Merged data from memtables and 5 sstables [ReadStage-2] Read 3 live rows and 0 tombstone cells [ReadStage-2] Request complete - Since you mention lots of deletes, I am thinking it could be tombstones. Are you getting any tombstone warnings or errors in your system.log? For each pkey, I get a new event_datetime that makes me delete one of (max) 3 previously saved records in Cassandra. If an pkey doesn't exist in Cassandra I will store it with its event_datetime without deleting anything. In Cassandra's logs I don't have any tombstone warning or error. - When you delete, are you deleting a full partition? Query for deletes: delete from my_keyspace.my_table where pkey = ? and event_datetime = ? IF EXISTS; - [..] And because only one node has the data, a single timeout means you won’t get any data. I will try to increase RF from 1 to 3. I hope to have answered to a
Re: Is there any chance the bootstrapping lost data?
> On Dec 28, 2018, at 2:17 AM, Jinhua Luo wrote: > > Hi All, > > While the pending node get streaming of token ranges from other nodes, > all coordinator would send new writes to it so that it would not miss > any new data, correct? > > I have two (maybe silly) questions here: > Given the CL is ONE, > a) what if the coordinator haven't meet the pending node via gossip, > and only sends mutation to the main replica (the replica would be > replaced by the pending node)? There’s a delay between joining in gossip and calculating the bootstrap streaming plan to mitigate this There’s also protections added in recent versions to avoid ack’ing a write (or read or stream) for ranges that don’t properly own the range, so topology/gossip disagreements shouldn’t result in consistency violations. > b) what if the coordinator fails to send the mutation to pending node? The coordinators increase consistency level/ blockFor by one for each pending node, so the pending node may not get the write if your RF is > 1, but enough nodes will that you’ll meet your consistency guarantee > > In both cases, when the pending node finish streaming and join the > ring eventually, the mutation mentioned above would be lost? > > Regards, > Jinhua Luo > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Sporadic high IO bandwidth and Linux OOM killer
On Fri, Dec 7, 2018 at 12:43 PM Oleksandr Shulgin < oleksandr.shul...@zalando.de> wrote: > > After a fresh JVM start the memory allocation looks roughly like this: > > total used free sharedbuffers cached > Mem: 14G14G 173M 1.1M12M 3.2G > -/+ buffers/cache:11G 3.4G > Swap: 0B 0B 0B > > Then, within a number of days, the allocated disk cache shrinks all the > way down to unreasonable numbers like only 150M. At the same time "free" > stays at the original level and "used" grows all the way up to 14G. > Shortly after that the node becomes unavailable because of the IO and > ultimately after some time the JVM gets killed. > > Most importantly, the resident size of JVM process stays at around 11-12G > all the time, like it was shortly after the start. How can we find where > the rest of the memory gets allocated? Is it just some sort of malloc > fragmentation? > For the ones following along at home, here's what we ended up with so far: 0. Switched to the next biggest EC2 instance type, r4.xlarge: and the symptoms are gone. Our bill is dominated by the price EBS storage, so this is much less than 2x increase in total. 1. We've noticed that increased memory usage correlates with the number of SSTables on disk. When the number of files on disk decreases, available memory increases. This leads us to think that extra memory allocation is indeed due to use of mmap. Not clear how we could account for that. 2. Improved our monitoring to include number of files (via total - free inodes). Given the cluster's resource utilization, it still feels like r4.large would be a good fit, if only we could figure out those few "missing" GB of RAM. ;-) Cheers! -- Alex
Re: Sporadic high IO bandwidth and Linux OOM killer
We On Fri, Dec 28, 2018, 4:23 PM Oleksandr Shulgin < oleksandr.shul...@zalando.de> wrote: > On Fri, Dec 7, 2018 at 12:43 PM Oleksandr Shulgin < > oleksandr.shul...@zalando.de> wrote: > >> >> After a fresh JVM start the memory allocation looks roughly like this: >> >> total used free sharedbuffers cached >> Mem: 14G14G 173M 1.1M12M 3.2G >> -/+ buffers/cache:11G 3.4G >> Swap: 0B 0B 0B >> >> Then, within a number of days, the allocated disk cache shrinks all the >> way down to unreasonable numbers like only 150M. At the same time "free" >> stays at the original level and "used" grows all the way up to 14G. >> Shortly after that the node becomes unavailable because of the IO and >> ultimately after some time the JVM gets killed. >> >> Most importantly, the resident size of JVM process stays at around 11-12G >> all the time, like it was shortly after the start. How can we find where >> the rest of the memory gets allocated? Is it just some sort of malloc >> fragmentation? >> > > For the ones following along at home, here's what we ended up with so far: > > 0. Switched to the next biggest EC2 instance type, r4.xlarge: and the > symptoms are gone. Our bill is dominated by the price EBS storage, so this > is much less than 2x increase in total. > > 1. We've noticed that increased memory usage correlates with the number of > SSTables on disk. When the number of files on disk decreases, available > memory increases. This leads us to think that extra memory allocation is > indeed due to use of mmap. Not clear how we could account for that. > > 2. Improved our monitoring to include number of files (via total - free > inodes). > > Given the cluster's resource utilization, it still feels like r4.large > would be a good fit, if only we could figure out those few "missing" GB of > RAM. ;-) > > Cheers! > -- > Alex > >
Re: Sporadic high IO bandwidth and Linux OOM killer
I’ve lost some context but there are two direct memory allocations per sstable - compression offsets and the bloom filter. Both of those get built during sstable creation and the bloom filter’s size is aggressively allocated , so you’ll see a big chunk of memory disappear as compaction kicks off based on the estimated number of keys. Are you sure that’s not what you’re seeing? If it is, dropping bloom filter FP ratio or increasing compression chunk size may help (and probably saves you some disk, you’ll get better ratios but slightly slower by increasing that chunk size) -- Jeff Jirsa > On Dec 28, 2018, at 1:22 PM, Oleksandr Shulgin > wrote: > >> On Fri, Dec 7, 2018 at 12:43 PM Oleksandr Shulgin >> wrote: >> >> After a fresh JVM start the memory allocation looks roughly like this: >> >> total used free sharedbuffers cached >> Mem: 14G14G 173M 1.1M12M 3.2G >> -/+ buffers/cache:11G 3.4G >> Swap: 0B 0B 0B >> >> Then, within a number of days, the allocated disk cache shrinks all the way >> down to unreasonable numbers like only 150M. At the same time "free" stays >> at the original level and "used" grows all the way up to 14G. Shortly after >> that the node becomes unavailable because of the IO and ultimately after >> some time the JVM gets killed. >> >> Most importantly, the resident size of JVM process stays at around 11-12G >> all the time, like it was shortly after the start. How can we find where >> the rest of the memory gets allocated? Is it just some sort of malloc >> fragmentation? > > For the ones following along at home, here's what we ended up with so far: > > 0. Switched to the next biggest EC2 instance type, r4.xlarge: and the > symptoms are gone. Our bill is dominated by the price EBS storage, so this > is much less than 2x increase in total. > > 1. We've noticed that increased memory usage correlates with the number of > SSTables on disk. When the number of files on disk decreases, available > memory increases. This leads us to think that extra memory allocation is > indeed due to use of mmap. Not clear how we could account for that. > > 2. Improved our monitoring to include number of files (via total - free > inodes). > > Given the cluster's resource utilization, it still feels like r4.large would > be a good fit, if only we could figure out those few "missing" GB of RAM. ;-) > > Cheers! > -- > Alex >