Don't forget to run "nodetool upgradesstables -a" after you ran the ALTER statement so that all SSTables get re-written with the new compression settings.
Since you have a lot of tables in your cluster, be aware that lowering the chunk length will grow the offheap memory usage of Cassandra. You can get more informations here : http://thelastpickle.com/blog/2018/08/08/compression_performance.html You should also check your readahead settings as it may be set too high : sudo blockdev --report The default is usually 256 but Cassandra would rather favor low readahead values to get more IOPS instead of more throughput (and readahead is usually not that useful for Cassandra). A conservative setting is 64 (you can go down to 8 and see how Cassandra performs then). Do note that changing the readahead settings requires to restart Cassandra as it is only read once by the JVM during startup. Cheers, On Wed, Sep 5, 2018 at 7:27 AM CPC <acha...@gmail.com> wrote: > Could you decrease chunk_length_in_kb to 16 or 8 and repeat the test. > > On Wed, Sep 5, 2018, 5:51 AM wxn...@zjqunshuo.com <wxn...@zjqunshuo.com> > wrote: > >> How large is your row? You may meet reading wide row problem. >> >> -Simon >> >> *From:* Laxmikant Upadhyay <laxmikant....@gmail.com> >> *Date:* 2018-09-05 01:01 >> *To:* user <user@cassandra.apache.org> >> *Subject:* High IO and poor read performance on 3.11.2 cassandra cluster >> >> We have 3 node cassandra cluster (3.11.2) in single dc. >> >> We have written 450 million records on the table with LCS. The write >> latency is fine. After write we perform read and update operations. >> >> When we run read+update operations on newly inserted 1 million records >> (on top of 450 m records) then the read latency and io usage is under >> control. However when we perform read+update on old 1 million records which >> are part of 450 million records we observe high read latency (The >> performance goes down by 4 times in comparison 1st case ). We have not >> observed major gc pauses. >> >> *system information:* >> *cpu core :* 24 >> *disc type : *ssd . we are using raid with deadline schedular >> *disk space:* >> df -h : >> Filesystem Size Used Avail Use% Mounted on >> /dev/sdb1 1.9T 393G 1.5T 22% /var/lib/cassandra >> *memory:* >> free -g >> total used free shared buff/cache >> available >> Mem: 62 30 0 0 32 >> 31 >> Swap: 8 0 8 >> >> ========================================== >> >> *schema* >> >> desc table ks.xyz; >> >> CREATE TABLE ks.xyz ( >> key text, >> column1 text, >> value text, >> PRIMARY KEY (key, column1) >> ) WITH COMPACT STORAGE >> AND CLUSTERING ORDER BY (column1 ASC) >> AND bloom_filter_fp_chance = 0.1 >> AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} >> AND comment = '' >> AND compaction = {'class': >> 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'} >> AND compression = {'chunk_length_in_kb': '64', 'class': >> 'org.apache.cassandra.io.compress.LZ4Compressor'} >> AND crc_check_chance = 1.0 >> AND dclocal_read_repair_chance = 0.0 >> AND default_time_to_live = 0 >> AND gc_grace_seconds = 864000 >> AND max_index_interval = 2048 >> AND memtable_flush_period_in_ms = 0 >> AND min_index_interval = 128 >> AND read_repair_chance = 0.0 >> AND speculative_retry = '99PERCENTILE'; >> >> ============================================================================== >> Below is some system stats snippet when read operations was running: >> >> *iotop -o * : Observation : the disk read goes up to 5.5 G/s >> >> Total DISK READ : *3.86 G/s* | Total DISK WRITE : 1252.88 K/s >> Actual DISK READ: * 3.92 G/s* | Actual DISK WRITE: 0.00 B/s >> TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND >> 10715 be/4 cassandr 375.89 M/s 99.79 K/s 0.00 % 29.15 % java >> -Dorg.xerial.snappy.tempdir=/var/tmp >> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ >> 10714 be/4 cassandr 358.56 M/s 107.18 K/s 0.00 % 27.06 % java >> -Dorg.xerial.snappy.tempdir=/var/tmp >> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ >> 10712 be/4 cassandr 351.86 M/s 147.83 K/s 0.00 % 25.02 % java >> -Dorg.xerial.snappy.tempdir=/var/tmp >> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ >> 10718 be/4 cassandr 359.82 M/s 110.87 K/s 0.00 % 24.49 % java >> -Dorg.xerial.snappy.tempdir=/var/tmp >> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ >> 10711 be/4 cassandr 333.03 M/s 125.66 K/s 0.00 % 23.37 % java >> -Dorg.xerial.snappy.tempdir=/var/tmp >> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ >> 10716 be/4 cassandr 330.80 M/s 103.48 K/s 0.00 % 23.02 % java >> -Dorg.xerial.snappy.tempdir=/var/tmp >> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ >> 10717 be/4 cassandr 319.49 M/s 118.27 K/s 0.00 % 22.11 % java >> -Dorg.xerial.snappy.tempdir=/var/tmp >> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ >> 10713 be/4 cassandr 300.62 M/s 118.27 K/s 0.00 % 21.65 % java >> -Dorg.xerial.snappy.tempdir=/var/tmp >> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ >> 10719 be/4 cassandr 294.98 M/s 81.31 K/s 0.00 % 21.60 % java >> -Dorg.xerial.snappy.tempdir=/var/tmp >> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ >> 10720 be/4 cassandr 289.00 M/s 73.92 K/s 0.00 % 21.45 % java >> -Dorg.xerial.snappy.tempdir=/var/tmp >> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ >> 10742 be/4 cassandr 240.98 M/s 81.31 K/s 0.00 % 17.68 % java >> -Dorg.xerial.snappy.tempdir=/var/tmp >> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ >> 10743 be/4 cassandr 224.43 M/s 36.96 K/s 0.00 % 17.57 % java >> -Dorg.xerial.snappy.tempdir=/var/tmp >> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ >> 10744 be/4 cassandr 113.29 M/s 14.78 K/s 0.00 % 10.22 % java >> -Dorg.xerial.snappy.tempdir=/var/tmp >> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ >> 10745 be/4 cassandr 61.63 M/s 33.26 K/s 0.00 % 4.20 % java >> -Dorg.xerial.snappy.tempdir=/var/tmp >> -Dja~etrics-core-3.1.0.jar:/usr/share/cassandra/lib/ >> >> >> ================================================================================================== >> *iostats -x 5* >> >> avg-cpu: %user %nice %system %iowait %steal %idle >> 2.35 0.00 7.09 *11.26 * 0.00 79.30 >> >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >> avgrq-sz avgqu-sz await r_await w_await svctm %util >> sdb 205.60 0.00 5304.60 3.00 3707651.20 2450.40 >> 1398.03 11.37 2.14 2.14 0.67 0.18 93.60 >> >> >> avg-cpu: %user %nice %system %iowait %steal %idle >> 2.99 0.00 7.49 *10.95* 0.00 78.56 >> >> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >> avgrq-sz avgqu-sz await r_await w_await svctm %util >> sdb 228.00 0.00 5639.40 4.80 3947922.40 1348.00 >> 1399.41 12.65 2.24 2.25 0.46 0.17 94.00 >> >> >> ================================================================================================== >> *nodetool info* >> ID : ddab003d-41d1-413e-ace2-87ccd34229d5 >> Gossip active : true >> Thrift active : false >> Native Transport active: true >> Load : 391.58 GiB >> Generation No : 1536065356 >> Uptime (seconds) : 4714 >> Heap Memory (MB) : 5172.52 / 13312.00 >> Off Heap Memory (MB) : 2288.28 >> Data Center : DC1 >> Rack : RAC1 >> Exceptions : 0 >> Key Cache : entries 1604008, size 167.04 MiB, capacity 256 >> MiB, 3615 hits, 1801574 requests, 0.002 recent hit rate, 14400 save period >> in seconds >> Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 >> hits, 0 requests, NaN recent hit rate, 0 save period in seconds >> Counter Cache : entries 0, size 0 bytes, capacity 50 MiB, 0 >> hits, 0 requests, NaN recent hit rate, 7200 save period in seconds >> Chunk Cache : entries 7680, size 480 MiB, capacity 480 MiB, >> 1399440 misses, 2338352 requests, 0.402 recent hit rate, 2760.976 >> microseconds miss latency >> Percent Repaired : 1.6357850897532832E-8% >> Token : (invoke with -T/--tokens to see all 256 tokens) >> >> >> ============================================================================= >> >> *nodetool tablehistograms ks xyz : *Observation : not a wide row >> ks/xyz histograms >> Percentile SSTables Write Latency Read Latency Partition >> Size Cell Count >> (micros) (micros) (bytes) >> 50% 1.00 17.08 5839.59 >> 1109 17 >> 75% 1.00 20.50 *7007.51 * >> 1331 24 >> 95% 2.00 29.52 *12108.97 * >> 1331 24 >> 98% 2.00 35.43 14530.76 >> 1331 24 >> 99% 2.00 42.51 17436.92 >> 1331 24 >> Min 1.00 5.72 105.78 >> 150 2 >> Max 2.00 315.85 464228.84 >> 1916 35 >> >> *nodetool tablehistograms ks xyz* >> ks/xyz histograms >> Percentile SSTables Write Latency Read Latency Partition >> Size Cell Count >> (micros) (micros) (bytes) >> 50% 1.00 17.08 5839.59 >> 1109 17 >> 75% 1.00 20.50 7007.51 >> 1331 24 >> 95% 2.00 29.52 12108.97 >> 1331 24 >> 98% 2.00 35.43 14530.76 >> 1331 24 >> 99% 2.00 42.51 17436.92 >> 1331 24 >> Min 1.00 4.77 105.78 >> 150 2 >> Max 2.00 315.85 464228.84 >> 1916 35 >> >> >> >> ====================================================================== >> >> *nodetool tpstats* >> >> Pool Name Active Pending Completed >> Blocked All time blocked >> ReadStage 3 1 1603986 >> 0 0 >> MiscStage 0 0 0 >> 0 0 >> CompactionExecutor 0 0 6670 >> 0 0 >> MutationStage 0 0 5044371 >> 0 0 >> MemtableReclaimMemory 0 0 57 >> 0 0 >> PendingRangeCalculator 0 0 3 >> 0 0 >> GossipStage 0 0 14299 >> 0 0 >> SecondaryIndexManagement 0 0 0 >> 0 0 >> HintsDispatcher 0 0 0 >> 0 0 >> RequestResponseStage 0 1 3867575 >> 0 0 >> Native-Transport-Requests 3 0 2319802 >> 0 0 >> ReadRepairStage 0 0 13281 >> 0 0 >> CounterMutationStage 0 0 0 >> 0 0 >> MigrationStage 0 0 2 >> 0 0 >> MemtablePostFlush 0 0 65 >> 0 0 >> PerDiskMemtableFlushWriter_0 0 0 57 >> 0 0 >> ValidationExecutor 0 0 0 >> 0 0 >> Sampler 0 0 0 >> 0 0 >> MemtableFlushWriter 0 0 57 >> 0 0 >> InternalResponseStage 0 0 0 >> 0 0 >> ViewMutationStage 0 0 0 >> 0 0 >> AntiEntropyStage 0 0 0 >> 0 0 >> CacheCleanupExecutor 0 0 0 >> 0 0 >> >> Message type Dropped >> READ 0 >> RANGE_SLICE 0 >> _TRACE 0 >> HINT 0 >> MUTATION 0 >> COUNTER_MUTATION 0 >> BATCH_STORE 0 >> BATCH_REMOVE 0 >> REQUEST_RESPONSE 0 >> PAGED_RANGE 0 >> READ_REPAIR 0 >> >> ===================================================================== >> >> *notetool compactionstats* >> pending tasks: 0 >> >> >> ============================================================================== >> *debug.log Snippent* >> >> >> WARN [PERIODIC-COMMIT-LOG-SYNCER] 2018-09-04 19:39:51,057 >> NoSpamLogger.java:94 - Out of 29 commit log syncs over the past 263.94s >> with average duration of 15.17ms, 2 have exceeded the configured commit >> interval by an average of 20.32ms >> DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:39:53,015 >> AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating >> a fresh one >> DEBUG [SlabPoolCleaner] 2018-09-04 19:39:59,659 >> ColumnFamilyStore.java:1308 - Flushing largest CFS(Keyspace='system', >> ColumnFamily='batches') to free up room. Used total: 0.33/0.00, live: >> 0.33/0.00, flushing: 0.00/0.00, this: 0.13/0.00 >> DEBUG [SlabPoolCleaner] 2018-09-04 19:39:59,659 >> ColumnFamilyStore.java:918 - Enqueuing flush of batches: 128.567MiB (13%) >> on-heap, 0.000KiB (0%) off-heap >> DEBUG [PerDiskMemtableFlushWriter_0:12] 2018-09-04 19:39:59,685 >> Memtable.java:456 - Writing Memtable-batches@1450035834(78.630MiB >> serialized bytes, 125418 ops, 13%/0% of on/off-heap limit), flushed range = >> (null, null] >> DEBUG [PerDiskMemtableFlushWriter_0:12] 2018-09-04 19:39:59,695 >> Memtable.java:485 - Completed flushing >> /var/lib/cassandra/data/system/batches-919a4bc57a333573b03e13fc3f68b465/mc-16-big-Data.db >> (0.000KiB) for commitlog position >> CommitLogPosition(segmentId=1536065319618, position=7958044) >> DEBUG [MemtableFlushWriter:12] 2018-09-04 19:39:59,695 >> ColumnFamilyStore.java:1216 - Flushed to [] (0 sstables, 0.000KiB), biggest >> 0.000KiB, smallest 8589934592.000GiB >> DEBUG [ScheduledTasks:1] 2018-09-04 19:40:15,710 MonitoringTask.java:173 >> - 2 operations were slow in the last 4999 msecs: >> <SELECT * FROM ks1.abc WHERE key = 12345 LIMIT 5000>, time 575 msec - >> slow timeout 500 msec/cross-node >> <SELECT * FROM ks.xyz WHERE key = key1 LIMIT 5000>, time 645 msec - slow >> timeout 500 msec/cross-node >> DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:40:20,475 >> AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating >> a fresh one >> DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:40:46,675 >> AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating >> a fresh one >> DEBUG [SlabPoolCleaner] 2018-09-04 19:41:04,976 >> ColumnFamilyStore.java:1308 - Flushing largest CFS(Keyspace='ks', >> ColumnFamily='xyz') to free up room. Used total: 0.33/0.00, live: >> 0.33/0.00, flushing: 0.00/0.00, this: 0.12/0.00 >> DEBUG [SlabPoolCleaner] 2018-09-04 19:41:04,977 >> ColumnFamilyStore.java:918 - Enqueuing flush of xyz: 121.374MiB (12%) >> on-heap, 0.000KiB (0%) off-heap >> >> *Observation :* frequent "operations were slow in the last " (for >> select queries) and "WARN: commit log syncs over the past" >> =================================================== >> >> *notetool tablestats -H ks.xyz <http://ks.xyz>;* >> Total number of tables: 89 >> ---------------- >> Keyspace : ks >> Read Count: 1439722 >> Read Latency: 1.8982509581710914 ms >> Write Count: 4222811 >> Write Latency: 0.016324778684151386 ms >> Pending Flushes: 0 >> Table: xyz >> SSTable count: 1036 >> SSTables in each level: [1, 10, 116/100, 909, 0, 0, 0, 0, >> 0] >> Space used (live): 187.09 GiB >> Space used (total): 187.09 GiB >> Space used by snapshots (total): 0 bytes >> Off heap memory used (total): 783.93 MiB >> SSTable Compression Ratio: 0.3238726404414842 >> Number of partitions (estimate): 447095605 >> Memtable cell count: 306194 >> Memtable data size: 20.59 MiB >> Memtable off heap memory used: 0 bytes >> Memtable switch count: 7 >> Local read count: 1440322 >> Local read latency: 6.785 ms >> Local write count: 1408204 >> Local write latency: 0.021 ms >> Pending flushes: 0 >> Percent repaired: 0.0 >> Bloom filter false positives: 19 >> Bloom filter false ratio: 0.00003 >> Bloom filter space used: 418.2 MiB >> Bloom filter off heap memory used: 418.19 MiB >> Index summary off heap memory used: 307.75 MiB >> Compression metadata off heap memory used: 57.99 MiB >> Compacted partition minimum bytes: 150 >> Compacted partition maximum bytes: 1916 >> Compacted partition mean bytes: 1003 >> Average live cells per slice (last five minutes): 20.0 >> Maximum live cells per slice (last five minutes): 20 >> Average tombstones per slice (last five minutes): 1.0 >> Maximum tombstones per slice (last five minutes): 1 >> Dropped Mutations: 0 bytes >> >> -- >> >> regards, >> Laxmikant Upadhyay >> >> -- ----------------- Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com