It seems this problem is back and I am unsure how to solve it. I have a test setup like this: 4 machines run 8 processess each. Each process has 2 threads, 1 for writing 100 000 rows and one for reading another 100 000 rows. Each machine (and process) read and write the exact same rows so it is essentially the same 200 000 rows being read / written. The cassandra cluster is a one node cluster. The first 10-20 runs of the test described above goes smoothly, after that tests take increasingly long time with GC happening almost all the time.
Here is my CASS-FREEZE-001 form answers: How big is your JVM heap ? 2GB How many CPUs ? A virtual environment so I can't be perfectly sure but according to their specification, "8 cores". Garbage collection taking long ? ( look for log lines from GCInspector) Yes, these are a few lines seen during 1 test run: INFO [ScheduledTasks:1] 2013-04-03 08:47:40,757 GCInspector.java (line 122) GC for ParNew: 40370 ms for 3 collections, 565045688 used; max is 2038693888 INFO [ScheduledTasks:1] 2013-04-03 08:48:24,720 GCInspector.java (line 122) GC for ParNew: 39840 ms for 2 collections, 614065528 used; max is 2038693888 INFO [ScheduledTasks:1] 2013-04-03 08:49:09,319 GCInspector.java (line 122) GC for ParNew: 37666 ms for 2 collections, 682352952 used; max is 2038693888 INFO [ScheduledTasks:1] 2013-04-03 08:50:02,577 GCInspector.java (line 122) GC for ParNew: 44590 ms for 1 collections, 792861352 used; max is 2038693888 Running out of heap ? ( "heap is .. full" log lines ) Yes. Same run as above: WARN [ScheduledTasks:1] 2013-04-03 08:54:35,108 GCInspector.java (line 139) Heap is 0.8596674853032178 full. You may need to reduce memtable and/or cache sizes. Cassandra is now reducing cache sizes to free up memory. Adjust reduce_cache_sizes_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically WARN [ScheduledTasks:1] 2013-04-03 08:54:36,831 GCInspector.java (line 145) Heap is 0.8596674853032178 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically Any tasks backing up / being dropped ? ( nodetool tpstats and ".. dropped in last .. ms" log lines ) Yes. Same run as above: INFO [ScheduledTasks:1] 2013-04-03 08:52:04,943 MessagingService.java (line 673) 31 MUTATION messages dropped in last 5000ms INFO [ScheduledTasks:1] 2013-04-03 08:52:04,944 MessagingService.java (line 673) 8 READ messages dropped in last 5000ms Are writes really slow? ( nodetool cfhistograms Keyspace ColumnFamily ) Not sure how to interpret the output of nodetool cfhistograms, but here it is (I hope it's fairly readable): Offset SSTables Write Latency Read Latency Row Size Column Count 1 38162520 0 0 0 200000 2 0 22 0 0 0 3 0 1629 0 0 0 4 0 9990 0 0 0 5 0 40169 0 0 0 6 0 161538 0 0 0 7 0 487266 0 0 0 8 0 1096601 0 0 0 10 0 4842978 0 0 0 12 0 7976003 0 0 0 14 0 8673230 0 0 0 17 0 9805730 0 0 0 20 0 5083707 0 0 0 24 0 2541157 0 0 0 29 0 768916 0 0 0 35 0 220440 0 0 0 42 0 112915 0 0 0 50 0 71469 0 0 0 60 0 48909 0 0 0 72 0 50714 0 0 0 86 0 45390 0 0 0 103 0 41975 0 0 0 124 0 40371 0 0 0 149 0 37103 0 0 0 179 0 44631 0 0 0 215 0 43957 0 0 0 258 0 32499 1 0 0 310 0 18446 23056779 0 0 372 0 13113 12580639 0 0 446 0 9862 1017347 0 0 535 0 7480 784506 0 0 642 0 5473 274274 0 0 770 0 4084 56379 0 0 924 0 3046 27979 0 0 1109 0 2205 20206 200000 0 1331 0 1658 16947 0 0 1597 0 1228 16969 0 0 1916 0 896 15848 0 0 2299 0 542 13928 0 0 2759 0 379 11782 0 0 3311 0 326 9761 0 0 3973 0 540 8997 0 0 4768 0 450 7938 0 0 5722 0 270 6552 0 0 6866 0 170 6022 0 0 8239 0 146 6474 0 0 9887 0 166 7969 0 0 11864 0 176 53725 0 0 14237 0 203 10260 0 0 17084 0 255 6827 0 0 20501 0 312 27462 0 0 24601 0 445 11523 0 0 29521 0 736 9904 0 0 35425 0 909 20539 0 0 42510 0 896 14280 0 0 51012 0 904 12443 0 0 61214 0 715 11956 0 0 73457 0 652 10040 0 0 88148 0 474 7992 0 0 105778 0 256 5043 0 0 126934 0 113 2370 0 0 152321 0 75 1189 0 0 182785 0 39 690 0 0 219342 0 44 550 0 0 263210 0 69 551 0 0 315852 0 35 419 0 0 379022 0 35 564 0 0 454826 0 52 504 0 0 545791 0 79 749 0 0 654949 0 61 737 0 0 785939 0 30 399 0 0 943127 0 57 611 0 0 1131752 0 78 488 0 0 1358102 0 23 302 0 0 1629722 0 28 240 0 0 1955666 0 48 294 0 0 2346799 0 28 306 0 0 2816159 0 19 224 0 0 3379391 0 37 212 0 0 4055269 0 24 237 0 0 4866323 0 13 137 0 0 5839588 0 11 99 0 0 7007506 0 4 115 0 0 8409007 0 16 194 0 0 10090808 0 12 156 0 0 12108970 0 12 54 0 0 14530764 0 24 147 0 0 17436917 0 10 114 0 0 20924300 0 3 66 0 0 25109160 0 22 100 0 0+ Some of the write latencies looks really bad, but since they have column count 0 for most, I am not sure what to make of it. How much is lots of data? Lots of data might have been an exaggeration but the test is as described above. Each row read or written is about 1kb in size so each test run generates 4 (machines) * 8 (processes) * 2 (read and write) * 100 000 (rows) * 1kb (row size) = 6250 mb read or written (half of each) Wide or skinny rows? Skinny rows, only a single column is used for each row. Mutations/sec ? The test when run on a freshly rebooted cassandra takes around 390 seconds, and 6400000 rows are read / written during this time period so around 16410 mutations / second (unless I missunderstood the word mutation). Which Compaction Strategy are you using? SizeTieredCompactionStrategy Output of show schema ( cassandra-cli ) for the relevant Keyspace/CF might help as well create column family rawData with column_type = 'Standard' and comparator = 'UTF8Type' and default_validation_class = 'BytesType' and key_validation_class = 'BytesType' and read_repair_chance = 0.1 and dclocal_read_repair_chance = 0.0 and gc_grace = 864000 and min_compaction_threshold = 4 and max_compaction_threshold = 32 and replicate_on_write = true and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' and caching = 'KEYS_ONLY' and column_metadata = [ {column_name : 'created', validation_class : LongType}, {column_name : 'socketHash', validation_class : Int32Type}, {column_name : 'data', validation_class : UTF8Type}, {column_name : 'guid', validation_class : UTF8Type}, {column_name : 'evaluated', validation_class : Int32Type, index_name : 'rawData_evaluated_idx_1', index_type : 0}] and compression_options = {'sstable_compression' : 'org.apache.cassandra.io.compress.SnappyCompressor'}; Only the "data" column is used during the test. What consistency are you doing your writes with ? I am writing with consistency level ONE. What are the values for these settings in cassandra.yaml memtable_total_space_in_mb: No value set in cassandra.yaml, so 1/3 of heap according to documentation (2gb / 3) memtable_flush_writers: No value set in cassandra.yaml, but only one data directory so I assume it is 1. memtable_flush_queue_size: 4 compaction_throughput_mb_per_sec: 16 concurrent_writes: 32 Which version of Cassandra? 1.1.8 Hope this helps you help me :) Best regards, Joel Samuelsson 2013/3/22 Joel Samuelsson <samuelsson.j...@gmail.com> > Thanks for the GC suggestion. It seems we didn't have enough CPU power to > handle both the data and GC. Increasing the number of CPU cores made > everything run smoothly at the same load. > > > 2013/3/21 Andras Szerdahelyi <andras.szerdahe...@ignitionone.com> > >> Neat! >> Thanks. >> >> From: Sylvain Lebresne <sylv...@datastax.com> >> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org> >> Date: Thursday 21 March 2013 10:10 >> To: "user@cassandra.apache.org" <user@cassandra.apache.org> >> Subject: Re: Cassandra freezes >> >> Prior to 1.2 the index summaries were not saved on disk, and were thus >> computed on startup while the sstable was loaded. In 1.2 they now are saved >> on disk to make startup faster ( >> https://issues.apache.org/jira/browse/CASSANDRA-2392). That being said, >> if the index_interval value used by a summary saved doesn't match the >> current one while the sstable is loaded, the summary is recomputed anyway, >> so restarting a node should always take a new index_interval setting into >> account. >> >> -- >> Sylvain >> >> >> On Thu, Mar 21, 2013 at 9:43 AM, Andras Szerdahelyi < >> andras.szerdahe...@ignitionone.com> wrote: >> >>> I can not find the reference that notes having to upgradesstables when >>> you >>> change this. I really hope such complex assumptions are not formulating >>> in >>> my head just on their own and there actually exists some kind of reliable >>> reference that clears this up :-) but, >>> >>> # index_interval controls the sampling of entries from the primrary >>> # row index in terms of space versus time. The larger the interval, >>> # the smaller and less effective the sampling will be. In technicial >>> # terms, the interval coresponds to the number of index entries that >>> # are skipped between taking each sample. All the sampled entries >>> # must fit in memory. Generally, a value between 128 and 512 here >>> # coupled with a large key cache size on CFs results in the best trade >>> # offs. This value is not often changed, however if you have many >>> # very small rows (many to an OS page), then increasing this will >>> # often lower memory usage without a impact on performance. >>> >>> it is ( very ) safe to assume the row index is re-built/updated when new >>> sstables are built. >>> Obviously the sample of this index will have to follow this process very >>> closely. >>> >>> It is possible however that the sample itself is not persisted and is >>> built at startup as opposed to *only* when the index changes.( which is >>> what I thought was happening ) >>> It shouldn't be too difficult to verify this, but I'd appreciate if >>> someone who looked at this before could confirm if this is the case. >>> >>> Thanks, >>> Andras >>> >>> On 21/03/13 09:13, "Michal Michalski" <mich...@opera.com> wrote: >>> >>> >About index_interval: >>> > >>> >> 1) you have to rebuild stables ( not an issue if you are evaluating, >>> >>doing >>> >> test writes.. Etc, not so much in production ) >>> > >>> >Are you sure of this? As I understand indexes, it's not required because >>> >this parameter defines an interval of in-memory index sample, which is >>> >created during C* startup basing on a primary on-disk index file. The >>> >fact that Heap usage is reduced immediately after C* restart seem to >>> >confirm this, but maybe I miss something? >>> > >>> >M. >>> >>> >> >