> Regarding memory usage after a repair ... Are the merkle trees kept around? >
They should not be. Cheers ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/10/2012, at 4:51 PM, B. Todd Burruss <bto...@gmail.com> wrote: > Regarding memory usage after a repair ... Are the merkle trees kept around? > > On Oct 23, 2012 3:00 PM, "Bryan Talbot" <btal...@aeriagames.com> wrote: > On Mon, Oct 22, 2012 at 6:05 PM, aaron morton <aa...@thelastpickle.com> wrote: >> The GC was on-going even when the nodes were not compacting or running a >> heavy application load -- even when the main app was paused constant the GC >> continued. > If you restart a node is the onset of GC activity correlated to some event? > > Yes and no. When the nodes were generally under the .75 occupancy threshold > a weekly "repair -pr" job would cause them to go over the threshold and then > stay there even after the repair had completed and there were no ongoing > compactions. It acts as though at least some substantial amount of memory > used during repair was never dereferenced once the repair was complete. > > Once one CF in particular grew larger the constant GC would start up pretty > soon (less than 90 minutes) after a node restart even without a repair. > > > > >> As a test we dropped the largest CF and the memory usage immediately dropped >> to acceptable levels and the constant GC stopped. So it's definitely >> related to data load. memtable size is 1 GB, row cache is disabled and key >> cache is small (default). > How many keys did the CF have per node? > I dismissed the memory used to hold bloom filters and index sampling. That > memory is not considered part of the memtable size, and will end up in the > tenured heap. It is generally only a problem with very large key counts per > node. > > > I've changed the app to retain less data for that CF but I think that it was > about 400M rows per node. Row keys are a TimeUUID. All of the rows are > write-once, never updated, and rarely read. There are no secondary indexes > for this particular CF. > > > >> They were 2+ GB (as reported by nodetool cfstats anyway). It looks like >> the default bloom_filter_fp_chance defaults to 0.0 > The default should be 0.000744. > > If the chance is zero or null this code should run when a new SSTable is > written > // paranoia -- we've had bugs in the thrift <-> avro <-> CfDef dance > before, let's not let that break things > logger.error("Bloom filter FP chance of zero isn't supposed > to happen"); > > Were the CF's migrated from an old version ? > > > Yes, the CF were created in 1.0.9, then migrated to 1.0.11 and finally to > 1.1.5 with a "upgradesstables" run at each upgrade along the way. > > I could not find a way to view the current bloom_filter_fp_chance settings > when they are at a default value. JMX reports the actual fp rate and if a > specific rate is set for a CF that shows up in "describe table" but I > couldn't find out how to tell what the default was. I didn't inspect the > source. > > >> Is there any way to predict how much memory the bloom filters will consume >> if the size of the row keys, number or rows is known, and fp chance is known? > > See o.a.c.utils.BloomFilter.getFilter() in the code > This http://hur.st/bloomfilter appears to give similar results. > > > > > Ahh, very helpful. This indicates that 714MB would be used for the bloom > filter for that one CF. > > JMX / cfstats reports "Bloom Filter Space Used" but the MBean method name > (getBloomFilterDiskSpaceUsed) indicates this is the on-disk space. If on-disk > and in-memory space used is similar then summing up all the "Bloom Filter > Space Used" says they're currently consuming 1-2 GB of the heap which is > substantial. > > If a CF is rarely read is it safe to set bloom_filter_fp_chance to 1.0? It > just means more trips to SSTable indexes for a read correct? Trade RAM for > time (disk I/O). > > -Bryan >