Re: Cassandra OOM on joining existing ring

Kunal Gangakhedkar Sun, 12 Jul 2015 22:12:08 -0700

Hi,

Looks like that is my primary problem - the sstable count for the
daily_challenges column family is >5k. Azure had scheduled maintenance
window on Sat. All the VMs got rebooted one by one - including the current
cassandra one - and it's taking forever to bring cassandra back up online.


Is there any way I can re-organize my existing data? so that I can bring
down that count?
I don't want to lose that data.
If possible, can I do that while cassandra is down? As I mentioned, it's
taking forever to get the service up - it's stuck in reading those 5k
sstable (+ another 5k of corresponding secondary index) files. :(
Oh, did I mention I'm new to cassandra?

Thanks,
Kunal

Kunal

On 11 July 2015 at 03:29, Sebastian Estevez <sebastian.este...@datastax.com>
wrote:

> #1
>
>> There is one table - daily_challenges - which shows compacted partition
>> max bytes as ~460M and another one - daily_guest_logins - which shows
>> compacted partition max bytes as ~36M.
>
>
> 460 is high, I like to keep my partitions under 100mb when possible. I've
> seen worse though. The fix is to add something else (maybe month or week or
> something) into your partition key:
>
>  PRIMARY KEY ((segment_type, something_else), date, user_id, sess_id)
>
> #2 looks like your jam version is 3 per your env.sh so you're probably
> okay to copy the env.sh over from the C* 3.0 link I shared once you
> uncomment and tweak the MAX_HEAP. If there's something wrong your node
> won't come up. tail your logs.
>
>
>
> All the best,
>
>
> [image: datastax_logo.png] <http://www.datastax.com/>
>
> Sebastián Estévez
>
> Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com
>
> [image: linkedin.png] <https://www.linkedin.com/company/datastax> [image:
> facebook.png] <https://www.facebook.com/datastax> [image: twitter.png]
> <https://twitter.com/datastax> [image: g+.png]
> <https://plus.google.com/+Datastax/about>
> <http://feeds.feedburner.com/datastax>
>
> <http://cassandrasummit-datastax.com/>
>
> DataStax is the fastest, most scalable distributed database technology,
> delivering Apache Cassandra to the world’s most innovative enterprises.
> Datastax is built to be agile, always-on, and predictably scalable to any
> size. With more than 500 customers in 45 countries, DataStax is the
> database technology and transactional backbone of choice for the worlds
> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>
> On Fri, Jul 10, 2015 at 2:44 PM, Kunal Gangakhedkar <
> kgangakhed...@gmail.com> wrote:
>
>> And here is my cassandra-env.sh
>> https://gist.github.com/kunalg/2c092cb2450c62be9a20
>>
>> Kunal
>>
>> On 11 July 2015 at 00:04, Kunal Gangakhedkar <kgangakhed...@gmail.com>
>> wrote:
>>
>>> From jhat output, top 10 entries for "Instance Count for All Classes
>>> (excluding platform)" shows:
>>>
>>> 2088223 instances of class org.apache.cassandra.db.BufferCell
>>> 1983245 instances of class
>>> org.apache.cassandra.db.composites.CompoundSparseCellName
>>> 1885974 instances of class
>>> org.apache.cassandra.db.composites.CompoundDenseCellName
>>> 630000 instances of class
>>> org.apache.cassandra.io.sstable.IndexHelper$IndexInfo
>>> 503687 instances of class org.apache.cassandra.db.BufferDeletedCell
>>> 378206 instances of class org.apache.cassandra.cql3.ColumnIdentifier
>>> 101800 instances of class org.apache.cassandra.utils.concurrent.Ref
>>> 101800 instances of class
>>> org.apache.cassandra.utils.concurrent.Ref$State
>>> 90704 instances of class
>>> org.apache.cassandra.utils.concurrent.Ref$GlobalState
>>> 71123 instances of class org.apache.cassandra.db.BufferDecoratedKey
>>>
>>> At the bottom of the page, it shows:
>>> Total of 8739510 instances occupying 193607512 bytes.
>>> JFYI.
>>>
>>> Kunal
>>>
>>> On 10 July 2015 at 23:49, Kunal Gangakhedkar <kgangakhed...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for quick reply.
>>>>
>>>> 1. I don't know what are the thresholds that I should look for. So, to
>>>> save this back-and-forth, I'm attaching the cfstats output for the 
>>>> keyspace.
>>>>
>>>> There is one table - daily_challenges - which shows compacted partition
>>>> max bytes as ~460M and another one - daily_guest_logins - which shows
>>>> compacted partition max bytes as ~36M.
>>>>
>>>> Can that be a problem?
>>>> Here is the CQL schema for the daily_challenges column family:
>>>>
>>>> CREATE TABLE app_10001.daily_challenges (
>>>>     segment_type text,
>>>>     date timestamp,
>>>>     user_id int,
>>>>     sess_id text,
>>>>     data text,
>>>>     deleted boolean,
>>>>     PRIMARY KEY (segment_type, date, user_id, sess_id)
>>>> ) WITH CLUSTERING ORDER BY (date DESC, user_id ASC, sess_id ASC)
>>>>     AND bloom_filter_fp_chance = 0.01
>>>>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>>>>     AND comment = ''
>>>>     AND compaction = {'min_threshold': '4', 'class':
>>>> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
>>>> 'max_threshold': '32'}
>>>>     AND compression = {'sstable_compression':
>>>> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>>>>     AND dclocal_read_repair_chance = 0.1
>>>>     AND default_time_to_live = 0
>>>>     AND gc_grace_seconds = 864000
>>>>     AND max_index_interval = 2048
>>>>     AND memtable_flush_period_in_ms = 0
>>>>     AND min_index_interval = 128
>>>>     AND read_repair_chance = 0.0
>>>>     AND speculative_retry = '99.0PERCENTILE';
>>>>
>>>> CREATE INDEX idx_deleted ON app_10001.daily_challenges (deleted);
>>>>
>>>>
>>>> 2. I don't know - how do I check? As I mentioned, I just installed the
>>>> dsc21 update from datastax's debian repo (ver 2.1.7).
>>>>
>>>> Really appreciate your help.
>>>>
>>>> Thanks,
>>>> Kunal
>>>>
>>>> On 10 July 2015 at 23:33, Sebastian Estevez <
>>>> sebastian.este...@datastax.com> wrote:
>>>>
>>>>> 1. You want to look at # of sstables in cfhistograms or in cfstats
>>>>> look at:
>>>>> Compacted partition maximum bytes
>>>>> Maximum live cells per slice
>>>>>
>>>>> 2) No, here's the env.sh from 3.0 which should work with some tweaks:
>>>>>
>>>>> https://github.com/tobert/cassandra/blob/0f70469985d62aeadc20b41dc9cdc9d72a035c64/conf/cassandra-env.sh
>>>>>
>>>>> You'll at least have to modify the jamm version to what's in yours. I
>>>>> think it's 2.5
>>>>>
>>>>>
>>>>>
>>>>> All the best,
>>>>>
>>>>>
>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>
>>>>> Sebastián Estévez
>>>>>
>>>>> Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com
>>>>>
>>>>> [image: linkedin.png] <https://www.linkedin.com/company/datastax> [image:
>>>>> facebook.png] <https://www.facebook.com/datastax> [image: twitter.png]
>>>>> <https://twitter.com/datastax> [image: g+.png]
>>>>> <https://plus.google.com/+Datastax/about>
>>>>> <http://feeds.feedburner.com/datastax>
>>>>>
>>>>> <http://cassandrasummit-datastax.com/>
>>>>>
>>>>> DataStax is the fastest, most scalable distributed database
>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>> scalable to any size. With more than 500 customers in 45 countries, 
>>>>> DataStax
>>>>> is the database technology and transactional backbone of choice for the
>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>>>
>>>>> On Fri, Jul 10, 2015 at 1:42 PM, Kunal Gangakhedkar <
>>>>> kgangakhed...@gmail.com> wrote:
>>>>>
>>>>>> Thanks, Sebastian.
>>>>>>
>>>>>> Couple of questions (I'm really new to cassandra):
>>>>>> 1. How do I interpret the output of 'nodetool cfstats' to figure out
>>>>>> the issues? Any documentation pointer on that would be helpful.
>>>>>>
>>>>>> 2. I'm primarily a python/c developer - so, totally clueless about
>>>>>> JVM environment. So, please bare with me as I would need a lot of
>>>>>> hand-holding.
>>>>>> Should I just copy+paste the settings you gave and try to restart the
>>>>>> failing cassandra server?
>>>>>>
>>>>>> Thanks,
>>>>>> Kunal
>>>>>>
>>>>>> On 10 July 2015 at 22:35, Sebastian Estevez <
>>>>>> sebastian.este...@datastax.com> wrote:
>>>>>>
>>>>>>> #1 You need more information.
>>>>>>>
>>>>>>> a) Take a look at your .hprof file (memory heap from the OOM) with
>>>>>>> an introspection tool like jhat or visualvm or java flight recorder and 
>>>>>>> see
>>>>>>> what is using up your RAM.
>>>>>>>
>>>>>>> b) How big are your large rows (use nodetool cfstats on each node).
>>>>>>> If your data model is bad, you are going to have to re-design it no 
>>>>>>> matter
>>>>>>> what.
>>>>>>>
>>>>>>> #2 As a possible workaround try using the G1GC allocator with the
>>>>>>> settings from c* 3.0 instead of CMS. I've seen lots of success with it
>>>>>>> lately (tl;dr G1GC is much simpler than CMS and almost as good as a 
>>>>>>> finely
>>>>>>> tuned CMS). *Note:* Use it with the latest Java 8 from Oracle. Do
>>>>>>> *not* set the newgen size for G1 sets it dynamically:
>>>>>>>
>>>>>>> # min and max heap sizes should be set to the same value to avoid
>>>>>>>> # stop-the-world GC pauses during resize, and so that we can lock
>>>>>>>> the
>>>>>>>> # heap in memory on startup to prevent any of it from being swapped
>>>>>>>> # out.
>>>>>>>> JVM_OPTS="$JVM_OPTS -Xms${MAX_HEAP_SIZE}"
>>>>>>>> JVM_OPTS="$JVM_OPTS -Xmx${MAX_HEAP_SIZE}"
>>>>>>>>
>>>>>>>> # Per-thread stack size.
>>>>>>>> JVM_OPTS="$JVM_OPTS -Xss256k"
>>>>>>>>
>>>>>>>> # Use the Hotspot garbage-first collector.
>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC"
>>>>>>>>
>>>>>>>> # Have the JVM do less remembered set work during STW, instead
>>>>>>>> # preferring concurrent GC. Reduces p99.9 latency.
>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:G1RSetUpdatingPauseTimePercent=5"
>>>>>>>>
>>>>>>>> # The JVM maximum is 8 PGC threads and 1/4 of that for ConcGC.
>>>>>>>> # Machines with > 10 cores may need additional threads.
>>>>>>>> # Increase to <= full cores (do not count HT cores).
>>>>>>>> #JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=16"
>>>>>>>> #JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=16"
>>>>>>>>
>>>>>>>> # Main G1GC tunable: lowering the pause target will lower
>>>>>>>> throughput and vise versa.
>>>>>>>> # 200ms is the JVM default and lowest viable setting
>>>>>>>> # 1000ms increases throughput. Keep it smaller than the timeouts in
>>>>>>>> cassandra.yaml.
>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=500"
>>>>>>>> # Do reference processing in parallel GC.
>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+ParallelRefProcEnabled"
>>>>>>>>
>>>>>>>> # This may help eliminate STW.
>>>>>>>> # The default in Hotspot 8u40 is 40%.
>>>>>>>> #JVM_OPTS="$JVM_OPTS -XX:InitiatingHeapOccupancyPercent=25"
>>>>>>>>
>>>>>>>> # For workloads that do large allocations, increasing the region
>>>>>>>> # size may make things more efficient. Otherwise, let the JVM
>>>>>>>> # set this automatically.
>>>>>>>> #JVM_OPTS="$JVM_OPTS -XX:G1HeapRegionSize=32m"
>>>>>>>>
>>>>>>>> # Make sure all memory is faulted and zeroed on startup.
>>>>>>>> # This helps prevent soft faults in containers and makes
>>>>>>>> # transparent hugepage allocation more effective.
>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+AlwaysPreTouch"
>>>>>>>>
>>>>>>>> # Biased locking does not benefit Cassandra.
>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
>>>>>>>>
>>>>>>>> # Larger interned string table, for gossip's benefit
>>>>>>>> (CASSANDRA-6410)
>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:StringTableSize=1000003"
>>>>>>>>
>>>>>>>> # Enable thread-local allocation blocks and allow the JVM to
>>>>>>>> automatically
>>>>>>>> # resize them at runtime.
>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB -XX:+ResizeTLAB"
>>>>>>>>
>>>>>>>> # http://www.evanjones.ca/jvm-mmap-pause.html
>>>>>>>> JVM_OPTS="$JVM_OPTS -XX:+PerfDisableSharedMem"
>>>>>>>
>>>>>>>
>>>>>>> All the best,
>>>>>>>
>>>>>>>
>>>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>>>>
>>>>>>> Sebastián Estévez
>>>>>>>
>>>>>>> Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com
>>>>>>>
>>>>>>> [image: linkedin.png] <https://www.linkedin.com/company/datastax> 
>>>>>>> [image:
>>>>>>> facebook.png] <https://www.facebook.com/datastax> [image:
>>>>>>> twitter.png] <https://twitter.com/datastax> [image: g+.png]
>>>>>>> <https://plus.google.com/+Datastax/about>
>>>>>>> <http://feeds.feedburner.com/datastax>
>>>>>>>
>>>>>>> <http://cassandrasummit-datastax.com/>
>>>>>>>
>>>>>>> DataStax is the fastest, most scalable distributed database
>>>>>>> technology, delivering Apache Cassandra to the world’s most innovative
>>>>>>> enterprises. Datastax is built to be agile, always-on, and predictably
>>>>>>> scalable to any size. With more than 500 customers in 45 countries, 
>>>>>>> DataStax
>>>>>>> is the database technology and transactional backbone of choice for the
>>>>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and 
>>>>>>> eBay.
>>>>>>>
>>>>>>> On Fri, Jul 10, 2015 at 12:55 PM, Kunal Gangakhedkar <
>>>>>>> kgangakhed...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I upgraded my instance from 8GB to a 14GB one.
>>>>>>>> Allocated 8GB to jvm heap in cassandra-env.sh.
>>>>>>>>
>>>>>>>> And now, it crashes even faster with an OOM..
>>>>>>>>
>>>>>>>> Earlier, with 4GB heap, I could go upto ~90% replication completion
>>>>>>>> (as reported by nodetool netstats); now, with 8GB heap, I cannot even 
>>>>>>>> get
>>>>>>>> there. I've already restarted cassandra service 4 times with 8GB heap.
>>>>>>>>
>>>>>>>> No clue what's going on.. :(
>>>>>>>>
>>>>>>>> Kunal
>>>>>>>>
>>>>>>>> On 10 July 2015 at 17:45, Jack Krupansky <jack.krupan...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> You, and only you, are responsible for knowing your data and data
>>>>>>>>> model.
>>>>>>>>>
>>>>>>>>> If columns per row or rows per partition can be large, then an 8GB
>>>>>>>>> system is probably too small. But the real issue is that you need to 
>>>>>>>>> keep
>>>>>>>>> your partition size from getting too large.
>>>>>>>>>
>>>>>>>>> Generally, an 8GB system is okay, but only for reasonably-sized
>>>>>>>>> partitions, like under 10MB.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -- Jack Krupansky
>>>>>>>>>
>>>>>>>>> On Fri, Jul 10, 2015 at 8:05 AM, Kunal Gangakhedkar <
>>>>>>>>> kgangakhed...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> I'm new to cassandra
>>>>>>>>>> How do I find those out? - mainly, the partition params that you
>>>>>>>>>> asked for. Others, I think I can figure out.
>>>>>>>>>>
>>>>>>>>>> We don't have any large objects/blobs in the column values - it's
>>>>>>>>>> all textual, date-time, numeric and uuid data.
>>>>>>>>>>
>>>>>>>>>> We use cassandra to primarily store segmentation data - with
>>>>>>>>>> segment type as partition key. That is again divided into two 
>>>>>>>>>> separate
>>>>>>>>>> column families; but they have similar structure.
>>>>>>>>>>
>>>>>>>>>> Columns per row can be fairly large - each segment type as the
>>>>>>>>>> row key and associated user ids and timestamp as column value.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Kunal
>>>>>>>>>>
>>>>>>>>>> On 10 July 2015 at 16:36, Jack Krupansky <
>>>>>>>>>> jack.krupan...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> What does your data and data model look like - partition size,
>>>>>>>>>>> rows per partition, number of columns per row, any large 
>>>>>>>>>>> values/blobs in
>>>>>>>>>>> column values?
>>>>>>>>>>>
>>>>>>>>>>> You could run fine on an 8GB system, but only if your rows and
>>>>>>>>>>> partitions are reasonably small. Any large partitions could blow 
>>>>>>>>>>> you away.
>>>>>>>>>>>
>>>>>>>>>>> -- Jack Krupansky
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jul 10, 2015 at 4:22 AM, Kunal Gangakhedkar <
>>>>>>>>>>> kgangakhed...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Attaching the stack dump captured from the last OOM.
>>>>>>>>>>>>
>>>>>>>>>>>> Kunal
>>>>>>>>>>>>
>>>>>>>>>>>> On 10 July 2015 at 13:32, Kunal Gangakhedkar <
>>>>>>>>>>>> kgangakhed...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Forgot to mention: the data size is not that big - it's barely
>>>>>>>>>>>>> 10GB in all.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Kunal
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10 July 2015 at 13:29, Kunal Gangakhedkar <
>>>>>>>>>>>>> kgangakhed...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have a 2 node setup on Azure (east us region) running
>>>>>>>>>>>>>> Ubuntu server 14.04LTS.
>>>>>>>>>>>>>> Both nodes have 8GB RAM.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One of the nodes (seed node) died with OOM - so, I am trying
>>>>>>>>>>>>>> to add a replacement node with same configuration.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The problem is this new node also keeps dying with OOM - I've
>>>>>>>>>>>>>> restarted the cassandra service like 8-10 times hoping that it 
>>>>>>>>>>>>>> would finish
>>>>>>>>>>>>>> the replication. But it didn't help.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The one node that is still up is happily chugging along.
>>>>>>>>>>>>>> All nodes have similar configuration - with libjna installed.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cassandra is installed from datastax's debian repo - pkg:
>>>>>>>>>>>>>> dsc21 version 2.1.7.
>>>>>>>>>>>>>> I started off with the default configuration - i.e. the
>>>>>>>>>>>>>> default cassandra-env.sh - which calculates the heap size 
>>>>>>>>>>>>>> automatically
>>>>>>>>>>>>>> (1/4 * RAM = 2GB)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But, that didn't help. So, I then tried to increase the heap
>>>>>>>>>>>>>> to 4GB manually and restarted. It still keeps crashing.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Any clue as to why it's happening?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Kunal
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Cassandra OOM on joining existing ring

Reply via email to