Re: OOM at Bootstrap Time

Maxime Thu, 30 Oct 2014 12:53:33 -0700

I've been trying to go through the logs but I can't say I understand very
well the details:


INFO  [SlabPoolCleaner] 2014-10-30 19:20:18,446 ColumnFamilyStore.java:856
- Enqueuing flush of loc: 7977119 (1%) on-heap, 0 (0%) off-heap
DEBUG [SharedPool-Worker-22] 2014-10-30 19:20:18,446
AbstractSimplePerColumnSecondaryIndex.java:124 - applying index row
2c95cbbb61fb8ec3bd06d70058bfa236ccad5195e48fd00c056f7e1e3fdd4368 in
ColumnFamily(loc.loc_id_idx [66652e312e31332e3830:0:false:0@1414696815026000
!63072000,])
DEBUG [SharedPool-Worker-6] 2014-10-30 19:20:18,446
AbstractSimplePerColumnSecondaryIndex.java:124 - applying index row
41fc260427a88d2f084971702fdcb32756e0731c6042f93e9761e03db5197990 in
ColumnFamily(loc.loc_id_idx [66652e312e31332e3830:0:false:0@1414696815333000
!63072000,])
DEBUG [SharedPool-Worker-25] 2014-10-30 19:20:18,446
AbstractSimplePerColumnSecondaryIndex.java:124 - applying index row
2e8c4dab33faade0a4fc265e4126e43dc2e58fb72830f73d7e9b8e836101d413 in
ColumnFamily(loc.loc_id_idx [66652e312e31332e3830:0:false:0@1414696815335000
!63072000,])
DEBUG [SharedPool-Worker-26] 2014-10-30 19:20:18,446
AbstractSimplePerColumnSecondaryIndex.java:124 - applying index row
245bec68c5820364a72db093d5c9899b631e692006881c98f0abf4da5fbff4cd in
ColumnFamily(loc.loc_id_idx [66652e312e31332e3830:0:false:0@1414696815344000
!63072000,])
DEBUG [SharedPool-Worker-20] 2014-10-30 19:20:18,446
AbstractSimplePerColumnSecondaryIndex.java:124 - applying index row
ea8dfb47177bd40f46aac4fe41d3cfea3316cf35451ace0825f46b6e0fa9e3ef in
ColumnFamily(loc.loc_id_idx [66652e312e31332e3830:0:false:0@1414696815262000
!63072000,])

This is a sample of Enqueuing flush events in the storm.

On Thu, Oct 30, 2014 at 12:20 PM, Maxime <maxim...@gmail.com> wrote:

> I will give a shot adding the logging.
>
> I've tried some experiments and I have no clue what could be happening
> anymore:
>
> I tried setting all nodes to a streamthroughput of 1 except 1, to see if
> somehow it was getting overloaded by too many streams coming in at once,
> nope.
> I went through the source at ColumnFamilyStore.java:856 where the huge
> burst of "Enqueuing flush..." occurs, and it's clearly at the moment
> memtables get converted to SSTables on disk. So I started the bootstrap
> process and using a bash script trigerred a 'nodetool flush' every minute
> during the processes. At first it seemed to work, but again after what
> seems to be a locally-trigered cue, the burst (many many thousands of
> Enqueuing flush...). But through my previous experiment, I am fairly
> certain it's not a question of volume of data coming in (throughput), or
> number of SSTables being streamed (dealing at max 150 files pr node).
>
> Does anyone know if such Enqueuing bursts are normal during bootstrap? I'd
> like to be able to say "it's because my nodes are underpowered", but at the
> moment, I'm leaning towards a bug of some kind.
>
> On Wed, Oct 29, 2014 at 3:05 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
>
>> Some ideas:
>>
>> 1) Put on DEBUG log on the joining node to see what is going on in
>> details with the stream with 1500 files
>>
>> 2) Check the stream ID to see whether it's a new stream or an old one
>> pending
>>
>>
>>
>> On Wed, Oct 29, 2014 at 2:21 AM, Maxime <maxim...@gmail.com> wrote:
>>
>>> Doan, thanks for the tip, I just read about it this morning, just
>>> waiting for the new version to pop up on the debian datastax repo.
>>>
>>> Michael, I do believe you are correct in the general running of the
>>> cluster and I've reset everything.
>>>
>>> So it took me a while to reply, I finally got the SSTables down, as seen
>>> in the OpsCenter graphs. I'm stumped however because when I bootstrap the
>>> new node, I still see very large number of files being streamed (~1500 for
>>> some nodes) and the bootstrap process is failing exactly as it did before,
>>> in a flury of "Enqueuing flush of ..."
>>>
>>> Any ideas? I'm reaching the end of what I know I can do, OpsCenter says
>>> around 32 SStables per CF, but still streaming tons of "files". :-/
>>>
>>>
>>> On Mon, Oct 27, 2014 at 1:12 PM, DuyHai Doan <doanduy...@gmail.com>
>>> wrote:
>>>
>>>> "Tombstones will be a very important issue for me since the dataset is
>>>> very much a rolling dataset using TTLs heavily."
>>>>
>>>> --> You can try the new DateTiered compaction strategy (
>>>> https://issues.apache.org/jira/browse/CASSANDRA-6602) released on
>>>> 2.1.1 if you have a time series data model to eliminate tombstones
>>>>
>>>> On Mon, Oct 27, 2014 at 5:47 PM, Laing, Michael <
>>>> michael.la...@nytimes.com> wrote:
>>>>
>>>>> Again, from our experience w 2.0.x:
>>>>>
>>>>> Revert to the defaults - you are manually setting heap way too high
>>>>> IMHO.
>>>>>
>>>>> On our small nodes we tried LCS - way too much compaction - switch all
>>>>> CFs to STCS.
>>>>>
>>>>> We do a major rolling compaction on our small nodes weekly during less
>>>>> busy hours - works great. Be sure you have enough disk.
>>>>>
>>>>> We never explicitly delete and only use ttls or truncation. You can
>>>>> set GC to 0 in that case, so tombstones are more readily expunged. There
>>>>> are a couple threads in the list that discuss this... also normal rolling
>>>>> repair becomes optional, reducing load (still repair if something unusual
>>>>> happens tho...).
>>>>>
>>>>> In your current situation, you need to kickstart compaction - are
>>>>> there any CFs you can truncate at least temporarily? Then try compacting a
>>>>> small CF, then another, etc.
>>>>>
>>>>> Hopefully you can get enough headroom to add a node.
>>>>>
>>>>> ml
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Oct 26, 2014 at 6:24 PM, Maxime <maxim...@gmail.com> wrote:
>>>>>
>>>>>> Hmm, thanks for the reading.
>>>>>>
>>>>>> I initially followed some (perhaps too old) maintenance scripts,
>>>>>> which included weekly 'nodetool compact'. Is there a way for me to undo 
>>>>>> the
>>>>>> damage? Tombstones will be a very important issue for me since the 
>>>>>> dataset
>>>>>> is very much a rolling dataset using TTLs heavily.
>>>>>>
>>>>>> On Sun, Oct 26, 2014 at 6:04 PM, DuyHai Doan <doanduy...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> "Should doing a major compaction on those nodes lead to a
>>>>>>> restructuration of the SSTables?" --> Beware of the major
>>>>>>> compaction on SizeTiered, it will create 2 giant SSTables and the
>>>>>>> expired/outdated/tombstone columns in this big file will be never 
>>>>>>> cleaned
>>>>>>> since the SSTable will never get a chance to be compacted again
>>>>>>>
>>>>>>> Essentially to reduce the fragmentation of small SSTables you can
>>>>>>> stay with SizeTiered compaction and play around with compaction 
>>>>>>> properties
>>>>>>> (the thresholds) to make C* group a bunch of files each time it 
>>>>>>> compacts so
>>>>>>> that the file number shrinks to a reasonable count
>>>>>>>
>>>>>>> Since you're using C* 2.1 and anti-compaction has been introduced, I
>>>>>>> hesitate advising you to use Leveled compaction as a work-around to 
>>>>>>> reduce
>>>>>>> SSTable count.
>>>>>>>
>>>>>>>  Things are a little bit more complicated because of the incremental
>>>>>>> repair process (I don't know whether you're using incremental repair or 
>>>>>>> not
>>>>>>> in production). The Dev blog says that Leveled compaction is performed 
>>>>>>> only
>>>>>>> on repaired SSTables, the un-repaired ones still use SizeTiered, more
>>>>>>> details here:
>>>>>>> http://www.datastax.com/dev/blog/anticompaction-in-cassandra-2-1
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Oct 26, 2014 at 9:44 PM, Jonathan Haddad <j...@jonhaddad.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> If the issue is related to I/O, you're going to want to determine if
>>>>>>>> you're saturated.  Take a look at `iostat -dmx 1`, you'll see
>>>>>>>> avgqu-sz
>>>>>>>> (queue size) and svctm, (service time).    The higher those numbers
>>>>>>>> are, the most overwhelmed your disk is.
>>>>>>>>
>>>>>>>> On Sun, Oct 26, 2014 at 12:01 PM, DuyHai Doan <doanduy...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> > Hello Maxime
>>>>>>>> >
>>>>>>>> > Increasing the flush writers won't help if your disk I/O is not
>>>>>>>> keeping up.
>>>>>>>> >
>>>>>>>> > I've had a look into the log file, below are some remarks:
>>>>>>>> >
>>>>>>>> > 1) There are a lot of SSTables on disk for some tables (events
>>>>>>>> for example,
>>>>>>>> > but not only). I've seen that some compactions are taking up to
>>>>>>>> 32 SSTables
>>>>>>>> > (which corresponds to the default max value for SizeTiered
>>>>>>>> compaction).
>>>>>>>> >
>>>>>>>> > 2) There is a secondary index that I found suspicious :
>>>>>>>> loc.loc_id_idx. As
>>>>>>>> > its name implies I have the impression that it's an index on the
>>>>>>>> id of the
>>>>>>>> > loc which would lead to almost an 1-1 relationship between the
>>>>>>>> indexed value
>>>>>>>> > and the original loc. Such index should be avoided because they
>>>>>>>> do not
>>>>>>>> > perform well. If it's not an index on the loc_id, please
>>>>>>>> disregard my remark
>>>>>>>> >
>>>>>>>> > 3) There is a clear imbalance of SSTable count on some nodes. In
>>>>>>>> the log, I
>>>>>>>> > saw:
>>>>>>>> >
>>>>>>>> > INFO  [STREAM-IN-/xxxx.xxxx.xxxx.20] 2014-10-25 02:21:43,360
>>>>>>>> > StreamResultFuture.java:166 - [Stream
>>>>>>>> #a6e54ea0-5bed-11e4-8df5-f357715e1a79
>>>>>>>> > ID#0] Prepare completed. Receiving 163 files(4 111 187 195
>>>>>>>> bytes), sending 0
>>>>>>>> > files(0 bytes)
>>>>>>>> >
>>>>>>>> > INFO  [STREAM-IN-/xxxx.xxxx.xxxx.81] 2014-10-25 02:21:46,121
>>>>>>>> > StreamResultFuture.java:166 - [Stream
>>>>>>>> #a6e54ea0-5bed-11e4-8df5-f357715e1a79
>>>>>>>> > ID#0] Prepare completed. Receiving 154 files(3 332 779 920
>>>>>>>> bytes), sending 0
>>>>>>>> > files(0 bytes)
>>>>>>>> >
>>>>>>>> > INFO  [STREAM-IN-/xxxx.xxxx.xxxx.71] 2014-10-25 02:21:50,494
>>>>>>>> > StreamResultFuture.java:166 - [Stream
>>>>>>>> #a6e54ea0-5bed-11e4-8df5-f357715e1a79
>>>>>>>> > ID#0] Prepare completed. Receiving 1315 files(4 606 316 933
>>>>>>>> bytes), sending
>>>>>>>> > 0 files(0 bytes)
>>>>>>>> >
>>>>>>>> > INFO  [STREAM-IN-/xxxx.xxxx.xxxx.217] 2014-10-25 02:21:51,036
>>>>>>>> > StreamResultFuture.java:166 - [Stream
>>>>>>>> #a6e54ea0-5bed-11e4-8df5-f357715e1a79
>>>>>>>> > ID#0] Prepare completed. Receiving 1640 files(3 208 023 573
>>>>>>>> bytes), sending
>>>>>>>> > 0 files(0 bytes)
>>>>>>>> >
>>>>>>>> >  As you can see, the existing 4 nodes are streaming data to the
>>>>>>>> new node and
>>>>>>>> > on average the data set size is about 3.3 - 4.5 Gb. However the
>>>>>>>> number of
>>>>>>>> > SSTables is around 150 files for nodes xxxx.xxxx.xxxx.20 and
>>>>>>>> > xxxx.xxxx.xxxx.81 but goes through the roof to reach 1315 files
>>>>>>>> for
>>>>>>>> > xxxx.xxxx.xxxx.71 and 1640 files for xxxx.xxxx.xxxx.217
>>>>>>>> >
>>>>>>>> >  The total data set size is roughly the same but the file number
>>>>>>>> is x10,
>>>>>>>> > which mean that you'll have a bunch of tiny files.
>>>>>>>> >
>>>>>>>> >  I guess that upon reception of those files, there will be a
>>>>>>>> massive flush
>>>>>>>> > to disk, explaining the behaviour you're facing (flush storm)
>>>>>>>> >
>>>>>>>> > I would suggest looking on nodes xxxx.xxxx.xxxx.71 and
>>>>>>>> xxxx.xxxx.xxxx.217 to
>>>>>>>> > check for the total SSTable count for each table to confirm this
>>>>>>>> intuition
>>>>>>>> >
>>>>>>>> > Regards
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Sun, Oct 26, 2014 at 4:58 PM, Maxime <maxim...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> >>
>>>>>>>> >> I've emailed you a raw log file of an instance of this happening.
>>>>>>>> >>
>>>>>>>> >> I've been monitoring more closely the timing of events in
>>>>>>>> tpstats and the
>>>>>>>> >> logs and I believe this is what is happening:
>>>>>>>> >>
>>>>>>>> >> - For some reason, C* decides to provoke a flush storm (I say
>>>>>>>> some reason,
>>>>>>>> >> I'm sure there is one but I have had difficulty determining the
>>>>>>>> behaviour
>>>>>>>> >> changes between 1.* and more recent releases).
>>>>>>>> >> - So we see ~ 3000 flush being enqueued.
>>>>>>>> >> - This happens so suddenly that even boosting the number of
>>>>>>>> flush writers
>>>>>>>> >> to 20 does not suffice. I don't even see "all time blocked"
>>>>>>>> numbers for it
>>>>>>>> >> before C* stops responding. I suspect this is due to the sudden
>>>>>>>> OOM and GC
>>>>>>>> >> occurring.
>>>>>>>> >> - The last tpstat that comes back before the node goes down
>>>>>>>> indicates 20
>>>>>>>> >> active and 3000 pending and the rest 0. It's by far the
>>>>>>>> anomalous activity.
>>>>>>>> >>
>>>>>>>> >> Is there a way to throttle down this generation of Flush? C*
>>>>>>>> complains if
>>>>>>>> >> I set the queue_size to any value (deprecated now?) and boosting
>>>>>>>> the threads
>>>>>>>> >> does not seem to help since even at 20 we're an order of
>>>>>>>> magnitude off.
>>>>>>>> >>
>>>>>>>> >> Suggestions? Comments?
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> On Sun, Oct 26, 2014 at 2:26 AM, DuyHai Doan <
>>>>>>>> doanduy...@gmail.com> wrote:
>>>>>>>> >>>
>>>>>>>> >>> Hello Maxime
>>>>>>>> >>>
>>>>>>>> >>>  Can you put the complete logs and config somewhere ? It would
>>>>>>>> be
>>>>>>>> >>> interesting to know what is the cause of the OOM.
>>>>>>>> >>>
>>>>>>>> >>> On Sun, Oct 26, 2014 at 3:15 AM, Maxime <maxim...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> >>>>
>>>>>>>> >>>> Thanks a lot that is comforting. We are also small at the
>>>>>>>> moment so I
>>>>>>>> >>>> definitely can relate with the idea of keeping small and
>>>>>>>> simple at a level
>>>>>>>> >>>> where it just works.
>>>>>>>> >>>>
>>>>>>>> >>>> I see the new Apache version has a lot of fixes so I will try
>>>>>>>> to upgrade
>>>>>>>> >>>> before I look into downgrading.
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> >>>> On Saturday, October 25, 2014, Laing, Michael
>>>>>>>> >>>> <michael.la...@nytimes.com> wrote:
>>>>>>>> >>>>>
>>>>>>>> >>>>> Since no one else has stepped in...
>>>>>>>> >>>>>
>>>>>>>> >>>>> We have run clusters with ridiculously small nodes - I have a
>>>>>>>> >>>>> production cluster in AWS with 4GB nodes each with 1 CPU and
>>>>>>>> disk-based
>>>>>>>> >>>>> instance storage. It works fine but you can see those little
>>>>>>>> puppies
>>>>>>>> >>>>> struggle...
>>>>>>>> >>>>>
>>>>>>>> >>>>> And I ran into problems such as you observe...
>>>>>>>> >>>>>
>>>>>>>> >>>>> Upgrading Java to the latest 1.7 and - most importantly -
>>>>>>>> reverting to
>>>>>>>> >>>>> the default configuration, esp. for heap, seemed to settle
>>>>>>>> things down
>>>>>>>> >>>>> completely. Also make sure that you are using the
>>>>>>>> 'recommended production
>>>>>>>> >>>>> settings' from the docs on your boxen.
>>>>>>>> >>>>>
>>>>>>>> >>>>> However we are running 2.0.x not 2.1.0 so YMMV.
>>>>>>>> >>>>>
>>>>>>>> >>>>> And we are switching to 15GB nodes w 2 heftier CPUs each and
>>>>>>>> SSD
>>>>>>>> >>>>> storage - still a 'small' machine, but much more reasonable
>>>>>>>> for C*.
>>>>>>>> >>>>>
>>>>>>>> >>>>> However I can't say I am an expert, since I deliberately keep
>>>>>>>> things so
>>>>>>>> >>>>> simple that we do not encounter problems - it just works so I
>>>>>>>> dig into other
>>>>>>>> >>>>> stuff.
>>>>>>>> >>>>>
>>>>>>>> >>>>> ml
>>>>>>>> >>>>>
>>>>>>>> >>>>>
>>>>>>>> >>>>> On Sat, Oct 25, 2014 at 5:22 PM, Maxime <maxim...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> Hello, I've been trying to add a new node to my cluster ( 4
>>>>>>>> nodes )
>>>>>>>> >>>>>> for a few days now.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> I started by adding a node similar to my current
>>>>>>>> configuration, 4 GB
>>>>>>>> >>>>>> or RAM + 2 Cores on DigitalOcean. However every time, I
>>>>>>>> would end up getting
>>>>>>>> >>>>>> OOM errors after many log entries of the type:
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> INFO  [SlabPoolCleaner] 2014-10-25 13:44:57,240
>>>>>>>> >>>>>> ColumnFamilyStore.java:856 - Enqueuing flush of mycf: 5383
>>>>>>>> (0%) on-heap, 0
>>>>>>>> >>>>>> (0%) off-heap
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> leading to:
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> ka-120-Data.db (39291 bytes) for commitlog position
>>>>>>>> >>>>>> ReplayPosition(segmentId=1414243978538, position=23699418)
>>>>>>>> >>>>>> WARN  [SharedPool-Worker-13] 2014-10-25 13:48:18,032
>>>>>>>> >>>>>> AbstractTracingAwareExecutorService.java:167 - Uncaught
>>>>>>>> exception on thread
>>>>>>>> >>>>>> Thread[SharedPool-Worker-13,5,main]: {}
>>>>>>>> >>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> Thinking it had to do with either compaction somehow or
>>>>>>>> streaming, 2
>>>>>>>> >>>>>> activities I've had tremendous issues with in the past; I
>>>>>>>> tried to slow down
>>>>>>>> >>>>>> the setstreamthroughput to extremely low values all the way
>>>>>>>> to 5. I also
>>>>>>>> >>>>>> tried setting setcompactionthoughput to 0, and then reading
>>>>>>>> that in some
>>>>>>>> >>>>>> cases it might be too fast, down to 8. Nothing worked, it
>>>>>>>> merely vaguely
>>>>>>>> >>>>>> changed the mean time to OOM but not in a way indicating
>>>>>>>> either was anywhere
>>>>>>>> >>>>>> a solution.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> The nodes were configured with 2 GB of Heap initially, I
>>>>>>>> tried to
>>>>>>>> >>>>>> crank it up to 3 GB, stressing the host memory to its limit.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> After doing some exploration (I am considering writing a
>>>>>>>> Cassandra Ops
>>>>>>>> >>>>>> documentation with lessons learned since there seems to be
>>>>>>>> little of it in
>>>>>>>> >>>>>> organized fashions), I read that some people had strange
>>>>>>>> issues on lower-end
>>>>>>>> >>>>>> boxes like that, so I bit the bullet and upgraded my new
>>>>>>>> node to a 8GB + 4
>>>>>>>> >>>>>> Core instance, which was anecdotally better.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> To my complete shock, exact same issues are present, even
>>>>>>>> raising the
>>>>>>>> >>>>>> Heap memory to 6 GB. I figure it can't be a "normal"
>>>>>>>> situation anymore, but
>>>>>>>> >>>>>> must be a bug somehow.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> My cluster is 4 nodes, RF of 2, about 160 GB of data across
>>>>>>>> all nodes.
>>>>>>>> >>>>>> About 10 CF of varying sizes. Runtime writes are between 300
>>>>>>>> to 900 /
>>>>>>>> >>>>>> second. Cassandra 2.1.0, nothing too wild.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> Has anyone encountered these kinds of issues before? I would
>>>>>>>> really
>>>>>>>> >>>>>> enjoy hearing about the experiences of people trying to run
>>>>>>>> small-sized
>>>>>>>> >>>>>> clusters like mine. From everything I read, Cassandra
>>>>>>>> operations go very
>>>>>>>> >>>>>> well on large (16 GB + 8 Cores) machines, but I'm sad to
>>>>>>>> report I've had
>>>>>>>> >>>>>> nothing but trouble trying to run on smaller machines,
>>>>>>>> perhaps I can learn
>>>>>>>> >>>>>> from other's experience?
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> Full logs can be provided to anyone interested.
>>>>>>>> >>>>>>
>>>>>>>> >>>>>> Cheers
>>>>>>>> >>>>>
>>>>>>>> >>>>>
>>>>>>>> >>>
>>>>>>>> >>
>>>>>>>> >
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jon Haddad
>>>>>>>> http://www.rustyrazorblade.com
>>>>>>>> twitter: rustyrazorblade
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: OOM at Bootstrap Time

Reply via email to