Doan, thanks for the tip, I just read about it this morning, just waiting for the new version to pop up on the debian datastax repo.
Michael, I do believe you are correct in the general running of the cluster and I've reset everything. So it took me a while to reply, I finally got the SSTables down, as seen in the OpsCenter graphs. I'm stumped however because when I bootstrap the new node, I still see very large number of files being streamed (~1500 for some nodes) and the bootstrap process is failing exactly as it did before, in a flury of "Enqueuing flush of ..." Any ideas? I'm reaching the end of what I know I can do, OpsCenter says around 32 SStables per CF, but still streaming tons of "files". :-/ On Mon, Oct 27, 2014 at 1:12 PM, DuyHai Doan <doanduy...@gmail.com> wrote: > "Tombstones will be a very important issue for me since the dataset is > very much a rolling dataset using TTLs heavily." > > --> You can try the new DateTiered compaction strategy ( > https://issues.apache.org/jira/browse/CASSANDRA-6602) released on 2.1.1 > if you have a time series data model to eliminate tombstones > > On Mon, Oct 27, 2014 at 5:47 PM, Laing, Michael <michael.la...@nytimes.com > > wrote: > >> Again, from our experience w 2.0.x: >> >> Revert to the defaults - you are manually setting heap way too high IMHO. >> >> On our small nodes we tried LCS - way too much compaction - switch all >> CFs to STCS. >> >> We do a major rolling compaction on our small nodes weekly during less >> busy hours - works great. Be sure you have enough disk. >> >> We never explicitly delete and only use ttls or truncation. You can set >> GC to 0 in that case, so tombstones are more readily expunged. There are a >> couple threads in the list that discuss this... also normal rolling repair >> becomes optional, reducing load (still repair if something unusual happens >> tho...). >> >> In your current situation, you need to kickstart compaction - are there >> any CFs you can truncate at least temporarily? Then try compacting a small >> CF, then another, etc. >> >> Hopefully you can get enough headroom to add a node. >> >> ml >> >> >> >> >> On Sun, Oct 26, 2014 at 6:24 PM, Maxime <maxim...@gmail.com> wrote: >> >>> Hmm, thanks for the reading. >>> >>> I initially followed some (perhaps too old) maintenance scripts, which >>> included weekly 'nodetool compact'. Is there a way for me to undo the >>> damage? Tombstones will be a very important issue for me since the dataset >>> is very much a rolling dataset using TTLs heavily. >>> >>> On Sun, Oct 26, 2014 at 6:04 PM, DuyHai Doan <doanduy...@gmail.com> >>> wrote: >>> >>>> "Should doing a major compaction on those nodes lead to a restructuration >>>> of the SSTables?" --> Beware of the major compaction on SizeTiered, it will >>>> create 2 giant SSTables and the expired/outdated/tombstone columns in this >>>> big file will be never cleaned since the SSTable will never get a chance to >>>> be compacted again >>>> >>>> Essentially to reduce the fragmentation of small SSTables you can stay >>>> with SizeTiered compaction and play around with compaction properties (the >>>> thresholds) to make C* group a bunch of files each time it compacts so that >>>> the file number shrinks to a reasonable count >>>> >>>> Since you're using C* 2.1 and anti-compaction has been introduced, I >>>> hesitate advising you to use Leveled compaction as a work-around to reduce >>>> SSTable count. >>>> >>>> Things are a little bit more complicated because of the incremental >>>> repair process (I don't know whether you're using incremental repair or not >>>> in production). The Dev blog says that Leveled compaction is performed only >>>> on repaired SSTables, the un-repaired ones still use SizeTiered, more >>>> details here: >>>> http://www.datastax.com/dev/blog/anticompaction-in-cassandra-2-1 >>>> >>>> Regards >>>> >>>> >>>> >>>> >>>> >>>> On Sun, Oct 26, 2014 at 9:44 PM, Jonathan Haddad <j...@jonhaddad.com> >>>> wrote: >>>> >>>>> If the issue is related to I/O, you're going to want to determine if >>>>> you're saturated. Take a look at `iostat -dmx 1`, you'll see avgqu-sz >>>>> (queue size) and svctm, (service time). The higher those numbers >>>>> are, the most overwhelmed your disk is. >>>>> >>>>> On Sun, Oct 26, 2014 at 12:01 PM, DuyHai Doan <doanduy...@gmail.com> >>>>> wrote: >>>>> > Hello Maxime >>>>> > >>>>> > Increasing the flush writers won't help if your disk I/O is not >>>>> keeping up. >>>>> > >>>>> > I've had a look into the log file, below are some remarks: >>>>> > >>>>> > 1) There are a lot of SSTables on disk for some tables (events for >>>>> example, >>>>> > but not only). I've seen that some compactions are taking up to 32 >>>>> SSTables >>>>> > (which corresponds to the default max value for SizeTiered >>>>> compaction). >>>>> > >>>>> > 2) There is a secondary index that I found suspicious : >>>>> loc.loc_id_idx. As >>>>> > its name implies I have the impression that it's an index on the id >>>>> of the >>>>> > loc which would lead to almost an 1-1 relationship between the >>>>> indexed value >>>>> > and the original loc. Such index should be avoided because they do >>>>> not >>>>> > perform well. If it's not an index on the loc_id, please disregard >>>>> my remark >>>>> > >>>>> > 3) There is a clear imbalance of SSTable count on some nodes. In the >>>>> log, I >>>>> > saw: >>>>> > >>>>> > INFO [STREAM-IN-/xxxx.xxxx.xxxx.20] 2014-10-25 02:21:43,360 >>>>> > StreamResultFuture.java:166 - [Stream >>>>> #a6e54ea0-5bed-11e4-8df5-f357715e1a79 >>>>> > ID#0] Prepare completed. Receiving 163 files(4 111 187 195 bytes), >>>>> sending 0 >>>>> > files(0 bytes) >>>>> > >>>>> > INFO [STREAM-IN-/xxxx.xxxx.xxxx.81] 2014-10-25 02:21:46,121 >>>>> > StreamResultFuture.java:166 - [Stream >>>>> #a6e54ea0-5bed-11e4-8df5-f357715e1a79 >>>>> > ID#0] Prepare completed. Receiving 154 files(3 332 779 920 bytes), >>>>> sending 0 >>>>> > files(0 bytes) >>>>> > >>>>> > INFO [STREAM-IN-/xxxx.xxxx.xxxx.71] 2014-10-25 02:21:50,494 >>>>> > StreamResultFuture.java:166 - [Stream >>>>> #a6e54ea0-5bed-11e4-8df5-f357715e1a79 >>>>> > ID#0] Prepare completed. Receiving 1315 files(4 606 316 933 bytes), >>>>> sending >>>>> > 0 files(0 bytes) >>>>> > >>>>> > INFO [STREAM-IN-/xxxx.xxxx.xxxx.217] 2014-10-25 02:21:51,036 >>>>> > StreamResultFuture.java:166 - [Stream >>>>> #a6e54ea0-5bed-11e4-8df5-f357715e1a79 >>>>> > ID#0] Prepare completed. Receiving 1640 files(3 208 023 573 bytes), >>>>> sending >>>>> > 0 files(0 bytes) >>>>> > >>>>> > As you can see, the existing 4 nodes are streaming data to the new >>>>> node and >>>>> > on average the data set size is about 3.3 - 4.5 Gb. However the >>>>> number of >>>>> > SSTables is around 150 files for nodes xxxx.xxxx.xxxx.20 and >>>>> > xxxx.xxxx.xxxx.81 but goes through the roof to reach 1315 files for >>>>> > xxxx.xxxx.xxxx.71 and 1640 files for xxxx.xxxx.xxxx.217 >>>>> > >>>>> > The total data set size is roughly the same but the file number is >>>>> x10, >>>>> > which mean that you'll have a bunch of tiny files. >>>>> > >>>>> > I guess that upon reception of those files, there will be a massive >>>>> flush >>>>> > to disk, explaining the behaviour you're facing (flush storm) >>>>> > >>>>> > I would suggest looking on nodes xxxx.xxxx.xxxx.71 and >>>>> xxxx.xxxx.xxxx.217 to >>>>> > check for the total SSTable count for each table to confirm this >>>>> intuition >>>>> > >>>>> > Regards >>>>> > >>>>> > >>>>> > On Sun, Oct 26, 2014 at 4:58 PM, Maxime <maxim...@gmail.com> wrote: >>>>> >> >>>>> >> I've emailed you a raw log file of an instance of this happening. >>>>> >> >>>>> >> I've been monitoring more closely the timing of events in tpstats >>>>> and the >>>>> >> logs and I believe this is what is happening: >>>>> >> >>>>> >> - For some reason, C* decides to provoke a flush storm (I say some >>>>> reason, >>>>> >> I'm sure there is one but I have had difficulty determining the >>>>> behaviour >>>>> >> changes between 1.* and more recent releases). >>>>> >> - So we see ~ 3000 flush being enqueued. >>>>> >> - This happens so suddenly that even boosting the number of flush >>>>> writers >>>>> >> to 20 does not suffice. I don't even see "all time blocked" numbers >>>>> for it >>>>> >> before C* stops responding. I suspect this is due to the sudden OOM >>>>> and GC >>>>> >> occurring. >>>>> >> - The last tpstat that comes back before the node goes down >>>>> indicates 20 >>>>> >> active and 3000 pending and the rest 0. It's by far the anomalous >>>>> activity. >>>>> >> >>>>> >> Is there a way to throttle down this generation of Flush? C* >>>>> complains if >>>>> >> I set the queue_size to any value (deprecated now?) and boosting >>>>> the threads >>>>> >> does not seem to help since even at 20 we're an order of magnitude >>>>> off. >>>>> >> >>>>> >> Suggestions? Comments? >>>>> >> >>>>> >> >>>>> >> On Sun, Oct 26, 2014 at 2:26 AM, DuyHai Doan <doanduy...@gmail.com> >>>>> wrote: >>>>> >>> >>>>> >>> Hello Maxime >>>>> >>> >>>>> >>> Can you put the complete logs and config somewhere ? It would be >>>>> >>> interesting to know what is the cause of the OOM. >>>>> >>> >>>>> >>> On Sun, Oct 26, 2014 at 3:15 AM, Maxime <maxim...@gmail.com> >>>>> wrote: >>>>> >>>> >>>>> >>>> Thanks a lot that is comforting. We are also small at the moment >>>>> so I >>>>> >>>> definitely can relate with the idea of keeping small and simple >>>>> at a level >>>>> >>>> where it just works. >>>>> >>>> >>>>> >>>> I see the new Apache version has a lot of fixes so I will try to >>>>> upgrade >>>>> >>>> before I look into downgrading. >>>>> >>>> >>>>> >>>> >>>>> >>>> On Saturday, October 25, 2014, Laing, Michael >>>>> >>>> <michael.la...@nytimes.com> wrote: >>>>> >>>>> >>>>> >>>>> Since no one else has stepped in... >>>>> >>>>> >>>>> >>>>> We have run clusters with ridiculously small nodes - I have a >>>>> >>>>> production cluster in AWS with 4GB nodes each with 1 CPU and >>>>> disk-based >>>>> >>>>> instance storage. It works fine but you can see those little >>>>> puppies >>>>> >>>>> struggle... >>>>> >>>>> >>>>> >>>>> And I ran into problems such as you observe... >>>>> >>>>> >>>>> >>>>> Upgrading Java to the latest 1.7 and - most importantly - >>>>> reverting to >>>>> >>>>> the default configuration, esp. for heap, seemed to settle >>>>> things down >>>>> >>>>> completely. Also make sure that you are using the 'recommended >>>>> production >>>>> >>>>> settings' from the docs on your boxen. >>>>> >>>>> >>>>> >>>>> However we are running 2.0.x not 2.1.0 so YMMV. >>>>> >>>>> >>>>> >>>>> And we are switching to 15GB nodes w 2 heftier CPUs each and SSD >>>>> >>>>> storage - still a 'small' machine, but much more reasonable for >>>>> C*. >>>>> >>>>> >>>>> >>>>> However I can't say I am an expert, since I deliberately keep >>>>> things so >>>>> >>>>> simple that we do not encounter problems - it just works so I >>>>> dig into other >>>>> >>>>> stuff. >>>>> >>>>> >>>>> >>>>> ml >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Sat, Oct 25, 2014 at 5:22 PM, Maxime <maxim...@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>> >>>>>> Hello, I've been trying to add a new node to my cluster ( 4 >>>>> nodes ) >>>>> >>>>>> for a few days now. >>>>> >>>>>> >>>>> >>>>>> I started by adding a node similar to my current configuration, >>>>> 4 GB >>>>> >>>>>> or RAM + 2 Cores on DigitalOcean. However every time, I would >>>>> end up getting >>>>> >>>>>> OOM errors after many log entries of the type: >>>>> >>>>>> >>>>> >>>>>> INFO [SlabPoolCleaner] 2014-10-25 13:44:57,240 >>>>> >>>>>> ColumnFamilyStore.java:856 - Enqueuing flush of mycf: 5383 (0%) >>>>> on-heap, 0 >>>>> >>>>>> (0%) off-heap >>>>> >>>>>> >>>>> >>>>>> leading to: >>>>> >>>>>> >>>>> >>>>>> ka-120-Data.db (39291 bytes) for commitlog position >>>>> >>>>>> ReplayPosition(segmentId=1414243978538, position=23699418) >>>>> >>>>>> WARN [SharedPool-Worker-13] 2014-10-25 13:48:18,032 >>>>> >>>>>> AbstractTracingAwareExecutorService.java:167 - Uncaught >>>>> exception on thread >>>>> >>>>>> Thread[SharedPool-Worker-13,5,main]: {} >>>>> >>>>>> java.lang.OutOfMemoryError: Java heap space >>>>> >>>>>> >>>>> >>>>>> Thinking it had to do with either compaction somehow or >>>>> streaming, 2 >>>>> >>>>>> activities I've had tremendous issues with in the past; I tried >>>>> to slow down >>>>> >>>>>> the setstreamthroughput to extremely low values all the way to >>>>> 5. I also >>>>> >>>>>> tried setting setcompactionthoughput to 0, and then reading >>>>> that in some >>>>> >>>>>> cases it might be too fast, down to 8. Nothing worked, it >>>>> merely vaguely >>>>> >>>>>> changed the mean time to OOM but not in a way indicating either >>>>> was anywhere >>>>> >>>>>> a solution. >>>>> >>>>>> >>>>> >>>>>> The nodes were configured with 2 GB of Heap initially, I tried >>>>> to >>>>> >>>>>> crank it up to 3 GB, stressing the host memory to its limit. >>>>> >>>>>> >>>>> >>>>>> After doing some exploration (I am considering writing a >>>>> Cassandra Ops >>>>> >>>>>> documentation with lessons learned since there seems to be >>>>> little of it in >>>>> >>>>>> organized fashions), I read that some people had strange issues >>>>> on lower-end >>>>> >>>>>> boxes like that, so I bit the bullet and upgraded my new node >>>>> to a 8GB + 4 >>>>> >>>>>> Core instance, which was anecdotally better. >>>>> >>>>>> >>>>> >>>>>> To my complete shock, exact same issues are present, even >>>>> raising the >>>>> >>>>>> Heap memory to 6 GB. I figure it can't be a "normal" situation >>>>> anymore, but >>>>> >>>>>> must be a bug somehow. >>>>> >>>>>> >>>>> >>>>>> My cluster is 4 nodes, RF of 2, about 160 GB of data across all >>>>> nodes. >>>>> >>>>>> About 10 CF of varying sizes. Runtime writes are between 300 to >>>>> 900 / >>>>> >>>>>> second. Cassandra 2.1.0, nothing too wild. >>>>> >>>>>> >>>>> >>>>>> Has anyone encountered these kinds of issues before? I would >>>>> really >>>>> >>>>>> enjoy hearing about the experiences of people trying to run >>>>> small-sized >>>>> >>>>>> clusters like mine. From everything I read, Cassandra >>>>> operations go very >>>>> >>>>>> well on large (16 GB + 8 Cores) machines, but I'm sad to report >>>>> I've had >>>>> >>>>>> nothing but trouble trying to run on smaller machines, perhaps >>>>> I can learn >>>>> >>>>>> from other's experience? >>>>> >>>>>> >>>>> >>>>>> Full logs can be provided to anyone interested. >>>>> >>>>>> >>>>> >>>>>> Cheers >>>>> >>>>> >>>>> >>>>> >>>>> >>> >>>>> >> >>>>> > >>>>> >>>>> >>>>> >>>>> -- >>>>> Jon Haddad >>>>> http://www.rustyrazorblade.com >>>>> twitter: rustyrazorblade >>>>> >>>> >>>> >>> >> >