Thank you very much for your reply. This is a deeper interpretation of the logs than I can do at the moment.
Regarding 2) it's a good assumption on your part but in this case, non-obviously the loc table's primary key is actually not id, the scheme changed historically which has led to this odd naming of the field. What you are describing makes me think it may be related to an odd state left behind from an experiment I made a few days ago. I switched all tables from SizeTiered to Level compaction strategy (in an attempt to make better use of the limited disk space on the machines, compaction was starting to lead to nodes out of space). Afterwards I reverted a few of the more write-heavy tables to SizeTiered. The whole experiment seemed shaky. Should doing a major compaction on those nodes lead to a restructuration of the SSTables? I would think so. On Sunday, October 26, 2014, DuyHai Doan <doanduy...@gmail.com> wrote: > Hello Maxime > > Increasing the flush writers won't help if your disk I/O is not keeping up. > > I've had a look into the log file, below are some remarks: > > 1) There are a lot of SSTables on disk for some tables (events for > example, but not only). I've seen that some compactions are taking up to 32 > SSTables (which corresponds to the default max value for SizeTiered > compaction). > > 2) There is a secondary index that I found suspicious : loc.loc_id_idx. As > its name implies I have the impression that it's an index on the id of the > loc which would lead to almost an 1-1 relationship between the indexed > value and the original loc. Such index should be avoided because they do > not perform well. If it's not an index on the loc_id, please disregard my > remark > > 3) There is a clear imbalance of SSTable count on some nodes. In the log, > I saw: > > INFO [STREAM-IN-/xxxx.xxxx.xxxx.20] 2014-10-25 02:21:43,360 > StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79 > ID#0] Prepare completed. Receiving *163* files(*4 111 187 195* bytes), > sending 0 files(0 bytes) > > INFO [STREAM-IN-/xxxx.xxxx.xxxx.81] 2014-10-25 02:21:46,121 > StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79 > ID#0] Prepare completed. Receiving *154* files(*3 332 779 920* bytes), > sending 0 files(0 bytes) > > INFO [STREAM-IN-/xxxx.xxxx.xxxx.71] 2014-10-25 02:21:50,494 > StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79 > ID#0] Prepare completed. Receiving *1315* files(*4 606 316 933* bytes), > sending 0 files(0 bytes) > > INFO [STREAM-IN-/xxxx.xxxx.xxxx.217] 2014-10-25 02:21:51,036 > StreamResultFuture.java:166 - [Stream #a6e54ea0-5bed-11e4-8df5-f357715e1a79 > ID#0] Prepare completed. Receiving *1640* files(*3 208 023 573* bytes), > sending 0 files(0 bytes) > > As you can see, the existing 4 nodes are streaming data to the new node > and on average the data set size is about 3.3 - 4.5 Gb. However the number > of SSTables is around 150 files for nodes xxxx.xxxx.xxxx.20 and > xxxx.xxxx.xxxx.81 but goes through the roof to reach *1315* files for > xxxx.xxxx.xxxx.71 and *1640* files for xxxx.xxxx.xxxx.217 > > The total data set size is roughly the same but the file number is x10, > which mean that you'll have a bunch of tiny files. > > I guess that upon reception of those files, there will be a massive flush > to disk, explaining the behaviour you're facing (flush storm) > > I would suggest looking on nodes xxxx.xxxx.xxxx.71 and xxxx.xxxx.xxxx.217 > to check for the total SSTable count for each table to confirm this > intuition > > Regards > > > On Sun, Oct 26, 2014 at 4:58 PM, Maxime <maxim...@gmail.com > <javascript:_e(%7B%7D,'cvml','maxim...@gmail.com');>> wrote: > >> I've emailed you a raw log file of an instance of this happening. >> >> I've been monitoring more closely the timing of events in tpstats and the >> logs and I believe this is what is happening: >> >> - For some reason, C* decides to provoke a flush storm (I say some >> reason, I'm sure there is one but I have had difficulty determining the >> behaviour changes between 1.* and more recent releases). >> - So we see ~ 3000 flush being enqueued. >> - This happens so suddenly that even boosting the number of flush writers >> to 20 does not suffice. I don't even see "all time blocked" numbers for it >> before C* stops responding. I suspect this is due to the sudden OOM and GC >> occurring. >> - The last tpstat that comes back before the node goes down indicates 20 >> active and 3000 pending and the rest 0. It's by far the anomalous activity. >> >> Is there a way to throttle down this generation of Flush? C* complains if >> I set the queue_size to any value (deprecated now?) and boosting the >> threads does not seem to help since even at 20 we're an order of magnitude >> off. >> >> Suggestions? Comments? >> >> >> On Sun, Oct 26, 2014 at 2:26 AM, DuyHai Doan <doanduy...@gmail.com >> <javascript:_e(%7B%7D,'cvml','doanduy...@gmail.com');>> wrote: >> >>> Hello Maxime >>> >>> Can you put the complete logs and config somewhere ? It would be >>> interesting to know what is the cause of the OOM. >>> >>> On Sun, Oct 26, 2014 at 3:15 AM, Maxime <maxim...@gmail.com >>> <javascript:_e(%7B%7D,'cvml','maxim...@gmail.com');>> wrote: >>> >>>> Thanks a lot that is comforting. We are also small at the moment so I >>>> definitely can relate with the idea of keeping small and simple at a level >>>> where it just works. >>>> >>>> I see the new Apache version has a lot of fixes so I will try to >>>> upgrade before I look into downgrading. >>>> >>>> >>>> On Saturday, October 25, 2014, Laing, Michael < >>>> michael.la...@nytimes.com >>>> <javascript:_e(%7B%7D,'cvml','michael.la...@nytimes.com');>> wrote: >>>> >>>>> Since no one else has stepped in... >>>>> >>>>> We have run clusters with ridiculously small nodes - I have a >>>>> production cluster in AWS with 4GB nodes each with 1 CPU and disk-based >>>>> instance storage. It works fine but you can see those little puppies >>>>> struggle... >>>>> >>>>> And I ran into problems such as you observe... >>>>> >>>>> Upgrading Java to the latest 1.7 and - most importantly - *reverting >>>>> to the default configuration, esp. for heap*, seemed to settle things >>>>> down completely. Also make sure that you are using the 'recommended >>>>> production settings' from the docs on your boxen. >>>>> >>>>> However we are running 2.0.x not 2.1.0 so YMMV. >>>>> >>>>> And we are switching to 15GB nodes w 2 heftier CPUs each and SSD >>>>> storage - still a 'small' machine, but much more reasonable for C*. >>>>> >>>>> However I can't say I am an expert, since I deliberately keep things >>>>> so simple that we do not encounter problems - it just works so I dig into >>>>> other stuff. >>>>> >>>>> ml >>>>> >>>>> >>>>> On Sat, Oct 25, 2014 at 5:22 PM, Maxime <maxim...@gmail.com> wrote: >>>>> >>>>>> Hello, I've been trying to add a new node to my cluster ( 4 nodes ) >>>>>> for a few days now. >>>>>> >>>>>> I started by adding a node similar to my current configuration, 4 GB >>>>>> or RAM + 2 Cores on DigitalOcean. However every time, I would end up >>>>>> getting OOM errors after many log entries of the type: >>>>>> >>>>>> INFO [SlabPoolCleaner] 2014-10-25 13:44:57,240 >>>>>> ColumnFamilyStore.java:856 - Enqueuing flush of mycf: 5383 (0%) on-heap, >>>>>> 0 >>>>>> (0%) off-heap >>>>>> >>>>>> leading to: >>>>>> >>>>>> ka-120-Data.db (39291 bytes) for commitlog position >>>>>> ReplayPosition(segmentId=1414243978538, position=23699418) >>>>>> WARN [SharedPool-Worker-13] 2014-10-25 13:48:18,032 >>>>>> AbstractTracingAwareExecutorService.java:167 - Uncaught exception on >>>>>> thread >>>>>> Thread[SharedPool-Worker-13,5,main]: {} >>>>>> java.lang.OutOfMemoryError: Java heap space >>>>>> >>>>>> Thinking it had to do with either compaction somehow or streaming, 2 >>>>>> activities I've had tremendous issues with in the past; I tried to slow >>>>>> down the setstreamthroughput to extremely low values all the way to 5. I >>>>>> also tried setting setcompactionthoughput to 0, and then reading that in >>>>>> some cases it might be too fast, down to 8. Nothing worked, it merely >>>>>> vaguely changed the mean time to OOM but not in a way indicating either >>>>>> was >>>>>> anywhere a solution. >>>>>> >>>>>> The nodes were configured with 2 GB of Heap initially, I tried to >>>>>> crank it up to 3 GB, stressing the host memory to its limit. >>>>>> >>>>>> After doing some exploration (I am considering writing a Cassandra >>>>>> Ops documentation with lessons learned since there seems to be little of >>>>>> it >>>>>> in organized fashions), I read that some people had strange issues on >>>>>> lower-end boxes like that, so I bit the bullet and upgraded my new node >>>>>> to >>>>>> a 8GB + 4 Core instance, which was anecdotally better. >>>>>> >>>>>> To my complete shock, exact same issues are present, even raising the >>>>>> Heap memory to 6 GB. I figure it can't be a "normal" situation anymore, >>>>>> but >>>>>> must be a bug somehow. >>>>>> >>>>>> My cluster is 4 nodes, RF of 2, about 160 GB of data across all >>>>>> nodes. About 10 CF of varying sizes. Runtime writes are between 300 to >>>>>> 900 >>>>>> / second. Cassandra 2.1.0, nothing too wild. >>>>>> >>>>>> Has anyone encountered these kinds of issues before? I would really >>>>>> enjoy hearing about the experiences of people trying to run small-sized >>>>>> clusters like mine. From everything I read, Cassandra operations go very >>>>>> well on large (16 GB + 8 Cores) machines, but I'm sad to report I've had >>>>>> nothing but trouble trying to run on smaller machines, perhaps I can >>>>>> learn >>>>>> from other's experience? >>>>>> >>>>>> Full logs can be provided to anyone interested. >>>>>> >>>>>> Cheers >>>>>> >>>>> >>>>> >>> >> >