I was able to get metrics, but nothing stands out. When the applications start up and a table is dropped, shortly thereafter on a subsequent write I get a NoHostAvailableException that is caused by an OperationTimedOutException. I am not 100% certain on which write the timeout occurs because there are multiple apps running, but it does happen fairly consistently almost immediately after the table is dropped. I don't see any indication of a server side timeout or any dropped mutations being reported in the log.
On Tue, Sep 20, 2016 at 11:07 PM, John Sanda <john.sa...@gmail.com> wrote: > Thanks Nate. We do not have monitoring set up yet, but I should be able to > get the deployment updated with a metrics reporter. I'll update the thread > with my findings. > > On Tue, Sep 20, 2016 at 10:30 PM, Nate McCall <n...@thelastpickle.com> > wrote: > >> If you can get to them in the test env. you want to look in >> o.a.c.metrics.CommitLog for: >> - TotalCommitlogSize: if this hovers near commitlog_size_in_mb and never >> goes down, you are thrashing on segment allocation >> - WaitingOnCommit: this is the time spent waiting on calls to sync and >> will start to climb real fast if you cant sync within sync_interval >> - WaitingOnSegmentAllocation: how long it took to allocate a new >> commitlog segment, if it is all over the place it is IO bound >> >> Try turning all the commit log settings way down for low-IO test >> infrastructure like this. Maybe total commit log size of like 32mb with 4mb >> segments (or even lower depending on test data volume) so they basically >> flush constantly and don't try to hold any tables open. Also lower >> concurrent_writes substantially while you are at it to add some write >> throttling. >> >> On Wed, Sep 21, 2016 at 2:14 PM, John Sanda <john.sa...@gmail.com> wrote: >> >>> I have seen in various threads on the list that 3.0.x is probably best >>> for prod. Just wondering though if there is anything in particular in 3.7 >>> to be weary of. >>> >>> I need to check with one of our QA engineers to get specifics on the >>> storage. Here is what I do know. We have a blade center running lots of >>> virtual machines for various testing. Some of those vm's are running >>> Cassandra and the Java web apps I previously mentioned via docker >>> containers. The storage is shared. Beyond that I don't have any more >>> specific details at the moment. I can also tell you that the storage can be >>> quite slow. >>> >>> I have come across different threads that talk to one degree or another >>> about the flush queue getting full. I have been looking at the code in >>> ColumnFamilyStore.java. Is perDiskFlushExecutors the thread pool I should >>> be interested in? It uses an unbounded queue, so I am not really sure what >>> it means for it to get full. Is there anything I can check or look for to >>> see if writes are getting blocked? >>> >>> On Tue, Sep 20, 2016 at 8:41 PM, Jonathan Haddad <j...@jonhaddad.com> >>> wrote: >>> >>>> If you haven't yet deployed to prod I strongly recommend *not* using >>>> 3.7. >>>> >>>> What network storage are you using? Outside of a handful of highly >>>> experienced experts using EBS in very specific ways, it usually ends in >>>> failure. >>>> >>>> On Tue, Sep 20, 2016 at 3:30 PM John Sanda <john.sa...@gmail.com> >>>> wrote: >>>> >>>>> I am deploying multiple Java web apps that connect to a Cassandra 3.7 >>>>> instance. Each app creates its own schema at start up. One of the schema >>>>> changes involves dropping a table. I am seeing frequent client-side >>>>> timeouts reported by the DataStax driver after the DROP TABLE statement is >>>>> executed. I don't see this behavior in all environments. I do see it >>>>> consistently in a QA environment in which Cassandra is running in docker >>>>> with network storage, so writes are pretty slow from the get go. In my >>>>> logs >>>>> I see a lot of tables getting flushed, which I guess are all of the dirty >>>>> column families in the respective commit log segment. Then I seen a whole >>>>> bunch of flushes getting queued up. Can I reach a point in which too many >>>>> table flushes get queued such that writes would be blocked? >>>>> >>>>> >>>>> -- >>>>> >>>>> - John >>>>> >>>> >>> >>> >>> -- >>> >>> - John >>> >> >> >> >> -- >> ----------------- >> Nate McCall >> Wellington, NZ >> @zznate >> >> CTO >> Apache Cassandra Consulting >> http://www.thelastpickle.com >> > > > > -- > > - John > -- - John