I was able to get metrics, but nothing stands out. When the applications
start up and a table is dropped, shortly thereafter on a subsequent write I
get a NoHostAvailableException that is caused by an
OperationTimedOutException. I am not 100% certain on which write the
timeout occurs because there are multiple apps running, but it does happen
fairly consistently almost immediately after the table is dropped. I don't
see any indication of a server side timeout or any dropped mutations being
reported in the log.

On Tue, Sep 20, 2016 at 11:07 PM, John Sanda <john.sa...@gmail.com> wrote:

> Thanks Nate. We do not have monitoring set up yet, but I should be able to
> get the deployment updated with a metrics reporter. I'll update the thread
> with my findings.
>
> On Tue, Sep 20, 2016 at 10:30 PM, Nate McCall <n...@thelastpickle.com>
> wrote:
>
>> If you can get to them in the test env. you want to look in
>> o.a.c.metrics.CommitLog for:
>> - TotalCommitlogSize: if this hovers near commitlog_size_in_mb and never
>> goes down, you are thrashing on segment allocation
>> - WaitingOnCommit: this is the time spent waiting on calls to sync and
>> will start to climb real fast if you cant sync within sync_interval
>> - WaitingOnSegmentAllocation: how long it took to allocate a new
>> commitlog segment, if it is all over the place it is IO bound
>>
>> Try turning all the commit log settings way down for low-IO test
>> infrastructure like this. Maybe total commit log size of like 32mb with 4mb
>> segments (or even lower depending on test data volume) so they basically
>> flush constantly and don't try to hold any tables open. Also lower
>> concurrent_writes substantially while you are at it to add some write
>> throttling.
>>
>> On Wed, Sep 21, 2016 at 2:14 PM, John Sanda <john.sa...@gmail.com> wrote:
>>
>>> I have seen in various threads on the list that 3.0.x is probably best
>>> for prod. Just wondering though if there is anything in particular in 3.7
>>> to be weary of.
>>>
>>> I need to check with one of our QA engineers to get specifics on the
>>> storage. Here is what I do know. We have a blade center running lots of
>>> virtual machines for various testing. Some of those vm's are running
>>> Cassandra and the Java web apps I previously mentioned via docker
>>> containers. The storage is shared. Beyond that I don't have any more
>>> specific details at the moment. I can also tell you that the storage can be
>>> quite slow.
>>>
>>> I have come across different threads that talk to one degree or another
>>> about the flush queue getting full. I have been looking at the code in
>>> ColumnFamilyStore.java. Is perDiskFlushExecutors the thread pool I should
>>> be interested in? It uses an unbounded queue, so I am not really sure what
>>> it means for it to get full. Is there anything I can check or look for to
>>> see if writes are getting blocked?
>>>
>>> On Tue, Sep 20, 2016 at 8:41 PM, Jonathan Haddad <j...@jonhaddad.com>
>>> wrote:
>>>
>>>> If you haven't yet deployed to prod I strongly recommend *not* using
>>>> 3.7.
>>>>
>>>> What network storage are you using?  Outside of a handful of highly
>>>> experienced experts using EBS in very specific ways, it usually ends in
>>>> failure.
>>>>
>>>> On Tue, Sep 20, 2016 at 3:30 PM John Sanda <john.sa...@gmail.com>
>>>> wrote:
>>>>
>>>>> I am deploying multiple Java web apps that connect to a Cassandra 3.7
>>>>> instance. Each app creates its own schema at start up. One of the schema
>>>>> changes involves dropping a table. I am seeing frequent client-side
>>>>> timeouts reported by the DataStax driver after the DROP TABLE statement is
>>>>> executed. I don't see this behavior in all environments. I do see it
>>>>> consistently in a QA environment in which Cassandra is running in docker
>>>>> with network storage, so writes are pretty slow from the get go. In my 
>>>>> logs
>>>>> I see a lot of tables getting flushed, which I guess are all of the dirty
>>>>> column families in the respective commit log segment. Then I seen a whole
>>>>> bunch of flushes getting queued up. Can I reach a point in which too many
>>>>> table flushes get queued such that writes would be blocked?
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> - John
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> - John
>>>
>>
>>
>>
>> --
>> -----------------
>> Nate McCall
>> Wellington, NZ
>> @zznate
>>
>> CTO
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>
>
>
> --
>
> - John
>



-- 

- John

Reply via email to