Great that you found a specific release that triggers the problem - 2.1.x
has a huge number of changes.

How many partitions and rows do you have? What's the largest row count for
a single partition? And all of these CQL tables are COMPACT STORAGE,
correct? Are you writing a large number of skinny partitions or a smaller
number of very wide partitions? It wouldn't surprise me if behavior for
large partitions varies between releases since they can be so
memory-intensive.

I see this change in 2.1.5 that could possibly introduce some memory usage:
Write partition size estimates into a system table (CASSANDRA-7688)

At this stage it would probably help for you to try to produce a reasonably
small repro test case that you could file as a Jira. And if you could run
that repro test case on 3.x to verify that the problem still exists, that
would be helpful as well.

How long does it take to repro the timeout?

Can you repro the timeout using a single node?

What is the pattern of the timeouts - just random and occasional, or heavy
and continuous once they start?

Are they occurring uniformly on all three nodes?

If you bounce the cluster and continue testing, do the timeouts commence
immediately, fairly soon, or only after about as long as they take from a
clean fresh start?



-- Jack Krupansky

On Wed, Feb 24, 2016 at 7:04 PM, Mike Heffner <m...@librato.com> wrote:

> Nate,
>
> So we have run several install tests, bisecting the 2.1.x release line,
> and we believe that the regression was introduced in version 2.1.5. This is
> the first release that clearly hits the timeout for us.
>
> It looks like quite a large release, so our next step will likely be
> bisecting the major commits to see if we can narrow it down:
> https://github.com/apache/cassandra/blob/3c0a337ebc90b0d99349d0aa152c92b5b3494d8c/CHANGES.txt.
> Obviously, any suggestions on potential suspects appreciated.
>
> These are the memtable settings we've configured diff from the defaults
> during our testing:
>
> memtable_allocation_type: offheap_objects
> memtable_flush_writers: 8
>
>
> Cheers,
>
> Mike
>
> On Fri, Feb 19, 2016 at 1:46 PM, Nate McCall <n...@thelastpickle.com>
> wrote:
>
>> The biggest change which *might* explain your behavior has to do with the
>> changes in memtable flushing between 2.0 and 2.1:
>> https://issues.apache.org/jira/browse/CASSANDRA-5549
>>
>> However, the tpstats you posted shows no dropped mutations which would
>> make me more certain of this as the cause.
>>
>> What values do you have right now for each of these (my recommendations
>> for each on a c4.2xl with stock cassandra-env.sh are in parenthesis):
>>
>> - memtable_flush_writers (2)
>> - memtable_heap_space_in_mb  (2048)
>> - memtable_offheap_space_in_mb (2048)
>> - memtable_cleanup_threshold (0.11)
>> - memtable_allocation_type (offheap_objects)
>>
>> The biggest win IMO will be moving to offheap_objects. By default,
>> everything is on heap. Regardless, spending some time tuning these for your
>> workload will pay off.
>>
>> You may also want to be explicit about
>>
>> - native_transport_max_concurrent_connections
>> - native_transport_max_concurrent_connections_per_ip
>>
>> Depending on the driver, these may now be allowing 32k streams per
>> connection(!) as detailed in v3 of the native protocol:
>>
>> https://github.com/apache/cassandra/blob/cassandra-2.1/doc/native_protocol_v3.spec#L130-L152
>>
>>
>>
>> On Fri, Feb 19, 2016 at 8:48 AM, Mike Heffner <m...@librato.com> wrote:
>>
>>> Anuj,
>>>
>>> So we originally started testing with Java8 + G1, however we were able
>>> to reproduce the same results with the default CMS settings that ship in
>>> the cassandra-env.sh from the Deb pkg. We didn't detect any large GC pauses
>>> during the runs.
>>>
>>> Query pattern during our testing was 100% writes, batching (via Thrift
>>> mostly) to 5 tables, between 6-1500 rows per batch.
>>>
>>> Mike
>>>
>>> On Thu, Feb 18, 2016 at 12:22 PM, Anuj Wadehra <anujw_2...@yahoo.co.in>
>>> wrote:
>>>
>>>> Whats the GC overhead? Can you your share your GC collector and
>>>> settings ?
>>>>
>>>>
>>>> Whats your query pattern? Do you use secondary indexes, batches, in
>>>> clause etc?
>>>>
>>>>
>>>> Anuj
>>>>
>>>>
>>>> Sent from Yahoo Mail on Android
>>>> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>>>>
>>>> On Thu, 18 Feb, 2016 at 8:45 pm, Mike Heffner
>>>> <m...@librato.com> wrote:
>>>> Alain,
>>>>
>>>> Thanks for the suggestions.
>>>>
>>>> Sure, tpstats are here:
>>>> https://gist.github.com/mheffner/a979ae1a0304480b052a. Looking at the
>>>> metrics across the ring, there were no blocked tasks nor dropped messages.
>>>>
>>>> Iowait metrics look fine, so it doesn't appear to be blocking on disk.
>>>> Similarly, there are no long GC pauses.
>>>>
>>>> We haven't noticed latency on any particular table higher than others
>>>> or correlated around the occurrence of a timeout. We have noticed with
>>>> further testing that running cassandra-stress against the ring, while our
>>>> workload is writing to the same ring, will incur similar 10 second
>>>> timeouts. If our workload is not writing to the ring, cassandra stress will
>>>> run without hitting timeouts. This seems to imply that our workload pattern
>>>> is causing something to block cluster-wide, since the stress tool writes to
>>>> a different keyspace then our workload.
>>>>
>>>> I mentioned in another reply that we've tracked it to something between
>>>> 2.0.x and 2.1.x, so we are focusing on narrowing which point release it was
>>>> introduced in.
>>>>
>>>> Cheers,
>>>>
>>>> Mike
>>>>
>>>> On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ <arodr...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Mike,
>>>>>
>>>>> What about the output of tpstats ? I imagine you have dropped messages
>>>>> there. Any blocked threads ? Could you paste this output here ?
>>>>>
>>>>> May this be due to some network hiccup to access the disks as they are
>>>>> EBS ? Can you think of anyway of checking this ? Do you have a lot of GC
>>>>> logs, how long are the pauses (use something like: grep -i 'GCInspector'
>>>>> /var/log/cassandra/system.log) ?
>>>>>
>>>>> Something else you could check are local_writes stats to see if only
>>>>> one table if affected or this is keyspace / cluster wide. You can use
>>>>> metrics exposed by cassandra or if you have no dashboards I believe a:
>>>>> 'nodetool cfstats <myks> | grep -e 'Table:' -e 'Local'' should give you a
>>>>> rough idea of local latencies.
>>>>>
>>>>> Those are just things I would check, I have not a clue on what is
>>>>> happening here, hope this will help.
>>>>>
>>>>> C*heers,
>>>>> -----------------
>>>>> Alain Rodriguez
>>>>> France
>>>>>
>>>>> The Last Pickle
>>>>> http://www.thelastpickle.com
>>>>>
>>>>> 2016-02-18 5:13 GMT+01:00 Mike Heffner <m...@librato.com>:
>>>>>
>>>>>> Jaydeep,
>>>>>>
>>>>>> No, we don't use any light weight transactions.
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>> On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia <
>>>>>> chovatia.jayd...@gmail.com> wrote:
>>>>>>
>>>>>>> Are you guys using light weight transactions in your write path?
>>>>>>>
>>>>>>> On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat <
>>>>>>> fabrice.faco...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Are your commitlog and data on the same disk ? If yes, you should
>>>>>>>> put
>>>>>>>> commitlogs on a separate disk which don't have a lot of IO.
>>>>>>>>
>>>>>>>> Others IO may have great impact impact on your commitlog writing and
>>>>>>>> it may even block.
>>>>>>>>
>>>>>>>> An example of impact IO may have, even for Async writes:
>>>>>>>>
>>>>>>>> https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic
>>>>>>>>
>>>>>>>> 2016-02-11 0:31 GMT+01:00 Mike Heffner <m...@librato.com>:
>>>>>>>> > Jeff,
>>>>>>>> >
>>>>>>>> > We have both commitlog and data on a 4TB EBS with 10k IOPS.
>>>>>>>> >
>>>>>>>> > Mike
>>>>>>>> >
>>>>>>>> > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa <
>>>>>>>> jeff.ji...@crowdstrike.com>
>>>>>>>> > wrote:
>>>>>>>> >>
>>>>>>>> >> What disk size are you using?
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> From: Mike Heffner
>>>>>>>> >> Reply-To: "user@cassandra.apache.org"
>>>>>>>> >> Date: Wednesday, February 10, 2016 at 2:24 PM
>>>>>>>> >> To: "user@cassandra.apache.org"
>>>>>>>> >> Cc: Peter Norton
>>>>>>>> >> Subject: Re: Debugging write timeouts on Cassandra 2.2.5
>>>>>>>> >>
>>>>>>>> >> Paulo,
>>>>>>>> >>
>>>>>>>> >> Thanks for the suggestion, we ran some tests against CMS and saw
>>>>>>>> the same
>>>>>>>> >> timeouts. On that note though, we are going to try doubling the
>>>>>>>> instance
>>>>>>>> >> sizes and testing with double the heap (even though current
>>>>>>>> usage is low).
>>>>>>>> >>
>>>>>>>> >> Mike
>>>>>>>> >>
>>>>>>>> >> On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta <
>>>>>>>> pauloricard...@gmail.com>
>>>>>>>> >> wrote:
>>>>>>>> >>>
>>>>>>>> >>> Are you using the same GC settings as the staging 2.0 cluster?
>>>>>>>> If not,
>>>>>>>> >>> could you try using the default GC settings (CMS) and see if
>>>>>>>> that changes
>>>>>>>> >>> anything? This is just a wild guess, but there were reports
>>>>>>>> before of
>>>>>>>> >>> G1-caused instabilities with small heap sizes (< 16GB - see
>>>>>>>> CASSANDRA-10403
>>>>>>>> >>> for more context). Please ignore if you already tried reverting
>>>>>>>> back to CMS.
>>>>>>>> >>>
>>>>>>>> >>> 2016-02-10 16:51 GMT-03:00 Mike Heffner <m...@librato.com>:
>>>>>>>> >>>>
>>>>>>>> >>>> Hi all,
>>>>>>>> >>>>
>>>>>>>> >>>> We've recently embarked on a project to update our Cassandra
>>>>>>>> >>>> infrastructure running on EC2. We are long time users of 2.0.x
>>>>>>>> and are
>>>>>>>> >>>> testing out a move to version 2.2.5 running on VPC with EBS.
>>>>>>>> Our test setup
>>>>>>>> >>>> is a 3 node, RF=3 cluster supporting a small write load
>>>>>>>> (mirror of our
>>>>>>>> >>>> staging load).
>>>>>>>> >>>>
>>>>>>>> >>>> We are writing at QUORUM and while p95's look good compared to
>>>>>>>> our
>>>>>>>> >>>> staging 2.0.x cluster, we are seeing frequent write operations
>>>>>>>> that time out
>>>>>>>> >>>> at the max write_request_timeout_in_ms (10 seconds). CPU
>>>>>>>> across the cluster
>>>>>>>> >>>> is < 10% and EBS write load is < 100 IOPS. Cassandra is
>>>>>>>> running with the
>>>>>>>> >>>> Oracle JDK 8u60 and we're using G1GC and any GC pauses are
>>>>>>>> less than 500ms.
>>>>>>>> >>>>
>>>>>>>> >>>> We run on c4.2xl instances with GP2 EBS attached storage for
>>>>>>>> data and
>>>>>>>> >>>> commitlog directories. The nodes are using EC2 enhanced
>>>>>>>> networking and have
>>>>>>>> >>>> the latest Intel network driver module. We are running on HVM
>>>>>>>> instances
>>>>>>>> >>>> using Ubuntu 14.04.2.
>>>>>>>> >>>>
>>>>>>>> >>>> Our schema is 5 tables, all with COMPACT STORAGE. Each table
>>>>>>>> is similar
>>>>>>>> >>>> to the definition here:
>>>>>>>> >>>> https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a
>>>>>>>> >>>>
>>>>>>>> >>>> This is our cassandra.yaml:
>>>>>>>> >>>>
>>>>>>>> https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml
>>>>>>>> >>>>
>>>>>>>> >>>> Like I mentioned we use 8u60 with G1GC and have used many of
>>>>>>>> the GC
>>>>>>>> >>>> settings in Al Tobey's tuning guide. This is our upstart
>>>>>>>> config with JVM and
>>>>>>>> >>>> other CPU settings:
>>>>>>>> https://gist.github.com/mheffner/dc44613620b25c4fa46d
>>>>>>>> >>>>
>>>>>>>> >>>> We've used several of the sysctl settings from Al's guide as
>>>>>>>> well:
>>>>>>>> >>>> https://gist.github.com/mheffner/ea40d58f58a517028152
>>>>>>>> >>>>
>>>>>>>> >>>> Our client application is able to write using either Thrift
>>>>>>>> batches
>>>>>>>> >>>> using Asytanax driver or CQL async INSERT's using the Datastax
>>>>>>>> Java driver.
>>>>>>>> >>>>
>>>>>>>> >>>> For testing against Thrift (our legacy infra uses this) we
>>>>>>>> write batches
>>>>>>>> >>>> of anywhere from 6 to 1500 rows at a time. Our p99 for batch
>>>>>>>> execution is
>>>>>>>> >>>> around 45ms but our maximum (p100) sits less than 150ms except
>>>>>>>> when it
>>>>>>>> >>>> periodically spikes to the full 10seconds.
>>>>>>>> >>>>
>>>>>>>> >>>> Testing the same write path using CQL writes instead
>>>>>>>> demonstrates
>>>>>>>> >>>> similar behavior. Low p99s except for periodic full timeouts.
>>>>>>>> We enabled
>>>>>>>> >>>> tracing for several operations but were unable to get a trace
>>>>>>>> that completed
>>>>>>>> >>>> successfully -- Cassandra started logging many messages as:
>>>>>>>> >>>>
>>>>>>>> >>>> INFO  [ScheduledTasks:1] - MessagingService.java:946 - _TRACE
>>>>>>>> messages
>>>>>>>> >>>> were dropped in last 5000 ms: 52499 for internal timeout and 0
>>>>>>>> for cross
>>>>>>>> >>>> node timeout
>>>>>>>> >>>>
>>>>>>>> >>>> And all the traces contained rows with a "null" source_elapsed
>>>>>>>> row:
>>>>>>>> >>>>
>>>>>>>> https://gist.githubusercontent.com/mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7e7d8f2911b/trace.out
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> >>>> We've exhausted as many configuration option permutations that
>>>>>>>> we can
>>>>>>>> >>>> think of. This cluster does not appear to be under any
>>>>>>>> significant load and
>>>>>>>> >>>> latencies seem to largely fall in two bands: low normal or max
>>>>>>>> timeout. This
>>>>>>>> >>>> seems to imply that something is getting stuck and timing out
>>>>>>>> at the max
>>>>>>>> >>>> write timeout.
>>>>>>>> >>>>
>>>>>>>> >>>> Any suggestions on what to look for? We had debug enabled for
>>>>>>>> awhile but
>>>>>>>> >>>> we didn't see any msg that pointed to something obvious. Happy
>>>>>>>> to provide
>>>>>>>> >>>> any more information that may help.
>>>>>>>> >>>>
>>>>>>>> >>>> We are pretty much at the point of sprinkling debug around the
>>>>>>>> code to
>>>>>>>> >>>> track down what could be blocking.
>>>>>>>> >>>>
>>>>>>>> >>>>
>>>>>>>> >>>> Thanks,
>>>>>>>> >>>>
>>>>>>>> >>>> Mike
>>>>>>>> >>>>
>>>>>>>> >>>> --
>>>>>>>> >>>>
>>>>>>>> >>>>   Mike Heffner <m...@librato.com>
>>>>>>>> >>>>   Librato, Inc.
>>>>>>>> >>>>
>>>>>>>> >>>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> --
>>>>>>>> >>
>>>>>>>> >>   Mike Heffner <m...@librato.com>
>>>>>>>> >>   Librato, Inc.
>>>>>>>> >>
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > --
>>>>>>>> >
>>>>>>>> >   Mike Heffner <m...@librato.com>
>>>>>>>> >   Librato, Inc.
>>>>>>>> >
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Close the World, Open the Net
>>>>>>>> http://www.linux-wizard.net
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>>   Mike Heffner <m...@librato.com>
>>>>>>   Librato, Inc.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>   Mike Heffner <m...@librato.com>
>>>>   Librato, Inc.
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>>   Mike Heffner <m...@librato.com>
>>>   Librato, Inc.
>>>
>>>
>>
>>
>> --
>> -----------------
>> Nate McCall
>> Austin, TX
>> @zznate
>>
>> Co-Founder & Sr. Technical Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>
>
>
> --
>
>   Mike Heffner <m...@librato.com>
>   Librato, Inc.
>
>

Reply via email to