Re: All subsequent CAS requests time out after heavy use of new CAS feature

horschi Sat, 24 Dec 2016 04:17:14 -0800

Oh yes it is, like Couters :-)


On Sat, Dec 24, 2016 at 4:02 AM, Edward Capriolo <edlinuxg...@gmail.com>
wrote:

> Anecdotal CAS works differently than the typical cassandra workload. If
> you run a stress instance 3 nodes one host, you find that you typically run
> into CPU issues, but if you are doing a CAS workload you see things timing
> out and before you hit 100% CPU. It is a strange beast.
>
> On Fri, Dec 23, 2016 at 7:28 AM, horschi <hors...@gmail.com> wrote:
>
>> Update: I replace all quorum reads on that table with serial reads, and
>> now these errors got less. Somehow quorum reads on CAS values cause most of
>> these WTEs.
>>
>> Also I found two tickets on that topic:
>> https://issues.apache.org/jira/browse/CASSANDRA-9328
>> https://issues.apache.org/jira/browse/CASSANDRA-8672
>>
>> On Thu, Dec 15, 2016 at 3:14 PM, horschi <hors...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I would like to warm up this old thread. I did some debugging and found
>>> out that the timeouts are coming from StorageProxy.proposePaxos()
>>> - callback.isFullyRefused() returns false and therefore triggers a
>>> WriteTimeout.
>>>
>>> Looking at my ccm cluster logs, I can see that two replica nodes return
>>> different results in their ProposeVerbHandler. In my opinion the
>>> coordinator should not throw a Exception in such a case, but instead retry
>>> the operation.
>>>
>>> What do the CAS/Paxos experts on this list say to this? Feel free to
>>> instruct me to do further tests/code changes. I'd be glad to help.
>>>
>>> Log:
>>>
>>> node1/logs/system.log:WARN  [SharedPool-Worker-5] 2016-12-15
>>> 14:48:36,896 PaxosState.java:124 - Rejecting proposal for
>>> Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc, [MDS.Lock] key=locktest_ 1
>>> columns=[[] | [value]]
>>> node1/logs/system.log-    Row: id=@ | value=<tombstone>) because
>>> inProgress is now Commit(2d8146b0-c2cd-11e6-f996-e5c8d88a1da4,
>>> [MDS.Lock] key=locktest_ 1 columns=[[] | [value]]
>>> --
>>> node1/logs/system.log:ERROR [SharedPool-Worker-12] 2016-12-15
>>> 14:48:36,980 StorageProxy.java:506 - proposePaxos:
>>> Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc, [MDS.Lock] key=locktest_ 1
>>> columns=[[] | [value]]
>>> node1/logs/system.log-    Row: id=@ | value=<tombstone>)//1//0
>>> --
>>> node2/logs/system.log:WARN  [SharedPool-Worker-7] 2016-12-15
>>> 14:48:36,969 PaxosState.java:117 - Accepting proposal:
>>> Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc, [MDS.Lock] key=locktest_ 1
>>> columns=[[] | [value]]
>>> node2/logs/system.log-    Row: id=@ | value=<tombstone>)
>>> --
>>> node3/logs/system.log:WARN  [SharedPool-Worker-2] 2016-12-15
>>> 14:48:36,897 PaxosState.java:124 - Rejecting proposal for
>>> Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc, [MDS.Lock] key=locktest_ 1
>>> columns=[[] | [value]]
>>> node3/logs/system.log-    Row: id=@ | value=<tombstone>) because
>>> inProgress is now Commit(2d8146b0-c2cd-11e6-f996-e5c8d88a1da4,
>>> [MDS.Lock] key=locktest_ 1 columns=[[] | [value]]
>>>
>>>
>>> kind regards,
>>> Christian
>>>
>>>
>>> On Fri, Apr 15, 2016 at 8:27 PM, Denise Rogers <datag...@aol.com> wrote:
>>>
>>>> My thinking was that due to the size of the data that there maybe I/O
>>>> issues. But it sounds more like you're competing for locks and hit a
>>>> deadlock issue.
>>>>
>>>> Regards,
>>>> Denise
>>>> Cell - (860)989-3431 <(860)%20989-3431>
>>>>
>>>> Sent from mi iPhone
>>>>
>>>> On Apr 15, 2016, at 9:00 AM, horschi <hors...@gmail.com> wrote:
>>>>
>>>> Hi Denise,
>>>>
>>>> in my case its a small blob I am writing (should be around 100 bytes):
>>>>
>>>>      CREATE TABLE "Lock" (
>>>>          lockname varchar,
>>>>          id varchar,
>>>>          value blob,
>>>>          PRIMARY KEY (lockname, id)
>>>>      ) WITH COMPACT STORAGE
>>>>          AND COMPRESSION = { 'sstable_compression' :
>>>> 'SnappyCompressor', 'chunk_length_kb' : '8' };
>>>>
>>>> You ask because large values are known to cause issues? Anything
>>>> special you have in mind?
>>>>
>>>> kind regards,
>>>> Christian
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Apr 15, 2016 at 2:42 PM, Denise Rogers <datag...@aol.com>
>>>> wrote:
>>>>
>>>>> Also, what type of data were you reading/writing?
>>>>>
>>>>> Regards,
>>>>> Denise
>>>>>
>>>>> Sent from mi iPad
>>>>>
>>>>> On Apr 15, 2016, at 8:29 AM, horschi <hors...@gmail.com> wrote:
>>>>>
>>>>> Hi Jan,
>>>>>
>>>>> were you able to resolve your Problem?
>>>>>
>>>>> We are trying the same and also see a lot of WriteTimeouts:
>>>>> WriteTimeoutException: Cassandra timeout during write query at
>>>>> consistency SERIAL (2 replica were required but only 1 acknowledged the
>>>>> write)
>>>>>
>>>>> How many clients were competing for a lock in your case? In our case
>>>>> its only two :-(
>>>>>
>>>>> cheers,
>>>>> Christian
>>>>>
>>>>>
>>>>> On Tue, Sep 24, 2013 at 12:18 AM, Robert Coli <rc...@eventbrite.com>
>>>>> wrote:
>>>>>
>>>>>> On Mon, Sep 16, 2013 at 9:09 AM, Jan Algermissen <
>>>>>> jan.algermis...@nordsc.com> wrote:
>>>>>>
>>>>>>> I am experimenting with C* 2.0 ( and today's java-driver 2.0
>>>>>>> snapshot) for implementing distributed locks.
>>>>>>>
>>>>>>
>>>>>> [ and I'm experiencing the problem described in the subject ... ]
>>>>>>
>>>>>>
>>>>>>> Any idea how to approach this problem?
>>>>>>>
>>>>>>
>>>>>> 1) Upgrade to 2.0.1 release.
>>>>>> 2) Try to reproduce symptoms.
>>>>>> 3) If able to, file a JIRA at https://issues.apache.org/jira
>>>>>> /secure/Dashboard.jspa including repro steps
>>>>>> 4) Reply to this thread with the JIRA ticket URL
>>>>>>
>>>>>> =Rob
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: All subsequent CAS requests time out after heavy use of new CAS feature

Reply via email to