Re: All subsequent CAS requests time out after heavy use of new CAS feature

Edward Capriolo Fri, 23 Dec 2016 19:03:07 -0800

Anecdotal CAS works differently than the typical cassandra workload. If you
run a stress instance 3 nodes one host, you find that you typically run
into CPU issues, but if you are doing a CAS workload you see things timing
out and before you hit 100% CPU. It is a strange beast.


On Fri, Dec 23, 2016 at 7:28 AM, horschi <hors...@gmail.com> wrote:

> Update: I replace all quorum reads on that table with serial reads, and
> now these errors got less. Somehow quorum reads on CAS values cause most of
> these WTEs.
>
> Also I found two tickets on that topic:
> https://issues.apache.org/jira/browse/CASSANDRA-9328
> https://issues.apache.org/jira/browse/CASSANDRA-8672
>
> On Thu, Dec 15, 2016 at 3:14 PM, horschi <hors...@gmail.com> wrote:
>
>> Hi,
>>
>> I would like to warm up this old thread. I did some debugging and found
>> out that the timeouts are coming from StorageProxy.proposePaxos()
>> - callback.isFullyRefused() returns false and therefore triggers a
>> WriteTimeout.
>>
>> Looking at my ccm cluster logs, I can see that two replica nodes return
>> different results in their ProposeVerbHandler. In my opinion the
>> coordinator should not throw a Exception in such a case, but instead retry
>> the operation.
>>
>> What do the CAS/Paxos experts on this list say to this? Feel free to
>> instruct me to do further tests/code changes. I'd be glad to help.
>>
>> Log:
>>
>> node1/logs/system.log:WARN  [SharedPool-Worker-5] 2016-12-15 14:48:36,896
>> PaxosState.java:124 - Rejecting proposal for 
>> Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc,
>> [MDS.Lock] key=locktest_ 1 columns=[[] | [value]]
>> node1/logs/system.log-    Row: id=@ | value=<tombstone>) because
>> inProgress is now Commit(2d8146b0-c2cd-11e6-f996-e5c8d88a1da4,
>> [MDS.Lock] key=locktest_ 1 columns=[[] | [value]]
>> --
>> node1/logs/system.log:ERROR [SharedPool-Worker-12] 2016-12-15
>> 14:48:36,980 StorageProxy.java:506 - proposePaxos:
>> Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc, [MDS.Lock] key=locktest_ 1
>> columns=[[] | [value]]
>> node1/logs/system.log-    Row: id=@ | value=<tombstone>)//1//0
>> --
>> node2/logs/system.log:WARN  [SharedPool-Worker-7] 2016-12-15 14:48:36,969
>> PaxosState.java:117 - Accepting proposal: 
>> Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc,
>> [MDS.Lock] key=locktest_ 1 columns=[[] | [value]]
>> node2/logs/system.log-    Row: id=@ | value=<tombstone>)
>> --
>> node3/logs/system.log:WARN  [SharedPool-Worker-2] 2016-12-15 14:48:36,897
>> PaxosState.java:124 - Rejecting proposal for 
>> Commit(2d803540-c2cd-11e6-2e48-53a129c60cfc,
>> [MDS.Lock] key=locktest_ 1 columns=[[] | [value]]
>> node3/logs/system.log-    Row: id=@ | value=<tombstone>) because
>> inProgress is now Commit(2d8146b0-c2cd-11e6-f996-e5c8d88a1da4,
>> [MDS.Lock] key=locktest_ 1 columns=[[] | [value]]
>>
>>
>> kind regards,
>> Christian
>>
>>
>> On Fri, Apr 15, 2016 at 8:27 PM, Denise Rogers <datag...@aol.com> wrote:
>>
>>> My thinking was that due to the size of the data that there maybe I/O
>>> issues. But it sounds more like you're competing for locks and hit a
>>> deadlock issue.
>>>
>>> Regards,
>>> Denise
>>> Cell - (860)989-3431 <(860)%20989-3431>
>>>
>>> Sent from mi iPhone
>>>
>>> On Apr 15, 2016, at 9:00 AM, horschi <hors...@gmail.com> wrote:
>>>
>>> Hi Denise,
>>>
>>> in my case its a small blob I am writing (should be around 100 bytes):
>>>
>>>      CREATE TABLE "Lock" (
>>>          lockname varchar,
>>>          id varchar,
>>>          value blob,
>>>          PRIMARY KEY (lockname, id)
>>>      ) WITH COMPACT STORAGE
>>>          AND COMPRESSION = { 'sstable_compression' : 'SnappyCompressor',
>>> 'chunk_length_kb' : '8' };
>>>
>>> You ask because large values are known to cause issues? Anything special
>>> you have in mind?
>>>
>>> kind regards,
>>> Christian
>>>
>>>
>>>
>>>
>>> On Fri, Apr 15, 2016 at 2:42 PM, Denise Rogers <datag...@aol.com> wrote:
>>>
>>>> Also, what type of data were you reading/writing?
>>>>
>>>> Regards,
>>>> Denise
>>>>
>>>> Sent from mi iPad
>>>>
>>>> On Apr 15, 2016, at 8:29 AM, horschi <hors...@gmail.com> wrote:
>>>>
>>>> Hi Jan,
>>>>
>>>> were you able to resolve your Problem?
>>>>
>>>> We are trying the same and also see a lot of WriteTimeouts:
>>>> WriteTimeoutException: Cassandra timeout during write query at
>>>> consistency SERIAL (2 replica were required but only 1 acknowledged the
>>>> write)
>>>>
>>>> How many clients were competing for a lock in your case? In our case
>>>> its only two :-(
>>>>
>>>> cheers,
>>>> Christian
>>>>
>>>>
>>>> On Tue, Sep 24, 2013 at 12:18 AM, Robert Coli <rc...@eventbrite.com>
>>>> wrote:
>>>>
>>>>> On Mon, Sep 16, 2013 at 9:09 AM, Jan Algermissen <
>>>>> jan.algermis...@nordsc.com> wrote:
>>>>>
>>>>>> I am experimenting with C* 2.0 ( and today's java-driver 2.0
>>>>>> snapshot) for implementing distributed locks.
>>>>>>
>>>>>
>>>>> [ and I'm experiencing the problem described in the subject ... ]
>>>>>
>>>>>
>>>>>> Any idea how to approach this problem?
>>>>>>
>>>>>
>>>>> 1) Upgrade to 2.0.1 release.
>>>>> 2) Try to reproduce symptoms.
>>>>> 3) If able to, file a JIRA at https://issues.apache.org/jira
>>>>> /secure/Dashboard.jspa including repro steps
>>>>> 4) Reply to this thread with the JIRA ticket URL
>>>>>
>>>>> =Rob
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: All subsequent CAS requests time out after heavy use of new CAS feature

Reply via email to