Re: Cluster not working after upgrade from 2.1.12 to 3.5.0

Oskar Kjellin Tue, 21 Jun 2016 09:54:43 -0700

Hmm, no way we can do that in prod :/

Sent from my iPhone


> On 21 juni 2016, at 18:50, Julien Anguenot <jul...@anguenot.org> wrote:
> 
> See my comments on the issue: I had to truncate and reinsert data in
> these corrupted tables.
> 
> AFAIK, there is no evidence that UDTs are responsible of this bad behavior.
> 
>> On Tue, Jun 21, 2016 at 11:45 AM, Oskar Kjellin <oskar.kjel...@gmail.com> 
>> wrote:
>> Yea I saw that one. We're not using UDT in the affected tables tho.
>> 
>> Did you resolve it?
>> 
>> Sent from my iPhone
>> 
>>> On 21 juni 2016, at 18:27, Julien Anguenot <jul...@anguenot.org> wrote:
>>> 
>>> I have experienced similar duplicate primary keys behavior with couple
>>> of tables after upgrading from 2.2.x to 3.0.x.
>>> 
>>> See comments on the Jira issue I opened at the time over there:
>>> https://issues.apache.org/jira/browse/CASSANDRA-11887
>>> 
>>> 
>>>> On Tue, Jun 21, 2016 at 10:47 AM, Oskar Kjellin <oskar.kjel...@gmail.com> 
>>>> wrote:
>>>> Hi,
>>>> 
>>>> We've done this upgrade in both dev and stage before and we did not see
>>>> similar issues.
>>>> After upgrading production today we have a lot issues tho.
>>>> 
>>>> The main issue is that the Datastax client quite often does not get the 
>>>> data
>>>> (even though it's the same query). I see similar flakyness by simply 
>>>> running
>>>> cqlsh, although it does return it returns broken data.
>>>> 
>>>> We are running a 3 node cluster with RF 3.
>>>> 
>>>> I have this table
>>>> 
>>>> CREATE TABLE keyspace.table (
>>>> 
>>>> a text,
>>>> 
>>>>   b text,
>>>> 
>>>>   c text,
>>>> 
>>>>   d list<text>,
>>>> 
>>>>   e text,
>>>> 
>>>>   f timestamp,
>>>> 
>>>>   g list<text>,
>>>> 
>>>>   h timestamp,
>>>> 
>>>>   PRIMARY KEY (a, b, c)
>>>> 
>>>> )
>>>> 
>>>> 
>>>> Every other time I query (not exactly every other time, but random) I get:
>>>> 
>>>> 
>>>> SELECT * from table where a = 'xxx' and b = 'xxx'
>>>> 
>>>> a             | b | c                                 | d | e | f
>>>> | g            | h
>>>> 
>>>> ---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------
>>>> 
>>>> xxx |          xxx | ccc |             null |       null | 2089-11-30
>>>> 23:00:00.000000+0000 | ['fff'] | 2014-12-31 23:00:00.000000+0000
>>>> 
>>>> xxx |          xxx |                           ddd |             null |
>>>> null | 2099-01-01 00:00:00.000000+0000 | ['fff'] | 2016-06-17
>>>> 13:29:36.000000+0000
>>>> 
>>>> 
>>>> Which is the expected output.
>>>> 
>>>> 
>>>> But I also get:
>>>> 
>>>> a             | b | c                                 | d | e | f
>>>> | g            | h
>>>> 
>>>> ---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------
>>>> 
>>>> xxx |          xxx | ccc |             null |       null |
>>>> null |                  null |                            null
>>>> 
>>>> xxx |          xxx | ccc |             null |       null | 2089-11-30
>>>> 23:00:00.000000+0000 | ['fff'] |                            null
>>>> 
>>>> xxx |          xxx | ccc |             null |       null |
>>>> null |                  null | 2014-12-31 23:00:00.000000+0000
>>>> 
>>>> xxx |          xxx |                           ddd |             null |
>>>> null |                            null |                  null |
>>>> null
>>>> 
>>>> xxx |          xxx |                           ddd |             null |
>>>> null | 2099-01-01 00:00:00.000000+0000 | ['fff'] |
>>>> null
>>>> 
>>>> xxx |          xxx |                           ddd |             null |
>>>> null |                            null |                  null | 2016-06-17
>>>> 13:29:36.000000+0000
>>>> 
>>>> 
>>>> Notice that the same PK is returned 3 times. With different parts of the
>>>> data. I believe this is what's currently killing our production 
>>>> environment.
>>>> 
>>>> 
>>>> I'm running upgradesstables as of this moment, but it's not finished yet. I
>>>> started a repair before but nothing happened. The upgradesstables finished
>>>> now on 2 out of 3 nodes, but production is still down :/
>>>> 
>>>> 
>>>> We also see these in the logs, over and over again:
>>>> 
>>>> DEBUG [ReadRepairStage:4] 2016-06-21 15:44:01,119 ReadCallback.java:235 -
>>>> Digest mismatch:
>>>> 
>>>> org.apache.cassandra.service.DigestMismatchException: Mismatch for key
>>>> DecoratedKey(-1566729966326640413, 336b35356c49537731797a4a5f64627a797236)
>>>> (b3dcfcbeed6676eae7ff88cc1bd251fb vs 6e7e9225871374d68a7cdb54ae70726d)
>>>> 
>>>> at
>>>> org.apache.cassandra.service.DigestResolver.resolve(DigestResolver.java:85)
>>>> ~[apache-cassandra-3.5.0.jar:3.5.0]
>>>> 
>>>> at
>>>> org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:226)
>>>> ~[apache-cassandra-3.5.0.jar:3.5.0]
>>>> 
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>> [na:1.8.0_72]
>>>> 
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>> [na:1.8.0_72]
>>>> 
>>>> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]
>>>> 
>>>> 
>>>> Any help is much appreciated
>>> 
>>> 
>>> 
>>> --
>>> Julien Anguenot (@anguenot)
> 
> 
> 
> -- 
> Julien Anguenot (@anguenot)

Re: Cluster not working after upgrade from 2.1.12 to 3.5.0

Reply via email to