Re: Cluster not working after upgrade from 2.1.12 to 3.5.0

Bryan Cheng Tue, 21 Jun 2016 12:42:42 -0700

Hi Oskar,

I know this won't help you as quickly as you would like but please consider
updating the JIRA issue with details of your environment as it may help
move the investigation along.


Good luck!

On Tue, Jun 21, 2016 at 12:21 PM, Julien Anguenot <jul...@anguenot.org>
wrote:

> You could try to sstabledump that one corrupted table, write some
> (Python) code to get rid of the duplicates processing that stabledump
> output (might not be bullet proof depending on data, I agree),
> truncate and re-insert them back in that table without duplicates.
>
> On Tue, Jun 21, 2016 at 11:52 AM, Oskar Kjellin <oskar.kjel...@gmail.com>
> wrote:
> > Hmm, no way we can do that in prod :/
> >
> > Sent from my iPhone
> >
> >> On 21 juni 2016, at 18:50, Julien Anguenot <jul...@anguenot.org> wrote:
> >>
> >> See my comments on the issue: I had to truncate and reinsert data in
> >> these corrupted tables.
> >>
> >> AFAIK, there is no evidence that UDTs are responsible of this bad
> behavior.
> >>
> >>> On Tue, Jun 21, 2016 at 11:45 AM, Oskar Kjellin <
> oskar.kjel...@gmail.com> wrote:
> >>> Yea I saw that one. We're not using UDT in the affected tables tho.
> >>>
> >>> Did you resolve it?
> >>>
> >>> Sent from my iPhone
> >>>
> >>>> On 21 juni 2016, at 18:27, Julien Anguenot <jul...@anguenot.org>
> wrote:
> >>>>
> >>>> I have experienced similar duplicate primary keys behavior with couple
> >>>> of tables after upgrading from 2.2.x to 3.0.x.
> >>>>
> >>>> See comments on the Jira issue I opened at the time over there:
> >>>> https://issues.apache.org/jira/browse/CASSANDRA-11887
> >>>>
> >>>>
> >>>>> On Tue, Jun 21, 2016 at 10:47 AM, Oskar Kjellin <
> oskar.kjel...@gmail.com> wrote:
> >>>>> Hi,
> >>>>>
> >>>>> We've done this upgrade in both dev and stage before and we did not
> see
> >>>>> similar issues.
> >>>>> After upgrading production today we have a lot issues tho.
> >>>>>
> >>>>> The main issue is that the Datastax client quite often does not get
> the data
> >>>>> (even though it's the same query). I see similar flakyness by simply
> running
> >>>>> cqlsh, although it does return it returns broken data.
> >>>>>
> >>>>> We are running a 3 node cluster with RF 3.
> >>>>>
> >>>>> I have this table
> >>>>>
> >>>>> CREATE TABLE keyspace.table (
> >>>>>
> >>>>> a text,
> >>>>>
> >>>>>   b text,
> >>>>>
> >>>>>   c text,
> >>>>>
> >>>>>   d list<text>,
> >>>>>
> >>>>>   e text,
> >>>>>
> >>>>>   f timestamp,
> >>>>>
> >>>>>   g list<text>,
> >>>>>
> >>>>>   h timestamp,
> >>>>>
> >>>>>   PRIMARY KEY (a, b, c)
> >>>>>
> >>>>> )
> >>>>>
> >>>>>
> >>>>> Every other time I query (not exactly every other time, but random)
> I get:
> >>>>>
> >>>>>
> >>>>> SELECT * from table where a = 'xxx' and b = 'xxx'
> >>>>>
> >>>>> a             | b | c                                 | d | e | f
> >>>>> | g            | h
> >>>>>
> >>>>>
> ---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------
> >>>>>
> >>>>> xxx |          xxx | ccc |             null |       null | 2089-11-30
> >>>>> 23:00:00.000000+0000 | ['fff'] | 2014-12-31 23:00:00.000000+0000
> >>>>>
> >>>>> xxx |          xxx |                           ddd |
>  null |
> >>>>> null | 2099-01-01 00:00:00.000000+0000 | ['fff'] | 2016-06-17
> >>>>> 13:29:36.000000+0000
> >>>>>
> >>>>>
> >>>>> Which is the expected output.
> >>>>>
> >>>>>
> >>>>> But I also get:
> >>>>>
> >>>>> a             | b | c                                 | d | e | f
> >>>>> | g            | h
> >>>>>
> >>>>>
> ---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------
> >>>>>
> >>>>> xxx |          xxx | ccc |             null |       null |
> >>>>> null |                  null |                            null
> >>>>>
> >>>>> xxx |          xxx | ccc |             null |       null | 2089-11-30
> >>>>> 23:00:00.000000+0000 | ['fff'] |                            null
> >>>>>
> >>>>> xxx |          xxx | ccc |             null |       null |
> >>>>> null |                  null | 2014-12-31 23:00:00.000000+0000
> >>>>>
> >>>>> xxx |          xxx |                           ddd |
>  null |
> >>>>> null |                            null |                  null |
> >>>>> null
> >>>>>
> >>>>> xxx |          xxx |                           ddd |
>  null |
> >>>>> null | 2099-01-01 00:00:00.000000+0000 | ['fff'] |
> >>>>> null
> >>>>>
> >>>>> xxx |          xxx |                           ddd |
>  null |
> >>>>> null |                            null |                  null |
> 2016-06-17
> >>>>> 13:29:36.000000+0000
> >>>>>
> >>>>>
> >>>>> Notice that the same PK is returned 3 times. With different parts of
> the
> >>>>> data. I believe this is what's currently killing our production
> environment.
> >>>>>
> >>>>>
> >>>>> I'm running upgradesstables as of this moment, but it's not finished
> yet. I
> >>>>> started a repair before but nothing happened. The upgradesstables
> finished
> >>>>> now on 2 out of 3 nodes, but production is still down :/
> >>>>>
> >>>>>
> >>>>> We also see these in the logs, over and over again:
> >>>>>
> >>>>> DEBUG [ReadRepairStage:4] 2016-06-21 15:44:01,119
> ReadCallback.java:235 -
> >>>>> Digest mismatch:
> >>>>>
> >>>>> org.apache.cassandra.service.DigestMismatchException: Mismatch for
> key
> >>>>> DecoratedKey(-1566729966326640413,
> 336b35356c49537731797a4a5f64627a797236)
> >>>>> (b3dcfcbeed6676eae7ff88cc1bd251fb vs
> 6e7e9225871374d68a7cdb54ae70726d)
> >>>>>
> >>>>> at
> >>>>>
> org.apache.cassandra.service.DigestResolver.resolve(DigestResolver.java:85)
> >>>>> ~[apache-cassandra-3.5.0.jar:3.5.0]
> >>>>>
> >>>>> at
> >>>>>
> org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:226)
> >>>>> ~[apache-cassandra-3.5.0.jar:3.5.0]
> >>>>>
> >>>>> at
> >>>>>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> >>>>> [na:1.8.0_72]
> >>>>>
> >>>>> at
> >>>>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> >>>>> [na:1.8.0_72]
> >>>>>
> >>>>> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]
> >>>>>
> >>>>>
> >>>>> Any help is much appreciated
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Julien Anguenot (@anguenot)
> >>
> >>
> >>
> >> --
> >> Julien Anguenot (@anguenot)
>
>
>
> --
> Julien Anguenot (@anguenot)
>

Re: Cluster not working after upgrade from 2.1.12 to 3.5.0

Reply via email to