Re: Datafile Corruption

Laxmikant Upadhyay Wed, 09 Oct 2019 02:47:54 -0700

Cool ! Thanks for sharing the RCA.

On Wed, Oct 9, 2019 at 2:56 PM Philip Ó Condúin <philipocond...@gmail.com>
wrote:


> Just to follow up on this issue as others may see it in the future, we
> cracked it!
>
> Ou datafile corruption issues were a problem with the OS wrongly taking
> one block belonging to a C* data file thinking it was no longer used and
> treating it as a free block that would later be used.
>
> For example:
> C* deletes file after compaction, OS collects all blocks which are free
> now and sends TRIM command to SSD, but SSD from time to time picks the
> wrong block, not the one reported by OS - does the trim - causing zeroized
> blocks to be seen in the datafile and later use it for different file.
> So the symptom is - we suddenly see 4096 zeroes in the datafile- it means
> SSD just trimmed the block, after some time we can see some data written to
> those blocks - it means the block is used by other file and therefore gives
> us a corrupt file.
>
> We turned off the scheduled TRIM function on the OS and we are no longer
> getting corruptions.
>
> This was very difficult to pinpoint.
>
> On Thu, 15 Aug 2019 at 00:09, Patrick McFadin <pmcfa...@gmail.com> wrote:
>
>> If you hadn't mentioned the fact you are using physical disk I would have
>> guessed you were using virtual disks on a SAN. I've seen this sort of thing
>> happen a lot there. Are there any virtual layers between the cassandra
>> process and the hardware? Just a reminder, fsync can be a liar and the
>> virtual layer can mock the response back to user land while the actual bits
>> can be dropped before hitting the disk.
>>
>> If not, you should be looking hard at your disk options. fstab,
>> schedulers, etc. In that case, you need this:
>> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
>>
>>
>> Patrick
>>
>> On Wed, Aug 14, 2019 at 2:03 PM Forkalsrud, Erik <eforkals...@cj.com>
>> wrote:
>>
>>> The dmesg command will usually show information about hardware errors.
>>>
>>> An example from a spinning disk:
>>>     sd 0:0:10:0: [sdi] Unhandled sense code
>>>     sd 0:0:10:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
>>>     sd 0:0:10:0: [sdi] Sense Key : Medium Error [current]
>>>     Info fld=0x6fc72
>>>     sd 0:0:10:0: [sdi] Add. Sense: Unrecovered read error
>>>     sd 0:0:10:0: [sdi] CDB: Read(10): 28 00 00 06 fc 70 00 00 08 00
>>>
>>>
>>> Also, you can read the file like
>>> "cat  /data/ssd2/data/KeyspaceMetadata/x-x/lb-26203-big-Data.db >
>>> /dev/null"
>>> If you get an error message, it's probably a hardware issue.
>>>
>>> - Erik -
>>>
>>> ------------------------------
>>> *From:* Philip Ó Condúin <philipocond...@gmail.com>
>>> *Sent:* Thursday, August 8, 2019 09:58
>>> *To:* user@cassandra.apache.org <user@cassandra.apache.org>
>>> *Subject:* Re: Datafile Corruption
>>>
>>> Hi Jon,
>>>
>>> Good question, I'm not sure if we're using NVMe, I don't see /dev/nvme
>>> but we could still be using it.
>>> We using *Cisco UCS C220 M4 SFF* so I'm just going to check the spec.
>>>
>>> Our Kernal is the following, we're using REDHAT so I'm told we can't
>>> upgrade the version until the next major release anyway.
>>> root@cass 0 17:32:28 ~ # uname -r
>>> 3.10.0-957.5.1.el7.x86_64
>>>
>>> Cheers,
>>> Phil
>>>
>>> On Thu, 8 Aug 2019 at 17:35, Jon Haddad <j...@jonhaddad.com> wrote:
>>>
>>> Any chance you're using NVMe with an older Linux kernel?  I've seen a
>>> *lot* filesystem errors from using older CentOS versions.  You'll want to
>>> be using a version > 4.15.
>>>
>>> On Thu, Aug 8, 2019 at 9:31 AM Philip Ó Condúin <
>>> philipocond...@gmail.com> wrote:
>>>
>>> *@Jeff *- If it was hardware that would explain it all, but do you
>>> think it's possible to have every server in the cluster with a hardware
>>> issue?
>>> The data is sensitive and the customer would lose their mind if I sent
>>> it off-site which is a pity cause I could really do with the help.
>>> The corruption is occurring irregularly on every server and instance and
>>> column family in the cluster.  Out of 72 instances, we are getting maybe 10
>>> corrupt files per day.
>>> We are using vnodes (256) and it is happening in both DC's
>>>
>>> *@Asad *- internode compression is set to ALL on every server.  I have
>>> checked the packets for the private interconnect and I can't see any
>>> dropped packets, there are dropped packets for other interfaces, but not
>>> for the private ones, I will get the network team to double-check this.
>>> The corruption is only on the application schema, we are not getting
>>> corruption on any system or cass keyspaces.  Corruption is happening in
>>> both DC's.  We are getting corruption for the 1 application schema we have
>>> across all tables in the keyspace, it's not limited to one table.
>>> Im not sure why the app team decided to not use default compression, I
>>> must ask them.
>>>
>>>
>>>
>>> I have been checking the /var/log/messages today going back a few weeks
>>> and can see a serious amount of broken pipe errors across all servers and
>>> instances.
>>> Here is a snippet from one server but most pipe errors are similar:
>>>
>>> Jul  9 03:00:08  cassandra: INFO  02:00:08 Writing
>>> Memtable-sstable_activity@1126262628(43.631KiB serialized bytes, 18072
>>> ops, 0%/0% of on/off-heap limit)
>>> Jul  9 03:00:13  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
>>> Jul  9 03:00:19  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
>>> Jul  9 03:00:22  cassandra: ERROR 02:00:22 Got an IOException during
>>> write!
>>> Jul  9 03:00:22  cassandra: java.io.IOException: Broken pipe
>>> Jul  9 03:00:22  cassandra: at
>>> sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_172]
>>> Jul  9 03:00:22  cassandra: at
>>> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_172]
>>> Jul  9 03:00:22  cassandra: at
>>> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_172]
>>> Jul  9 03:00:22  cassandra: at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>>> ~[na:1.8.0_172]
>>> Jul  9 03:00:22  cassandra: at
>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
>>> ~[na:1.8.0_172]
>>> Jul  9 03:00:22  cassandra: at
>>> org.apache.thrift.transport.TNonblockingSocket.write(TNonblockingSocket.java:165)
>>> ~[libthrift-0.9.2.jar:0.9.2]
>>> Jul  9 03:00:22  cassandra: at
>>> com.thinkaurelius.thrift.util.mem.Buffer.writeTo(Buffer.java:104)
>>> ~[thrift-server-0.3.7.jar:na]
>>> Jul  9 03:00:22  cassandra: at
>>> com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.streamTo(FastMemoryOutputTransport.java:112)
>>> ~[thrift-server-0.3.7.jar:na]
>>> Jul  9 03:00:22  cassandra: at
>>> com.thinkaurelius.thrift.Message.write(Message.java:222)
>>> ~[thrift-server-0.3.7.jar:na]
>>> Jul  9 03:00:22  cassandra: at
>>> com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.handleWrite(TDisruptorServer.java:598)
>>> [thrift-server-0.3.7.jar:na]
>>> Jul  9 03:00:22  cassandra: at
>>> com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.processKey(TDisruptorServer.java:569)
>>> [thrift-server-0.3.7.jar:na]
>>> Jul  9 03:00:22  cassandra: at
>>> com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.select(TDisruptorServer.java:423)
>>> [thrift-server-0.3.7.jar:na]
>>> Jul  9 03:00:22  cassandra: at
>>> com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.run(TDisruptorServer.java:383)
>>> [thrift-server-0.3.7.jar:na]
>>> Jul  9 03:00:25  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
>>> Jul  9 03:00:30  cassandra: ERROR 02:00:30 Got an IOException during
>>> write!
>>> Jul  9 03:00:30  cassandra: java.io.IOException: Broken pipe
>>> Jul  9 03:00:30  cassandra: at
>>> sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_172]
>>> Jul  9 03:00:30  cassandra: at
>>> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_172]
>>> Jul  9 03:00:30  cassandra: at
>>> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_172]
>>> Jul  9 03:00:30  cassandra: at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>>> ~[na:1.8.0_172]
>>> Jul  9 03:00:30  cassandra: at
>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
>>> ~[na:1.8.0_172]
>>> Jul  9 03:00:30  cassandra: at
>>> org.apache.thrift.transport.TNonblockingSocket.write(TNonblockingSocket.java:165)
>>> ~[libthrift-0.9.2.jar:0.9.2]
>>> Jul  9 03:00:30  cassandra: at
>>> com.thinkaurelius.thrift.util.mem.Buffer.writeTo(Buffer.java:104)
>>> ~[thrift-server-0.3.7.jar:na]
>>> Jul  9 03:00:30  cassandra: at
>>> com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.streamTo(FastMemoryOutputTransport.java:112)
>>> ~[thrift-server-0.3.7.jar:na]
>>> Jul  9 03:00:30  cassandra: at
>>> com.thinkaurelius.thrift.Message.write(Message.java:222)
>>> ~[thrift-server-0.3.7.jar:na]
>>> Jul  9 03:00:30  cassandra: at
>>> com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.handleWrite(TDisruptorServer.java:598)
>>> [thrift-server-0.3.7.jar:na]
>>> Jul  9 03:00:30  cassandra: at
>>> com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.processKey(TDisruptorServer.java:569)
>>> [thrift-server-0.3.7.jar:na]
>>> Jul  9 03:00:30  cassandra: at
>>> com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.select(TDisruptorServer.java:423)
>>> [thrift-server-0.3.7.jar:na]
>>> Jul  9 03:00:30  cassandra: at
>>> com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.run(TDisruptorServer.java:383)
>>> [thrift-server-0.3.7.jar:na]
>>> Jul  9 03:00:31  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
>>> Jul  9 03:00:37  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
>>> Jul  9 03:00:43  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
>>>
>>>
>>>
>>> On Thu, 8 Aug 2019 at 15:42, ZAIDI, ASAD A <az1...@att.com> wrote:
>>>
>>> Did you check if packets are NOT being dropped for network interfaces
>>> Cassandra instances are consuming (ifconfig –a) internode compression is
>>> set for all endpoint – may be network is playing any role here?
>>>
>>> is this corruption limited so certain keyspace/table | DCs or is that
>>> wide spread – the log snippet you shared it looked like only specific
>>> keyspace/table is affected – is that correct?
>>>
>>> When you remove corrupted sstable of a certain table, I guess you
>>> verifies all nodes for corrupted sstables for same table (may be with with
>>> nodetool scrub tool) so to limit spread of corruptions – right?
>>>
>>> Just curious to know – you’re not using lz4/default compressor for all
>>> tables there must be some reason for it.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From:* Philip Ó Condúin [mailto:philipocond...@gmail.com]
>>> *Sent:* Thursday, August 08, 2019 6:20 AM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Datafile Corruption
>>>
>>>
>>>
>>> Hi All,
>>>
>>> Thank you so much for the replies.
>>>
>>> Currently, I have the following list that can potentially cause some
>>> sort of corruption in a Cassandra cluster.
>>>
>>>    - Sudden Power cut  -  *We have had no power cuts in the datacenters*
>>>    - Network Issues - *no network issues from what I can tell*
>>>    - Disk full - * I don't think this is an issue for us, see disks
>>>    below.*
>>>    - An issue in Casandra version like Cassandra-13752 -* couldn't find
>>>    any Jira issues similar to ours.*
>>>    - Bit Flips -* we have compression enabled so I don't think this
>>>    should be an issue.*
>>>    - Repair during upgrade has caused corruption too -* we have not
>>>    upgraded*
>>>    - Dropping and adding columns with the same name but a different
>>>    type - *I will need to ask the apps team how they are using the
>>>    database.*
>>>
>>>
>>>
>>> Ok, let me try and explain the issue we are having, I am under a lot of
>>> pressure from above to get this fixed and I can't figure it out.
>>>
>>> This is a PRE-PROD environment.
>>>
>>>    - 2 datacenters.
>>>    - 9 physical servers in each datacenter
>>>    - 4 Cassandra instances on each server
>>>    - 72 Cassandra instances across the 2 data centres, 36 in site A, 36
>>>    in site B.
>>>
>>>
>>> We also have 2 Reaper Nodes we use for repair.  One reaper node in each
>>> datacenter each running with its own Cassandra back end in a cluster
>>> together.
>>>
>>> OS Details [Red Hat Linux]
>>> cass_a@x 0 10:53:01 ~ $ uname -a
>>> Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018
>>> x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> cass_a@x 0 10:57:31 ~ $ cat /etc/*release
>>> NAME="Red Hat Enterprise Linux Server"
>>> VERSION="7.6 (Maipo)"
>>> ID="rhel"
>>>
>>> Storage Layout
>>> cass_a@xx 0 10:46:28 ~ $ df -h
>>> Filesystem                         Size  Used Avail Use% Mounted on
>>> /dev/mapper/vg01-lv_root            20G  2.2G   18G  11% /
>>> devtmpfs                            63G     0   63G   0% /dev
>>> tmpfs                               63G     0   63G   0% /dev/shm
>>> tmpfs                               63G  4.1G   59G   7% /run
>>> tmpfs                               63G     0   63G   0% /sys/fs/cgroup
>>> >> 4 cassandra instances
>>> /dev/sdd                           1.5T  802G  688G  54% /data/ssd4
>>> /dev/sda                           1.5T  798G  692G  54% /data/ssd1
>>> /dev/sdb                           1.5T  681G  810G  46% /data/ssd2
>>> /dev/sdc                           1.5T  558G  932G  38% /data/ssd3
>>>  /data/ssd2/data/KeyspaceMetadata/x-x/lb-26203-big-Data.db
>>> Cassandra load is about 200GB and the rest of the space is snapshots
>>>
>>> CPU
>>> cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
>>> CPU(s):                64
>>> Thread(s) per core:    2
>>> Core(s) per socket:    16
>>> Socket(s):             2
>>>
>>> *Description of problem:*
>>> During repair of the cluster, we are seeing multiple corruptions in the
>>> log files on a lot of instances.  There seems to be no pattern to the
>>> corruption.  It seems that the repair job is finding all the corrupted
>>> files for us.  The repair will hang on the node where the corrupted file is
>>> found.  To fix this we remove/rename the datafile and bounce the Cassandra
>>> instance.  Our hardware/OS team have stated there is no problem on their
>>> side.  I do not believe it the repair causing the corruption.
>>>
>>> We have maintenance scripts that run every night running compactions and
>>> creating snapshots, I decided to turn these off, fix any corruptions we
>>> currently had and ran major compaction on all nodes, once this was done we
>>> had a "clean" cluster and we left the cluster for a few days.  After the
>>> process we noticed one corruption in the cluster, this datafile was created
>>> after I turned off the maintenance scripts so my theory of the scripts
>>> causing the issue was wrong.  We then kicked off another repair and started
>>> to find more corrupt files created after the maintenance script was turned
>>> off.
>>>
>>>
>>> So let me give you an example of a corrupted file and maybe someone
>>> might be able to work through it with me?
>>>
>>> When this corrupted file was reported in the log it looks like it was
>>> the repair that found it.
>>>
>>> $ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00"
>>> --until "2019-08-07 22:45:00"
>>>
>>> Aug 07 22:30:33 cassandra[34611]: INFO  21:30:33 Writing
>>> Memtable-compactions_in_progress@830377457(0.008KiB serialized bytes, 1
>>> ops, 0%/0% of on/off-heap limit)
>>> Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Failed creating a
>>> merkle tree for [repair #9587a200-b95a-11e9-8920-9f72868b8375 on
>>> KeyspaceMetadata/x, (-1476350953672479093,-1474461
>>> Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Exception in thread
>>> Thread[ValidationExecutor:825,1,main]
>>> Aug 07 22:30:33 cassandra[34611]: org.apache.cassandra.io.FSReadError:
>>> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted:
>>> /x/ssd2/data/KeyspaceMetadata/x-1e453cb0
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:365)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:361)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:340)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:382)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:366)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:81)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.db.AbstractCell$1.computeNext(AbstractCell.java:52)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.db.AbstractCell$1.computeNext(AbstractCell.java:46)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>>> ~[guava-16.0.jar:na]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>>> ~[guava-16.0.jar:na]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.io.sstable.SSTableIdentityIterator.hasNext(SSTableIdentityIterator.java:169)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.utils.MergeIterator$OneToOne.computeNext(MergeIterator.java:202)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>>> ~[guava-16.0.jar:na]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>>> ~[guava-16.0.jar:na]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> com.google.common.collect.Iterators$7.computeNext(Iterators.java:645)
>>> ~[guava-16.0.jar:na]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>>> ~[guava-16.0.jar:na]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>>> ~[guava-16.0.jar:na]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.db.ColumnIndex$Builder.buildForCompaction(ColumnIndex.java:174)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.db.compaction.LazilyCompactedRow.update(LazilyCompactedRow.java:187)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.repair.Validator.rowHash(Validator.java:201)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.repair.Validator.add(Validator.java:150)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:1166)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.db.compaction.CompactionManager.access$600(CompactionManager.java:76)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.db.compaction.CompactionManager$10.call(CompactionManager.java:736)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_172]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>> ~[na:1.8.0_172]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>> [na:1.8.0_172]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> java.lang.Thread.run(Thread.java:748) [na:1.8.0_172]
>>> Aug 07 22:30:33 cassandra[34611]: Caused by:
>>> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted:
>>> /data/ssd2/data/KeyspaceMetadata/x-x/l
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBufferMmap(CompressedRandomAccessReader.java:216)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBuffer(CompressedRandomAccessReader.java:226)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.io.compress.CompressedThrottledReader.reBuffer(CompressedThrottledReader.java:42)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:352)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: ... 27 common frames omitted
>>> Aug 07 22:30:33 cassandra[34611]: Caused by:
>>> org.apache.cassandra.io.compress.CorruptBlockException:
>>> (/data/ssd2/data/KeyspaceMetadata/x-x/lb-26203-big
>>> Aug 07 22:30:33 cassandra[34611]: at
>>> org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBufferMmap(CompressedRandomAccessReader.java:185)
>>> ~[apache-cassandra-2.2.13.jar:2.2.13]
>>> Aug 07 22:30:33 cassandra[34611]: ... 30 common frames omitted
>>> Aug 07 22:30:33 cassandra[34611]: INFO  21:30:33 Not a global repair,
>>> will not do anticompaction
>>> Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Stopping gossiper
>>> Aug 07 22:30:33 cassandra[34611]: WARN  21:30:33 Stopping gossip by
>>> operator request
>>> Aug 07 22:30:33 cassandra[34611]: INFO  21:30:33 Announcing shutdown
>>> Aug 07 22:30:33 cassandra[34611]: INFO  21:30:33 Node /10.2.57.37
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.2.57.37&d=DwMFaQ&c=LFYZ-o9_HUMeMTSQicvjIg&r=FsmDztdsVuIKml8IDhdHdg&m=4S7F10IxFntsiwIo-XT-YrkZE8312_yG8jMeOqOBjpE&s=20PLv0KNcUpBbyY1byoboimsLRjbPCLV76xL37jpttQ&e=>
>>> state jump to shutdown
>>>
>>>
>>> So I went to the file system to see when this corrupt file was created
>>> and it was created on July 30th at 15.55
>>>
>>> root@x 0 01:14:03 ~ # ls -l
>>> /data/ssd2/data/KeyspaceMetadata/x-x/lb-26203-big-Data.db
>>> -rw-r--r-- 1 cass_b cass_b 3182243670 Jul 30
>>> 15:55 /data/ssd2/data/KeyspaceMetadata/x-x/lb-26203-big-Data.db
>>>
>>>
>>>
>>> So I checked /var/log/messages for errors around that time
>>> The only thing that stands out to me is the message "Cannot perform a
>>> full major compaction as repaired and unrepaired sstables cannot be
>>> compacted together", I'm not sure if this would be an issue though and
>>> cause corruption.
>>>
>>> Jul 30 15:55:06 x systemd: Created slice User Slice of root.
>>> Jul 30 15:55:06 x systemd: Started Session c165280 of user root.
>>> Jul 30 15:55:06 x audispd: node=x. type=USER_START
>>> msg=audit(1564498506.021:457933): pid=17533 uid=0 auid=4294967295
>>> ses=4294967295 msg='op=PAM:session_open
>>> grantors=pam_keyinit,pam_limits,pam_keyinit,pam_limits,pam_tty_audit,pam_systemd,pam_unix
>>> acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=? res=success'
>>> Jul 30 15:55:06 x systemd: Removed slice User Slice of root.
>>> Jul 30 15:55:14 x tag_audit_log: type=USER_CMD
>>> msg=audit(1564498506.013:457932): pid=17533 uid=509 auid=4294967295
>>> ses=4294967295 msg='cwd="/"
>>> cmd=2F7573722F7362696E2F69706D692D73656E736F7273202D2D71756965742D6361636865202D2D7364722D63616368652D7265637265617465202D2D696E746572707265742D6F656D2D64617461202D2D6F75747075742D73656E736F722D7374617465202D2D69676E6F72652D6E6F742D617661696C61626C652D73656E736F7273202D2D6F75747075742D73656E736F722D7468726573686F6C6473
>>> terminal=? res=success'
>>> Jul 30 15:55:14 x tag_audit_log: type=USER_START
>>> msg=audit(1564498506.021:457933): pid=17533 uid=0 auid=4294967295
>>> ses=4294967295 msg='op=PAM:session_open
>>> grantors=pam_keyinit,pam_limits,pam_keyinit,pam_limits,pam_tty_audit,pam_systemd,pam_unix
>>> acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=? res=success'
>>> Jul 30 15:55:19 x cassandra: INFO  14:55:19 Writing
>>> Memtable-compactions_in_progress@1462227999(0.008KiB serialized bytes,
>>> 1 ops, 0%/0% of on/off-heap limit)
>>> Jul 30 15:55:19 x cassandra: INFO  14:55:19 Cannot perform a full major
>>> compaction as repaired and unrepaired sstables cannot be compacted
>>> together. These two set of sstables will be compacted separately.
>>> Jul 30 15:55:19 x cassandra: INFO  14:55:19 Writing
>>> Memtable-compactions_in_progress@1198535528(1.002KiB serialized bytes,
>>> 57 ops, 0%/0% of on/off-heap limit)
>>> Jul 30 15:55:20 x cassandra: INFO  14:55:20 Writing
>>> Memtable-compactions_in_progress@2039409834(0.008KiB serialized bytes,
>>> 1 ops, 0%/0% of on/off-heap limit)
>>> Jul 30 15:55:24 x audispd: node=x. type=USER_LOGOUT
>>> msg=audit(1564498524.409:457934): pid=46620 uid=0 auid=464400029 ses=2747
>>> msg='op=login id=464400029 exe="/usr/sbin/sshd" hostname=? addr=?
>>> terminal=/dev/pts/0 res=success'
>>> Jul 30 15:55:24 x audispd: node=x. type=USER_LOGOUT
>>> msg=audit(1564498524.409:457935): pid=4878 uid=0 auid=464400029 ses=2749
>>> msg='op=login id=464400029 exe="/usr/sbin/sshd" hostname=? addr=?
>>> terminal=/dev/pts/1 res=success'
>>>
>>> Jul 30 15:55:57 x systemd: Created slice User Slice of root.
>>> Jul 30 15:55:57 x systemd: Started Session c165288 of user root.
>>> Jul 30 15:55:57 x audispd: node=x. type=USER_START
>>> msg=audit(1564498557.294:457958): pid=19687 uid=0 auid=4294967295
>>> ses=4294967295 msg='op=PAM:session_open
>>> grantors=pam_keyinit,pam_limits,pam_keyinit,pam_limits,pam_tty_audit,pam_systemd,pam_unix
>>> acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=? res=success'
>>> Jul 30 15:55:57 x audispd: node=x. type=USER_START
>>> msg=audit(1564498557.298:457959): pid=19690 uid=0 auid=4294967295
>>> ses=4294967295 msg='op=PAM:session_open
>>> grantors=pam_keyinit,pam_systemd,pam_keyinit,pam_limits,pam_unix
>>> acct="cass_b" exe="/usr/sbin/runuser" hostname=? addr=? terminal=?
>>> res=success'
>>> Jul 30 15:55:58 x systemd: Removed slice User Slice of root.
>>> Jul 30 15:56:02 x cassandra: INFO  14:56:02 Writing
>>> Memtable-compactions_in_progress@1532791194(0.008KiB serialized bytes,
>>> 1 ops, 0%/0% of on/off-heap limit)
>>> Jul 30 15:56:02 x cassandra: INFO  14:56:02 Cannot perform a full major
>>> compaction as repaired and unrepaired sstables cannot be compacted
>>> together. These two set of sstables will be compacted separately.
>>> Jul 30 15:56:02 x cassandra: INFO  14:56:02 Writing
>>> Memtable-compactions_in_progress@1455399453(0.281KiB serialized bytes,
>>> 16 ops, 0%/0% of on/off-heap limit)
>>> Jul 30 15:56:04 x tag_audit_log: type=USER_CMD
>>> msg=audit(1564498555.190:457951): pid=19294 uid=509 auid=4294967295
>>> ses=4294967295 msg='cwd="/"
>>> cmd=72756E75736572202D73202F62696E2F62617368202D6C20636173735F62202D632063617373616E6472612D6D6574612F63617373616E6472612F62696E2F6E6F6465746F6F6C2074707374617473
>>> terminal=? res=success'
>>>
>>>
>>>
>>> We have checked a number of other things like NTP setting etc but
>>> nothing is telling us what could cause so many corruptions across the
>>> entire cluster.
>>> Things were healthy with this cluster for months, the only thing I can
>>> think is that we started loading data from a load of 20GB per instance up
>>> to 200GB where it sits now, maybe this just highlighted the issue.
>>>
>>>
>>>
>>> Compaction and Compression on Keyspace CL's [mixture]
>>> All CF's are using compression.
>>>
>>> AND compaction = {'min_threshold': '4', 'class':
>>> 'org.apache.cassandra.db.compaction.*SizeTieredCompactionStrategy*',
>>> 'max_threshold': '32'}
>>> AND compression = {'sstable_compression':
>>> 'org.apache.cassandra.io.compress.*SnappyCompressor*'}
>>>
>>> AND compaction = {'min_threshold': '4', 'class':
>>> 'org.apache.cassandra.db.compaction.*SizeTieredCompactionStrategy*',
>>> 'max_threshold': '32'}
>>> AND compression = {'sstable_compression':
>>> 'org.apache.cassandra.io.compress.*LZ4Compressor*'}
>>>
>>> AND compaction = {'class': 'org.apache.cassandra.db.compaction.
>>> *LeveledCompactionStrategy*'}
>>> AND compression = {'sstable_compression':
>>> 'org.apache.cassandra.io.compress.*SnappyCompressor*'}
>>>
>>> --We are also using internode network compression:
>>> internode_compression: all
>>>
>>>
>>>
>>> Does anyone have any idea what I should check next?
>>> Our next theory is that there may be an issue with Checksum, but I'm not
>>> sure where to go with this.
>>>
>>>
>>>
>>> Any help would be very much appreciated before I lose the last bit of
>>> hair I have on my head.
>>>
>>>
>>>
>>> Kind Regards,
>>>
>>> Phil
>>>
>>>
>>>
>>> On Wed, 7 Aug 2019 at 20:51, Nitan Kainth <nitankai...@gmail.com> wrote:
>>>
>>> Repair during upgrade have caused corruption too.
>>>
>>>
>>>
>>> Also, dropping and adding columns with same name but different type
>>>
>>>
>>>
>>> Regards,
>>>
>>> Nitan
>>>
>>> Cell: 510 449 9629
>>>
>>>
>>> On Aug 7, 2019, at 2:42 PM, Jeff Jirsa <jji...@gmail.com> wrote:
>>>
>>> Is compression enabled?
>>>
>>>
>>>
>>> If not, bit flips on disk can corrupt data files and reads + repair may
>>> send that corruption to other hosts in the cluster
>>>
>>>
>>> On Aug 7, 2019, at 3:46 AM, Philip Ó Condúin <philipocond...@gmail.com>
>>> wrote:
>>>
>>> Hi All,
>>>
>>>
>>>
>>> I am currently experiencing multiple datafile corruptions across most
>>> nodes in my cluster, there seems to be no pattern to the corruption.  I'm
>>> starting to think it might be a bug, we're using Cassandra 2.2.13.
>>>
>>>
>>>
>>> Without going into detail about the issue I just want to confirm
>>> something.
>>>
>>>
>>>
>>> Can someone share with me a list of scenarios that would cause
>>> corruption?
>>>
>>>
>>>
>>> 1. OS failure
>>>
>>> 2. Cassandra disturbed during the writing
>>>
>>>
>>>
>>> etc etc.
>>>
>>>
>>>
>>> I need to investigate each scenario and don't want to leave any out.
>>>
>>>
>>>
>>> --
>>>
>>> Regards,
>>>
>>> Phil
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Regards,
>>>
>>> Phil
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Phil
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Phil
>>>
>>
>
> --
> Regards,
> Phil
>


-- 

regards,
Laxmikant Upadhyay

Re: Datafile Corruption

Reply via email to