Cool ! Thanks for sharing the RCA. On Wed, Oct 9, 2019 at 2:56 PM Philip Ó Condúin <philipocond...@gmail.com> wrote:
> Just to follow up on this issue as others may see it in the future, we > cracked it! > > Ou datafile corruption issues were a problem with the OS wrongly taking > one block belonging to a C* data file thinking it was no longer used and > treating it as a free block that would later be used. > > For example: > C* deletes file after compaction, OS collects all blocks which are free > now and sends TRIM command to SSD, but SSD from time to time picks the > wrong block, not the one reported by OS - does the trim - causing zeroized > blocks to be seen in the datafile and later use it for different file. > So the symptom is - we suddenly see 4096 zeroes in the datafile- it means > SSD just trimmed the block, after some time we can see some data written to > those blocks - it means the block is used by other file and therefore gives > us a corrupt file. > > We turned off the scheduled TRIM function on the OS and we are no longer > getting corruptions. > > This was very difficult to pinpoint. > > On Thu, 15 Aug 2019 at 00:09, Patrick McFadin <pmcfa...@gmail.com> wrote: > >> If you hadn't mentioned the fact you are using physical disk I would have >> guessed you were using virtual disks on a SAN. I've seen this sort of thing >> happen a lot there. Are there any virtual layers between the cassandra >> process and the hardware? Just a reminder, fsync can be a liar and the >> virtual layer can mock the response back to user land while the actual bits >> can be dropped before hitting the disk. >> >> If not, you should be looking hard at your disk options. fstab, >> schedulers, etc. In that case, you need this: >> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html >> >> >> Patrick >> >> On Wed, Aug 14, 2019 at 2:03 PM Forkalsrud, Erik <eforkals...@cj.com> >> wrote: >> >>> The dmesg command will usually show information about hardware errors. >>> >>> An example from a spinning disk: >>> sd 0:0:10:0: [sdi] Unhandled sense code >>> sd 0:0:10:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE >>> sd 0:0:10:0: [sdi] Sense Key : Medium Error [current] >>> Info fld=0x6fc72 >>> sd 0:0:10:0: [sdi] Add. Sense: Unrecovered read error >>> sd 0:0:10:0: [sdi] CDB: Read(10): 28 00 00 06 fc 70 00 00 08 00 >>> >>> >>> Also, you can read the file like >>> "cat /data/ssd2/data/KeyspaceMetadata/x-x/lb-26203-big-Data.db > >>> /dev/null" >>> If you get an error message, it's probably a hardware issue. >>> >>> - Erik - >>> >>> ------------------------------ >>> *From:* Philip Ó Condúin <philipocond...@gmail.com> >>> *Sent:* Thursday, August 8, 2019 09:58 >>> *To:* user@cassandra.apache.org <user@cassandra.apache.org> >>> *Subject:* Re: Datafile Corruption >>> >>> Hi Jon, >>> >>> Good question, I'm not sure if we're using NVMe, I don't see /dev/nvme >>> but we could still be using it. >>> We using *Cisco UCS C220 M4 SFF* so I'm just going to check the spec. >>> >>> Our Kernal is the following, we're using REDHAT so I'm told we can't >>> upgrade the version until the next major release anyway. >>> root@cass 0 17:32:28 ~ # uname -r >>> 3.10.0-957.5.1.el7.x86_64 >>> >>> Cheers, >>> Phil >>> >>> On Thu, 8 Aug 2019 at 17:35, Jon Haddad <j...@jonhaddad.com> wrote: >>> >>> Any chance you're using NVMe with an older Linux kernel? I've seen a >>> *lot* filesystem errors from using older CentOS versions. You'll want to >>> be using a version > 4.15. >>> >>> On Thu, Aug 8, 2019 at 9:31 AM Philip Ó Condúin < >>> philipocond...@gmail.com> wrote: >>> >>> *@Jeff *- If it was hardware that would explain it all, but do you >>> think it's possible to have every server in the cluster with a hardware >>> issue? >>> The data is sensitive and the customer would lose their mind if I sent >>> it off-site which is a pity cause I could really do with the help. >>> The corruption is occurring irregularly on every server and instance and >>> column family in the cluster. Out of 72 instances, we are getting maybe 10 >>> corrupt files per day. >>> We are using vnodes (256) and it is happening in both DC's >>> >>> *@Asad *- internode compression is set to ALL on every server. I have >>> checked the packets for the private interconnect and I can't see any >>> dropped packets, there are dropped packets for other interfaces, but not >>> for the private ones, I will get the network team to double-check this. >>> The corruption is only on the application schema, we are not getting >>> corruption on any system or cass keyspaces. Corruption is happening in >>> both DC's. We are getting corruption for the 1 application schema we have >>> across all tables in the keyspace, it's not limited to one table. >>> Im not sure why the app team decided to not use default compression, I >>> must ask them. >>> >>> >>> >>> I have been checking the /var/log/messages today going back a few weeks >>> and can see a serious amount of broken pipe errors across all servers and >>> instances. >>> Here is a snippet from one server but most pipe errors are similar: >>> >>> Jul 9 03:00:08 cassandra: INFO 02:00:08 Writing >>> Memtable-sstable_activity@1126262628(43.631KiB serialized bytes, 18072 >>> ops, 0%/0% of on/off-heap limit) >>> Jul 9 03:00:13 kernel: fnic_handle_fip_timer: 8 callbacks suppressed >>> Jul 9 03:00:19 kernel: fnic_handle_fip_timer: 8 callbacks suppressed >>> Jul 9 03:00:22 cassandra: ERROR 02:00:22 Got an IOException during >>> write! >>> Jul 9 03:00:22 cassandra: java.io.IOException: Broken pipe >>> Jul 9 03:00:22 cassandra: at >>> sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_172] >>> Jul 9 03:00:22 cassandra: at >>> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_172] >>> Jul 9 03:00:22 cassandra: at >>> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_172] >>> Jul 9 03:00:22 cassandra: at sun.nio.ch.IOUtil.write(IOUtil.java:65) >>> ~[na:1.8.0_172] >>> Jul 9 03:00:22 cassandra: at >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) >>> ~[na:1.8.0_172] >>> Jul 9 03:00:22 cassandra: at >>> org.apache.thrift.transport.TNonblockingSocket.write(TNonblockingSocket.java:165) >>> ~[libthrift-0.9.2.jar:0.9.2] >>> Jul 9 03:00:22 cassandra: at >>> com.thinkaurelius.thrift.util.mem.Buffer.writeTo(Buffer.java:104) >>> ~[thrift-server-0.3.7.jar:na] >>> Jul 9 03:00:22 cassandra: at >>> com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.streamTo(FastMemoryOutputTransport.java:112) >>> ~[thrift-server-0.3.7.jar:na] >>> Jul 9 03:00:22 cassandra: at >>> com.thinkaurelius.thrift.Message.write(Message.java:222) >>> ~[thrift-server-0.3.7.jar:na] >>> Jul 9 03:00:22 cassandra: at >>> com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.handleWrite(TDisruptorServer.java:598) >>> [thrift-server-0.3.7.jar:na] >>> Jul 9 03:00:22 cassandra: at >>> com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.processKey(TDisruptorServer.java:569) >>> [thrift-server-0.3.7.jar:na] >>> Jul 9 03:00:22 cassandra: at >>> com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.select(TDisruptorServer.java:423) >>> [thrift-server-0.3.7.jar:na] >>> Jul 9 03:00:22 cassandra: at >>> com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.run(TDisruptorServer.java:383) >>> [thrift-server-0.3.7.jar:na] >>> Jul 9 03:00:25 kernel: fnic_handle_fip_timer: 8 callbacks suppressed >>> Jul 9 03:00:30 cassandra: ERROR 02:00:30 Got an IOException during >>> write! >>> Jul 9 03:00:30 cassandra: java.io.IOException: Broken pipe >>> Jul 9 03:00:30 cassandra: at >>> sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_172] >>> Jul 9 03:00:30 cassandra: at >>> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_172] >>> Jul 9 03:00:30 cassandra: at >>> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_172] >>> Jul 9 03:00:30 cassandra: at sun.nio.ch.IOUtil.write(IOUtil.java:65) >>> ~[na:1.8.0_172] >>> Jul 9 03:00:30 cassandra: at >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) >>> ~[na:1.8.0_172] >>> Jul 9 03:00:30 cassandra: at >>> org.apache.thrift.transport.TNonblockingSocket.write(TNonblockingSocket.java:165) >>> ~[libthrift-0.9.2.jar:0.9.2] >>> Jul 9 03:00:30 cassandra: at >>> com.thinkaurelius.thrift.util.mem.Buffer.writeTo(Buffer.java:104) >>> ~[thrift-server-0.3.7.jar:na] >>> Jul 9 03:00:30 cassandra: at >>> com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.streamTo(FastMemoryOutputTransport.java:112) >>> ~[thrift-server-0.3.7.jar:na] >>> Jul 9 03:00:30 cassandra: at >>> com.thinkaurelius.thrift.Message.write(Message.java:222) >>> ~[thrift-server-0.3.7.jar:na] >>> Jul 9 03:00:30 cassandra: at >>> com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.handleWrite(TDisruptorServer.java:598) >>> [thrift-server-0.3.7.jar:na] >>> Jul 9 03:00:30 cassandra: at >>> com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.processKey(TDisruptorServer.java:569) >>> [thrift-server-0.3.7.jar:na] >>> Jul 9 03:00:30 cassandra: at >>> com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.select(TDisruptorServer.java:423) >>> [thrift-server-0.3.7.jar:na] >>> Jul 9 03:00:30 cassandra: at >>> com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.run(TDisruptorServer.java:383) >>> [thrift-server-0.3.7.jar:na] >>> Jul 9 03:00:31 kernel: fnic_handle_fip_timer: 8 callbacks suppressed >>> Jul 9 03:00:37 kernel: fnic_handle_fip_timer: 8 callbacks suppressed >>> Jul 9 03:00:43 kernel: fnic_handle_fip_timer: 8 callbacks suppressed >>> >>> >>> >>> On Thu, 8 Aug 2019 at 15:42, ZAIDI, ASAD A <az1...@att.com> wrote: >>> >>> Did you check if packets are NOT being dropped for network interfaces >>> Cassandra instances are consuming (ifconfig –a) internode compression is >>> set for all endpoint – may be network is playing any role here? >>> >>> is this corruption limited so certain keyspace/table | DCs or is that >>> wide spread – the log snippet you shared it looked like only specific >>> keyspace/table is affected – is that correct? >>> >>> When you remove corrupted sstable of a certain table, I guess you >>> verifies all nodes for corrupted sstables for same table (may be with with >>> nodetool scrub tool) so to limit spread of corruptions – right? >>> >>> Just curious to know – you’re not using lz4/default compressor for all >>> tables there must be some reason for it. >>> >>> >>> >>> >>> >>> >>> >>> *From:* Philip Ó Condúin [mailto:philipocond...@gmail.com] >>> *Sent:* Thursday, August 08, 2019 6:20 AM >>> *To:* user@cassandra.apache.org >>> *Subject:* Re: Datafile Corruption >>> >>> >>> >>> Hi All, >>> >>> Thank you so much for the replies. >>> >>> Currently, I have the following list that can potentially cause some >>> sort of corruption in a Cassandra cluster. >>> >>> - Sudden Power cut - *We have had no power cuts in the datacenters* >>> - Network Issues - *no network issues from what I can tell* >>> - Disk full - * I don't think this is an issue for us, see disks >>> below.* >>> - An issue in Casandra version like Cassandra-13752 -* couldn't find >>> any Jira issues similar to ours.* >>> - Bit Flips -* we have compression enabled so I don't think this >>> should be an issue.* >>> - Repair during upgrade has caused corruption too -* we have not >>> upgraded* >>> - Dropping and adding columns with the same name but a different >>> type - *I will need to ask the apps team how they are using the >>> database.* >>> >>> >>> >>> Ok, let me try and explain the issue we are having, I am under a lot of >>> pressure from above to get this fixed and I can't figure it out. >>> >>> This is a PRE-PROD environment. >>> >>> - 2 datacenters. >>> - 9 physical servers in each datacenter >>> - 4 Cassandra instances on each server >>> - 72 Cassandra instances across the 2 data centres, 36 in site A, 36 >>> in site B. >>> >>> >>> We also have 2 Reaper Nodes we use for repair. One reaper node in each >>> datacenter each running with its own Cassandra back end in a cluster >>> together. >>> >>> OS Details [Red Hat Linux] >>> cass_a@x 0 10:53:01 ~ $ uname -a >>> Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018 >>> x86_64 x86_64 x86_64 GNU/Linux >>> >>> cass_a@x 0 10:57:31 ~ $ cat /etc/*release >>> NAME="Red Hat Enterprise Linux Server" >>> VERSION="7.6 (Maipo)" >>> ID="rhel" >>> >>> Storage Layout >>> cass_a@xx 0 10:46:28 ~ $ df -h >>> Filesystem Size Used Avail Use% Mounted on >>> /dev/mapper/vg01-lv_root 20G 2.2G 18G 11% / >>> devtmpfs 63G 0 63G 0% /dev >>> tmpfs 63G 0 63G 0% /dev/shm >>> tmpfs 63G 4.1G 59G 7% /run >>> tmpfs 63G 0 63G 0% /sys/fs/cgroup >>> >> 4 cassandra instances >>> /dev/sdd 1.5T 802G 688G 54% /data/ssd4 >>> /dev/sda 1.5T 798G 692G 54% /data/ssd1 >>> /dev/sdb 1.5T 681G 810G 46% /data/ssd2 >>> /dev/sdc 1.5T 558G 932G 38% /data/ssd3 >>> /data/ssd2/data/KeyspaceMetadata/x-x/lb-26203-big-Data.db >>> Cassandra load is about 200GB and the rest of the space is snapshots >>> >>> CPU >>> cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\(' >>> CPU(s): 64 >>> Thread(s) per core: 2 >>> Core(s) per socket: 16 >>> Socket(s): 2 >>> >>> *Description of problem:* >>> During repair of the cluster, we are seeing multiple corruptions in the >>> log files on a lot of instances. There seems to be no pattern to the >>> corruption. It seems that the repair job is finding all the corrupted >>> files for us. The repair will hang on the node where the corrupted file is >>> found. To fix this we remove/rename the datafile and bounce the Cassandra >>> instance. Our hardware/OS team have stated there is no problem on their >>> side. I do not believe it the repair causing the corruption. >>> >>> We have maintenance scripts that run every night running compactions and >>> creating snapshots, I decided to turn these off, fix any corruptions we >>> currently had and ran major compaction on all nodes, once this was done we >>> had a "clean" cluster and we left the cluster for a few days. After the >>> process we noticed one corruption in the cluster, this datafile was created >>> after I turned off the maintenance scripts so my theory of the scripts >>> causing the issue was wrong. We then kicked off another repair and started >>> to find more corrupt files created after the maintenance script was turned >>> off. >>> >>> >>> So let me give you an example of a corrupted file and maybe someone >>> might be able to work through it with me? >>> >>> When this corrupted file was reported in the log it looks like it was >>> the repair that found it. >>> >>> $ journalctl -u cassmeta-cass_b.service --since "2019-08-07 22:25:00" >>> --until "2019-08-07 22:45:00" >>> >>> Aug 07 22:30:33 cassandra[34611]: INFO 21:30:33 Writing >>> Memtable-compactions_in_progress@830377457(0.008KiB serialized bytes, 1 >>> ops, 0%/0% of on/off-heap limit) >>> Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Failed creating a >>> merkle tree for [repair #9587a200-b95a-11e9-8920-9f72868b8375 on >>> KeyspaceMetadata/x, (-1476350953672479093,-1474461 >>> Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Exception in thread >>> Thread[ValidationExecutor:825,1,main] >>> Aug 07 22:30:33 cassandra[34611]: org.apache.cassandra.io.FSReadError: >>> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: >>> /x/ssd2/data/KeyspaceMetadata/x-1e453cb0 >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:365) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:361) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:340) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:382) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:366) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:81) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.db.AbstractCell$1.computeNext(AbstractCell.java:52) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.db.AbstractCell$1.computeNext(AbstractCell.java:46) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: at >>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) >>> ~[guava-16.0.jar:na] >>> Aug 07 22:30:33 cassandra[34611]: at >>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) >>> ~[guava-16.0.jar:na] >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.io.sstable.SSTableIdentityIterator.hasNext(SSTableIdentityIterator.java:169) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.utils.MergeIterator$OneToOne.computeNext(MergeIterator.java:202) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: at >>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) >>> ~[guava-16.0.jar:na] >>> Aug 07 22:30:33 cassandra[34611]: at >>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) >>> ~[guava-16.0.jar:na] >>> Aug 07 22:30:33 cassandra[34611]: at >>> com.google.common.collect.Iterators$7.computeNext(Iterators.java:645) >>> ~[guava-16.0.jar:na] >>> Aug 07 22:30:33 cassandra[34611]: at >>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) >>> ~[guava-16.0.jar:na] >>> Aug 07 22:30:33 cassandra[34611]: at >>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) >>> ~[guava-16.0.jar:na] >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.db.ColumnIndex$Builder.buildForCompaction(ColumnIndex.java:174) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.db.compaction.LazilyCompactedRow.update(LazilyCompactedRow.java:187) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.repair.Validator.rowHash(Validator.java:201) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.repair.Validator.add(Validator.java:150) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:1166) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.db.compaction.CompactionManager.access$600(CompactionManager.java:76) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.db.compaction.CompactionManager$10.call(CompactionManager.java:736) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: at >>> java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_172] >>> Aug 07 22:30:33 cassandra[34611]: at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>> ~[na:1.8.0_172] >>> Aug 07 22:30:33 cassandra[34611]: at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>> [na:1.8.0_172] >>> Aug 07 22:30:33 cassandra[34611]: at >>> java.lang.Thread.run(Thread.java:748) [na:1.8.0_172] >>> Aug 07 22:30:33 cassandra[34611]: Caused by: >>> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: >>> /data/ssd2/data/KeyspaceMetadata/x-x/l >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBufferMmap(CompressedRandomAccessReader.java:216) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBuffer(CompressedRandomAccessReader.java:226) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.io.compress.CompressedThrottledReader.reBuffer(CompressedThrottledReader.java:42) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:352) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: ... 27 common frames omitted >>> Aug 07 22:30:33 cassandra[34611]: Caused by: >>> org.apache.cassandra.io.compress.CorruptBlockException: >>> (/data/ssd2/data/KeyspaceMetadata/x-x/lb-26203-big >>> Aug 07 22:30:33 cassandra[34611]: at >>> org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBufferMmap(CompressedRandomAccessReader.java:185) >>> ~[apache-cassandra-2.2.13.jar:2.2.13] >>> Aug 07 22:30:33 cassandra[34611]: ... 30 common frames omitted >>> Aug 07 22:30:33 cassandra[34611]: INFO 21:30:33 Not a global repair, >>> will not do anticompaction >>> Aug 07 22:30:33 cassandra[34611]: ERROR 21:30:33 Stopping gossiper >>> Aug 07 22:30:33 cassandra[34611]: WARN 21:30:33 Stopping gossip by >>> operator request >>> Aug 07 22:30:33 cassandra[34611]: INFO 21:30:33 Announcing shutdown >>> Aug 07 22:30:33 cassandra[34611]: INFO 21:30:33 Node /10.2.57.37 >>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__10.2.57.37&d=DwMFaQ&c=LFYZ-o9_HUMeMTSQicvjIg&r=FsmDztdsVuIKml8IDhdHdg&m=4S7F10IxFntsiwIo-XT-YrkZE8312_yG8jMeOqOBjpE&s=20PLv0KNcUpBbyY1byoboimsLRjbPCLV76xL37jpttQ&e=> >>> state jump to shutdown >>> >>> >>> So I went to the file system to see when this corrupt file was created >>> and it was created on July 30th at 15.55 >>> >>> root@x 0 01:14:03 ~ # ls -l >>> /data/ssd2/data/KeyspaceMetadata/x-x/lb-26203-big-Data.db >>> -rw-r--r-- 1 cass_b cass_b 3182243670 Jul 30 >>> 15:55 /data/ssd2/data/KeyspaceMetadata/x-x/lb-26203-big-Data.db >>> >>> >>> >>> So I checked /var/log/messages for errors around that time >>> The only thing that stands out to me is the message "Cannot perform a >>> full major compaction as repaired and unrepaired sstables cannot be >>> compacted together", I'm not sure if this would be an issue though and >>> cause corruption. >>> >>> Jul 30 15:55:06 x systemd: Created slice User Slice of root. >>> Jul 30 15:55:06 x systemd: Started Session c165280 of user root. >>> Jul 30 15:55:06 x audispd: node=x. type=USER_START >>> msg=audit(1564498506.021:457933): pid=17533 uid=0 auid=4294967295 >>> ses=4294967295 msg='op=PAM:session_open >>> grantors=pam_keyinit,pam_limits,pam_keyinit,pam_limits,pam_tty_audit,pam_systemd,pam_unix >>> acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=? res=success' >>> Jul 30 15:55:06 x systemd: Removed slice User Slice of root. >>> Jul 30 15:55:14 x tag_audit_log: type=USER_CMD >>> msg=audit(1564498506.013:457932): pid=17533 uid=509 auid=4294967295 >>> ses=4294967295 msg='cwd="/" >>> cmd=2F7573722F7362696E2F69706D692D73656E736F7273202D2D71756965742D6361636865202D2D7364722D63616368652D7265637265617465202D2D696E746572707265742D6F656D2D64617461202D2D6F75747075742D73656E736F722D7374617465202D2D69676E6F72652D6E6F742D617661696C61626C652D73656E736F7273202D2D6F75747075742D73656E736F722D7468726573686F6C6473 >>> terminal=? res=success' >>> Jul 30 15:55:14 x tag_audit_log: type=USER_START >>> msg=audit(1564498506.021:457933): pid=17533 uid=0 auid=4294967295 >>> ses=4294967295 msg='op=PAM:session_open >>> grantors=pam_keyinit,pam_limits,pam_keyinit,pam_limits,pam_tty_audit,pam_systemd,pam_unix >>> acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=? res=success' >>> Jul 30 15:55:19 x cassandra: INFO 14:55:19 Writing >>> Memtable-compactions_in_progress@1462227999(0.008KiB serialized bytes, >>> 1 ops, 0%/0% of on/off-heap limit) >>> Jul 30 15:55:19 x cassandra: INFO 14:55:19 Cannot perform a full major >>> compaction as repaired and unrepaired sstables cannot be compacted >>> together. These two set of sstables will be compacted separately. >>> Jul 30 15:55:19 x cassandra: INFO 14:55:19 Writing >>> Memtable-compactions_in_progress@1198535528(1.002KiB serialized bytes, >>> 57 ops, 0%/0% of on/off-heap limit) >>> Jul 30 15:55:20 x cassandra: INFO 14:55:20 Writing >>> Memtable-compactions_in_progress@2039409834(0.008KiB serialized bytes, >>> 1 ops, 0%/0% of on/off-heap limit) >>> Jul 30 15:55:24 x audispd: node=x. type=USER_LOGOUT >>> msg=audit(1564498524.409:457934): pid=46620 uid=0 auid=464400029 ses=2747 >>> msg='op=login id=464400029 exe="/usr/sbin/sshd" hostname=? addr=? >>> terminal=/dev/pts/0 res=success' >>> Jul 30 15:55:24 x audispd: node=x. type=USER_LOGOUT >>> msg=audit(1564498524.409:457935): pid=4878 uid=0 auid=464400029 ses=2749 >>> msg='op=login id=464400029 exe="/usr/sbin/sshd" hostname=? addr=? >>> terminal=/dev/pts/1 res=success' >>> >>> Jul 30 15:55:57 x systemd: Created slice User Slice of root. >>> Jul 30 15:55:57 x systemd: Started Session c165288 of user root. >>> Jul 30 15:55:57 x audispd: node=x. type=USER_START >>> msg=audit(1564498557.294:457958): pid=19687 uid=0 auid=4294967295 >>> ses=4294967295 msg='op=PAM:session_open >>> grantors=pam_keyinit,pam_limits,pam_keyinit,pam_limits,pam_tty_audit,pam_systemd,pam_unix >>> acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=? res=success' >>> Jul 30 15:55:57 x audispd: node=x. type=USER_START >>> msg=audit(1564498557.298:457959): pid=19690 uid=0 auid=4294967295 >>> ses=4294967295 msg='op=PAM:session_open >>> grantors=pam_keyinit,pam_systemd,pam_keyinit,pam_limits,pam_unix >>> acct="cass_b" exe="/usr/sbin/runuser" hostname=? addr=? terminal=? >>> res=success' >>> Jul 30 15:55:58 x systemd: Removed slice User Slice of root. >>> Jul 30 15:56:02 x cassandra: INFO 14:56:02 Writing >>> Memtable-compactions_in_progress@1532791194(0.008KiB serialized bytes, >>> 1 ops, 0%/0% of on/off-heap limit) >>> Jul 30 15:56:02 x cassandra: INFO 14:56:02 Cannot perform a full major >>> compaction as repaired and unrepaired sstables cannot be compacted >>> together. These two set of sstables will be compacted separately. >>> Jul 30 15:56:02 x cassandra: INFO 14:56:02 Writing >>> Memtable-compactions_in_progress@1455399453(0.281KiB serialized bytes, >>> 16 ops, 0%/0% of on/off-heap limit) >>> Jul 30 15:56:04 x tag_audit_log: type=USER_CMD >>> msg=audit(1564498555.190:457951): pid=19294 uid=509 auid=4294967295 >>> ses=4294967295 msg='cwd="/" >>> cmd=72756E75736572202D73202F62696E2F62617368202D6C20636173735F62202D632063617373616E6472612D6D6574612F63617373616E6472612F62696E2F6E6F6465746F6F6C2074707374617473 >>> terminal=? res=success' >>> >>> >>> >>> We have checked a number of other things like NTP setting etc but >>> nothing is telling us what could cause so many corruptions across the >>> entire cluster. >>> Things were healthy with this cluster for months, the only thing I can >>> think is that we started loading data from a load of 20GB per instance up >>> to 200GB where it sits now, maybe this just highlighted the issue. >>> >>> >>> >>> Compaction and Compression on Keyspace CL's [mixture] >>> All CF's are using compression. >>> >>> AND compaction = {'min_threshold': '4', 'class': >>> 'org.apache.cassandra.db.compaction.*SizeTieredCompactionStrategy*', >>> 'max_threshold': '32'} >>> AND compression = {'sstable_compression': >>> 'org.apache.cassandra.io.compress.*SnappyCompressor*'} >>> >>> AND compaction = {'min_threshold': '4', 'class': >>> 'org.apache.cassandra.db.compaction.*SizeTieredCompactionStrategy*', >>> 'max_threshold': '32'} >>> AND compression = {'sstable_compression': >>> 'org.apache.cassandra.io.compress.*LZ4Compressor*'} >>> >>> AND compaction = {'class': 'org.apache.cassandra.db.compaction. >>> *LeveledCompactionStrategy*'} >>> AND compression = {'sstable_compression': >>> 'org.apache.cassandra.io.compress.*SnappyCompressor*'} >>> >>> --We are also using internode network compression: >>> internode_compression: all >>> >>> >>> >>> Does anyone have any idea what I should check next? >>> Our next theory is that there may be an issue with Checksum, but I'm not >>> sure where to go with this. >>> >>> >>> >>> Any help would be very much appreciated before I lose the last bit of >>> hair I have on my head. >>> >>> >>> >>> Kind Regards, >>> >>> Phil >>> >>> >>> >>> On Wed, 7 Aug 2019 at 20:51, Nitan Kainth <nitankai...@gmail.com> wrote: >>> >>> Repair during upgrade have caused corruption too. >>> >>> >>> >>> Also, dropping and adding columns with same name but different type >>> >>> >>> >>> Regards, >>> >>> Nitan >>> >>> Cell: 510 449 9629 >>> >>> >>> On Aug 7, 2019, at 2:42 PM, Jeff Jirsa <jji...@gmail.com> wrote: >>> >>> Is compression enabled? >>> >>> >>> >>> If not, bit flips on disk can corrupt data files and reads + repair may >>> send that corruption to other hosts in the cluster >>> >>> >>> On Aug 7, 2019, at 3:46 AM, Philip Ó Condúin <philipocond...@gmail.com> >>> wrote: >>> >>> Hi All, >>> >>> >>> >>> I am currently experiencing multiple datafile corruptions across most >>> nodes in my cluster, there seems to be no pattern to the corruption. I'm >>> starting to think it might be a bug, we're using Cassandra 2.2.13. >>> >>> >>> >>> Without going into detail about the issue I just want to confirm >>> something. >>> >>> >>> >>> Can someone share with me a list of scenarios that would cause >>> corruption? >>> >>> >>> >>> 1. OS failure >>> >>> 2. Cassandra disturbed during the writing >>> >>> >>> >>> etc etc. >>> >>> >>> >>> I need to investigate each scenario and don't want to leave any out. >>> >>> >>> >>> -- >>> >>> Regards, >>> >>> Phil >>> >>> >>> >>> >>> -- >>> >>> Regards, >>> >>> Phil >>> >>> >>> >>> -- >>> Regards, >>> Phil >>> >>> >>> >>> -- >>> Regards, >>> Phil >>> >> > > -- > Regards, > Phil > -- regards, Laxmikant Upadhyay