Stefan Miklosovic created CASSANDRA-20490: ---------------------------------------------
Summary: Encountred "duplicate hardlink error" when repairing Key: CASSANDRA-20490 URL: https://issues.apache.org/jira/browse/CASSANDRA-20490 Project: Apache Cassandra Issue Type: Bug Components: Consistency/Repair Reporter: Stefan Miklosovic A user reported: Hi all, we experience the following issue when executing full sequential repairs in Cassandra 4.0.10. ERROR [RepairSnapshotExecutor:1] 2023-11-07 13:22:50,267 CassandraDaemon.java:581 - Exception in thread Thread[RepairSnapshotExecutor:1,5,main] java.lang.RuntimeException: Tried to create duplicate hard link to /opt/ddb/data/pool/data1/test_keyspace/test1-c4b33340f0a211edb0cb2fb04a4be304/snapshots/bec3dba0-7d70-11ee-99d3-7bda513c2b90/nb-1-big-Filter.db at org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:185) at org.apache.cassandra.io.sstable.format.SSTableReader.createLinks(SSTableReader.java:1624) at org.apache.cassandra.io.sstable.format.SSTableReader.createLinks(SSTableReader.java:1606) at org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(ColumnFamilyStore.java:1852) at org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:2031) at org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:2017) at org.apache.cassandra.db.repair.CassandraTableRepairManager.lambda$snapshot$0(CassandraTableRepairManager.java:74) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Unknown Source) ERROR [AntiEntropyStage:1] 2023-11-07 13:22:50,267 CassandraDaemon.java:581 - Exception in thread Thread[AntiEntropyStage:1,5,main] java.lang.RuntimeException: java.lang.RuntimeException: Unable to take a snapshot bec3dba0-7d70-11ee-99d3-7bda513c2b90 on test_keyspace/test1 This behavior is reproduced consistently, when the following are true: * It is a normal sequential repair (--full and --sequential), * It is not a global repair, meaning at least one datacenter is defined (--in-dc or --in-local-dc), * The repair affects more than two Cassandra nodes. For more than two Cassandra nodes the parent repair session consists of multiple separate repair sessions towards different target endpoints. Full sequential repairs require that all participants flush and snapshot the data before starting the repair. Unfortunately, there is a collision between the separate repair sessions. The first one creates the ephemeral snapshot successfully and the second one that tries to create a snapshot (create hard link) in the same node fails with the above error. This issue is not seen in global repairs, where datacenters and hosts are not defined, because in that case there is an explicit check if a snapshot already exists before proceeding. I found a few issues in Jira about duplicate hard links, but all of them are from older versions and seem irrelevant to this one. Could you please help with this issue? Thank you, Panagiotis https://lists.apache.org/thread/kwz89po5gkx68bhof7l7o0yykz48bnbw -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org