There are at least two bugs in the compaction lifecycle transaction log - one that can drop an ABORT / ADD in the wrong order (and prevent startup), and one that allows for invalid timestamps in the log file (and again, prevent startups).
I believe it's safe to work around the former by removing the .log file, and you can work around the latter by using `touch` to update the timestamps of the data file that mismatches, but I can't find the relevant JIRAs to be 100% sure. (Also, it may be a good trigger to cut a new release, because things that block startup are obviously quite serious). On Wed, Aug 30, 2023 at 6:59 AM Joe Obernberger < joseph.obernber...@gmail.com> wrote: > Hi all - I replaced a node in a 14 node cluster, and it rebuilt OK. I > started to see a lot of timeout errors, and discovered one of the nodes > had this message constantly repeated: > "waiting to acquire a permit to begin streaming" - so perhaps I hit this > bug: > https://www.mail-archive.com/commits@cassandra.apache.org/msg284709.html > > I then restarted that node, but it gave a bunch of errors about > "unexpected disk state: failed to read translation log" > I deleted the corresponding files and got that node to come up, but now > when I restart any of the other nodes in the cluster, they too do not > start back up: > > Example: > > INFO [main] 2023-08-30 09:50:46,130 LogTransaction.java:544 - Verifying > logfile transaction > [nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log in > /data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3, > > > /data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3] > ERROR [main] 2023-08-30 09:50:46,154 LogReplicaSet.java:145 - Mismatched > line in file nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log: got > 'ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37640-big-,0,8][2833571752]' > > expected > 'ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37639-big-,0,8][1997892352]', > > giving up > ERROR [main] 2023-08-30 09:50:46,155 LogFile.java:164 - Failed to read > records for transaction log > [nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log in > /data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3, > > > /data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3] > ERROR [main] 2023-08-30 09:50:46,156 LogTransaction.java:559 - > Unexpected disk state: failed to read transaction log > [nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log in > /data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3, > > > /data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3] > Files and contents follow: > > /data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log > > ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37639-big-,0,8][1997892352] > ABORT:[,0,0][737437348] > > ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37640-big-,0,8][2833571752] > > ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37644-big-,0,8][3122518803] > > ADD:[/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37643-big-,0,8][2875951075] > > ADD:[/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37642-big-,0,8][884016253] > > ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37641-big-,0,8][926833718] > > /data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log > > ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37640-big-,0,8][2833571752] > ***Does not match > <ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37639-big-,0,8][1997892352]> > > in first replica file > > ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37644-big-,0,8][3122518803] > > ADD:[/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37643-big-,0,8][2875951075] > > ADD:[/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37642-big-,0,8][884016253] > > ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37641-big-,0,8][926833718] > > ERROR [main] 2023-08-30 09:50:46,156 CassandraDaemon.java:897 - Cannot > remove temporary or obsoleted files for doc.extractedmetadata due to a > problem with transaction log files. Please check records with problems > in the log messages above and fix them. Refer to the 3.0 upgrading > instructions in NEWS.txt for a description of transaction log files. > > I then delete the files and eventually after many iterations, the node > comes back up. > The table 'extractedmetadata' has 29 billion records. Just a data point > here - I think the 'right' thing to do is just to go to each node and > stop it, clean up the files, and finally get each one back up? > > -Joe > > > -- > This email has been checked for viruses by AVG antivirus software. > www.avg.com >