Hi all - I replaced a node in a 14 node cluster, and it rebuilt OK. I
started to see a lot of timeout errors, and discovered one of the nodes
had this message constantly repeated:
"waiting to acquire a permit to begin streaming" - so perhaps I hit this
bug:
https://www.mail-archive.com/commits@cassandra.apache.org/msg284709.html
I then restarted that node, but it gave a bunch of errors about
"unexpected disk state: failed to read translation log"
I deleted the corresponding files and got that node to come up, but now
when I restart any of the other nodes in the cluster, they too do not
start back up:
Example:
INFO [main] 2023-08-30 09:50:46,130 LogTransaction.java:544 - Verifying
logfile transaction
[nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log in
/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3,
/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3]
ERROR [main] 2023-08-30 09:50:46,154 LogReplicaSet.java:145 - Mismatched
line in file nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log: got
'ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37640-big-,0,8][2833571752]'
expected
'ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37639-big-,0,8][1997892352]',
giving up
ERROR [main] 2023-08-30 09:50:46,155 LogFile.java:164 - Failed to read
records for transaction log
[nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log in
/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3,
/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3]
ERROR [main] 2023-08-30 09:50:46,156 LogTransaction.java:559 -
Unexpected disk state: failed to read transaction log
[nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log in
/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3,
/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3]
Files and contents follow:
/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log
ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37639-big-,0,8][1997892352]
ABORT:[,0,0][737437348]
ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37640-big-,0,8][2833571752]
ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37644-big-,0,8][3122518803]
ADD:[/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37643-big-,0,8][2875951075]
ADD:[/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37642-big-,0,8][884016253]
ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37641-big-,0,8][926833718]
/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb_txn_stream_6bfe4220-43b9-11ee-9649-316c953ea746.log
ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37640-big-,0,8][2833571752]
***Does not match
<ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37639-big-,0,8][1997892352]>
in first replica file
ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37644-big-,0,8][3122518803]
ADD:[/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37643-big-,0,8][2875951075]
ADD:[/data/1/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37642-big-,0,8][884016253]
ADD:[/data/4/cassandra/data/doc/extractedmetadata-25c210e0ada011ebade9fdc1d34336d3/nb-37641-big-,0,8][926833718]
ERROR [main] 2023-08-30 09:50:46,156 CassandraDaemon.java:897 - Cannot
remove temporary or obsoleted files for doc.extractedmetadata due to a
problem with transaction log files. Please check records with problems
in the log messages above and fix them. Refer to the 3.0 upgrading
instructions in NEWS.txt for a description of transaction log files.
I then delete the files and eventually after many iterations, the node
comes back up.
The table 'extractedmetadata' has 29 billion records. Just a data point
here - I think the 'right' thing to do is just to go to each node and
stop it, clean up the files, and finally get each one back up?
-Joe
--
This email has been checked for viruses by AVG antivirus software.
www.avg.com