[ 
https://issues.apache.org/jira/browse/KAFKA-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096436#comment-17096436
 ] 

yongmao wang commented on KAFKA-7575:
-------------------------------------

Here is my initial investigation about this issue:

 

The error happened when renaming a file(replication-offset-checkpoint.tmp) to 
another name, and it would result in the Kafka service stops. From the Kafka 
logs, we saw there are "AccessDeniedException" and "FileAlreadyExistsException" 
was throw.  After running the testing code, I found only when create a buffered 
reader using the "Files.newBufferedReader()" method and doesn't close it during 
the file move operation. The same error can be reproduced.

I have reproduced the same exceptions by writing the test codes on my Windows 
debug machine.      

I run the same file move codes of the Kafka method "atomicMoveWithFallback()" 
and tried to find out which case would generate the same exceptions. From the 
below screenshot, we could see if create a buffered reader using the 
"Files.newBufferedReader()" method and doesn't close it before the file move 
operation, it will throw the "java.nio.file.AccessDeniedException" firstly and 
then "java.nio.file.FileAlreadyExistsException":   

!https://rally1.rallydev.com/slm/attachment/372362614736/Screen%20Shot%202020-02-25%20at%206.31.31%20AM.png|width=1256,height=497!


Then I reviewed Kafka source codes again and found there is another method 
["read()" which under the class 
"CheckPointFile.scala"|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/checkpoints/CheckpointFile.scala#L127].
 It calls the method "Files.newBufferedReader()" to read the context from the 
file "replication-offset-checkpoint". So I guess the error that happened was 
caused by that there may be another thread is calling this read() method during 
the write() method that wants to move the replication-offset-checkpoint.tmp 
file to replication-offset-checkpoint. So the file rename operation failed.

Then I used the"Find Usages" function tried to find out where this method is 
used. But I found that a lot of places using this read() method. To identify 
which place cause this thread problem is very difficult.

I noticed that under the same class "CheckPointFile.scala", besides the write() 
method, there is another "read()" method that would be open a buffered reader.  
Then I suspect this may be a multip thread problem. But for both of them, they 
use the keyword "lock synchronized" for the whole code block. So this is very 
strange for us. Now it is very difficult to find out where is the root cause. 


How about your ideas? Thanks

> 'Error while writing to checkpoint file' Issue
> ----------------------------------------------
>
>                 Key: KAFKA-7575
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7575
>             Project: Kafka
>          Issue Type: Bug
>          Components: producer 
>    Affects Versions: 1.1.1
>         Environment: Windows 10, Kafka 1.1.1
>            Reporter: Dasun Nirmitha
>            Priority: Major
>         Attachments: Dry run error.rar
>
>
> I'm currently testing a Java Kafka producer application coded to retrieve a 
> db value from a local mysql db and produce to a single topic. Locally I've 
> got a Zookeeper server and a Kafka single broker running.
> My issue is I need to produce this from the Kafka producer each second, and 
> that works for around 2 hours until broker throws an 'Error while writing to 
> checkpoint file' and shuts down. Producing with a 1 minute interval works 
> with no issues but unfortunately I need the produce interval to be 1 second.
> I have attached a rar containing screenshots of the Errors thrown from the 
> Broker and my application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to