Tianning Zhang <Tianning.Zhang@...> writes: > > Hi, > > After switching from the Kafka version 2.10-0.8.1.1 to 2.10-0.8.2.1 I frequently encounter the > exception below, which result in re-election of the leaders for the partitions. This exception occurs > every one ~ several hours. > > We have a 3-node-cluster. We didn't see this problem when we still use the version 0.8.1.1. It seems neither > to be a problem of limits in open sockets/files - as increasing the limits didn't alleviate the problem. > > Does anyone have similar problem? What are the possible fix? > > I noticed that a similar issue was reported in > (http://mail-archives.apache.org/mod_mbox/kafka-users/201501.mbox/%3CCAHwHRrW7qMLByv85pe7Vg17ksFrMSWtwHxGSjuHJawDFeWBuag-JsoAwUIsXosN+BqQ9rBEUg <at> public.gmane.org%3E) > for version 0.8.1. Did you find a solution at the end? > > Thanks! > > Tianning > __________________________ > FATAL [Replica Manager on Broker 3]: Error writing to highwatermark file: (kafka.server.ReplicaManager) > java.io.FileNotFoundException: /data/replication-offset-checkpoint.tmp (Permission denied) > at java.io.FileOutputStream.open(Native Method) > at java.io.FileOutputStream.<init>(FileOutputStream.java:212) > at java.io.FileOutputStream.<init>(FileOutputStream.java:165) > at kafka.server.OffsetCheckpoint.write(OffsetCheckpoint.scala:37) > at kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$2.apply(Re > at kafka.server.ReplicaManager$$anonfun$checkpointHighWatermarks$2.apply(Re > at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(Tra > at scala.collection.immutable.Map$Map1.foreach(Map.scala:109) > at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scal > at kafka.server.ReplicaManager.checkpointHighWatermarks(ReplicaManager.scal > at kafka.server.ReplicaManager$$anonfun$1.apply$mcV$sp(ReplicaManager.scala > at kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala: > at kafka.utils.Utils$$anon$1.run(Utils.scala:54) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:35 > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.acc > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.jav > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja > at java.lang.Thread.run(Thread.java:722) > > [http://www.zanox.com/disclaimer/lgo_zanox.gif] > -------------------------------------------------------------------------------- > ZANOX AG | Headquarters: Berlin AG Charlottenburg | HRB 75459 | VAT identification number: DE 209981705 > Executive Board: Mark Walters (CEO) | Adam Ross (COO) | Peter Loveday (CTO) > Chairman of the Supervisory Board: Dr. Andreas Wiele > > This e-mail and any attachments may contain confidential and/or privileged information. If you are not > the intended recipient (or have received this e-mail in error) please notify the sender immediately and > delete this e-mail from your system. Any other use, copying, disclosure or distribution is strictly forbidden. >
The above problem in our data center is solved. It seems to be a reliability problem of the storage system. After moving to a new storage provider, the exception disappeared.