Re: Repair fails with java.io.IOError: java.io.EOFException

Sylvain Lebresne Tue, 26 Jul 2011 00:01:18 -0700

> If they are and repair has completed use node tool cleanup to remove the
> data the node is no longer responsible. See bootstrap section above.


I've seen that said a few times so allow me to correct. Cleanup is useless after
a repair. 'nodetool cleanup' removes rows the node is not responsible anymore
and is thus useful only after operations that change the range a node is
responsible for (bootstrap, move, decommission). After a repair, you will need
compaction to kick in to see you disk usage come back to normal.

--
Sylvain

> Hope that helps.
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> On 26 Jul 2011, at 12:44, Sameer Farooqui wrote:
>
> Looks like the repair finished successfully the second time. However, the
> cluster is still severely unbalanced. I was hoping the repair would balance
> the nodes. We're using random partitioner. One node has 900GB and others
> have 128GB, 191GB, 129GB, 257 GB, etc. The 900GB and the 646GB are just
> insanely high. Not sure why or how to troubleshoot.
>
>
>
> On Fri, Jul 22, 2011 at 1:28 PM, Sameer Farooqui <cassandral...@gmail.com>
> wrote:
>>
>> I don't see a JVM crashlog ( hs_err_pid[pid].log) in
>> ~/brisk/resources/cassandra/bin or /tmp. So maybe JVM didn't crash?
>>
>> We're running a pretty up to date with Sun Java:
>>
>> ubuntu@ip-10-2-x-x:/tmp$ java -version
>> java version "1.6.0_24"
>> Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
>> Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
>>
>> I'm gonna restart the Repair process in a few more hours. If there are any
>> additional debug or troubleshooting logs you'd like me to enable first,
>> please let me know.
>>
>> - Sameer
>>
>>
>>
>> On Thu, Jul 21, 2011 at 5:31 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
>>>
>>> Did you check for a JVM crash log?
>>>
>>> You should make sure you're running the latest Sun JVM, older versions
>>> and OpenJDK in particular are prone to segfaulting.
>>>
>>> On Thu, Jul 21, 2011 at 6:53 PM, Sameer Farooqui
>>> <cassandral...@gmail.com> wrote:
>>> > We are starting Cassandra with "brisk cassandra", so as a stand-alone
>>> > process, not a service.
>>> >
>>> > The syslog on the node doesn't show anything regarding the Cassandra
>>> > Java
>>> > process around the time the last entries were made in the Cassandra
>>> > system.log (2011-07-21 13:01:51):
>>> >
>>> > Jul 21 12:35:01 ip-10-2-206-127 CRON[12826]: (root) CMD (command -v
>>> > debian-sa1 > /dev/null && debian-sa1 1 1)
>>> > Jul 21 12:45:01 ip-10-2-206-127 CRON[13420]: (root) CMD (command -v
>>> > debian-sa1 > /dev/null && debian-sa1 1 1)
>>> > Jul 21 12:55:01 ip-10-2-206-127 CRON[14021]: (root) CMD (command -v
>>> > debian-sa1 > /dev/null && debian-sa1 1 1)
>>> > Jul 21 14:26:07 ip-10-2-206-127 kernel: imklog 4.2.0, log source =
>>> > /proc/kmsg started.
>>> > Jul 21 14:26:07 ip-10-2-206-127 rsyslogd: [origin software="rsyslogd"
>>> > swVersion="4.2.0" x-pid="663" x-info="http://www.rsyslog.com";]
>>> > (re)start
>>> >
>>> >
>>> > The last thing in the Cassandra log before INFO Logging initialized is:
>>> >
>>> >  INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line
>>> > 128)
>>> > GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max
>>> > is
>>> > 4030726144
>>> >
>>> >
>>> > I can start Repair again, but am worried that it will crash Cassandra
>>> > again,
>>> > so I want to turn on any debugging or helpful logs to diagnose the
>>> > crash if
>>> > it happens again.
>>> >
>>> >
>>> > - Sameer
>>> >
>>> >
>>> > On Thu, Jul 21, 2011 at 4:30 PM, aaron morton <aa...@thelastpickle.com>
>>> > wrote:
>>> >>
>>> >> The default init.d script will direct std out/err to that file, how
>>> >> are
>>> >> you starting brisk / cassandra ?
>>> >> Check the syslog and other logs in /var/log to see if the OS killed
>>> >> cassandra.
>>> >> Also, what was the last thing in the casandra log before INFO [main]
>>> >> 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging
>>> >> initialised ?
>>> >>
>>> >> Cheers
>>> >>
>>> >> -----------------
>>> >> Aaron Morton
>>> >> Freelance Cassandra Developer
>>> >> @aaronmorton
>>> >> http://www.thelastpickle.com
>>> >> On 22 Jul 2011, at 10:50, Sameer Farooqui wrote:
>>> >>
>>> >> Hey Aaron,
>>> >>
>>> >> I don't have any output.log files in that folder:
>>> >>
>>> >> ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra
>>> >> ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls
>>> >> system.log     system.log.11  system.log.4  system.log.7
>>> >> system.log.1   system.log.2   system.log.5  system.log.8
>>> >> system.log.10  system.log.3   system.log.6  system.log.9
>>> >>
>>> >>
>>> >>
>>> >> On Thu, Jul 21, 2011 at 3:40 PM, aaron morton
>>> >> <aa...@thelastpickle.com>
>>> >> wrote:
>>> >>>
>>> >>> Check /var/log/cassandra/output.log (assuming the default init
>>> >>> scripts)
>>> >>> A
>>> >>> -----------------
>>> >>> Aaron Morton
>>> >>> Freelance Cassandra Developer
>>> >>> @aaronmorton
>>> >>> http://www.thelastpickle.com
>>> >>> On 22 Jul 2011, at 10:13, Sameer Farooqui wrote:
>>> >>>
>>> >>> Hmm. Just looked at the log more closely.
>>> >>>
>>> >>> So, what actually happened is while Repair was running on this
>>> >>> specific
>>> >>> node, the Cassandra java process terminated itself automatically. The
>>> >>> last
>>> >>> entries in the log are:
>>> >>>
>>> >>>  INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java
>>> >>> (line
>>> >>> 128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888
>>> >>> used; max
>>> >>> is 4030726144
>>> >>>  INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java
>>> >>> (line
>>> >>> 128) GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688
>>> >>> used; max
>>> >>> is 4030726144
>>> >>>  INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java
>>> >>> (line
>>> >>> 128) GC for ParNew: 251 ms, 148861328 reclaimed leaving 1931111120
>>> >>> used; max
>>> >>> is 4030726144
>>> >>>  INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java
>>> >>> (line
>>> >>> 128) GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368
>>> >>> used; max
>>> >>> is 4030726144
>>> >>>  INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java
>>> >>> (line
>>> >>> 128) GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176
>>> >>> used; max
>>> >>> is 4030726144
>>> >>>  INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java
>>> >>> (line
>>> >>> 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600
>>> >>> used; max
>>> >>> is 4030726144
>>> >>>
>>> >>> When we came in this morning, nodetool ring from another node showed
>>> >>> the
>>> >>> 1st node as down and OpsCenter also reported it as down.
>>> >>>
>>> >>> Next we ran "sudo netstat -anp | grep 7199" from the 1st node to see
>>> >>> the
>>> >>> status of the Cassandra PID and it was not running.
>>> >>>
>>> >>> We then started Cassandra:
>>> >>>
>>> >>> INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java
>>> >>> (line
>>> >>> 78) Logging initialized
>>> >>>  INFO [main] 2011-07-21 15:48:07,266 AbstractCassandraDaemon.java
>>> >>> (line
>>> >>> 96) Heap size: 3894411264/3894411264
>>> >>>  INFO [main] 2011-07-21 15:48:11,678 CLibrary.java (line 106) JNA
>>> >>> mlockall successful
>>> >>>  INFO [main] 2011-07-21 15:48:11,702 DatabaseDescriptor.java (line
>>> >>> 121)
>>> >>> Loading settings from
>>> >>> file:/home/ubuntu/brisk/resources/cassandra/conf/cassandra.yaml
>>> >>>
>>> >>>
>>> >>> It was during this start process that the java.io.EOFException was
>>> >>> seen,
>>> >>> but yes, like you said Jonathan, the Cassandra process started back
>>> >>> up and
>>> >>> joined the ring.
>>> >>>
>>> >>> We're now wondering why the Repair failed and why Cassandra crashed
>>> >>> in
>>> >>> the first place. We only had default level logging enabled. Is there
>>> >>> something else I can check or that you suspect?
>>> >>>
>>> >>> Should we turn the logging up to debug and retry the Repair?
>>> >>>
>>> >>>
>>> >>> - Sameer
>>> >>>
>>> >>>
>>> >>> On Thu, Jul 21, 2011 at 12:37 PM, Jonathan Ellis <jbel...@gmail.com>
>>> >>> wrote:
>>> >>>>
>>> >>>> Looks harmless to me.
>>> >>>>
>>> >>>> On Thu, Jul 21, 2011 at 1:41 PM, Sameer Farooqui
>>> >>>> <cassandral...@gmail.com> wrote:
>>> >>>> > While running Repair on a 0.8.1 node, we got this error in the
>>> >>>> > system.log:
>>> >>>> >
>>> >>>> > ERROR [Thread-23] 2011-07-21 15:48:43,868
>>> >>>> > AbstractCassandraDaemon.java
>>> >>>> > (line
>>> >>>> > 113) Fatal exception in thread Thread[Thread-23,5,main]
>>> >>>> > java.io.IOError: java.io.EOFException
>>> >>>> > at
>>> >>>> >
>>> >>>> >
>>> >>>> > org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78)
>>> >>>> > Caused by: java.io.EOFException
>>> >>>> > at java.io.DataInputStream.readInt(DataInputStream.java:375)
>>> >>>> > at
>>> >>>> >
>>> >>>> >
>>> >>>> > org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66)
>>> >>>> >
>>> >>>> > There's just a bunch of informational messages about Gossip before
>>> >>>> > this.
>>> >>>> >
>>> >>>> > Looks like the file or stream unexpectedly ended?
>>> >>>> >
>>> >>>> >
>>> >>>> > http://download.oracle.com/javase/1.4.2/docs/api/java/io/EOFException.html
>>> >>>> >
>>> >>>> > Is this a bug or something wrong in our environment?
>>> >>>> >
>>> >>>> >
>>> >>>> > - Sameer
>>> >>>> >
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Jonathan Ellis
>>> >>>> Project Chair, Apache Cassandra
>>> >>>> co-founder of DataStax, the source for professional Cassandra
>>> >>>> support
>>> >>>> http://www.datastax.com
>>> >>>
>>> >>>
>>> >>
>>> >>
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of DataStax, the source for professional Cassandra support
>>> http://www.datastax.com
>>
>
>
>

Re: Repair fails with java.io.IOError: java.io.EOFException

Reply via email to