Re: Repair fails with java.io.IOError: java.io.EOFException

aaron morton Tue, 26 Jul 2011 02:00:02 -0700

Was guessing something like a token move may have happened in the past. 

Good suggestion to also kick off a major compaction. I've seen that make a big 
difference even for apps that do not do deletes, but do do overwrites.


Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 26 Jul 2011, at 19:00, Sylvain Lebresne wrote:

>> If they are and repair has completed use node tool cleanup to remove the
>> data the node is no longer responsible. See bootstrap section above.
> 
> I've seen that said a few times so allow me to correct. Cleanup is useless 
> after
> a repair. 'nodetool cleanup' removes rows the node is not responsible anymore
> and is thus useful only after operations that change the range a node is
> responsible for (bootstrap, move, decommission). After a repair, you will need
> compaction to kick in to see you disk usage come back to normal.
> 
> --
> Sylvain
> 
>> Hope that helps.
>> 
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> On 26 Jul 2011, at 12:44, Sameer Farooqui wrote:
>> 
>> Looks like the repair finished successfully the second time. However, the
>> cluster is still severely unbalanced. I was hoping the repair would balance
>> the nodes. We're using random partitioner. One node has 900GB and others
>> have 128GB, 191GB, 129GB, 257 GB, etc. The 900GB and the 646GB are just
>> insanely high. Not sure why or how to troubleshoot.
>> 
>> 
>> 
>> On Fri, Jul 22, 2011 at 1:28 PM, Sameer Farooqui <cassandral...@gmail.com>
>> wrote:
>>> 
>>> I don't see a JVM crashlog ( hs_err_pid[pid].log) in
>>> ~/brisk/resources/cassandra/bin or /tmp. So maybe JVM didn't crash?
>>> 
>>> We're running a pretty up to date with Sun Java:
>>> 
>>> ubuntu@ip-10-2-x-x:/tmp$ java -version
>>> java version "1.6.0_24"
>>> Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
>>> Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
>>> 
>>> I'm gonna restart the Repair process in a few more hours. If there are any
>>> additional debug or troubleshooting logs you'd like me to enable first,
>>> please let me know.
>>> 
>>> - Sameer
>>> 
>>> 
>>> 
>>> On Thu, Jul 21, 2011 at 5:31 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
>>>> 
>>>> Did you check for a JVM crash log?
>>>> 
>>>> You should make sure you're running the latest Sun JVM, older versions
>>>> and OpenJDK in particular are prone to segfaulting.
>>>> 
>>>> On Thu, Jul 21, 2011 at 6:53 PM, Sameer Farooqui
>>>> <cassandral...@gmail.com> wrote:
>>>>> We are starting Cassandra with "brisk cassandra", so as a stand-alone
>>>>> process, not a service.
>>>>> 
>>>>> The syslog on the node doesn't show anything regarding the Cassandra
>>>>> Java
>>>>> process around the time the last entries were made in the Cassandra
>>>>> system.log (2011-07-21 13:01:51):
>>>>> 
>>>>> Jul 21 12:35:01 ip-10-2-206-127 CRON[12826]: (root) CMD (command -v
>>>>> debian-sa1 > /dev/null && debian-sa1 1 1)
>>>>> Jul 21 12:45:01 ip-10-2-206-127 CRON[13420]: (root) CMD (command -v
>>>>> debian-sa1 > /dev/null && debian-sa1 1 1)
>>>>> Jul 21 12:55:01 ip-10-2-206-127 CRON[14021]: (root) CMD (command -v
>>>>> debian-sa1 > /dev/null && debian-sa1 1 1)
>>>>> Jul 21 14:26:07 ip-10-2-206-127 kernel: imklog 4.2.0, log source =
>>>>> /proc/kmsg started.
>>>>> Jul 21 14:26:07 ip-10-2-206-127 rsyslogd: [origin software="rsyslogd"
>>>>> swVersion="4.2.0" x-pid="663" x-info="http://www.rsyslog.com";]
>>>>> (re)start
>>>>> 
>>>>> 
>>>>> The last thing in the Cassandra log before INFO Logging initialized is:
>>>>> 
>>>>>  INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line
>>>>> 128)
>>>>> GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max
>>>>> is
>>>>> 4030726144
>>>>> 
>>>>> 
>>>>> I can start Repair again, but am worried that it will crash Cassandra
>>>>> again,
>>>>> so I want to turn on any debugging or helpful logs to diagnose the
>>>>> crash if
>>>>> it happens again.
>>>>> 
>>>>> 
>>>>> - Sameer
>>>>> 
>>>>> 
>>>>> On Thu, Jul 21, 2011 at 4:30 PM, aaron morton <aa...@thelastpickle.com>
>>>>> wrote:
>>>>>> 
>>>>>> The default init.d script will direct std out/err to that file, how
>>>>>> are
>>>>>> you starting brisk / cassandra ?
>>>>>> Check the syslog and other logs in /var/log to see if the OS killed
>>>>>> cassandra.
>>>>>> Also, what was the last thing in the casandra log before INFO [main]
>>>>>> 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging
>>>>>> initialised ?
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> -----------------
>>>>>> Aaron Morton
>>>>>> Freelance Cassandra Developer
>>>>>> @aaronmorton
>>>>>> http://www.thelastpickle.com
>>>>>> On 22 Jul 2011, at 10:50, Sameer Farooqui wrote:
>>>>>> 
>>>>>> Hey Aaron,
>>>>>> 
>>>>>> I don't have any output.log files in that folder:
>>>>>> 
>>>>>> ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra
>>>>>> ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls
>>>>>> system.log     system.log.11  system.log.4  system.log.7
>>>>>> system.log.1   system.log.2   system.log.5  system.log.8
>>>>>> system.log.10  system.log.3   system.log.6  system.log.9
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, Jul 21, 2011 at 3:40 PM, aaron morton
>>>>>> <aa...@thelastpickle.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>> Check /var/log/cassandra/output.log (assuming the default init
>>>>>>> scripts)
>>>>>>> A
>>>>>>> -----------------
>>>>>>> Aaron Morton
>>>>>>> Freelance Cassandra Developer
>>>>>>> @aaronmorton
>>>>>>> http://www.thelastpickle.com
>>>>>>> On 22 Jul 2011, at 10:13, Sameer Farooqui wrote:
>>>>>>> 
>>>>>>> Hmm. Just looked at the log more closely.
>>>>>>> 
>>>>>>> So, what actually happened is while Repair was running on this
>>>>>>> specific
>>>>>>> node, the Cassandra java process terminated itself automatically. The
>>>>>>> last
>>>>>>> entries in the log are:
>>>>>>> 
>>>>>>>  INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java
>>>>>>> (line
>>>>>>> 128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888
>>>>>>> used; max
>>>>>>> is 4030726144
>>>>>>>  INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java
>>>>>>> (line
>>>>>>> 128) GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688
>>>>>>> used; max
>>>>>>> is 4030726144
>>>>>>>  INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java
>>>>>>> (line
>>>>>>> 128) GC for ParNew: 251 ms, 148861328 reclaimed leaving 1931111120
>>>>>>> used; max
>>>>>>> is 4030726144
>>>>>>>  INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java
>>>>>>> (line
>>>>>>> 128) GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368
>>>>>>> used; max
>>>>>>> is 4030726144
>>>>>>>  INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java
>>>>>>> (line
>>>>>>> 128) GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176
>>>>>>> used; max
>>>>>>> is 4030726144
>>>>>>>  INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java
>>>>>>> (line
>>>>>>> 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600
>>>>>>> used; max
>>>>>>> is 4030726144
>>>>>>> 
>>>>>>> When we came in this morning, nodetool ring from another node showed
>>>>>>> the
>>>>>>> 1st node as down and OpsCenter also reported it as down.
>>>>>>> 
>>>>>>> Next we ran "sudo netstat -anp | grep 7199" from the 1st node to see
>>>>>>> the
>>>>>>> status of the Cassandra PID and it was not running.
>>>>>>> 
>>>>>>> We then started Cassandra:
>>>>>>> 
>>>>>>> INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java
>>>>>>> (line
>>>>>>> 78) Logging initialized
>>>>>>>  INFO [main] 2011-07-21 15:48:07,266 AbstractCassandraDaemon.java
>>>>>>> (line
>>>>>>> 96) Heap size: 3894411264/3894411264
>>>>>>>  INFO [main] 2011-07-21 15:48:11,678 CLibrary.java (line 106) JNA
>>>>>>> mlockall successful
>>>>>>>  INFO [main] 2011-07-21 15:48:11,702 DatabaseDescriptor.java (line
>>>>>>> 121)
>>>>>>> Loading settings from
>>>>>>> file:/home/ubuntu/brisk/resources/cassandra/conf/cassandra.yaml
>>>>>>> 
>>>>>>> 
>>>>>>> It was during this start process that the java.io.EOFException was
>>>>>>> seen,
>>>>>>> but yes, like you said Jonathan, the Cassandra process started back
>>>>>>> up and
>>>>>>> joined the ring.
>>>>>>> 
>>>>>>> We're now wondering why the Repair failed and why Cassandra crashed
>>>>>>> in
>>>>>>> the first place. We only had default level logging enabled. Is there
>>>>>>> something else I can check or that you suspect?
>>>>>>> 
>>>>>>> Should we turn the logging up to debug and retry the Repair?
>>>>>>> 
>>>>>>> 
>>>>>>> - Sameer
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jul 21, 2011 at 12:37 PM, Jonathan Ellis <jbel...@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Looks harmless to me.
>>>>>>>> 
>>>>>>>> On Thu, Jul 21, 2011 at 1:41 PM, Sameer Farooqui
>>>>>>>> <cassandral...@gmail.com> wrote:
>>>>>>>>> While running Repair on a 0.8.1 node, we got this error in the
>>>>>>>>> system.log:
>>>>>>>>> 
>>>>>>>>> ERROR [Thread-23] 2011-07-21 15:48:43,868
>>>>>>>>> AbstractCassandraDaemon.java
>>>>>>>>> (line
>>>>>>>>> 113) Fatal exception in thread Thread[Thread-23,5,main]
>>>>>>>>> java.io.IOError: java.io.EOFException
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78)
>>>>>>>>> Caused by: java.io.EOFException
>>>>>>>>> at java.io.DataInputStream.readInt(DataInputStream.java:375)
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66)
>>>>>>>>> 
>>>>>>>>> There's just a bunch of informational messages about Gossip before
>>>>>>>>> this.
>>>>>>>>> 
>>>>>>>>> Looks like the file or stream unexpectedly ended?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> http://download.oracle.com/javase/1.4.2/docs/api/java/io/EOFException.html
>>>>>>>>> 
>>>>>>>>> Is this a bug or something wrong in our environment?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> - Sameer
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Jonathan Ellis
>>>>>>>> Project Chair, Apache Cassandra
>>>>>>>> co-founder of DataStax, the source for professional Cassandra
>>>>>>>> support
>>>>>>>> http://www.datastax.com
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Jonathan Ellis
>>>> Project Chair, Apache Cassandra
>>>> co-founder of DataStax, the source for professional Cassandra support
>>>> http://www.datastax.com
>>> 
>> 
>> 
>>

Re: Repair fails with java.io.IOError: java.io.EOFException

Reply via email to