Looks like the repair finished successfully the second time. However, the
cluster is still severely unbalanced. I was hoping the repair would balance
the nodes. We're using random partitioner. One node has 900GB and others
have 128GB, 191GB, 129GB, 257 GB, etc. The 900GB and the 646GB are just
insanely high. Not sure why or how to troubleshoot.



On Fri, Jul 22, 2011 at 1:28 PM, Sameer Farooqui <cassandral...@gmail.com>wrote:

> I don't see a JVM crashlog ( hs_err_pid[pid].log) in
> ~/brisk/resources/cassandra/bin or /tmp. So maybe JVM didn't crash?
>
> We're running a pretty up to date with Sun Java:
>
> ubuntu@ip-10-2-x-x:/tmp$ java -version
> java version "1.6.0_24"
> Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
> Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
>
> I'm gonna restart the Repair process in a few more hours. If there are any
> additional debug or troubleshooting logs you'd like me to enable first,
> please let me know.
>
> - Sameer
>
>
>
>
> On Thu, Jul 21, 2011 at 5:31 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
>
>> Did you check for a JVM crash log?
>>
>> You should make sure you're running the latest Sun JVM, older versions
>> and OpenJDK in particular are prone to segfaulting.
>>
>> On Thu, Jul 21, 2011 at 6:53 PM, Sameer Farooqui
>> <cassandral...@gmail.com> wrote:
>> > We are starting Cassandra with "brisk cassandra", so as a stand-alone
>> > process, not a service.
>> >
>> > The syslog on the node doesn't show anything regarding the Cassandra
>> Java
>> > process around the time the last entries were made in the Cassandra
>> > system.log (2011-07-21 13:01:51):
>> >
>> > Jul 21 12:35:01 ip-10-2-206-127 CRON[12826]: (root) CMD (command -v
>> > debian-sa1 > /dev/null && debian-sa1 1 1)
>> > Jul 21 12:45:01 ip-10-2-206-127 CRON[13420]: (root) CMD (command -v
>> > debian-sa1 > /dev/null && debian-sa1 1 1)
>> > Jul 21 12:55:01 ip-10-2-206-127 CRON[14021]: (root) CMD (command -v
>> > debian-sa1 > /dev/null && debian-sa1 1 1)
>> > Jul 21 14:26:07 ip-10-2-206-127 kernel: imklog 4.2.0, log source =
>> > /proc/kmsg started.
>> > Jul 21 14:26:07 ip-10-2-206-127 rsyslogd: [origin software="rsyslogd"
>> > swVersion="4.2.0" x-pid="663" x-info="http://www.rsyslog.com";]
>> (re)start
>> >
>> >
>> > The last thing in the Cassandra log before INFO Logging initialized is:
>> >
>> >  INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line
>> 128)
>> > GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max
>> is
>> > 4030726144
>> >
>> >
>> > I can start Repair again, but am worried that it will crash Cassandra
>> again,
>> > so I want to turn on any debugging or helpful logs to diagnose the crash
>> if
>> > it happens again.
>> >
>> >
>> > - Sameer
>> >
>> >
>> > On Thu, Jul 21, 2011 at 4:30 PM, aaron morton <aa...@thelastpickle.com>
>> > wrote:
>> >>
>> >> The default init.d script will direct std out/err to that file, how are
>> >> you starting brisk / cassandra ?
>> >> Check the syslog and other logs in /var/log to see if the OS killed
>> >> cassandra.
>> >> Also, what was the last thing in the casandra log before INFO [main]
>> >> 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging
>> >> initialised ?
>> >>
>> >> Cheers
>> >>
>> >> -----------------
>> >> Aaron Morton
>> >> Freelance Cassandra Developer
>> >> @aaronmorton
>> >> http://www.thelastpickle.com
>> >> On 22 Jul 2011, at 10:50, Sameer Farooqui wrote:
>> >>
>> >> Hey Aaron,
>> >>
>> >> I don't have any output.log files in that folder:
>> >>
>> >> ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra
>> >> ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls
>> >> system.log     system.log.11  system.log.4  system.log.7
>> >> system.log.1   system.log.2   system.log.5  system.log.8
>> >> system.log.10  system.log.3   system.log.6  system.log.9
>> >>
>> >>
>> >>
>> >> On Thu, Jul 21, 2011 at 3:40 PM, aaron morton <aa...@thelastpickle.com
>> >
>> >> wrote:
>> >>>
>> >>> Check /var/log/cassandra/output.log (assuming the default init
>> scripts)
>> >>> A
>> >>> -----------------
>> >>> Aaron Morton
>> >>> Freelance Cassandra Developer
>> >>> @aaronmorton
>> >>> http://www.thelastpickle.com
>> >>> On 22 Jul 2011, at 10:13, Sameer Farooqui wrote:
>> >>>
>> >>> Hmm. Just looked at the log more closely.
>> >>>
>> >>> So, what actually happened is while Repair was running on this
>> specific
>> >>> node, the Cassandra java process terminated itself automatically. The
>> last
>> >>> entries in the log are:
>> >>>
>> >>>  INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java
>> (line
>> >>> 128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888
>> used; max
>> >>> is 4030726144
>> >>>  INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java
>> (line
>> >>> 128) GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688
>> used; max
>> >>> is 4030726144
>> >>>  INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java
>> (line
>> >>> 128) GC for ParNew: 251 ms, 148861328 reclaimed leaving 1931111120
>> used; max
>> >>> is 4030726144
>> >>>  INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java
>> (line
>> >>> 128) GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368
>> used; max
>> >>> is 4030726144
>> >>>  INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java
>> (line
>> >>> 128) GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176
>> used; max
>> >>> is 4030726144
>> >>>  INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java
>> (line
>> >>> 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600
>> used; max
>> >>> is 4030726144
>> >>>
>> >>> When we came in this morning, nodetool ring from another node showed
>> the
>> >>> 1st node as down and OpsCenter also reported it as down.
>> >>>
>> >>> Next we ran "sudo netstat -anp | grep 7199" from the 1st node to see
>> the
>> >>> status of the Cassandra PID and it was not running.
>> >>>
>> >>> We then started Cassandra:
>> >>>
>> >>> INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line
>> >>> 78) Logging initialized
>> >>>  INFO [main] 2011-07-21 15:48:07,266 AbstractCassandraDaemon.java
>> (line
>> >>> 96) Heap size: 3894411264/3894411264
>> >>>  INFO [main] 2011-07-21 15:48:11,678 CLibrary.java (line 106) JNA
>> >>> mlockall successful
>> >>>  INFO [main] 2011-07-21 15:48:11,702 DatabaseDescriptor.java (line
>> 121)
>> >>> Loading settings from
>> >>> file:/home/ubuntu/brisk/resources/cassandra/conf/cassandra.yaml
>> >>>
>> >>>
>> >>> It was during this start process that the java.io.EOFException was
>> seen,
>> >>> but yes, like you said Jonathan, the Cassandra process started back up
>> and
>> >>> joined the ring.
>> >>>
>> >>> We're now wondering why the Repair failed and why Cassandra crashed in
>> >>> the first place. We only had default level logging enabled. Is there
>> >>> something else I can check or that you suspect?
>> >>>
>> >>> Should we turn the logging up to debug and retry the Repair?
>> >>>
>> >>>
>> >>> - Sameer
>> >>>
>> >>>
>> >>> On Thu, Jul 21, 2011 at 12:37 PM, Jonathan Ellis <jbel...@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Looks harmless to me.
>> >>>>
>> >>>> On Thu, Jul 21, 2011 at 1:41 PM, Sameer Farooqui
>> >>>> <cassandral...@gmail.com> wrote:
>> >>>> > While running Repair on a 0.8.1 node, we got this error in the
>> >>>> > system.log:
>> >>>> >
>> >>>> > ERROR [Thread-23] 2011-07-21 15:48:43,868
>> AbstractCassandraDaemon.java
>> >>>> > (line
>> >>>> > 113) Fatal exception in thread Thread[Thread-23,5,main]
>> >>>> > java.io.IOError: java.io.EOFException
>> >>>> > at
>> >>>> >
>> >>>> >
>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78)
>> >>>> > Caused by: java.io.EOFException
>> >>>> > at java.io.DataInputStream.readInt(DataInputStream.java:375)
>> >>>> > at
>> >>>> >
>> >>>> >
>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66)
>> >>>> >
>> >>>> > There's just a bunch of informational messages about Gossip before
>> >>>> > this.
>> >>>> >
>> >>>> > Looks like the file or stream unexpectedly ended?
>> >>>> >
>> >>>> >
>> http://download.oracle.com/javase/1.4.2/docs/api/java/io/EOFException.html
>> >>>> >
>> >>>> > Is this a bug or something wrong in our environment?
>> >>>> >
>> >>>> >
>> >>>> > - Sameer
>> >>>> >
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Jonathan Ellis
>> >>>> Project Chair, Apache Cassandra
>> >>>> co-founder of DataStax, the source for professional Cassandra support
>> >>>> http://www.datastax.com
>> >>>
>> >>>
>> >>
>> >>
>> >
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>>
>
>

Reply via email to