> If they are and repair has completed use node tool cleanup to remove the > data the node is no longer responsible. See bootstrap section above.
I've seen that said a few times so allow me to correct. Cleanup is useless after a repair. 'nodetool cleanup' removes rows the node is not responsible anymore and is thus useful only after operations that change the range a node is responsible for (bootstrap, move, decommission). After a repair, you will need compaction to kick in to see you disk usage come back to normal. -- Sylvain > Hope that helps. > > ----------------- > Aaron Morton > Freelance Cassandra Developer > @aaronmorton > http://www.thelastpickle.com > On 26 Jul 2011, at 12:44, Sameer Farooqui wrote: > > Looks like the repair finished successfully the second time. However, the > cluster is still severely unbalanced. I was hoping the repair would balance > the nodes. We're using random partitioner. One node has 900GB and others > have 128GB, 191GB, 129GB, 257 GB, etc. The 900GB and the 646GB are just > insanely high. Not sure why or how to troubleshoot. > > > > On Fri, Jul 22, 2011 at 1:28 PM, Sameer Farooqui <cassandral...@gmail.com> > wrote: >> >> I don't see a JVM crashlog ( hs_err_pid[pid].log) in >> ~/brisk/resources/cassandra/bin or /tmp. So maybe JVM didn't crash? >> >> We're running a pretty up to date with Sun Java: >> >> ubuntu@ip-10-2-x-x:/tmp$ java -version >> java version "1.6.0_24" >> Java(TM) SE Runtime Environment (build 1.6.0_24-b07) >> Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) >> >> I'm gonna restart the Repair process in a few more hours. If there are any >> additional debug or troubleshooting logs you'd like me to enable first, >> please let me know. >> >> - Sameer >> >> >> >> On Thu, Jul 21, 2011 at 5:31 PM, Jonathan Ellis <jbel...@gmail.com> wrote: >>> >>> Did you check for a JVM crash log? >>> >>> You should make sure you're running the latest Sun JVM, older versions >>> and OpenJDK in particular are prone to segfaulting. >>> >>> On Thu, Jul 21, 2011 at 6:53 PM, Sameer Farooqui >>> <cassandral...@gmail.com> wrote: >>> > We are starting Cassandra with "brisk cassandra", so as a stand-alone >>> > process, not a service. >>> > >>> > The syslog on the node doesn't show anything regarding the Cassandra >>> > Java >>> > process around the time the last entries were made in the Cassandra >>> > system.log (2011-07-21 13:01:51): >>> > >>> > Jul 21 12:35:01 ip-10-2-206-127 CRON[12826]: (root) CMD (command -v >>> > debian-sa1 > /dev/null && debian-sa1 1 1) >>> > Jul 21 12:45:01 ip-10-2-206-127 CRON[13420]: (root) CMD (command -v >>> > debian-sa1 > /dev/null && debian-sa1 1 1) >>> > Jul 21 12:55:01 ip-10-2-206-127 CRON[14021]: (root) CMD (command -v >>> > debian-sa1 > /dev/null && debian-sa1 1 1) >>> > Jul 21 14:26:07 ip-10-2-206-127 kernel: imklog 4.2.0, log source = >>> > /proc/kmsg started. >>> > Jul 21 14:26:07 ip-10-2-206-127 rsyslogd: [origin software="rsyslogd" >>> > swVersion="4.2.0" x-pid="663" x-info="http://www.rsyslog.com"] >>> > (re)start >>> > >>> > >>> > The last thing in the Cassandra log before INFO Logging initialized is: >>> > >>> > INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java (line >>> > 128) >>> > GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used; max >>> > is >>> > 4030726144 >>> > >>> > >>> > I can start Repair again, but am worried that it will crash Cassandra >>> > again, >>> > so I want to turn on any debugging or helpful logs to diagnose the >>> > crash if >>> > it happens again. >>> > >>> > >>> > - Sameer >>> > >>> > >>> > On Thu, Jul 21, 2011 at 4:30 PM, aaron morton <aa...@thelastpickle.com> >>> > wrote: >>> >> >>> >> The default init.d script will direct std out/err to that file, how >>> >> are >>> >> you starting brisk / cassandra ? >>> >> Check the syslog and other logs in /var/log to see if the OS killed >>> >> cassandra. >>> >> Also, what was the last thing in the casandra log before INFO [main] >>> >> 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78) Logging >>> >> initialised ? >>> >> >>> >> Cheers >>> >> >>> >> ----------------- >>> >> Aaron Morton >>> >> Freelance Cassandra Developer >>> >> @aaronmorton >>> >> http://www.thelastpickle.com >>> >> On 22 Jul 2011, at 10:50, Sameer Farooqui wrote: >>> >> >>> >> Hey Aaron, >>> >> >>> >> I don't have any output.log files in that folder: >>> >> >>> >> ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra >>> >> ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls >>> >> system.log system.log.11 system.log.4 system.log.7 >>> >> system.log.1 system.log.2 system.log.5 system.log.8 >>> >> system.log.10 system.log.3 system.log.6 system.log.9 >>> >> >>> >> >>> >> >>> >> On Thu, Jul 21, 2011 at 3:40 PM, aaron morton >>> >> <aa...@thelastpickle.com> >>> >> wrote: >>> >>> >>> >>> Check /var/log/cassandra/output.log (assuming the default init >>> >>> scripts) >>> >>> A >>> >>> ----------------- >>> >>> Aaron Morton >>> >>> Freelance Cassandra Developer >>> >>> @aaronmorton >>> >>> http://www.thelastpickle.com >>> >>> On 22 Jul 2011, at 10:13, Sameer Farooqui wrote: >>> >>> >>> >>> Hmm. Just looked at the log more closely. >>> >>> >>> >>> So, what actually happened is while Repair was running on this >>> >>> specific >>> >>> node, the Cassandra java process terminated itself automatically. The >>> >>> last >>> >>> entries in the log are: >>> >>> >>> >>> INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java >>> >>> (line >>> >>> 128) GC for ParNew: 214 ms, 162748656 reclaimed leaving 1845274888 >>> >>> used; max >>> >>> is 4030726144 >>> >>> INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java >>> >>> (line >>> >>> 128) GC for ParNew: 266 ms, 158835624 reclaimed leaving 1864471688 >>> >>> used; max >>> >>> is 4030726144 >>> >>> INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java >>> >>> (line >>> >>> 128) GC for ParNew: 251 ms, 148861328 reclaimed leaving 1931111120 >>> >>> used; max >>> >>> is 4030726144 >>> >>> INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java >>> >>> (line >>> >>> 128) GC for ParNew: 260 ms, 157638152 reclaimed leaving 1955746368 >>> >>> used; max >>> >>> is 4030726144 >>> >>> INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java >>> >>> (line >>> >>> 128) GC for ParNew: 325 ms, 154157352 reclaimed leaving 1969361176 >>> >>> used; max >>> >>> is 4030726144 >>> >>> INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java >>> >>> (line >>> >>> 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 >>> >>> used; max >>> >>> is 4030726144 >>> >>> >>> >>> When we came in this morning, nodetool ring from another node showed >>> >>> the >>> >>> 1st node as down and OpsCenter also reported it as down. >>> >>> >>> >>> Next we ran "sudo netstat -anp | grep 7199" from the 1st node to see >>> >>> the >>> >>> status of the Cassandra PID and it was not running. >>> >>> >>> >>> We then started Cassandra: >>> >>> >>> >>> INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java >>> >>> (line >>> >>> 78) Logging initialized >>> >>> INFO [main] 2011-07-21 15:48:07,266 AbstractCassandraDaemon.java >>> >>> (line >>> >>> 96) Heap size: 3894411264/3894411264 >>> >>> INFO [main] 2011-07-21 15:48:11,678 CLibrary.java (line 106) JNA >>> >>> mlockall successful >>> >>> INFO [main] 2011-07-21 15:48:11,702 DatabaseDescriptor.java (line >>> >>> 121) >>> >>> Loading settings from >>> >>> file:/home/ubuntu/brisk/resources/cassandra/conf/cassandra.yaml >>> >>> >>> >>> >>> >>> It was during this start process that the java.io.EOFException was >>> >>> seen, >>> >>> but yes, like you said Jonathan, the Cassandra process started back >>> >>> up and >>> >>> joined the ring. >>> >>> >>> >>> We're now wondering why the Repair failed and why Cassandra crashed >>> >>> in >>> >>> the first place. We only had default level logging enabled. Is there >>> >>> something else I can check or that you suspect? >>> >>> >>> >>> Should we turn the logging up to debug and retry the Repair? >>> >>> >>> >>> >>> >>> - Sameer >>> >>> >>> >>> >>> >>> On Thu, Jul 21, 2011 at 12:37 PM, Jonathan Ellis <jbel...@gmail.com> >>> >>> wrote: >>> >>>> >>> >>>> Looks harmless to me. >>> >>>> >>> >>>> On Thu, Jul 21, 2011 at 1:41 PM, Sameer Farooqui >>> >>>> <cassandral...@gmail.com> wrote: >>> >>>> > While running Repair on a 0.8.1 node, we got this error in the >>> >>>> > system.log: >>> >>>> > >>> >>>> > ERROR [Thread-23] 2011-07-21 15:48:43,868 >>> >>>> > AbstractCassandraDaemon.java >>> >>>> > (line >>> >>>> > 113) Fatal exception in thread Thread[Thread-23,5,main] >>> >>>> > java.io.IOError: java.io.EOFException >>> >>>> > at >>> >>>> > >>> >>>> > >>> >>>> > org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78) >>> >>>> > Caused by: java.io.EOFException >>> >>>> > at java.io.DataInputStream.readInt(DataInputStream.java:375) >>> >>>> > at >>> >>>> > >>> >>>> > >>> >>>> > org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66) >>> >>>> > >>> >>>> > There's just a bunch of informational messages about Gossip before >>> >>>> > this. >>> >>>> > >>> >>>> > Looks like the file or stream unexpectedly ended? >>> >>>> > >>> >>>> > >>> >>>> > http://download.oracle.com/javase/1.4.2/docs/api/java/io/EOFException.html >>> >>>> > >>> >>>> > Is this a bug or something wrong in our environment? >>> >>>> > >>> >>>> > >>> >>>> > - Sameer >>> >>>> > >>> >>>> >>> >>>> >>> >>>> >>> >>>> -- >>> >>>> Jonathan Ellis >>> >>>> Project Chair, Apache Cassandra >>> >>>> co-founder of DataStax, the source for professional Cassandra >>> >>>> support >>> >>>> http://www.datastax.com >>> >>> >>> >>> >>> >> >>> >> >>> > >>> > >>> >>> >>> >>> -- >>> Jonathan Ellis >>> Project Chair, Apache Cassandra >>> co-founder of DataStax, the source for professional Cassandra support >>> http://www.datastax.com >> > > >