Re: Having trouble getting cassandra to stay up

Stu Hood Fri, 24 Dec 2010 22:22:37 -0800

With a very small amount of memory, the Cassandra process may be getting
killed by the Linux OOM killer, which should result in a log message to the
kernel logs. See
http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killerto
locate the error if it exists.


On Fri, Dec 24, 2010 at 6:46 PM, Dan Hendry <dan.hendry.j...@gmail.com>wrote:

> One last clarification given you are running with -f, “fully die”=return to
> command prompt with no action on your part? If you ctrl-c from Cassandra
> when running in foreground mode (ie with –f), the process WILL be killed.
> Try running in background mode (without the –f).
>
>
>
> Removing the contents of /var/lib/Cassandra/ and using the default
> Cassandra.yaml and Cassandra-env.sh is effectively the same as a complete
> reset. You can also just delete then re-un-tar the provided tarball.
>
>
>
> Given the limited amount of ram on a micro instance, you might try using
> JNA (download from
> https://jna.dev.java.net/servlets/ProjectDocumentList?folderID=12329&expandFolder=12329&folderID=0and
>  put the jar in cassandras lib directory, see the
> https://issues.apache.org/jira/browse/CASSANDRA-1214) or setting
> disk_access_mode: standard in cassandra.yaml.
>
>
>
> Other than that, I am out of ideas: perhaps somebody else can comment. I
> have set up Cassandra 0.7 RC2 on various EC2 ubuntu 10.10 instances with no
> issue (although not a micro in quite some time). Having problems with a
> stock ubuntu image, and the provided Cassandra tarball and with no tinkering
> with the cassandra or system settings is very strange. Again if worse comes
> to worse, start with a fresh m1.small instance; it takes me less that a ½
> hour to be up and running from scratch.
>
>
>
> Dan
>
>
>
> *From:* Alex Quan [mailto:alex.q...@tinkur.com]
> *Sent:* December-24-10 17:44
> *To:* user@cassandra.apache.org
> *Subject:* Re: Having trouble getting cassandra to stay up
>
>
>
> I am running the bin/cassandra with the -f option and it does seem to fully
> die and not just stalling.
>
> I have also tried using the cassandra-cli to create keyspace and it works
> for a little bit and then will die slightly after accepting the request the
> vmstat after it dies is as follows:
>
>
> procs -----------memory---------- ---swap-- -----io---- -system--
> ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
> wa
>  0  0      0 311240    424  23356    0    0    14     4   13    2  0  0 99
> 0
>
> I also tried the cassandra-cli creating keyspace after I deleted all the
> content of cassandra/data and cassandra/commitlog and it still is dying
> almost immediately after the keyspace creation I am not sure why this is the
> case. Is there a way to fully remove cassandra and start off with a fully
> fresh copy?
>
> Thanks
>
> Alex
>
> On Fri, Dec 24, 2010 at 1:42 PM, Dan Hendry <dan.hendry.j...@gmail.com>
> wrote:
>
> Hum, very strange.
>
>
>
> More what I was trying to get at was: did the process truly die or was it
> just non-responsive and looking like it was dead? It would be very strange
> if the actual process was dying without any warnings in the logs. Presumably
> you are running bin/cassandra *without* the -f option? What is the output of
> top/vmstat on the dead node after Cassandra has 'died'? Sorry I was not
> clear on this initially.
>
> I have no experience with pycassa but you might want to try using the
> Cassandra CLI to create keyspaces and column families to rule out some sort
> of client weirdness. Also, you haven't made any changes to cassandra-env.sh
> have you? EC2 micros have a very limited amount of ram. I have also seen
> their CPU bursting cause problems but that does not seem to be the issue
> here. I might also suggest you try a m1.small instead just to be safe; they
> are still pretty cheap when you run then as spot-instances.
>
>
>
> As a last ditch effort (given that this is a test cluster), you can delete
> the contents of /var/lib/cassandra/data/*. /var/lib/cassandra/commitlog/* to
> effectively reset your nodes.
>
>
>
> On Fri, Dec 24, 2010 at 12:48 PM, Alex Quan <alex.q...@tinkur.com> wrote:
>
> Sorry but I am not sure how to answer all the question that you have posed
> since a lot of the stuff I am working with is quite new to me and I haven't
> use many of the tools that are talked about but I will try my best to answer
> the question to the best of my knowledge. I am trying to get the cassandra
> to run between 2 nodes that are both Amazon's ec2 micro instances, I believe
> they are using a 64 bit linux ubuntu 10.01 using java version 1.6.0_23. When
> I said killed it was what was outputted into the console when the process
> died so I am not sure what that exactly means. Here is some of the info
> before cassandra went down:
>
> ring:
>
> Address         Status State   Load            Owns
> Token
>
> 111232248257764777335763873822010980488
> 10.127.155.205  Up     Normal  85.17 KB        59.06%
> 41570168072350555868554892080805525145
> 10.122.123.210  Up     Normal  91.1 KB         40.94%
> 111232248257764777335763873822010980488
>
> vmstat before cassandra is up:
>
> procs -----------memory---------- ---swap-- -----io---- -system--
> ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
> wa
>  0  0      0 328196    632  13936    0    0    12     4   13    1  0  0 99
> 0
>
> vmstat after cassandra is up:
>
> procs -----------memory---------- ---swap-- -----io---- -system--
> ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
> wa
>  0  2      0   5660    116  10312    0    0    12     4   13    1  0  0 99
> 0
>
> Then after I run a line like sys.create_keyspace('testing', 1) in pycassa
> with the connections setup to point to my machine I get the following error:
>
>
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File
> "/usr/local/lib/python2.6/dist-packages/pycassa-1.0.2-py2.6.egg/pycassa/system_manager.py",
> line 365, in drop_keyspace
>     schema_version = self._conn.system_drop_keyspace(keyspace)
>   File
> "/usr/local/lib/python2.6/dist-packages/pycassa-1.0.2-py2.6.egg/pycassa/cassandra/Cassandra.py",
> line 1255, in system_drop_keyspace
>     return self.recv_system_drop_keyspace()
>   File
> "/usr/local/lib/python2.6/dist-packages/pycassa-1.0.2-py2.6.egg/pycassa/cassandra/Cassandra.py",
> line 1266, in recv_system_drop_keyspace
>     (fname, mtype, rseqid) = self._iprot.readMessageBegin()
>   File
> "/usr/local/lib/python2.6/dist-packages/thrift05-0.5.0-py2.6-linux-x86_64.egg/thrift/protocol/TBinaryProtocol.py",
> line 126, in readMessageBegin
>     sz = self.readI32()
>   File
> "/usr/local/lib/python2.6/dist-packages/thrift05-0.5.0-py2.6-linux-x86_64.egg/thrift/protocol/TBinaryProtocol.py",
> line 203, in readI32
>     buff = self.trans.readAll(4)
>   File
> "/usr/local/lib/python2.6/dist-packages/thrift05-0.5.0-py2.6-linux-x86_64.egg/thrift/transport/TTransport.py",
> line 58, in readAll
>     chunk = self.read(sz-have)
>   File
> "/usr/local/lib/python2.6/dist-packages/thrift05-0.5.0-py2.6-linux-x86_64.egg/thrift/transport/TTransport.py",
> line 272, in read
>     self.readFrame()
>   File
> "/usr/local/lib/python2.6/dist-packages/thrift05-0.5.0-py2.6-linux-x86_64.egg/thrift/transport/TTransport.py",
> line 276, in readFrame
>     buff = self.__trans.readAll(4)
>   File
> "/usr/local/lib/python2.6/dist-packages/thrift05-0.5.0-py2.6-linux-x86_64.egg/thrift/transport/TTransport.py",
> line 58, in readAll
>     chunk = self.read(sz-have)
>   File
> "/usr/local/lib/python2.6/dist-packages/thrift05-0.5.0-py2.6-linux-x86_64.egg/thrift/transport/TSocket.py",
> line 108, in read
>     raise TTransportException(type=TTransportException.END_OF_FILE,
> message='TSocket read 0 bytes')
> thrift.transport.TTransport.TTransportException: TSocket read 0 bytes
>
> and then cassandra on the machine dies, here is the log some of the log of
> the machine that died:
>
>  INFO [FlushWriter:1] 2010-12-24 03:24:01,999 Memtable.java (line 162)
> Completed flushing /var/lib/cassandra/data/system/LocationInfo-e-24-Data.db
> (301 bytes)
>  INFO [main] 2010-12-24 03:24:02,003 Mx4jTool.java (line 73) Will not load
> MX4J, mx4j-tools.jar is not in the classpath
>  INFO [main] 2010-12-24 03:24:02,048 CassandraDaemon.java (line 77) Binding
> thrift service to /0.0.0.0:9160
>  INFO [main] 2010-12-24 03:24:02,050 CassandraDaemon.java (line 91) Using
> TFramedTransport with a max frame size of 15728640 bytes.
>  INFO [main] 2010-12-24 03:24:02,053 CassandraDaemon.java (line 119)
> Listening for thrift clients...
>  INFO [MigrationStage:1] 2010-12-24 03:26:42,226 ColumnFamilyStore.java
> (line 639) switching in a fresh Memtable for Migrations at
> CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1293161040907.log',
> position=10873)
>  INFO [MigrationStage:1] 2010-12-24 03:26:42,226 ColumnFamilyStore.java
> (line 943) Enqueuing flush of memtable-migrati...@948345082(5902 bytes, 1
> operations)
>  INFO [FlushWriter:1] 2010-12-24 03:26:42,226 Memtable.java (line 155)
> Writing memtable-migrati...@948345082(5902 bytes, 1 operations)
>  INFO [MigrationStage:1] 2010-12-24 03:26:42,238 ColumnFamilyStore.java
> (line 639) switching in a fresh Memtable for Schema at
> CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1293161040907.log',
> position=10873)
>  INFO [MigrationStage:1] 2010-12-24 03:26:42,238 ColumnFamilyStore.java
> (line 943) Enqueuing flush of memtable-sch...@212165140(2194 bytes, 3
> operations)
>  INFO [FlushWriter:1] 2010-12-24 03:26:45,351 Memtable.java (line 162)
> Completed flushing /var/lib/cassandra/data/system/Migrations-e-11-Data.db
> (6035 bytes)
>  INFO [FlushWriter:1] 2010-12-24 03:26:45,531 Memtable.java (line 155)
> Writing memtable-sch...@212165140(2194 bytes, 3 operations)
>
> and the log on the machine that stays up:
>
> ERROR [ReadStage:4] 2010-12-24 03:24:01,979 AbstractCassandraDaemon.java
> (line 90) Fatal exception in thread Thread[ReadStage:4,5,main]
> org.apache.avro.AvroTypeException: Found
> {"type":"record","name":"CfDef","namespace":"org.apache.cassandra.avro","fields":[{"name":"keyspace","type":"string"},{"name":"name","type":"string"},{"name":"column_type","type":["string","null"]},{"name":"comparator_type","type":["string","null"]},{"name":"subcomparator_type","type":["string","null"]},{"name":"comment","type":["string","null"]},{"name":"row_cache_size","type":["double","null"]},{"name":"key_cache_size","type":["double","null"]},{"name":"read_repair_chance","type":["double","null"]},{"name":"gc_grace_seconds","type":["int","null"]},{"name":"default_validation_class","type":["null","string"],"default":null},{"name":"min_compaction_threshold","type":["null","int"],"default":null},{"name":"max_compaction_threshold","type":["null","int"],"default":null},{"name":"row_cache_save_period_in_seconds","type":["int","null"],"default":0},{"name":"key_cache_save_period_in_seconds","type":["int","null"],"default":3600},{"name":"memtable_flush_after_mins","type":["int","null"],"default":60},{"name":"memtable_throughput_in_mb","type":["null","int"],"default":null},{"name":"memtable_operations_in_millions","type":["null","double"],"default":null},{"name":"id","type":["int","null"]},{"name":"column_metadata","type":[{"type":"array","items":{"type":"record","name":"ColumnDef","fields":[{"name":"name","type":"bytes"},{"name":"validation_class","type":"string"},{"name":"index_type","type":[{"type":"enum","name":"IndexType","symbols":["KEYS"],"aliases":["org.apache.cassandra.config.avro.IndexType"]},"null"]},{"name":"index_name","type":["string","null"]}]}},"null"]}]},
> expecting
> {"type":"record","name":"CfDef","namespace":"org.apache.cassandra.avro","fields":[{"name":"keyspace","type":"string"},{"name":"name","type":"string"},{"name":"column_type","type":["string","null"]},{"name":"comparator_type","type":["string","null"]},{"name":"subcomparator_type","type":["string","null"]},{"name":"comment","type":["string","null"]},{"name":"row_cache_size","type":["double","null"]},{"name":"key_cache_size","type":["double","null"]},{"name":"read_repair_chance","type":["double","null"]},{"name":"replicate_on_write","type":["boolean","null"]},{"name":"gc_grace_seconds","type":["int","null"]},{"name":"default_validation_class","type":["null","string"],"default":null},{"name":"min_compaction_threshold","type":["null","int"],"default":null},{"name":"max_compaction_threshold","type":["null","int"],"default":null},{"name":"row_cache_save_period_in_seconds","type":["int","null"],"default":0},{"name":"key_cache_save_period_in_seconds","type":["int","null"],"default":3600},{"name":"memtable_flush_after_mins","type":["int","null"],"default":60},{"name":"memtable_throughput_in_mb","type":["null","int"],"default":null},{"name":"memtable_operations_in_millions","type":["null","double"],"default":null},{"name":"id","type":["int","null"]},{"name":"column_metadata","type":[{"type":"array","items":{"type":"record","name":"ColumnDef","fields":[{"name":"name","type":"bytes"},{"name":"validation_class","type":"string"},{"name":"index_type","type":[{"type":"enum","name":"IndexType","symbols":["KEYS"],"aliases":["org.apache.cassandra.config.avro.IndexType"]},"null"]},{"name":"index_name","type":["string","null"]}],"aliases":["org.apache.cassandra.config.avro.ColumnDef"]}},"null"]}],"aliases":["org.apache.cassandra.config.avro.CfDef"]}
>     at
> org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:212)
>     at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
>     at
> org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:121)
>     at
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:138)
>     at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:114)
>     at
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:142)
>     at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:114)
>     at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:118)
>     at
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:142)
>     at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:114)
>     at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:105)
>     at
> org.apache.cassandra.io.SerDeUtils.deserializeWithSchema(SerDeUtils.java:98)
>     at
> org.apache.cassandra.db.migration.Migration.deserialize(Migration.java:274)
>     at
> org.apache.cassandra.db.DefinitionsUpdateResponseVerbHandler.doVerb(DefinitionsUpdateResponseVerbHandler.java:56)
>     at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>     at java.lang.Thread.run(Thread.java:662)
>  INFO [GossipStage:1] 2010-12-24 03:24:02,151 Gossiper.java (line 583) Node
> /10.127.155.205 has restarted, now UP again
>  INFO [GossipStage:1] 2010-12-24 03:24:02,151 StorageService.java (line
> 670) Node /10.127.155.205 state jump to normal
>  INFO [HintedHandoff:1] 2010-12-24 03:24:02,151 HintedHandOffManager.java
> (line 191) Started hinted handoff for endpoint /10.127.155.205
>  INFO [HintedHandoff:1] 2010-12-24 03:24:02,152 HintedHandOffManager.java
> (line 247) Finished hinted handoff of 0 rows to endpoint /10.127.155.205
>  INFO [WRITE-/10.127.155.205] 2010-12-24 03:26:47,789
> OutboundTcpConnection.java (line 115) error writing to /10.127.155.205
>  INFO [ScheduledTasks:1] 2010-12-24 03:26:58,899 Gossiper.java (line 195)
> InetAddress /10.127.155.205 is now dead.
>
> The ring output on my node that stays up:
>
> Address         Status State   Load            Owns
> Token
>
> 111232248257764777335763873822010980488
> 10.127.155.205  Down   Normal  85.17 KB        59.06%
> 41570168072350555868554892080805525145
> 10.122.123.210  Up     Normal  91.1 KB         40.94%
> 111232248257764777335763873822010980488
>
> I am not sure how to use the jmx tools to connect to these machines so I
> can't really answer that but hopefully this is enough information to
> diagnose my problem, thanks
>
> Alex
>
>
>
> On Thu, Dec 23, 2010 at 4:35 PM, Dan Hendry <dan.hendry.j...@gmail.com>
> wrote:
>
> Your details are rather vague, what do you mean by killed? Is the Cassandra
> java process still running? Any other warning or error log messages (from
> either node)? Could you provide the last few Cassandra log lines from each
> machine? Can you connect to the node via JMX? What is the output of nodetool
> ring from the second node (which is presumably still alive)? Is there any
> unusual system activity: high cpu usage, low cpu usage, problems with disk
> IO (can be checked with vmstat).
>
>
>
> Can you provide any further system information? Linux/windows, java
> version, 32/64 bit, amount of ram?
>
>
>
> On Thu, Dec 23, 2010 at 1:42 PM, Alex Quan <alex.q...@tinkur.com> wrote:
>
> Hi,
>
> I am a newbie to cassandra and am using cassandra RC 2. I initially have
> cassndra working on one node and was able to create keyspace, column
> families and populate the database fine. I tried adding a second node by
> changing the seed to point to another node and setting listen_address and
> rpc_address to blank. I then started up the second node and it seems to have
> connected fine using the node tool but after that I couldn't get it to
> accept any commands and whenever I tried to make a new keyspace or column
> family it would kill my initial node after a message like this:
>
>  INFO 18:19:49,335 switching in a fresh Memtable for Schema at
> CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1293127746481.log',
> position=9143)
>  INFO 18:19:49,335 Enqueuing flush of memtable-sch...@1358138608(2410
> bytes, 5 operations)
> Killed
>
> and the next few time I start up the server a similar would pop up until I
> am guessing all the stuff is flushed out then it would start fine until I
> tried to add anything to it. I tried changing back the yaml file back to the
> original setup and this still happens. I don't know what to try to get it to
> work properly, if you guys can help I would be really grateful
>
> Alex
>
>
>
>
>
>
>
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.872 / Virus Database: 271.1.1/3335 - Release Date: 12/24/10
> 02:34:00
>

Re: Having trouble getting cassandra to stay up

Reply via email to