AW: Monitoring Cluster with JMX

2011-02-09 Thread Roland Gude
Unfortunately not, as the nagios JMX check expects a numeric return value and 
only allows for defining thresholds for issuing warnings or errors depending on 
that value. It does not allow for post processing the return values.

roland

Von: Aaron Morton [mailto:aa...@thelastpickle.com]
Gesendet: Dienstag, 8. Februar 2011 21:32
An: dev@cassandra.apache.org
Betreff: Re: Monitoring Cluster with JMX

Can't you get the length of the list on the monitoring side of things ?
aaron
On 08 Feb, 2011,at 10:25 PM, Roland Gude  wrote:
Hello,

we are trying to monitor our cassandra cluster with Nagios JMX checks. While 
there are JMX attributes which expose the list of reachable/unreachable hosts, 
it would be very helpful to have additional numeric attributes exposing the 
size of these lists. This could be used to set thresholds (in Nagios 
monitoring) i.e. at least 3 hosts must be reachable before Nagios issues a 
warning.
This is probably not hard to do and we are willing to implement/supply patches 
if someone could point us in the right direction on where to implement it.

Greetings,
roland

--
YOOCHOOSE GmbH

Roland Gude
Software Engineer

Im Mediapark 8, 50670 Köln

+49 221 4544151 (Tel)
+49 221 4544159 (Fax)
+49 171 7894057 (Mobil)


Email: roland.g...@yoochoose.com
WWW: 
www.yoochoose.com>

YOOCHOOSE GmbH
Geschäftsführer: Dr. Uwe Alkemper, Michael Friedmann
Handelsregister: Amtsgericht Köln HRB 65275
Ust-Ident-Nr: DE 264 773 520
Sitz der Gesellschaft: Köln


Re: New feature / educational project

2011-02-09 Thread Dave Revell
+1 for interest. This feature would be great if done well.
On Feb 8, 2011 11:54 PM, "m...@monit.dk"  wrote:
> I noticed :)
>
> But the question was if such a feature could make it to the trunk - and as
far as I see, there is enough interest around this.
>
> - Reply message -
> From: "Tristan Tarrant" 
> Date: Wed, Feb 9, 2011 05:56
> Subject: New feature / educational project
> To: 
>
> With Java 6 there is no need to add rhino as there already is the
> javax.script package
> Tristan
> On Feb 8, 2011 9:56 PM, "Morten Wegelbye Nissen"  wrote:
>> Hello mighty developers of Cassandra,
>>
>> I have been thinking of creating a feature like stored procedures for
>> Cassandra.
>> Concept is actually pretty simple add one of the javascript compilers. (
>> Mozilla Rhino or one alike ). Save js source in a CF in the system
>> keyspace. Add feature to thrift to invoke the code. Return just like
>> get_slice.
>> Needless to say that the execution environment needs access to the
>> keyspaces and needs to be sandboxed. (ie. no access to filesystem etc. )
>>
>> On the cli it would be something like; > invoke myProc param1, param2,
>> param3
>>
>> The alternative where the expansions, like the existing once, is done by
>> implementing interfaces. Would require a rather complex distribution of
>> jars.
>>
>> Now I might have the option to get this done as a educational project,
>> where I after the project would like to release the code to freedom.
>>
>> Would a feature like that ever make it to the core of Cassandra?
>>
>> ./Morten


Re: Monitoring Cluster with JMX

2011-02-09 Thread Ryan King
If you're using 0.7, I'd skip jmx and use the mx4j http interface then
write scripts that convert the data to the format you need.

-ryan

On Wed, Feb 9, 2011 at 2:47 AM, Roland Gude  wrote:
> Unfortunately not, as the nagios JMX check expects a numeric return value and 
> only allows for defining thresholds for issuing warnings or errors depending 
> on that value. It does not allow for post processing the return values.
>
> roland
>
> Von: Aaron Morton [mailto:aa...@thelastpickle.com]
> Gesendet: Dienstag, 8. Februar 2011 21:32
> An: dev@cassandra.apache.org
> Betreff: Re: Monitoring Cluster with JMX
>
> Can't you get the length of the list on the monitoring side of things ?
> aaron
> On 08 Feb, 2011,at 10:25 PM, Roland Gude  wrote:
> Hello,
>
> we are trying to monitor our cassandra cluster with Nagios JMX checks. While 
> there are JMX attributes which expose the list of reachable/unreachable 
> hosts, it would be very helpful to have additional numeric attributes 
> exposing the size of these lists. This could be used to set thresholds (in 
> Nagios monitoring) i.e. at least 3 hosts must be reachable before Nagios 
> issues a warning.
> This is probably not hard to do and we are willing to implement/supply 
> patches if someone could point us in the right direction on where to 
> implement it.
>
> Greetings,
> roland
>
> --
> YOOCHOOSE GmbH
>
> Roland Gude
> Software Engineer
>
> Im Mediapark 8, 50670 Köln
>
> +49 221 4544151 (Tel)
> +49 221 4544159 (Fax)
> +49 171 7894057 (Mobil)
>
>
> Email: roland.g...@yoochoose.com
> WWW: 
> www.yoochoose.com>
>
> YOOCHOOSE GmbH
> Geschäftsführer: Dr. Uwe Alkemper, Michael Friedmann
> Handelsregister: Amtsgericht Köln HRB 65275
> Ust-Ident-Nr: DE 264 773 520
> Sitz der Gesellschaft: Köln
>



-- 
-@rk


AW: Monitoring Cluster with JMX

2011-02-09 Thread Roland Gude
Ah... thanks for the pointer.
This should indeed be musch simpler.

Thanks.

-Ursprüngliche Nachricht-
Von: Ryan King [mailto:r...@twitter.com] 
Gesendet: Mittwoch, 9. Februar 2011 18:11
An: dev@cassandra.apache.org
Betreff: Re: Monitoring Cluster with JMX

If you're using 0.7, I'd skip jmx and use the mx4j http interface then
write scripts that convert the data to the format you need.

-ryan

On Wed, Feb 9, 2011 at 2:47 AM, Roland Gude  wrote:
> Unfortunately not, as the nagios JMX check expects a numeric return value and 
> only allows for defining thresholds for issuing warnings or errors depending 
> on that value. It does not allow for post processing the return values.
>
> roland
>
> Von: Aaron Morton [mailto:aa...@thelastpickle.com]
> Gesendet: Dienstag, 8. Februar 2011 21:32
> An: dev@cassandra.apache.org
> Betreff: Re: Monitoring Cluster with JMX
>
> Can't you get the length of the list on the monitoring side of things ?
> aaron
> On 08 Feb, 2011,at 10:25 PM, Roland Gude  wrote:
> Hello,
>
> we are trying to monitor our cassandra cluster with Nagios JMX checks. While 
> there are JMX attributes which expose the list of reachable/unreachable 
> hosts, it would be very helpful to have additional numeric attributes 
> exposing the size of these lists. This could be used to set thresholds (in 
> Nagios monitoring) i.e. at least 3 hosts must be reachable before Nagios 
> issues a warning.
> This is probably not hard to do and we are willing to implement/supply 
> patches if someone could point us in the right direction on where to 
> implement it.
>
> Greetings,
> roland
>
> --
> YOOCHOOSE GmbH
>
> Roland Gude
> Software Engineer
>
> Im Mediapark 8, 50670 Köln
>
> +49 221 4544151 (Tel)
> +49 221 4544159 (Fax)
> +49 171 7894057 (Mobil)
>
>
> Email: roland.g...@yoochoose.com
> WWW: 
> www.yoochoose.com>
>
> YOOCHOOSE GmbH
> Geschäftsführer: Dr. Uwe Alkemper, Michael Friedmann
> Handelsregister: Amtsgericht Köln HRB 65275
> Ust-Ident-Nr: DE 264 773 520
> Sitz der Gesellschaft: Köln
>



-- 
-@rk




Re: Using Cassandra-cli

2011-02-09 Thread Eranda Sooriyabandara
Hi Vishan, Aron and all,

Thanks for the help. I tried it and successfully worked for me.
But I could not find a place where mention about the attributes of some
commands.

e.g.
update column family  [with = [and = ...]];
create keyspace  [with = [and = ...]];
(we can use comparator=UTF8Type and default_validation_class=UTF8Type as
changed attributes)

Is there any documentaries which mentioned about those applicable attributes
in each case?

thanks
Eranda

P.S. I put a blog post on Cassandra-cli in
http://emsooriyabandara.blogspot.com/ please correct me if I am got it wrong
in any place


Re: Gossip messages at DEBUG

2011-02-09 Thread Brandon Williams
On Tue, Feb 8, 2011 at 9:01 PM, Aaron Morton wrote:

> I've just put the latest 0.7 build on a node and it's logging gossip
> messages at DEBUG and making the logs really hard to use. Anyone object to
> moving these to TRACE level ?
>

Moved to TRACE.  I think when this was moved from sendRR to sendOneWay
gossip wasn't considered.

-Brandon


Re: Gossip messages at DEBUG

2011-02-09 Thread Aaron Morton
thanks.AOn 10 Feb, 2011,at 08:21 AM, Brandon Williams  wrote:On Tue, Feb 8, 2011 at 9:01 PM, Aaron Morton wrote:

> I've just put the latest 0.7 build on a node and it's logging gossip
> messages at DEBUG and making the logs really hard to use. Anyone object to
> moving these to TRACE level ?
>

Moved to TRACE.  I think when this was moved from sendRR to sendOneWay
gossip wasn't considered.

-Brandon


Re: SEVERE Data Corruption Problems

2011-02-09 Thread Jonathan Ellis
Hi Dan,

it would be very useful to test with 0.7 branch instead of 0.7.0 so at
least you're not chasing known and fixed bugs like CASSANDRA-1992.

As you say, there's a lot of people who aren't seeing this, so it
would also be useful if you can provide some kind of test harness
where you can say "point this at a cluster and within a few hours

On Wed, Feb 9, 2011 at 4:31 PM, Dan Hendry  wrote:
> I have been having SEVERE data corruption issues with SSTables in my
> cluster, for one CF it was happening almost daily (I have since shut down
> the service using that CF as it was too much work to manage the Cassandra
> errors). At this point, I can’t see how it is anything but a Cassandra bug
> yet it’s somewhat strange and very scary that I am the only one who seems to
> be having such serious issues. Most of my data is indexed in two ways so I
> have been able to write a validator which goes through and back fills
> missing data but it’s kind of defeating the whole point of Cassandra. The
> only way I have found to deal with issues when they crop up to prevent nodes
> crashing from repeated failed compactions is delete the SSTable. My cluster
> is running a slightly modified 0.7.0 version which logs what files errors
> for so that I can stop the node and delete them.
>
>
>
> The problem:
>
> -  Reads, compactions and hinted handoff fail with various
> exceptions (samples shown at the end of this email) which seem to indicate
> sstable corruption.
>
> -  I have seen failed reads/compactions/hinted handoff on 4 out of 4
> nodes (RF=2) for 3 different super column families and 1 standard column
> family (4 out of 11) and just now, the Hints system CF. (if it matters the
> ring has not changed since one CF which has been giving me trouble was
> created). I have check SMART disk info and run various diagnostics and there
> does not seem to be any hardware issues, plus what are the chances of all
> four nodes having the same hardware problems at the same time when for all
> other purposes, they appear fine?
>
> -  I have added logging which outputs what sstable are causing
> exceptions to be thrown. The corrupt sstables have been both freshly flushed
> memtables and the output of compaction (ie, 4 sstables which all seem to be
> fine get compacted to 1 which is then corrupt). It seems that the majority
> of corrupt sstables are post-compacted (vs post-memtable flush).
>
> -  The one CF which was giving me the most problems was heavily
> written to (1000-1500 writes/second continually across the cluster). For
> that cf, was having to deleting 4-6 sstables a day across the cluster (and
> the number was going up, even the number of problems for remaining CFs is
> going up). The other CFs which have had corrupt sstables are also quite
> heavily written to (generally a few hundred writes a second across the
> cluster).
>
> -  Most of the time (5/6 attempts) when this problem occurs,
> sstable2json also fails. I have however, had one case where I was able to
> export the sstable to json, then re-import it at which point I was no longer
> seeing exceptions.
>
> -  The cluster has been running for a little over 2 months now,
> problem seems to have sprung up in the last 3-4 weeks and seems to be
> steadily getting worse.
>
>
>
> Ultimately, I think I am hitting some subtle race condition somewhere. I
> have been starting to dig into the Cassandra code but I barely know where to
> start looking. I realize I have not provided nearly enough information to
> easily debug the problem but PLEASE keep your eyes open for possibly racy or
> buggy code which could cause these sorts of problems. I am willing to
> provided full Cassandra logs and a corrupt SSTable on an individual basis:
> please email me and let me know.
>
>
>
> Here is possibly relevant information and my theories on a possible root
> cause. Again, I know little about the Cassandra code base and have only
> moderate java experience so these theories may be way off base.
>
> -  Strictly speaking, I probably don’t have enough memory for my
> workload. I see stop the world gc occurring ~30/day/node, often causing
> Cassandra to hang for 30+ seconds (according to the gc logs). Could there be
> some java bug where a full gc in the middle of writing or flushing
> (compaction/memtable flush) or doing some other disk based activity causes
> some sort of data corruption?
>
> -  Writes are usually done at ConsistencyLevel ONE with additional
> client side retry logic. Given that I often see consecutive nodes in the
> ring down, could there be some edge condition where dying at just the right
> time causes parts of mutations/messages to be lost?
>
> -  All of the CFs which have been causing me problems have large
> rows which are compacted incrementally. Could there be some problem with the
> incremental compaction logic?
>
> -  My cluster has a fairly heavy write load (again, the most
> problemat

RE: SEVERE Data Corruption Problems

2011-02-09 Thread Dan Hendry
I will put two nodes on 0.7. Did you really mean CASSANDRA-1992? I looked
over the bug report and patch but cant see how it is related to the problems
I have been having. I am not performing bootstraps or repairs and I haven’t
since one of the most problematic CFs has been created. I have also looked
over the resolved issues for 0.7.1 and did not see anything which I thought
could be related. 

I would love to provide a test cluster and we actually have one for our
development environment but it is working flawlessly. Exact same Cassandra
version, application code, java version and OS. The only difference is that
it has a far lower write load and is in EC2 instead of on physical machines.
Its one of the reasons I believe I am hitting some strange race/edge
condition somewhere.

Looking over the user list, it seems at least one other person is having the
same type of problem:
http://www.mail-archive.com/user@cassandra.apache.org/msg09838.html .
Although I have not seen the second error (possibly because I don’t do range
slices), the first error looks eerily familiar.

Dan

-Original Message-
From: Jonathan Ellis [mailto:jbel...@gmail.com] 
Sent: February-09-11 18:14
To: dev
Subject: Re: SEVERE Data Corruption Problems

Hi Dan,

it would be very useful to test with 0.7 branch instead of 0.7.0 so at
least you're not chasing known and fixed bugs like CASSANDRA-1992.

As you say, there's a lot of people who aren't seeing this, so it
would also be useful if you can provide some kind of test harness
where you can say "point this at a cluster and within a few hours

On Wed, Feb 9, 2011 at 4:31 PM, Dan Hendry 
wrote:
> I have been having SEVERE data corruption issues with SSTables in my
> cluster, for one CF it was happening almost daily (I have since shut down
> the service using that CF as it was too much work to manage the Cassandra
> errors). At this point, I can’t see how it is anything but a Cassandra bug
> yet it’s somewhat strange and very scary that I am the only one who seems
to
> be having such serious issues. Most of my data is indexed in two ways so I
> have been able to write a validator which goes through and back fills
> missing data but it’s kind of defeating the whole point of Cassandra. The
> only way I have found to deal with issues when they crop up to prevent
nodes
> crashing from repeated failed compactions is delete the SSTable. My
cluster
> is running a slightly modified 0.7.0 version which logs what files errors
> for so that I can stop the node and delete them.
>
>
>
> The problem:
>
> -  Reads, compactions and hinted handoff fail with various
> exceptions (samples shown at the end of this email) which seem to indicate
> sstable corruption.
>
> -  I have seen failed reads/compactions/hinted handoff on 4 out of
4
> nodes (RF=2) for 3 different super column families and 1 standard column
> family (4 out of 11) and just now, the Hints system CF. (if it matters the
> ring has not changed since one CF which has been giving me trouble was
> created). I have check SMART disk info and run various diagnostics and
there
> does not seem to be any hardware issues, plus what are the chances of all
> four nodes having the same hardware problems at the same time when for all
> other purposes, they appear fine?
>
> -  I have added logging which outputs what sstable are causing
> exceptions to be thrown. The corrupt sstables have been both freshly
flushed
> memtables and the output of compaction (ie, 4 sstables which all seem to
be
> fine get compacted to 1 which is then corrupt). It seems that the majority
> of corrupt sstables are post-compacted (vs post-memtable flush).
>
> -  The one CF which was giving me the most problems was heavily
> written to (1000-1500 writes/second continually across the cluster). For
> that cf, was having to deleting 4-6 sstables a day across the cluster (and
> the number was going up, even the number of problems for remaining CFs is
> going up). The other CFs which have had corrupt sstables are also quite
> heavily written to (generally a few hundred writes a second across the
> cluster).
>
> -  Most of the time (5/6 attempts) when this problem occurs,
> sstable2json also fails. I have however, had one case where I was able to
> export the sstable to json, then re-import it at which point I was no
longer
> seeing exceptions.
>
> -  The cluster has been running for a little over 2 months now,
> problem seems to have sprung up in the last 3-4 weeks and seems to be
> steadily getting worse.
>
>
>
> Ultimately, I think I am hitting some subtle race condition somewhere. I
> have been starting to dig into the Cassandra code but I barely know where
to
> start looking. I realize I have not provided nearly enough information to
> easily debug the problem but PLEASE keep your eyes open for possibly racy
or
> buggy code which could cause these sorts of problems. I am willing to
> provided full Cassandra logs and a corrupt SSTable 

Re: Using Cassandra-cli

2011-02-09 Thread Vishal Gupta
Hi Eranda,

you can refer book --> "Cassandra: The Definitive Guide". Also there are web
apps(http://github.com/suguru/cassandra-webconsole) which helps to do the
same via browser.

Also u can follow these article's
1) http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model
2)
http://maxgrinev.com/2010/07/09/a-quick-introduction-to-the-cassandra-data-model/

Regards,
vishal

On Thu, Feb 10, 2011 at 12:45 AM, Eranda Sooriyabandara
<0704...@gmail.com>wrote:

> Hi Vishan, Aron and all,
>
> Thanks for the help. I tried it and successfully worked for me.
> But I could not find a place where mention about the attributes of some
> commands.
>
> e.g.
> update column family  [with = [and = ...]];
> create keyspace  [with = [and =
> ...]];
> (we can use comparator=UTF8Type and default_validation_class=UTF8Type as
> changed attributes)
>
> Is there any documentaries which mentioned about those applicable
> attributes
> in each case?
>
> thanks
> Eranda
>
> P.S. I put a blog post on Cassandra-cli in
> http://emsooriyabandara.blogspot.com/ please correct me if I am got it
> wrong
> in any place
>