RE: Reboot, now node down 0.8rc1

2011-05-24 Thread Scott McPheeters
It was a 0.8beta2 install that last week I upgraded to rc1.  

I will turn the logging up to debug the next time I have issues to get
more details.

Thank you.



Scott

-Original Message-
From: aaron morton [mailto:aa...@thelastpickle.com] 
Sent: Monday, May 23, 2011 6:42 PM
To: user@cassandra.apache.org
Subject: Re: Reboot, now node down 0.8rc1

You could have removed the affected commit log file and then run a
nodetool repair after the node had started. 

It would be handy to have some more context for the problem. Was this an
upgrade from 0.7 or a fresh install?

If you are running the rc's it's handy to turn logging up to DEBUG so
there are some more details if things go wrong.

Cheers 
 
-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 24 May 2011, at 08:05, Scott McPheeters wrote:

> Since this is a testing system, I deleted the commit log and it came
> right up.  My question now is, let's say I had a ton of data in the
> commit log that this node needs now.  What is the best way to get the
> data back to the node?  Does a nodetool repair do this? Or do I need
to
> decommission the node and bring it back?  Or am I missing completely
> what the commitlog is?
> 
> 
> Scott
> 
> 
> -Original Message-
> From: Scott McPheeters [mailto:smcpheet...@healthx.com] 
> Sent: Monday, May 23, 2011 2:18 PM
> To: user@cassandra.apache.org
> Subject: Reboot, now node down 0.8rc1
> 
> I have a test node system running release 0.8rc1.  I rebooted node3
and
> now Cassandra is failing on startup. 
> 
> Any ideas?  I am not sure where to begin.
> 
> Debian 6, plenty of disk space, Cassandra 0.8rc1
> 
> 
> INFO 13:48:58,192 Creating new commitlog segment
> /home/cassandra/commitlog/CommitLog-1306172938192.log
> INFO 13:48:58,236 Replaying
> /home/cassandra/commitlog/CommitLog-1305918923361.log
> INFO 13:49:04,041 Finished reading
> /home/cassandra/commitlog/CommitLog-1305918923361.log
> java.lang.reflect.InvocationTargetException
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
> a:39)
>at
>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
> Impl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at
>
org.apache.commons.daemon.support.DaemonLoader.load(DaemonLoader.java:16
> 0)
> Caused by: java.io.IOError: java.io.EOFException
>at
>
org.apache.cassandra.io.util.ColumnIterator.deserializeNext(ColumnSorted
> Map.java:265)
>at
>
org.apache.cassandra.io.util.ColumnIterator.next(ColumnSortedMap.java:28
> 1)
>at
>
org.apache.cassandra.io.util.ColumnIterator.next(ColumnSortedMap.java:23
> 6)
>at
>
java.util.concurrent.ConcurrentSkipListMap.buildFromSorted(ConcurrentSki
> pListMap.java:1493)
>at
>
java.util.concurrent.ConcurrentSkipListMap.(ConcurrentSkipListMap.
> java:1443)
>at
>
org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.ja
> va:402)
>at
>
org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(Column
> FamilySerializer.java:136)
>at
>
org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilyS
> erializer.java:126)
>at
>
org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(Ro
> wMutation.java:367)
>at
>
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:271)
>at
>
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:158)
>at
>
org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassa
> ndraDaemon.java:174)
>at
>
org.apache.cassandra.service.AbstractCassandraDaemon.init(AbstractCassan
> draDaemon.java:216)
>... 5 more
> Caused by: java.io.EOFException
>at java.io.DataInputStream.readFully(DataInputStream.java:180)
>at java.io.DataInputStream.readFully(DataInputStream.java:152)
>at
>
org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:394)
>at
>
org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBuffer
> Util.java:368)
>at
>
org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.ja
> va:87)
>at
>
org.apache.cassandra.io.util.ColumnIterator.deserializeNext(ColumnSorted
> Map.java:261)
>... 17 more
> Cannot load daemon
> Service exit with a return value of 3
> 
> 
> Scott
> 



Re: Reboot, now node down 0.8rc1

2011-05-24 Thread Sylvain Lebresne
Do you have a one of your super column family with a fairly short
value for gc_grace_seconds ?

I (strongly) suspect you're hitting
https://issues.apache.org/jira//browse/CASSANDRA-2675.

--
Sylvain

On Tue, May 24, 2011 at 2:29 PM, Scott McPheeters
 wrote:
> It was a 0.8beta2 install that last week I upgraded to rc1.
>
> I will turn the logging up to debug the next time I have issues to get
> more details.
>
> Thank you.
>
>
>
> Scott
>
> -Original Message-
> From: aaron morton [mailto:aa...@thelastpickle.com]
> Sent: Monday, May 23, 2011 6:42 PM
> To: user@cassandra.apache.org
> Subject: Re: Reboot, now node down 0.8rc1
>
> You could have removed the affected commit log file and then run a
> nodetool repair after the node had started.
>
> It would be handy to have some more context for the problem. Was this an
> upgrade from 0.7 or a fresh install?
>
> If you are running the rc's it's handy to turn logging up to DEBUG so
> there are some more details if things go wrong.
>
> Cheers
>
> -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 24 May 2011, at 08:05, Scott McPheeters wrote:
>
>> Since this is a testing system, I deleted the commit log and it came
>> right up.  My question now is, let's say I had a ton of data in the
>> commit log that this node needs now.  What is the best way to get the
>> data back to the node?  Does a nodetool repair do this? Or do I need
> to
>> decommission the node and bring it back?  Or am I missing completely
>> what the commitlog is?
>>
>>
>> Scott
>>
>>
>> -Original Message-
>> From: Scott McPheeters [mailto:smcpheet...@healthx.com]
>> Sent: Monday, May 23, 2011 2:18 PM
>> To: user@cassandra.apache.org
>> Subject: Reboot, now node down 0.8rc1
>>
>> I have a test node system running release 0.8rc1.  I rebooted node3
> and
>> now Cassandra is failing on startup.
>>
>> Any ideas?  I am not sure where to begin.
>>
>> Debian 6, plenty of disk space, Cassandra 0.8rc1
>>
>>
>> INFO 13:48:58,192 Creating new commitlog segment
>> /home/cassandra/commitlog/CommitLog-1306172938192.log
>> INFO 13:48:58,236 Replaying
>> /home/cassandra/commitlog/CommitLog-1305918923361.log
>> INFO 13:49:04,041 Finished reading
>> /home/cassandra/commitlog/CommitLog-1305918923361.log
>> java.lang.reflect.InvocationTargetException
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at
>>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
>> a:39)
>>        at
>>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
>> Impl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at
>>
> org.apache.commons.daemon.support.DaemonLoader.load(DaemonLoader.java:16
>> 0)
>> Caused by: java.io.IOError: java.io.EOFException
>>        at
>>
> org.apache.cassandra.io.util.ColumnIterator.deserializeNext(ColumnSorted
>> Map.java:265)
>>        at
>>
> org.apache.cassandra.io.util.ColumnIterator.next(ColumnSortedMap.java:28
>> 1)
>>        at
>>
> org.apache.cassandra.io.util.ColumnIterator.next(ColumnSortedMap.java:23
>> 6)
>>        at
>>
> java.util.concurrent.ConcurrentSkipListMap.buildFromSorted(ConcurrentSki
>> pListMap.java:1493)
>>        at
>>
> java.util.concurrent.ConcurrentSkipListMap.(ConcurrentSkipListMap.
>> java:1443)
>>        at
>>
> org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.ja
>> va:402)
>>        at
>>
> org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(Column
>> FamilySerializer.java:136)
>>        at
>>
> org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilyS
>> erializer.java:126)
>>        at
>>
> org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(Ro
>> wMutation.java:367)
>>        at
>>
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:271)
>>        at
>>
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:158)
>>        at
>>
> org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassa
>> ndraDaemon.java:174)
>>        at
>>
> org.apache.cassandra.service.AbstractCassandraDaemon.init(AbstractCassan
>> draDaemon.java:216)
>>        ... 5 more
>> Caused by: java.io.EOFException
>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>        at java.io.DataInputStream.readFully(DataInputStream.java:152)
>>        at
>>
> org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:394)
>>        at
>>
> org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBuffer
>> Util.java:368)
>>        at
>>
> org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.ja
>> va:87)
>>        at
>>
> org.apache.cassandra.io.util.ColumnIterator.deserializeNext(ColumnSorted
>> Map.java:261)
>>        ... 17 more
>> Cannot load daemon
>> Service exit with a return value of 3
>>
>>
>> Scott
>>
>
>


RE: Reboot, now node down 0.8rc1

2011-05-24 Thread Scott McPheeters
I have not changed any defaults as of yet.  Yes, I do have super columns, but 
my gc_grace_seconds is default.


Scott


-Original Message-
From: Sylvain Lebresne [mailto:sylv...@datastax.com] 
Sent: Tuesday, May 24, 2011 8:53 AM
To: user@cassandra.apache.org
Subject: Re: Reboot, now node down 0.8rc1

Do you have a one of your super column family with a fairly short
value for gc_grace_seconds ?

I (strongly) suspect you're hitting
https://issues.apache.org/jira//browse/CASSANDRA-2675.

--
Sylvain

On Tue, May 24, 2011 at 2:29 PM, Scott McPheeters
 wrote:
> It was a 0.8beta2 install that last week I upgraded to rc1.
>
> I will turn the logging up to debug the next time I have issues to get
> more details.
>
> Thank you.
>
>
>
> Scott
>
> -Original Message-
> From: aaron morton [mailto:aa...@thelastpickle.com]
> Sent: Monday, May 23, 2011 6:42 PM
> To: user@cassandra.apache.org
> Subject: Re: Reboot, now node down 0.8rc1
>
> You could have removed the affected commit log file and then run a
> nodetool repair after the node had started.
>
> It would be handy to have some more context for the problem. Was this an
> upgrade from 0.7 or a fresh install?
>
> If you are running the rc's it's handy to turn logging up to DEBUG so
> there are some more details if things go wrong.
>
> Cheers
>
> -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 24 May 2011, at 08:05, Scott McPheeters wrote:
>
>> Since this is a testing system, I deleted the commit log and it came
>> right up.  My question now is, let's say I had a ton of data in the
>> commit log that this node needs now.  What is the best way to get the
>> data back to the node?  Does a nodetool repair do this? Or do I need
> to
>> decommission the node and bring it back?  Or am I missing completely
>> what the commitlog is?
>>
>>
>> Scott
>>
>>
>> -Original Message-
>> From: Scott McPheeters [mailto:smcpheet...@healthx.com]
>> Sent: Monday, May 23, 2011 2:18 PM
>> To: user@cassandra.apache.org
>> Subject: Reboot, now node down 0.8rc1
>>
>> I have a test node system running release 0.8rc1.  I rebooted node3
> and
>> now Cassandra is failing on startup.
>>
>> Any ideas?  I am not sure where to begin.
>>
>> Debian 6, plenty of disk space, Cassandra 0.8rc1
>>
>>
>> INFO 13:48:58,192 Creating new commitlog segment
>> /home/cassandra/commitlog/CommitLog-1306172938192.log
>> INFO 13:48:58,236 Replaying
>> /home/cassandra/commitlog/CommitLog-1305918923361.log
>> INFO 13:49:04,041 Finished reading
>> /home/cassandra/commitlog/CommitLog-1305918923361.log
>> java.lang.reflect.InvocationTargetException
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at
>>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
>> a:39)
>>        at
>>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
>> Impl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at
>>
> org.apache.commons.daemon.support.DaemonLoader.load(DaemonLoader.java:16
>> 0)
>> Caused by: java.io.IOError: java.io.EOFException
>>        at
>>
> org.apache.cassandra.io.util.ColumnIterator.deserializeNext(ColumnSorted
>> Map.java:265)
>>        at
>>
> org.apache.cassandra.io.util.ColumnIterator.next(ColumnSortedMap.java:28
>> 1)
>>        at
>>
> org.apache.cassandra.io.util.ColumnIterator.next(ColumnSortedMap.java:23
>> 6)
>>        at
>>
> java.util.concurrent.ConcurrentSkipListMap.buildFromSorted(ConcurrentSki
>> pListMap.java:1493)
>>        at
>>
> java.util.concurrent.ConcurrentSkipListMap.(ConcurrentSkipListMap.
>> java:1443)
>>        at
>>
> org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.ja
>> va:402)
>>        at
>>
> org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(Column
>> FamilySerializer.java:136)
>>        at
>>
> org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilyS
>> erializer.java:126)
>>        at
>>
> org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(Ro
>> wMutation.java:367)
>>        at
>>
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:271)
>>        at
>>
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:158)
>>        at
>>
> org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassa
>> ndraDaemon.java:174)
>>        at
>>
> org.apache.cassandra.service.AbstractCassandraDaemon.init(AbstractCassan
>> draDaemon.java:216)
>>        ... 5 more
>> Caused by: java.io.EOFException
>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>        at java.io.DataInputStream.readFully(DataInputStream.java:152)
>>        at
>>
> org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:394)
>>        at
>>
> org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBuffer
>> Util.java:368)
>>        at
>>
> org.apache.cassandra.db.ColumnSerializer.des

Re: repair question

2011-05-24 Thread Sylvain Lebresne
On Mon, May 23, 2011 at 9:21 PM, Peter Schuller
 wrote:
>> I'm a bit lost: I tried a repair yesterday with only one CF and that didn't 
>> really work the way I expected but I thought that would be a bug which only 
>> affects that special case.
>>
>> So I tried again for all CFs.
>>
>> I started with a nicely compacted machine with around 320GB of load. Total 
>> disc space on this node was 1.1TB.
>
> Did you do repairs simultaneously on all nodes?
>
> I have seen very significant disk space increases under some
> circumstances. While I haven't filed a ticket about it because there
> was never time to confirm, I believe two things were at play:
>
> (1) nodes were sufficiently out a sync in a sufficiently spread out
> fashion that the granularity of the merkle tree (IIRC, and if I read
> correctly, it divides the ring into up to 2^15 segments but no more)
> became ineffective so that repair effectively had to transfer all the
> data. at first I thought there was an outright bug, but after looking
> at the code I suspected it was just the merkle tree granularity.

It's slightly more complicated than that but that's the main idea yes.
The key samples are used to help splitting the merkle tree "where data is".
Basically this means that the ring is not divided in 2^15 equivalent segment.
Which basically means that it's not the ring that is divided in 2^15, but the
range the node has (which is a little bit better). In 0.8, it's even a little
bit better since we compute a tree for each range, which mean that each time
we'll be dividing only one range (while in 0.7 it's all the ranges the node is
responsible for taken together). Not sure how clear I am, but anyway it is
true that at some point the granularity of the tree may be "not enough".

And the more spread out the out of sync is, the worse it will be. Though in
general we can expect that to not be too spread out. For the same reason than
why caches work.

So maybe it is worth improving the granularity of the tree. It is true that
having a fixed granularity as we do is probably a bit naive. In any case, it
will still be a trade-off between the precision of the tree and it's size
(which impact memory used and we have to transfer it (though the transfer part
could be done by tree level -- right now the whole tree is always transfered
which is not so optimal)).

Now there's probably a number of ideas we could implement to improve that
precision, but that may require not trivial changes, so really I'd like to be
sure we're not beating a dead horse. Maybe a first step would be to instrument
the repair process to compute the number of rows (and data size) that each
merkle tree segment end up containing for instance.
I've created https://issues.apache.org/jira/browse/CASSANDRA-2698 for that.

>
> (2) I suspected at the time that a contributing factor was also that
> as one repair might cause a node to significantly increase it's live
> sstables temporarily until they are compacted, another repair on
> another node may start and start validating compaction and streaming
> of that data - leading to disk space bload essentially being
> "contagious"; the third node streaming from the node that was
> temporarily bloated, will receive even more data from that node than
> it normally would.
>
> We're making sure to only run one repair at a time between any hosts
> that are neighbors of each other (meaning that at RF=3, that's 1
> concurrent repair per 6 nodes in the cluster).
>
> I'd be interested in hearing anyone confirm or deny whether my
> understanding of (1) in particular is correct. To connect it to
> reality: a 20 GB CF divided into 2^15 segments implies each segment is
>> 600 kbyte in size. For CF:s with tens or hundreds of millions of
> small rows and a fairly random (with respect to partitioner) update
> pattern, it's not very difficult to end up in a situation where most
> 600 kbyte chunks contain out-of-synch data. Particularly in a
> situation with lots of dropped messages.
>
> I'm getting the 2^15 from AntiEntropyService.Validator.Validator()
> which passes a maxsize of 2^15 to the MerkelTree constructor.
>
> --
> / Peter Schuller
>


Re: repair question

2011-05-24 Thread Sylvain Lebresne
On Tue, May 24, 2011 at 12:40 AM, Daniel Doubleday
 wrote:
> We are performing the repair on one node only. Other nodes receive reasonable 
> amounts of data (~500MB).  It's only the repairing node itself which 
> 'explodes'.

That, for instance, is a bit weird. That the node on which the repair
is performed get more data is expected, since it is repair with all
it's "neighbor" while the neighbors themselves get repaired only
against that given node. But when differences between two A and B are
computed, the ranges to repair are streaming both from A to B and for
B to A. Unless A and B are widely out of sync (like A has no data and
B has tons of it), around the same amount of data should transit in
both way. So with RF=3, the node on with repair was started should get
around 4 times (up to 6 times if you have weird topology) as much data
than any neighboring node, but that's is. While if I'm correct, you
are reporting that the neighboring node gets ~500MB and the
"coordinator" gets > 700GB ?!
Honestly I'm not sure an imprecision of the merkle tree could account
for that behavior.

Anyway, Daniel, would you be able to share the logs of the nodes (at
least the node on which repair is started) ? I'm not sure how much
that could help but that cannot hurt.

--
Sylvain

>
> I must admit that I'm a noob when it comes to aes/repair. Its just strange 
> that a cluster that is up and running with no probs is doing that. But I 
> understand that its not supposed to do what its doing. I just hope that I 
> find out why soon enough.
>
>
>
> On 23.05.2011, at 21:21, Peter Schuller  wrote:
>
>>> I'm a bit lost: I tried a repair yesterday with only one CF and that didn't 
>>> really work the way I expected but I thought that would be a bug which only 
>>> affects that special case.
>>>
>>> So I tried again for all CFs.
>>>
>>> I started with a nicely compacted machine with around 320GB of load. Total 
>>> disc space on this node was 1.1TB.
>>
>> Did you do repairs simultaneously on all nodes?
>>
>> I have seen very significant disk space increases under some
>> circumstances. While I haven't filed a ticket about it because there
>> was never time to confirm, I believe two things were at play:
>>
>> (1) nodes were sufficiently out a sync in a sufficiently spread out
>> fashion that the granularity of the merkle tree (IIRC, and if I read
>> correctly, it divides the ring into up to 2^15 segments but no more)
>> became ineffective so that repair effectively had to transfer all the
>> data. at first I thought there was an outright bug, but after looking
>> at the code I suspected it was just the merkle tree granularity.
>>
>> (2) I suspected at the time that a contributing factor was also that
>> as one repair might cause a node to significantly increase it's live
>> sstables temporarily until they are compacted, another repair on
>> another node may start and start validating compaction and streaming
>> of that data - leading to disk space bload essentially being
>> "contagious"; the third node streaming from the node that was
>> temporarily bloated, will receive even more data from that node than
>> it normally would.
>>
>> We're making sure to only run one repair at a time between any hosts
>> that are neighbors of each other (meaning that at RF=3, that's 1
>> concurrent repair per 6 nodes in the cluster).
>>
>> I'd be interested in hearing anyone confirm or deny whether my
>> understanding of (1) in particular is correct. To connect it to
>> reality: a 20 GB CF divided into 2^15 segments implies each segment is
>>> 600 kbyte in size. For CF:s with tens or hundreds of millions of
>> small rows and a fairly random (with respect to partitioner) update
>> pattern, it's not very difficult to end up in a situation where most
>> 600 kbyte chunks contain out-of-synch data. Particularly in a
>> situation with lots of dropped messages.
>>
>> I'm getting the 2^15 from AntiEntropyService.Validator.Validator()
>> which passes a maxsize of 2^15 to the MerkelTree constructor.
>>
>> --
>> / Peter Schuller
>


Re: repair question

2011-05-24 Thread Edward Capriolo
On Tue, May 24, 2011 at 9:41 AM, Sylvain Lebresne wrote:

> On Tue, May 24, 2011 at 12:40 AM, Daniel Doubleday
>  wrote:
> > We are performing the repair on one node only. Other nodes receive
> reasonable amounts of data (~500MB).  It's only the repairing node itself
> which 'explodes'.
>
> That, for instance, is a bit weird. That the node on which the repair
> is performed get more data is expected, since it is repair with all
> it's "neighbor" while the neighbors themselves get repaired only
> against that given node. But when differences between two A and B are
> computed, the ranges to repair are streaming both from A to B and for
> B to A. Unless A and B are widely out of sync (like A has no data and
> B has tons of it), around the same amount of data should transit in
> both way. So with RF=3, the node on with repair was started should get
> around 4 times (up to 6 times if you have weird topology) as much data
> than any neighboring node, but that's is. While if I'm correct, you
> are reporting that the neighboring node gets ~500MB and the
> "coordinator" gets > 700GB ?!
> Honestly I'm not sure an imprecision of the merkle tree could account
> for that behavior.
>
> Anyway, Daniel, would you be able to share the logs of the nodes (at
> least the node on which repair is started) ? I'm not sure how much
> that could help but that cannot hurt.
>
> --
> Sylvain
>
> >
> > I must admit that I'm a noob when it comes to aes/repair. Its just
> strange that a cluster that is up and running with no probs is doing that.
> But I understand that its not supposed to do what its doing. I just hope
> that I find out why soon enough.
> >
> >
> >
> > On 23.05.2011, at 21:21, Peter Schuller 
> wrote:
> >
> >>> I'm a bit lost: I tried a repair yesterday with only one CF and that
> didn't really work the way I expected but I thought that would be a bug
> which only affects that special case.
> >>>
> >>> So I tried again for all CFs.
> >>>
> >>> I started with a nicely compacted machine with around 320GB of load.
> Total disc space on this node was 1.1TB.
> >>
> >> Did you do repairs simultaneously on all nodes?
> >>
> >> I have seen very significant disk space increases under some
> >> circumstances. While I haven't filed a ticket about it because there
> >> was never time to confirm, I believe two things were at play:
> >>
> >> (1) nodes were sufficiently out a sync in a sufficiently spread out
> >> fashion that the granularity of the merkle tree (IIRC, and if I read
> >> correctly, it divides the ring into up to 2^15 segments but no more)
> >> became ineffective so that repair effectively had to transfer all the
> >> data. at first I thought there was an outright bug, but after looking
> >> at the code I suspected it was just the merkle tree granularity.
> >>
> >> (2) I suspected at the time that a contributing factor was also that
> >> as one repair might cause a node to significantly increase it's live
> >> sstables temporarily until they are compacted, another repair on
> >> another node may start and start validating compaction and streaming
> >> of that data - leading to disk space bload essentially being
> >> "contagious"; the third node streaming from the node that was
> >> temporarily bloated, will receive even more data from that node than
> >> it normally would.
> >>
> >> We're making sure to only run one repair at a time between any hosts
> >> that are neighbors of each other (meaning that at RF=3, that's 1
> >> concurrent repair per 6 nodes in the cluster).
> >>
> >> I'd be interested in hearing anyone confirm or deny whether my
> >> understanding of (1) in particular is correct. To connect it to
> >> reality: a 20 GB CF divided into 2^15 segments implies each segment is
> >>> 600 kbyte in size. For CF:s with tens or hundreds of millions of
> >> small rows and a fairly random (with respect to partitioner) update
> >> pattern, it's not very difficult to end up in a situation where most
> >> 600 kbyte chunks contain out-of-synch data. Particularly in a
> >> situation with lots of dropped messages.
> >>
> >> I'm getting the 2^15 from AntiEntropyService.Validator.Validator()
> >> which passes a maxsize of 2^15 to the MerkelTree constructor.
> >>
> >> --
> >> / Peter Schuller
> >
>

I never run repair for just this reason. It is very intensive and it
produces a lot of data. I am still in 0.6.X so this would be better for me
when I upgrade. If i had to take a wild stab at it I would guess that "guys
like us" with 300 GB of data and possibly tiny rows run a greater chance of
something not being in sync then those with 20 GB a node for example.

If your are doing a high volume of inserts and disabled HH or even set it to
only store hints for an hour, and you had a three hour outage some nodes are
going to be out of sync.


EC2 node adding trouble

2011-05-24 Thread Marcus Bointon
Hi,

First time here. I'm having trouble adding a third node to an existing 2-node 
ring (successfully upgraded from 0.72) running cassandra 0.8rc1 (successfully 
upgraded from 0.72) on ubuntu on EC2.

Evidently the seed node is working as the second node is already talking to it, 
nodetool lists both nodes.
From the new node I can successfully telnet to ports 7000, 7199 and 9160 on the 
seed node (so it's not a connectivity issue - necessary ports are open on node 
firewalls and EC2 security groups).
All nodes have identical cassandra.yaml files, apart from the new node has 
auto-bootstrap set to true.

Despite all this, cassandra still says "No other nodes seen!  Unable to 
bootstrap" when starting on the third node. Any ideas? How can I troubleshoot 
further?

Thanks,

Marcus

issue/minor bug with counters ?

2011-05-24 Thread Yang
if you have only counter columns in your keyspace, and do a lot of
updates on a few keys,
the getLiveSize() of memtable actually returns the total amount of
traffic that has gone into the Memtable,
not the real size,
so you end up producing very small SStables, with only a few KBytes.
(I have already changed the CF.memtable_throughput memtable_operations
memtable_flush_after params )



this can be changed easily by modifying the CounterColumn.resolve() so
that it updates the
currentThroughput with the ***difference   of old column size and
new column size, not just the size of new column.


Re: issue/minor bug with counters ?

2011-05-24 Thread Jonathan Ellis
"the total amount of traffic that has gone into the Memtable" is how
throughput is defined, so this is working as expected.

On Tue, May 24, 2011 at 11:15 AM, Yang  wrote:
> if you have only counter columns in your keyspace, and do a lot of
> updates on a few keys,
> the getLiveSize() of memtable actually returns the total amount of
> traffic that has gone into the Memtable,
> not the real size,
> so you end up producing very small SStables, with only a few KBytes.
> (I have already changed the CF.memtable_throughput memtable_operations
> memtable_flush_after params )
>
>
>
> this can be changed easily by modifying the CounterColumn.resolve() so
> that it updates the
> currentThroughput with the ***difference   of old column size and
> new column size, not just the size of new column.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: issue/minor bug with counters ?

2011-05-24 Thread Yang
then I guess Memtable.getLiveSize() should not rely on the
currentThroughput for size calculation,
an extra var currentSize is needed during the bookkeeping of resolve()


On Tue, May 24, 2011 at 9:47 AM, Jonathan Ellis  wrote:
> "the total amount of traffic that has gone into the Memtable" is how
> throughput is defined, so this is working as expected.
>
> On Tue, May 24, 2011 at 11:15 AM, Yang  wrote:
>> if you have only counter columns in your keyspace, and do a lot of
>> updates on a few keys,
>> the getLiveSize() of memtable actually returns the total amount of
>> traffic that has gone into the Memtable,
>> not the real size,
>> so you end up producing very small SStables, with only a few KBytes.
>> (I have already changed the CF.memtable_throughput memtable_operations
>> memtable_flush_after params )
>>
>>
>>
>> this can be changed easily by modifying the CounterColumn.resolve() so
>> that it updates the
>> currentThroughput with the ***difference   of old column size and
>> new column size, not just the size of new column.
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>


Link & mirrors to download Cassandra is down...

2011-05-24 Thread Sameer Farooqui
http://cassandra.apache.org/download

If you click the link on the right to download 0.7.6-2, the main link to
download Cassandra (@ kahuki.com) and a bunch of the mirrors are down.

Also, one of the HTTP mirrors takes me to a bare chested picture of Brad
Pitt:
http://www.gossipcheck.com/mirrors/apache//cassandra/0.7.6/apache-cassandra-0.7.6-2-bin.tar.gz



- Sameer


Re: Cassandra 0.8 questions

2011-05-24 Thread Jian Fang
Does anyone have a good suggestion on my second question? I believe that
question is a pretty common one.

My third question is a design question. For the same data, we can stored
them into multiple column families or a single column family with multiple
super columns.
>From Cassandra read/write performance point of view, what are the general
rules to make mutliple column families and when to use a single column
family?

Thanks again,

John

On Mon, May 23, 2011 at 5:47 PM, Jian Fang wrote:

> Hi,
>
> I am pretty new to Cassandra and am going to use Cassandra 0.8.0. I have
> two questions (sorry if they are very basic ones):
>
> 1) I have a column family to hold many super columns, say 30. When I first
> insert the data to the column family, do I need to insert each column one at
> a time or can I insert the whole column family in one transaction (or
> call?)? The latter one seems to be more efficient to me. Does Cassandra
> support that?
>
> For example, I saw the following code to do insertion (with Hector),
>
> Mutator m = HFactory.createMutator(keyspace, stringSerializer);
> //Mutator m =
> HFactory.createMutator(keyspace,stringSerializer);
> m.insert(p.getCassandraKey(), colFamily,
> HFactory.createStringColumn("type",
> p.getStringValue()));
> m.insert(p.getCassandraKey(), colFamily,
> HFactory.createColumn("data", p.getCompressedXML(),
> StringSerializer.get(),
> BytesArraySerializer.get()));
>
> Will the insertions be two separate calls to Cassandra? Or they are just
> one transaction? If it is the former case, is there any way to make them as
> one call to Cassandra?
>
> 2) How to store a list/array of data in Cassandra? For example, I have a
> data field called categories, which include none or many categories and each
> category includes a category id and a category description. Usually, how do
> people handle this scenario when they use Cassandra?
>
> Thanks in advance,
>
> John
>


Re: Link & mirrors to download Cassandra is down...

2011-05-24 Thread Jeremy Hanna
The link was fixed in cassandra.apache.org/download a couple of hours ago.  For 
the time being it may be better to scroll down to the Backup Sites section and 
use one of those links.

On May 24, 2011, at 12:24 PM, Sameer Farooqui wrote:

> http://cassandra.apache.org/download
> 
> If you click the link on the right to download 0.7.6-2, the main link to 
> download Cassandra (@ kahuki.com) and a bunch of the mirrors are down.
> 
> Also, one of the HTTP mirrors takes me to a bare chested picture of Brad Pitt:
> http://www.gossipcheck.com/mirrors/apache//cassandra/0.7.6/apache-cassandra-0.7.6-2-bin.tar.gz
> 
> 
> 
> - Sameer



Re: EC2 node adding trouble

2011-05-24 Thread Sameer Farooqui
What region and availability zones are the different nodes in? Are you using
EC2 Snitch? Did you set up the cluster using the Datastax AMI?

- Sameer


On Tue, May 24, 2011 at 8:35 AM, Marcus Bointon
wrote:

> Hi,
>
> First time here. I'm having trouble adding a third node to an existing
> 2-node ring (successfully upgraded from 0.72) running cassandra 0.8rc1
> (successfully upgraded from 0.72) on ubuntu on EC2.
>
> Evidently the seed node is working as the second node is already talking to
> it, nodetool lists both nodes.
> From the new node I can successfully telnet to ports 7000, 7199 and 9160 on
> the seed node (so it's not a connectivity issue - necessary ports are open
> on node firewalls and EC2 security groups).
> All nodes have identical cassandra.yaml files, apart from the new node has
> auto-bootstrap set to true.
>
> Despite all this, cassandra still says "No other nodes seen!  Unable to
> bootstrap" when starting on the third node. Any ideas? How can I
> troubleshoot further?
>
> Thanks,
>
> Marcus


Re: Cassandra 0.8 questions

2011-05-24 Thread Victor Kabdebon
It's not really possible to give a general answer your second question, it
depends of your implementation. Personally I do two thing : the first one is
to map arrays with a key and then name of column as a key of your array and
value of column as the data storage. However for some application, as I am
using Java I just serialize my ArrayList (or List) and push all the content
to one column. It all depends on what you want to achieve.

Third question: try to make CF according to what you want to achieve. I am
designing an internal messaging system I use only two column family to hold
the message lists, message and message box. I would have used one; but I
need one that is sorted by TimeUUID and the other one by UTF8Type. I think
there is a general consensus here : try to avoid super columns. 2 sets of
columns can do the same jobs has one SuperColumn and it's
the preferred scheme.

Again just experiment and be ready to change your organization if you begin
with Cassandra, this is the best way to figure out what to do for your data
organization.

Victor Kabdebon
http://www.voxnucleus.fr
http://www.victorkabdebon.net

2011/5/24 Jian Fang 

> Does anyone have a good suggestion on my second question? I believe that
> question is a pretty common one.
>
> My third question is a design question. For the same data, we can stored
> them into multiple column families or a single column family with multiple
> super columns.
> From Cassandra read/write performance point of view, what are the general
> rules to make mutliple column families and when to use a single column
> family?
>
> Thanks again,
>
> John
>
>
> On Mon, May 23, 2011 at 5:47 PM, Jian Fang 
> wrote:
>
>> Hi,
>>
>> I am pretty new to Cassandra and am going to use Cassandra 0.8.0. I have
>> two questions (sorry if they are very basic ones):
>>
>> 1) I have a column family to hold many super columns, say 30. When I first
>> insert the data to the column family, do I need to insert each column one at
>> a time or can I insert the whole column family in one transaction (or
>> call?)? The latter one seems to be more efficient to me. Does Cassandra
>> support that?
>>
>> For example, I saw the following code to do insertion (with Hector),
>>
>> Mutator m = HFactory.createMutator(keyspace, stringSerializer);
>> //Mutator m =
>> HFactory.createMutator(keyspace,stringSerializer);
>> m.insert(p.getCassandraKey(), colFamily,
>> HFactory.createStringColumn("type",
>> p.getStringValue()));
>> m.insert(p.getCassandraKey(), colFamily,
>> HFactory.createColumn("data",
>> p.getCompressedXML(), StringSerializer.get(),
>> BytesArraySerializer.get()));
>>
>> Will the insertions be two separate calls to Cassandra? Or they are just
>> one transaction? If it is the former case, is there any way to make them as
>> one call to Cassandra?
>>
>> 2) How to store a list/array of data in Cassandra? For example, I have a
>> data field called categories, which include none or many categories and each
>> category includes a category id and a category description. Usually, how do
>> people handle this scenario when they use Cassandra?
>>
>> Thanks in advance,
>>
>> John
>>
>
>


Re: Cassandra 0.8 questions

2011-05-24 Thread Jian Fang
Thanks a lot. This is really helpful.

John

On Tue, May 24, 2011 at 1:34 PM, Victor Kabdebon
wrote:

> It's not really possible to give a general answer your second question, it
> depends of your implementation. Personally I do two thing : the first one is
> to map arrays with a key and then name of column as a key of your array and
> value of column as the data storage. However for some application, as I am
> using Java I just serialize my ArrayList (or List) and push all the content
> to one column. It all depends on what you want to achieve.
>
> Third question: try to make CF according to what you want to achieve. I am
> designing an internal messaging system I use only two column family to hold
> the message lists, message and message box. I would have used one; but I
> need one that is sorted by TimeUUID and the other one by UTF8Type. I think
> there is a general consensus here : try to avoid super columns. 2 sets of
> columns can do the same jobs has one SuperColumn and it's
> the preferred scheme.
>
> Again just experiment and be ready to change your organization if you begin
> with Cassandra, this is the best way to figure out what to do for your data
> organization.
>
> Victor Kabdebon
> http://www.voxnucleus.fr
> http://www.victorkabdebon.net
>
>
> 2011/5/24 Jian Fang 
>
>> Does anyone have a good suggestion on my second question? I believe that
>> question is a pretty common one.
>>
>> My third question is a design question. For the same data, we can stored
>> them into multiple column families or a single column family with multiple
>> super columns.
>> From Cassandra read/write performance point of view, what are the general
>> rules to make mutliple column families and when to use a single column
>> family?
>>
>> Thanks again,
>>
>> John
>>
>>
>> On Mon, May 23, 2011 at 5:47 PM, Jian Fang > > wrote:
>>
>>> Hi,
>>>
>>> I am pretty new to Cassandra and am going to use Cassandra 0.8.0. I have
>>> two questions (sorry if they are very basic ones):
>>>
>>> 1) I have a column family to hold many super columns, say 30. When I
>>> first insert the data to the column family, do I need to insert each column
>>> one at a time or can I insert the whole column family in one transaction (or
>>> call?)? The latter one seems to be more efficient to me. Does Cassandra
>>> support that?
>>>
>>> For example, I saw the following code to do insertion (with Hector),
>>>
>>> Mutator m = HFactory.createMutator(keyspace, stringSerializer);
>>> //Mutator m =
>>> HFactory.createMutator(keyspace,stringSerializer);
>>> m.insert(p.getCassandraKey(), colFamily,
>>> HFactory.createStringColumn("type",
>>> p.getStringValue()));
>>> m.insert(p.getCassandraKey(), colFamily,
>>> HFactory.createColumn("data",
>>> p.getCompressedXML(), StringSerializer.get(),
>>> BytesArraySerializer.get()));
>>>
>>> Will the insertions be two separate calls to Cassandra? Or they are just
>>> one transaction? If it is the former case, is there any way to make them as
>>> one call to Cassandra?
>>>
>>> 2) How to store a list/array of data in Cassandra? For example, I have a
>>> data field called categories, which include none or many categories and each
>>> category includes a category id and a category description. Usually, how do
>>> people handle this scenario when they use Cassandra?
>>>
>>> Thanks in advance,
>>>
>>> John
>>>
>>
>>
>


Re: repair question

2011-05-24 Thread Daniel Doubleday
Ok thanks for your help Sylvain - much appreciated 

In short: I believe that most of this is me not looking clearly yesterday. 
There are only one / two points that i don't get. 
Maybe you could help me out there.

First the ~500MB thing is BS. The closer neighbors recieved around 80G and the 
other 2 aroung 40G.
Sorry about that but I got your attention :-)

My missing pieces are:

1. Why was I running out of space. I checked again and found that I started 
with 761G free disc space?

To make it simple I will only look at one CF 'BlobStore' which is the evil 
large one which makes up for 80%.

I greped for the streaming metadata in the log and summed it up: Total 
streaming file size was 279G.
This comes as a real surprise but still ...

2. The file access times are strange: why does the node receive data before 
differencing has finished?

On the repairing node I see first differencing for that one ended 13:02:

grep streaming /var/log/cassandra/system.log

 INFO [AntiEntropyStage:1] 2011-05-23 13:02:52,990 AntiEntropyService.java 
(line 491) Performing streaming repair of 2088 ranges for #


a listing I did on that node in the data dir shows that data files arrive much 
earlier

ls -al *tmp*
...
-rw-r--r-- 1 cass cass   146846246 May 23 12:14 BlobStore-tmp-f-16356-Data.db
-rw-r--r-- 1 cass cass  701291 May 23 12:14 BlobStore-tmp-f-16357-Data.db
-rw-r--r-- 1 cass cass 6628735 May 23 12:14 BlobStore-tmp-f-16358-Data.db
-rw-r--r-- 1 cass cass9991 May 23 12:14 BlobStore-tmp-f-16359-Data.db
...

The youngest file for every CF was written at 12:14 which is the time the first 
differencing ended:

 INFO [AntiEntropyStage:1] 2011-05-23 12:14:36,255 AntiEntropyService.java 
(line 491) Performing streaming repair of 71 ranges for #https://issues.apache.org/jira/browse/CASSANDRA-2670: Only one 
tree request was generated but still the repairing node got all that data from 
the other CFs.
That's in fact one of the reasons why I thought that there might be a bug that 
sends to much data in the first place.

Thanks for reading this book
Daniel

If your interested here's the log: http://dl.dropbox.com/u/5096376/system.log.gz





I also lied about total size of one node. It wasn't 320 but 280. All nodes 

On May 24, 2011, at 3:41 PM, Sylvain Lebresne wrote:

> On Tue, May 24, 2011 at 12:40 AM, Daniel Doubleday
>  wrote:
>> We are performing the repair on one node only. Other nodes receive 
>> reasonable amounts of data (~500MB).  It's only the repairing node itself 
>> which 'explodes'.
> 
> That, for instance, is a bit weird. That the node on which the repair
> is performed get more data is expected, since it is repair with all
> it's "neighbor" while the neighbors themselves get repaired only
> against that given node. But when differences between two A and B are
> computed, the ranges to repair are streaming both from A to B and for
> B to A. Unless A and B are widely out of sync (like A has no data and
> B has tons of it), around the same amount of data should transit in
> both way. So with RF=3, the node on with repair was started should get
> around 4 times (up to 6 times if you have weird topology) as much data
> than any neighboring node, but that's is. While if I'm correct, you
> are reporting that the neighboring node gets ~500MB and the
> "coordinator" gets > 700GB ?!
> Honestly I'm not sure an imprecision of the merkle tree could account
> for that behavior.
> 
> Anyway, Daniel, would you be able to share the logs of the nodes (at
> least the node on which repair is started) ? I'm not sure how much
> that could help but that cannot hurt.
> 
> --
> Sylvain
> 
>> 
>> I must admit that I'm a noob when it comes to aes/repair. Its just strange 
>> that a cluster that is up and running with no probs is doing that. But I 
>> understand that its not supposed to do what its doing. I just hope that I 
>> find out why soon enough.
>> 
>> 
>> 
>> On 23.05.2011, at 21:21, Peter Schuller  wrote:
>> 
 I'm a bit lost: I tried a repair yesterday with only one CF and that 
 didn't really work the way I expected but I thought that would be a bug 
 which only affects that special case.
 
 So I tried again for all CFs.
 
 I started with a nicely compacted machine with around 320GB of load. Total 
 disc space on this node was 1.1TB.
>>> 
>>> Did you do repairs simultaneously on all nodes?
>>> 
>>> I have seen very significant disk space increases under some
>>> circumstances. While I haven't filed a ticket about it because there
>>> was never time to confirm, I believe two things were at play:
>>> 
>>> (1) nodes were sufficiently out a sync in a sufficiently spread out
>>> fashion that the granularity of the merkle tree (IIRC, and if I read
>>> correctly, it divides the ring into up to 2^15 segments but no more)
>>> became ineffective so that repair effectively had to transfer all the
>>> data. at first I thought there was an outright bug, but after looking
>>> at the cod

Re: EC2 node adding trouble

2011-05-24 Thread Marcus Bointon
On 24 May 2011, at 19:33, Sameer Farooqui wrote:

> What region and availability zones are the different nodes in? Are you using 
> EC2 Snitch? Did you set up the cluster using the Datastax AMI?

The two existing ones are in us-east-1c and us-east-1d, the new one is in 
us-east-1c, so all same region, different AZs. I'm using the Alestic Ubuntu 
AMIs. I've not heard of either EC2 snitch or Datastax. Are they particularly 
suited to cassandra?

Marcus

Re: repair question

2011-05-24 Thread Peter Schuller
> And the more spread out the out of sync is, the worse it will be. Though in
> general we can expect that to not be too spread out. For the same reason than
> why caches work.

(Speaking generally now, not addressing the OP's issue)

I'm not sure I buy that. Unless your data is such that hotness is
expected to be related somehow to the order of the data, even with a
workload quite susceptible to caching you very quickly spread requests
out lots of merkle ring segments.

For example, suppose you have a CF with lots of smallish rows - say
250 byte per row on average. Given a 50 GB load and 2^15 for the range
(let's assume 0.8 improvements), each chunk/merkle segment will be 1.6
mb in size. You don't need to hit a lot of unique keys in order to
make the next repair transfer most content. 2^15 is only 32k; if a
node is e.g. overloaded or some other event causes more than just
random sporadic message dropping for any non-trivial period, it is not
difficult at all to hit >> 32k unique keys being written to for pretty
plausible workloads.

It's somewhat of a concern particularly because of the original reason
for dropped keys is overload, and you have lots of data and are
bottlenecking on I/O, the "over-repair" will then cause further data
bloat (even if temporarily), causing more dropped messages, causing
the next repair to do the same, etc.
\
> So maybe it is worth improving the granularity of the tree. It is true that
> having a fixed granularity as we do is probably a bit naive. In any case, it
> will still be a trade-off between the precision of the tree and it's size
> (which impact memory used and we have to transfer it (though the transfer part
> could be done by tree level -- right now the whole tree is always transfered
> which is not so optimal)).

Longer term I'm thinking something more incremental and less "bulky"
is desired, and such a solution might have the side-effect of avoiding
the granularity problem.

For example, consider a repair process where the originating node
iterates over its own range in a compaction-like fashion, calculating
small merkle trees as required. The node could ask neighbors to
deliver relevant data in a fashion more similar to hinted hand-off or
read-repair. You'd now have an incremental process that behaves more
like normal writes in terms of the impact on data size, compaction etc
instead of a sudden spike in sstable sizes and large long-running
compaction/streaming jobs.

Further the process should be fairly insensitive to issues like total
data size, the average amount of rows out of synch, etc. It could grab
one chunk at a time, with a "chunk" being some fixed reasonable size
like say 1 GB spread out over appropriate sstables (i..e, take
whatever segment of the ring corresponds to 1 GB). There would be no
need to keep long-term refs to obsolete sstables during this process
since the iteration would be entirely in token order so both the
receiver and the sender in the synchronization process should see a
minimal impact in terms of sstable retention, backed up compactions,
etc. By doing chunks however, you avoid the cost of traversing the
data in a seek bound fashion, since each chunk of reasonable size
would have compaction-like performance characteristics.

Hmmm, I'm starting to like this idea more and more the more I think of it ;)

-- 
/ Peter Schuller


Re: repair question

2011-05-24 Thread Peter Schuller
> Hmmm, I'm starting to like this idea more and more the more I think of it ;)

Filed: https://issues.apache.org/jira/browse/CASSANDRA-2699

-- 
/ Peter Schuller


Re: EC2 node adding trouble

2011-05-24 Thread Sameer Farooqui
Even with AutoBootstrap it is recommended that you always specify
the InitialToken on the new node because the picking of an initial token
will almost certainly result in an unbalanced ring.

Right now, I'm afraid that if you simply copied the YAML file from one of
the two nodes to the 3rd node, then the 3rd node has an incorrect
initial_token setting (it may be conflicting with node 1 or 2's range).

Some info about how to choose tokens:
http://wiki.apache.org/cassandra/Operations#Token_selection

http://journal.paul.querna.org/articles/2010/09/24/cassandra-token-selection/

So, once you know what token each of the 3 nodes should have, shut down the
first two nodes, change their tokens and add the correct token to the 3rd
node (in the YAML file).

If this doesn't fix it, start to look at the Cassandra log file to
troubleshoot. It's located usually at /var/log/cassandra and is by default
only logging at INFO level. You can make it more granular by changing the
logging level in the file /conf/log4j-server.properties and
change the line with INFO in it to:
log4j.rootLogger=DEBUG,stdout,R

Note that this will slow things down performance-wise in the cluster since
every I/O is logged. But it really helps with learning how Cassandra works
and with troubleshooting.

research links:
http://wiki.apache.org/cassandra/FAQ#existing_data_when_adding_new_nodes



On Tue, May 24, 2011 at 12:17 PM, Marcus Bointon
wrote:

> On 24 May 2011, at 19:33, Sameer Farooqui wrote:
>
> > What region and availability zones are the different nodes in? Are you
> using EC2 Snitch? Did you set up the cluster using the Datastax AMI?
>
> The two existing ones are in us-east-1c and us-east-1d, the new one is in
> us-east-1c, so all same region, different AZs. I'm using the Alestic Ubuntu
> AMIs. I've not heard of either EC2 snitch or Datastax. Are they particularly
> suited to cassandra?
>
> Marcus


Re: EC2 node adding trouble

2011-05-24 Thread aaron morton
Check the listen_address and rpc_address in the yaml file for each node. I 
think they are normally set to the private and public respectively. 

This may make your live easier 
http://www.datastax.com/dev/blog/setting-up-a-cassandra-cluster-with-the-datastax-ami

Cheers
-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 25 May 2011, at 09:58, Sameer Farooqui wrote:

> Even with AutoBootstrap it is recommended that you always specify the 
> InitialToken on the new node because the picking of an initial token will 
> almost certainly result in an unbalanced ring.
> 
> Right now, I'm afraid that if you simply copied the YAML file from one of the 
> two nodes to the 3rd node, then the 3rd node has an incorrect initial_token 
> setting (it may be conflicting with node 1 or 2's range). 
> 
> Some info about how to choose tokens:
> http://wiki.apache.org/cassandra/Operations#Token_selection
> http://journal.paul.querna.org/articles/2010/09/24/cassandra-token-selection/
> 
> So, once you know what token each of the 3 nodes should have, shut down the 
> first two nodes, change their tokens and add the correct token to the 3rd 
> node (in the YAML file).
> 
> If this doesn't fix it, start to look at the Cassandra log file to 
> troubleshoot. It's located usually at /var/log/cassandra and is by default 
> only logging at INFO level. You can make it more granular by changing the 
> logging level in the file /conf/log4j-server.properties and 
> change the line with INFO in it to:
> log4j.rootLogger=DEBUG,stdout,R
> 
> Note that this will slow things down performance-wise in the cluster since 
> every I/O is logged. But it really helps with learning how Cassandra works 
> and with troubleshooting.
> 
> research links:
> http://wiki.apache.org/cassandra/FAQ#existing_data_when_adding_new_nodes
> 
> 
> 
> On Tue, May 24, 2011 at 12:17 PM, Marcus Bointon  
> wrote:
> On 24 May 2011, at 19:33, Sameer Farooqui wrote:
> 
> > What region and availability zones are the different nodes in? Are you 
> > using EC2 Snitch? Did you set up the cluster using the Datastax AMI?
> 
> The two existing ones are in us-east-1c and us-east-1d, the new one is in 
> us-east-1c, so all same region, different AZs. I'm using the Alestic Ubuntu 
> AMIs. I've not heard of either EC2 snitch or Datastax. Are they particularly 
> suited to cassandra?
> 
> Marcus
> 



Measure Latency

2011-05-24 Thread Stephan Pfammatter
What’s the recommended way of measuring latency between nodes in a cluster?

By that I’m not referring to read/write latency for a given KS/CF.

Basically I want to inject a row in a node A and want to see how long it
takes to get to node B (assuming proper RF is set).


I have already some network monitoring in our windows environment between my
distributed nodes in place. But I still would like to get a handle on how a
potential network and/or cassandra slowness affects inter-node latency.
Maybe there is a tool/cmd that I'm not aware of? Tx.


How to make use of Cassandra raw row keys?

2011-05-24 Thread Suan Aik Yeo
We're using Cassandra to store our sessions, all in a single column family
"Sessions" with the format:
Sessions['session_key'] = {'val': }
(session_key is a randomly generated hash)

The "raw" keys I'm talking about are for example the 'key' value as seen
from Cassandra DEBUG output:
insert writing local RowMutation(keyspace='my_keyspace',
key='73657373696f6e3a6365613765323931353838616437343732363130646163666331643161393334',
modifications=[Sessions])

Today we ran into a problem where a session with a given key (say
"session:12345") seemingly disappeared (at least it appeared that way to the
client app), but in the server log DEBUG output, the "raw" Cassandra key
that seemed to correspond to that session_key (say "a12345f") was still
being used as evidenced by DEBUG log output. Indeed, none of the existing
session_keys corresponded to the "a12345f" raw key. However, in
Cassandra-cli when I do the "list Sessions" command, the "a12345f" raw key
shows up as part of the output.

I'd like to dig further into the issue, but first I need to find out:
what are these keys and how are they determined?
Is there any way I could use them in querying Cassandra to find out what
they're pointing to? (Seems that even the cli expects the "session:12345"
type key rather than raw ones when querying)


Thanks,
Suan


Re: Measure Latency

2011-05-24 Thread Aaron Morton
Once the cluster has returned to the client you know the write has been committed to Consistency Level  number of nodes. i.e. If you send an insert using QUORUM consistency to a cluster with Replication Factor 3, and you get a non error response you know the write has occurred on at least 2 nodes (one may be the one the client is connected to.)After the initial request has completed it's up the "Eventual Consistency" features of Read Repair, Hinted Handoff and Anti-Entropy to distribute changes. Does that help or are you thinking of something else?  -
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.comOn 25 May, 2011,at 03:11 PM, Stephan Pfammatter  wrote:What’s the recommended way of measuring
latency between nodes in a cluster?

By that I’m not referring to read/write latency
for a given KS/CF. Basically I want to inject a row in a node A and want to see how
long it takes to get to node B (assuming proper RF is set). I have already some network monitoring in our windows environment between my distributed nodes in place. But I still would like to get a handle on how a potential network and/or cassandra slowness affects inter-node latency. Maybe there is a tool/cmd that I'm not aware of? Tx.




Re: How to make use of Cassandra raw row keys?

2011-05-24 Thread Aaron Morton
The key printed in the DEBUG message is the byte array the server was given as the key converted to hex. Your client API may have converted the string to ascii bytes before sending to the server.e.g. here is me writing a 'foo' key to the server DEBUG 15:52:15,818 insert writing local RowMutation(keyspace='dev', key='666f6f', modifications=[data])You can tell the CLI what data type the keys are, see the assume statement. e.g. assume my_cf keys as ascii; Will tell the cli to convert them back to ascii for you.Hope that helps. -
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.comOn 25 May, 2011,at 03:17 PM, Suan Aik Yeo  wrote:We're using Cassandra to store our sessions, all in a single column family "Sessions" with the format:Sessions['session_key'] = {'val': }(session_key is a randomly generated hash)
The "raw" keys I'm talking about are for example the 'key' value as seen from Cassandra DEBUG output:insert writing local RowMutation(keyspace='my_keyspace', key='73657373696f6e3a6365613765323931353838616437343732363130646163666331643161393334', modifications=[Sessions])
Today we ran into a problem where a session with a given key (say "session:12345") seemingly disappeared (at least it appeared that way to the client app), but in the server log DEBUG output, the "raw" Cassandra key that seemed to correspond to that session_key (say "a12345f") was still being used as evidenced by DEBUG log output. Indeed, none of the existing session_keys corresponded to the "a12345f" raw key. However, in Cassandra-cli when I do the "list Sessions" command, the "a12345f" raw key shows up as part of the output.
I'd like to dig further into the issue, but first I need to find out:what are these keys and how are they determined?Is there any way I could use them in querying Cassandra to find out what they're pointing to? (Seems that even the cli expects the "session:12345" type key rather than raw ones when querying)
Thanks,Suan


monitoring cassandra with JMX

2011-05-24 Thread vineet daniel
Hi

I have just written a little note on how to monitor cassandra...
http://vineetdaniel.me/2011/03/26/monitoring-cassandra-with-jmx/

I hope it helps the community.



Regards
Vineet Daniel
Cell  : +918106217121
Websites :
Blog    |
Linkedin
|  Twitter 


Re: EC2 node adding trouble

2011-05-24 Thread Marcus Bointon
On 25 May 2011, at 02:10, aaron morton wrote:

> Check the listen_address and rpc_address in the yaml file for each node. I 
> think they are normally set to the private and public respectively. 

On all modes, listen_address and rpc_address are blank and 0.0.0.0 
respectively, the seed node address uses an EC2 external elastic IP's DNS name 
so that it resolves correctly to the seed's internal IP even if it changes. 
Those settings work fine for my existing pair - starting and stopping the 
instances (resulting in internal IP changes) doesn't break anything. As far as 
I can see, setting fixed IPs in here is a recipe for management nightmares on 
EC2.

Sound sane?

Marcus

Re: EC2 node adding trouble

2011-05-24 Thread Marcus Bointon
On 24 May 2011, at 23:58, Sameer Farooqui wrote:

> Even with AutoBootstrap it is recommended that you always specify the 
> InitialToken on the new node because the picking of an initial token will 
> almost certainly result in an unbalanced ring.
> 
> Right now, I'm afraid that if you simply copied the YAML file from one of the 
> two nodes to the 3rd node, then the 3rd node has an incorrect initial_token 
> setting (it may be conflicting with node 1 or 2's range). 
> 
> Some info about how to choose tokens:
> http://wiki.apache.org/cassandra/Operations#Token_selection
> http://journal.paul.querna.org/articles/2010/09/24/cassandra-token-selection/
> 
> So, once you know what token each of the 3 nodes should have, shut down the 
> first two nodes, change their tokens and add the correct token to the 3rd 
> node (in the YAML file).

At the moment I'm not bothered about balance - just having it work at all would 
be good! initial_token is currently blank (i.e. self-defining according to the 
docs) on all nodes - I can see why that might cause balance issues, but I don't 
see that it would prevent it working as there should not be a token clash. If 
there was a token clash, wouldn't it throw an error saying so?

That second article, while wonderfully practical and informative, suggests that 
cassandra is really not suited to rapid scaling up or down - i.e. adding or 
removing nodes is neither quick nor painless, e.g. scaling from 5 to 20 nodes 
and back is something that might happen over weeks rather than minutes. Maybe 
cassandra just isn't the right tool for the job in this case?

Marcus