Token != DecoratedKey assertion

2011-09-25 Thread Philippe
Hello,
I've seen a couple of these in my logs, running 0.8.4.
This is a RF=3, 3-node cluster. 2 nodes including this one are on 0.8.4 and
one is on 0.8.5

The node is still functionning hours later. Should I be worried ?

Thanks

ERROR [ReadStage:94911] 2011-09-24 22:40:30,043 AbstractCassandraDaemon.java
(line 134) Fatal exception in thread Thread[ReadStage:94911,5,main]
java.lang.AssertionError:
DecoratedKey(Token(bytes[224ceb80b5fb11e0848783ceb9bf0002ff33]),
224ceb80b5fb11e0848783ceb9bf0002ff33) !=
DecoratedKey(Token(bytes[038453154cb0005f14]), 038453154cb0005f14)
in /var/lib/cassandra/data/X/PUBLIC_MONTHLY_20-g-10634-Data.db
at
org.apache.cassandra.db.columniterator.SSTableSliceIterator.(SSTableSliceIterator.java:59)
at
org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:66)
at
org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80)
at
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1315)
at
org.apache.cassandra.db.ColumnFamilyStore.cacheRow(ColumnFamilyStore.java:1182)
at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1222)
at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1169)
at org.apache.cassandra.db.Table.getRow(Table.java:385)
at
org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:58)
at
org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:642)
at
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1107)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
ERROR [ReadStage:94936] 2011-09-24 22:40:30,042 AbstractCassandraDaemon.java
(line 134) Fatal exception in thread Thread[ReadStage:94936,5,main]
java.lang.AssertionError: DecoratedKey(Token(bytes[]), ) !=
DecoratedKey(Token(bytes[038453154c90005f14]), 038453154c90005f14)
in /var/lib/cassandra/data/X/PUBLIC_MONTHLY_20-g-10634-Data.db
at
org.apache.cassandra.db.columniterator.SSTableSliceIterator.(SSTableSliceIterator.java:59)
at
org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:66)
at
org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80)
at
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1315)
at
org.apache.cassandra.db.ColumnFamilyStore.cacheRow(ColumnFamilyStore.java:1182)
at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1222)
at
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1169)
at org.apache.cassandra.db.Table.getRow(Table.java:385)
at
org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:58)
at
org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:642)
at
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1107)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
ERROR [ReadStage:94713] 2011-09-24 22:40:30,041 AbstractCassandraDaemon.java
(line 134) Fatal exception in thread Thread[ReadStage:94713,5,main]
java.lang.AssertionError:
DecoratedKey(Token(bytes[7c4831fe0001ffaa000c6c697665626f782d6265306580008002000700031195010481327e62362a002400019c0dc550c60111e001687c4831fe000100010007000311950100010481327e62362a002400019c0dc550c60111e001687c4831fe0001ffbc6c697665626f782d6631326380008002000700031195010481327e62362a002400019c0dc550c60111e001687c4831fe000100010007000311950100010481327e62362a002400019c0dc550c60111e001687c4831fe0001ffab000b062a27f9e35f1300]),
7c4831fe0001ffaa000c6c697665626f782d6265306580008002000700031195010481327e62362a002400019c0dc550c60111e001687c4831fe000100010007000311950100010481327e62362a002400019c0dc550c60111e001687c4831fe0001ffbc6c697665626f782d6631326380008

Seed vs non-seed in YAML

2011-09-25 Thread Philippe
Hello,

I'm deploying my cluster with Puppet so it's actually easier for me to add
all cassandra nodes to the seed list in the YAML file than to choose a few.
Would there be any reason NOT to do this ?

Thanks


Re: frequent node UP/Down?

2011-09-25 Thread Philippe
I have this happening on 0.8.x It looks to me as this happens when the node
is under heavy load such as unthrottled compactions or a huge GC.

2011/9/24 Yang 

> I'm using 1.0.0
>
>
> there seems to be too many node Up/Dead events detected by the failure
> detector.
> I'm using  a 2 node cluster on EC2, in the same region, same security
> group, so I assume the message drop
> rate should be fairly low.
> but in about every 5 minutes, I'm seeing some node detected as down,
> and then Up again quickly, like the following
>
>
>  INFO 20:30:12,726 InetAddress /10.71.111.222 is now dead.
>  INFO 20:30:32,154 InetAddress /10.71.111.222 is now UP
>
>
> does the "1 in every 5 minutes" sound roughly right for your setup? I
> just want to make sure the unresponsiveness is not
> caused by something like memtable flushing, or GC, which I can
> probably further tune.
>
>
> Thanks
> Yang
>


Re: Increasing thrift_framed_transport_size_in_mb

2011-09-25 Thread aaron morton
Some discussion of large data here 
http://wiki.apache.org/cassandra/LargeDataSetConsiderations

When creating large rows you also need to be aware of 
in_memory_compaction_limit_in_mb (see the yaml) and that all columns for a row 
are stored on the same node. So if you store one file in a one row you may not 
get the best load distribution. 

I've heard mention before that 10MB is a reasonable max for a row if you have 
no natural partitions. 

That said CFS in Brisk put each block on a row, and used columns for the sub 
blocks. And the default settings for HFS are 

  

  fs.local.block.size
  67108864 




  fs.local.subblock.size
  2097152 


Hope that helps. 

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 24/09/2011, at 9:27 PM, Radim Kolar wrote:

> Dne 24.9.2011 0:05, Jonathan Ellis napsal(a):
>> Really large messages are not encouraged because they will fragment
>> your heap quickly.  Other than that, no.
> what is recommended chunk size for storing multi gigabyte files in cassandra? 
> 64MB is okay or its too large?



Re: Can not connect to cassandra 0.7 using CLI

2011-09-25 Thread aaron morton
Make sure that the directory /var/log/cassandra exists and the user running 
cassandra has permission to use it. 

There are some instructions here in the readme file 
https://github.com/apache/cassandra/blob/cassandra-0.7.9/README.txt#L27

Good luck. 

A

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 24/09/2011, at 10:01 PM, Julio Julio wrote:

> Eric Evans  rackspace.com> writes:
> 
>> 
>> On Thu, 2010-12-02 at 08:11 +1100, Joshua Partogi wrote:
>>> It is set to localhost I didn't change it and it is the same as
>>> configured
>>> in 0.6.8. Why doesn't it work out of the box?
>>> 
>>> Thanks heaps. 
>> 
>> Try "netstat -nl | grep 9160".  Is the node listening on 9160?  Which
>> interface is it bound to?
>> 
> 
> 
> I've got the same problem :( I've tried "netstat -nl | grep 9160" 
> and received nothing on the console output. In my cassandra.yaml
> file I've got:
> listen_address: localhost
> rpc_address: localhost
> rpc_port: 9160
> rpc_keepalive: true
> 
> (it might be a clue what is wrong) when I type "cassandra" on 
> the terminal I get this:
> 
> julio@julio-System-Product-Name:~$ log4j:ERROR setFile
> (null,true) 
> call failed.
> java.io.FileNotFoundException: /var/log/cassandra/system.log 
> (Permission denied)
>   at java.io.FileOutputStream.openAppend(Native Method)
>   at java.io.FileOutputStream.
> (FileOutputStream.java:207)
> 
> and more ... ;/
> 
> what should I do? maybe I make silly mistakes but I'm completely 
> new to noSQL and Cassandra. Please, help me!
> 
> best regards
> Julio
> 
> 



Re: Could not reach schema agreement when adding a new node.

2011-09-25 Thread aaron morton
Check the schema agreement using the CLI by running describe cluster;  it will 
tell you if they are in agreement.

it may have been a temporary thing while the new machine was applying it's 
schema. 

if the nodes are not in agreement or you want to dig deeper look for log 
messages from "Migration".


Cheers


-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 24/09/2011, at 10:10 PM, Dikang Gu wrote:

> I found this in the system.log when adding a new node to the cluster.
> 
> Anyone familiar with this?
> 
> ERROR [HintedHandoff:2] 2011-09-24 18:01:30,498 AbstractCassandraDaemon.java 
> (line 113) Fatal exception in thread Thread[HintedHandoff:2,1,main]
> java.lang.RuntimeException: java.lang.RuntimeException: Could not reach 
> schema agreement with /192.168.1.9 in 6ms
>   at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>   at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.RuntimeException: Could not reach schema agreement with 
> /192.168.1.9 in 6ms
>   at 
> org.apache.cassandra.db.HintedHandOffManager.waitForSchemaAgreement(HintedHandOffManager.java:290)
>   at 
> org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:301)
>   at 
> org.apache.cassandra.db.HintedHandOffManager.access$100(HintedHandOffManager.java:89)
>   at 
> org.apache.cassandra.db.HintedHandOffManager$2.runMayThrow(HintedHandOffManager.java:394)
>   at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
>   ... 3 more
> 
> Thanks.
> 
> -- 
> Dikang Gu
> 
> 0086 - 18611140205
> 



Re: progress of sstableloader keeps 0?

2011-09-25 Thread aaron morton
That can read data from previous versions, i.e. if you upgrade to 0.8 it can 
read the existing files from 0.7. 

But what you are doing with the sstable loader is (AFAIK) only copying the Data 
portion of the CF. Once the table is loaded the node will then build the Index 
and the Filter, this is the createBuild() call in the stack. It's throwing 
because version 0.8 does not want to make version 0.8 Index and and Filter 
files for a version 0.7 Data file. 

We get the same problem when upgrading from 0.7 to 0.8, where Repair will not 
work because it is streaming a 0.7 version data file and the recipient then 
tries to build the Index and Filter files. 

So to read 0.7 data from 0.8 just copy over *all* the files for the keyspace 
(data, filter and index). Then scrub the nodes so that repair can work. 

Hope that helps. 

 
-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 25/09/2011, at 6:07 PM, Yan Chunlu wrote:

> yes, I did.  thought 0.8 is downward compatible. is there other ways to load 
> 0.7's data into 0.8?  will copy the data dir directly will work?   I would 
> like to put load of three nodes into one node.
> 
>  thanks!
> 
> On Sun, Sep 25, 2011 at 11:52 AM, aaron morton  
> wrote:
> Looks like it is complaining that you are trying to load a 0.7 SSTable in 
> 0.8. 
> 
> 
> Cheers
> 
> -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 23/09/2011, at 5:23 PM, Yan Chunlu wrote:
> 
>> sorry I did not look into it  after check it I found version mismatch 
>> exception is in the log:
>> ERROR [Thread-17] 2011-09-22 08:24:24,248 AbstractCassandraDaemon.java (line 
>> 139) Fatal exception in thread Thread[Thread-17,5,main]
>> java.lang.RuntimeException: Cannot recover SSTable 
>> /disk2/cassandra/data/reddit/Comments-tmp-f-1 due to version mismatch. 
>> (current version is g).
>> at 
>> org.apache.cassandra.io.sstable.SSTableWriter.createBuilder(SSTableWriter.java:240)
>> at 
>> org.apache.cassandra.db.compaction.CompactionManager.submitSSTableBuild(CompactionManager.java:1097)
>> at 
>> org.apache.cassandra.streaming.StreamInSession.finished(StreamInSession.java:110)
>> at 
>> org.apache.cassandra.streaming.IncomingStreamReader.readFile(IncomingStreamReader.java:104)
>> at 
>> org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:61)
>> at 
>> org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:189)
>> at 
>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:117)
>> 
>> 
>> does that mean I need to run scrub before running the loader?  could I just 
>> delete it and keep going?  thanks!
>> 
>> On Fri, Sep 23, 2011 at 2:16 AM, Jonathan Ellis  wrote:
>> Did you check for errors in logs on both loader + target?
>> 
>> On Thu, Sep 22, 2011 at 10:52 AM, Yan Chunlu  wrote:
>> > I took a snapshot of one of my node in a cluster 0.7.4(N=RF=3).   use
>> > sstableloader to load the snapshot data to another 1 node cluster(N=RF=1).
>> >
>> > after execute  "bin/sstableloader  /disk2/mykeyspace/"
>> >
>> > it says"Starting client (and waiting 30 seconds for gossip) ..."
>> > "Streaming revelant part of  cf1.db. to [10.23.2.4]"
>> > then showing the progress indicator and stopped. nothing changed after
>> > then.
>> > progress: [/10.28.53.16 1/72 (0)] [total: 0 - 0MB/s (avg: 0MB/s)]]]
>> >
>> > I use nodetool to check the node 10.23.2.4, nothing changed. no data copied
>> > to it. and the data dir also keep its original size. is there anything
>> > wrong? how can I tell what was going on there?
>> > thanks!
>> 
>> 
>> 
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>> 
> 
> 



Re: Moving to a new cluster

2011-09-25 Thread aaron morton
sounds like it. 

A
-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 25/09/2011, at 6:10 PM, Yan Chunlu wrote:

> thanks!  is that similar problem described in this thread?
> 
>  
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/nodetool-repair-caused-high-disk-space-usage-td6695542.html
> 
> On Sun, Sep 25, 2011 at 11:33 AM, aaron morton  
> wrote:
> It can result in a lot of data on the node you run repair on. Where a lot 
> means perhaps 2 or more  times more data.
> 
> My unscientific approach is to repair one CF at a time so you can watch the 
> disk usage and repair the smaller CF's first. After the repair compact if you 
> need to. 
> 
> I think  the amount of extra data will be related to how out of sync things 
> are, so once you get repair working smoothly it will be less of problem.
> 
> Cheers
> 
> 
> -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 23/09/2011, at 3:04 AM, Yan Chunlu wrote:
> 
>> 
>> hi Aaron:
>> 
>> could you explain more about the issue about repair make space usage going 
>> crazy?
>> 
>> I am planning to upgrade my cluster from 0.7.4 to 0.8.6, which is because 
>> the repair never works on 0.7.4 for me.
>> more specifically, CASSANDRA-2280 and CASSANDRA-2156.
>> 
>> 
>> from your description, I really worried about 0.8.6 might make it worse...
>> 
>> thanks!
>> 
>> On Thu, Sep 22, 2011 at 7:25 AM, aaron morton  
>> wrote:
>> How much data is on the nodes in cluster 1 and how much disk space on 
>> cluster 2 ? Be aware that Cassandra 0.8 has an issue where repair can go 
>> crazy and use a lot of space. 
>> 
>> If you are not regularly running repair I would also repair before the move.
>> 
>> The repair after the copy is a good idea but should technically not be 
>> necessary. If you can practice the move watch the repair to see if much is 
>> transferred (check the logs). There is always a small transfer, but if you 
>> see data been transferred for several minutes I would investigate. 
>> 
>> When you start a repair it will repair will the other nodes it replicates 
>> data with. So you only need to run it every RF nodes. Start it one one, 
>> watch the logs to see who it talks to and then start it on the first node it 
>> does not talk to. And so on. 
>> 
>> Add a snapshot before the clean (repair will also snapshot before it runs)
>> 
>> Scrub is not needed unless you are migrating or you have file errors.
>> 
>> If your cluster is online, consider running the clean every RFth node rather 
>> than all at once (e.g. 1,4, 7, 10 then 2,5,8,11). It will have less impact 
>> on clients. 
>> 
>> Cheers
>> 
>> -
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 22/09/2011, at 10:27 AM, Philippe wrote:
>> 
>>> Hello,
>>> We're currently running on a 3-node RF=3 cluster. Now that we have a better 
>>> grip on things, we want to replace it with a 12-node RF=3 cluster of 
>>> "smaller" servers. So I wonder what the best way to move the data to the 
>>> new cluster would be. I can afford to stop writing to the current cluster 
>>> for whatever time is necessary. Has anyone written up something on this 
>>> subject ?
>>> 
>>> My plan is the following (nodes in cluster 1 are node1.1->1.3, nodes in 
>>> cluster 2 are node2.1->2.12)
>>> stop writing to current cluster & drain it
>>> get a snapshot on each node
>>> Since it's RF=3, each node should have all the data, so assuming I set the 
>>> tokens correctly I would move the snapshot from node1.1 to node2.1, 2.2, 
>>> 2.3 and 2.4 then node1.2->node2.5,2.6,2.,2.8, etc. This is because the 
>>> range for node1.1 is now spread across 2.1->2.4
>>> Run repair & clean & scrub on each node (more or less in //)
>>> What do you think ?
>>> Thanks
>> 
>> 
> 
> 



Re: CMS GC initial-mark taking 6 seconds , bad?

2011-09-25 Thread aaron morton
It does seem long and will be felt by your application. 

Are you running a 47GB heap ? Most peeps seem to think 8 to 12 is about the 
viable maximum. 

Cheers
 
-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 25/09/2011, at 7:14 PM, Yang wrote:

> I see the following in my GC log
> 
> 1910.513: [GC [1 CMS-initial-mark: 2598619K(26214400K)]
> 13749939K(49807360K), 6.0696680 secs] [Times: user=6.10 sys=0.00,
> real=6.07 secs]
> 
> so there is a stop-the-world period of 6 seconds. does this sound bad
> ? or 6 seconds is OK  and we should expect the built-in
> fault-tolerance of Cassandra handle this?
> 
> Thanks
> Yang



Re: progress of sstableloader keeps 0?

2011-09-25 Thread Yan Chunlu
thanks!  another problem is what if cluster number are not the same?

in my case I am move 3 nodes cluster data to 1 node,  the keyspace files in
3 nodes might use the same name...

I am using the new cluster only for emergency usage, so only 1 node is
attached.

On Sun, Sep 25, 2011 at 5:20 PM, aaron morton wrote:

> That can read data from previous versions, i.e. if you upgrade to 0.8 it
> can read the existing files from 0.7.
>
> But what you are doing with the sstable loader is (AFAIK) only copying the
> Data portion of the CF. Once the table is loaded the node will then build
> the Index and the Filter, this is the createBuild() call in the stack. It's
> throwing because version 0.8 does not want to make version 0.8 Index and and
> Filter files for a version 0.7 Data file.
>
> We get the same problem when upgrading from 0.7 to 0.8, where Repair will
> not work because it is streaming a 0.7 version data file and the recipient
> then tries to build the Index and Filter files.
>
> So to read 0.7 data from 0.8 just copy over *all* the files for the
> keyspace (data, filter and index). Then scrub the nodes so that repair can
> work.
>
> Hope that helps.
>
>
>  -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 25/09/2011, at 6:07 PM, Yan Chunlu wrote:
>
> yes, I did.  thought 0.8 is downward compatible. is there other ways to
> load 0.7's data into 0.8?  will copy the data dir directly will work?   I
> would like to put load of three nodes into one node.
>
>  thanks!
>
> On Sun, Sep 25, 2011 at 11:52 AM, aaron morton wrote:
>
>> Looks like it is complaining that you are trying to load a 0.7 SSTable in
>> 0.8.
>>
>>
>> Cheers
>>
>>  -
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 23/09/2011, at 5:23 PM, Yan Chunlu wrote:
>>
>> sorry I did not look into it  after check it I found version mismatch
>> exception is in the log:
>> ERROR [Thread-17] 2011-09-22 08:24:24,248 AbstractCassandraDaemon.java
>> (line 139) Fatal exception in thread Thread[Thread-17,5,main]
>> java.lang.RuntimeException: Cannot recover SSTable
>> /disk2/cassandra/data/reddit/Comments-tmp-f-1 due to version mismatch.
>> (current version is g).
>> at
>> org.apache.cassandra.io.sstable.SSTableWriter.createBuilder(SSTableWriter.java:240)
>> at
>> org.apache.cassandra.db.compaction.CompactionManager.submitSSTableBuild(CompactionManager.java:1097)
>> at
>> org.apache.cassandra.streaming.StreamInSession.finished(StreamInSession.java:110)
>> at
>> org.apache.cassandra.streaming.IncomingStreamReader.readFile(IncomingStreamReader.java:104)
>> at
>> org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:61)
>> at
>> org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:189)
>> at
>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:117)
>>
>>
>> does that mean I need to run scrub before running the loader?  could I
>> just delete it and keep going?  thanks!
>>
>> On Fri, Sep 23, 2011 at 2:16 AM, Jonathan Ellis wrote:
>>
>>> Did you check for errors in logs on both loader + target?
>>>
>>> On Thu, Sep 22, 2011 at 10:52 AM, Yan Chunlu 
>>> wrote:
>>> > I took a snapshot of one of my node in a cluster 0.7.4(N=RF=3).   use
>>> > sstableloader to load the snapshot data to another 1 node
>>> cluster(N=RF=1).
>>> >
>>> > after execute  "bin/sstableloader  /disk2/mykeyspace/"
>>> >
>>> > it says"Starting client (and waiting 30 seconds for gossip) ..."
>>> > "Streaming revelant part of  cf1.db. to [10.23.2.4]"
>>> > then showing the progress indicator and stopped. nothing changed after
>>> > then.
>>> > progress: [/10.28.53.16 1/72 (0)] [total: 0 - 0MB/s (avg: 0MB/s)]]]
>>> >
>>> > I use nodetool to check the node 10.23.2.4, nothing changed. no data
>>> copied
>>> > to it. and the data dir also keep its original size. is there anything
>>> > wrong? how can I tell what was going on there?
>>> > thanks!
>>>
>>>
>>>
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of DataStax, the source for professional Cassandra support
>>> http://www.datastax.com
>>>
>>
>>
>>
>
>


Re: CMS GC initial-mark taking 6 seconds , bad?

2011-09-25 Thread Peter Schuller
> I see the following in my GC log
>
> 1910.513: [GC [1 CMS-initial-mark: 2598619K(26214400K)]
> 13749939K(49807360K), 6.0696680 secs] [Times: user=6.10 sys=0.00,
> real=6.07 secs]
>
> so there is a stop-the-world period of 6 seconds. does this sound bad
> ? or 6 seconds is OK  and we should expect the built-in
> fault-tolerance of Cassandra handle this?

initial-mark pauses are stop-the-world, so a 6 second initial-mark
would have paused the node for those 6 seconds.

The initial mark is essentially marking roots for old-gen; that should
include thread stacks and such, but will also include younger
generations. You might read [1] which talks a bit about it; a
recommendation there is to make sure that initial marks happen right
after a young-gen collection, and they advise increasing heap size
sufficiently to allow an ininitial mark to trigger (I suppose by
heuristics) after the young gen collection, prior to the CMS trigger.
It makes sense, especially given that initial-mark is single-threaded,
to try do to that (and leave the young-gen smaller, collected by the
parallel collector). However I'm not entirely clear on what VM options
are required for this. I had a brief look at the code but it wasn't
obvious at cursory glance under what circumstances an initial mark is
triggered right after young-gen vs. not. In your case you clearly have
enough heap.

Can you correlate with ParNew collections and see if the initial mark
pauses seem to happen immediately after a ParNew, or somewhere in
between, in the cases where they take this long?

Also, as a mitigationg: What's your young generation size? One way to
mitigate the problem, if it is indeed the young gen marking that is
taking time, is to decrease the size of the young generation to leave
less work for initial marking. Normally the young gen is sized based
on expected pause times given parallel ParNew ollections, but if a
non-parallel initial-mark is having to do marking of the same contents
the pause time could be higher (hence the discussion above).

Also, is each initial mark this long, or is that something that
happens once in a while?

As for Cassandra dealing with it: It is definitely not a good thing to
have 6 second pauses. Even with all other nodes up, it takes time for
the dynamic snitch to realize what's going on and you will tend to see
a subset of requests to the cluster get 'stuck' in circumstances like
that. Also, if you're e.g. doing QUORUM at RF=3, if a node is down for
legitimate reasons, another node having a 6 second pause will by
necessity cause high latency for requests during that period.

[1] http://answerpot.com/showthread.php?1558705-CMS+initial+mark+pauses


-- 
/ Peter Schuller (@scode on twitter)


Re: frequent node UP/Down?

2011-09-25 Thread Radim Kolar

Dne 25.9.2011 9:29, Philippe napsal(a):
I have this happening on 0.8.x It looks to me as this happens when the 
node is under heavy load such as unthrottled compactions or a huge GC.
i have this problem too. Node down detection must be improved - 
increased timeouts a bit or make more tries before making decision. If 
node is under load (especially if there is swap activity), it is often 
marked unavailable.


Re: Token != DecoratedKey assertion

2011-09-25 Thread Jonathan Ellis
Assertion errors are bugs, so that should worry you.

However, I'd upgrade before filing a ticket.  There were a lot of
fixes in 0.8.5.

On Sun, Sep 25, 2011 at 2:27 AM, Philippe  wrote:
> Hello,
> I've seen a couple of these in my logs, running 0.8.4.
> This is a RF=3, 3-node cluster. 2 nodes including this one are on 0.8.4 and
> one is on 0.8.5
>
> The node is still functionning hours later. Should I be worried ?
>
> Thanks
>
> ERROR [ReadStage:94911] 2011-09-24 22:40:30,043 AbstractCassandraDaemon.java
> (line 134) Fatal exception in thread Thread[ReadStage:94911,5,main]
> java.lang.AssertionError:
> DecoratedKey(Token(bytes[224ceb80b5fb11e0848783ceb9bf0002ff33]),
> 224ceb80b5fb11e0848783ceb9bf0002ff33) !=
> DecoratedKey(Token(bytes[038453154cb0005f14]), 038453154cb0005f14)
> in /var/lib/cassandra/data/X/PUBLIC_MONTHLY_20-g-10634-Data.db
>     at
> org.apache.cassandra.db.columniterator.SSTableSliceIterator.(SSTableSliceIterator.java:59)
>     at
> org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:66)
>     at
> org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80)
>     at
> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1315)
>     at
> org.apache.cassandra.db.ColumnFamilyStore.cacheRow(ColumnFamilyStore.java:1182)
>     at
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1222)
>     at
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1169)
>     at org.apache.cassandra.db.Table.getRow(Table.java:385)
>     at
> org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:58)
>     at
> org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:642)
>     at
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1107)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>     at java.lang.Thread.run(Thread.java:662)
> ERROR [ReadStage:94936] 2011-09-24 22:40:30,042 AbstractCassandraDaemon.java
> (line 134) Fatal exception in thread Thread[ReadStage:94936,5,main]
> java.lang.AssertionError: DecoratedKey(Token(bytes[]), ) !=
> DecoratedKey(Token(bytes[038453154c90005f14]), 038453154c90005f14)
> in /var/lib/cassandra/data/X/PUBLIC_MONTHLY_20-g-10634-Data.db
>     at
> org.apache.cassandra.db.columniterator.SSTableSliceIterator.(SSTableSliceIterator.java:59)
>     at
> org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:66)
>     at
> org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80)
>     at
> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1315)
>     at
> org.apache.cassandra.db.ColumnFamilyStore.cacheRow(ColumnFamilyStore.java:1182)
>     at
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1222)
>     at
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1169)
>     at org.apache.cassandra.db.Table.getRow(Table.java:385)
>     at
> org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:58)
>     at
> org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:642)
>     at
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1107)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>     at java.lang.Thread.run(Thread.java:662)
> ERROR [ReadStage:94713] 2011-09-24 22:40:30,041 AbstractCassandraDaemon.java
> (line 134) Fatal exception in thread Thread[ReadStage:94713,5,main]
> java.lang.AssertionError:
> DecoratedKey(Token(bytes[7c4831fe0001ffaa000c6c697665626f782d6265306580008002000700031195010481327e62362a002400019c0dc550c60111e001687c4831fe000100010007000311950100010481327e62362a002400019c0dc550c60111e001687c4831fe0001ffbc6c697665626f782d6631326380008002000700031195010481327e62362a002400019c0dc550c60111e001687c4831fe000100010007000311950100010481327e62362a002400019c0dc550c60111e001687c4831fe0001ffab000b062a27f9e35f1300]),
> 7c4831fe0001ffaa000c6c697665626f782d62653

Re: messages stopped for 3 minutes?

2011-09-25 Thread Jonathan Ellis
What makes you think the problem is on the receiving node, rather than
the sending node?

On Sun, Sep 25, 2011 at 1:19 AM, Yang  wrote:
> I constantly see TimedOutException , then followed by
> UnavailableException in my logs,
> so I added some extra debugging to Gossiper. notifyFailureDetector()
>
>
>
>    void notifyFailureDetector(InetAddress endpoint, EndpointState
> remoteEndpointState)
>    {
>        IFailureDetector fd = FailureDetector.instance;
>        EndpointState localEndpointState = endpointStateMap.get(endpoint);
>        logger.debug("notify failure detector");
>        /*
>         * If the local endpoint state exists then report to the FD only
>         * if the versions workout.
>        */
>        if ( localEndpointState != null )
>        {
>                logger.debug("notify failure detector, endpoint");
>            int localGeneration =
> localEndpointState.getHeartBeatState().getGeneration();
>            int remoteGeneration =
> remoteEndpointState.getHeartBeatState().getGeneration();
>            if ( remoteGeneration > localGeneration )
>            {
>                localEndpointState.updateTimestamp();
>                logger.debug("notify failure detector --- report 1");
>                fd.report(endpoint);
>                return;
>            }
>
>
>
>
> then I found that this method stopped being called for a period of 3
> minutes, so of course the detector considers the other side to be
> dead.
>
> but since these 2 boxes are in the same EC2 region, same security
> group, there is no reason there is a network issue that long. so I
> ran a background job that just does
>
> echo | nc $the_other_box 7000   in a loop
>
> and this always works fine, without failing to contact the 7000 port.
>
>
> so somehow the messages were not delivered or received, how could I debug 
> this?
> (extra logging attached)
>
> Thanks
> Yang
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: frequent node UP/Down?

2011-09-25 Thread Radim Kolar

Dne 25.9.2011 14:31, Radim Kolar napsal(a):

Dne 25.9.2011 9:29, Philippe napsal(a):
I have this happening on 0.8.x It looks to me as this happens when 
the node is under heavy load such as unthrottled compactions or a 
huge GC.
i have this problem too. Node down detection must be improved - 
increased timeouts a bit or make more tries before making decision. If 
node is under load (especially if there is swap activity), it is often 
marked unavailable.
Also there needs to be implemented algorithm like it is used in BGP 
routing protocol to prevent route flap. It should guard against cases 
like this:


  INFO [GossipTasks:1] 2011-09-25 14:56:36,544 Gossiper.java (line 695) 
InetAddress /216.17.99.40 is now dead.
 INFO [GossipStage:1] 2011-09-25 14:56:36,641 Gossiper.java (line 681) 
InetAddress /216.17.99.40 is now UP
 INFO [GossipTasks:1] 2011-09-25 14:56:37,823 Gossiper.java (line 695) 
InetAddress /216.17.99.40 is now dead.
 INFO [GossipStage:1] 2011-09-25 14:56:37,971 Gossiper.java (line 681) 
InetAddress /216.17.99.40 is now UP


route flap protection works like - announce 1st state change immediately 
to peer, next change for example after 30 seconds if state is changed in 
less than 30 seconds, if route keeps flaping up/down then increase 
report time to 60 seconds etc.


Re: messages stopped for 3 minutes?

2011-09-25 Thread Yang
thanks Jonathan,


I really don't know, I just did further tests to catch the jstack on
the receiving side over the last night. going through these stacks
now.  if I can't find anything suspicious, I'll add these debugging to
the sending side too.

another useful piece of info: when I did a single-node setup, I also
found a lot of TimedOutException, similar to what I see with the
2-node setup. I think I didn't see the UnavailableException, probably
because it's just a single node, and the node always believes itself
to be available.

I think GC issue is not the culprit here, since I don't see any length
GC logging around when the delay is happening. no compaction/flushing
either



On Sun, Sep 25, 2011 at 6:33 AM, Jonathan Ellis  wrote:
> What makes you think the problem is on the receiving node, rather than
> the sending node?
>
> On Sun, Sep 25, 2011 at 1:19 AM, Yang  wrote:
>> I constantly see TimedOutException , then followed by
>> UnavailableException in my logs,
>> so I added some extra debugging to Gossiper. notifyFailureDetector()
>>
>>
>>
>>    void notifyFailureDetector(InetAddress endpoint, EndpointState
>> remoteEndpointState)
>>    {
>>        IFailureDetector fd = FailureDetector.instance;
>>        EndpointState localEndpointState = endpointStateMap.get(endpoint);
>>        logger.debug("notify failure detector");
>>        /*
>>         * If the local endpoint state exists then report to the FD only
>>         * if the versions workout.
>>        */
>>        if ( localEndpointState != null )
>>        {
>>                logger.debug("notify failure detector, endpoint");
>>            int localGeneration =
>> localEndpointState.getHeartBeatState().getGeneration();
>>            int remoteGeneration =
>> remoteEndpointState.getHeartBeatState().getGeneration();
>>            if ( remoteGeneration > localGeneration )
>>            {
>>                localEndpointState.updateTimestamp();
>>                logger.debug("notify failure detector --- report 1");
>>                fd.report(endpoint);
>>                return;
>>            }
>>
>>
>>
>>
>> then I found that this method stopped being called for a period of 3
>> minutes, so of course the detector considers the other side to be
>> dead.
>>
>> but since these 2 boxes are in the same EC2 region, same security
>> group, there is no reason there is a network issue that long. so I
>> ran a background job that just does
>>
>> echo | nc $the_other_box 7000   in a loop
>>
>> and this always works fine, without failing to contact the 7000 port.
>>
>>
>> so somehow the messages were not delivered or received, how could I debug 
>> this?
>> (extra logging attached)
>>
>> Thanks
>> Yang
>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>


Re: frequent node UP/Down?

2011-09-25 Thread Brandon Williams
On Sat, Sep 24, 2011 at 4:54 PM, Yang  wrote:
> I'm using 1.0.0
>
>
> there seems to be too many node Up/Dead events detected by the failure
> detector.
> I'm using  a 2 node cluster on EC2, in the same region, same security
> group, so I assume the message drop
> rate should be fairly low.
> but in about every 5 minutes, I'm seeing some node detected as down,
> and then Up again quickly

This is fairly common on ec2 due to wild variance in the network.
Increase your phi_convict_threshold to 10 or higher (but I wouldn't go
over 12, this is roughly an exponential increase)

-Brandon


Re: CMS GC initial-mark taking 6 seconds , bad?

2011-09-25 Thread Yang
Thanks Peter and Aaron.


right now I have too much logging so the CMS logging is flushed
(somehow it does not appear in the system.log, only on stdout ), I'll
keep an eye on the correlation with ParNew as I get more logging

Yang

On Sun, Sep 25, 2011 at 3:59 AM, Peter Schuller
 wrote:
>> I see the following in my GC log
>>
>> 1910.513: [GC [1 CMS-initial-mark: 2598619K(26214400K)]
>> 13749939K(49807360K), 6.0696680 secs] [Times: user=6.10 sys=0.00,
>> real=6.07 secs]
>>
>> so there is a stop-the-world period of 6 seconds. does this sound bad
>> ? or 6 seconds is OK  and we should expect the built-in
>> fault-tolerance of Cassandra handle this?
>
> initial-mark pauses are stop-the-world, so a 6 second initial-mark
> would have paused the node for those 6 seconds.
>
> The initial mark is essentially marking roots for old-gen; that should
> include thread stacks and such, but will also include younger
> generations. You might read [1] which talks a bit about it; a
> recommendation there is to make sure that initial marks happen right
> after a young-gen collection, and they advise increasing heap size
> sufficiently to allow an ininitial mark to trigger (I suppose by
> heuristics) after the young gen collection, prior to the CMS trigger.
> It makes sense, especially given that initial-mark is single-threaded,
> to try do to that (and leave the young-gen smaller, collected by the
> parallel collector). However I'm not entirely clear on what VM options
> are required for this. I had a brief look at the code but it wasn't
> obvious at cursory glance under what circumstances an initial mark is
> triggered right after young-gen vs. not. In your case you clearly have
> enough heap.
>
> Can you correlate with ParNew collections and see if the initial mark
> pauses seem to happen immediately after a ParNew, or somewhere in
> between, in the cases where they take this long?
>
> Also, as a mitigationg: What's your young generation size? One way to
> mitigate the problem, if it is indeed the young gen marking that is
> taking time, is to decrease the size of the young generation to leave
> less work for initial marking. Normally the young gen is sized based
> on expected pause times given parallel ParNew ollections, but if a
> non-parallel initial-mark is having to do marking of the same contents
> the pause time could be higher (hence the discussion above).
>
> Also, is each initial mark this long, or is that something that
> happens once in a while?
>
> As for Cassandra dealing with it: It is definitely not a good thing to
> have 6 second pauses. Even with all other nodes up, it takes time for
> the dynamic snitch to realize what's going on and you will tend to see
> a subset of requests to the cluster get 'stuck' in circumstances like
> that. Also, if you're e.g. doing QUORUM at RF=3, if a node is down for
> legitimate reasons, another node having a 6 second pause will by
> necessity cause high latency for requests during that period.
>
> [1] http://answerpot.com/showthread.php?1558705-CMS+initial+mark+pauses
>
>
> --
> / Peter Schuller (@scode on twitter)
>


Re: frequent node UP/Down?

2011-09-25 Thread Yang
Thanks Brandon.

I suspected that, but I think that's precluded as a possibility since
I setup another background job to do
echo | nc other_box 7000
in a loop,
this job seems to be working fine all the time, so network seems fine.

Yang

On Sun, Sep 25, 2011 at 10:39 AM, Brandon Williams  wrote:
> On Sat, Sep 24, 2011 at 4:54 PM, Yang  wrote:
>> I'm using 1.0.0
>>
>>
>> there seems to be too many node Up/Dead events detected by the failure
>> detector.
>> I'm using  a 2 node cluster on EC2, in the same region, same security
>> group, so I assume the message drop
>> rate should be fairly low.
>> but in about every 5 minutes, I'm seeing some node detected as down,
>> and then Up again quickly
>
> This is fairly common on ec2 due to wild variance in the network.
> Increase your phi_convict_threshold to 10 or higher (but I wouldn't go
> over 12, this is roughly an exponential increase)
>
> -Brandon
>


Re: frequent node UP/Down?

2011-09-25 Thread Brandon Williams
On Sun, Sep 25, 2011 at 12:52 PM, Yang  wrote:
> Thanks Brandon.
>
> I suspected that, but I think that's precluded as a possibility since
> I setup another background job to do
> echo | nc other_box 7000
> in a loop,
> this job seems to be working fine all the time, so network seems fine.

This isn't measuring latency, however.  That is how the failure
detector works, using probability to estimate the likelihood that a
given host is alive, based on previous history.  The situation on ec2
is something like the following: 99% of pings are 1ms, but sometimes
there are brief periods of 100ms, and this is where the FD says "this
is not realistic, I think the host is dead" but then receives the
ping, and thus the flapping.  I've seen it a million times, increasing
the phi threshold always solves it.

-Brandon


Re: frequent node UP/Down?

2011-09-25 Thread Yang
Thanks Brandon.

I'll try this.

but you can also see my later post regarding message drop :
http://mail-archives.apache.org/mod_mbox/cassandra-user/201109.mbox/%3ccaanh3_8aehidyh9ybt82_emh3likbcdsenrak3jhfzaj2l+...@mail.gmail.com%3E

that seems to show something in either code or background load causing
messages to be really dropped


Yang

On Sun, Sep 25, 2011 at 10:59 AM, Brandon Williams  wrote:
> On Sun, Sep 25, 2011 at 12:52 PM, Yang  wrote:
>> Thanks Brandon.
>>
>> I suspected that, but I think that's precluded as a possibility since
>> I setup another background job to do
>> echo | nc other_box 7000
>> in a loop,
>> this job seems to be working fine all the time, so network seems fine.
>
> This isn't measuring latency, however.  That is how the failure
> detector works, using probability to estimate the likelihood that a
> given host is alive, based on previous history.  The situation on ec2
> is something like the following: 99% of pings are 1ms, but sometimes
> there are brief periods of 100ms, and this is where the FD says "this
> is not realistic, I think the host is dead" but then receives the
> ping, and thus the flapping.  I've seen it a million times, increasing
> the phi threshold always solves it.
>
> -Brandon
>


Re: frequent node UP/Down?

2011-09-25 Thread Brandon Williams
On Sun, Sep 25, 2011 at 1:10 PM, Yang  wrote:
> Thanks Brandon.
>
> I'll try this.
>
> but you can also see my later post regarding message drop :
> http://mail-archives.apache.org/mod_mbox/cassandra-user/201109.mbox/%3ccaanh3_8aehidyh9ybt82_emh3likbcdsenrak3jhfzaj2l+...@mail.gmail.com%3E
>
> that seems to show something in either code or background load causing
> messages to be really dropped

I see.  My guess is then this: there is a local clock problem, causing
generations to be the same, thus not notifying the FD.  So perhaps the
problem is not network-related, but it is something in the ec2
environment.

-Brandon


Re: adding node without bootstrap

2011-09-25 Thread aaron morton
That message will be logged if there RF on the keyspace is 1 or if the other 
nodes are not up. 

What's the RF ? 

You should also sort out the tokens before going to far. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 25/09/2011, at 7:35 PM, Radim Kolar wrote:

> 
>> If you join a node with auto_bootstrap=false you had better be working at 
>> quorum or higher to avoid stale/not found reads. You should then repair the 
>> node right away to get all the missing data back on the node. This is not 
>> suggested. It is best to leave auto_boostrap=true and let Cassandra handle 
>> this on the front end.
> This do not works. I joined ring with node without bootstrap and result is 
> like this:
> 
> 216.17.99.40datacenter1 rack1   Up Normal  1.17 GB 99.64% 
>  83030609119105147711596238577753588267
> 64.6.104.18 datacenter1 rack1   Up Normal  43.15 KB0.36%  
>  83648735508289295779178617154261005054
> 
> Well, this was expected. But running repair on both nodes didnt do anything:
> 
> INFO [GossipStage:1] 2011-09-25 08:18:34,287 Gossiper.java (line 715) Node 
> /216.17.99.40 is now part of the cluster
> INFO [GossipStage:1] 2011-09-25 08:18:34,287 Gossiper.java (line 681) 
> InetAddress /216.17.99.40 is now UP
> INFO [AntiEntropySessions:1] 2011-09-25 08:22:16,066 AntiEntropyService.java 
> (line 648) No neighbors to repair with for test on 
> (83030609119105147711596238577753588267,83648735508289295779178617154261005054]:
>  manual-repair-04dd27f0-401b-4452-b0eb-853beeda197b completed.
> 
> Data are not moved to new node. Maybe tokens are not too random. I deleted 
> new node and retried:
> 
> 64.6.104.18 datacenter1 rack1   Up Normal  45.52 KB56.94% 
>  9762979552315026283322466206354139578
> 216.17.99.40datacenter1 rack1   Up Normal  1.17 GB 43.06% 
>  83030609119105147711596238577753588267
> 
> and still nothing, while running repair on both nodes.
> 
> INFO [AntiEntropySessions:1] 2011-09-25 08:29:13,447 AntiEntropyService.java 
> (line 648) No neighbors to repair with for test on 
> (83030609119105147711596238577753588267,9762979552315026283322466206354139578]:
>  manual-repair-87bfcc67-2b99-4285-8571-e5bd168ef5e0 completed.
> 
> Can you try this too? i cant get scenario: make 1 node - add data, add second 
> node without bootstrap then repair on both work.



Re: Seed vs non-seed in YAML

2011-09-25 Thread aaron morton
Seeds will not auto-bootstrap themselves when you add them to  the cluster. 

Normal approach is to have 2 or 3 per DC. 

You may also be interested in how Gossip uses the seed list 
http://wiki.apache.org/cassandra/ArchitectureGossip

cheers

 
-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 25/09/2011, at 8:28 PM, Philippe wrote:

> Hello,
> 
> I'm deploying my cluster with Puppet so it's actually easier for me to add 
> all cassandra nodes to the seed list in the YAML file than to choose a few.
> Would there be any reason NOT to do this ?
> 
> Thanks



Re: progress of sstableloader keeps 0?

2011-09-25 Thread aaron morton
If you had RF3 in a 3 node cluster and everything was repaired you *should* be 
ok to only take the data from 1 node, if the cluster is not receiving writes. 

If you want to merge the data from 3 nodes rename the files AFAIK they do not 
have to have contiguous file numbers. 

Cheers


-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 25/09/2011, at 10:45 PM, Yan Chunlu wrote:

> thanks!  another problem is what if cluster number are not the same?
> 
> in my case I am move 3 nodes cluster data to 1 node,  the keyspace files in 3 
> nodes might use the same name...
> 
> I am using the new cluster only for emergency usage, so only 1 node is 
> attached.
> 
> On Sun, Sep 25, 2011 at 5:20 PM, aaron morton  wrote:
> That can read data from previous versions, i.e. if you upgrade to 0.8 it can 
> read the existing files from 0.7. 
> 
> But what you are doing with the sstable loader is (AFAIK) only copying the 
> Data portion of the CF. Once the table is loaded the node will then build the 
> Index and the Filter, this is the createBuild() call in the stack. It's 
> throwing because version 0.8 does not want to make version 0.8 Index and and 
> Filter files for a version 0.7 Data file. 
> 
> We get the same problem when upgrading from 0.7 to 0.8, where Repair will not 
> work because it is streaming a 0.7 version data file and the recipient then 
> tries to build the Index and Filter files. 
> 
> So to read 0.7 data from 0.8 just copy over *all* the files for the keyspace 
> (data, filter and index). Then scrub the nodes so that repair can work. 
> 
> Hope that helps. 
> 
>  
> -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 25/09/2011, at 6:07 PM, Yan Chunlu wrote:
> 
>> yes, I did.  thought 0.8 is downward compatible. is there other ways to load 
>> 0.7's data into 0.8?  will copy the data dir directly will work?   I would 
>> like to put load of three nodes into one node.
>> 
>>  thanks!
>> 
>> On Sun, Sep 25, 2011 at 11:52 AM, aaron morton  
>> wrote:
>> Looks like it is complaining that you are trying to load a 0.7 SSTable in 
>> 0.8. 
>> 
>> 
>> Cheers
>> 
>> -
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 23/09/2011, at 5:23 PM, Yan Chunlu wrote:
>> 
>>> sorry I did not look into it  after check it I found version mismatch 
>>> exception is in the log:
>>> ERROR [Thread-17] 2011-09-22 08:24:24,248 AbstractCassandraDaemon.java 
>>> (line 139) Fatal exception in thread Thread[Thread-17,5,main]
>>> java.lang.RuntimeException: Cannot recover SSTable 
>>> /disk2/cassandra/data/reddit/Comments-tmp-f-1 due to version mismatch. 
>>> (current version is g).
>>> at 
>>> org.apache.cassandra.io.sstable.SSTableWriter.createBuilder(SSTableWriter.java:240)
>>> at 
>>> org.apache.cassandra.db.compaction.CompactionManager.submitSSTableBuild(CompactionManager.java:1097)
>>> at 
>>> org.apache.cassandra.streaming.StreamInSession.finished(StreamInSession.java:110)
>>> at 
>>> org.apache.cassandra.streaming.IncomingStreamReader.readFile(IncomingStreamReader.java:104)
>>> at 
>>> org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:61)
>>> at 
>>> org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:189)
>>> at 
>>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:117)
>>> 
>>> 
>>> does that mean I need to run scrub before running the loader?  could I just 
>>> delete it and keep going?  thanks!
>>> 
>>> On Fri, Sep 23, 2011 at 2:16 AM, Jonathan Ellis  wrote:
>>> Did you check for errors in logs on both loader + target?
>>> 
>>> On Thu, Sep 22, 2011 at 10:52 AM, Yan Chunlu  wrote:
>>> > I took a snapshot of one of my node in a cluster 0.7.4(N=RF=3).   use
>>> > sstableloader to load the snapshot data to another 1 node cluster(N=RF=1).
>>> >
>>> > after execute  "bin/sstableloader  /disk2/mykeyspace/"
>>> >
>>> > it says"Starting client (and waiting 30 seconds for gossip) ..."
>>> > "Streaming revelant part of  cf1.db. to [10.23.2.4]"
>>> > then showing the progress indicator and stopped. nothing changed after
>>> > then.
>>> > progress: [/10.28.53.16 1/72 (0)] [total: 0 - 0MB/s (avg: 0MB/s)]]]
>>> >
>>> > I use nodetool to check the node 10.23.2.4, nothing changed. no data 
>>> > copied
>>> > to it. and the data dir also keep its original size. is there anything
>>> > wrong? how can I tell what was going on there?
>>> > thanks!
>>> 
>>> 
>>> 
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of DataStax, the source for professional Cassandra support
>>> http://www.datastax.com
>>> 
>> 
>> 
> 
> 



Re: adding node without bootstrap

2011-09-25 Thread Radim Kolar

Dne 25.9.2011 22:40, aaron morton napsal(a):

That message will be logged if there RF on the keyspace is 1 or if the other 
nodes are not up.
What's the RF ?

rf is 1.


Re: adding node without bootstrap

2011-09-25 Thread aaron morton
Then there is nothing to repair. 

Set a better token,  cassandra-cli to increase the RF to 2 and then kick off 
repair. 

A
 
-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 26/09/2011, at 10:12 AM, Radim Kolar wrote:

> Dne 25.9.2011 22:40, aaron morton napsal(a):
>> That message will be logged if there RF on the keyspace is 1 or if the other 
>> nodes are not up.
>> What's the RF ?
> rf is 1.



Surgecon Meetup?

2011-09-25 Thread Chris Burroughs
Surge [1] is scalability focused conference in late September hosted in
Baltimore.  It's a pretty cool conference with a good mix of
operationally minded people interested in scalability, distributed
systems, systems level performance and good stuff like that.  You should
go! [2]

Anyway, I'll be there if there, and are if any other Cassandra users are
coming I'm happy to help herd us towards meeting up, lunch, hacking,
etc.  I *think* there might be some time for structured BoF type
sessions as well.


[1] http://omniti.com/surge/2011

[2] Actually tickets recenlty sold out, you should go in 2012!


Re: Token != DecoratedKey assertion

2011-09-25 Thread Philippe
Juste did
Could there be data corruption or will repairs do this?

Thanks
Le 25 sept. 2011 15:30, "Jonathan Ellis"  a écrit :
> Assertion errors are bugs, so that should worry you.
>
> However, I'd upgrade before filing a ticket. There were a lot of
> fixes in 0.8.5.
>
> On Sun, Sep 25, 2011 at 2:27 AM, Philippe  wrote:
>> Hello,
>> I've seen a couple of these in my logs, running 0.8.4.
>> This is a RF=3, 3-node cluster. 2 nodes including this one are on 0.8.4
and
>> one is on 0.8.5
>>
>> The node is still functionning hours later. Should I be worried ?
>>
>> Thanks
>>
>> ERROR [ReadStage:94911] 2011-09-24 22:40:30,043
AbstractCassandraDaemon.java
>> (line 134) Fatal exception in thread Thread[ReadStage:94911,5,main]
>> java.lang.AssertionError:
>>
DecoratedKey(Token(bytes[224ceb80b5fb11e0848783ceb9bf0002ff33]),
>> 224ceb80b5fb11e0848783ceb9bf0002ff33) !=
>> DecoratedKey(Token(bytes[038453154cb0005f14]),
038453154cb0005f14)
>> in /var/lib/cassandra/data/X/PUBLIC_MONTHLY_20-g-10634-Data.db
>> at
>>
org.apache.cassandra.db.columniterator.SSTableSliceIterator.(SSTableSliceIterator.java:59)
>> at
>>
org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:66)
>> at
>>
org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80)
>> at
>>
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1315)
>> at
>>
org.apache.cassandra.db.ColumnFamilyStore.cacheRow(ColumnFamilyStore.java:1182)
>> at
>>
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1222)
>> at
>>
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1169)
>> at org.apache.cassandra.db.Table.getRow(Table.java:385)
>> at
>>
org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:58)
>> at
>>
org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:642)
>> at
>>
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1107)
>> at
>>
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> at
>>
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> at java.lang.Thread.run(Thread.java:662)
>> ERROR [ReadStage:94936] 2011-09-24 22:40:30,042
AbstractCassandraDaemon.java
>> (line 134) Fatal exception in thread Thread[ReadStage:94936,5,main]
>> java.lang.AssertionError: DecoratedKey(Token(bytes[]), ) !=
>> DecoratedKey(Token(bytes[038453154c90005f14]),
038453154c90005f14)
>> in /var/lib/cassandra/data/X/PUBLIC_MONTHLY_20-g-10634-Data.db
>> at
>>
org.apache.cassandra.db.columniterator.SSTableSliceIterator.(SSTableSliceIterator.java:59)
>> at
>>
org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:66)
>> at
>>
org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80)
>> at
>>
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1315)
>> at
>>
org.apache.cassandra.db.ColumnFamilyStore.cacheRow(ColumnFamilyStore.java:1182)
>> at
>>
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1222)
>> at
>>
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1169)
>> at org.apache.cassandra.db.Table.getRow(Table.java:385)
>> at
>>
org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:58)
>> at
>>
org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:642)
>> at
>>
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1107)
>> at
>>
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> at
>>
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> at java.lang.Thread.run(Thread.java:662)
>> ERROR [ReadStage:94713] 2011-09-24 22:40:30,041
AbstractCassandraDaemon.java
>> (line 134) Fatal exception in thread Thread[ReadStage:94713,5,main]
>> java.lang.AssertionError:
>>
DecoratedKey(Token(bytes[7c4831fe0001ffaa000c6c697665626f782d6265306580008002000700031195010481327e62362a002400019c0dc550c60111e001687c4831fe000100010007000311950100010481327e62362a002400019c0dc550c60111e001687c4831fe0001ffbc6c697665626f782d6631326380008002000700031195010481327e62362a002400019c0dc550c60111e001687c4831fe0001000100070003119501000104800

Re: Possibility of going OOM using get_count

2011-09-25 Thread Boris Yen
Hi Aaron,

Thanks for the explanation, I know the performance will be varied when the
offset is a very large number, just like what has been mentioned
on CASSANDRA-261. Even if the users implement the offset on the client side,
they suffer the same issues, I just think it would be nice if cassandra can
provide this function internally, of course this function will have its
limitation, just like any other functions cassnadra has, Counter, for
example.

In CASSANDRA-261, it seems cassandra had the offset function, however, due
to the some RR issues it has been removed, CASSANDRA-286. I think the reason
why CASSANDRA-261 has the RR issue is because it changes the internal
mechanism in order to provide the offset function. Unlike
CASSANDRA-2894,
it only changes code in the "CassandraServer", it should not have the same
issue as CASSANDRA-261. Therefore, I was wondering if you could re-consider
to put the offset function back to cassandra. It should be really helpful
for many users.

Regards
Boris

On Sun, Sep 25, 2011 at 12:21 PM, aaron morton wrote:

> The changes in get_count() are designed to stop counts for very large rows
> running out of memory as they try to hold millions of columns in memory.
>
> So if you ask to count all the cols in a row with 1M cols, it will (by
> default) read the first 1024 columns, and then the next 1024 using the last
> column read as the first column for the next page.
>
> The important part is that it is actually reading the columns. Tombstones
> mean we do not know if a column should be a member of the result set for a
> query until it is read and reconciled with all the other versions of a
> column. e.g. 3 sstables have each have a value for a column, if one is a
> tombstone then the column may or may not be deleted. We do not know until
> all 3 column versions are reconciled.
>
> get_count() is like get_slice() but we do not return the columns, just the
> count of them. Counting 1M columns still takes a long time. And find the
> 999,980th column will also take a long time, but if you know the name of the
> 999,980th column it will be mucho faster.
>
> Some experiments I did a while ago on query plans
> http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ - cass 1.0 will
> probably invalidate this.
>
> Cheers
>
>
> -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 23/09/2011, at 6:01 PM, Boris Yen wrote:
>
>
>
> On Fri, Sep 23, 2011 at 12:28 PM, aaron morton wrote:
>
>> Offsets have been discussed in previously. IIRC the main concerns were
>> either:
>>
>> There is no way to reliably count to start the offset, i.e. we do not lock
>> the row
>>
>
> In the new get_count function, cassandra does the internal paging in order
> to get the total count. Without locking the row,  the count could still be
> unreliable (someone might be deleting some columns while cassandra is
> counting the columns).
>
>
>>
>> Or performance related in, as there is not a reliable way to skip 10,000
>> columns other than counting 10,000 columns. With a start col we can search.
>>
>>
> I am just curious, basically "skip 10,000 columns to get the start column"
> can be done as what cassandra does for new get_count function (internal
> paging). I just can not think of a reason why it is doable for get_count but
> it can not be done for the offset.
>
> I know the result might not be reliable and the performance might be varied
> depends on the offset, but if cassandra can using internal paging to get
> count, it should be able the apply the same method to get the start column
> for the offset.
>
>
>> Cheers
>>
>>  -
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 22/09/2011, at 8:50 PM, Boris Yen wrote:
>>
>> I was wondering if it is possible to use similar way as 
>> CASSANDRA-2894 to
>> have the slice_predict support the offset concept? With the offset, it would
>> be much easier to implement the paging from the client side.
>>
>> Boris
>>
>> On Mon, Sep 19, 2011 at 9:45 PM, Jonathan Ellis wrote:
>>
>>> Unfortunately no, because you don't know what the actual
>>> last-column-counted was.
>>>
>>> On Mon, Sep 19, 2011 at 4:25 AM, aaron morton 
>>> wrote:
>>> > get_count() supports the same predicate as get_slice. So you can
>>> implement
>>> > the paging yourself.
>>> > Cheers
>>> > -
>>> > Aaron Morton
>>> > Freelance Cassandra Developer
>>> > @aaronmorton
>>> > http://www.thelastpickle.com
>>> > On 19/09/2011, at 8:45 PM, Tharindu Mathew wrote:
>>> >
>>> >
>>> > On Mon, Sep 19, 2011 at 12:40 PM, Benoit Perroud 
>>> wrote:
>>> >>
>>> >> The workaround for 0.7 is calling get_slice and count on client side.
>>> >> It's heavier, sure, but you will then be able to set start column
>>> >> accordingly.
>>> >
>>> > I was afraid of that :(
>>> 

Re: progress of sstableloader keeps 0?

2011-09-25 Thread Yan Chunlu
thank you very much aaron. your explanation  is clear enough and very
helpful!

On Mon, Sep 26, 2011 at 4:58 AM, aaron morton wrote:

> If you had RF3 in a 3 node cluster and everything was repaired you *should*
> be ok to only take the data from 1 node, if the cluster is not receiving
> writes.
>
> If you want to merge the data from 3 nodes rename the files AFAIK they do
> not have to have contiguous file numbers.
>
> Cheers
>
>
>  -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 25/09/2011, at 10:45 PM, Yan Chunlu wrote:
>
> thanks!  another problem is what if cluster number are not the same?
>
> in my case I am move 3 nodes cluster data to 1 node,  the keyspace files in
> 3 nodes might use the same name...
>
> I am using the new cluster only for emergency usage, so only 1 node is
> attached.
>
> On Sun, Sep 25, 2011 at 5:20 PM, aaron morton wrote:
>
>> That can read data from previous versions, i.e. if you upgrade to 0.8 it
>> can read the existing files from 0.7.
>>
>> But what you are doing with the sstable loader is (AFAIK) only copying the
>> Data portion of the CF. Once the table is loaded the node will then build
>> the Index and the Filter, this is the createBuild() call in the stack. It's
>> throwing because version 0.8 does not want to make version 0.8 Index and and
>> Filter files for a version 0.7 Data file.
>>
>> We get the same problem when upgrading from 0.7 to 0.8, where Repair will
>> not work because it is streaming a 0.7 version data file and the recipient
>> then tries to build the Index and Filter files.
>>
>> So to read 0.7 data from 0.8 just copy over *all* the files for the
>> keyspace (data, filter and index). Then scrub the nodes so that repair can
>> work.
>>
>> Hope that helps.
>>
>>
>>  -
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 25/09/2011, at 6:07 PM, Yan Chunlu wrote:
>>
>> yes, I did.  thought 0.8 is downward compatible. is there other ways to
>> load 0.7's data into 0.8?  will copy the data dir directly will work?   I
>> would like to put load of three nodes into one node.
>>
>>  thanks!
>>
>> On Sun, Sep 25, 2011 at 11:52 AM, aaron morton 
>> wrote:
>>
>>> Looks like it is complaining that you are trying to load a 0.7 SSTable in
>>> 0.8.
>>>
>>>
>>> Cheers
>>>
>>>  -
>>> Aaron Morton
>>> Freelance Cassandra Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>>
>>> On 23/09/2011, at 5:23 PM, Yan Chunlu wrote:
>>>
>>> sorry I did not look into it  after check it I found version mismatch
>>> exception is in the log:
>>> ERROR [Thread-17] 2011-09-22 08:24:24,248 AbstractCassandraDaemon.java
>>> (line 139) Fatal exception in thread Thread[Thread-17,5,main]
>>> java.lang.RuntimeException: Cannot recover SSTable
>>> /disk2/cassandra/data/reddit/Comments-tmp-f-1 due to version mismatch.
>>> (current version is g).
>>> at
>>> org.apache.cassandra.io.sstable.SSTableWriter.createBuilder(SSTableWriter.java:240)
>>> at
>>> org.apache.cassandra.db.compaction.CompactionManager.submitSSTableBuild(CompactionManager.java:1097)
>>> at
>>> org.apache.cassandra.streaming.StreamInSession.finished(StreamInSession.java:110)
>>> at
>>> org.apache.cassandra.streaming.IncomingStreamReader.readFile(IncomingStreamReader.java:104)
>>> at
>>> org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:61)
>>> at
>>> org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:189)
>>> at
>>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:117)
>>>
>>>
>>> does that mean I need to run scrub before running the loader?  could I
>>> just delete it and keep going?  thanks!
>>>
>>> On Fri, Sep 23, 2011 at 2:16 AM, Jonathan Ellis wrote:
>>>
 Did you check for errors in logs on both loader + target?

 On Thu, Sep 22, 2011 at 10:52 AM, Yan Chunlu 
 wrote:
 > I took a snapshot of one of my node in a cluster 0.7.4(N=RF=3).   use
 > sstableloader to load the snapshot data to another 1 node
 cluster(N=RF=1).
 >
 > after execute  "bin/sstableloader  /disk2/mykeyspace/"
 >
 > it says"Starting client (and waiting 30 seconds for gossip) ..."
 > "Streaming revelant part of  cf1.db. to [10.23.2.4]"
 > then showing the progress indicator and stopped. nothing changed after
 > then.
 > progress: [/10.28.53.16 1/72 (0)] [total: 0 - 0MB/s (avg: 0MB/s)]]]
 >
 > I use nodetool to check the node 10.23.2.4, nothing changed. no data
 copied
 > to it. and the data dir also keep its original size. is there anything
 > wrong? how can I tell what was going on there?
 > thanks!



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www