Re: Rolling upgrade from 1.1.12 to 1.2.5 visibility issue

2013-06-21 Thread Polytron Feng
Hi aaron,

Thank you for your reply. We tried to increase PHI threshold but still met
same issue. We used Ec2Snitch and PropertyFileSnitch instead and they work
without this problem. It seems only happened with Ec2MultiRegionSnitch
config. Although we can workaround this problem by PropertyFileSnitch, we
hit another bug: EOFException in
https://issues.apache.org/jira/browse/CASSANDRA-5476. We will try to
upgrade to 1.1.12 first and waiting for the fix of issue 5476.

Thank you!




On Thu, Jun 20, 2013 at 5:49 PM, aaron morton wrote:

> I once had something like this, looking at your logs I donot think it's
> the same thing but here is a post on it
> http://thelastpickle.com/2011/12/15/Anatomy-of-a-Cassandra-Partition/
>
> It's a little different in 1.2 but the GossipDigestAckVerbHandler (and
> ACK2) should be calling Gossiper.instance.notifyFailureDetector which will
> result in the FailureDetector being called. This will keep the remote node
> marked as up. it looks like this is happening.
>
>
> TRACE [GossipTasks:1] 2013-06-19 07:44:52,359 FailureDetector.java
> (line 189) PHI for /54.254.xxx.xxx : 8.05616263930532
>
> The default phi_convict_threshold is 8, so this node thinks the other is
> just sick enough to be marked as down.
>
> As a work around try increasing the phi_convict_threshold to 12. Not sure
> why the 1.2 node thinks this, not sure if anything has changed.
>
> I used to think there was a way to dump the phi values for nodes, but I
> cannot find it. If you call dumpInterArrivalTimes on
> the org.apache.cassandra.net:type=FailureDetector MBean it will dump a
> file in the temp dir called "failuredetector-*" with the arrival times for
> messages from the other nodes. That may help.
>
> Cheers
>
> -
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 19/06/2013, at 8:34 PM, Polytron Feng  wrote:
>
>
> Hi,
>
> We are trying to roll upgrade from 1.0.12 to 1.2.5, but we found that the
> 1.2.5 node cannot see other old nodes.
> Therefore, we tried to upgrade to 1.1.12 first, and it works.
> However, we still saw the same issue when rolling upgrade from 1.1.12 to
> 1.2.5.
> This seems to be the fixed issue as
> https://issues.apache.org/jira/browse/CASSANDRA-5332 but we still saw it
> in 1.2.5.
>
> Enviroment:
>OS: CentOS 6
>JDK: 6u31
>cluster:3 nodes for testing, in EC2
>Snitch: Ec2MultiRegionSnitch
>NetworkTopologyStrategy: strategy_options = { ap-southeast:3 }
>
> We have 3 nodes and we upgraded 122.248.xxx.xxx to 1.2.5 first, the other
> 2 nodes are still 1.1.12.
> When we restarted the upgraded node, it will see the other 2 old nodes as
> UP in the log.
> However, after a few seconds, these 2 nodes will be marked as DOWN.
> This is the ring info from 1.2.5 node - 122.248.xxx.xxx
>
> Note: Ownership information does not include topology; for complete
> information, specify a keyspace
>
> Datacenter: ap-southeast
> ==
> Address  RackStatus State   LoadOwns
>  Token
>
>   113427455640312821154458202477256070486
> 122.248.xxx.xxx  1b  Up Normal  69.74 GB33.33%
>  1
> 54.251.xxx.xxx   1b  Down   Normal  69.77 GB33.33%
>  56713727820156410577229101238628035243
> 54.254.xxx.xxx   1b  Down   Normal  70.28 GB33.33%
>  113427455640312821154458202477256070486
>
>
> but Old 1.1.12 nodes can see new node:
>
> Note: Ownership information does not include topology, please specify
> a keyspace.
> Address DC  RackStatus State   Load
>  OwnsToken
>
>  113427455640312821154458202477256070486
> 122.248.xxx.xxx ap-southeast1b  Up Normal  69.74 GB
>  33.33%  1
> 54.251.xxx.xxx  ap-southeast1b  Up Normal  69.77 GB
>  33.33%  56713727820156410577229101238628035243
> 54.254.xxx.xxx  ap-southeast1b  Up Normal  70.28 GB
>  33.33%  113427455640312821154458202477256070486
>
>
> We enabled trace log level to check gossip related logs. The log below
> from 1.2.5 node shows that the
> other 2 nodes are UP in the beginning. They seem to complete SYN/ACK/ACK2
> handshake cycle.
>
> TRACE [GossipStage:1] 2013-06-19 07:44:43,047
> GossipDigestSynVerbHandler.java (line 40) Received a GossipDigestSynMessage
> from /54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-06-19 07:44:43,047
> GossipDigestSynVerbHandler.java (line 71) Gossip syn digests are :
> /54.254.xxx.xxx:1371617084:10967 /54.251.xxx.xxx:1371625851:2055
> TRACE [GossipStage:1] 2013-06-19 07:44:43,048 Gossiper.java (line 945)
> requestAll for /54.254.xxx.xxx
> .
>
> TRACE [GossipStage:1] 2013-06-19 07:44:43,080
> GossipDigestSynVerbHandler.java (line 84) Sending a GossipDigestAckMessage
> to /54.254.xxx.xxx
> TRACE [GossipStage:1] 2013-

Re: Heap is not released and streaming hangs at 0%

2013-06-21 Thread aaron morton
> > nodetool -h localhost flush didn't do much good.
Do you have 100's of millions of rows ?
If so see recent discussions about reducing the bloom_filter_fp_chance and 
index_sampling. 

If this is an old schema you may be using the very old setting of 0.000744 
which creates a lot of bloom filters. 

Cheers
 
-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/06/2013, at 6:36 AM, Wei Zhu  wrote:

> If you want, you can try to force the GC through Jconsole. Memory->Perform GC.
> 
> It theoretically triggers a full GC and when it will happen depends on the JVM
> 
> -Wei
> 
> From: "Robert Coli" 
> To: user@cassandra.apache.org
> Sent: Tuesday, June 18, 2013 10:43:13 AM
> Subject: Re: Heap is not released and streaming hangs at 0%
> 
> On Tue, Jun 18, 2013 at 10:33 AM, srmore  wrote:
> > But then shouldn't JVM C G it eventually ? I can still see Cassandra alive
> > and kicking but looks like the heap is locked up even after the traffic is
> > long stopped.
> 
> No, when GC system fails this hard it is often a permanent failure
> which requires a restart of the JVM.
> 
> > nodetool -h localhost flush didn't do much good.
> 
> This adds support to the idea that your heap is too full, and not full
> of memtables.
> 
> You could try nodetool -h localhost invalidatekeycache, but that
> probably will not free enough memory to help you.
> 
> =Rob



Re: Joining distinct clusters with the same schema together

2013-06-21 Thread aaron morton
> > Question 2: is this a sane strategy?
> 
> On its face my answer is "not... really"? 
I'd go with a solid no. 

Just because the the three independent clusters have a schema that looks the 
same does not make them the same. The schema is a versioned document, you will 
not be able to merge them by merging the DC's later without downtime. 

It will be easier to go with a multi DC setup from the start. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/06/2013, at 6:36 AM, Eric Stevens  wrote:

> On its face my answer is "not... really"? What do you view yourself as
> getting with this technique versus using built in replication? As an
> example, you lose the ability to do LOCAL_QUORUM vs EACH_QUORUM
> consistency level operations?
> 
> Doing replication manually sounds like a recipe for the DC's eventually 
> getting subtly out of sync with each other.  If a connection goes down 
> between DC's, and you are taking data at both, how will you catch each other 
> up?  C* already offers that resolution for you, and you'd have to work pretty 
> hard to reproduce it for no obvious benefit that I can see.  
> 
> For minimum effort, definitely rely on Cassandra's well-tested codebase for 
> this.
> 
> 
> 
> 
> On Wed, Jun 19, 2013 at 2:27 PM, Robert Coli  wrote:
> On Wed, Jun 19, 2013 at 10:50 AM, Faraaz Sareshwala
>  wrote:
> > Each datacenter will have a cassandra cluster with a separate set of seeds
> > specific to that datacenter. However, the cluster name will be the same.
> >
> > Question 1: is this enough to guarentee that the three datacenters will have
> > distinct cassandra clusters as well? Or will one node in datacenter A still
> > somehow be able to join datacenter B's ring.
> 
> If they have network connectivity and the same cluster name, they are
> the same logical cluster. However if your nodes share tokens and you
> have auto_bootstrap=yes (the implicit default) the second node you
> attempt to start will refuse to start because you are trying to
> bootstrap it into the range of a live node.
> 
> > For now, we are planning on using our own relay mechanism to transfer
> > data changes from one datacenter to another.
> 
> Are you planning to use the streaming commitlog functionality for
> this? Not sure how you would capture all changes otherwise, except
> having your app just write the same thing to multiple places? Unless
> data timestamps are identical between clusters, otherwise identical
> data will not merge properly, as cassandra uses data timestamps to
> merge.
> 
> > Question 2: is this a sane strategy?
> 
> On its face my answer is "not... really"? What do you view yourself as
> getting with this technique versus using built in replication? As an
> example, you lose the ability to do LOCAL_QUORUM vs EACH_QUORUM
> consistency level operations?
> 
> > Question 3: eventually, we want to turn all these cassandra clusters into 
> > one
> > large multi-datacenter cluster. What's the best practice to do this? Should 
> > I
> > just add nodes from all datacenters to the list of seeds and let cassandra
> > resolve differences? Is there another way I don't know about?
> 
> If you are using NetworkTopologyStrategy and have the same cluster
> name for your isolated clusters, all you need to do is :
> 
> 1) configure NTS to store replicas on a per-datacenter basis
> 2) ensure that your nodes are in different logical data centers (by
> default, all nodes are in DC1/rack1)
> 3) ensure that clusters are able to reach each other
> 4) ensure that tokens do not overlap between clusters (the common
> technique with manual token assignment is that each node gets a range
> which is off-by-one)
> 5) ensure that all nodes seed lists contain (recommended) 3 seeds from each DC
> 6) rolling restart (so the new seed list is picked up)
> 7) repair ("should" only be required if writes have not replicated via
> your out of band mechanism)
> 
> Vnodes change the picture slightly because the chance of your clusters
> having conflicting tokens increases with the number of token ranges
> you have.
> 
> =Rob
> 



Re: error on startup: unable to find sufficient sources for streaming range

2013-06-21 Thread aaron morton
> On some of my nodes, I'm getting the following exception when cassandra starts
How many nodes? 
Is this a new node or an old one and this problem just started ? 

What version are you on ? 

Do you have this error from system.log ? It includes the thread name which is 
handy to debug things. Also looks like there are some lines missing from the 
first error. 
 
it looks like an error that may happen when a node is bootstrapping or 
replacing an existing node. If you can provide some more context we may be able 
to help.

Cheers
 

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/06/2013, at 10:36 AM, Faraaz Sareshwala  wrote:

> Hi,
> 
> I couldn't find any information on the following error so I apologize if it 
> has
> already been discussed.
> 
> On some of my nodes, I'm getting the following exception when cassandra starts
> up:
> 
> 2013-06-19 22:17:39.480414500 Exception encountered during startup: unable to 
> find sufficient sources for streaming range 
> (-4250921392403750427,-4250887922781325324]
> 2013-06-19 22:17:39.482733500 ERROR Exception in thread 
> Thread[StorageServiceShutdownHook,5,main] 
> (CassandraDaemon.java:org.apache.cassandra.service.CassandraDaemon$1:175)
> 2013-06-19 22:17:39.482735500 java.lang.NullPointerException
> 2013-06-19 22:17:39.482735500   at 
> org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321)
> 2013-06-19 22:17:39.482736500   at 
> org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:362)
> 2013-06-19 22:17:39.482736500   at 
> org.apache.cassandra.service.StorageService.access$000(StorageService.java:88)
> 2013-06-19 22:17:39.482751500   at 
> org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:513)
> 2013-06-19 22:17:39.482752500   at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
> 2013-06-19 22:17:39.482752500   at java.lang.Thread.run(Thread.java:662)
> 
> Can someone point me to more information about what could cause this error?
> 
> Faraaz



Re: Performance Difference between Cassandra version

2013-06-21 Thread aaron morton
> I am trying to see whether there will be any performance difference between 
> Cassandra 1.0.8 vs Cassandra 1.2.2 for reading the data mainly?
1.0 has key and row caches defined per CF, 1.1 has global ones which are better 
utilised and easier to manage. 
1.2 moves bloom filters and compression meta off heap which reduces GC, which 
will help. 
Things normally get faster.

Cheers
 
-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/06/2013, at 11:24 AM, Franc Carter  wrote:

> On Thu, Jun 20, 2013 at 9:18 AM, Raihan Jamal  wrote:
> I am trying to see whether there will be any performance difference between 
> Cassandra 1.0.8 vs Cassandra 1.2.2 for reading the data mainly?
> 
> Has anyone seen any major performance difference?
> 
> We are part way through a performance comparison between 1.0.9 with Size 
> Tiered Compaction and 1.2.4 with Leveled Compaction - for our use case it 
> looks like a significant performance improvement on the read side.  We are 
> finding compaction lags when we do very large bulk loads, but for us this is 
> an initialisation task and that's a reasonable trade-off
> 
> cheers
> 
> -- 
> Franc Carter | Systems architect | Sirca Ltd
> franc.car...@sirca.org.au | www.sirca.org.au
> Tel: +61 2 8355 2514 
> Level 4, 55 Harrington St, The Rocks NSW 2000
> PO Box H58, Australia Square, Sydney NSW 1215
> 



Re: Unit Testing Cassandra

2013-06-21 Thread aaron morton
> > 2) Second (in which I am more interested in) is for performance 
> > (stress/load) testing. 
Sometimes you can get cassandra-stress (shipped in the bin distro) to 
approximate the expected work load. It's then pretty easy to benchmark and 
tests you configuration changes.

Cheers
 
-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/06/2013, at 2:25 PM, Shahab Yunus  wrote:

> Thanks Edward, Ben and Dean for the pointers. Yes, I am using Java and these 
> sounds promising for unit testing, at least.
> 
> Regards,
> Shahab
> 
> 
> On Wed, Jun 19, 2013 at 9:58 AM, Edward Capriolo  
> wrote:
> You really do not need much in java you can use the embedded server. Hector 
> wrap a simple class around thiscalled  EmbeddedServerHelper
> 
> 
> On Wednesday, June 19, 2013, Ben Boule  wrote:
> > Hi Shabab,
> >
> > Cassandra-Unit has been helpful for us for running unit tests without 
> > requiring a real cassandra instance to be running.   We only use this to 
> > test our "DAO" code which interacts with the Cassandra client.  It 
> > basically starts up an embedded instance of cassandra and fools your 
> > client/driver into using it.  It uses a non-standard port and you just need 
> > to make sure you can set the port as a parameter into your client code.
> >
> > https://github.com/jsevellec/cassandra-unit
> >
> > One important thing is to either clear out the keyspace in between tests or 
> > carefully separate your data so different tests don't collide with each 
> > other in the embedded database.
> >
> > Setup/tear down time is pretty reasonable.
> >
> > Ben
> > 
> > From: Shahab Yunus [shahab.yu...@gmail.com]
> > Sent: Wednesday, June 19, 2013 8:46 AM
> > To: user@cassandra.apache.org
> > Subject: Re: Unit Testing Cassandra
> >
> > Thanks Stephen for you reply and explanation. My bad that I mixed those up 
> > and wasn't clear enough. Yes, I have different 2 requests/questions.
> > 1) One is for the unit testing.
> > 2) Second (in which I am more interested in) is for performance 
> > (stress/load) testing. Let us keep integration aside for now.
> > I do see some stuff out there but wanted to know recommendations from the 
> > community given their experience.
> > Regards,
> > Shahab
> >
> > On Wed, Jun 19, 2013 at 3:15 AM, Stephen Connolly 
> >  wrote:
> >>
> >> Unit testing means testing in isolation the smallest part.
> >> Unit tests should not take more than a few milliseconds to set up and 
> >> verify their assertions.
> >> As such, if your code is not factored well for testing, you would 
> >> typically use mocking (either by hand, or with mocking libraries) to mock 
> >> out the bits not under test.
> >> Extensive use of mocks is usually a smell of code that is not well 
> >> designed *for testing*
> >> If you intend to test components integrated together... That is 
> >> integration testing.
> >> If you intend to test performance of the whole or significant parts of the 
> >> whole... That is performance testing.
> >> When searching for the above, you will not get much luck if you are 
> >> looking for them in the context of "unit testing" as those things are 
> >> *outside the scope of unit testing"
> >>
> >> On Wednesday, 19 June 2013, Shahab Yunus wrote:
> >>>
> >>> Hello,
> >>>
> >>> Can anyone suggest a good/popular Unit Test tools/frameworks/utilities out
> >>> there for unit testing Cassandra stores? I am looking for testing from 
> >>> performance/load and monitoring perspective. I am using 1.2.
> >>>
> >>> Thanks a lot.
> >>>
> >>> Regards,
> >>> Shahab
> >>
> >>
> >> --
> >> Sent from my phone
> >
> > This electronic message contains information which may be confidential or 
> > privileged. The information is intended for the use of the individual or 
> > entity named above. If you are not the intended recipient, be aware that 
> > any disclosure, copying, distribution or use of the contents of this 
> > information is prohibited. If you have received this electronic 
> > transmission in error, please notify us by e-mail at 
> > (postmas...@rapid7.com) immediately.
> 



Re: Get fragments of big files (videos)

2013-06-21 Thread aaron morton
You should split the large blobs into multiple rows, and I would use 10MB per 
row as a good rule of thumb. 

See http://www.datastax.com/dev/blog/cassandra-file-system-design for a 
description of blob store in cassandra

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/06/2013, at 8:54 PM, Simon Majou  wrote:

> Thanks Serge
> 
> Simon
> 
> 
> On Thu, Jun 20, 2013 at 10:48 AM, Serge Fonville
>  wrote:
>> Also, after a quick Google.
>> 
>> http://wiki.apache.org/cassandra/CassandraLimitations states values cannot
>> exceed 2GB, it also answers you offset question
>> 
>> HTH
>> Kind regards/met vriendelijke groet,
>> 
>> Serge Fonville
>> 
>> http://www.sergefonville.nl
>> 
>> Convince Microsoft!
>> They need to add TRUNCATE PARTITION in SQL Server
>> https://connect.microsoft.com/SQLServer/feedback/details/417926/truncate-partition-of-partitioned-table
>> 
>> 
>> 2013/6/20 Sachin Sinha 
>>> 
>>> Fragment them in rows, that will help.
>>> 
>>> 
>>> On 20 June 2013 09:43, Simon Majou  wrote:
 
 Hello,
 
 If I store a video into a column, how can I get a fragment of it
 without having to download it entirely ? Is there a way to give an
 offset on a column ?
 
 Do I have to fragment it over a lot of small fixed sizes columns ? Is
 there any disadvantage to do so ? For example fragment a 10GB file
 into 1 000 columns of 10 MB ?
 
 Simon
>>> 
>>> 
>> 



Re: Compaction not running

2013-06-21 Thread aaron morton
> Do you think it's worth posting an issue, or not enough traceable evidence ?
If you can reproduce it then certainly file a bug. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/06/2013, at 9:41 PM, Franc Carter  wrote:

> On Thu, Jun 20, 2013 at 7:27 PM, aaron morton  wrote:
>> nodetool compactionstats, gives
>> 
>> pending tasks: 13120
> If there are no errors in the log, I would say this is a bug. 
> 
> This happened after the node ran out of file descriptors, so an edge case 
> wouldn't surprise me.
> 
> I've rebuilt the node (blown the data way and am running a nodetool rebuild). 
> Do you think it's worth posting an issue, or not enough traceable evidence ?
> 
> cheers
>  
> 
> Cheers
> 
> -
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 19/06/2013, at 11:41 AM, Franc Carter  wrote:
> 
>> On Wed, Jun 19, 2013 at 9:34 AM, Bryan Talbot  wrote:
>> Manual compaction for LCS doesn't really do much.  It certainly doesn't 
>> compact all those little files into bigger files.  What makes you think that 
>> compactions are not occurring? 
>> 
>> Yeah, that's what I thought, however:-
>> 
>> nodetool compactionstats, gives
>> 
>> pending tasks: 13120
>>Active compaction remaining time :n/a
>> 
>> when I run nodetool compact in a loop the pending tasks goes down gradually.
>> 
>> This node also has vastly higher latencies (x10) than the other nodes. I saw 
>> this with a previous CF than I 'manually compacted', and when the pending 
>> tasks reached low numbers (stuck on 9) then latencies were back to low 
>> milliseconds
>> 
>> cheers
>>  
>> -Bryan
>> 
>> 
>> 
>> On Tue, Jun 18, 2013 at 3:59 PM, Franc Carter  
>> wrote:
>> On Sat, Jun 15, 2013 at 11:49 AM, Franc Carter  
>> wrote:
>> On Sat, Jun 15, 2013 at 8:48 AM, Robert Coli  wrote:
>> On Wed, Jun 12, 2013 at 3:26 PM, Franc Carter  
>> wrote:
>> > We are running a test system with Leveled compaction on Cassandra-1.2.4.
>> > While doing an initial load of the data one of the nodes ran out of file
>> > descriptors and since then it hasn't been automatically compacting.
>> 
>> You have (at least) two options :
>> 
>> 1) increase file descriptors available to Cassandra with ulimit, if possible
>> 2) increase the size of your sstables with levelled compaction, such
>> that you have fewer of them
>> 
>> Oops, I wasn't clear enough.
>> 
>> I have increased the number of file descriptors and no longer have a file 
>> descriptor issue. However the node still doesn't compact automatically. If I 
>> run a 'nodetool compact' it will do a small amount of compaction and then 
>> stop. The Column Family is using LCS
>> 
>> Any ideas on this - compaction is still not automatically running for one of 
>> my nodes
>> 
>> thanks
>>  
>> 
>> cheers
>>  
>> 
>> =Rob
>> 
>> 
>> 
>> -- 
>> Franc Carter | Systems architect | Sirca Ltd
>> franc.car...@sirca.org.au | www.sirca.org.au
>> Tel: +61 2 8355 2514 
>> Level 4, 55 Harrington St, The Rocks NSW 2000
>> PO Box H58, Australia Square, Sydney NSW 1215
>> 
>> 
>> 
>> 
>> -- 
>> Franc Carter | Systems architect | Sirca Ltd
>> franc.car...@sirca.org.au | www.sirca.org.au
>> Tel: +61 2 8355 2514 
>> Level 4, 55 Harrington St, The Rocks NSW 2000
>> PO Box H58, Australia Square, Sydney NSW 1215
>> 
>> 
>> 
>> 
>> 
>> -- 
>> Franc Carter | Systems architect | Sirca Ltd
>> franc.car...@sirca.org.au | www.sirca.org.au
>> Tel: +61 2 8355 2514 
>> Level 4, 55 Harrington St, The Rocks NSW 2000
>> PO Box H58, Australia Square, Sydney NSW 1215
>> 
> 
> 
> 
> 
> -- 
> Franc Carter | Systems architect | Sirca Ltd
> franc.car...@sirca.org.au | www.sirca.org.au
> Tel: +61 2 8355 2514 
> Level 4, 55 Harrington St, The Rocks NSW 2000
> PO Box H58, Australia Square, Sydney NSW 1215
> 



Re: Confirm with cqlsh of Cassandra-1.2.5, the behavior of the export/import

2013-06-21 Thread aaron morton
That looks like it may be a bug, can you raise a ticket at 
https://issues.apache.org/jira/browse/CASSANDRA

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/06/2013, at 1:56 AM, hiroshi.kise...@hitachi.com wrote:

> 
> Dear everyone.
> 
> I'm Hiroshi Kise.
> I will confirm with cqlsh of Cassandra-1.2.5, the behavior of the export / 
> import of data.
> Using the Copy of cqlsh, the data included the “{“ and “[“ (= CollectionType) 
> case,
> I think in the export / import process, data integrity is compromised.
> How about?
> 
> Such as the definition of create table, if there is an error in courtesy, 
> please tell me the right way.
> 
> 
> Concrete operation is as follows.
> -*-*-*-*-*-*-*-*
> (1)map type's export/import
> 
> [root@castor bin]# ./cqlsh
> Connected to Test Cluster at localhost:9160.
> [cqlsh 3.0.2 | Cassandra 1.2.5 | CQL spec 3.0.0 | Thrift protocol 19.36.0]
> Use HELP for help.
> cqlsh> create keyspace maptestks with replication  = { 'class' : 
> 'SimpleStrategy', 'replication_factor' : '1' };
> cqlsh> use maptestks;
> cqlsh:maptestks> create table maptestcf (rowkey varchar PRIMARY KEY, 
> targetmap map);
> cqlsh:maptestks> insert into maptestcf (rowkey, targetmap) values 
> ('rowkey',{'mapkey':'mapvalue'});
> cqlsh:maptestks> select * from maptestcf;
> 
> rowkey | targetmap
> +
> rowkey | {mapkey: mapvalue}
> cqlsh:maptestks>  copy maptestcf to 'maptestcf-20130619.txt';
> 1 rows exported in 0.008 seconds.
> cqlsh:maptestks> exit;
> 
> [root@castor bin]# cat maptestcf-20130619.txt
> rowkey,{mapkey: mapvalue}
>    <(a)
> 
> [root@castor bin]# ./cqlsh
> Connected to Test Cluster at localhost:9160.
> [cqlsh 3.0.2 | Cassandra 1.2.5 | CQL spec 3.0.0 | Thrift protocol 19.36.0]
> Use HELP for help.
> cqlsh> create keyspace mapimptestks with replication  = { 'class' : 
> 'SimpleStrategy', 'replication_factor' : '1' };
> cqlsh> use mapimptestks;
> cqlsh:mapimptestks> create table mapimptestcf (rowkey varchar PRIMARY KEY, 
> targetmap map);
> 
> cqlsh:mapimptestks> copy mapimptestcf from ' maptestcf-20130619.txt ';
> Bad Request: line 1:83 no viable alternative at input '}'
> Aborting import at record #0 (line 1). Previously-inserted values still 
> present.
> 0 rows imported in 0.025 seconds.
> -*-*-*-*-*-*-*-*
> (2)list type's export/import
> 
> [root@castor bin]#./cqlsh
> Connected to Test Cluster at localhost:9160.
> [cqlsh 3.0.2 | Cassandra 1.2.5 | CQL spec 3.0.0 | Thrift protocol 19.36.0]
> Use HELP for help.
> cqlsh> create keyspace listtestks with replication  = { 'class' : 
> 'SimpleStrategy', 'replication_factor' : '1' };
> cqlsh> use listtestks;
> cqlsh:listtestks> create table listtestcf (rowkey varchar PRIMARY KEY, value 
> list);
> cqlsh:listtestks> insert into listtestcf (rowkey,value) values 
> ('rowkey',['value1','value2']);
> cqlsh:listtestks> select * from listtestcf;
> 
> rowkey | value
> +--
> rowkey | [value1, value2]
> 
> cqlsh:listtestks> copy listtestcf to 'listtestcf-20130619.txt';
> 1 rows exported in 0.014 seconds.
> cqlsh:listtestks> exit;
> 
> [root@castor bin]# cat listtestcf-20130619.txt
> rowkey,"[value1, value2]"
>    <(b)
> 
> [root@castor bin]# ./cqlsh
> Connected to Test Cluster at localhost:9160.
> [cqlsh 3.0.2 | Cassandra 1.2.5 | CQL spec 3.0.0 | Thrift protocol 19.36.0]
> Use HELP for help.
> cqlsh> create keyspace listimptestks with replication  = { 'class' : 
> 'SimpleStrategy', 'replication_factor' : '1' };
> cqlsh> use listimptestks;
> cqlsh:listimptestks> create table listimptestcf (rowkey varchar PRIMARY KEY, 
> value list);
> cqlsh:listimptestks> copy listimptestcf from ' listtestcf-20130619.txt ';
> Bad Request: line 1:79 no viable alternative at input ']'
> Aborting import at record #0 (line 1). Previously-inserted values still 
> present.
> 0 rows imported in 0.030 seconds.
> -*-*-*-*-*-*-*-*
> Reference: (correct, or error, in another dimension)
> 
> Manually, I have rewritten the export file.
> [root@castor bin]# cat nlisttestcf-20130619.txt
> rowkey,"['value1',' value2']"
> 
> 
> cqlsh:listimptestks> copy listimptestcf from 'nlisttestcf-20130619.txt';
> 1 rows imported in 0.035 seconds.
> 
> cqlsh:listimptestks> select * from implisttestcf;
> rowkey | value
> +--
> rowkey | [value1, value2]
> cqlsh:implisttestks> exit;
> 
> [root@castor bin]# cat nmaptestcf-20130619.txt
> rowkey,”{'mapkey': 'mapvalue'}”
> 
> [root@castor bin]# ./cqlsh
> Connected to Test Cluster at localhost:9160.
> [cqlsh 3.0.2 | Cassandra 1.2.5 | CQL spec 3.0.0 | Thrift protocol 19.36.0]
> Use HELP for help.
> cqlsh> use  mapimptestks;

Re: block size

2013-06-21 Thread aaron morton
> If I have a data in column of size 500KB, 
> 
Also some information here 
http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/

The data files are memory mapped so it's sort of OS dependant. 

A

-

Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/06/2013, at 8:29 AM, Shahab Yunus  wrote:

> Ok. Though the closest that I can find is this (Aaron Morton's great blog):
> http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/
> 
> I would also like to know the answer as, as such, I also haven't came across 
> 'block size' as a core concept (or a concept to be considered while 
> developing with) Cassandra unlike Hadoop.
> 
> Regards,
> Shahab
> 
> 
> On Thu, Jun 20, 2013 at 3:38 PM, Kanwar Sangha  wrote:
> Yes. Is that not specific to hadoop with CFS ? I want to know that If I have 
> a data in column of size 500KB, how many IOPS are needed to read that ? 
> (assuming we have key cache enabled)
> 
>  
> 
>  
> 
> From: Shahab Yunus [mailto:shahab.yu...@gmail.com] 
> Sent: 20 June 2013 14:32
> To: user@cassandra.apache.org
> Subject: Re: block size
> 
>  
> 
> Have you seen this?
> 
> http://www.datastax.com/dev/blog/cassandra-file-system-design
> 
>  
> 
> Regards,
> Shahab
> 
>  
> 
> On Thu, Jun 20, 2013 at 3:17 PM, Kanwar Sangha  wrote:
> 
> Hi – What is the block size for Cassandra ? is it taken from the OS defaults ?
> 
>  
> 
> 



Re: Compaction not running

2013-06-21 Thread Franc Carter
On Fri, Jun 21, 2013 at 6:16 PM, aaron morton wrote:

> Do you think it's worth posting an issue, or not enough traceable evidence
> ?
>
> If you can reproduce it then certainly file a bug.
>

I'll keep my eye on it to see if it happens again and there is a pattern

cheers


>
> Cheers
>
>-
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 20/06/2013, at 9:41 PM, Franc Carter  wrote:
>
> On Thu, Jun 20, 2013 at 7:27 PM, aaron morton wrote:
>
>> nodetool compactionstats, gives
>>
>> pending tasks: 13120
>>
>> If there are no errors in the log, I would say this is a bug.
>>
>
> This happened after the node ran out of file descriptors, so an edge case
> wouldn't surprise me.
>
> I've rebuilt the node (blown the data way and am running a nodetool
> rebuild). Do you think it's worth posting an issue, or not enough traceable
> evidence ?
>
> cheers
>
>
>>
>> Cheers
>>
>>-
>> Aaron Morton
>> Freelance Cassandra Consultant
>> New Zealand
>>
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 19/06/2013, at 11:41 AM, Franc Carter 
>> wrote:
>>
>> On Wed, Jun 19, 2013 at 9:34 AM, Bryan Talbot wrote:
>>
>>> Manual compaction for LCS doesn't really do much.  It certainly doesn't
>>> compact all those little files into bigger files.  What makes you think
>>> that compactions are not occurring?
>>>
>>
>> Yeah, that's what I thought, however:-
>>
>> nodetool compactionstats, gives
>>
>> pending tasks: 13120
>>Active compaction remaining time :n/a
>>
>> when I run nodetool compact in a loop the pending tasks goes down
>> gradually.
>>
>> This node also has vastly higher latencies (x10) than the other nodes. I
>> saw this with a previous CF than I 'manually compacted', and when the
>> pending tasks reached low numbers (stuck on 9) then latencies were back to
>> low milliseconds
>>
>> cheers
>>
>>
>>> -Bryan
>>>
>>>
>>>
>>> On Tue, Jun 18, 2013 at 3:59 PM, Franc Carter >> > wrote:
>>>
 On Sat, Jun 15, 2013 at 11:49 AM, Franc Carter <
 franc.car...@sirca.org.au> wrote:

> On Sat, Jun 15, 2013 at 8:48 AM, Robert Coli wrote:
>
>> On Wed, Jun 12, 2013 at 3:26 PM, Franc Carter <
>> franc.car...@sirca.org.au> wrote:
>> > We are running a test system with Leveled compaction on
>> Cassandra-1.2.4.
>> > While doing an initial load of the data one of the nodes ran out of
>> file
>> > descriptors and since then it hasn't been automatically compacting.
>>
>> You have (at least) two options :
>>
>> 1) increase file descriptors available to Cassandra with ulimit, if
>> possible
>> 2) increase the size of your sstables with levelled compaction, such
>> that you have fewer of them
>>
>
> Oops, I wasn't clear enough.
>
> I have increased the number of file descriptors and no longer have a
> file descriptor issue. However the node still doesn't compact
> automatically. If I run a 'nodetool compact' it will do a small amount of
> compaction and then stop. The Column Family is using LCS
>

 Any ideas on this - compaction is still not automatically running for
 one of my nodes

 thanks


>
> cheers
>
>
>>
>> =Rob
>>
>
>
>
> --
> *Franc Carter* | Systems architect | Sirca Ltd
>  
> franc.car...@sirca.org.au | www.sirca.org.au
> Tel: +61 2 8355 2514
>  Level 4, 55 Harrington St, The Rocks NSW 2000
> PO Box H58, Australia Square, Sydney NSW 1215
>
>


 --
 *Franc Carter* | Systems architect | Sirca Ltd
  
 franc.car...@sirca.org.au | www.sirca.org.au
 Tel: +61 2 8355 2514
  Level 4, 55 Harrington St, The Rocks NSW 2000
 PO Box H58, Australia Square, Sydney NSW 1215


>>>
>>
>>
>> --
>> *Franc Carter* | Systems architect | Sirca Ltd
>>  
>> franc.car...@sirca.org.au | www.sirca.org.au
>> Tel: +61 2 8355 2514
>>  Level 4, 55 Harrington St, The Rocks NSW 2000
>> PO Box H58, Australia Square, Sydney NSW 1215
>>
>>
>>
>
>
> --
> *Franc Carter* | Systems architect | Sirca Ltd
>  
> franc.car...@sirca.org.au | www.sirca.org.au
> Tel: +61 2 8355 2514
>  Level 4, 55 Harrington St, The Rocks NSW 2000
> PO Box H58, Australia Square, Sydney NSW 1215
>
>
>


-- 

*Franc Carter* | Systems architect | Sirca Ltd
 

franc.car...@sirca.org.au | www.sirca.org.au

Tel: +61 2 8355 2514

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215


Re: nodetool ring showing different 'Load' size

2013-06-21 Thread Rodrigo Felix
Ok. Thank you all you guys.

Att.

*Rodrigo Felix de Almeida*
LSBD - Universidade Federal do Ceará
Project Manager
MBA, CSM, CSPO, SCJP


On Wed, Jun 19, 2013 at 2:26 PM, Robert Coli  wrote:

> On Wed, Jun 19, 2013 at 5:47 AM, Michal Michalski 
> wrote:
> > You can also perform a major compaction via nodetool compact (for
> > SizeTieredCompaction), but - again - you really should not do it unless
> > you're really sure what you do, as it compacts all the SSTables together,
> > which is not something you might want to achieve in most of the cases.
>
> If you do that and discover you did not want to :
>
> https://github.com/pcmanus/cassandra/tree/sstable_split
>
> Will enable you to split your monolithic sstable back into smaller
> sstables.
>
> =Rob
> PS - @pcmanus, here's that reminder we discussed @ summit to merge
> this tool into upstream! :D
>


Re: [Cassandra] Replacing a cassandra node

2013-06-21 Thread Eric Stevens
Is there a way to replace a failed server using vnodes?  I only had
occasion to do this once, on a relatively small cluster.  At the time I
just needed to get the new server online and wasn't concerned about the
performance implications, so I just removed the failed server from the
cluster and bootstrapped a new one.  Of course that caused a bunch of key
reassignments, so I'm sure it would be less work for the cluster if I could
bring a new server online with the same vnodes as the failed server.


On Thu, Jun 20, 2013 at 2:40 PM, Robert Coli  wrote:

> On Thu, Jun 20, 2013 at 10:40 AM, Emalayan Vairavanathan
>  wrote:
> > In the case where replace a cassandra node (call it node A) with another
> one
> > that has the exact same IP (ie. during a node failure), what exactly
> should
> > we do?  Currently I understand that we should at least run "nodetool
> > repair".
>
> If you lost the data from the node, then what you want is "replace_token."
>
> If you didn't lose the data from the node (and can tolerate stale
> reads until the repair completes) you want to start the node with
> auto_bootstrap set to false and then repair.
>
> =Rob
>


Re: timeuuid and cql3 query

2013-06-21 Thread Eric Stevens
It's my understanding that if cardinality of the first part of the primary
key has low cardinality, you will struggle with cluster balance as (unless
you use WITH COMPACT STORAGE) the first entry of the primary key equates to
the row key from the traditional interface, thus all entries related to a
single value for the "counter" column will map to the same partition.

So consider the cardinality of this field, if cardinality is low, you might
need to remodel with PRIMARY KEY (counter, ts, key1) then tack on WITH
COMPACT STORAGE (then the entire primary key becomes the row key, but you
can only have one column which is not part of the primary key)  If
cardinality of "counter" is high, then you have nothing to worry about.


On Wed, Jun 19, 2013 at 3:16 PM, Francisco Andrades Grassi <
bigjoc...@gmail.com> wrote:

> Hi,
>
> I believe what he's recommending is:
>
> CREATE TABLE count3 (
>   counter text,
>   ts timeuuid,
>   key1 text,
>   value int,
>   PRIMARY KEY (counter, ts)
> )
>
> That way *counter* will be your partitioning key, and all the rows that
> have the same *counter* value will be clustered (stored as a single wide
> row sorted by the *ts* value). In this scenario the query:
>
>  where counter = 'test' and ts > minTimeuuid('2013-06-18 16:23:00') and ts
> < minTimeuuid('2013-06-18 16:24:00');
>
> would actually be a sequential read on a wide row on a single node.
>
> --
> Francisco Andrades Grassi
> www.bigjocker.com
> @bigjocker
>
> On Jun 19, 2013, at 12:17 PM, "Ryan, Brent"  wrote:
>
>  Tyler,
>
>  You're recommending this schema instead, correct?
>
>  CREATE TABLE count3 (
>   counter text,
>   ts timeuuid,
>   key1 text,
>   value int,
>   PRIMARY KEY (ts, counter)
> )
>
>  I believe I tried this as well and ran into similar problems but I'll
> try it again.  I'm using the "ByteOrderedPartitioner" if that helps with
> the latest version of DSE community edition which I believe is Cassandra
> 1.2.3.
>
>
>  Thanks,
> Brent
>
>
>   From: Tyler Hobbs 
> Reply-To: "user@cassandra.apache.org" 
> Date: Wednesday, June 19, 2013 11:00 AM
> To: "user@cassandra.apache.org" 
> Subject: Re: timeuuid and cql3 query
>
>
> On Wed, Jun 19, 2013 at 8:08 AM, Ryan, Brent  wrote:
>
>>
>>  CREATE TABLE count3 (
>>   counter text,
>>   ts timeuuid,
>>   key1 text,
>>   value int,
>>   PRIMARY KEY ((counter, ts))
>> )
>>
>
> Instead of doing a composite partition key, remove a set of parens and let
> ts be your clustering key.  That will cause cql rows to be stored in sorted
> order by the ts column (for a given value of "counter") and allow you to do
> the kind of query you're looking for.
>
>
> --
> Tyler Hobbs
> DataStax 
>
>
>


Re: Heap is not released and streaming hangs at 0%

2013-06-21 Thread srmore
On Fri, Jun 21, 2013 at 2:53 AM, aaron morton wrote:

> > nodetool -h localhost flush didn't do much good.
>
> Do you have 100's of millions of rows ?
> If so see recent discussions about reducing the bloom_filter_fp_chance and
> index_sampling.
>
Yes, I have 100's of millions of rows.


>
> If this is an old schema you may be using the very old setting of 0.000744
> which creates a lot of bloom filters.
>
> bloom_filter_fp_chance value that was changed from default to 0.1, looked
at the filters and they are about 2.5G on disk and I have around 8G of heap.
I will try increasing the value to 0.7 and report my results.

It also appears to be a case of hard GC failure (as Rob mentioned) as the
heap is never released, even after 24+ hours of idle time, the JVM needs to
be restarted to reclaim the heap.

Cheers
>
> -
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 20/06/2013, at 6:36 AM, Wei Zhu  wrote:
>
> If you want, you can try to force the GC through Jconsole. Memory->Perform
> GC.
>
> It theoretically triggers a full GC and when it will happen depends on the
> JVM
>
> -Wei
>
> --
> *From: *"Robert Coli" 
> *To: *user@cassandra.apache.org
> *Sent: *Tuesday, June 18, 2013 10:43:13 AM
> *Subject: *Re: Heap is not released and streaming hangs at 0%
>
> On Tue, Jun 18, 2013 at 10:33 AM, srmore  wrote:
> > But then shouldn't JVM C G it eventually ? I can still see Cassandra
> alive
> > and kicking but looks like the heap is locked up even after the traffic
> is
> > long stopped.
>
> No, when GC system fails this hard it is often a permanent failure
> which requires a restart of the JVM.
>
> > nodetool -h localhost flush didn't do much good.
>
> This adds support to the idea that your heap is too full, and not full
> of memtables.
>
> You could try nodetool -h localhost invalidatekeycache, but that
> probably will not free enough memory to help you.
>
> =Rob
>
>
>


Re: timeuuid and cql3 query

2013-06-21 Thread Ryan, Brent
Yes.  The problem is that I can't use "counter" as the partition key otherwise 
I'd wind up with hot spots in my cluster where majority of the data is being 
written to single node in the cluster.  The only real way around this problem 
with Cassandra is to follow along with what this blog does:

http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra


From: Eric Stevens mailto:migh...@gmail.com>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Date: Friday, June 21, 2013 8:38 AM
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: Re: timeuuid and cql3 query

It's my understanding that if cardinality of the first part of the primary key 
has low cardinality, you will struggle with cluster balance as (unless you use 
WITH COMPACT STORAGE) the first entry of the primary key equates to the row key 
from the traditional interface, thus all entries related to a single value for 
the "counter" column will map to the same partition.

So consider the cardinality of this field, if cardinality is low, you might 
need to remodel with PRIMARY KEY (counter, ts, key1) then tack on WITH COMPACT 
STORAGE (then the entire primary key becomes the row key, but you can only have 
one column which is not part of the primary key)  If cardinality of "counter" 
is high, then you have nothing to worry about.


On Wed, Jun 19, 2013 at 3:16 PM, Francisco Andrades Grassi 
mailto:bigjoc...@gmail.com>> wrote:
Hi,

I believe what he's recommending is:

CREATE TABLE count3 (
  counter text,
  ts timeuuid,
  key1 text,
  value int,
  PRIMARY KEY (counter, ts)
)

That way counter will be your partitioning key, and all the rows that have the 
same counter value will be clustered (stored as a single wide row sorted by the 
ts value). In this scenario the query:

 where counter = 'test' and ts > minTimeuuid('2013-06-18 16:23:00') and ts < 
minTimeuuid('2013-06-18 16:24:00');

would actually be a sequential read on a wide row on a single node.

--
Francisco Andrades Grassi
www.bigjocker.com
@bigjocker

On Jun 19, 2013, at 12:17 PM, "Ryan, Brent" 
mailto:br...@cvent.com>> wrote:

Tyler,

You're recommending this schema instead, correct?

CREATE TABLE count3 (
  counter text,
  ts timeuuid,
  key1 text,
  value int,
  PRIMARY KEY (ts, counter)
)

I believe I tried this as well and ran into similar problems but I'll try it 
again.  I'm using the "ByteOrderedPartitioner" if that helps with the latest 
version of DSE community edition which I believe is Cassandra 1.2.3.


Thanks,
Brent


From: Tyler Hobbs mailto:ty...@datastax.com>>
Reply-To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Date: Wednesday, June 19, 2013 11:00 AM
To: "user@cassandra.apache.org" 
mailto:user@cassandra.apache.org>>
Subject: Re: timeuuid and cql3 query


On Wed, Jun 19, 2013 at 8:08 AM, Ryan, Brent 
mailto:br...@cvent.com>> wrote:

CREATE TABLE count3 (
  counter text,
  ts timeuuid,
  key1 text,
  value int,
  PRIMARY KEY ((counter, ts))
)

Instead of doing a composite partition key, remove a set of parens and let ts 
be your clustering key.  That will cause cql rows to be stored in sorted order 
by the ts column (for a given value of "counter") and allow you to do the kind 
of query you're looking for.


--
Tyler Hobbs
DataStax




NREL has released open source Databus on github for time series data

2013-06-21 Thread Hiller, Dean
NREL has released their open source databus.  They spin it as energy data (and 
a system for campus energy/building energy) but it is very general right now 
and probably will stay pretty general.  More information can be found here

http://www.nrel.gov/analysis/databus/

The source code can be found here
https://github.com/deanhiller/databus

Star the project if you like the idea.  NREL just did a big press release and 
is developing a community around the project.  It is in it's early stages but 
there are users using it and I am helping HP set an instance up this month.  If 
you want to become a committer on the project, let me know as well.

Later,
Dean



Cassandra terminates with OutOfMemory (OOM) error

2013-06-21 Thread Mohammed Guller
We have a 3-node cassandra cluster on AWS. These nodes are running cassandra 
1.2.2 and have 8GB memory. We didn't change any of the default heap or GC 
settings. So each node is allocating 1.8GB of heap space. The rows are wide; 
each row stores around 260,000 columns. We are reading the data using Astyanax. 
If our application tries to read 80,000 columns each from 10 or more rows at 
the same time, some of the nodes run out of heap space and terminate with OOM 
error. Here is the error message:

java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.duplicate(HeapByteBuffer.java:107)
at 
org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:50)
at 
org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:60)
at 
org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:126)
at 
org.apache.cassandra.db.filter.ColumnCounter$GroupByPrefix.count(ColumnCounter.java:96)
at 
org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:164)
at 
org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
at 
org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
at 
org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:294)
at 
org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
at 
org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1363)
at 
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1220)
at 
org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1132)
at org.apache.cassandra.db.Table.getRow(Table.java:355)
at 
org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
   at 
org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1052)
at 
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1578)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

ERROR 02:14:05,351 Exception in thread Thread[Thrift:6,5,main]
java.lang.OutOfMemoryError: Java heap space
at java.lang.Long.toString(Long.java:269)
at java.lang.Long.toString(Long.java:764)
at 
org.apache.cassandra.dht.Murmur3Partitioner$1.toString(Murmur3Partitioner.java:171)
at 
org.apache.cassandra.service.StorageService.describeRing(StorageService.java:1068)
at 
org.apache.cassandra.thrift.CassandraServer.describe_ring(CassandraServer.java:1192)
at 
org.apache.cassandra.thrift.Cassandra$Processor$describe_ring.getResult(Cassandra.java:3766)
at 
org.apache.cassandra.thrift.Cassandra$Processor$describe_ring.getResult(Cassandra.java:3754)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:32)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34)
at 
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

The data in each column is less than 50 bytes. After adding all the column 
overheads (column name + metadata), it should not be more than 100 bytes. So 
reading 80,000 columns from 10 rows each means that we are reading 80,000 * 10 
* 100 = 80 MB of data. It is large, but not large enough to fill up the 1.8 GB 
heap. So I wonder why the heap is getting full. If the data request is too big 
to fill in a reasonable amount of time, I would expect Cassandra to return a 
TimeOutException instead of terminating.

One easy solution is to increase the heapsize. However that means Cassandra can 
still crash if someone reads 100 rows.  I wonder if there some other Cassandra 
setting that I can tweak to prevent the OOM exception?

Thanks,
Mohammed


Re: Cassandra terminates with OutOfMemory (OOM) error

2013-06-21 Thread Jabbar Azam
Hello Mohammed,

You should increase the heap space. You should also tune the garbage
collection so young generation objects are collected faster, relieving
pressure on heap We have been using jdk 7 and it uses G1 as the default
collector. It does a better job than me trying to optimise the JDK 6 GC
collectors.

Bear in mind though that the OS will need memory, so will the row cache and
the filing system. Although memory usage will depend on the workload of
your system.

I'm sure you'll also get good advice from other members of the mailing list.

Thanks

Jabbar Azam


On 21 June 2013 18:49, Mohammed Guller  wrote:

>  We have a 3-node cassandra cluster on AWS. These nodes are running
> cassandra 1.2.2 and have 8GB memory. We didn't change any of the default
> heap or GC settings. So each node is allocating 1.8GB of heap space. The
> rows are wide; each row stores around 260,000 columns. We are reading the
> data using Astyanax. If our application tries to read 80,000 columns each
> from 10 or more rows at the same time, some of the nodes run out of heap
> space and terminate with OOM error. Here is the error message:
>
> ** **
>
> java.lang.OutOfMemoryError: Java heap space
>
> at java.nio.HeapByteBuffer.duplicate(HeapByteBuffer.java:107)
>
> at
> org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:50)
> 
>
> at
> org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:60)
> 
>
> at
> org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:126)
> 
>
> at
> org.apache.cassandra.db.filter.ColumnCounter$GroupByPrefix.count(ColumnCounter.java:96)
> 
>
> at
> org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:164)
> 
>
> at
> org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
> 
>
> at
> org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
> 
>
> at
> org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:294)
> 
>
> at
> org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
> 
>
> at
> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1363)
> 
>
> at
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1220)
> 
>
> at
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1132)
> 
>
> at org.apache.cassandra.db.Table.getRow(Table.java:355)
>
> at
> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
> 
>
>at
> org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1052)
> 
>
> at
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1578)
> 
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> 
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> 
>
> at java.lang.Thread.run(Thread.java:722)
>
> ** **
>
> ERROR 02:14:05,351 Exception in thread Thread[Thrift:6,5,main]
>
> java.lang.OutOfMemoryError: Java heap space
>
> at java.lang.Long.toString(Long.java:269)
>
> at java.lang.Long.toString(Long.java:764)
>
> at
> org.apache.cassandra.dht.Murmur3Partitioner$1.toString(Murmur3Partitioner.java:171)
> 
>
> at
> org.apache.cassandra.service.StorageService.describeRing(StorageService.java:1068)
> 
>
> at
> org.apache.cassandra.thrift.CassandraServer.describe_ring(CassandraServer.java:1192)
> 
>
> at
> org.apache.cassandra.thrift.Cassandra$Processor$describe_ring.getResult(Cassandra.java:3766)
> 
>
> at
> org.apache.cassandra.thrift.Cassandra$Processor$describe_ring.getResult(Cassandra.java:3754)
> 
>
> at
> org.apache.thrift.ProcessFunction.process(ProcessFunction.java:32)
>
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34)
> 
>
> at
> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199)
> 
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> 
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> 
>
> at java.lang.Thread.run(Thread.java:722)
>
> ** **
>
> The data in each column is less than 50 bytes. After adding all the column
> overheads (column name + metadata), it should not be more than 100 bytes.
> So reading 80,000 columns from 10 rows each means that we are reading
> 80,000 * 10 * 100 = 80 MB of data. It is large, but not large

Cassandra driver performance question...

2013-06-21 Thread Tony Anecito
Hi All,
I am using jdbc driver and noticed that if I run the same query twice the 
second time it is much faster.
I setup the row cache and column family cache and it not seem to make a 
difference.

I am wondering how to setup cassandra such that the first query is always as 
fast as the second one. The second one was 1.8msec and the first 
28msec for the same exact paremeters. I am using preparestatement.

Thanks!

Re: Cassandra driver performance question...

2013-06-21 Thread Jabbar Azam
Hello Tony,

I would guess that the first queries data  is put into the row cache and
the filesystem cache. The second query gets the data from the row cache and
or the filesystem cache so it'll be faster.

If you want to make it consistently faster having a key cache will
definitely help. The following advice from Aaron Morton will also help

"You can also see what it looks like from the server side.

nodetool proxyhistograms will show you full request latency recorded
by the coordinator.
nodetool cfhistograms will show you the local read latency, this is
just the time it takes
to read data on a replica and does not include network or wait times.

If the proxyhistograms is showing most requests running faster than
your app says it's your
app."


http://mail-archives.apache.org/mod_mbox/cassandra-user/201301.mbox/%3ce3741956-c47c-4b43-ad99-dad8afc3a...@thelastpickle.com%3E



Thanks

Jabbar Azam


On 21 June 2013 21:29, Tony Anecito  wrote:

> Hi All,
> I am using jdbc driver and noticed that if I run the same query twice the
> second time it is much faster.
> I setup the row cache and column family cache and it not seem to make a
> difference.
>
> I am wondering how to setup cassandra such that the first query is always
> as fast as the second one. The second one was 1.8msec and the first 28msec
> for the same exact paremeters. I am using preparestatement.
>
> Thanks!
>


Re: [Cassandra] Replacing a cassandra node with one of the same IP

2013-06-21 Thread Mahony, Robin
Please note that I am currently using version 1.2.2 of Cassandra.  Also we are 
using virtual nodes.

My question mainly stems from the fact that the nodes appear to be aware that 
the node uuid changes for the IP (from reading the logs), so I am just 
wondering if this means the hinted handoffs are also updated to reflect the new 
Cassandra node uuid. If that was the case, I would not think a nodetool cleanup 
would be necessary.

- Forwarded Message -
From: Robert Coli mailto:rc...@eventbrite.com>>
To: user@cassandra.apache.org; Emalayan 
Vairavanathan mailto:svemala...@yahoo.com>>
Sent: Thursday, 20 June 2013 11:40 AM
Subject: Re: [Cassandra] Replacing a cassandra node

On Thu, Jun 20, 2013 at 10:40 AM, Emalayan Vairavanathan
mailto:svemala...@yahoo.com>> wrote:
> In the case where replace a cassandra node (call it node A) with another one
> that has the exact same IP (ie. during a node failure), what exactly should
> we do?  Currently I understand that we should at least run "nodetool
> repair".

If you lost the data from the node, then what you want is "replace_token."

If you didn't lose the data from the node (and can tolerate stale
reads until the repair completes) you want to start the node with
auto_bootstrap set to false and then repair.

=Rob


crashed while running repair

2013-06-21 Thread Franc Carter
Hi,

I am experimenting with Cassandra-1.2.4, and got a crash while running
repair. The nodes has 24GB of ram with an 8GB heap. Any ideas on my I may
have missed in the config ? Log is below

ERROR [Thread-136019] 2013-06-22 06:30:05,861 CassandraDaemon.java (line
174) Exception in thread Thread[Thread-136019,5,main]
FSReadError in
/var/lib/cassandra/data/cut3/Price/cut3-Price-ib-44369-Index.db
at
org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.createSegments(MmappedSegmentedFile.java:200)
at
org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.complete(MmappedSegmentedFile.java:168)
at
org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:340)
at
org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:319)
at
org.apache.cassandra.streaming.IncomingStreamReader.streamIn(IncomingStreamReader.java:194)
at
org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:122)
at
org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:238)
at
org.apache.cassandra.net.IncomingTcpConnection.handleStream(IncomingTcpConnection.java:178)
at
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78)
Caused by: java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:748)
at
org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.createSegments(MmappedSegmentedFile.java:192)
... 8 more
Caused by: java.lang.OutOfMemoryError: Map failed
at sun.nio.ch.FileChannelImpl.map0(Native Method)
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)
... 9 more
ERROR [Thread-136019] 2013-06-22 06:30:05,865 FileUtils.java (line 375)
Stopping gossiper


thanks

-- 

*Franc Carter* | Systems architect | Sirca Ltd
 

franc.car...@sirca.org.au | www.sirca.org.au

Tel: +61 2 8355 2514

Level 4, 55 Harrington St, The Rocks NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215


Re: Heap is not released and streaming hangs at 0%

2013-06-21 Thread Bryan Talbot
bloom_filter_fp_chance = 0.7 is probably way too large to be effective and
you'll probably have issues compacting deleted rows and get poor read
performance with a value that high.  I'd guess that anything larger than
0.1 might as well be 1.0.

-Bryan



On Fri, Jun 21, 2013 at 5:58 AM, srmore  wrote:

>
> On Fri, Jun 21, 2013 at 2:53 AM, aaron morton wrote:
>
>> > nodetool -h localhost flush didn't do much good.
>>
>> Do you have 100's of millions of rows ?
>> If so see recent discussions about reducing the bloom_filter_fp_chance
>> and index_sampling.
>>
> Yes, I have 100's of millions of rows.
>
>
>>
>> If this is an old schema you may be using the very old setting of
>> 0.000744 which creates a lot of bloom filters.
>>
>> bloom_filter_fp_chance value that was changed from default to 0.1, looked
> at the filters and they are about 2.5G on disk and I have around 8G of heap.
> I will try increasing the value to 0.7 and report my results.
>
> It also appears to be a case of hard GC failure (as Rob mentioned) as the
> heap is never released, even after 24+ hours of idle time, the JVM needs to
> be restarted to reclaim the heap.
>
> Cheers
>>
>>-
>> Aaron Morton
>> Freelance Cassandra Consultant
>> New Zealand
>>
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 20/06/2013, at 6:36 AM, Wei Zhu  wrote:
>>
>> If you want, you can try to force the GC through Jconsole.
>> Memory->Perform GC.
>>
>> It theoretically triggers a full GC and when it will happen depends on
>> the JVM
>>
>> -Wei
>>
>> --
>> *From: *"Robert Coli" 
>> *To: *user@cassandra.apache.org
>> *Sent: *Tuesday, June 18, 2013 10:43:13 AM
>> *Subject: *Re: Heap is not released and streaming hangs at 0%
>>
>> On Tue, Jun 18, 2013 at 10:33 AM, srmore  wrote:
>> > But then shouldn't JVM C G it eventually ? I can still see Cassandra
>> alive
>> > and kicking but looks like the heap is locked up even after the traffic
>> is
>> > long stopped.
>>
>> No, when GC system fails this hard it is often a permanent failure
>> which requires a restart of the JVM.
>>
>> > nodetool -h localhost flush didn't do much good.
>>
>> This adds support to the idea that your heap is too full, and not full
>> of memtables.
>>
>> You could try nodetool -h localhost invalidatekeycache, but that
>> probably will not free enough memory to help you.
>>
>> =Rob
>>
>>
>>
>


Updated sstable size for LCS, ran upgradesstables, file sizes didn't change

2013-06-21 Thread Andrew Bialecki
We're potentially considering increasing the size of our sstables for some
column families from 10MB to something larger.

In test, we've been trying to verify that the sstable file sizes change and
then doing a bit of benchmarking. However when we run alter the column
family and then run "nodetool upgradesstables -a keyspace columnfamily,"
the files in the data directory have been re-written, but the file sizes
are the same.

Is this the expected behavior? If not, what's the right way to upgrade
them. If this is expected, how can we benchmark the read/write performance
with varying sstable sizes.

Thanks in advance!

Andrew


Re: Updated sstable size for LCS, ran upgradesstables, file sizes didn't change

2013-06-21 Thread Robert Coli
On Fri, Jun 21, 2013 at 4:40 PM, Andrew Bialecki
 wrote:
> However when we run alter the column
> family and then run "nodetool upgradesstables -a keyspace columnfamily," the
> files in the data directory have been re-written, but the file sizes are the
> same.
>
> Is this the expected behavior? If not, what's the right way to upgrade them.
> If this is expected, how can we benchmark the read/write performance with
> varying sstable sizes.

It is expected, upgradesstables/scrub/clean compactions work on a
single sstable at a time, they are not capable of combining or
splitting them.

In theory you could probably :

1) start out with the largest size you want to test
2) stop your node
3) use sstable_split [1] to split sstables
4) start node, test
5) repeat 2-4

I am not sure if there is anything about level compaction which makes
this infeasible.

=Rob
[1] https://github.com/pcmanus/cassandra/tree/sstable_split


Re: Updated sstable size for LCS, ran upgradesstables, file sizes didn't change

2013-06-21 Thread Wei Zhu
I think the new SSTable will be in the new size. In order to do that, you need 
to trigger a compaction so that the new SSTables will be generated. for LCS, 
there is no major compaction though. You can run a nodetool repair and 
hopefully you will bring some new SSTables and compactions will kick in. 
Or you can change the $CFName.json file under your data directory and move 
every SSTable to level 0. You need to stop your node, write a simple script to 
alter that file and start the node again. 

I think it will be helpful to have a nodetool command to change the SSTable 
Size and trigger the rebuild of the SSTables. 

Thanks. 
-Wei 

- Original Message -

From: "Robert Coli"  
To: user@cassandra.apache.org 
Sent: Friday, June 21, 2013 4:51:29 PM 
Subject: Re: Updated sstable size for LCS, ran upgradesstables, file sizes 
didn't change 

On Fri, Jun 21, 2013 at 4:40 PM, Andrew Bialecki 
 wrote: 
> However when we run alter the column 
> family and then run "nodetool upgradesstables -a keyspace columnfamily," the 
> files in the data directory have been re-written, but the file sizes are the 
> same. 
> 
> Is this the expected behavior? If not, what's the right way to upgrade them. 
> If this is expected, how can we benchmark the read/write performance with 
> varying sstable sizes. 

It is expected, upgradesstables/scrub/clean compactions work on a 
single sstable at a time, they are not capable of combining or 
splitting them. 

In theory you could probably : 

1) start out with the largest size you want to test 
2) stop your node 
3) use sstable_split [1] to split sstables 
4) start node, test 
5) repeat 2-4 

I am not sure if there is anything about level compaction which makes 
this infeasible. 

=Rob 
[1] https://github.com/pcmanus/cassandra/tree/sstable_split 



Re: Updated sstable size for LCS, ran upgradesstables, file sizes didn't change

2013-06-21 Thread sankalp kohli
I think you can remove the json file which stores the mapping of which
sstable is in which level. This will be treated by cassandra as all
sstables in level 0 which will trigger a compaction. But if you have lot of
data, it will be very slow as you will keep compacting data between L1 and
L0.
This also happens when you write very fast and have a pile up in L0.  A
comment from the code will explain this what I am saying
// LevelDB gives each level a score of how much data it contains vs its
ideal amount, and
// compacts the level with the highest score. But this falls apart
spectacularly once you
// get behind.  Consider this set of levels:
// L0: 988 [ideal: 4]
// L1: 117 [ideal: 10]
// L2: 12  [ideal: 100]
//
// The problem is that L0 has a much higher score (almost 250) than
L1 (11), so what we'll
// do is compact a batch of MAX_COMPACTING_L0 sstables with all 117
L1 sstables, and put the
// result (say, 120 sstables) in L1. Then we'll compact the next
batch of MAX_COMPACTING_L0,
// and so forth.  So we spend most of our i/o rewriting the L1 data
with each batch.
//
// If we could just do *all* L0 a single time with L1, that would
be ideal.  But we can't
// -- see the javadoc for MAX_COMPACTING_L0.
//
// LevelDB's way around this is to simply block writes if L0
compaction falls behind.
// We don't have that luxury.
//
// So instead, we
// 1) force compacting higher levels first, which minimizes the i/o
needed to compact
//optimially which gives us a long term win, and
// 2) if L0 falls behind, we will size-tiered compact it to reduce
read overhead until
//we can catch up on the higher levels.
//
// This isn't a magic wand -- if you are consistently writing too
fast for LCS to keep
// up, you're still screwed.  But if instead you have intermittent
bursts of activity,
// it can help a lot.


On Fri, Jun 21, 2013 at 5:42 PM, Wei Zhu  wrote:

> I think the new SSTable will be in the new size. In order to do that, you
> need to trigger a compaction so that the new SSTables will be generated.
> for LCS, there is no major compaction though. You can run a nodetool repair
> and hopefully you will bring some new SSTables and compactions will kick in.
> Or you can change the $CFName.json file under your data directory and move
> every SSTable to level 0. You need to stop your node,  write a simple
> script to alter that file and start the node again.
>
> I think it will be helpful to have a nodetool command to change the
> SSTable Size and trigger the rebuild of the SSTables.
>
> Thanks.
> -Wei
>
> --
> *From: *"Robert Coli" 
> *To: *user@cassandra.apache.org
> *Sent: *Friday, June 21, 2013 4:51:29 PM
> *Subject: *Re: Updated sstable size for LCS, ran upgradesstables, file
> sizes didn't change
>
>
> On Fri, Jun 21, 2013 at 4:40 PM, Andrew Bialecki
>  wrote:
> > However when we run alter the column
> > family and then run "nodetool upgradesstables -a keyspace columnfamily,"
> the
> > files in the data directory have been re-written, but the file sizes are
> the
> > same.
> >
> > Is this the expected behavior? If not, what's the right way to upgrade
> them.
> > If this is expected, how can we benchmark the read/write performance with
> > varying sstable sizes.
>
> It is expected, upgradesstables/scrub/clean compactions work on a
> single sstable at a time, they are not capable of combining or
> splitting them.
>
> In theory you could probably :
>
> 1) start out with the largest size you want to test
> 2) stop your node
> 3) use sstable_split [1] to split sstables
> 4) start node, test
> 5) repeat 2-4
>
> I am not sure if there is anything about level compaction which makes
> this infeasible.
>
> =Rob
> [1] https://github.com/pcmanus/cassandra/tree/sstable_split
>
>


Re: Heap is not released and streaming hangs at 0%

2013-06-21 Thread sankalp kohli
I will take a heap dump and see whats in there rather than guessing.


On Fri, Jun 21, 2013 at 4:12 PM, Bryan Talbot wrote:

> bloom_filter_fp_chance = 0.7 is probably way too large to be effective and
> you'll probably have issues compacting deleted rows and get poor read
> performance with a value that high.  I'd guess that anything larger than
> 0.1 might as well be 1.0.
>
> -Bryan
>
>
>
> On Fri, Jun 21, 2013 at 5:58 AM, srmore  wrote:
>
>>
>> On Fri, Jun 21, 2013 at 2:53 AM, aaron morton wrote:
>>
>>> > nodetool -h localhost flush didn't do much good.
>>>
>>> Do you have 100's of millions of rows ?
>>> If so see recent discussions about reducing the bloom_filter_fp_chance
>>> and index_sampling.
>>>
>> Yes, I have 100's of millions of rows.
>>
>>
>>>
>>> If this is an old schema you may be using the very old setting of
>>> 0.000744 which creates a lot of bloom filters.
>>>
>>> bloom_filter_fp_chance value that was changed from default to 0.1,
>> looked at the filters and they are about 2.5G on disk and I have around 8G
>> of heap.
>> I will try increasing the value to 0.7 and report my results.
>>
>> It also appears to be a case of hard GC failure (as Rob mentioned) as the
>> heap is never released, even after 24+ hours of idle time, the JVM needs to
>> be restarted to reclaim the heap.
>>
>> Cheers
>>>
>>>-
>>> Aaron Morton
>>> Freelance Cassandra Consultant
>>> New Zealand
>>>
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>>
>>> On 20/06/2013, at 6:36 AM, Wei Zhu  wrote:
>>>
>>> If you want, you can try to force the GC through Jconsole.
>>> Memory->Perform GC.
>>>
>>> It theoretically triggers a full GC and when it will happen depends on
>>> the JVM
>>>
>>> -Wei
>>>
>>> --
>>> *From: *"Robert Coli" 
>>> *To: *user@cassandra.apache.org
>>> *Sent: *Tuesday, June 18, 2013 10:43:13 AM
>>> *Subject: *Re: Heap is not released and streaming hangs at 0%
>>>
>>> On Tue, Jun 18, 2013 at 10:33 AM, srmore  wrote:
>>> > But then shouldn't JVM C G it eventually ? I can still see Cassandra
>>> alive
>>> > and kicking but looks like the heap is locked up even after the
>>> traffic is
>>> > long stopped.
>>>
>>> No, when GC system fails this hard it is often a permanent failure
>>> which requires a restart of the JVM.
>>>
>>> > nodetool -h localhost flush didn't do much good.
>>>
>>> This adds support to the idea that your heap is too full, and not full
>>> of memtables.
>>>
>>> You could try nodetool -h localhost invalidatekeycache, but that
>>> probably will not free enough memory to help you.
>>>
>>> =Rob
>>>
>>>
>>>
>>
>


Re: crashed while running repair

2013-06-21 Thread sankalp kohli
Looks like memory map failed. In a 64 bit system, you should have unlimited
virtual memory but Linux has a limit on the number of maps. Looks at these
two places.

http://stackoverflow.com/questions/8892143/error-when-opening-a-lucene-index-map-failed
https://blog.kumina.nl/2011/04/cassandra-java-io-ioerror-java-io-ioexception-map-failed/




On Fri, Jun 21, 2013 at 3:22 PM, Franc Carter wrote:

>
> Hi,
>
> I am experimenting with Cassandra-1.2.4, and got a crash while running
> repair. The nodes has 24GB of ram with an 8GB heap. Any ideas on my I may
> have missed in the config ? Log is below
>
> ERROR [Thread-136019] 2013-06-22 06:30:05,861 CassandraDaemon.java (line
> 174) Exception in thread Thread[Thread-136019,5,main]
> FSReadError in
> /var/lib/cassandra/data/cut3/Price/cut3-Price-ib-44369-Index.db
> at
> org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.createSegments(MmappedSegmentedFile.java:200)
> at
> org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.complete(MmappedSegmentedFile.java:168)
> at
> org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:340)
> at
> org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:319)
> at
> org.apache.cassandra.streaming.IncomingStreamReader.streamIn(IncomingStreamReader.java:194)
> at
> org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:122)
> at
> org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpConnection.java:238)
> at
> org.apache.cassandra.net.IncomingTcpConnection.handleStream(IncomingTcpConnection.java:178)
> at
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78)
> Caused by: java.io.IOException: Map failed
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:748)
> at
> org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.createSegments(MmappedSegmentedFile.java:192)
> ... 8 more
> Caused by: java.lang.OutOfMemoryError: Map failed
> at sun.nio.ch.FileChannelImpl.map0(Native Method)
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)
> ... 9 more
> ERROR [Thread-136019] 2013-06-22 06:30:05,865 FileUtils.java (line 375)
> Stopping gossiper
>
>
> thanks
>
> --
>
> *Franc Carter* | Systems architect | Sirca Ltd
>  
>
> franc.car...@sirca.org.au | www.sirca.org.au
>
> Tel: +61 2 8355 2514
>
> Level 4, 55 Harrington St, The Rocks NSW 2000
>
> PO Box H58, Australia Square, Sydney NSW 1215
>
>
>


Re: Cassandra terminates with OutOfMemory (OOM) error

2013-06-21 Thread sankalp kohli
Looks like you are putting lot of pressure on the heap by doing a slice
query on a large row.
Do you have lot of deletes/tombstone on the rows? That might be causing a
problem.
Also why are you returning so many columns as once, you can use auto
paginate feature in Astyanax.

Also do you see lot of GC happening?


On Fri, Jun 21, 2013 at 1:13 PM, Jabbar Azam  wrote:

> Hello Mohammed,
>
> You should increase the heap space. You should also tune the garbage
> collection so young generation objects are collected faster, relieving
> pressure on heap We have been using jdk 7 and it uses G1 as the default
> collector. It does a better job than me trying to optimise the JDK 6 GC
> collectors.
>
> Bear in mind though that the OS will need memory, so will the row cache
> and the filing system. Although memory usage will depend on the workload of
> your system.
>
> I'm sure you'll also get good advice from other members of the mailing
> list.
>
> Thanks
>
> Jabbar Azam
>
>
> On 21 June 2013 18:49, Mohammed Guller  wrote:
>
>>  We have a 3-node cassandra cluster on AWS. These nodes are running
>> cassandra 1.2.2 and have 8GB memory. We didn't change any of the default
>> heap or GC settings. So each node is allocating 1.8GB of heap space. The
>> rows are wide; each row stores around 260,000 columns. We are reading the
>> data using Astyanax. If our application tries to read 80,000 columns each
>> from 10 or more rows at the same time, some of the nodes run out of heap
>> space and terminate with OOM error. Here is the error message:
>>
>> ** **
>>
>> java.lang.OutOfMemoryError: Java heap space
>>
>> at java.nio.HeapByteBuffer.duplicate(HeapByteBuffer.java:107)
>>
>> at
>> org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:50)
>> 
>>
>> at
>> org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:60)
>> 
>>
>> at
>> org.apache.cassandra.db.marshal.AbstractCompositeType.split(AbstractCompositeType.java:126)
>> 
>>
>> at
>> org.apache.cassandra.db.filter.ColumnCounter$GroupByPrefix.count(ColumnCounter.java:96)
>> 
>>
>> at
>> org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:164)
>> 
>>
>> at
>> org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:136)
>> 
>>
>> at
>> org.apache.cassandra.db.filter.QueryFilter.collateOnDiskAtom(QueryFilter.java:84)
>> 
>>
>> at
>> org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:294)
>> 
>>
>> at
>> org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65)
>> 
>>
>> at
>> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1363)
>> 
>>
>> at
>> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1220)
>> 
>>
>> at
>> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1132)
>> 
>>
>> at org.apache.cassandra.db.Table.getRow(Table.java:355)
>>
>> at
>> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70)
>> 
>>
>>at
>> org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1052)
>> 
>>
>> at
>> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1578)
>> 
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>> 
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>> 
>>
>> at java.lang.Thread.run(Thread.java:722)
>>
>> ** **
>>
>> ERROR 02:14:05,351 Exception in thread Thread[Thrift:6,5,main]
>>
>> java.lang.OutOfMemoryError: Java heap space
>>
>> at java.lang.Long.toString(Long.java:269)
>>
>> at java.lang.Long.toString(Long.java:764)
>>
>> at
>> org.apache.cassandra.dht.Murmur3Partitioner$1.toString(Murmur3Partitioner.java:171)
>> 
>>
>> at
>> org.apache.cassandra.service.StorageService.describeRing(StorageService.java:1068)
>> 
>>
>> at
>> org.apache.cassandra.thrift.CassandraServer.describe_ring(CassandraServer.java:1192)
>> 
>>
>> at
>> org.apache.cassandra.thrift.Cassandra$Processor$describe_ring.getResult(Cassandra.java:3766)
>> 
>>
>> at
>> org.apache.cassandra.thrift.Cassandra$Processor$describe_ring.getResult(Cassandra.java:3754)
>> 
>>
>> at
>> org.apache.thrift.ProcessFunction.process(ProcessFunction.java:32)
>>
>> at
>> org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34)
>>
>> at
>> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199)
>> 
>>
>> at
>> java.util.concu

Re: Cassandra driver performance question...

2013-06-21 Thread Tony Anecito
Thanks Jabbar,
 
I ran nodetool as suggested and it 0 latency for the row count I have.
 
I also ran cli list command for the table hit by my JDBC perparedStatement and 
it was slow like 121msecs the first time I ran it and second time I ran it it 
was 40msecs versus jdbc call of 38msecs to start with unless I run it twice 
also and get 1.5-2.5msecs for executeQuery the second time the 
preparedStatement is called.
 
I ran describe from cli for the table and it said caching is "ALL" which is 
correct.
 
A real mystery and I need to understand better what is going on.
 
Regards,
-Tony

From: Jabbar Azam 
To: user@cassandra.apache.org; Tony Anecito  
Sent: Friday, June 21, 2013 3:32 PM
Subject: Re: Cassandra driver performance question...



Hello Tony, 

I would guess that the first queries data  is put into the row cache and the 
filesystem cache. The second query gets the data from the row cache and or the 
filesystem cache so it'll be faster.

If you want to make it consistently faster having a key cache will definitely 
help. The following advice from Aaron Morton will also help 
"You can also see what it looks like from the server side. 

nodetool proxyhistograms will show you full request latency recorded by the 
coordinator. 
nodetool cfhistograms will show you the local read latency, this is just the 
time it takes
to read data on a replica and does not include network or wait times. 

If the proxyhistograms is showing most requests running faster than your app 
says it's your
app."

http://mail-archives.apache.org/mod_mbox/cassandra-user/201301.mbox/%3ce3741956-c47c-4b43-ad99-dad8afc3a...@thelastpickle.com%3E



Thanks

Jabbar Azam



On 21 June 2013 21:29, Tony Anecito  wrote:

Hi All,
>I am using jdbc driver and noticed that if I run the same query twice the 
>second time it is much faster.
>I setup the row cache and column family cache and it not seem to make a 
>difference.
>
>
>I am wondering how to setup cassandra such that the first query is always as 
>fast as the second one. The second one was 1.8msec and the first 28msec for 
>the same exact paremeters. I am using preparestatement.
>
>
>Thanks!

Re: Cassandra driver performance question...

2013-06-21 Thread Tony Anecito
Hi Jabbar,
 
I think I know what is going on. I happened accross a change mentioned by the 
jdbc driver developers regarding metadata caching. Seems the metadata caching 
was moved from the connection object to the preparedStatement object. So I am 
wondering if the time difference I am seeing on the second preparedStatement 
object is because of the Metadata is cached then.
 
So my question is how to test this theory? Is there a way to stop the metadata 
from coming accross from Cassandra? A 20x performance improvement would be nice 
to have.
 
Thanks,
-Tony

From: Tony Anecito 
To: "user@cassandra.apache.org"  
Sent: Friday, June 21, 2013 8:56 PM
Subject: Re: Cassandra driver performance question...



Thanks Jabbar,
 
I ran nodetool as suggested and it 0 latency for the row count I have.
 
I also ran cli list command for the table hit by my JDBC perparedStatement and 
it was slow like 121msecs the first time I ran it and second time I ran it it 
was 40msecs versus jdbc call of 38msecs to start with unless I run it twice 
also and get 1.5-2.5msecs for executeQuery the second time the 
preparedStatement is called.
 
I ran describe from cli for the table and it said caching is "ALL" which is 
correct.
 
A real mystery and I need to understand better what is going on.
 
Regards,
-Tony

From: Jabbar Azam 
To: user@cassandra.apache.org; Tony Anecito  
Sent: Friday, June 21, 2013 3:32 PM
Subject: Re: Cassandra driver performance question...



Hello Tony, 

I would guess that the first queries data  is put into the row cache and the 
filesystem cache. The second query gets the data from the row cache and or the 
filesystem cache so it'll be faster.

If you want to make it consistently faster having a key cache will definitely 
help. The following advice from Aaron Morton will also help 
"You can also see what it looks like from the server side. 

nodetool proxyhistograms will show you full request latency recorded by the 
coordinator. 
nodetool cfhistograms will show you the local read latency, this is just the 
time it takes
to read data on a replica and does not include network or wait times. 

If the proxyhistograms is showing most requests running faster than your app 
says it's your
app."

http://mail-archives.apache.org/mod_mbox/cassandra-user/201301.mbox/%3ce3741956-c47c-4b43-ad99-dad8afc3a...@thelastpickle.com%3E



Thanks

Jabbar Azam



On 21 June 2013 21:29, Tony Anecito  wrote:

Hi All,
>I am using jdbc driver and noticed that if I run the same query twice the 
>second time it is much faster.
>I setup the row cache and column family cache and it not seem to make a 
>difference.
>
>
>I am wondering how to setup cassandra such that the first query is always as 
>fast as the second one. The second one was 1.8msec and the first 28msec for 
>the same exact paremeters. I am using preparestatement.
>
>
>Thanks!