Strange streaming behaviour

2010-05-03 Thread Dr . Martin Grabmüller
Hello list, I encountered a problem with streaming in my test cluster but found nothing like this in JIRA or on the list. I'm running a test cluster of three nodes, RF=3, Cassandra 0.6.1. I started the first node and inserted some data, then bootstrapped the other machines one after the other. N

Re: ColumnFamilyOutputFormat?

2010-05-03 Thread Johan Oskarsson
I wrote this CassandraOutputFormat last year. It is most likely not working against newer/current versions of Cassandra, but if you want something to work with it can be used as a starting point. http://github.com/johanoskarsson/cassandraoutputformat /Johan On 30 apr 2010, at 14.14, Utku Can T

Feeding in specific Cassandra columns into Hadoop

2010-05-03 Thread Mark Schnitzius
Hi all... I am trying to feed a specific list of Cassandra column names in as input to a Hadoop process, but for some reason it only feeds in some of the columns I specify, not all. This is a short description of the problem - I'll see if anyone might have some insight before I dump a big load of

Re: Login failure with SimpleAuthenticator

2010-05-03 Thread Julio Carlos Barrera Juez
Hi again. My system log says: ERROR [pool-1-thread-1] 2010-05-03 12:54:03,801 Cassandra.java (line 1153) Internal error processing login java.lang.RuntimeException: Unexpected authentication problem at org.apache.cassandra.auth.SimpleAuthenticator.login(SimpleAuthenticator.java:113) at org.apache

Re: Login failure with SimpleAuthenticator

2010-05-03 Thread roger schildmeijer
You need to define two more properties: passwd.properties and access.properties (hint -Dpasswd.properties=/user/schildmeijer/cassandra/conf/passwd.properties and analogous for access.properties) // Roger Schildmeijer On Mon, May 3, 2010 at 1:06 PM, Julio Carlos Barrera Juez < juliocar...@gmail

Distributed export and import into cassandra

2010-05-03 Thread Utku Can Topçu
Hey All, I have a simple sample use case, The aim is to export the columns in a column family into flat files with the keys in range from k1 to k2. Since all the nodes in the cluster is supposed to contain some of the distribution of data, is it possible to make each node dump its own local data v

RE: Cassandra on Windows network latency

2010-05-03 Thread Viktor Jevdokimov
Yes, we have already figured that out :) Thanks! -Original Message- From: Carlos Alvarez [mailto:cbalva...@gmail.com] Sent: Thursday, April 29, 2010 4:03 PM To: user@cassandra.apache.org Subject: Re: Cassandra on Windows network latency Are you using TSocket in the client?. If yes, use

Primary and Backup clusters

2010-05-03 Thread Viktor Jevdokimov
Hello, Our system (not Cassandra) have backup cluster in different datacenter in case of primary cluster unavailability or for software upgrades. 100% of traffic goes to primary cluster. We switch 100% traffic to backup cluster in case above for a short time, then when issues are resolved, traff

Re: inserting new rows with one key vs. inserting new columns in a row performance

2010-05-03 Thread malsmith
I've seen this too (your second case) - it seems like the entire row contents (or some big subset of the row) are loaded to memory on the server before any column value is returned. The partitioner selection did not make any difference to performance in my case. I did not find a way around this e

Re: inserting new rows with one key vs. inserting new columns in a row performance

2010-05-03 Thread Sylvain Lebresne
Make sure you have disallowed the row cache. If you have row cache, the entire row do get loaded to memory. Otherwise it is not. On Mon, May 3, 2010 at 3:06 PM, malsmith wrote: > I've seen this too (your second case) - it seems like the entire row > contents (or some big subset of the row) are lo

Re: Search Sample and Relation question because UDDI as Key

2010-05-03 Thread Jonathan Shook
I am only speaking to your second question. It may be helpful to think of modeling your storage layout in terms of * lists * sets * hash maps ... and certain combinations of these. Since there are no schema-defined relations, your relations may appear implicit between different views or "copies"

Re: A simple Cassandra CLI in Python with readline support

2010-05-03 Thread Jonathan Ellis
That's fine, although GPLv3 software cannot be included in Apache projects. http://www.apache.org/licenses/GPL-compatibility.html On Sat, May 1, 2010 at 8:25 PM, Shuge Lee wrote: > Thanks for reply. > Add GPLv3 license, github.com/shuge/shuge-cassandra/downloads. > > 2010/5/1 Jonathan Ellis >>

Re: Design Query

2010-05-03 Thread Jonathan Ellis
On Sat, May 1, 2010 at 6:34 AM, Rakesh Rajan wrote: > I am evaluating cassandra to implement activity streams. We currently have > over 100 feeds with total entries exceeding 32000 implemented using > redis ( ~320 entries / feed). Would like hear from the community on how to > use cassandr

Re: strange get_range_slices behaviour v0.6.1

2010-05-03 Thread Jonathan Ellis
Util.range returns a Range object which is end-exclusive. (You want "Bounds" for end-inclusive.) On Sun, May 2, 2010 at 7:19 AM, aaron morton wrote: > He there, I'm still getting odd behavior with get_range_slices. I've created > a JUNIT test that illustrates the case. > Could someone take a loo

Re: Row slice / cache performance

2010-05-03 Thread Jonathan Ellis
On Sun, May 2, 2010 at 1:00 PM, James Golick wrote: > the ConcurrentSkipListMap (ColumnFamily.columns_). > SliceQueryFilter.getMemColumnIterator @ ~30% - Virtually all the time in > here is spent in ConcurrentSkipListMap$Values.toArrray() Besides the UUID optimization you posted, we should do an

Re: Feeding in specific Cassandra columns into Hadoop

2010-05-03 Thread Jonathan Ellis
Can you reproduce outside the Hadoop environment, i.e. w/ Thrift code? On Mon, May 3, 2010 at 5:49 AM, Mark Schnitzius wrote: > Hi all...  I am trying to feed a specific list of Cassandra column names in > as input to a Hadoop process, but for some reason it only feeds in some of > the columns I

Re: Distributed export and import into cassandra

2010-05-03 Thread Jonathan Ellis
sstable2json does this. (you'd want to perform nodetool compact first, so there is only one sstable for the CF you want.) On Mon, May 3, 2010 at 6:17 AM, Utku Can Topçu wrote: > Hey All, > > I have a simple sample use case, > The aim is to export the columns in a column family into flat files wi

Re: Primary and Backup clusters

2010-05-03 Thread Jonathan Ellis
I suppose you could use rsync (sstable files are immutable, so you don't need to worry about not getting a "consistent" version of the data files), but compared to letting Cassandra handle the replication the way it's designed to, # you'll generate a lot of disk i/o doing that vs # your backup clu

Re: Strange streaming behaviour

2010-05-03 Thread Gary Dusbabek
Martin, Please create a ticket and include the relevant parts of your storage-conf. To summarize, the output gives you the impression that bootstrap has completed normally, but when you check, it appears to be hung on the receiving nodes? Do you mind turning debug on and and seeing if you can re

Error in TBaseHelper compareTo(byte [] a , byte [] b)

2010-05-03 Thread Erik Holstad
Hey! We are currently using Cassandra 0.5.1 and I'm getting a StackOverflowError when comparing two ColumnOrSuperColumn objects. It turns out the the comparTo function for byte [] has an infinite loop in libthrift-r820831.jar. We are planning to upgrade to 0.6.1 but not ready to do it today', so j

Bootstrap problem

2010-05-03 Thread David Koblas
Trying to add a node to an existing cluster and getting the following error (using 0.6.1): INFO [main] 2010-05-03 08:36:58,960 CommitLog.java (line 169) Log replay complete INFO [main] 2010-05-03 08:36:58,993 SystemTable.java (line 164) Saved Token found: 113225717064305079230489016527619806

Skip large size (Configurable) SSTable in minor or/and major compaction

2010-05-03 Thread Schubert Zhang
We make a patch to 0.6 branch and 0.6.1 for this feature. https://issues.apache.org/jira/browse/CASSANDRA-1041

Re: Bootstrap problem

2010-05-03 Thread Schubert Zhang
Seems your adding node is not a "new" node. INFO [main] 2010-05-03 08:36:58,993 SystemTable.java (line 164) Saved Token found: 113225717064305079230489016527619806663 INFO [main] 2010-05-03 08:36:58,994 SystemTable.java (line 179) Saved ClusterName found: Image Cluster Above log says, this node

Re: Bootstrap problem

2010-05-03 Thread David Koblas
It started out new, didn't cut and paste the "original" startup, but here it is... INFO [main] 2010-05-03 08:34:43,305 DatabaseDescriptor.java (line 229) Auto DiskAccessMode determined to be mmap INFO [main] 2010-05-03 08:34:43,637 SystemTable.java (line 139) Saved Token not found. Using 113

debian packages

2010-05-03 Thread Lee Parker
Is there a reason why the jvm options are so different in the debian version from the standard cassandra.in.sh? The following lines are completely missing from the init script: -XX:TargetSurvivorRatio=90 \ -XX:+AggressiveOpts \ -XX:+UseParNewGC \ -XX:+UseConcMarkSwe

Re: debian packages

2010-05-03 Thread Eric Evans
On Mon, 2010-05-03 at 13:30 -0500, Lee Parker wrote: > I see these in the JVM_EXTRA_OPS line of /etc/defaults/cassandra, but > I don't see where this data is actually passed into the jvm in the > init script. Am I missing something? JVM_EXTRA_OPS should be getting passed (they used to be), so th

replication with large rows

2010-05-03 Thread Lee Parker
I have a CF on our cluster which has several rows with 200k+ columns of TimeUUID data. I have noticed recently that this CF is reaching my memtable thresholds (128M or 1.5 mill obj) far more frequently than I would expect (every 10 minutes or so). This CF is used as an index of items in another C

Re: debian packages

2010-05-03 Thread Lee Parker
Thanks. I'll apply the patch. I'm not real familiar with the JVM options, but I assume that on a production machine I should remove -Xdebug and the -Xrunjdwp options. Lee Parker On Mon, May 3, 2010 at 2:29 PM, Eric Evans wrote: > On Mon, 2010-05-03 at 13:30 -0500, Lee Parker wrote: > > I see t

Re: debian packages

2010-05-03 Thread Eric Evans
On Mon, 2010-05-03 at 14:39 -0500, Lee Parker wrote: > Thanks. I'll apply the patch. I'm not real familiar with the JVM > options, but I assume that on a production machine I should remove > -Xdebug and the -Xrunjdwp options. Yes. -- Eric Evans eev...@rackspace.com

replication and memtable flush

2010-05-03 Thread Lee Parker
i have a cluster which contains two CFs. One is a bunch of rows with 10-15 columns per row. the other is an index of those items with only a few rows, but thousands of columns per row. i am noticing that the replication of data between the nodes in the cluster is causing a lot of memtable flushi

Determining Cassandra System/Memory Requirements

2010-05-03 Thread Jon Graham
Hello Everyone, Is there a practical formula for determining Cassandra system requirements using OrderPreservingPartitioner ? We have hundreds of millions of rows in a single column family with a potential target of maybe a billion rows. How can we estimate the Cassandra system requirements give

Re: Error in TBaseHelper compareTo(byte [] a , byte [] b)

2010-05-03 Thread Jonathan Ellis
You'd need to check out thrift r820831, fix the compareTo code, then build a new jar. You can't just use the jar from 0.6 or from current thrift trunk because Thrift breaks backwards compatibility frequently, and there were such changes between our 0.5 and 0.6. So, yes, you could do it, but it's

Re: Determining Cassandra System/Memory Requirements

2010-05-03 Thread Jonathan Ellis
Short answer: no, there is no formula into which you can plug numbers. Longer answer: benchmark with a subset of your data and extrapolate. The closer the test data is to real data, the more accurate it will be. Yes, compaction is O(N) wrt the amount of data in the system, so don't do it more tha

Re: Error in TBaseHelper compareTo(byte [] a , byte [] b)

2010-05-03 Thread Anthony Molinaro
On Mon, May 03, 2010 at 05:46:51PM -0500, Jonathan Ellis wrote: > You'd need to check out thrift r820831, fix the compareTo code, then > build a new jar. > > You can't just use the jar from 0.6 or from current thrift trunk > because Thrift breaks backwards compatibility frequently, and there > we

Re: Feeding in specific Cassandra columns into Hadoop

2010-05-03 Thread Mark Schnitzius
If I take the exact same SlicePredicate that fails in the Hadoop example, and pass it in to a multiget_slice, the data is returned successfully. So it appears the problem does lie somewhere in the tie-in to Hadoop. I will try to create a maximally-trimmed-down example that's complete enough to ru

Re: How do you, Bloom filter of the false positive rate or remove the problem of distributed databases?

2010-05-03 Thread Kauzki Aranami
Let me rephrase my question. How does Cassandra deal with bloom filter's false positives on deleted records? The bloom filters can answer false positives, especially for deleted records. How does Cassandra detect them? And, how does Cassandra remove those *detected* false positives from the blo

Re: Feeding in specific Cassandra columns into Hadoop

2010-05-03 Thread Jonathan Ellis
We serialize the SlicePredicate as part of the Hadoop Configuration string. It's quite possible that either - one of your column names is exposing a bug in the Thrift json serializer - Hadoop is silently truncating large predicates You should test that getSlicePredicate(conf).equals(originalPr

Re: How do you, Bloom filter of the false positive rate or remove the problem of distributed databases?

2010-05-03 Thread Jonathan Ellis
On Mon, May 3, 2010 at 8:45 PM, Kauzki Aranami wrote: > Let me rephrase my question. > > How does Cassandra deal with bloom filter's false positives on deleted > records? The same way it deals with tombstones that it encounters otherwise (part of a row slice, or in a memtable). All the bloom fi

Re: Error in TBaseHelper compareTo(byte [] a , byte [] b)

2010-05-03 Thread Jonathan Ellis
Thrift is good about wire compatibility. We're talking about running Java code built against one API, against another version of the thrift jar. Different ball game. On Mon, May 3, 2010 at 6:00 PM, Anthony Molinaro wrote: > > On Mon, May 03, 2010 at 05:46:51PM -0500, Jonathan Ellis wrote: >> Yo

Re: replication with large rows

2010-05-03 Thread Jonathan Ellis
replication in Cassandra is per-operation, not per-row On Mon, May 3, 2010 at 2:40 PM, Lee Parker wrote: > I have a CF on our cluster which has several rows with 200k+ columns of > TimeUUID data.  I have noticed recently that this CF is reaching my memtable > thresholds (128M or 1.5 mill obj) far

Cassandra and Request routing

2010-05-03 Thread Olivier Mallassi
Hi all, I can't figure out how to deal with request routing... In fact I have two nodes in the "Test Cluster" and I wrote the client as specified here http://wiki.apache.org/cassandra/ThriftExamples#Java. The Keyspace is the default one (KeySpace1, replicatorFactor 1..) The Seeds are well configu

Re: Feeding in specific Cassandra columns into Hadoop

2010-05-03 Thread Mark Schnitzius
> > You should test that getSlicePredicate(conf).equals(originalPredicate) > > That's it! The byte arrays are slightly different after setting it on the Hadoop config. Below is a simple test which demonstrates the bug -- it should print "true" but instead prints "false". Please let me know if a

performance tuning - where does the slowness come from?

2010-05-03 Thread Ran Tavory
I'm looking into performance issues on a 0.6.1 cluster. I see two symptoms: 1. Reads and writes are slow 2. One of the hosts is doing a lot of GC. 1 is slow in the sense that in normal state the cluster used to make around 3-5k read and writes per second (6-10k operations per second), but how it's

Re: Cassandra and Request routing

2010-05-03 Thread Jonathan Shook
I think you may found the "eventually" in eventually consistent. With a replication factor of 1, you are allowing the client thread to continue to the read on node#2 before it is replicated to node 2. Try setting your replication factor higher for different results. Jonathan On Tue, May 4, 2010 a