How to setup multinode cluster over different servers through firewall

2010-05-11 Thread Xu Hui Hui
He there! 1. I'm new to Cassandra. 2. I created multinode cluster which worked well between two computers in LAN (192.168.1.31, 192.168.1.33) with reference to the official wiki. 3. When creating multinode cluster between my computer (192.168.1.31) and a remote server ( y.y.y.y ) it didn't work. I

Re: Is SuperColumn necessary?

2010-05-11 Thread David Boxenhorn
Don't think of it as getting rid of supercolum. Think of it as adding superdupercolums, supertriplecolums, etc. Or, in sparse array terminology: array[dim1][dim2][dim3].[dimN] = value Or, as said above: On Mon, May 10, 2010 at 11:

how to count the columns

2010-05-11 Thread vd
Hi Can we count the total no. of columns in a ColumnFamily, if yes how ?

Re: Is SuperColumn necessary?

2010-05-11 Thread Torsten Curdt
Exactly. On Tue, May 11, 2010 at 10:20, David Boxenhorn wrote: > Don't think of it as getting rid of supercolum. Think of it as adding > superdupercolums, supertriplecolums, etc. Or, in sparse array terminology: > array[dim1][dim2][dim3].[dimN] = value > > Or, as said above: > >    Type="UTF8

Re: how to count the columns

2010-05-11 Thread Sylvain Lebresne
> Can we count the total no. of columns in a ColumnFamily, if yes how ? You should have a look at the map reduce support. There is an example in contrib/word_count. Now if the question was, is there a simple api call that returns the total no. of columns in a ColumnFamily, then the answer is no.

Re: Tuning Cassandra

2010-05-11 Thread David Boxenhorn
Turns out the problem is with batch mutate. I mutate chunks 100 times bigger, it goes 100 times faster. Now I have a problem with running out of memory sometimes On Mon, May 10, 2010 at 8:17 PM, B. Todd Burruss wrote: > have you put your commit log on a disk by itself? not a logical parti

updating column names

2010-05-11 Thread vd
Hi I have a column named colom. Can we update column name "colom" to "column" during runtime or via API ?

Re: Is SuperColumn necessary?

2010-05-11 Thread vd
Hi Can we make range search on ID:ID format as this would be treated as single ID by API or can it bifurcate on ':' . If now then how do can we ignore usage of supercolumns where we need to associate 'n' number of rows to a single ID. Like CatID1-> articleID1 CatID1-> articleID

What is the optimal size of batch mutate batches?

2010-05-11 Thread David Boxenhorn
I am saving a large amount of data to Cassandra using batch mutate. I have found that my speed is proportional to the size of the batch. It was very slow when I was inserting one row at a time, but when I created batches of 100 rows and mutated them together, it went 100 times faster. (OK, I didn't

Re: What is the optimal size of batch mutate batches?

2010-05-11 Thread Ben Browning
I like to base my batch sizes off of the total number of columns instead of the number of rows. This effectively means counting the number of Mutation objects in your mutation map and submitting the batch once it reaches a certain size. For my data, batch sizes of about 25,000 columns work best. Yo

Re: What is the optimal size of batch mutate batches?

2010-05-11 Thread David Boxenhorn
Thanks a lot! 25,000 is a number I can work with. Any other suggestions? On Tue, May 11, 2010 at 3:21 PM, Ben Browning wrote: > I like to base my batch sizes off of the total number of columns > instead of the number of rows. This effectively means counting the > number of Mutation objects in y

Re: busy thread on IncomingStreamReader ?

2010-05-11 Thread Ingram Chen
We have this problem initially. but it disappeared after several days' operation. so we have no chance to investigate problems more. 2010/5/10 Даниел Симеонов > Hi, > I've experienced the same problem, two nodes got stuck with CPU at 99% > and the following source code from IncomingStreamRead

Re: What is the optimal size of batch mutate batches?

2010-05-11 Thread Ben Browning
The main thing is to test on your data - 25k works great for me but if your values are substantially smaller or larger it might not for you. Not specific to batches, but if you have a decent size cluster and have to do lots of inserts make sure your client is multi-threaded so that it's not the bo

Re: Rolling upgrade

2010-05-11 Thread Gary Dusbabek
On Mon, May 10, 2010 at 17:01, Tatsuya Kawano wrote: > Hi, > > Does Cassandra support rolling restart recipe between minor version > upgrade? I mean rolling restart is a way to upgrade Cassandra version > or change configuration **without** bringing down the whole cluster. > The recipe will be som

Re: updating column names

2010-05-11 Thread Gary Dusbabek
On Tue, May 11, 2010 at 04:45, vd wrote: > Hi > > I have a column named colom. Can we update column name "colom" to > "column" during runtime or via API ? > This will require two operations: remove and insert. Gary.

Re: What is the optimal size of batch mutate batches?

2010-05-11 Thread Gary Dusbabek
On Tue, May 11, 2010 at 06:54, David Boxenhorn wrote: > My problem is that my rows are of very varying degrees of bushiness (i.e. > number of supercolums and columns per row). I inserted 592,500 rows > successfully, in a few minutes, and then I hit a batch of exceptionally > bushy rows and ran out

Re: Is multiget_slice performant when you're looking for lots of keys?

2010-05-11 Thread Jonathan Ellis
multiget performs in O(N) with the number of rows requested. so will range scanning. if you want to query millions of records of one type i would create a CF per type and use hadoop to parallelize the computation. On Fri, May 7, 2010 at 6:16 PM, James wrote: > Hi all, > Apologies if I'm still s

Re: How to setup multinode cluster over different servers through firewall

2010-05-11 Thread Gary Dusbabek
You didn't give a lot of details about the remoteness of the remote server. Remote hosts not be able to contact any host on the 192.168.*.* over the internet without routing support. If the remote host is on the same network as the 192.168.*.* host, it should work unless one of those hosts is run

Re: Overfull node

2010-05-11 Thread Jonathan Ellis
s/keyspace/token/ and you've got it. On Mon, May 10, 2010 at 10:34 AM, David Koblas wrote: > Sounds great, will give it a go.  However, just to make sure I understand > getting the keyspace correct. > > Lets say I've got: >    A -- Node before overfull node in keyspace order >    O -- Overfull no

Re: Strange behavior with large column values

2010-05-11 Thread Jonathan Ellis
Sounds like https://issues.apache.org/jira/browse/THRIFT-638 On Tue, May 11, 2010 at 1:50 AM, Jared Laprise wrote: > Hello all, I’m really stumped on this issue. > > > > I’m using the PHP Thrift client along with Pandra. > > > > I have a ColumnFamily `Groups` and once I set the `description` colu

Re: Tuning Cassandra

2010-05-11 Thread Jonathan Ellis
Using multiple client threads (w/ pooled thrift connections) will be even better than mutating really large chunks at a time. On Tue, May 11, 2010 at 4:16 AM, David Boxenhorn wrote: > Turns out the problem is with batch mutate. I mutate chunks 100 times > bigger, it goes 100 times faster. > > Now

Re: Is multiget_slice performant when you're looking for lots of keys?

2010-05-11 Thread David Boxenhorn
I have a similar issue, but I can't create a CF per type, because types are an open-ended set in my case (they are geographical locations). So I wanted to have one CF for types, and a supercolumn for each type, with the keys as columns per supercolumn. Is it a problem for me to have millions of co

Re: Is SuperColumn necessary?

2010-05-11 Thread Jonathan Shook
This is one of the sticking points with the key concatenation argument. You can't simply access subpartitions of data along an aggregate name using a concatenated key unless you can efficiently address a range of the keys according to a property of a subset. I'm hoping this will bear out with more

Re: Is SuperColumn necessary?

2010-05-11 Thread David Boxenhorn
I would like an API with a variable number of arguments. Using Java varargs, something like value = keyspace.get("articles", "cars", "John Smith", "2010-05-01", "comment-25"); or valueArray = keyspace.get("articles", predicate1, predicate2, predicate3, predicate4); The storage layout would be

Read Latency

2010-05-11 Thread Wayne
I am evaluating Cassandra, and Read latency is the biggest concern in terms of performance. As I test various scenarios and configurations I am getting surprising results. I have a 2 node cluster with both nodes connected to direct attached storage. The read latency pulling data off the raid 10 sto

Re: Read Latency

2010-05-11 Thread Jonathan Shook
RAID may be less valuable to you here. More useful to you would be to split the storage according to http://wiki.apache.org/cassandra/CassandraHardware When Cassandra is accessing effectively random parts of a large data store, expect it to be constantly hitting certain "always hot" parts of files

Re: Read Latency

2010-05-11 Thread Peter Schüller
> isolated requests, obviously in scale the RAID should perform better... I > have not started testing concurrent reads in scale as the single reads are > too slow to begin with. I am getting 20-30ms response time off of internal Concurrent reads is what you need to do in order to see the benefit

Re: Is SuperColumn necessary?

2010-05-11 Thread Schubert Zhang
Hi Stu, Thanks for your hard work. That's not a easy work. With my partners, after days of reading of the code. We really know that current code implementation of the storage-layer should be rewrite for a clear implementation. On Tue, May 11, 2010 at 12:44 AM, Stu Hood wrote: > I think that it

Re: Is SuperColumn necessary?

2010-05-11 Thread Schubert Zhang
I appreciate to let cassandra core data model clear and pure. On Tue, May 11, 2010 at 5:20 AM, Mike Malone wrote: > On Mon, May 10, 2010 at 1:38 PM, AJ Chen wrote: > >> Could someone confirm this discussion is not about abandoning supercolumn >> family? I have found modeling data with supercol

Re: How to write WHERE .. LIKE query ?

2010-05-11 Thread Mike Malone
On Mon, May 10, 2010 at 11:36 PM, vd wrote: > Hi Mike > > AFAIK cassandra queries only on keys and not on column names, please > verify. > Incorrect. You can slice a row or rows (identified by a key) on a column name range (e.g., "a" through "m") or ask for specific columns in a row or rows (e.g

Re: Read Latency

2010-05-11 Thread Schubert Zhang
"(I originally saw 3-5 ms read latency with a small amount of data and 1 Keyspace/CF)? " The 3~5ms latency is offered by the Filesystem page cache. Because your dataset is small, it can be cached totally by Filesystm. 2010/5/11 Peter Schüller > > isolated requests, obviously in scale the RAID

Why not to delete the "Compacted" file immediately? What is the policy in 0.6.1?

2010-05-11 Thread Schubert Zhang
In current 0.6.1, after a long time of compation, the old SSTable files are still there, with the mark of "CFName-id-Compacted" zero sized file. Whey not delete them immediately? What is the policy in 0.6.1? See following examples. -rw-rw-r-- 1 cassandra cassandra 0 May 11 23:35 LZO-12

Nodes Levels of Hierarchy in Cassandra.

2010-05-11 Thread xmanach.ext
Hi. I undertstood I can use this structure with cassandra : KeySpace_3nodeslevels = { Key_Family_alfa : { Key_Super_Column_A : { Key_Column_1: "Value_alfa_A_1", Key_Column_2: "Value_alfa_A_2"}, Key_Super_Column_B : { Key_Column_1: "Value_alfa_B

Re: How to write WHERE .. LIKE query ?

2010-05-11 Thread Schubert Zhang
In the future, maybe cassandra can provide some "Filter" or "Coprocessor" interfaces. Just like what of Bigtable do. But now, cassandra is too young, there are many things to do for a clear core. On Tue, May 11, 2010 at 11:35 PM, Mike Malone wrote: > On Mon, May 10, 2010 at 11:36 PM, vd wrote:

Re: Is multiget_slice performant when you're looking for lots of keys?

2010-05-11 Thread Schubert Zhang
Is it a problem for me to have millions of columns in a supercolumn? You will have problem, because there is no index in supercolumn for subcolumns. On Tue, May 11, 2010 at 10:03 PM, David Boxenhorn wrote: > I have a similar issue, but I can't create a CF per type, because types are > an open-en

Re: Is SuperColumn necessary?

2010-05-11 Thread Mike Malone
On Tue, May 11, 2010 at 7:46 AM, David Boxenhorn wrote: > I would like an API with a variable number of arguments. Using Java > varargs, something like > > value = keyspace.get("articles", "cars", "John Smith", "2010-05-01", > "comment-25"); > > or > > valueArray = keyspace.get("articles", predic

Re: How to write WHERE .. LIKE query ?

2010-05-11 Thread Mike Malone
On Tue, May 11, 2010 at 8:54 AM, Schubert Zhang wrote: > In the future, maybe cassandra can provide some "Filter" or "Coprocessor" > interfaces. Just like what of Bigtable do. > But now, cassandra is too young, there are many things to do for a clear > core. There's been talk of adding coproces

replication impact on write throughput

2010-05-11 Thread Bill de hOra
If I had 10 Cassandra nodes each with a write capacity of 5K per second and a replication factor of 2, would that mean the expected write capacity of the system would be ~25K writes per second because the nodes are also serving other nodes and not just clients? I know this is highly simplified

Re: Read Latency

2010-05-11 Thread B. Todd Burruss
you can try this benchmarking tool to compare your drive(s) http://freshmeat.net/projects/fio/ ... you can simulate various loads, etc. my RAID0 outperforms single drive (as mentioned below) under heavy concurrent reads. On 05/11/2010 08:15 AM, Peter Schüller wrote: isolated requests, obvio

0.6.2

2010-05-11 Thread B. Todd Burruss
i was thinking about doing some testing with 0.6.2 ... do the devs consider the tip of 0.6 branch ok to test with?

Re: Why not to delete the "Compacted" file immediately? What is the policy in 0.6.1?

2010-05-11 Thread Jonathan Ellis
from http://wiki.apache.org/cassandra/ArchitectureInternals: Making this concurrency-safe without blocking writes or reads while we remove the old SSTables from the list and add the new one is tricky, because naive approaches require waiting for all readers of the old sstables to finish before del

Re: 0.6.2

2010-05-11 Thread Jonathan Ellis
Yes On Tue, May 11, 2010 at 11:19 AM, B. Todd Burruss wrote: > i was thinking about doing some testing with 0.6.2 ... do the devs consider > the tip of 0.6 branch ok to test with? > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra

Apache Cassandra and Net::Cassandra::Easy

2010-05-11 Thread Scott Doty
Hi folks, I'm trying to wrap my head around Net::Cassandra::Easy, and it's making me cross-eyed. My prototype app can be seen here: http://bito.ponzo.net/Hatchet/ The idea is to index logfiles by various keys, using Cassandra's extreme write speed to keep up with the millions of lines of logfil

Re: Cassandra training on May 21 in Palo Alto

2010-05-11 Thread Vick Khera
On Fri, May 7, 2010 at 6:56 AM, Matt Revelle wrote: > Reston, VA is a good spot in the DC metro area for tech events. +1

Re: Why not to delete the "Compacted" file immediately? What is the policy in 0.6.1?

2010-05-11 Thread Schubert Zhang
Thanks jonathan, clear! On Wed, May 12, 2010 at 12:22 AM, Jonathan Ellis wrote: > from http://wiki.apache.org/cassandra/ArchitectureInternals: > > Making this concurrency-safe without blocking writes or reads while we > remove the old SSTables from the list and add the new one is tricky, > becau

Re: performance tuning - where does the slowness come from?

2010-05-11 Thread B. Todd Burruss
another note on this ... since all my nodes are very well balanced and were started at the same time, i notice that they all do garbage collection at about the same time. this of course causes a performance issue. i also have noticed that with the default JVM options and heavy load, ConcMark

Re: Apache Cassandra and Net::Cassandra::Easy

2010-05-11 Thread Ted Zlatanov
On Tue, 11 May 2010 09:40:02 -0700 Scott Doty wrote: SD> I'm trying to wrap my head around Net::Cassandra::Easy, and it's making SD> me cross-eyed. SD> My prototype app can be seen here: SD> http://bito.ponzo.net/Hatchet/ SD> The idea is to index logfiles by various keys, using Cassandra's ex

Re: replication impact on write throughput

2010-05-11 Thread Mark Greene
If you have for example, your replication factor equal to the total amount of nodes in the ring, I suspect you will hit a brick wall pretty soon. The biggest impact on your write performance will most likely be the consistency level of your writes. In other words, how many nodes you want to wait f

Re: replication impact on write throughput

2010-05-11 Thread Peter Schüller
> The biggest impact on your write performance will most likely be the > consistency level of your writes. In other words, how many nodes you want to > wait for before you acknowledge the write back to the client. I believe the consistency level is only expected to have a significant impact on lat

Inverted Indexing a ColumnFamily

2010-05-11 Thread Utku Can Topçu
Hello All, I guess the subject talks for itself. I'm currently developing a document analysis engine using cassandra as the scalable storage. I just want to briefly make an overview of the data model I'm using for this purpose. "the key" is formed in the format of timestamp.random(), so that it'

Re: Apache Cassandra and Net::Cassandra::Easy

2010-05-11 Thread Scott Doty
On 05/11/2010 10:45 AM, Ted Zlatanov wrote: > SD> My prototype app can be seen here: > > SD> http://bito.ponzo.net/Hatchet/ > > > The latest N::C::Easy will not work with Cassandra 0.6.x, the only > target is SVN trunk. I can't discover the API version on the server so > there's no way to anticipa

Real-time Web Analysis tool using Cassandra. Doubts...

2010-05-11 Thread Paulo Gabriel Poiati
Hi all. I thinking about implementing a real-time WA tool using Cassandra as my storage. But i have some questions first. I'm considering Cassandra because of its excellent write performance, horizontal scalability and its tunable consistency level. - First of all, my first thoughts is to have t

Re: replication impact on write throughput

2010-05-11 Thread Bill de hOra
Mark Greene wrote: If you have for example, your replication factor equal to the total amount of nodes in the ring, I suspect you will hit a brick wall pretty soon. Right :) So if we said there was 100 nodes at 5K wps with R=2, then would that suggest the cluster can support 250K wps? Again

Re: performance tuning - where does the slowness come from?

2010-05-11 Thread Jonathan Ellis
We're working on better GC defaults for 0.6.2. Thanks! On Tue, May 11, 2010 at 12:00 PM, B. Todd Burruss wrote: > another note on this ... since all my nodes are very well balanced and were > started at the same time, i notice that they all do garbage collection at > about the same time.  this o

Re: Apache Cassandra and Net::Cassandra::Easy

2010-05-11 Thread Jonathan Ellis
2010/5/11 Ted Zlatanov : > The latest N::C::Easy will not work with Cassandra 0.6.x, the only > target is SVN trunk.  I can't discover the API version on the server so > there's no way to anticipate such breakage as you see (I suspect it's > due to API mismatch).  The Cassandra developers haven't a

Re: replication impact on write throughput

2010-05-11 Thread Jonathan Ellis
On Tue, May 11, 2010 at 11:10 AM, Bill de hOra wrote: > I know this is highly simplified take on things (ie no consideration for > reads or quorum), I'm just trying to understand what the implication of > replication is on write scalability. Intuitively it would seem actual write > capacity is tot

Re: Apache Cassandra and Net::Cassandra::Easy

2010-05-11 Thread Ted Zlatanov
On Tue, 11 May 2010 14:29:13 -0500 Jonathan Ellis wrote: JE> 2010/5/11 Ted Zlatanov : >> The latest N::C::Easy will not work with Cassandra 0.6.x, the only >> target is SVN trunk.  I can't discover the API version on the server so >> there's no way to anticipate such breakage as you see (I suspe

(Binary)Memtable flushing

2010-05-11 Thread Tobias Jungen
Yet another BMT question, thought this may apply for regular memtables as well... After doing a batch insert, I accidentally submitted the flush command twice. To my surprise, the target node's log indicates that it wrote a new *-Data.db file, and the disk usage went up accordingly. I tested and i

nodetool drain disables writes?

2010-05-11 Thread Anthony Molinaro
Hi, I thought that 'nodetool drain' was supposed to flush the commit logs through the system, which it appears to do (verified by running ls in the commit log directory and seeing no files). However, it also appears to disable writes completely (ie, scripts attempting to write data were frozen,

what/how do you guys monitor "slow" nodes?

2010-05-11 Thread S Ahmed
If you have 3-4 nodes, how do you monitor the performance of each node?

Re: what/how do you guys monitor "slow" nodes?

2010-05-11 Thread Jordan Pittier - Rezel
For sure you have to pay particular attention to memory allocation on each node, especially be sure your servers dont swap. Then you can monitor how load are balanced among your nodes (nodetools -h XX ring). On Tue, May 11, 2010 at 11:46 PM, S Ahmed wrote: > If you have 3-4 nodes, how do you mon

Re: replication impact on write throughput

2010-05-11 Thread Mark Greene
I was under the impression from what I've seen talked about on this list (perhaps I'm wrong here) that given the write throughput of one node in a cluster (again assuming each node has a given throughput and the same config) that you would simply multiply that throughput by the number of nodes you

Re: How to setup multinode cluster over different servers through firewall

2010-05-11 Thread Xu Hui Hui
On Tue, May 11, 2010 at 9:30 PM, Gary Dusbabek wrote: > If the remote host is on the same network as the 192.168.*.* host, it should > work > unless one of those hosts is running a local firewall. Not quite understand what you meant by "on the same network", but connecting them with VPN works. T

Timed out reads still in queue

2010-05-11 Thread Jeremy Dunck
Reddit posted a blog entry about some recent downtime, partially due to issues with Cassandra. http://blog.reddit.com/2010/05/reddits-may-2010-state-of-servers.html This part surprised me: " First, Cassandra has an internal queue of work to do. When it times out a client (10s by default), it still

Re: replication impact on write throughput

2010-05-11 Thread Paul Prescod
On Tue, May 11, 2010 at 5:56 PM, Mark Greene wrote: > I was under the impression from what I've seen talked about on this list > (perhaps I'm wrong here) that given the write throughput of one node in a > cluster (again assuming each node has a given throughput and the same > config) that you woul

Unexpected TType error on web traffic increase

2010-05-11 Thread Waqas Badar
Dear all, We are using Cassandra on website. Whenever website traffic increases, we got the following error (Python): File "/usr/local/lib/python2.6/dist-packages/pycassa/columnfamily.py", line 199, in multiget self._rcl(read_consistency_level)) File "/usr/local/lib/p