Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data

2010-10-25 Thread Edward Capriolo
On Mon, Oct 25, 2010 at 10:19 PM, Takayuki Tsunakawa wrote: > Hello, Mike, > > Thank you for your advice. I'll close this thread with this mail (I've been > afraid I was interrupting the community developers with cloudy questions.) > I'm happy to know that any clearly known limitation does not exi

Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data

2010-10-25 Thread Takayuki Tsunakawa
Hello, Mike, Thank you for your advice. I'll close this thread with this mail (I've been afraid I was interrupting the community developers with cloudy questions.) I'm happy to know that any clearly known limitation does not exist to limit the cluster to a couple hundreds of nodes. If our project

Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data

2010-10-25 Thread Mike Malone
Hey Takayuki, I don't think you're going to find anyone willing to promise that Cassandra will fit your petabyte scale data analysis problem. That's a lot of data, and there's not a ton of operational experience at that scale within the community. And the people who do work on that sort of problem

Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data

2010-10-25 Thread Takayuki Tsunakawa
Hello, Edward, Thank you for giving me insight about large disk nodes. From: "Edward Capriolo" > Index sampling on start up. If you have very small rows your indexes > become large. These have to be sampled on start up and sampling our > indexes for 300Gb of data can take 5 minutes. This is goin

Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data

2010-10-25 Thread Takayuki Tsunakawa
Hello, Jonathan, From: "Jonathan Ellis" > There is no reason Cassandra cannot scale to 1000s or more nodes with > the current architecture. Oh, really, I got an impression that the gossip exchanges limit the number of nodes in a cluster when I read the Dynamos's paper and "Cassandra - A Decentra

Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data

2010-10-25 Thread Edward Capriolo
On Mon, Oct 25, 2010 at 12:37 PM, Jonathan Ellis wrote: > On Sun, Oct 24, 2010 at 9:09 PM, Takayuki Tsunakawa > wrote: >> From: "Jonathan Ellis" >>> (b) Cassandra generates input splits from the sampling of keys each >>> node has in memory.  So if a node does end up with no data for a >>> keyspa

Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data

2010-10-25 Thread Jonathan Ellis
On Sun, Oct 24, 2010 at 9:09 PM, Takayuki Tsunakawa wrote: > From: "Jonathan Ellis" >> (b) Cassandra generates input splits from the sampling of keys each >> node has in memory.  So if a node does end up with no data for a >> keyspace (because of bad OOP balancing for instance) it will have no >>

Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data

2010-10-24 Thread Takayuki Tsunakawa
Hello, Jonathan, Thank you for your kind reply. Could you give me some more opinions/comments? From: "Jonathan Ellis" > (b) Cassandra generates input splits from the sampling of keys each > node has in memory. So if a node does end up with no data for a > keyspace (because of bad OOP balancing

Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data

2010-10-22 Thread Jonathan Ellis
On Fri, Oct 22, 2010 at 3:30 AM, Takayuki Tsunakawa wrote: > Yes, I meant one map task would be sent to each task tracker, resulting in > 1,000 concurrent map tasks in the cluster. ColumnFamilyInputFormat cannot > identify the nodes that actually hold some data, so the job tracker will > send the

Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data

2010-10-22 Thread Aaron Morton
. > > Regards, > Takayuki Tsunakawa > > - Original Message - > From: aaron morton > To: user@cassandra.apache.org > Sent: Friday, October 22, 2010 4:05 PM > Subject: Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes > of data >

Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data

2010-10-22 Thread Takayuki Tsunakawa
d. Regards, Takayuki Tsunakawa - Original Message - From: aaron morton To: user@cassandra.apache.org Sent: Friday, October 22, 2010 4:05 PM Subject: Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data For plain old log analysis the Cloudera Hadoop distribution

Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data

2010-10-22 Thread aaron morton
For plain old log analysis the Cloudera Hadoop distribution may be a better match. Flume is designed to help with streaming data into HDFS, the LZo compression extensions would help with the data size and PIG would make the analysis easier (IMHO). http://www.cloudera.com/hadoop/ http://www.clou

[Q] MapReduce behavior and Cassandra's scalability for petabytes of data

2010-10-21 Thread Takayuki Tsunakawa
Hello, I'm evaluating whether Cassandra fits a certain customer well. The customer will collect petabytes of logs and analyze them. Could you tell me if my understanding is correct and/or give me your opinions? I'm sorry that the analysis requirement is not clear yet. 1. MapReduce behavior I read