Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data

Edward Capriolo Mon, 25 Oct 2010 20:25:27 -0700

On Mon, Oct 25, 2010 at 10:19 PM, Takayuki Tsunakawa
<tsunakawa.ta...@jp.fujitsu.com> wrote:
> Hello, Mike,
>
> Thank you for your advice. I'll close this thread with this mail (I've been
> afraid I was interrupting the community developers with cloudy questions.)
> I'm happy to know that any clearly known limitation does not exist to limit
> the cluster to a couple hundreds of nodes. If our project starts with
> Cassandra and encounter any issues or interesting things, I'll report here.
>
> Regards,
> Takayuki Tsunakawa
>
> From: Mike Malone
> Hey Takayuki,
>
> I don't think you're going to find anyone willing to promise that Cassandra
> will fit your petabyte scale data analysis problem. That's a lot of data,
> and there's not a ton of operational experience at that scale within the
> community. And the people who do work on that sort of problem tend to be
> busy ;). If your problem is that big, you're probably going to need to do
> some experimentation and see if the system will scale for you. I'm sure
> someone here can answer any specific questions that may come up if you do
> that sort of work.
>
> As you mentioned, the first concern I'd have with a cluster that big is
> whether gossip will scale. I'd suggest taking a look at the gossip code.
> Cassandra nodes are "omniscient" in the sense that they all try to maintain
> full ring state for the entire cluster. At a certain cluster size that no
> longer works.
>
> My best guess is that a cluster of 1000 machines would be fine. Maybe even
> an order of maginitude bigger than that. I could be completely wrong, but
> given the low overhead that I've observed that estimate seems reasonable. If
> you do find that gossip won't work in your situation it would be interesting
> to hear why. You may even consider modifying / updating gossip to work for
> you. The code isn't as scary as it may seem. At that scale it's likely
> you'll encounter bugs and corner cases that other people haven't, so it's
> probably worth familiarizing yourself with the code anyways if you decide to
> use Cassandra.
>
> Mike
>


I miscommunicated my idea. I was not describing the time to compute
splits. I was describing how it takes me 5 minutes to start a
cassandra node with 300 GB of Data and large indexes caused by small
rows.

As for statistics on join times, I do not have them. The intensive
operations like compactions and joins get absorbed by large clusters.
By this I mean that if you have 100 nodes adding the 101st node has a
small impact on the cluster at large.

Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data

Reply via email to