Hello, Mike,

Thank you for your advice. I'll close this thread with this mail (I've been
afraid I was interrupting the community developers with cloudy questions.)
I'm happy to know that any clearly known limitation does not exist to limit
the cluster to a couple hundreds of nodes. If our project starts with
Cassandra and encounter any issues or interesting things, I'll report here.

Regards,
Takayuki Tsunakawa


From: Mike Malone
Hey Takayuki,


I don't think you're going to find anyone willing to promise that Cassandra
will fit your petabyte scale data analysis problem. That's a lot of data,
and there's not a ton of operational experience at that scale within the
community. And the people who do work on that sort of problem tend to be
busy ;). If your problem is that big, you're probably going to need to do
some experimentation and see if the system will scale for you. I'm sure
someone here can answer any specific questions that may come up if you do
that sort of work.


As you mentioned, the first concern I'd have with a cluster that big is
whether gossip will scale. I'd suggest taking a look at the gossip code.
Cassandra nodes are "omniscient" in the sense that they all try to maintain
full ring state for the entire cluster. At a certain cluster size that no
longer works.


My best guess is that a cluster of 1000 machines would be fine. Maybe even
an order of maginitude bigger than that. I could be completely wrong, but
given the low overhead that I've observed that estimate seems reasonable. If
you do find that gossip won't work in your situation it would be interesting
to hear why. You may even consider modifying / updating gossip to work for
you. The code isn't as scary as it may seem. At that scale it's likely
you'll encounter bugs and corner cases that other people haven't, so it's
probably worth familiarizing yourself with the code anyways if you decide to
use Cassandra.


Mike

Reply via email to