Hey Takayuki,

I don't think you're going to find anyone willing to promise that Cassandra
will fit your petabyte scale data analysis problem. That's a lot of data,
and there's not a ton of operational experience at that scale within the
community. And the people who do work on that sort of problem tend to be
busy ;). If your problem is that big, you're probably going to need to do
some experimentation and see if the system will scale for you. I'm sure
someone here can answer any specific questions that may come up if you do
that sort of work.

As you mentioned, the first concern I'd have with a cluster that big is
whether gossip will scale. I'd suggest taking a look at the gossip code.
Cassandra nodes are "omniscient" in the sense that they all try to maintain
full ring state for the entire cluster. At a certain cluster size that no
longer works.

My best guess is that a cluster of 1000 machines would be fine. Maybe even
an order of maginitude bigger than that. I could be completely wrong, but
given the low overhead that I've observed that estimate seems reasonable. If
you do find that gossip won't work in your situation it would be interesting
to hear why. You may even consider modifying / updating gossip to work for
you. The code isn't as scary as it may seem. At that scale it's likely
you'll encounter bugs and corner cases that other people haven't, so it's
probably worth familiarizing yourself with the code anyways if you decide to
use Cassandra.

Mike

On Tue, Oct 26, 2010 at 1:09 AM, Takayuki Tsunakawa <
tsunakawa.ta...@jp.fujitsu.com> wrote:

> Hello, Edward,
>
> Thank you for giving me insight about large disk nodes.
>
> From: "Edward Capriolo" <edlinuxg...@gmail.com>
> > Index sampling on start up. If you have very small rows your indexes
> > become large. These have to be sampled on start up and sampling our
> > indexes for 300Gb of data can take 5 minutes. This is going to be
> > optimized soon.
>
> 5 minutes for 300 GB data ... it's not cheap, is it? Simply, 3 TB of
> data will leat to 50 minutes just for computing input splits. This is
> too expensive when I want only part of the 3 TB data.
>
>
> > (Just wanted to note some of this as I am in the middle of a process
> > of joining a node now :)
>
> Good luck. I'd appreciate if you could some performance numbers of
> joining nodes (amount of data, time to distribute data, load impact on
> applications, etc) if you can. The cluster our customer is thinking of
> is likely to become very large, so I'm interested in the elasticity.
> Yahoo!'s YCSB report makes me worry about adding nodes.
>
> Regards,
> Takayuki Tsunakawa
>
>
> From: "Edward Capriolo" <edlinuxg...@gmail.com>
> [Q3]
> There are some challenges with very large disk nodes.
> Caveats:
> I will use words like "long", "slow", and "large" relatively. If you
> have great equipment IE. 10G Ethernet between nodes it will not take
> "long" to transfer data. If you have an insane disk pack it may not
> take "long" to compact 200GB of data. I am basing these statements on
> server class hardware. ~32 GB ram ~2x processor, ~6 disk SAS RAID.
>
> Index sampling on start up. If you have very small rows your indexes
> become large. These have to be sampled on start up and sampling our
> indexes for 300Gb of data can take 5 minutes. This is going to be
> optimized soon.
>
> Joining nodes: When you go with larger systems joining a new node
> involves a lot of transfer, and can take a "long" time.  Node join
> process is going to be optimized in 0.7 and 0.8 (quite drastic changes
> in 0.7)
>
> Major compaction and very large normal compaction can take a "long"
> time. For example while doing a 200 GB compaction that takes 30
> minutes, other sstables build up, more sstables mean "slower" reads.
>
> Achieving a high RAM/DISK ratio may be easier with smaller nodes vs
> one big node with 128 GB RAM $$$.
>
> As Jonathan pointed out nothing technically is stopping larger disk
> nodes.
>
> (Just wanted to note some of this as I am in the middle of a process
> of joining a node now :)
>
>
>

Reply via email to