On Thu, Jun 9, 2011 at 7:40 AM, Edward Capriolo <edlinuxg...@gmail.com> wrote:
>
>
> On Thu, Jun 9, 2011 at 4:23 AM, AJ <a...@dude.podzone.net> wrote:
>>
>> [Please feel free to correct me on anything or suggest other workarounds
>> that could be employed now to help.]
>>
>> Hello,
>>
>> This is purely theoretical, as I don't have a big working cluster yet and
>> am still in the planning stages, but from what I understand, while Cass
>> scales well horizontally, EACH node will not be able to handle well a data
>> store in the terabyte range... for reasons that are understandable such as
>> simple hardware and bandwidth limitations.  But, looking forward and pushing
>> the envelope, I think there might be ways to at least manage these issues
>> until broadband speeds, disk and memory technology catches up.
>>
>> The biggest issues with big data clusters that I am currently aware are:
>>
>> > disk I/O probs during major compaction and repairs.
>> > Bandwidth limitations during new node commissioning.
>>
>> Here are a few ideas I've thought of:
>>
>> 1.)  Load-balancing:
>>
>> During a major compaction or repair or other similar severe performance
>> impacting processes, allow the node to broadcast that it is temporarily
>> unavailable so requests for data can be sent to other nodes in the cluster.
>>  The node could still "wake-up" and pause or cancel it's compaction in the
>> case of a failed node whereby there are no other nodes that can provide the
>> data requested.  The node could be considered as "degraded" by other nodes,
>> rather than down.  (As a matter of fact, a general load-balancing scheme
>> could be devised if each node broadcasts it's current load level and maybe
>> even hop-count between data centers.)
>>
>> 2.)  Data Transplants:
>>
>> Since commissioning a new node that is due to receive data in the TB range
>> (data xfer could take days or weeks), it would be much more efficient to
>> just courier the data.  Perhaps the SSTables (maybe from a snapshot) could
>> be transplanted from one production node into a new node to help jump-start
>> the bootstrap process.  The new node could sort things out during the
>> bootstrapping phase so that it is balanced correctly as if it had started
>> out with no data as usual.  If this could cut down on half the bandwidth,
>> that would be a great benefit.  However, this would work well mostly if the
>> transplanted data came from a keyspace that used a random partitioner;
>> coming from an ordered partioner may not be so helpful if the rows in the
>> transplanted data would never be used in the new node.
>>
>> 3.)  Strategic Partitioning:
>>
>> Of course, there are surely other issues to contend with, such as RAM
>> requirements for caching purposes.  That may be managed by a partition
>> strategy that allows certain nodes to specialize in a certain subset of the
>> data, such as geographically or whatever the designer chooses.  Replication
>> would still be done as usual but this may help the cache to be better
>> utilized by allowing it to focus on the subset of data that comprises the
>> majority of the node's data versus a random sampling of the entire cluster.
>>  IOW, while a node may specialize in a certain subset and also contain
>> replicated rows from outside that subset, it will still only (mostly) be
>> queried for data from within it's subset and thus the cache will contain
>> mostly data from this special subset which could increase the hit rate of
>> the cache.
>>
>> This may not be a huge help for TB sized data nodes since the even 32 GB
>> of RAM would still be relatively tiny in comparison to the data size, but I
>> include it just in case it spurs other ideas.  Also, I do not know how Cass
>> decides on which node to query for data in the first place... maybe not the
>> best idea.
>>
>> 4.)  Compressed Columns:
>>
>> Some sort of data compression of certain columns could be very helpful
>> especially since text can be compressed to less than 50% if the conditions
>> are right.  Overall native disk compression will not help the bandwidth
>> issue since the data would be decompressed before transit.  If the data was
>> stored compressed, then Cass could even send the data to the client
>> compressed so as to offload the decompression to the client.  Likewise,
>> during node commission, the data would never have to be decompressed saving
>> on CPU and BW.  Alternately, a client could tell Cass to decompress the data
>> before transmit if needed.  This, combined with idea #1 (transplants) could
>> help speed-up new node bootstraping, but only when a large portion of the
>> data consists of very large column values and thus compression is practical
>> and efficient.  Of course, the client could handle all the compression today
>> without Cass even knowing about it, so building this into Cass would be just
>> a convenience, but still nice to have, nonetheless.
>>
>> 5.)  Postponed Major Compactions:
>>
>> The option to postpone auto-triggered major compactions until a
>> pre-defined time of day or week or until staff can do it manually.
>>
>> 6.?)  Finally, some have suggested just using more nodes with less data
>> storage which may solve most if not all of these problems.  But, I'm still
>> fuzzy on that.  The trade-offs would be more infrastructure and maintenance
>> costs, higher chance that a server will fail... maybe higher bandwidth
>> between nodes due to a large cluster???  I need more clarity on this
>> alternative.  Imagine a total data size of 100 TBs and the choice between
>> 200 nodes or 50.  What is the cost of more nodes; all things being equal?
>>
>> Please contribute additional ideas and strategies/patterns for the benefit
>> of all!
>>
>> Thanks for listening and keep up the good work guys!
>>
>
> Some of these things are challenges, and a few are being worked on in one
> way or another.
>
> 1) Dynamic snitch was implemented to determine slow acting nodes and
> re-balance load.
>
> 2) You can budget bootstrap with rsync, as long as you know what data to
> copy where. 0.7.X made the data moving process more efficient.

I don't think you can do this with counters (unless things have
changed since we originally developed them).

> 3) There are many cases where different partition strategies can
> theoretically be better. The question is for the normal use case what is the
> best?
>
> 4) Compressed SSTables is on the way. This will be nice because it can help
> maximize disk caches.

https://issues.apache.org/jira/browse/CASSANDRA-674

> 5) Compaction's *are* a good thing. You can already do this by setting
> compaction thresholds to 0. That is not great because smaller compactions
> can run really fast and you want those to happen regularly. Another way I
> take care of this is forcing major compactions on my schedule. This makes it
> very unlikely that a larger compaction will happen at random during peak
> time. 0.8.X has multi-threaded compaction and a throttling limit so that
> looks promising.

If you throttle your compactions, you have a continuous incremental
process, which is much less painful.

>
> More nodes vs less nodes..+1 more nodes. This does not mean you need to go
> very small, but the larger disk configurations are just more painful. Unless
> you can get very/very/very fast disks.
>

I would still like to be able to run multiple nodes on a single
machine (without having multiple IPs)

-ryan

Reply via email to