On Thu, Jun 9, 2011 at 7:40 AM, Edward Capriolo <edlinuxg...@gmail.com> wrote: > > > On Thu, Jun 9, 2011 at 4:23 AM, AJ <a...@dude.podzone.net> wrote: >> >> [Please feel free to correct me on anything or suggest other workarounds >> that could be employed now to help.] >> >> Hello, >> >> This is purely theoretical, as I don't have a big working cluster yet and >> am still in the planning stages, but from what I understand, while Cass >> scales well horizontally, EACH node will not be able to handle well a data >> store in the terabyte range... for reasons that are understandable such as >> simple hardware and bandwidth limitations. But, looking forward and pushing >> the envelope, I think there might be ways to at least manage these issues >> until broadband speeds, disk and memory technology catches up. >> >> The biggest issues with big data clusters that I am currently aware are: >> >> > disk I/O probs during major compaction and repairs. >> > Bandwidth limitations during new node commissioning. >> >> Here are a few ideas I've thought of: >> >> 1.) Load-balancing: >> >> During a major compaction or repair or other similar severe performance >> impacting processes, allow the node to broadcast that it is temporarily >> unavailable so requests for data can be sent to other nodes in the cluster. >> The node could still "wake-up" and pause or cancel it's compaction in the >> case of a failed node whereby there are no other nodes that can provide the >> data requested. The node could be considered as "degraded" by other nodes, >> rather than down. (As a matter of fact, a general load-balancing scheme >> could be devised if each node broadcasts it's current load level and maybe >> even hop-count between data centers.) >> >> 2.) Data Transplants: >> >> Since commissioning a new node that is due to receive data in the TB range >> (data xfer could take days or weeks), it would be much more efficient to >> just courier the data. Perhaps the SSTables (maybe from a snapshot) could >> be transplanted from one production node into a new node to help jump-start >> the bootstrap process. The new node could sort things out during the >> bootstrapping phase so that it is balanced correctly as if it had started >> out with no data as usual. If this could cut down on half the bandwidth, >> that would be a great benefit. However, this would work well mostly if the >> transplanted data came from a keyspace that used a random partitioner; >> coming from an ordered partioner may not be so helpful if the rows in the >> transplanted data would never be used in the new node. >> >> 3.) Strategic Partitioning: >> >> Of course, there are surely other issues to contend with, such as RAM >> requirements for caching purposes. That may be managed by a partition >> strategy that allows certain nodes to specialize in a certain subset of the >> data, such as geographically or whatever the designer chooses. Replication >> would still be done as usual but this may help the cache to be better >> utilized by allowing it to focus on the subset of data that comprises the >> majority of the node's data versus a random sampling of the entire cluster. >> IOW, while a node may specialize in a certain subset and also contain >> replicated rows from outside that subset, it will still only (mostly) be >> queried for data from within it's subset and thus the cache will contain >> mostly data from this special subset which could increase the hit rate of >> the cache. >> >> This may not be a huge help for TB sized data nodes since the even 32 GB >> of RAM would still be relatively tiny in comparison to the data size, but I >> include it just in case it spurs other ideas. Also, I do not know how Cass >> decides on which node to query for data in the first place... maybe not the >> best idea. >> >> 4.) Compressed Columns: >> >> Some sort of data compression of certain columns could be very helpful >> especially since text can be compressed to less than 50% if the conditions >> are right. Overall native disk compression will not help the bandwidth >> issue since the data would be decompressed before transit. If the data was >> stored compressed, then Cass could even send the data to the client >> compressed so as to offload the decompression to the client. Likewise, >> during node commission, the data would never have to be decompressed saving >> on CPU and BW. Alternately, a client could tell Cass to decompress the data >> before transmit if needed. This, combined with idea #1 (transplants) could >> help speed-up new node bootstraping, but only when a large portion of the >> data consists of very large column values and thus compression is practical >> and efficient. Of course, the client could handle all the compression today >> without Cass even knowing about it, so building this into Cass would be just >> a convenience, but still nice to have, nonetheless. >> >> 5.) Postponed Major Compactions: >> >> The option to postpone auto-triggered major compactions until a >> pre-defined time of day or week or until staff can do it manually. >> >> 6.?) Finally, some have suggested just using more nodes with less data >> storage which may solve most if not all of these problems. But, I'm still >> fuzzy on that. The trade-offs would be more infrastructure and maintenance >> costs, higher chance that a server will fail... maybe higher bandwidth >> between nodes due to a large cluster??? I need more clarity on this >> alternative. Imagine a total data size of 100 TBs and the choice between >> 200 nodes or 50. What is the cost of more nodes; all things being equal? >> >> Please contribute additional ideas and strategies/patterns for the benefit >> of all! >> >> Thanks for listening and keep up the good work guys! >> > > Some of these things are challenges, and a few are being worked on in one > way or another. > > 1) Dynamic snitch was implemented to determine slow acting nodes and > re-balance load. > > 2) You can budget bootstrap with rsync, as long as you know what data to > copy where. 0.7.X made the data moving process more efficient.
I don't think you can do this with counters (unless things have changed since we originally developed them). > 3) There are many cases where different partition strategies can > theoretically be better. The question is for the normal use case what is the > best? > > 4) Compressed SSTables is on the way. This will be nice because it can help > maximize disk caches. https://issues.apache.org/jira/browse/CASSANDRA-674 > 5) Compaction's *are* a good thing. You can already do this by setting > compaction thresholds to 0. That is not great because smaller compactions > can run really fast and you want those to happen regularly. Another way I > take care of this is forcing major compactions on my schedule. This makes it > very unlikely that a larger compaction will happen at random during peak > time. 0.8.X has multi-threaded compaction and a throttling limit so that > looks promising. If you throttle your compactions, you have a continuous incremental process, which is much less painful. > > More nodes vs less nodes..+1 more nodes. This does not mean you need to go > very small, but the larger disk configurations are just more painful. Unless > you can get very/very/very fast disks. > I would still like to be able to run multiple nodes on a single machine (without having multiple IPs) -ryan