Thanks Tyler, this is really useful.

Also, I noticed that you can specify multiple data file directories located on different disks. Let's say if I have machine with 4 x 500GB drives, what would be the difference between following 2 setups:

  1. each drive mounted separately and has data file dirs on it (so 4x
     data file dirs)
  2. disks are in RAID0 and mounted as one drive with one data folder on it

In other words, does splitting data folder into smaller ones bring any performance or stability advantages?


On 10/12/2010 00:03, Tyler Hobbs wrote:
Yes, that's correct, but I wouldn't push it too far. You'll become much more sensitive to disk usage changes; in particular, rebalancing your cluster will particularly difficult, and repair will also become dangerous. Disk performance also tends to drop when a disk nears capacity.

There's no recommended maximum size -- it all depends on your access rates. Anywhere from 10 GB to 1TB is typical.

- Tyler

On Thu, Dec 9, 2010 at 5:52 PM, Rustam Aliyev <rus...@code.az <mailto:rus...@code.az>> wrote:


    That depends on your scenario.  In the worst case of one big CF,
    there's not much that can be easily done for the disk usage of
    compaction and cleanup (which is essentially compaction).

    If, instead, you have several column families and no single CF
    makes up the majority of your data, you can push your disk usage
    a bit higher.


    Is there any formula to calculate this? Let's say I have 500GB in
    single CF. So I need at least 500GB of free space for compaction.
    If I partition this CF and split it into 10 proportional CFs each
    50GB, does it mean that I will need only 50GB of free space?

    Also, is there recommended maximum of data size per node?

    Thanks.


    A fundamental idea behind Cassandra's architecture is that disk
    space is cheap (which, indeed, it is).  If you are particularly
    sensitive to this, Cassandra might not be the best solution to
    your problem.  Also keep in mind that Cassandra performs well
with average disks, so you don't need to spend a lot there. Additionally, most people find that the replication protects
    their data enough to allow them to use RAID 0 instead of 1, 10,
    5, or 6.

    - Tyler

    On Thu, Dec 9, 2010 at 12:20 PM, Rustam Aliyev <rus...@code.az
    <mailto:rus...@code.az>> wrote:

        Is there any plans to improve this in future?

        For big data clusters this could be very expensive. Based on
        your comment, I will need 200TB of storage for 100TB of data
        to keep Cassandra running.

        --
        Rustam.

        On 09/12/2010 17:56, Tyler Hobbs wrote:
        If you are on 0.6, repair is particularly dangerous with
        respect to disk space usage.  If your replica is
        sufficiently out of sync, you can triple your disk usage
        pretty easily.  This has been improved in 0.7, so repairs
        should use about half as much disk space, on average.

        In general, yes, keep your nodes under 50% disk usage at all
        times.  Any of: compaction, cleanup, snapshotting, repair,
        or bootstrapping (the latter two are improved in 0.7) can
        double your disk usage temporarily.

        You should plan to add more disk space or add nodes when you
        get close to this limit.  Once you go over 50%, it's more
        difficult to add nodes, at least in 0.6.

        - Tyler

        On Thu, Dec 9, 2010 at 11:19 AM, Mark
        <static.void....@gmail.com
        <mailto:static.void....@gmail.com>> wrote:

            I recently ran into a problem during a repair operation
            where my nodes completely ran out of space and my whole
            cluster was... well, clusterfucked.

            I want to make sure how to prevent this problem in the
            future.

            Should I make sure that at all times every node is under
            50% of its disk space? Are there any normal day-to-day
            operations that would cause the any one node to double
            in size that I should be aware of? If on or more nodes
            to surpass the 50% mark, what should I plan to do?

            Thanks for any advice




Reply via email to