This is great information - thank you!

I'm coming from HDFS+Hbase, lots of nodes, nodes with many spindles.  When a drive fails in this environment (which happens a lot with 16-24 drives per node), HDFS removes that one failed volume and then maintains the 3x replication with the rest of the cluster.  As long as that drive is not the boot volume, we can then replace the failed drive live, typically without even doing a reboot of that one node.  We'll usually have two drives in a mirror RAID for boot, and then JBOD for the data drives.

If a data drive fails on a Cassandra server, does the whole node come down?

-Joe

On 1/20/2021 12:13 PM, Durity, Sean R wrote:

This is a great way to think through the problem and solution. I will add that part of my calculation on failure time is how long does it take to actually replace a drive and/or a server with (however many) drives? We pay for very fast vendor SLAs. However, in reality, there has been quite a bit more activity before any of those SLAs kicks in and then the hardware is actually ready for use by Cassandra. So, I calculate my needed capacity and preferred node sizes with those factors included. (This is for on-prem hardware, not a cloud-there’s-always-a-spare model.)

Sean Durity

*From:* Jeff Jirsa <jji...@gmail.com>
*Sent:* Wednesday, January 20, 2021 11:59 AM
*To:* cassandra <user@cassandra.apache.org>
*Subject:* [EXTERNAL] Re: Node Size

Not going to give a number other than to say that 1TB/instance is probably super super super conservative in 2021. The modern number is likely considerably higher. But let's look at this from first principles. There's basically two things to worry about here:


1) Can you get enough CPU/memory to support a query load over that much data, and

2) When that machine fails, what happens?

Let's set aside 1, because you can certainly find some query pattern that works, e.g. write-only with time window compaction or something where there's very little actual work to maintain state.

So focusing on 2, a few philosophical notes:

2.a) For each range, cassandra streams from one replica. That means if you use a single token and RF=3, you're probably streaming from 3 hosts at a time

2.b) In cassandra 0.whatever to 3.11, streaming during replacement presumed that you would only send a portion of each data file to the new node, so it deserialized and reserialized most of the contents, even if the whole file was being sent (in LCS, sending the whole file is COMMON; in TWCS / STCS, it's less common)

2.c) Each data file doing the partial file streaming ser/deser uses exactly one core/thread on the receiving side. Adding extra cpu doesnt speed up streaming when you have to serialize/deserialize.

2.d) The more disks you put into a system, the more likely it is that any disk on a host fails, so your frequency of failure will go up with more disks.

What's that mean?

The time it takes to rebuild a failed node depends on:
- Whether or not you're using vnodes (recalling that Joey at Netflix did some fun math that says lots of vnodes makes your chance of outage/dataloss go up very very quickly)

- Whether or not you're using LCS (recalling that LCS is super IO intensive compared to other compaction strategies)

- Whether or not you're running RAID on the host

Vnodes means more streaming sources, but also increases your chance of an outage with concurrent host failures.

LCS means streaming is faster, but also requires a lot more IO to maintain

RAID is ... well, RAID. You're still doing the same type of rebuild operation there, and losing capacity, so ... dont do that probably.

If you are clever enough to run more than one cassandra instance on the host, you protect yourself from the "bad" vnode behaviors (likelihood of an outage with 2 hosts down, ability to do simultaneous hosts joining/leaving/moving, etc), but it requires multiple IPs and a lot more effort.

So, how much data can you put onto a machine? Calculate your failure rate. Calculate your rebuild time. Figure out your chances of two failures in that same window, and the cost to your business of an outage/data loss if that were to happen. Keep adjusting fill sizes / ratios until you get numbers that work for you.

On Wed, Jan 20, 2021 at 7:59 AM Joe Obernberger <joseph.obernber...@gmail.com <mailto:joseph.obernber...@gmail.com>> wrote:

    Thank you Sean and Yakir.  Is 4.x the same?

    So if you were to build a 1PByte system, you would want 512-1024
    nodes?  Doesn't seem space efficient vs say 48TByte nodes where
    you would need ~21 machines.
    What would you do to build a 1PByte configuration?  I know there
    are a lot of - it depends - on that question, but say it was a
    write heavy, light read setup.  Thank you!

    -Joe

    On 1/20/2021 10:06 AM, Durity, Sean R wrote:

        Yakir is correct. While it is feasible to have large disk
        nodes, the practical aspect of managing them is an issue. With
        the current technology, I do not build nodes with more than
        about 3.5 TB of disk available. I prefer 1-2 TB, but
        costs/number of nodes can change the considerations.

        Putting more than 1 node of Cassandra on a given host is also
        possible, but you will want to consider your availability if
        that hardware goes down. Losing 2 or more nodes with one
        failure is usually not good.

        NOTE: DataStax has some new features for supporting much
        larger disks and alleviating many of the admin pains
        associated with it. I don’t have personal experience with it,
        yet, but I will be testing it soon. In my understanding it is
        for use cases with massive needs for disk, but low to moderate
        throughput (ie, where node expansion is only for disk, not
        additional traffic).

        Sean Durity

        *From:* Yakir Gibraltar <yaki...@gmail.com>
        <mailto:yaki...@gmail.com>
        *Sent:* Wednesday, January 20, 2021 9:21 AM
        *To:* user@cassandra.apache.org <mailto:user@cassandra.apache.org>
        *Subject:* [EXTERNAL] Re: Node Size

        It possible to use large nodes and it will work, the problem
        of large nodes will be:

          * Maintenance like join/remove nodes will take more time.
          * Larger heap
          * etc.

        On Wed, Jan 20, 2021 at 3:54 PM Joe Obernberger
        <joseph.obernber...@gmail.com
        <mailto:joseph.obernber...@gmail.com>> wrote:

            Anyone know where I could find out more information on this?
            Thanks!

            -Joe

            On 1/13/2021 8:42 AM, Joe Obernberger wrote:
            > Reading the documentation on Cassandra 3.x there is
            recommendations
            > that node size should be ~1TByte of data.  Modern
            servers can have 24
            > SSDs, each at 2TBytes in size for data. Is that a bad
            idea for
            > Cassandra?  Does 4.0beta4 handle larger nodes?
            > We have machines that have 16, 8TBytes SATA drives -
            would that be a
            > bad server for Cassandra?  Would it make sense to run
            multiple copies
            > of Cassandra on the same node in that case?
            >
            > Thanks!
            >
            > -Joe
            >

            
---------------------------------------------------------------------
            To unsubscribe, e-mail:
            user-unsubscr...@cassandra.apache.org
            <mailto:user-unsubscr...@cassandra.apache.org>
            For additional commands, e-mail:
            user-h...@cassandra.apache.org
            <mailto:user-h...@cassandra.apache.org>


--
        *בברכה,*

        *יקיר גיברלטר*

        ------------------------------------------------------------------------


        The information in this Internet Email is confidential and may
        be legally privileged. It is intended solely for the
        addressee. Access to this Email by anyone else is
        unauthorized. If you are not the intended recipient, any
        disclosure, copying, distribution or any action taken or
        omitted to be taken in reliance on it, is prohibited and may
        be unlawful. When addressed to our clients any opinions or
        advice contained in this Email are subject to the terms and
        conditions expressed in any applicable governing The Home
        Depot terms of business or client engagement letter. The Home
        Depot disclaims all responsibility and liability for the
        accuracy and content of this attachment and for any damages or
        losses arising from any inaccuracies, errors, viruses, e.g.,
        worms, trojan horses, etc., or other items of a destructive
        nature, which may be contained in this attachment and shall
        not be liable for direct, indirect, consequential or special
        damages in connection with this e-mail message or its attachment.

        Image removed by sender.[avg.com]
        
<https://urldefense.com/v3/__http:/www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient__;!!M-nmYVHPHQ!ZsGwKqKTIhs3ZFvMXTzXUxkppCAiXXZ1sx0fsPypjMFlr3OYsfemtjeZXAJW849AvbtVW-I$>

                

        Virus-free. www.avg.com [avg.com]
        
<https://urldefense.com/v3/__http:/www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient__;!!M-nmYVHPHQ!ZsGwKqKTIhs3ZFvMXTzXUxkppCAiXXZ1sx0fsPypjMFlr3OYsfemtjeZXAJW849AvbtVW-I$>



------------------------------------------------------------------------

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

Reply via email to