This is great information - thank you!
I'm coming from HDFS+Hbase, lots of nodes, nodes with many spindles.
When a drive fails in this environment (which happens a lot with 16-24
drives per node), HDFS removes that one failed volume and then maintains
the 3x replication with the rest of the cluster. As long as that drive
is not the boot volume, we can then replace the failed drive live,
typically without even doing a reboot of that one node. We'll usually
have two drives in a mirror RAID for boot, and then JBOD for the data
drives.
If a data drive fails on a Cassandra server, does the whole node come down?
-Joe
On 1/20/2021 12:13 PM, Durity, Sean R wrote:
This is a great way to think through the problem and solution. I will
add that part of my calculation on failure time is how long does it
take to actually replace a drive and/or a server with (however many)
drives? We pay for very fast vendor SLAs. However, in reality, there
has been quite a bit more activity before any of those SLAs kicks in
and then the hardware is actually ready for use by Cassandra. So, I
calculate my needed capacity and preferred node sizes with those
factors included. (This is for on-prem hardware, not a
cloud-there’s-always-a-spare model.)
Sean Durity
*From:* Jeff Jirsa <jji...@gmail.com>
*Sent:* Wednesday, January 20, 2021 11:59 AM
*To:* cassandra <user@cassandra.apache.org>
*Subject:* [EXTERNAL] Re: Node Size
Not going to give a number other than to say that 1TB/instance is
probably super super super conservative in 2021. The modern number is
likely considerably higher. But let's look at this from first
principles. There's basically two things to worry about here:
1) Can you get enough CPU/memory to support a query load over that
much data, and
2) When that machine fails, what happens?
Let's set aside 1, because you can certainly find some query pattern
that works, e.g. write-only with time window compaction or something
where there's very little actual work to maintain state.
So focusing on 2, a few philosophical notes:
2.a) For each range, cassandra streams from one replica. That means if
you use a single token and RF=3, you're probably streaming from 3
hosts at a time
2.b) In cassandra 0.whatever to 3.11, streaming during replacement
presumed that you would only send a portion of each data file to the
new node, so it deserialized and reserialized most of the contents,
even if the whole file was being sent (in LCS, sending the whole file
is COMMON; in TWCS / STCS, it's less common)
2.c) Each data file doing the partial file streaming ser/deser uses
exactly one core/thread on the receiving side. Adding extra cpu doesnt
speed up streaming when you have to serialize/deserialize.
2.d) The more disks you put into a system, the more likely it is that
any disk on a host fails, so your frequency of failure will go up with
more disks.
What's that mean?
The time it takes to rebuild a failed node depends on:
- Whether or not you're using vnodes (recalling that Joey at Netflix
did some fun math that says lots of vnodes makes your chance of
outage/dataloss go up very very quickly)
- Whether or not you're using LCS (recalling that LCS is super IO
intensive compared to other compaction strategies)
- Whether or not you're running RAID on the host
Vnodes means more streaming sources, but also increases your chance of
an outage with concurrent host failures.
LCS means streaming is faster, but also requires a lot more IO to maintain
RAID is ... well, RAID. You're still doing the same type of rebuild
operation there, and losing capacity, so ... dont do that probably.
If you are clever enough to run more than one cassandra instance on
the host, you protect yourself from the "bad" vnode behaviors
(likelihood of an outage with 2 hosts down, ability to do simultaneous
hosts joining/leaving/moving, etc), but it requires multiple IPs and a
lot more effort.
So, how much data can you put onto a machine? Calculate your failure
rate. Calculate your rebuild time. Figure out your chances of two
failures in that same window, and the cost to your business of an
outage/data loss if that were to happen. Keep adjusting fill sizes /
ratios until you get numbers that work for you.
On Wed, Jan 20, 2021 at 7:59 AM Joe Obernberger
<joseph.obernber...@gmail.com <mailto:joseph.obernber...@gmail.com>>
wrote:
Thank you Sean and Yakir. Is 4.x the same?
So if you were to build a 1PByte system, you would want 512-1024
nodes? Doesn't seem space efficient vs say 48TByte nodes where
you would need ~21 machines.
What would you do to build a 1PByte configuration? I know there
are a lot of - it depends - on that question, but say it was a
write heavy, light read setup. Thank you!
-Joe
On 1/20/2021 10:06 AM, Durity, Sean R wrote:
Yakir is correct. While it is feasible to have large disk
nodes, the practical aspect of managing them is an issue. With
the current technology, I do not build nodes with more than
about 3.5 TB of disk available. I prefer 1-2 TB, but
costs/number of nodes can change the considerations.
Putting more than 1 node of Cassandra on a given host is also
possible, but you will want to consider your availability if
that hardware goes down. Losing 2 or more nodes with one
failure is usually not good.
NOTE: DataStax has some new features for supporting much
larger disks and alleviating many of the admin pains
associated with it. I don’t have personal experience with it,
yet, but I will be testing it soon. In my understanding it is
for use cases with massive needs for disk, but low to moderate
throughput (ie, where node expansion is only for disk, not
additional traffic).
Sean Durity
*From:* Yakir Gibraltar <yaki...@gmail.com>
<mailto:yaki...@gmail.com>
*Sent:* Wednesday, January 20, 2021 9:21 AM
*To:* user@cassandra.apache.org <mailto:user@cassandra.apache.org>
*Subject:* [EXTERNAL] Re: Node Size
It possible to use large nodes and it will work, the problem
of large nodes will be:
* Maintenance like join/remove nodes will take more time.
* Larger heap
* etc.
On Wed, Jan 20, 2021 at 3:54 PM Joe Obernberger
<joseph.obernber...@gmail.com
<mailto:joseph.obernber...@gmail.com>> wrote:
Anyone know where I could find out more information on this?
Thanks!
-Joe
On 1/13/2021 8:42 AM, Joe Obernberger wrote:
> Reading the documentation on Cassandra 3.x there is
recommendations
> that node size should be ~1TByte of data. Modern
servers can have 24
> SSDs, each at 2TBytes in size for data. Is that a bad
idea for
> Cassandra? Does 4.0beta4 handle larger nodes?
> We have machines that have 16, 8TBytes SATA drives -
would that be a
> bad server for Cassandra? Would it make sense to run
multiple copies
> of Cassandra on the same node in that case?
>
> Thanks!
>
> -Joe
>
---------------------------------------------------------------------
To unsubscribe, e-mail:
user-unsubscr...@cassandra.apache.org
<mailto:user-unsubscr...@cassandra.apache.org>
For additional commands, e-mail:
user-h...@cassandra.apache.org
<mailto:user-h...@cassandra.apache.org>
--
*בברכה,*
*יקיר גיברלטר*
------------------------------------------------------------------------
The information in this Internet Email is confidential and may
be legally privileged. It is intended solely for the
addressee. Access to this Email by anyone else is
unauthorized. If you are not the intended recipient, any
disclosure, copying, distribution or any action taken or
omitted to be taken in reliance on it, is prohibited and may
be unlawful. When addressed to our clients any opinions or
advice contained in this Email are subject to the terms and
conditions expressed in any applicable governing The Home
Depot terms of business or client engagement letter. The Home
Depot disclaims all responsibility and liability for the
accuracy and content of this attachment and for any damages or
losses arising from any inaccuracies, errors, viruses, e.g.,
worms, trojan horses, etc., or other items of a destructive
nature, which may be contained in this attachment and shall
not be liable for direct, indirect, consequential or special
damages in connection with this e-mail message or its attachment.
Image removed by sender.[avg.com]
<https://urldefense.com/v3/__http:/www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient__;!!M-nmYVHPHQ!ZsGwKqKTIhs3ZFvMXTzXUxkppCAiXXZ1sx0fsPypjMFlr3OYsfemtjeZXAJW849AvbtVW-I$>
Virus-free. www.avg.com [avg.com]
<https://urldefense.com/v3/__http:/www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient__;!!M-nmYVHPHQ!ZsGwKqKTIhs3ZFvMXTzXUxkppCAiXXZ1sx0fsPypjMFlr3OYsfemtjeZXAJW849AvbtVW-I$>
------------------------------------------------------------------------
The information in this Internet Email is confidential and may be
legally privileged. It is intended solely for the addressee. Access to
this Email by anyone else is unauthorized. If you are not the intended
recipient, any disclosure, copying, distribution or any action taken
or omitted to be taken in reliance on it, is prohibited and may be
unlawful. When addressed to our clients any opinions or advice
contained in this Email are subject to the terms and conditions
expressed in any applicable governing The Home Depot terms of business
or client engagement letter. The Home Depot disclaims all
responsibility and liability for the accuracy and content of this
attachment and for any damages or losses arising from any
inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or
other items of a destructive nature, which may be contained in this
attachment and shall not be liable for direct, indirect, consequential
or special damages in connection with this e-mail message or its
attachment.