4.0 has gone a ways to enable better densification of nodes, but it wasn't a main focus. We're probably still only thinking that 4TB - 8TB nodes will be feasible (and then maybe only for expert users). The main problems tend to be streaming, compaction, and repairs when it comes to dense nodes.
Ebay uses Cassandra and claims to have 80+ petabytes. What do they do? They 1. likely have a lot of nodes (1000+ node clusters are possible, just hard), and 2. that 80 petabytes is undoubtedly spread across many clusters. raft.so - Cassandra consulting, support, and managed services On Fri, Apr 9, 2021 at 11:15 PM Joe Obernberger < joseph.obernber...@gmail.com> wrote: > We run a ~1PByte HBase cluster on top of Hadoop/HDFS that works pretty > well. I would love to be able to use Cassandra instead on a system like > that. HBase queries / scans are not the easiest to deal with, but, as with > Cassandra, if you know the primary key, you can get to your data fast, even > in trillions of rows. Cassandra offers some capabilities that HBase > doesn't that I would like to leverage, but yeah - how can you use Cassandra > with modern equipment in a bare metal environment? Kubernetes could make > sense as long as you're able to maintain data locality with however your > storage is configured. > Even all SSDs - you can get a system with 24, 2 TByte SSDs, which is too > large for 1 instance of Cassandra. Does 4.x address any of this? > > Ebay uses Cassandra and claims to have 80+ petabytes. What do they do? > > -Joe > On 4/8/2021 6:35 PM, Elliott Sims wrote: > > I'm not sure I'd suggest building a single DIY Backblaze pod. The SATA > port multipliers are a pain both from a supply chain and systems management > perspective. Can be worth it when you're amortizing that across a lot of > servers and can exert some leverage over wholesale suppliers, but less so > for a one-off. There's a lot more whitebox/OEM/etc options for > high-density storage servers these days from Seagate, Dell, HP, Supermicro, > etc that are worth a look. > > > I'd agree with this (both examples) sounding like a poor fit for > Cassandra. Seems like you could always just spin up a bunch of Cassandra > VMs in the ESX cluster instead of one big one, but something like MySQL or > PostgreSQL might suit your needs better. Or even some sort of flatfile > archive with something like Parquet if it's more being kept "just in case" > with no need for quick random access. > > For the 10PB example, it may be time to look at something like Hadoop, or > maybe Ceph. > > On Thu, Apr 8, 2021 at 10:39 AM Bowen Song <bo...@bso.ng> <bo...@bso.ng> > wrote: > >> This is off-topic. But if your goal is to maximise storage density and >> also ensuring data durability and availability, this is what you should be >> looking at: >> >> - hardware: >> https://www.backblaze.com/blog/open-source-data-storage-server/ >> - architecture and software: >> https://www.backblaze.com/blog/vault-cloud-storage-architecture/ >> >> >> On 08/04/2021 17:50, Joe Obernberger wrote: >> >> I am also curious on this question. Say your use case is to store >> 10PBytes of data in a new server room / data-center with new equipment, >> what makes the most sense? If your database is primarily write with little >> read, I think you'd want to maximize disk space per rack space. So you may >> opt for a 2u server with 24 3.5" disks at 16TBytes each for a node with >> 384TBytes of disk - so ~27 servers for 10PBytes. >> >> Cassandra doesn't seem to be the good choice for that configuration; the >> rule of thumb that I'm hearing is ~2Tbytes per node, in which case we'd >> need over 5000 servers. This seems really unreasonable. >> >> -Joe >> >> On 4/8/2021 9:56 AM, Lapo Luchini wrote: >> >> Hi, one project I wrote is using Cassandra to back the huge amount of >> data it needs (data is written only once and read very rarely, but needs to >> be accessible for years, so the storage needs become huge in time and I >> chose Cassandra mainly for its horizontal scalability regarding disk size) >> and a client of mine needs to install that on his hosts. >> >> Problem is, while I usually use a cluster of 6 "smallish" nodes (which >> can grow in time), he only has big ESX servers with huge disk space (which >> is already RAID-6 redundant) but wouldn't have the possibility to have 3+ >> nodes per DC. >> >> This is out of my usual experience with Cassandra and, as far as I read >> around, out of most use-cases found on the website or this mailing list, so >> the question is: >> does it make sense to use Cassandra with a big (let's talk 6TB today, up >> to 20TB in a few years) single-node DataCenter, and another single-node >> DataCenter (to act as disaster recovery)? >> >> Thanks in advance for any suggestion or comment! >> >> > > <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> > Virus-free. > www.avg.com > <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> > <#m_8132644495991221951_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> > >