[ceph-users] Re: Building a petabyte cluster from scratch

Anthony D'Atri Tue, 03 Dec 2019 12:39:04 -0800

> 
> ## Requirements
> 
> * ~1 PB usable space for file storage, extensible in the future
> * The files are mostly "hot" data, no cold storage
> * Purpose : storage for big files being essentially used on windows 
> workstations (10G access)
> * Performance is better :)
> 
> 
> ## Global design
> 
> * 8+3 Erasure Coded pool


EC performance for RBD is going to be mediocre at best, esp. on spinners.

> * ZFS on RBD, exposed via samba shares (cluster with failover)

Why ZFS? Mind you I like ZFS, but layering it on top of RBD is more overhead 
and complexity.

>   * 128 GB RAM

Nowhere near enough.   You’re going to want 256 at the very least.

> * Networking : 2 x Cisco N3K 3132Q or 3164Q
>   * 2 x 40G per server for ceph network (LACP/VPC for HA)
>   * 2 x 40G per server for public network (LACP/VPC for HA)

Don’t bother with a replication network.

> * We're used to run mons and mgrs daemons on a few of our OSD nodes, without
>   any issue so far : is this a bad idea for a big cluster ?

Contention for resources can lead to a vicious circle.  Failure/maint of 
mon/mgr/OSD at the same time can be ugly.  Put your mons on something cheap, 5 
of them or 3 if you must.

> * We thought using cache tiering on an SSD pool, but a large part of the PB is
>   used on a daily basis, so we expect the cache to be not so effective and
>   really expensive ?

Cache tiering is deprecated at best.  Not a good idea to invest in it.  If 
you’re going to use SSDs, there are better ways.

> * Could a 2x10G network be enough ?

Yes.

> * ZFS on Ceph ? Any thoughts ?

ZFS is great, but unless you have a specific need, it sounds like a lot of 
overhead and complexity.

> * Hardware raid with Battery Backed write-cache - will allow OSD to ack 
> writes before hitting spinning rust.

Disagree.  See my litany from a few months ago.  Use a plain, IT-mode HBA.  
Take the $$ you save and put it toward building your cluster out of SSDs 
instead of HDDs.  That way you don’t have to mess with the management hassles 
of maintaining and allocating external WAL+DB partitions too.

> 3x replication instead of EC

This.  The performance of EC RBD vols will likely disappoint you, esp on 
spinners.  Having suffered 3R RBD on LFF spinners, I predict that you would 
also be unhappy unless your use-case is only archival / backups or some other 
cold, latency-tolerance workload.














_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Building a petabyte cluster from scratch

Reply via email to