[ceph-users] Re: Using RBD to pack billions of small files

2021-02-03 Thread Burkhard Linke
Hi, On 2/2/21 9:32 PM, Loïc Dachary wrote: Hi Greg, On 02/02/2021 20:34, Gregory Farnum wrote: *snipsnap* Right. Dan's comment gave me pause: it does not seem to be a good idea to assume a RBD image of an infinite size. A friend who read this thread suggested a sensible approach (which als

[ceph-users] Re: XFS block size on RBD / EC vs space amplification

2021-02-03 Thread Konstantin Shalygin
Actually, with last Igor patches default min alloc size for hdd is 4K k Sent from my iPhone > On 2 Feb 2021, at 13:12, Gilles Mocellin > wrote: > > Hello, > > As we know, with 64k for bluestore_min_alloc_size_hdd (I'm only using HDDs), > in certain conditions, especially with erasure codi

[ceph-users] Re: Using RBD to pack billions of small files

2021-02-03 Thread Loïc Dachary
> > Just my 2 cents: > > You could use the first byte of the SHA sum to identify the image, e.g. using > a fixed number of 256 images. Or some flexible approach similar to the way > filestore used to store rados objects. A friend suggested the same to save space. Good idea. OpenPGP_signatu

[ceph-users] Re: Using RBD to pack billions of small files

2021-02-03 Thread Burkhard Linke
Hi, On 2/3/21 9:41 AM, Loïc Dachary wrote: Just my 2 cents: You could use the first byte of the SHA sum to identify the image, e.g. using a fixed number of 256 images. Or some flexible approach similar to the way filestore used to store rados objects. A friend suggested the same to save spac

[ceph-users] Worst thing that can happen if I have size= 2

2021-02-03 Thread Mario Giammarco
Hello, Imagine this situation: - 3 servers with ceph - a pool with size 2 min 1 I know perfectly the size 3 and min 2 is better. I would like to know what is the worst thing that can happen: - a disk breaks and another disk breaks before ceph has reconstructed second replica, ok I lose data - if

[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-03 Thread Magnus HAGDORN
if a OSD becomes unavailble (broken disk, rebooting server) then all I/O to the PGs stored on that OSD will block until replication level of 2 is reached again. So, for a highly available cluster you need a replication level of 3 On Wed, 2021-02-03 at 10:24 +0100, Mario Giammarco wrote: > Hello,

[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-03 Thread Max Krasilnikov
День добрий! Wed, Feb 03, 2021 at 09:29:52AM +, Magnus.Hagdorn wrote: > if a OSD becomes unavailble (broken disk, rebooting server) then all > I/O to the PGs stored on that OSD will block until replication level of > 2 is reached again. So, for a highly available cluster you need a > repli

[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-03 Thread Magnus HAGDORN
On Wed, 2021-02-03 at 09:39 +, Max Krasilnikov wrote: > > if a OSD becomes unavailble (broken disk, rebooting server) then > > all > > I/O to the PGs stored on that OSD will block until replication > > level of > > 2 is reached again. So, for a highly available cluster you need a > > replicatio

[ceph-users] Re: XFS block size on RBD / EC vs space amplification

2021-02-03 Thread Gilles Mocellin
Hello, thank you for your response. Erasure Coding gets better and we really cannot afford the storage overhead of x3 replication. Anyway, as I understand the problem, it is also present with replication, just less amplified (blocks are not divided between OSDs, just replicated fully). Le 2

[ceph-users] Re: XFS block size on RBD / EC vs space amplification

2021-02-03 Thread Gilles Mocellin
Hello, thanxs, I've seen that. But is it the only solution, do I have alternatives with my use case, forcing using big blocks client side ? I've said in XFS (4k bllock size), but perhaps straight in krbd, as it seems the block device is shown as a drive with 512B sectors. But I don't really

[ceph-users] Re: db_devices doesn't show up in exported osd service spec

2021-02-03 Thread Eugen Block
How do you manage the db_sizes of your SSDs? Is that managed automatically by ceph-volume? You could try to add another config and see what it does, maybe try to add block_db_size? Zitat von Tony Liu : All mon, mgr, crash and osd are upgraded to 15.2.8. It actually fixed another issue (no

[ceph-users] Re: replace OSD without PG remapping

2021-02-03 Thread Frank Schilder
You asked about exactly this before: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IGYCYJTAMBDDOD2AQUCJQ6VSUWIO4ELW/#ZJU3555Z5WQTJDPCTMPZ6LOFTIUKKQUS It is not possible to avoid remapping, because if the PGs are not remapped you would have degraded redundancy. In any storage sy

[ceph-users] 15.2.9 ETA?

2021-02-03 Thread David Orman
Hi, We're looking forward to a few of the major bugfixes for breaking mgr issues with larger clusters that are merged into the Octopus branch, as well as the updated cheroot pushed to EPEL that should make it into the next container build. It's been quite some time since the last (15.2.8) release,

[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-03 Thread Adam Boyhan
Isn't this somewhat reliant on the OSD type? Redhat/Micron/Samsung/Supermicro have all put out white papers backing the idea of 2 copies on NVMe's as safe for production. From: "Magnus HAGDORN" To: pse...@avalon.org.ua Cc: "ceph-users" Sent: Wednesday, February 3, 2021 4:43:08 AM Subjec

[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-03 Thread DHilsbos
Adam; I'd like to see that / those white papers. I suspect what they're advocating is multiple OSD daemon processes per NVMe device. This is something which can improve performance. Though I've never done it, I believe you partition the device, and then create your OSD pointing at a partitio

[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-03 Thread Martin Verges
Hello Adam, 2 copies are save, min size 1 is not. As long as there is no write while one copy is missing, you can recover from that or from the unavailable copy when it comes online again. If you have min size 1 and you therefore write data on a single copy, no safety net will protect you. In gen

[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-03 Thread Adam Boyhan
No problem. They have been around for quite some time now. Even speaking to the Ceph engineers over at supermicro while we spec'd our hardware, they agreed as well. [ https://www.supermicro.com/white_paper/white_paper_Ceph-Ultra.pdf | https://www.supermicro.com/white_paper/white_paper_Ceph-Ult

[ceph-users] reinstalling node with orchestrator/cephadm

2021-02-03 Thread Kenneth Waegeman
Hi all, I'm running a 15.2.8 cluster using ceph orch with all daemons adopted to cephadm. I tried reinstall an OSD node. Is there a way to make ceph orch/cephadm activate the devices on this node again, ideally automatically? I tried running `cephadm ceph-volume -- lvm activate --all` but t

[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-03 Thread Simon Ironside
On 03/02/2021 09:24, Mario Giammarco wrote: Hello, Imagine this situation: - 3 servers with ceph - a pool with size 2 min 1 I know perfectly the size 3 and min 2 is better. I would like to know what is the worst thing that can happen: Hi Mario, This thread is worth a read, it's an oldie but a

[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-03 Thread Mario Giammarco
Thanks Simon and thanks to other people that have replied. Sorry but I try to explain myself better. It is evident to me that if I have two copies of data, one brokes and while ceph creates again a new copy of the data also the disk with the second copy brokes you lose the data. It is obvious and a

[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-03 Thread Dan van der Ster
Ceph is multiple factors more risky with min_size 1 than good old raid1: With raid1, having disks A and B, when disk A fails, you start recovery to a new disk A'. If disk B fails during recovery then you have a disaster. With Ceph, we have multiple servers and multiple disks: When an OSD fails an

[ceph-users] Re: Using RBD to pack billions of small files

2021-02-03 Thread Loïc Dachary
Hi Matt, I did not know about pixz, thanks for the pointer. The idea it implements is also new to me and it looks like it can usefully be applied to this use case. I'm not going to say "awesome" because I can't grasp how useful it really is right now. But I'll definitely think about it :-) Chee

[ceph-users] Re: Worst thing that can happen if I have size= 2

2021-02-03 Thread Simon Ironside
On 03/02/2021 19:48, Mario Giammarco wrote: It is obvious and a bit paranoid because many servers on many customers run on raid1 and so you are saying: yeah you have two copies of the data but you can broke both. Consider that in ceph recovery is automatic, with raid1 some one must manually g

[ceph-users] mon db high iops

2021-02-03 Thread Seena Fallah
Hi all, My monitor nodes are getting up and down because of paxos lease timeout and there is a high iops (2k iops) and 500MB/s throughput on /var/lib/ceph/mon/ceph.../store.db/. My cluster is in a recovery state and there is a bunch of degraded pgs on my cluster. It seems it's doing a 200k block

[ceph-users] Re: Using RBD to pack billions of small files

2021-02-03 Thread Matt Wilder
If it were me, I would do something along the lines of: - Bundle larger blocks of code into pixz (essentially indexed tar files, allowing random access) and store them in RadosGW. - Build a small frontend that fetches (with caching) them and provides the file content

[ceph-users] Re: replace OSD without PG remapping

2021-02-03 Thread Tony Liu
Thank you Frank, "degradation is exactly what needs to be avoided/fixed at all cost", clear and loud, point is taken! I didn't actually quite get it last time. I used to think degradation would be OK, but now, I agree with you, that is not OK at all for production storage. Appreciate your patience!

[ceph-users] Re: Using RBD to pack billions of small files

2021-02-03 Thread Loïc Dachary
Hi Frederico, On 04/02/2021 05:51, Federico Lucifredi wrote: > Hi Loïc, >    I am intrigued, but am missing something: why not using RGW, and store the > source code files as objects? RGW has native compression and can take care of > that behind the scenes. Excellent question! > >    Is the desi