[ceph-users] CephFS per client monitoring

2021-02-02 Thread Erwin Bogaard
Hi,

we're using mainly CephFS to give access to storage.
At all times we can see that all clients combines use "X MiB/s" and "y
op/s" for read and write by using the cli or ceph dashboard.
With a tool like iftop, I can get a bit of insight to which clients most
data 'flows', but it isn't really precise.

is there any way to get a MiB/s and op/s number per CephFS client?

Thanks,
Erwin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: no device listed after adding host

2021-02-02 Thread Eugen Block
Just a note: you don't need to install any additional package to run  
ceph-volume:


host1:~ # cephadm ceph-volume lvm list

Did you resolve the missing OSDs since you posted a follow-up  
question? If not did you check all the logs on the OSD host, e.g.  
'journalctl -f' or ceph-volume.log in /var/log/ceph//? I would  
expect to find clues what's going on.



Zitat von Tony Liu :


Hi Eugen,

I installed ceph-osd on the osd-host to run ceph-volume,
which then lists all devices. But "ceph orch device ls" on the
controller (mon and mgr) still doesn't show those devices.
This worked when I initially built the cluster. Not sure what
is missing here. Trying to find out how to trace it. Any idea?


Thanks!
Tony

-Original Message-
From: Eugen Block 
Sent: Monday, February 1, 2021 12:33 PM
To: Tony Liu 
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: no device listed after adding host

Hi,

you could try

ceph-volume inventory

to see if it finds or reports anything.


Zitat von Tony Liu :

> "ceph log last cephadm" shows the host was added without errors.
> "ceph orch host ls" shows the host as well.
> "python3 -c import sys;exec(...)" is running on the host.
> But still no devices on this host is listed.
> Where else can I check?
>
> Thanks!
> Tony
>> -Original Message-
>> From: Tony Liu 
>> Sent: Sunday, January 31, 2021 9:23 PM
>> To: ceph-users@ceph.io
>> Subject: [ceph-users] no device listed after adding host
>>
>> Hi,
>>
>> I added a host by "ceph orch host add ceph-osd-5 10.6.10.84 ceph-osd".
>> I can see the host by "ceph orch host ls", but no devices listed by
>> "ceph orch device ls ceph-osd-5". I tried "ceph orch device zap
>> ceph-osd-5 /dev/sdc --force", which works fine. Wondering why no
>> devices listed? What I am missing here?
>>
>>
>> Thanks!
>> Tony
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
>> email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io





___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: db_devices doesn't show up in exported osd service spec

2021-02-02 Thread Eugen Block
Have you tried with a newer version of ceph? There has been a major  
rewrite of ceph-volume in 15.2.8 [1], maybe this was already resolved?


[1] https://docs.ceph.com/en/latest/releases/octopus/#notable-changes


Zitat von Tony Liu :


Hi,

When build cluster Octopus 15.2.5 initially, here is the OSD
service spec file applied.
```
service_type: osd
service_id: osd-spec
placement:
  host_pattern: ceph-osd-[1-3]
data_devices:
  rotational: 1
db_devices:
  rotational: 0
```
After applying it, all HDDs were added and DB of each hdd is created
on SSD.

Here is the export of OSD service spec.
```
# ceph orch ls --service_name osd.osd-spec --export
service_type: osd
service_id: osd-spec
service_name: osd.osd-spec
placement:
  host_pattern: ceph-osd-[1-3]
spec:
  data_devices:
rotational: 1
  filter_logic: AND
  objectstore: bluestore
```
Why db_devices doesn't show up there?

When I replace a disk recently, when the new disk was installed and
zapped, OSD was automatically re-created, but DB was created on HDD,
not SSD. I assume this is because of that missing db_devices?

I tried to update service spec, the same result, db_devices doesn't
show up when export it.

Is this some known issue or something I am missing?


Thanks!
Tony
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS per client monitoring

2021-02-02 Thread Venky Shankar
On Tue, Feb 2, 2021 at 1:55 PM Erwin Bogaard  wrote:
>
> Hi,
>
> we're using mainly CephFS to give access to storage.
> At all times we can see that all clients combines use "X MiB/s" and "y
> op/s" for read and write by using the cli or ceph dashboard.
> With a tool like iftop, I can get a bit of insight to which clients most
> data 'flows', but it isn't really precise.
>
> is there any way to get a MiB/s and op/s number per CephFS client?

cephfs-top is one option and would be available in Pacific release --
https://docs.ceph.com/en/latest/cephfs/cephfs-top/

Right now, just a handful of client metrics are tracked. Many others
would be added in the future.

>
> Thanks,
> Erwin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Increasing QD=1 performance (lowering latency)

2021-02-02 Thread Wido den Hollander
Hi,

There are many talks and presentations out there about Ceph's
performance. Ceph is great when it comes to parallel I/O, large queue
depths and many applications sending I/O towards Ceph.

One thing where Ceph isn't the fastest are 4k blocks written at Queue
Depth 1.

Some applications benefit very much from high performance/low latency
I/O at qd=1, for example Single Threaded applications which are writing
small files inside a VM running on RBD.

With some tuning you can get to a ~700us latency for a 4k write with
qd=1 (Replication, size=3)

I benchmark this using fio:

$ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. ..

700us latency means the result will be about ~1500 IOps (1000 / 0.7)

When comparing this to let's say a BSD machine running ZFS that's on the
low side. With ZFS+NVMe you'll be able to reach about somewhere between
7.000 and 10.000 IOps, the latency is simply much lower.

My benchmarking / test setup for this:

- Ceph Nautilus/Octopus (doesn't make a big difference)
- 3x SuperMicro 1U with:
- AMD Epyc 7302P 16-core CPU
- 128GB DDR4
- 10x Samsung PM983 3,84TB
- 10Gbit Base-T networking

Things to configure/tune:

- C-State pinning to 1
- CPU governer to performance
- Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0)

Higher clock speeds (New AMD Epyc coming in March!) help to reduce the
latency and going towards 25Gbit/100Gbit might help as well.

These are however only very small increments and might help to reduce
the latency by another 15% or so.

It doesn't bring us anywhere near the 10k IOps other applications can do.

And I totally understand that replication over a TCP/IP network takes
time and thus increases latency.

The Crimson project [0] is aiming to lower the latency with many things
like DPDK and SPDK, but this is far from finished and production ready.

In the meantime, am I overseeing some things here? Can we reduce the
latency further of the current OSDs?

Reaching a ~500us latency would already be great!

Thanks,

Wido


[0]: https://docs.ceph.com/en/latest/dev/crimson/crimson/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd recommended scheduler

2021-02-02 Thread mj

Hi!

Interesting. Didn't know that.

We are running cfq on the OSDs, and added to ceph.conf:


[osd]
osd_disk_thread_ioprio_class  = idle
osd_disk_thread_ioprio_priority = 7


Since we recently switched from HDD to SSD OSDs, I guess we should also 
change from CFQ to noop.


Is there also something we need to change accordingly in ceph.conf?

We simply added to rc.local:


echo cfq > /sys/block/sda/queue/scheduler



echo cfq > /sys/block/sdf/queue/scheduler


Anything else to do, besides changing cfq to noop in the above..?

Thanks for the tip!

MJ

On 2/2/21 8:44 AM, Wido den Hollander wrote:



On 28/01/2021 18:09, Andrei Mikhailovsky wrote:


Hello everyone,

Could some one please let me know what is the recommended modern 
kernel disk scheduler that should be used for SSD and HDD osds? The 
information in the manuals is pretty dated and refer to the schedulers 
which have been deprecated from the recent kernels.




Afaik noop is usually the one use for Flash devices.

CFQ is used on HDDs most of the time as it allows for better 
scheduling/QoS.


Wido


Thanks

Andrei
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] XFS block size on RBD / EC vs space amplification

2021-02-02 Thread Gilles Mocellin

Hello,

As we know, with 64k for bluestore_min_alloc_size_hdd (I'm only using 
HDDs),

in certain conditions, especially with erasure coding,
there's a leak of space while writing objects smaller than 64k x k 
(EC:k+m).


Every object is divided in k elements, written on different OSD.

My main use case is big (40TB) RBD images mounted as XFS filesystems on 
Linux servers,

exposed to our backup software.
So, it's mainly big files.

My though, but I'd like some other point of view, is that I could deal 
with the amplification by using bigger block sizes on my XFS 
filesystems.

Instead of reducing bluestore_min_alloc_size_hdd on all OSDs.

What do you think ?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: db_devices doesn't show up in exported osd service spec

2021-02-02 Thread Eugen Block

Hi,

I would recommend to update (again), here's my output from a 15.2.8  
test cluster:



host1:~ # ceph orch ls --service_name osd.default --export
service_type: osd
service_id: default
service_name: osd.default
placement:
  hosts:
  - host4
  - host3
  - host1
  - host2
spec:
  block_db_size: 4G
  data_devices:
rotational: 1
size: '20G:'
  db_devices:
size: '10G:'
  filter_logic: AND
  objectstore: bluestore


Regards,
Eugen


Zitat von Tony Liu :


Hi,

When build cluster Octopus 15.2.5 initially, here is the OSD
service spec file applied.
```
service_type: osd
service_id: osd-spec
placement:
  host_pattern: ceph-osd-[1-3]
data_devices:
  rotational: 1
db_devices:
  rotational: 0
```
After applying it, all HDDs were added and DB of each hdd is created
on SSD.

Here is the export of OSD service spec.
```
# ceph orch ls --service_name osd.osd-spec --export
service_type: osd
service_id: osd-spec
service_name: osd.osd-spec
placement:
  host_pattern: ceph-osd-[1-3]
spec:
  data_devices:
rotational: 1
  filter_logic: AND
  objectstore: bluestore
```
Why db_devices doesn't show up there?

When I replace a disk recently, when the new disk was installed and
zapped, OSD was automatically re-created, but DB was created on HDD,
not SSD. I assume this is because of that missing db_devices?

I tried to update service spec, the same result, db_devices doesn't
show up when export it.

Is this some known issue or something I am missing?


Thanks!
Tony
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd recommended scheduler

2021-02-02 Thread Dan van der Ster
cfq and now bfq are the only IO schedulers that implement fair share
across processes, and also they are the only schedulers that implement
io priorities (e.g. ionice).

We run this script via rc.local on all our ceph clusters:

https://gist.github.com/dvanders/968d5862f227e0dd988eb5db8fbba203

-- dan

On Tue, Feb 2, 2021 at 11:08 AM Andrei Mikhailovsky  wrote:
>
> Thanks for your reply, Wido.
>
> Isn't CFQ being deprecated in the latest kernel versions? From what I've read 
> in the Ubuntu support pages, the cfq, deadline and noop are no longer 
> supported since 2019 / kernel version 5.3 and later. There are, however, the 
> following schedulers:  bfq, kyber, mq-deadline and none. Could someone please 
> suggest which of these new schedulers does ceph team recommend using for HDD 
> drives and SSD drives? We have both drive types in use.
>
> Many thanks
>
> Andrei
>
> - Original Message -
> > From: "Wido den Hollander" 
> > To: "Andrei Mikhailovsky" , "ceph-users" 
> > 
> > Sent: Tuesday, 2 February, 2021 07:44:13
> > Subject: Re: [ceph-users] osd recommended scheduler
>
> > On 28/01/2021 18:09, Andrei Mikhailovsky wrote:
> >>
> >> Hello everyone,
> >>
> >> Could some one please let me know what is the recommended modern kernel 
> >> disk
> >> scheduler that should be used for SSD and HDD osds? The information in the
> >> manuals is pretty dated and refer to the schedulers which have been 
> >> deprecated
> >> from the recent kernels.
> >>
> >
> > Afaik noop is usually the one use for Flash devices.
> >
> > CFQ is used on HDDs most of the time as it allows for better scheduling/QoS.
> >
> > Wido
> >
> >> Thanks
> >>
> >> Andrei
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] pg repair or pg deep-scrub does not start

2021-02-02 Thread Marcel Kuiper
Hi

I've got an old cluster running ceph 10.2.11 with filestore backend. Last
week a PG was reported inconsistent with a scrub error

# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 38.20 is active+clean+inconsistent, acting [1778,1640,1379]
1 scrub errors

I first tried 'ceph pg repair' but nothing seemed to happen, then

# rados list-inconsistent-obj 38.20 --format=json-pretty

showed that the problem was on osd 1379. The logs showed that that osd had
read errors so I decided to mark that osd out for replacement. Later on
removed it from the crush map en deleted the osd. My thoughts were that
the missing replica gets backfilled on another osd and everything would be
ok again. It got another osd assigned but the health error stayed

# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 38.20 is active+clean+inconsistent, acting [1778,1640,1384]
1 scrub errors

Now I get an error on:

# rados list-inconsistent-obj 38.20 --format=json-pretty
No scrub information available for pg 38.20
error 2: (2) No such file or directory

And if I try

# ceph pg deep-scrub 38.20
instructing pg 38.20 on osd.1778 to deep-scrub

The deepscrub does not get scheduled. Same goes for

# ceph daemon osd.1778 trigger_scrub 38.20 on the storage node

Nothing appears in the logs concerning the scrubbing of PG 38.20. I see in
the log that other PG's get (deep) scrubbed according to the automatic
scheduling

There is no recovery going on but just to be sure I'd set ceph daemon
osd.1778 config set osd_scrub_during_recovery true

Also the load limit is set way higher then the actual system load

I checked the other osds en there are no scrubs going on on these when I
schedule the deep-scrub

I found some report of people that had the same problem. However no
solution was found (for example https://tracker.ceph.com/issues/15781).
Even in mimic and luminous there were sort of the same cases

- Does anyone know what logging I should incraese in order to get more
information as to why my deep-scrub does not get scheduled
- Is there a way in jewel to see the list of scheduled scrubs and their
dates for an osd
- Does someone have advice on how to proceed in clearing this PG error

Thanks for any help

Marcel



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd recommended scheduler

2021-02-02 Thread Andrei Mikhailovsky
Thanks for your reply, Wido.

Isn't CFQ being deprecated in the latest kernel versions? From what I've read 
in the Ubuntu support pages, the cfq, deadline and noop are no longer supported 
since 2019 / kernel version 5.3 and later. There are, however, the following 
schedulers:  bfq, kyber, mq-deadline and none. Could someone please suggest 
which of these new schedulers does ceph team recommend using for HDD drives and 
SSD drives? We have both drive types in use.

Many thanks

Andrei

- Original Message -
> From: "Wido den Hollander" 
> To: "Andrei Mikhailovsky" , "ceph-users" 
> 
> Sent: Tuesday, 2 February, 2021 07:44:13
> Subject: Re: [ceph-users] osd recommended scheduler

> On 28/01/2021 18:09, Andrei Mikhailovsky wrote:
>> 
>> Hello everyone,
>> 
>> Could some one please let me know what is the recommended modern kernel disk
>> scheduler that should be used for SSD and HDD osds? The information in the
>> manuals is pretty dated and refer to the schedulers which have been 
>> deprecated
>> from the recent kernels.
>> 
> 
> Afaik noop is usually the one use for Flash devices.
> 
> CFQ is used on HDDs most of the time as it allows for better scheduling/QoS.
> 
> Wido
> 
>> Thanks
>> 
>> Andrei
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: no device listed after adding host

2021-02-02 Thread Juan Miguel Olmo Martinez
Hi Eugen Block


useful tips to create OSDs:

1. Check devices availability in your cluster hosts:
# ceph orch device ls

2. Devices not available:
This usually means that you have created lvs in these devices, (I mean the
devices are not cleaned.) A ""cepr orch zap " will fix that.

3. The OSD does not start. Check what is the status with:
ceph orch osd ls --format yaml


-- 

Juan Miguel Olmo Martínez

Senior Software Engineer

Red Hat 

jolmo...@redhat.com

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: `cephadm` not deploying OSDs from a storage spec

2021-02-02 Thread Juan Miguel Olmo Martinez
Hi Davor,

Use "ceph orch ls osd --format yaml" to have more info about the problems
deploying the osd service, probably that will give you clues about what is
happening. Share the input if you cannot solve the problem:-)

The same command can be used for other services like the node-exporter,
although in that case I think that the problem was a bug fixed a few days
ago.
https://github.com/ceph/ceph/pull/38946
The fix was backported to pacific last week.

BR


-- 

Juan Miguel Olmo Martínez

Senior Software Engineer

Red Hat 

jolmo...@redhat.com

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: is unknown pg going to be active after osds are fixed?

2021-02-02 Thread Jeremy Austin
I'm in a similar but not identical situation.

I was in the middle of a rebalance on a small test cluster, without about
1% of pgs degraded, and shut the cluster entirely down for maintenance. On
startup, many pgs are entirely unknown, and most stale. In fact most pgs
can't be queried! No mon failures. No obvious signs of OSD failure (and the
problem is too widespread for that.) Is there a specific way to force OSDs
to rescan and re-advertise their pgs? Is there a specific startup order
that fixes this, i.e., start all OSDs first and then start mons?

I'm baffled,
Jeremy

On Mon, Feb 1, 2021 at 10:43 PM Wido den Hollander  wrote:

>
>
> On 01/02/2021 22:48, Tony Liu wrote:
> > Hi,
> >
> > With 3 replicas, a pg hs 3 osds. If all those 3 osds are down,
> > the pg becomes unknow. Is that right?
> >
>
> Yes. As no OSD can report the status to the MONs.
>
> > If those 3 osds are replaced and in and on, is that pg going to
> > be eventually back to active? Or anything else has to be done
> > to fix it?
> >
>
> If you can bring back the OSDs without wiping them: Yes
>
> As you mention the word 'replaced' I was wondering what you mean by
> that. If you replace the disks without data recovery the PGs will be lost.
>
> So you need to bring back the OSDs with their data in tact for the PG to
> come back online.
>
> Wido
>
> >
> > Thanks!
> > Tony
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Jeremy Austin
jhaus...@gmail.com
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Using RBD to pack billions of small files

2021-02-02 Thread Gregory Farnum
Packing's obviously a good idea for storing these kinds of artifacts
in Ceph, and hacking through the existing librbd might indeed be
easier than building something up from raw RADOS, especially if you
want to use stuff like rbd-mirror.

My main concern would just be as Dan points out, that we don't test
rbd with extremely large images and we know deleting that image will
take a long time — I don't know of other issues off the top of my
head, and in the worst case you could always fall back to manipulating
it with raw librados if there is an issue.

But you might also check in on the status of Danny Al-Gaaf's rados
email project. Email and these artifacts seemingly have a lot in
common.
-Greg

On Mon, Feb 1, 2021 at 12:52 PM Loïc Dachary  wrote:
>
> Hi Dan,
>
> On 01/02/2021 21:13, Dan van der Ster wrote:
> > Hi Loïc,
> >
> > We've never managed 100TB+ in a single RBD volume. I can't think of
> > anything, but perhaps there are some unknown limitations when they get so
> > big.
> > It should be easy enough to use rbd bench to create and fill a massive test
> > image to validate everything works well at that size.
> Good idea! I'll look for a cluster with 100TB of free space and post my 
> findings.
> >
> > Also, I assume you'll be doing the IO from just one client? Multiple
> > readers/writers to a single volume could get complicated.
> Yes.
> >
> > Otherwise, yes RBD sounds very convenient for what you need.
> It is inspired by 
> https://static.usenix.org/event/osdi10/tech/full_papers/Beaver.pdf which 
> suggests an ad-hoc implementation to pack immutable objects together. But I 
> think RBD already provides the underlying logic, even though it is not 
> specialized for this use case. RGW also packs small objects together and 
> would be a good candidate. But it provides more flexibility to modify/delete 
> objects and I assume it will be slower to write N objects with RGW than to 
> write them sequentially on an RBD image. But I did not try and maybe I should.
>
> To be continued.
> >
> > Cheers, Dan
> >
> >
> > On Sat, Jan 30, 2021, 4:01 PM Loïc Dachary  wrote:
> >
> >> Bonjour,
> >>
> >> In the context Software Heritage (a noble mission to preserve all source
> >> code)[0], artifacts have an average size of ~3KB and there are billions of
> >> them. They never change and are never deleted. To save space it would make
> >> sense to write them, one after the other, in an every growing RBD volume
> >> (more than 100TB). An index, located somewhere else, would record the
> >> offset and size of the artifacts in the volume.
> >>
> >> I wonder if someone already implemented this idea with success? And if
> >> not... does anyone see a reason why it would be a bad idea?
> >>
> >> Cheers
> >>
> >> [0] https://docs.softwareheritage.org/
> >>
> >> --
> >> Loïc Dachary, Artisan Logiciel Libre
> >>
> >>
> >>
> >>
> >>
> >>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Using RBD to pack billions of small files

2021-02-02 Thread Anthony D'Atri
I’d be nervous about a plan to utilize a single volume, growing indefinitely.  
I would think that from a blast radius perspective that you’d want to strike a 
balance between a single monolithic blockchain-style volume vs a zillion tiny 
files.  Perhaps a strategy to shard into, say, 10 TB volumes.  That size is 
large enough to hold lots of immutable code yet not so unweildy that it becomes 
infeasible to manage.


> Packing's obviously a good idea for storing these kinds of artifacts
> in Ceph, and hacking through the existing librbd might indeed be
> easier than building something up from raw RADOS, especially if you
> want to use stuff like rbd-mirror.
> 
> My main concern would just be as Dan points out, that we don't test
> rbd with extremely large images and we know deleting that image will
> take a long time — I don't know of other issues off the top of my
> head, and in the worst case you could always fall back to manipulating
> it with raw librados if there is an issue.
> 
> But you might also check in on the status of Danny Al-Gaaf's rados
> email project. Email and these artifacts seemingly have a lot in
> common.
> -Greg
> 
> On Mon, Feb 1, 2021 at 12:52 PM Loïc Dachary  wrote:
>> 
>> Hi Dan,
>> 
>> On 01/02/2021 21:13, Dan van der Ster wrote:
>>> Hi Loïc,
>>> 
>>> We've never managed 100TB+ in a single RBD volume. I can't think of
>>> anything, but perhaps there are some unknown limitations when they get so
>>> big.
>>> It should be easy enough to use rbd bench to create and fill a massive test
>>> image to validate everything works well at that size.
>> Good idea! I'll look for a cluster with 100TB of free space and post my 
>> findings.
>>> 
>>> Also, I assume you'll be doing the IO from just one client? Multiple
>>> readers/writers to a single volume could get complicated.
>> Yes.
>>> 
>>> Otherwise, yes RBD sounds very convenient for what you need.
>> It is inspired by 
>> https://static.usenix.org/event/osdi10/tech/full_papers/Beaver.pdf which 
>> suggests an ad-hoc implementation to pack immutable objects together. But I 
>> think RBD already provides the underlying logic, even though it is not 
>> specialized for this use case. RGW also packs small objects together and 
>> would be a good candidate. But it provides more flexibility to modify/delete 
>> objects and I assume it will be slower to write N objects with RGW than to 
>> write them sequentially on an RBD image. But I did not try and maybe I 
>> should.
>> 
>> To be continued.
>>> 
>>> Cheers, Dan
>>> 
>>> 
>>> On Sat, Jan 30, 2021, 4:01 PM Loïc Dachary  wrote:
>>> 
 Bonjour,
 
 In the context Software Heritage (a noble mission to preserve all source
 code)[0], artifacts have an average size of ~3KB and there are billions of
 them. They never change and are never deleted. To save space it would make
 sense to write them, one after the other, in an every growing RBD volume
 (more than 100TB). An index, located somewhere else, would record the
 offset and size of the artifacts in the volume.
 
 I wonder if someone already implemented this idea with success? And if
 not... does anyone see a reason why it would be a bad idea?
 
 Cheers
 
 [0] https://docs.softwareheritage.org/
 
 --
 Loïc Dachary, Artisan Logiciel Libre
 
 
 
 
 
 
 ___
 ceph-users mailing list -- ceph-users@ceph.io
 To unsubscribe send an email to ceph-users-le...@ceph.io
 
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> 
>> 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Using RBD to pack billions of small files

2021-02-02 Thread Loïc Dachary
Hi Greg,

On 02/02/2021 20:34, Gregory Farnum wrote:
> Packing's obviously a good idea for storing these kinds of artifacts
> in Ceph, and hacking through the existing librbd might indeed be
> easier than building something up from raw RADOS, especially if you
> want to use stuff like rbd-mirror.
>
> My main concern would just be as Dan points out, that we don't test
> rbd with extremely large images and we know deleting that image will
> take a long time — I don't know of other issues off the top of my
> head, and in the worst case you could always fall back to manipulating
> it with raw librados if there is an issue.
Right. Dan's comment gave me pause: it does not seem to be
a good idea to assume a RBD image of an infinite size. A friend who read this
thread suggested a sensible approach (which also is in line with the
Haystack paper): instead of making a single gigantic image, make
multiple 1TB images. The index is bigger

SHA256 sum of the artifact => name/uuid of the 1TB image,offset,size

instead of

SHA256 sum of the artifact  => offset,size

But each image still provides packing for over 100 millions artifacts when the
average artifact size is around 3KB. It also allows:

* multiple writers (one for each image),
* rbd-mirroring individual 1TB images to a different Ceph cluster (challenging 
with a single 100TB+ image),
* copying a 1TB image from a pool with a given erasure code profile to another 
pool with a different profile,
* growing from 1TB to 2TB in the future by merging two 1TB images,
* etc.

> But you might also check in on the status of Danny Al-Gaaf's rados
> email project. Email and these artifacts seemingly have a lot in
> common.
They do. This is inspiring:

https://github.com/ceph-dovecot/dovecot-ceph-plugin
https://github.com/ceph-dovecot/dovecot-ceph-plugin/tree/master/src/librmb

Thanks for the pointer.

Cheers
> -Greg
>
> On Mon, Feb 1, 2021 at 12:52 PM Loïc Dachary  wrote:
>> Hi Dan,
>>
>> On 01/02/2021 21:13, Dan van der Ster wrote:
>>> Hi Loïc,
>>>
>>> We've never managed 100TB+ in a single RBD volume. I can't think of
>>> anything, but perhaps there are some unknown limitations when they get so
>>> big.
>>> It should be easy enough to use rbd bench to create and fill a massive test
>>> image to validate everything works well at that size.
>> Good idea! I'll look for a cluster with 100TB of free space and post my 
>> findings.
>>> Also, I assume you'll be doing the IO from just one client? Multiple
>>> readers/writers to a single volume could get complicated.
>> Yes.
>>> Otherwise, yes RBD sounds very convenient for what you need.
>> It is inspired by 
>> https://static.usenix.org/event/osdi10/tech/full_papers/Beaver.pdf which 
>> suggests an ad-hoc implementation to pack immutable objects together. But I 
>> think RBD already provides the underlying logic, even though it is not 
>> specialized for this use case. RGW also packs small objects together and 
>> would be a good candidate. But it provides more flexibility to modify/delete 
>> objects and I assume it will be slower to write N objects with RGW than to 
>> write them sequentially on an RBD image. But I did not try and maybe I 
>> should.
>>
>> To be continued.
>>> Cheers, Dan
>>>
>>>
>>> On Sat, Jan 30, 2021, 4:01 PM Loïc Dachary  wrote:
>>>
 Bonjour,

 In the context Software Heritage (a noble mission to preserve all source
 code)[0], artifacts have an average size of ~3KB and there are billions of
 them. They never change and are never deleted. To save space it would make
 sense to write them, one after the other, in an every growing RBD volume
 (more than 100TB). An index, located somewhere else, would record the
 offset and size of the artifacts in the volume.

 I wonder if someone already implemented this idea with success? And if
 not... does anyone see a reason why it would be a bad idea?

 Cheers

 [0] https://docs.softwareheritage.org/

 --
 Loïc Dachary, Artisan Logiciel Libre






 ___
 ceph-users mailing list -- ceph-users@ceph.io
 To unsubscribe send an email to ceph-users-le...@ceph.io

>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>>
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io

-- 
Loïc Dachary, Artisan Logiciel Libre




OpenPGP_signature
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: radosgw bucket index issue

2021-02-02 Thread Fox, Kevin M
Ping


From: Fox, Kevin M 
Sent: Tuesday, December 29, 2020 3:17 PM
To: ceph-users@ceph.io
Subject: [ceph-users] radosgw bucket index issue

We have a fairly old cluster that has over time been upgraded to nautilus. We 
were digging through some things and found 3 bucket indexes without a 
corresponding bucket. They should have been deleted but somehow were left 
behind. When we try and delete the bucket index, it will not allow it as the 
bucket is not found. The bucket index list command works fine though without 
the bucket. Is there a way to delete the indexes? Maybe somehow relink the 
bucket so it can be deleted again?

Thanks,
Kevin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: is unknown pg going to be active after osds are fixed?

2021-02-02 Thread Tony Liu
Thank you all for kind response!
This problem didn't happen naturally. It was caused by operation
mistake. Anyways, 3 OSDs were replaced by zapped disk. That caused
two unknown PGs. Data on those 2 PGs are permanently lost unfortunately.
"pg dump" shows unknown. "pg map " shows those 3 replaced OSDs.
"pg query " can't find it. I did "osd force-create-pg " to
recreate them. PG map remains on those 3 OSDs.
Now, they are active+clean.


Tony
> -Original Message-
> From: Jeremy Austin 
> Sent: Tuesday, February 2, 2021 8:58 AM
> To: Wido den Hollander 
> Cc: Tony Liu ; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: is unknown pg going to be active after
> osds are fixed?
> 
> I'm in a similar but not identical situation.
> 
> I was in the middle of a rebalance on a small test cluster, without
> about 1% of pgs degraded, and shut the cluster entirely down for
> maintenance. On startup, many pgs are entirely unknown, and most stale.
> In fact most pgs can't be queried! No mon failures. No obvious signs of
> OSD failure (and the problem is too widespread for that.) Is there a
> specific way to force OSDs to rescan and re-advertise their pgs? Is
> there a specific startup order that fixes this, i.e., start all OSDs
> first and then start mons?
> 
> I'm baffled,
> Jeremy
> 
> On Mon, Feb 1, 2021 at 10:43 PM Wido den Hollander   > wrote:
> 
> 
> 
> 
>   On 01/02/2021 22:48, Tony Liu wrote:
>   > Hi,
>   >
>   > With 3 replicas, a pg hs 3 osds. If all those 3 osds are down,
>   > the pg becomes unknow. Is that right?
>   >
> 
>   Yes. As no OSD can report the status to the MONs.
> 
>   > If those 3 osds are replaced and in and on, is that pg going to
>   > be eventually back to active? Or anything else has to be done
>   > to fix it?
>   >
> 
>   If you can bring back the OSDs without wiping them: Yes
> 
>   As you mention the word 'replaced' I was wondering what you mean by
>   that. If you replace the disks without data recovery the PGs will
> be lost.
> 
>   So you need to bring back the OSDs with their data in tact for the
> PG to
>   come back online.
> 
>   Wido
> 
>   >
>   > Thanks!
>   > Tony
>   > ___
>   > ceph-users mailing list -- ceph-users@ceph.io  us...@ceph.io>
>   > To unsubscribe send an email to ceph-users-le...@ceph.io
> 
>   >
>   ___
>   ceph-users mailing list -- ceph-users@ceph.io  us...@ceph.io>
>   To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> 
> 
> 
> --
> 
> Jeremy Austin
> jhaus...@gmail.com 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: no device listed after adding host

2021-02-02 Thread Tony Liu
This works after upgrading to 15.2.8 from 15.2.5.
I see an improvement that "orch host add" does some checking
and shows explicit messages if anything is missing.
But I am still not sure how 15.2.5 worked initially when building
the cluster. Anyways, I am good now.

Thanks!
Tony
> -Original Message-
> From: Eugen Block 
> Sent: Tuesday, February 2, 2021 12:32 AM
> To: Tony Liu 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: no device listed after adding host
> 
> Just a note: you don't need to install any additional package to run
> ceph-volume:
> 
> host1:~ # cephadm ceph-volume lvm list
> 
> Did you resolve the missing OSDs since you posted a follow-up question?
> If not did you check all the logs on the OSD host, e.g.
> 'journalctl -f' or ceph-volume.log in /var/log/ceph//? I would
> expect to find clues what's going on.
> 
> 
> Zitat von Tony Liu :
> 
> > Hi Eugen,
> >
> > I installed ceph-osd on the osd-host to run ceph-volume, which then
> > lists all devices. But "ceph orch device ls" on the controller (mon
> > and mgr) still doesn't show those devices.
> > This worked when I initially built the cluster. Not sure what is
> > missing here. Trying to find out how to trace it. Any idea?
> >
> >
> > Thanks!
> > Tony
> >> -Original Message-
> >> From: Eugen Block 
> >> Sent: Monday, February 1, 2021 12:33 PM
> >> To: Tony Liu 
> >> Cc: ceph-users@ceph.io
> >> Subject: Re: [ceph-users] Re: no device listed after adding host
> >>
> >> Hi,
> >>
> >> you could try
> >>
> >> ceph-volume inventory
> >>
> >> to see if it finds or reports anything.
> >>
> >>
> >> Zitat von Tony Liu :
> >>
> >> > "ceph log last cephadm" shows the host was added without errors.
> >> > "ceph orch host ls" shows the host as well.
> >> > "python3 -c import sys;exec(...)" is running on the host.
> >> > But still no devices on this host is listed.
> >> > Where else can I check?
> >> >
> >> > Thanks!
> >> > Tony
> >> >> -Original Message-
> >> >> From: Tony Liu 
> >> >> Sent: Sunday, January 31, 2021 9:23 PM
> >> >> To: ceph-users@ceph.io
> >> >> Subject: [ceph-users] no device listed after adding host
> >> >>
> >> >> Hi,
> >> >>
> >> >> I added a host by "ceph orch host add ceph-osd-5 10.6.10.84 ceph-
> osd".
> >> >> I can see the host by "ceph orch host ls", but no devices listed
> >> >> by "ceph orch device ls ceph-osd-5". I tried "ceph orch device zap
> >> >> ceph-osd-5 /dev/sdc --force", which works fine. Wondering why no
> >> >> devices listed? What I am missing here?
> >> >>
> >> >>
> >> >> Thanks!
> >> >> Tony
> >> >> ___
> >> >> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send
> >> >> an email to ceph-users-le...@ceph.io
> >> > ___
> >> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send
> >> > an email to ceph-users-le...@ceph.io
> >>
> >>
> 
> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: db_devices doesn't show up in exported osd service spec

2021-02-02 Thread Tony Liu
All mon, mgr, crash and osd are upgraded to 15.2.8. It actually
fixed another issue (no device listed after adding host).
But this issue remains. 
```
# cat osd-spec.yaml 
service_type: osd
service_id: osd-spec
placement:
  host_pattern: ceph-osd-[1-3]
data_devices:
  rotational: 1
db_devices:
  rotational: 0

# ceph orch apply osd -i osd-spec.yaml 
Scheduled osd.osd-spec update...

# ceph orch ls --service_name osd.osd-spec --export
service_type: osd
service_id: osd-spec
service_name: osd.osd-spec
placement:
  host_pattern: ceph-osd-[1-3]
spec:
  data_devices:
rotational: 1
  filter_logic: AND
  objectstore: bluestore
```
db_devices still doesn't show up.
Keep scratching my head...


Thanks!
Tony
> -Original Message-
> From: Eugen Block 
> Sent: Tuesday, February 2, 2021 2:20 AM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: db_devices doesn't show up in exported osd
> service spec
> 
> Hi,
> 
> I would recommend to update (again), here's my output from a 15.2.8 test
> cluster:
> 
> 
> host1:~ # ceph orch ls --service_name osd.default --export
> service_type: osd
> service_id: default
> service_name: osd.default
> placement:
>hosts:
>- host4
>- host3
>- host1
>- host2
> spec:
>block_db_size: 4G
>data_devices:
>  rotational: 1
>  size: '20G:'
>db_devices:
>  size: '10G:'
>filter_logic: AND
>objectstore: bluestore
> 
> 
> Regards,
> Eugen
> 
> 
> Zitat von Tony Liu :
> 
> > Hi,
> >
> > When build cluster Octopus 15.2.5 initially, here is the OSD service
> > spec file applied.
> > ```
> > service_type: osd
> > service_id: osd-spec
> > placement:
> >   host_pattern: ceph-osd-[1-3]
> > data_devices:
> >   rotational: 1
> > db_devices:
> >   rotational: 0
> > ```
> > After applying it, all HDDs were added and DB of each hdd is created
> > on SSD.
> >
> > Here is the export of OSD service spec.
> > ```
> > # ceph orch ls --service_name osd.osd-spec --export
> > service_type: osd
> > service_id: osd-spec
> > service_name: osd.osd-spec
> > placement:
> >   host_pattern: ceph-osd-[1-3]
> > spec:
> >   data_devices:
> > rotational: 1
> >   filter_logic: AND
> >   objectstore: bluestore
> > ```
> > Why db_devices doesn't show up there?
> >
> > When I replace a disk recently, when the new disk was installed and
> > zapped, OSD was automatically re-created, but DB was created on HDD,
> > not SSD. I assume this is because of that missing db_devices?
> >
> > I tried to update service spec, the same result, db_devices doesn't
> > show up when export it.
> >
> > Is this some known issue or something I am missing?
> >
> >
> > Thanks!
> > Tony
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > email to ceph-users-le...@ceph.io
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] replace OSD without PG remapping

2021-02-02 Thread Tony Liu
Hi,

There are multiple different procedures to replace an OSD.
What I want is to replace an OSD without PG remapping.

#1
I tried "orch osd rm --replace", which sets OSD reweight 0 and
status "destroyed". "orch osd rm status" shows "draining".
All PGs on this OSD are remapped. Checked "pg dump", can't find
this OSD any more.

1) Given [1], setting weight 0 seems better than setting reweight 0.
Is that right? If yes, should we change the behavior of "orch osd
rm --replace"?

2) "ceph status" doesn't show anything about OSD draining.
Is there any way to see the progress of draining?
Is there actually copy happening? The PG on this OSD is remapped
and copied to another OSD, right?

3) When OSD is replaced, there will be remapping and backfilling.

4) There is remapping in #2 and remapping again in #3.
I want to avoid it.

#2
Is there any procedure that doesn't mark OSD out (set reweight 0),
neither set weight 0, which should keep PG map unchanged, but just
warn about less redundancy (one out of 3 OSDs of PG is down), and
when OSD is replaced, no remapping, just data backfilling?

[1] 
https://ceph.com/geen-categorie/difference-between-ceph-osd-reweight-and-ceph-osd-crush-reweight/


Thanks!
Tony
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Module 'cephadm' has failed: 'NoneType' object has no attribute 'split'

2021-02-02 Thread Tony Liu
Hi,

After upgrading from 15.2.5 to 15.2.8, I see this health error.
Has anyone seen this? "ceph log last cephadm" doesn't show anything
about it. How can I trace it?


Thanks!
Tony
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Module 'cephadm' has failed: 'NoneType' object has no attribute 'split'

2021-02-02 Thread Tony Liu
File \"/usr/share/ceph/mgr/cephadm/module.py\", line 442, in serve
  serve.serve()
File \"/usr/share/ceph/mgr/cephadm/serve.py\", line 66, in serve
  self.mgr.rm_util.process_removal_queue()
File \"/usr/share/ceph/mgr/cephadm/services/osd.py\", line 348, in 
process_removal_queue
  self.mgr._remove_daemon(osd.fullname, osd.hostname)
File \"/usr/share/ceph/mgr/cephadm/module.py\", line 1808, in _remove_daemon
  (daemon_type, daemon_id) = name.split('.', 1)

When process_removal_queue calls _remove_daemon(name, host),
"osd.fullname" is None. "osd" is from the list "to_remove_osds".
Seems like a bug to me.


Thanks!
Tony
> -Original Message-
> From: Tony Liu 
> Sent: Tuesday, February 2, 2021 7:20 PM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Module 'cephadm' has failed: 'NoneType' object has
> no attribute 'split'
> 
> Hi,
> 
> After upgrading from 15.2.5 to 15.2.8, I see this health error.
> Has anyone seen this? "ceph log last cephadm" doesn't show anything
> about it. How can I trace it?
> 
> 
> Thanks!
> Tony
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io