[ceph-users] BlueFS spillover detected, why, what?

2020-08-20 Thread Simon Oosthoek

Hi

Recently our ceph cluster (nautilus) is experiencing bluefs spillovers, 
just 2 osd's and I disabled the warning for these osds.

(ceph config set osd.125 bluestore_warn_on_bluefs_spillover false)

I'm wondering what causes this and how this can be prevented.

As I understand it the rocksdb for the OSD needs to store more than fits 
on the NVME logical volume (123G for 12T OSD). A way to fix it could be 
to increase the logical volume on the nvme (if there was space on the 
nvme, which there isn't at the moment).


This is the current size of the cluster and how much is free:

[root@cephmon1 ~]# ceph df
RAW STORAGE:
CLASS SIZEAVAIL   USEDRAW USED %RAW USED
hdd   1.8 PiB 842 TiB 974 TiB  974 TiB 53.63
TOTAL 1.8 PiB 842 TiB 974 TiB  974 TiB 53.63

POOLS:
POOLID STORED  OBJECTS USED 
%USED MAX AVAIL
cephfs_data  1 572 MiB 121.26M 2.4 GiB 
   0   167 TiB
cephfs_metadata  2  56 GiB   5.15M  57 GiB 
   0   167 TiB
cephfs_data_3copy8 201 GiB  51.68k 602 GiB 
0.09   222 TiB
cephfs_data_ec8313 643 TiB 279.75M 953 TiB 
58.86   485 TiB
rbd 14  21 GiB   5.66k  64 GiB 
   0   222 TiB
.rgw.root   15 1.2 KiB   4   1 MiB 
   0   167 TiB
default.rgw.control 16 0 B   8 0 B 
   0   167 TiB
default.rgw.meta17   765 B   4   1 MiB 
   0   167 TiB
default.rgw.log 18 0 B 207 0 B 
   0   167 TiB
cephfs_data_ec5720 433 MiB 230 1.2 GiB 
   0   278 TiB


The amount used can still grow a bit before we need to add nodes, but 
apparently we are running into the limits of our rocskdb partitions.


Did we choose a parameter (e.g. minimal object size) too small, so we 
have too much objects on these spillover OSDs? Or is it that too many 
small files are stored on the cephfs filesystems?


When we expand the cluster, we can choose larger nvme devices to allow 
larger rocksdb partitions, but is that the right way to deal with this, 
or should we adjust some parameters on the cluster that will reduce the 
rocksdb size?


Cheers

/Simon
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CEPH FS is always showing the status as creating

2020-08-20 Thread Alokkumar Mahajan
Hello Nathan,
Below is the output of ceph status:-

  cluster:
id: a3ede5f7-ade8-4bfd-91f4-568e19ca9e69
health: HEALTH_WARN
1 MDSs report slow metadata IOs
Degraded data redundancy: 12563/37689 objects degraded
(33.333%), 109 pgs degraded
application not enabled on 2 pool(s)

  services:
mon: 1 daemons, quorum vl-co-qbr
mgr: node1(active)
mds: cephfs-1/1/1 up  {0=vl-co-qbr=up:creating}
osd: 4 osds: 2 up, 2 in

  data:
pools:   4 pools, 248 pgs
objects: 12.56 k objects, 37 GiB
usage:   90 GiB used, 510 GiB / 600 GiB avail
pgs: 12563/37689 objects degraded (33.333%)
 139 active+undersized
 109 active+undersized+degraded

  io:
client:   1.4 KiB/s wr, 0 op/s rd, 5 op/s wr
recovery: 12 B/s, 0 keys/s, 1 objects/s

On Wed, 19 Aug 2020 at 21:26, Nathan Fish  wrote:

> Have you created any MDS daemons? Can you paste "ceph status"?
>
> On Wed, Aug 19, 2020 at 11:52 AM Alokkumar Mahajan
>  wrote:
> >
> > Hello,
> > We have created CEPH FS but it is always showing the status as creating.
> >
> > ceph fs get returns below output:-
> >
> > ===
> > Filesystem 'cephfs' (4)
> > fs_name cephfs
> > epoch   2865929
> > flags   12
> > created 2020-08-07 05:05:58.033824
> > modified2020-08-14 03:15:49.727680
> > tableserver 0
> > root0
> > session_timeout 60
> > session_autoclose   300
> > max_file_size   1099511627776
> > min_compat_client   -1 (unspecified)
> > last_failure0
> > last_failure_osd_epoch  652
> > compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> > ranges,3=default file layouts on dirs,4=dir inode in separate
> object,5=mds
> > uses versioned encoding,6=dirfrag is stored in omap,8=no anchor
> > table,9=file layout v2,10=snaprealm v2}
> > max_mds 1
> > in  0
> > up  {0=1494099}
> > failed
> > damaged
> > stopped
> > data_pools  [32]
> > metadata_pool   33
> > inline_data disabled
> > balancer
> > standby_count_wanted0
> > 1494099:10.18.97.47:6800/2041780514
> >  'vl-pun-qa' mds.0.2748022
> > up:creating seq 149289
> > 
> >
> > We are CEPH 13.2.6 MIMIC Version.
> >
> > I am new to CEPH so i am really not sure where to start checking this,
> any
> > help will be greatly appreciated.
> >
> > Thanks,
> > -alok
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD memory leak?

2020-08-20 Thread Dan van der Ster
Hi Frank,

I didn't get time yet. On our side, I was planning to see if the issue
persists after upgrading to v14.2.11 -- it includes some updates to
how the osdmap is referenced across OSD.cc.

BTW, do you happen to have osd_map_dedup set to false? We do, and that
surely increases the osdmap memory usage somewhat.

-- Dan



-- Dan

On Thu, Aug 20, 2020 at 9:33 AM Frank Schilder  wrote:
>
> Hi Dan and Mark,
>
> could you please let me know if you can read the files with the version info 
> I provided in my previous e-mail? I'm in the process of collecting data with 
> more FS activity and would like to send it in a format that is useful for 
> investigation.
>
> Right now I'm observing a daily growth of swap of ca. 100-200MB on servers 
> with 16 OSDs each, 1SSD and 15HDDs. The OS+daemons operate fine, the OS 
> manages to keep enough RAM available. Also the mempool dump still shows onode 
> and data cached at a seemingly reasonable level. Users report a more stable 
> performance of the FS after I increased the cach min sizes on all OSDs.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Frank Schilder 
> Sent: 17 August 2020 09:37
> To: Dan van der Ster
> Cc: ceph-users
> Subject: [ceph-users] Re: OSD memory leak?
>
> Hi Dan,
>
> I use the container 
> docker.io/ceph/daemon:v3.2.10-stable-3.2-mimic-centos-7-x86_64. As far as I 
> can see, it uses the packages from http://download.ceph.com/rpm-mimic/el7, 
> its a Centos 7 build. The version is:
>
> # ceph -v
> ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
>
> On Centos, the profiler packages are called different, without the "google-" 
> prefix. The version I have installed is
>
> # pprof --version
> pprof (part of gperftools 2.0)
>
> Copyright 1998-2007 Google Inc.
>
> This is BSD licensed software; see the source for copying conditions
> and license information.
> There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
> PARTICULAR PURPOSE.
>
> It is possible to install pprof inside this container and analyse the 
> *.heap-files I provided.
>
> If this doesn't work for you and you want me to generate the text output for 
> heap-files, I can do that. Please let me know if I should do all files and 
> with what option (eg. against a base etc.).
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Dan van der Ster 
> Sent: 14 August 2020 10:38:57
> To: Frank Schilder
> Cc: Mark Nelson; ceph-users
> Subject: Re: [ceph-users] Re: OSD memory leak?
>
> Hi Frank,
>
> I'm having trouble getting the exact version of ceph you used to
> create this heap profile.
> Could you run the google-pprof --text steps at [1] and share the output?
>
> Thanks, Dan
>
> [1] https://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/
>
>
> On Tue, Aug 11, 2020 at 2:37 PM Frank Schilder  wrote:
> >
> > Hi Mark,
> >
> > here is a first collection of heap profiling data (valid 30 days):
> >
> > https://files.dtu.dk/u/53HHic_xx5P1cceJ/heap_profiling-2020-08-03.tgz?l
> >
> > This was collected with the following config settings:
> >
> >   osd  dev  osd_memory_cache_min  
> > 805306368
> >   osd  basicosd_memory_target 
> > 2147483648
> >
> > Setting the cache_min value seems to help keeping cache space available. 
> > Unfortunately, the above collection is for 12 days only. I needed to 
> > restart the OSD and will need to restart it soon again. I hope I can then 
> > run a longer sample. The profiling does cause slow ops though.
> >
> > Maybe you can see something already? It seems to have collected some leaked 
> > memory. Unfortunately, it was a period of extremely low load. Basically, 
> > with the day of recording the utilization dropped to almost zero.
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: Frank Schilder 
> > Sent: 21 July 2020 12:57:32
> > To: Mark Nelson; Dan van der Ster
> > Cc: ceph-users
> > Subject: [ceph-users] Re: OSD memory leak?
> >
> > Quick question: Is there a way to change the frequency of heap dumps? On 
> > this page http://goog-perftools.sourceforge.net/doc/heap_profiler.html a 
> > function HeapProfilerSetAllocationInterval() is mentioned, but no other way 
> > of configuring this. Is there a config parameter or a ceph daemon call to 
> > adjust this?
> >
> > If not, can I change the dump path?
> >
> > Its likely to overrun my log partition quickly if I cannot adjust either of 
> > the two.
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: Frank Schilder 
> > Sent: 20 July 2020 15:19:05
> > To: Ma

[ceph-users] Re: 5 pgs inactive, 5 pgs incomplete

2020-08-20 Thread Eugen Block

Hi Martin,

have you seen this blog post [1]? It describes how to recover from  
inactive and incomplete PGs (on a size 1 pool). I haven't tried any of  
that but it could be worth a try. Apparently it only would work if the  
affected PGs have 0 objects but that seems to be the case, right?


Regards,
Eugen

[1]  
https://medium.com/opsops/recovering-ceph-from-reduced-data-availability-3-pgs-inactive-3-pgs-incomplete-b97cbcb4b5a1



Zitat von Martin Palma :


If Ceph consultants are reading this please feel free to contact me
off list. We are seeking for someone who can help us of course we will
pay.



On Mon, Aug 17, 2020 at 12:50 PM Martin Palma  wrote:


After doing some research I suspect the problem is that during the
cluster was backfilling an OSD was removed.

Now the PGs which are inactive and incomplete have all the same
(removed OSD) in the "down_osds_we_would_probe" output and the peering
is blocked by "peering_blocked_by_history_les_bound". We tried to set
the "osd_find_best_info_ignore_history_les = true" but with no success
the OSDs keep in a peering loop.

On Mon, Aug 17, 2020 at 9:53 AM Martin Palma  wrote:
>
> Here is the output with all OSD up and running.
>
> ceph -s: https://pastebin.com/5tMf12Lm
> ceph health detail: https://pastebin.com/avDhcJt0
> ceph osd tree: https://pastebin.com/XEB0eUbk
> ceph osd pool ls detail: https://pastebin.com/ShSdmM5a
>
> On Mon, Aug 17, 2020 at 9:38 AM Martin Palma  wrote:
> >
> > Hi Peter,
> >
> > On the weekend another host was down due to power problems, which was
> > restarted. Therefore these outputs also include some "Degraded data
> > redundancy" messages. And one OSD crashed due to a disk error.
> >
> > ceph -s: https://pastebin.com/Tm8QHp52
> > ceph health detail: https://pastebin.com/SrA7Bivj
> > ceph osd tree: https://pastebin.com/nBK8Uafd
> > ceph osd pool ls detail: https://pastebin.com/kYyCb7B2
> >
> > No it's not a EC pool which has the inactive+incomplete PGs.
> >
> > ceph osd crush dump | jq '[.rules, .tunables]':  
https://pastebin.com/gqDTjfat

> >
> > Best,
> > Martin
> >
> > On Sun, Aug 16, 2020 at 3:44 PM Peter Maloney
> >  wrote:
> > >
> > > Dear Martin,
> > >
> > > Can you provide some details?
> > >
> > > ceph -s
> > > ceph health detail
> > > ceph osd tree
> > > ceph osd pool ls detail
> > >
> > > If it's EC (you implied it's not) also show the crush  
rules...and may as well include tunables (because greatly raising  
choose_total_tries, eg. 200 may be the solution to your problem):

> > > ceph osd crush dump | jq '[.rules, .tunables]'
> > >
> > > Peter
> > >
> > > On 8/16/20 1:18 AM, Martin Palma wrote:
> > > > Yes, but that didn’t help. After some time they have  
blocked requests again

> > > > and remain inactive and incomplete.
> > > >
> > > > On Sat, 15 Aug 2020 at 16:58,  wrote:
> > > >
> > > >> Did you tried to restart the sayed osds?
> > > >>
> > > >>
> > > >>
> > > >> Hth
> > > >>
> > > >> Mehmet
> > > >>
> > > >>
> > > >>
> > > >> Am 12. August 2020 21:07:55 MESZ schrieb Martin Palma  
:

> > > >>
> > >  Are the OSDs online? Or do they refuse to boot?
> > > >>> Yes. They are up and running and not marked as down or out of the
> > > >>> cluster.
> > >  Can you list the data with ceph-objectstore-tool on these OSDs?
> > > >>> If you mean the "list" operation on the PG works if an output for
> > > >>> example:
> > > >>> $ ceph-objectstore-tool --data-path  
/var/lib/ceph/osd/ceph-63 --pgid

> > > >>> 22.11a --op list
> > > >>
> > > >>>  
["22.11a",{"oid":"1001c1ee04f.0007","key":"","snapid":-2,"hash":3825189146,"max":0,"pool":22,"namespace":"","max":0}]

> > > >>
> > > >>>  
["22.11a",{"oid":"1000448667f.","key":"","snapid":-2,"hash":4294951194,"max":0,"pool":22,"namespace":"","max":0}]

> > > >>> ...
> > > >>> If I run "ceph pg ls incomplete" in the output only one PG has
> > > >>> objects... all others have 0 objects.
> > > >>> ___
> > > >>> ceph-users mailing list -- ceph-users@ceph.io
> > > >>> To unsubscribe send an email to ceph-users-le...@ceph.io
> > > >> ___
> > > >>
> > > >> ceph-users mailing list -- ceph-users@ceph.io
> > > >>
> > > >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > > >>
> > > >>
> > > > ___
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> > >
> > > --
> > > 
> > > Peter Maloney
> > > Brockmann Consult GmbH
> > > www.brockmann-consult.de
> > > Chrysanderstr. 1
> > > D-21029 Hamburg, Germany
> > > Tel: +49 (0)40 69 63 89 - 320
> > > E-mail: peter.malo...@brockmann-consult.de
> > > Amtsgericht Hamburg HRB 157689
> > > Geschäftsführer Dr. Carsten Brockmann
> > > 
> > >

___
ceph-users mailing list -- ceph-users@ceph.io
To u

[ceph-users] Re: CEPH FS is always showing the status as creating

2020-08-20 Thread Eugen Block
You need to fix the out OSDs first. The default pool size is very  
likely three and you only have two OSDs up, that's why 33% of your PGs  
are degraded. I'm pretty sure if you fix that your cephfs will become  
active.



Zitat von Alokkumar Mahajan :


Hello Nathan,
Below is the output of ceph status:-

  cluster:
id: a3ede5f7-ade8-4bfd-91f4-568e19ca9e69
health: HEALTH_WARN
1 MDSs report slow metadata IOs
Degraded data redundancy: 12563/37689 objects degraded
(33.333%), 109 pgs degraded
application not enabled on 2 pool(s)

  services:
mon: 1 daemons, quorum vl-co-qbr
mgr: node1(active)
mds: cephfs-1/1/1 up  {0=vl-co-qbr=up:creating}
osd: 4 osds: 2 up, 2 in

  data:
pools:   4 pools, 248 pgs
objects: 12.56 k objects, 37 GiB
usage:   90 GiB used, 510 GiB / 600 GiB avail
pgs: 12563/37689 objects degraded (33.333%)
 139 active+undersized
 109 active+undersized+degraded

  io:
client:   1.4 KiB/s wr, 0 op/s rd, 5 op/s wr
recovery: 12 B/s, 0 keys/s, 1 objects/s

On Wed, 19 Aug 2020 at 21:26, Nathan Fish  wrote:


Have you created any MDS daemons? Can you paste "ceph status"?

On Wed, Aug 19, 2020 at 11:52 AM Alokkumar Mahajan
 wrote:
>
> Hello,
> We have created CEPH FS but it is always showing the status as creating.
>
> ceph fs get returns below output:-
>
> ===
> Filesystem 'cephfs' (4)
> fs_name cephfs
> epoch   2865929
> flags   12
> created 2020-08-07 05:05:58.033824
> modified2020-08-14 03:15:49.727680
> tableserver 0
> root0
> session_timeout 60
> session_autoclose   300
> max_file_size   1099511627776
> min_compat_client   -1 (unspecified)
> last_failure0
> last_failure_osd_epoch  652
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate
object,5=mds
> uses versioned encoding,6=dirfrag is stored in omap,8=no anchor
> table,9=file layout v2,10=snaprealm v2}
> max_mds 1
> in  0
> up  {0=1494099}
> failed
> damaged
> stopped
> data_pools  [32]
> metadata_pool   33
> inline_data disabled
> balancer
> standby_count_wanted0
> 1494099:10.18.97.47:6800/2041780514
>  'vl-pun-qa' mds.0.2748022
> up:creating seq 149289
> 
>
> We are CEPH 13.2.6 MIMIC Version.
>
> I am new to CEPH so i am really not sure where to start checking this,
any
> help will be greatly appreciated.
>
> Thanks,
> -alok
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Michael Bisig
Hi Simon

As far as I know, RocksDB only uses "leveled" space on the NVME partition. The 
values are set to be 300MB, 3GB, 30GB and 300GB. Every DB space above such a 
limit will automatically end up on slow devices. 
In your setup where you have 123GB per OSD that means you only use 30GB of fast 
device. The DB which spills over this limit will be offloaded to the HDD and 
accordingly, it slows down requests and compactions.

You can proof what your OSD currently consumes with:
  ceph daemon osd.X perf dump

Informative values are `db_total_bytes`, `db_used_bytes` and `slow_used_bytes`. 
This changes regularly because of the ongoing compactions but Prometheus mgr 
module exports these values such that you can track it.

Small files generally leads to bigger RocksDB, especially when you use EC, but 
this depends on the actual amount and file sizes.

I hope this helps.
Regards,
Michael

On 20.08.20, 09:10, "Simon Oosthoek"  wrote:

Hi

Recently our ceph cluster (nautilus) is experiencing bluefs spillovers, 
just 2 osd's and I disabled the warning for these osds.
(ceph config set osd.125 bluestore_warn_on_bluefs_spillover false)

I'm wondering what causes this and how this can be prevented.

As I understand it the rocksdb for the OSD needs to store more than fits 
on the NVME logical volume (123G for 12T OSD). A way to fix it could be 
to increase the logical volume on the nvme (if there was space on the 
nvme, which there isn't at the moment).

This is the current size of the cluster and how much is free:

[root@cephmon1 ~]# ceph df
RAW STORAGE:
 CLASS SIZEAVAIL   USEDRAW USED %RAW USED
 hdd   1.8 PiB 842 TiB 974 TiB  974 TiB 53.63
 TOTAL 1.8 PiB 842 TiB 974 TiB  974 TiB 53.63

POOLS:
 POOLID STORED  OBJECTS USED 
%USED MAX AVAIL
 cephfs_data  1 572 MiB 121.26M 2.4 GiB 
0   167 TiB
 cephfs_metadata  2  56 GiB   5.15M  57 GiB 
0   167 TiB
 cephfs_data_3copy8 201 GiB  51.68k 602 GiB 
0.09   222 TiB
 cephfs_data_ec8313 643 TiB 279.75M 953 TiB 
58.86   485 TiB
 rbd 14  21 GiB   5.66k  64 GiB 
0   222 TiB
 .rgw.root   15 1.2 KiB   4   1 MiB 
0   167 TiB
 default.rgw.control 16 0 B   8 0 B 
0   167 TiB
 default.rgw.meta17   765 B   4   1 MiB 
0   167 TiB
 default.rgw.log 18 0 B 207 0 B 
0   167 TiB
 cephfs_data_ec5720 433 MiB 230 1.2 GiB 
0   278 TiB

The amount used can still grow a bit before we need to add nodes, but 
apparently we are running into the limits of our rocskdb partitions.

Did we choose a parameter (e.g. minimal object size) too small, so we 
have too much objects on these spillover OSDs? Or is it that too many 
small files are stored on the cephfs filesystems?

When we expand the cluster, we can choose larger nvme devices to allow 
larger rocksdb partitions, but is that the right way to deal with this, 
or should we adjust some parameters on the cluster that will reduce the 
rocksdb size?

Cheers

/Simon
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Snapshot Children not exists / children relation broken

2020-08-20 Thread Konstantin Shalygin


On 8/3/20 2:07 PM, Torsten Ennenbach wrote:

Hi Jason.

Well, I don't tried that, because I am afraid to break something :/ I don’t 
really understand what are you doing there:(

Thanks anyways.



May be you catch this [1] bug? I have how-to solution [2] to resolve 
this, please try again.



[1] https://tracker.ceph.com/issues/19413

[2] 
https://k0ste.ru/how-to-delete-rbd-snapshot-in-luminous-ceph-cluster-kraken-bug.html



k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Simon Oosthoek

Hi Michael,

thanks for the explanation! So if I understand correctly, we waste 93 GB 
per OSD on unused NVME space, because only 30GB is actually used...?


And to improve the space for rocksdb, we need to plan for 300GB per 
rocksdb partition in order to benefit from this advantage


Reducing the number of small files is something we always ask of our 
users, but reality is what it is ;-)


I'll have to look into how I can get an informative view on these 
metrics... It's pretty overwhelming the amount of information coming out 
of the ceph cluster, even when you look only superficially...


Cheers,

/Simon

On 20/08/2020 10:16, Michael Bisig wrote:

Hi Simon

As far as I know, RocksDB only uses "leveled" space on the NVME partition. The 
values are set to be 300MB, 3GB, 30GB and 300GB. Every DB space above such a limit will 
automatically end up on slow devices.
In your setup where you have 123GB per OSD that means you only use 30GB of fast 
device. The DB which spills over this limit will be offloaded to the HDD and 
accordingly, it slows down requests and compactions.

You can proof what your OSD currently consumes with:
   ceph daemon osd.X perf dump

Informative values are `db_total_bytes`, `db_used_bytes` and `slow_used_bytes`. 
This changes regularly because of the ongoing compactions but Prometheus mgr 
module exports these values such that you can track it.

Small files generally leads to bigger RocksDB, especially when you use EC, but 
this depends on the actual amount and file sizes.

I hope this helps.
Regards,
Michael

On 20.08.20, 09:10, "Simon Oosthoek"  wrote:

 Hi

 Recently our ceph cluster (nautilus) is experiencing bluefs spillovers,
 just 2 osd's and I disabled the warning for these osds.
 (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false)

 I'm wondering what causes this and how this can be prevented.

 As I understand it the rocksdb for the OSD needs to store more than fits
 on the NVME logical volume (123G for 12T OSD). A way to fix it could be
 to increase the logical volume on the nvme (if there was space on the
 nvme, which there isn't at the moment).

 This is the current size of the cluster and how much is free:

 [root@cephmon1 ~]# ceph df
 RAW STORAGE:
  CLASS SIZEAVAIL   USEDRAW USED %RAW USED
  hdd   1.8 PiB 842 TiB 974 TiB  974 TiB 53.63
  TOTAL 1.8 PiB 842 TiB 974 TiB  974 TiB 53.63

 POOLS:
  POOLID STORED  OBJECTS USED
 %USED MAX AVAIL
  cephfs_data  1 572 MiB 121.26M 2.4 GiB
 0   167 TiB
  cephfs_metadata  2  56 GiB   5.15M  57 GiB
 0   167 TiB
  cephfs_data_3copy8 201 GiB  51.68k 602 GiB
 0.09   222 TiB
  cephfs_data_ec8313 643 TiB 279.75M 953 TiB
 58.86   485 TiB
  rbd 14  21 GiB   5.66k  64 GiB
 0   222 TiB
  .rgw.root   15 1.2 KiB   4   1 MiB
 0   167 TiB
  default.rgw.control 16 0 B   8 0 B
 0   167 TiB
  default.rgw.meta17   765 B   4   1 MiB
 0   167 TiB
  default.rgw.log 18 0 B 207 0 B
 0   167 TiB
  cephfs_data_ec5720 433 MiB 230 1.2 GiB
 0   278 TiB

 The amount used can still grow a bit before we need to add nodes, but
 apparently we are running into the limits of our rocskdb partitions.

 Did we choose a parameter (e.g. minimal object size) too small, so we
 have too much objects on these spillover OSDs? Or is it that too many
 small files are stored on the cephfs filesystems?

 When we expand the cluster, we can choose larger nvme devices to allow
 larger rocksdb partitions, but is that the right way to deal with this,
 or should we adjust some parameters on the cluster that will reduce the
 rocksdb size?

 Cheers

 /Simon
 ___
 ceph-users mailing list -- ceph-users@ceph.io
 To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm not working with non-root user

2020-08-20 Thread Amudhan P
Hi,

Any of you used cephadm bootstrap command without root user?


On Wed, Aug 19, 2020 at 11:30 AM Amudhan P  wrote:

> Hi,
>
> I am trying to install ceph 'octopus' using cephadm. In bootstrap
> command, I have specified a non-root user account as ssh-user.
> cephadm bootstrap --mon-ip xx.xxx.xx.xx --ssh-user non-rootuser
>
> when bootstrap about to complete it threw an error stating.
>
> 
> INFO:cephadm:Non-zero exit code 2 from /usr/bin/podman run --rm --net=host
> --ipc=host -e CONTAINER_IMAGE=docker.io/ceph/ceph:v15 -e NODE_NAME=node1
> -   v
> /var/log/ceph/ae4ed114-e145-11ea-9c1f-0025900a8ebe:/var/log/ceph:z -v
> /tmp/ceph-tmpm22k9j9w:/etc/ceph/ceph.client.admin.keyring:z -v
> /tmp/ceph-tmpe   1ltigk8:/etc/ceph/ceph.conf:z --entrypoint
> /usr/bin/ceph docker.io/ceph/ceph:v15 orch host add node1
> INFO:cephadm:/usr/bin/ceph:stderr Error ENOENT: Failed to connect to node1
> (node1).
> INFO:cephadm:/usr/bin/ceph:stderr Check that the host is reachable and
> accepts connections using the cephadm SSH key
> INFO:cephadm:/usr/bin/ceph:stderr
> INFO:cephadm:/usr/bin/ceph:stderr you may want to run:
> INFO:cephadm:/usr/bin/ceph:stderr > ceph cephadm get-ssh-config >
> ssh_config
> INFO:cephadm:/usr/bin/ceph:stderr > ceph config-key get
> mgr/cephadm/ssh_identity_key > key
> INFO:cephadm:/usr/bin/ceph:stderr > ssh -F ssh_config -i key root@node1
> "
> In the above steps, it's trying to connect as root to the node and when I
> downloaded ssh_config file it was also specified as 'root' inside. so, I
> modified the config file and uploaded but same to ceph but still ssh to
> node1 is not working.
>
> To confirm if I have used the right command been used during bootstrap. I
> have tried the below command.
>
> " ceph config-key dump mgr/cephadm/ssh_user"
> {
> "mgr/cephadm/ssh_user": "non-rootuser"
> }
>
> and the output shows the user I have used during bootstrap  "non-rootuser"
>
> but at the same time when I run cmd " ceph cephadm get-user " the output
> still shows 'root' as the user.
>
> Why the change is not affecting? do anyone faced a similar issue in
> bootstrap?
>
> Is there any way to avoid using container with cephadm?
>
> regards
> Amudhan
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 5 pgs inactive, 5 pgs incomplete

2020-08-20 Thread Dan van der Ster
Something else to help debugging is

ceph pg 17.173 query

at the end it should say why the pg is incomplete.

-- dan



On Thu, Aug 20, 2020 at 10:01 AM Eugen Block  wrote:
>
> Hi Martin,
>
> have you seen this blog post [1]? It describes how to recover from
> inactive and incomplete PGs (on a size 1 pool). I haven't tried any of
> that but it could be worth a try. Apparently it only would work if the
> affected PGs have 0 objects but that seems to be the case, right?
>
> Regards,
> Eugen
>
> [1]
> https://medium.com/opsops/recovering-ceph-from-reduced-data-availability-3-pgs-inactive-3-pgs-incomplete-b97cbcb4b5a1
>
>
> Zitat von Martin Palma :
>
> > If Ceph consultants are reading this please feel free to contact me
> > off list. We are seeking for someone who can help us of course we will
> > pay.
> >
> >
> >
> > On Mon, Aug 17, 2020 at 12:50 PM Martin Palma  wrote:
> >>
> >> After doing some research I suspect the problem is that during the
> >> cluster was backfilling an OSD was removed.
> >>
> >> Now the PGs which are inactive and incomplete have all the same
> >> (removed OSD) in the "down_osds_we_would_probe" output and the peering
> >> is blocked by "peering_blocked_by_history_les_bound". We tried to set
> >> the "osd_find_best_info_ignore_history_les = true" but with no success
> >> the OSDs keep in a peering loop.
> >>
> >> On Mon, Aug 17, 2020 at 9:53 AM Martin Palma  wrote:
> >> >
> >> > Here is the output with all OSD up and running.
> >> >
> >> > ceph -s: https://pastebin.com/5tMf12Lm
> >> > ceph health detail: https://pastebin.com/avDhcJt0
> >> > ceph osd tree: https://pastebin.com/XEB0eUbk
> >> > ceph osd pool ls detail: https://pastebin.com/ShSdmM5a
> >> >
> >> > On Mon, Aug 17, 2020 at 9:38 AM Martin Palma  wrote:
> >> > >
> >> > > Hi Peter,
> >> > >
> >> > > On the weekend another host was down due to power problems, which was
> >> > > restarted. Therefore these outputs also include some "Degraded data
> >> > > redundancy" messages. And one OSD crashed due to a disk error.
> >> > >
> >> > > ceph -s: https://pastebin.com/Tm8QHp52
> >> > > ceph health detail: https://pastebin.com/SrA7Bivj
> >> > > ceph osd tree: https://pastebin.com/nBK8Uafd
> >> > > ceph osd pool ls detail: https://pastebin.com/kYyCb7B2
> >> > >
> >> > > No it's not a EC pool which has the inactive+incomplete PGs.
> >> > >
> >> > > ceph osd crush dump | jq '[.rules, .tunables]':
> >> https://pastebin.com/gqDTjfat
> >> > >
> >> > > Best,
> >> > > Martin
> >> > >
> >> > > On Sun, Aug 16, 2020 at 3:44 PM Peter Maloney
> >> > >  wrote:
> >> > > >
> >> > > > Dear Martin,
> >> > > >
> >> > > > Can you provide some details?
> >> > > >
> >> > > > ceph -s
> >> > > > ceph health detail
> >> > > > ceph osd tree
> >> > > > ceph osd pool ls detail
> >> > > >
> >> > > > If it's EC (you implied it's not) also show the crush
> >> rules...and may as well include tunables (because greatly raising
> >> choose_total_tries, eg. 200 may be the solution to your problem):
> >> > > > ceph osd crush dump | jq '[.rules, .tunables]'
> >> > > >
> >> > > > Peter
> >> > > >
> >> > > > On 8/16/20 1:18 AM, Martin Palma wrote:
> >> > > > > Yes, but that didn’t help. After some time they have
> >> blocked requests again
> >> > > > > and remain inactive and incomplete.
> >> > > > >
> >> > > > > On Sat, 15 Aug 2020 at 16:58,  wrote:
> >> > > > >
> >> > > > >> Did you tried to restart the sayed osds?
> >> > > > >>
> >> > > > >>
> >> > > > >>
> >> > > > >> Hth
> >> > > > >>
> >> > > > >> Mehmet
> >> > > > >>
> >> > > > >>
> >> > > > >>
> >> > > > >> Am 12. August 2020 21:07:55 MESZ schrieb Martin Palma
> >> :
> >> > > > >>
> >> > > >  Are the OSDs online? Or do they refuse to boot?
> >> > > > >>> Yes. They are up and running and not marked as down or out of the
> >> > > > >>> cluster.
> >> > > >  Can you list the data with ceph-objectstore-tool on these OSDs?
> >> > > > >>> If you mean the "list" operation on the PG works if an output for
> >> > > > >>> example:
> >> > > > >>> $ ceph-objectstore-tool --data-path
> >> /var/lib/ceph/osd/ceph-63 --pgid
> >> > > > >>> 22.11a --op list
> >> > > > >>
> >> > > > >>>
> >> ["22.11a",{"oid":"1001c1ee04f.0007","key":"","snapid":-2,"hash":3825189146,"max":0,"pool":22,"namespace":"","max":0}]
> >> > > > >>
> >> > > > >>>
> >> ["22.11a",{"oid":"1000448667f.","key":"","snapid":-2,"hash":4294951194,"max":0,"pool":22,"namespace":"","max":0}]
> >> > > > >>> ...
> >> > > > >>> If I run "ceph pg ls incomplete" in the output only one PG has
> >> > > > >>> objects... all others have 0 objects.
> >> > > > >>> ___
> >> > > > >>> ceph-users mailing list -- ceph-users@ceph.io
> >> > > > >>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >> > > > >> ___
> >> > > > >>
> >> > > > >> ceph-users mailing list -- ceph-users@ceph.io
> >> > > > >>
> >> > > > >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >> > >

[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Michael Bisig
Hi Simon

Unfortunately, the other NVME space is wasted or at least, this is the 
information we gathered during our research. This fact is due to the RocksDB 
level management which is explained here 
(https://github.com/facebook/rocksdb/wiki/Leveled-Compaction). I don't think 
it's a hard limit but it will be something above these values. Also consult 
this thread 
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-February/033286.html).
 It's probably better to go a bit over these limits to be on the safe side.

Exactly, reality is always different. We also struggle with small files which 
lead to further problems. Accordingly, the right initial setting is pretty 
important and depends on your individual usecase.

Regards,
Michael

On 20.08.20, 10:40, "Simon Oosthoek"  wrote:

Hi Michael,

thanks for the explanation! So if I understand correctly, we waste 93 GB 
per OSD on unused NVME space, because only 30GB is actually used...?

And to improve the space for rocksdb, we need to plan for 300GB per 
rocksdb partition in order to benefit from this advantage

Reducing the number of small files is something we always ask of our 
users, but reality is what it is ;-)

I'll have to look into how I can get an informative view on these 
metrics... It's pretty overwhelming the amount of information coming out 
of the ceph cluster, even when you look only superficially...

Cheers,

/Simon

On 20/08/2020 10:16, Michael Bisig wrote:
> Hi Simon
> 
> As far as I know, RocksDB only uses "leveled" space on the NVME 
partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every DB space 
above such a limit will automatically end up on slow devices.
> In your setup where you have 123GB per OSD that means you only use 30GB 
of fast device. The DB which spills over this limit will be offloaded to the 
HDD and accordingly, it slows down requests and compactions.
> 
> You can proof what your OSD currently consumes with:
>ceph daemon osd.X perf dump
> 
> Informative values are `db_total_bytes`, `db_used_bytes` and 
`slow_used_bytes`. This changes regularly because of the ongoing compactions 
but Prometheus mgr module exports these values such that you can track it.
> 
> Small files generally leads to bigger RocksDB, especially when you use 
EC, but this depends on the actual amount and file sizes.
> 
> I hope this helps.
> Regards,
> Michael
> 
> On 20.08.20, 09:10, "Simon Oosthoek"  wrote:
> 
>  Hi
> 
>  Recently our ceph cluster (nautilus) is experiencing bluefs 
spillovers,
>  just 2 osd's and I disabled the warning for these osds.
>  (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false)
> 
>  I'm wondering what causes this and how this can be prevented.
> 
>  As I understand it the rocksdb for the OSD needs to store more than 
fits
>  on the NVME logical volume (123G for 12T OSD). A way to fix it could 
be
>  to increase the logical volume on the nvme (if there was space on the
>  nvme, which there isn't at the moment).
> 
>  This is the current size of the cluster and how much is free:
> 
>  [root@cephmon1 ~]# ceph df
>  RAW STORAGE:
>   CLASS SIZEAVAIL   USEDRAW USED %RAW 
USED
>   hdd   1.8 PiB 842 TiB 974 TiB  974 TiB 
53.63
>   TOTAL 1.8 PiB 842 TiB 974 TiB  974 TiB 
53.63
> 
>  POOLS:
>   POOLID STORED  OBJECTS USED
>  %USED MAX AVAIL
>   cephfs_data  1 572 MiB 121.26M 2.4 GiB
>  0   167 TiB
>   cephfs_metadata  2  56 GiB   5.15M  57 GiB
>  0   167 TiB
>   cephfs_data_3copy8 201 GiB  51.68k 602 GiB
>  0.09   222 TiB
>   cephfs_data_ec8313 643 TiB 279.75M 953 TiB
>  58.86   485 TiB
>   rbd 14  21 GiB   5.66k  64 GiB
>  0   222 TiB
>   .rgw.root   15 1.2 KiB   4   1 MiB
>  0   167 TiB
>   default.rgw.control 16 0 B   8 0 B
>  0   167 TiB
>   default.rgw.meta17   765 B   4   1 MiB
>  0   167 TiB
>   default.rgw.log 18 0 B 207 0 B
>  0   167 TiB
>   cephfs_data_ec5720 433 MiB 230 1.2 GiB
>  0   278 TiB
> 
>  The amount used can still grow a bit before we need to add nodes, but
>  apparently we are running into the limits of our rocskdb par

[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Simon Oosthoek

Hi Michael,

thanks for the pointers! This is our first production ceph cluster and 
we have to learn as we go... Small files is always a problem for all 
(networked) filesystems, usually it just trashes performance, but in 
this case it has another unfortunate side effect with the rocksdb :-(


Cheers

/Simon

On 20/08/2020 11:06, Michael Bisig wrote:

Hi Simon

Unfortunately, the other NVME space is wasted or at least, this is the 
information we gathered during our research. This fact is due to the RocksDB 
level management which is explained here 
(https://github.com/facebook/rocksdb/wiki/Leveled-Compaction). I don't think 
it's a hard limit but it will be something above these values. Also consult 
this thread 
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-February/033286.html).
 It's probably better to go a bit over these limits to be on the safe side.

Exactly, reality is always different. We also struggle with small files which 
lead to further problems. Accordingly, the right initial setting is pretty 
important and depends on your individual usecase.

Regards,
Michael

On 20.08.20, 10:40, "Simon Oosthoek"  wrote:

 Hi Michael,

 thanks for the explanation! So if I understand correctly, we waste 93 GB
 per OSD on unused NVME space, because only 30GB is actually used...?

 And to improve the space for rocksdb, we need to plan for 300GB per
 rocksdb partition in order to benefit from this advantage

 Reducing the number of small files is something we always ask of our
 users, but reality is what it is ;-)

 I'll have to look into how I can get an informative view on these
 metrics... It's pretty overwhelming the amount of information coming out
 of the ceph cluster, even when you look only superficially...

 Cheers,

 /Simon

 On 20/08/2020 10:16, Michael Bisig wrote:
 > Hi Simon
 >
 > As far as I know, RocksDB only uses "leveled" space on the NVME 
partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every DB space above such a 
limit will automatically end up on slow devices.
 > In your setup where you have 123GB per OSD that means you only use 30GB 
of fast device. The DB which spills over this limit will be offloaded to the HDD 
and accordingly, it slows down requests and compactions.
 >
 > You can proof what your OSD currently consumes with:
 >ceph daemon osd.X perf dump
 >
 > Informative values are `db_total_bytes`, `db_used_bytes` and 
`slow_used_bytes`. This changes regularly because of the ongoing compactions but 
Prometheus mgr module exports these values such that you can track it.
 >
 > Small files generally leads to bigger RocksDB, especially when you use 
EC, but this depends on the actual amount and file sizes.
 >
 > I hope this helps.
 > Regards,
 > Michael
 >
 > On 20.08.20, 09:10, "Simon Oosthoek"  wrote:
 >
 >  Hi
 >
 >  Recently our ceph cluster (nautilus) is experiencing bluefs 
spillovers,
 >  just 2 osd's and I disabled the warning for these osds.
 >  (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false)
 >
 >  I'm wondering what causes this and how this can be prevented.
 >
 >  As I understand it the rocksdb for the OSD needs to store more than 
fits
 >  on the NVME logical volume (123G for 12T OSD). A way to fix it 
could be
 >  to increase the logical volume on the nvme (if there was space on 
the
 >  nvme, which there isn't at the moment).
 >
 >  This is the current size of the cluster and how much is free:
 >
 >  [root@cephmon1 ~]# ceph df
 >  RAW STORAGE:
 >   CLASS SIZEAVAIL   USEDRAW USED 
%RAW USED
 >   hdd   1.8 PiB 842 TiB 974 TiB  974 TiB 
53.63
 >   TOTAL 1.8 PiB 842 TiB 974 TiB  974 TiB 
53.63
 >
 >  POOLS:
 >   POOLID STORED  OBJECTS USED
 >  %USED MAX AVAIL
 >   cephfs_data  1 572 MiB 121.26M 2.4 GiB
 >  0   167 TiB
 >   cephfs_metadata  2  56 GiB   5.15M  57 GiB
 >  0   167 TiB
 >   cephfs_data_3copy8 201 GiB  51.68k 602 GiB
 >  0.09   222 TiB
 >   cephfs_data_ec8313 643 TiB 279.75M 953 TiB
 >  58.86   485 TiB
 >   rbd 14  21 GiB   5.66k  64 GiB
 >  0   222 TiB
 >   .rgw.root   15 1.2 KiB   4   1 MiB
 >  0   167 TiB
 >   default.rgw.control 16 0 B   8 0 B
 >  0   167 TiB
 >   default.rgw.meta17   765 B   4   1 MiB

[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Igor Fedotov

Hi Simon,


starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at DB 
volume.


see this PR: https://github.com/ceph/ceph/pull/29687

Nice overview on the overall BlueFS/RocksDB design can be find here:

https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf

Which also includes some overview (as well as additional concerns) for 
changes brought by the above-mentioned PR.



Thanks,

Igor


On 8/20/2020 11:39 AM, Simon Oosthoek wrote:

Hi Michael,

thanks for the explanation! So if I understand correctly, we waste 93 
GB per OSD on unused NVME space, because only 30GB is actually used...?


And to improve the space for rocksdb, we need to plan for 300GB per 
rocksdb partition in order to benefit from this advantage


Reducing the number of small files is something we always ask of our 
users, but reality is what it is ;-)


I'll have to look into how I can get an informative view on these 
metrics... It's pretty overwhelming the amount of information coming 
out of the ceph cluster, even when you look only superficially...


Cheers,

/Simon

On 20/08/2020 10:16, Michael Bisig wrote:

Hi Simon

As far as I know, RocksDB only uses "leveled" space on the NVME 
partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every 
DB space above such a limit will automatically end up on slow devices.
In your setup where you have 123GB per OSD that means you only use 
30GB of fast device. The DB which spills over this limit will be 
offloaded to the HDD and accordingly, it slows down requests and 
compactions.


You can proof what your OSD currently consumes with:
   ceph daemon osd.X perf dump

Informative values are `db_total_bytes`, `db_used_bytes` and 
`slow_used_bytes`. This changes regularly because of the ongoing 
compactions but Prometheus mgr module exports these values such that 
you can track it.


Small files generally leads to bigger RocksDB, especially when you 
use EC, but this depends on the actual amount and file sizes.


I hope this helps.
Regards,
Michael

On 20.08.20, 09:10, "Simon Oosthoek"  wrote:

 Hi

 Recently our ceph cluster (nautilus) is experiencing bluefs 
spillovers,

 just 2 osd's and I disabled the warning for these osds.
 (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false)

 I'm wondering what causes this and how this can be prevented.

 As I understand it the rocksdb for the OSD needs to store more 
than fits
 on the NVME logical volume (123G for 12T OSD). A way to fix it 
could be
 to increase the logical volume on the nvme (if there was space 
on the

 nvme, which there isn't at the moment).

 This is the current size of the cluster and how much is free:

 [root@cephmon1 ~]# ceph df
 RAW STORAGE:
  CLASS SIZE    AVAIL   USED    RAW USED 
%RAW USED
  hdd   1.8 PiB 842 TiB 974 TiB  974 
TiB 53.63
  TOTAL 1.8 PiB 842 TiB 974 TiB  974 
TiB 53.63


 POOLS:
  POOL    ID STORED  OBJECTS USED
 %USED MAX AVAIL
  cephfs_data  1 572 MiB 121.26M 2.4 GiB
 0   167 TiB
  cephfs_metadata  2  56 GiB 5.15M  57 GiB
 0   167 TiB
  cephfs_data_3copy    8 201 GiB  51.68k 602 GiB
 0.09   222 TiB
  cephfs_data_ec83    13 643 TiB 279.75M 953 TiB
 58.86   485 TiB
  rbd 14  21 GiB 5.66k  64 GiB
 0   222 TiB
  .rgw.root   15 1.2 KiB 4   1 MiB
 0   167 TiB
  default.rgw.control 16 0 B 8 0 B
 0   167 TiB
  default.rgw.meta    17   765 B 4   1 MiB
 0   167 TiB
  default.rgw.log 18 0 B 207 0 B
 0   167 TiB
  cephfs_data_ec57    20 433 MiB 230 1.2 GiB
 0   278 TiB

 The amount used can still grow a bit before we need to add 
nodes, but
 apparently we are running into the limits of our rocskdb 
partitions.


 Did we choose a parameter (e.g. minimal object size) too small, 
so we
 have too much objects on these spillover OSDs? Or is it that too 
many

 small files are stored on the cephfs filesystems?

 When we expand the cluster, we can choose larger nvme devices to 
allow
 larger rocksdb partitions, but is that the right way to deal 
with this,
 or should we adjust some parameters on the cluster that will 
reduce the

 rocksdb size?

 Cheers

 /Simon
 ___
 ceph-users mailing list -- ceph-users@ceph.io
 To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email

[ceph-users] Re: 5 pgs inactive, 5 pgs incomplete

2020-08-20 Thread Martin Palma
All inactive and incomplete PGs are blocked by OSD 81 which does not
exist anymore:
...
"down_osds_we_would_probe": [
81
],
"peering_blocked_by": [],
"peering_blocked_by_detail": [
{
"detail": "peering_blocked_by_history_les_bound"
}
]
...

Here the full output: https://pastebin.com/V5EPZ0N7

On Thu, Aug 20, 2020 at 10:58 AM Dan van der Ster  wrote:
>
> Something else to help debugging is
>
> ceph pg 17.173 query
>
> at the end it should say why the pg is incomplete.
>
> -- dan
>
>
>
> On Thu, Aug 20, 2020 at 10:01 AM Eugen Block  wrote:
> >
> > Hi Martin,
> >
> > have you seen this blog post [1]? It describes how to recover from
> > inactive and incomplete PGs (on a size 1 pool). I haven't tried any of
> > that but it could be worth a try. Apparently it only would work if the
> > affected PGs have 0 objects but that seems to be the case, right?
> >
> > Regards,
> > Eugen
> >
> > [1]
> > https://medium.com/opsops/recovering-ceph-from-reduced-data-availability-3-pgs-inactive-3-pgs-incomplete-b97cbcb4b5a1
> >
> >
> > Zitat von Martin Palma :
> >
> > > If Ceph consultants are reading this please feel free to contact me
> > > off list. We are seeking for someone who can help us of course we will
> > > pay.
> > >
> > >
> > >
> > > On Mon, Aug 17, 2020 at 12:50 PM Martin Palma  wrote:
> > >>
> > >> After doing some research I suspect the problem is that during the
> > >> cluster was backfilling an OSD was removed.
> > >>
> > >> Now the PGs which are inactive and incomplete have all the same
> > >> (removed OSD) in the "down_osds_we_would_probe" output and the peering
> > >> is blocked by "peering_blocked_by_history_les_bound". We tried to set
> > >> the "osd_find_best_info_ignore_history_les = true" but with no success
> > >> the OSDs keep in a peering loop.
> > >>
> > >> On Mon, Aug 17, 2020 at 9:53 AM Martin Palma  wrote:
> > >> >
> > >> > Here is the output with all OSD up and running.
> > >> >
> > >> > ceph -s: https://pastebin.com/5tMf12Lm
> > >> > ceph health detail: https://pastebin.com/avDhcJt0
> > >> > ceph osd tree: https://pastebin.com/XEB0eUbk
> > >> > ceph osd pool ls detail: https://pastebin.com/ShSdmM5a
> > >> >
> > >> > On Mon, Aug 17, 2020 at 9:38 AM Martin Palma  wrote:
> > >> > >
> > >> > > Hi Peter,
> > >> > >
> > >> > > On the weekend another host was down due to power problems, which was
> > >> > > restarted. Therefore these outputs also include some "Degraded data
> > >> > > redundancy" messages. And one OSD crashed due to a disk error.
> > >> > >
> > >> > > ceph -s: https://pastebin.com/Tm8QHp52
> > >> > > ceph health detail: https://pastebin.com/SrA7Bivj
> > >> > > ceph osd tree: https://pastebin.com/nBK8Uafd
> > >> > > ceph osd pool ls detail: https://pastebin.com/kYyCb7B2
> > >> > >
> > >> > > No it's not a EC pool which has the inactive+incomplete PGs.
> > >> > >
> > >> > > ceph osd crush dump | jq '[.rules, .tunables]':
> > >> https://pastebin.com/gqDTjfat
> > >> > >
> > >> > > Best,
> > >> > > Martin
> > >> > >
> > >> > > On Sun, Aug 16, 2020 at 3:44 PM Peter Maloney
> > >> > >  wrote:
> > >> > > >
> > >> > > > Dear Martin,
> > >> > > >
> > >> > > > Can you provide some details?
> > >> > > >
> > >> > > > ceph -s
> > >> > > > ceph health detail
> > >> > > > ceph osd tree
> > >> > > > ceph osd pool ls detail
> > >> > > >
> > >> > > > If it's EC (you implied it's not) also show the crush
> > >> rules...and may as well include tunables (because greatly raising
> > >> choose_total_tries, eg. 200 may be the solution to your problem):
> > >> > > > ceph osd crush dump | jq '[.rules, .tunables]'
> > >> > > >
> > >> > > > Peter
> > >> > > >
> > >> > > > On 8/16/20 1:18 AM, Martin Palma wrote:
> > >> > > > > Yes, but that didn’t help. After some time they have
> > >> blocked requests again
> > >> > > > > and remain inactive and incomplete.
> > >> > > > >
> > >> > > > > On Sat, 15 Aug 2020 at 16:58,  wrote:
> > >> > > > >
> > >> > > > >> Did you tried to restart the sayed osds?
> > >> > > > >>
> > >> > > > >>
> > >> > > > >>
> > >> > > > >> Hth
> > >> > > > >>
> > >> > > > >> Mehmet
> > >> > > > >>
> > >> > > > >>
> > >> > > > >>
> > >> > > > >> Am 12. August 2020 21:07:55 MESZ schrieb Martin Palma
> > >> :
> > >> > > > >>
> > >> > > >  Are the OSDs online? Or do they refuse to boot?
> > >> > > > >>> Yes. They are up and running and not marked as down or out of 
> > >> > > > >>> the
> > >> > > > >>> cluster.
> > >> > > >  Can you list the data with ceph-objectstore-tool on these 
> > >> > > >  OSDs?
> > >> > > > >>> If you mean the "list" operation on the PG works if an output 
> > >> > > > >>> for
> > >> > > > >>> example:
> > >> > > > >>> $ ceph-objectstore-tool --data-path
> > >> /var/lib/ceph/osd/ceph-63 --pgid
> > >> > > > >>> 22.11a --op list
> > >> > > > >>
> > >> > > > >>>
> > >> ["22.11a",{"oid":"1001c1ee04f.0007","key":"","snapid":-2,"hash":3825189146,"ma

[ceph-users] Re: 5 pgs inactive, 5 pgs incomplete

2020-08-20 Thread Dan van der Ster
Did you already mark osd.81 as lost?

AFAIU you need to `ceph osd lost 81`, and *then* you can try the
osd_find_best_info_ignore_history_les option.

-- dan


On Thu, Aug 20, 2020 at 11:31 AM Martin Palma  wrote:
>
> All inactive and incomplete PGs are blocked by OSD 81 which does not
> exist anymore:
> ...
> "down_osds_we_would_probe": [
> 81
> ],
> "peering_blocked_by": [],
> "peering_blocked_by_detail": [
> {
> "detail": "peering_blocked_by_history_les_bound"
> }
> ]
> ...
>
> Here the full output: https://pastebin.com/V5EPZ0N7
>
> On Thu, Aug 20, 2020 at 10:58 AM Dan van der Ster  wrote:
> >
> > Something else to help debugging is
> >
> > ceph pg 17.173 query
> >
> > at the end it should say why the pg is incomplete.
> >
> > -- dan
> >
> >
> >
> > On Thu, Aug 20, 2020 at 10:01 AM Eugen Block  wrote:
> > >
> > > Hi Martin,
> > >
> > > have you seen this blog post [1]? It describes how to recover from
> > > inactive and incomplete PGs (on a size 1 pool). I haven't tried any of
> > > that but it could be worth a try. Apparently it only would work if the
> > > affected PGs have 0 objects but that seems to be the case, right?
> > >
> > > Regards,
> > > Eugen
> > >
> > > [1]
> > > https://medium.com/opsops/recovering-ceph-from-reduced-data-availability-3-pgs-inactive-3-pgs-incomplete-b97cbcb4b5a1
> > >
> > >
> > > Zitat von Martin Palma :
> > >
> > > > If Ceph consultants are reading this please feel free to contact me
> > > > off list. We are seeking for someone who can help us of course we will
> > > > pay.
> > > >
> > > >
> > > >
> > > > On Mon, Aug 17, 2020 at 12:50 PM Martin Palma  wrote:
> > > >>
> > > >> After doing some research I suspect the problem is that during the
> > > >> cluster was backfilling an OSD was removed.
> > > >>
> > > >> Now the PGs which are inactive and incomplete have all the same
> > > >> (removed OSD) in the "down_osds_we_would_probe" output and the peering
> > > >> is blocked by "peering_blocked_by_history_les_bound". We tried to set
> > > >> the "osd_find_best_info_ignore_history_les = true" but with no success
> > > >> the OSDs keep in a peering loop.
> > > >>
> > > >> On Mon, Aug 17, 2020 at 9:53 AM Martin Palma  wrote:
> > > >> >
> > > >> > Here is the output with all OSD up and running.
> > > >> >
> > > >> > ceph -s: https://pastebin.com/5tMf12Lm
> > > >> > ceph health detail: https://pastebin.com/avDhcJt0
> > > >> > ceph osd tree: https://pastebin.com/XEB0eUbk
> > > >> > ceph osd pool ls detail: https://pastebin.com/ShSdmM5a
> > > >> >
> > > >> > On Mon, Aug 17, 2020 at 9:38 AM Martin Palma  wrote:
> > > >> > >
> > > >> > > Hi Peter,
> > > >> > >
> > > >> > > On the weekend another host was down due to power problems, which 
> > > >> > > was
> > > >> > > restarted. Therefore these outputs also include some "Degraded data
> > > >> > > redundancy" messages. And one OSD crashed due to a disk error.
> > > >> > >
> > > >> > > ceph -s: https://pastebin.com/Tm8QHp52
> > > >> > > ceph health detail: https://pastebin.com/SrA7Bivj
> > > >> > > ceph osd tree: https://pastebin.com/nBK8Uafd
> > > >> > > ceph osd pool ls detail: https://pastebin.com/kYyCb7B2
> > > >> > >
> > > >> > > No it's not a EC pool which has the inactive+incomplete PGs.
> > > >> > >
> > > >> > > ceph osd crush dump | jq '[.rules, .tunables]':
> > > >> https://pastebin.com/gqDTjfat
> > > >> > >
> > > >> > > Best,
> > > >> > > Martin
> > > >> > >
> > > >> > > On Sun, Aug 16, 2020 at 3:44 PM Peter Maloney
> > > >> > >  wrote:
> > > >> > > >
> > > >> > > > Dear Martin,
> > > >> > > >
> > > >> > > > Can you provide some details?
> > > >> > > >
> > > >> > > > ceph -s
> > > >> > > > ceph health detail
> > > >> > > > ceph osd tree
> > > >> > > > ceph osd pool ls detail
> > > >> > > >
> > > >> > > > If it's EC (you implied it's not) also show the crush
> > > >> rules...and may as well include tunables (because greatly raising
> > > >> choose_total_tries, eg. 200 may be the solution to your problem):
> > > >> > > > ceph osd crush dump | jq '[.rules, .tunables]'
> > > >> > > >
> > > >> > > > Peter
> > > >> > > >
> > > >> > > > On 8/16/20 1:18 AM, Martin Palma wrote:
> > > >> > > > > Yes, but that didn’t help. After some time they have
> > > >> blocked requests again
> > > >> > > > > and remain inactive and incomplete.
> > > >> > > > >
> > > >> > > > > On Sat, 15 Aug 2020 at 16:58,  wrote:
> > > >> > > > >
> > > >> > > > >> Did you tried to restart the sayed osds?
> > > >> > > > >>
> > > >> > > > >>
> > > >> > > > >>
> > > >> > > > >> Hth
> > > >> > > > >>
> > > >> > > > >> Mehmet
> > > >> > > > >>
> > > >> > > > >>
> > > >> > > > >>
> > > >> > > > >> Am 12. August 2020 21:07:55 MESZ schrieb Martin Palma
> > > >> :
> > > >> > > > >>
> > > >> > > >  Are the OSDs online? Or do they refuse to boot?
> > > >> > > > >>> Yes. They are up and running and not marked as down or out 
> > > >> > > > >>> of the
> > > 

[ceph-users] Re: 5 pgs inactive, 5 pgs incomplete

2020-08-20 Thread Martin Palma
Yes we already did that but since the OSD does not exists anymore we
get the following error:

% ceph osd lost 81 --yes-i-really-mean-it
Error ENOENT: osd.81 does not exist

So we do not know how we can bring the PGs to notice that OSD 81 does
not exist anymore...

On Thu, Aug 20, 2020 at 11:41 AM Dan van der Ster  wrote:
>
> Did you already mark osd.81 as lost?
>
> AFAIU you need to `ceph osd lost 81`, and *then* you can try the
> osd_find_best_info_ignore_history_les option.
>
> -- dan
>
>
> On Thu, Aug 20, 2020 at 11:31 AM Martin Palma  wrote:
> >
> > All inactive and incomplete PGs are blocked by OSD 81 which does not
> > exist anymore:
> > ...
> > "down_osds_we_would_probe": [
> > 81
> > ],
> > "peering_blocked_by": [],
> > "peering_blocked_by_detail": [
> > {
> > "detail": "peering_blocked_by_history_les_bound"
> > }
> > ]
> > ...
> >
> > Here the full output: https://pastebin.com/V5EPZ0N7
> >
> > On Thu, Aug 20, 2020 at 10:58 AM Dan van der Ster  
> > wrote:
> > >
> > > Something else to help debugging is
> > >
> > > ceph pg 17.173 query
> > >
> > > at the end it should say why the pg is incomplete.
> > >
> > > -- dan
> > >
> > >
> > >
> > > On Thu, Aug 20, 2020 at 10:01 AM Eugen Block  wrote:
> > > >
> > > > Hi Martin,
> > > >
> > > > have you seen this blog post [1]? It describes how to recover from
> > > > inactive and incomplete PGs (on a size 1 pool). I haven't tried any of
> > > > that but it could be worth a try. Apparently it only would work if the
> > > > affected PGs have 0 objects but that seems to be the case, right?
> > > >
> > > > Regards,
> > > > Eugen
> > > >
> > > > [1]
> > > > https://medium.com/opsops/recovering-ceph-from-reduced-data-availability-3-pgs-inactive-3-pgs-incomplete-b97cbcb4b5a1
> > > >
> > > >
> > > > Zitat von Martin Palma :
> > > >
> > > > > If Ceph consultants are reading this please feel free to contact me
> > > > > off list. We are seeking for someone who can help us of course we will
> > > > > pay.
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Aug 17, 2020 at 12:50 PM Martin Palma  wrote:
> > > > >>
> > > > >> After doing some research I suspect the problem is that during the
> > > > >> cluster was backfilling an OSD was removed.
> > > > >>
> > > > >> Now the PGs which are inactive and incomplete have all the same
> > > > >> (removed OSD) in the "down_osds_we_would_probe" output and the 
> > > > >> peering
> > > > >> is blocked by "peering_blocked_by_history_les_bound". We tried to set
> > > > >> the "osd_find_best_info_ignore_history_les = true" but with no 
> > > > >> success
> > > > >> the OSDs keep in a peering loop.
> > > > >>
> > > > >> On Mon, Aug 17, 2020 at 9:53 AM Martin Palma  wrote:
> > > > >> >
> > > > >> > Here is the output with all OSD up and running.
> > > > >> >
> > > > >> > ceph -s: https://pastebin.com/5tMf12Lm
> > > > >> > ceph health detail: https://pastebin.com/avDhcJt0
> > > > >> > ceph osd tree: https://pastebin.com/XEB0eUbk
> > > > >> > ceph osd pool ls detail: https://pastebin.com/ShSdmM5a
> > > > >> >
> > > > >> > On Mon, Aug 17, 2020 at 9:38 AM Martin Palma  
> > > > >> > wrote:
> > > > >> > >
> > > > >> > > Hi Peter,
> > > > >> > >
> > > > >> > > On the weekend another host was down due to power problems, 
> > > > >> > > which was
> > > > >> > > restarted. Therefore these outputs also include some "Degraded 
> > > > >> > > data
> > > > >> > > redundancy" messages. And one OSD crashed due to a disk error.
> > > > >> > >
> > > > >> > > ceph -s: https://pastebin.com/Tm8QHp52
> > > > >> > > ceph health detail: https://pastebin.com/SrA7Bivj
> > > > >> > > ceph osd tree: https://pastebin.com/nBK8Uafd
> > > > >> > > ceph osd pool ls detail: https://pastebin.com/kYyCb7B2
> > > > >> > >
> > > > >> > > No it's not a EC pool which has the inactive+incomplete PGs.
> > > > >> > >
> > > > >> > > ceph osd crush dump | jq '[.rules, .tunables]':
> > > > >> https://pastebin.com/gqDTjfat
> > > > >> > >
> > > > >> > > Best,
> > > > >> > > Martin
> > > > >> > >
> > > > >> > > On Sun, Aug 16, 2020 at 3:44 PM Peter Maloney
> > > > >> > >  wrote:
> > > > >> > > >
> > > > >> > > > Dear Martin,
> > > > >> > > >
> > > > >> > > > Can you provide some details?
> > > > >> > > >
> > > > >> > > > ceph -s
> > > > >> > > > ceph health detail
> > > > >> > > > ceph osd tree
> > > > >> > > > ceph osd pool ls detail
> > > > >> > > >
> > > > >> > > > If it's EC (you implied it's not) also show the crush
> > > > >> rules...and may as well include tunables (because greatly raising
> > > > >> choose_total_tries, eg. 200 may be the solution to your problem):
> > > > >> > > > ceph osd crush dump | jq '[.rules, .tunables]'
> > > > >> > > >
> > > > >> > > > Peter
> > > > >> > > >
> > > > >> > > > On 8/16/20 1:18 AM, Martin Palma wrote:
> > > > >> > > > > Yes, but that didn’t help. After some time they have
> > > > >> blocked requests again
> > > > >> > > > > an

[ceph-users] Re: 5 pgs inactive, 5 pgs incomplete

2020-08-20 Thread Martin Palma
On one pool, which was only a test pool, we investigated both OSDs
which host the inactive and incomplete PG with the following command:

% ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-## --pgid
 --op list

On the primary OSD for the PG we saw no output, but on the secondary
we got an output. So we marked that PG on that OSD as complete. This
solved the inactive/incomplete PG for that pool.

The other PGs are from our main CephFS pool and we have the fear that
by doing the above we could lose access to the whole pool and data.

On Thu, Aug 20, 2020 at 11:49 AM Martin Palma  wrote:
>
> Yes we already did that but since the OSD does not exists anymore we
> get the following error:
>
> % ceph osd lost 81 --yes-i-really-mean-it
> Error ENOENT: osd.81 does not exist
>
> So we do not know how we can bring the PGs to notice that OSD 81 does
> not exist anymore...
>
> On Thu, Aug 20, 2020 at 11:41 AM Dan van der Ster  wrote:
> >
> > Did you already mark osd.81 as lost?
> >
> > AFAIU you need to `ceph osd lost 81`, and *then* you can try the
> > osd_find_best_info_ignore_history_les option.
> >
> > -- dan
> >
> >
> > On Thu, Aug 20, 2020 at 11:31 AM Martin Palma  wrote:
> > >
> > > All inactive and incomplete PGs are blocked by OSD 81 which does not
> > > exist anymore:
> > > ...
> > > "down_osds_we_would_probe": [
> > > 81
> > > ],
> > > "peering_blocked_by": [],
> > > "peering_blocked_by_detail": [
> > > {
> > > "detail": "peering_blocked_by_history_les_bound"
> > > }
> > > ]
> > > ...
> > >
> > > Here the full output: https://pastebin.com/V5EPZ0N7
> > >
> > > On Thu, Aug 20, 2020 at 10:58 AM Dan van der Ster  
> > > wrote:
> > > >
> > > > Something else to help debugging is
> > > >
> > > > ceph pg 17.173 query
> > > >
> > > > at the end it should say why the pg is incomplete.
> > > >
> > > > -- dan
> > > >
> > > >
> > > >
> > > > On Thu, Aug 20, 2020 at 10:01 AM Eugen Block  wrote:
> > > > >
> > > > > Hi Martin,
> > > > >
> > > > > have you seen this blog post [1]? It describes how to recover from
> > > > > inactive and incomplete PGs (on a size 1 pool). I haven't tried any of
> > > > > that but it could be worth a try. Apparently it only would work if the
> > > > > affected PGs have 0 objects but that seems to be the case, right?
> > > > >
> > > > > Regards,
> > > > > Eugen
> > > > >
> > > > > [1]
> > > > > https://medium.com/opsops/recovering-ceph-from-reduced-data-availability-3-pgs-inactive-3-pgs-incomplete-b97cbcb4b5a1
> > > > >
> > > > >
> > > > > Zitat von Martin Palma :
> > > > >
> > > > > > If Ceph consultants are reading this please feel free to contact me
> > > > > > off list. We are seeking for someone who can help us of course we 
> > > > > > will
> > > > > > pay.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Aug 17, 2020 at 12:50 PM Martin Palma  
> > > > > > wrote:
> > > > > >>
> > > > > >> After doing some research I suspect the problem is that during the
> > > > > >> cluster was backfilling an OSD was removed.
> > > > > >>
> > > > > >> Now the PGs which are inactive and incomplete have all the same
> > > > > >> (removed OSD) in the "down_osds_we_would_probe" output and the 
> > > > > >> peering
> > > > > >> is blocked by "peering_blocked_by_history_les_bound". We tried to 
> > > > > >> set
> > > > > >> the "osd_find_best_info_ignore_history_les = true" but with no 
> > > > > >> success
> > > > > >> the OSDs keep in a peering loop.
> > > > > >>
> > > > > >> On Mon, Aug 17, 2020 at 9:53 AM Martin Palma  
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > Here is the output with all OSD up and running.
> > > > > >> >
> > > > > >> > ceph -s: https://pastebin.com/5tMf12Lm
> > > > > >> > ceph health detail: https://pastebin.com/avDhcJt0
> > > > > >> > ceph osd tree: https://pastebin.com/XEB0eUbk
> > > > > >> > ceph osd pool ls detail: https://pastebin.com/ShSdmM5a
> > > > > >> >
> > > > > >> > On Mon, Aug 17, 2020 at 9:38 AM Martin Palma  
> > > > > >> > wrote:
> > > > > >> > >
> > > > > >> > > Hi Peter,
> > > > > >> > >
> > > > > >> > > On the weekend another host was down due to power problems, 
> > > > > >> > > which was
> > > > > >> > > restarted. Therefore these outputs also include some "Degraded 
> > > > > >> > > data
> > > > > >> > > redundancy" messages. And one OSD crashed due to a disk error.
> > > > > >> > >
> > > > > >> > > ceph -s: https://pastebin.com/Tm8QHp52
> > > > > >> > > ceph health detail: https://pastebin.com/SrA7Bivj
> > > > > >> > > ceph osd tree: https://pastebin.com/nBK8Uafd
> > > > > >> > > ceph osd pool ls detail: https://pastebin.com/kYyCb7B2
> > > > > >> > >
> > > > > >> > > No it's not a EC pool which has the inactive+incomplete PGs.
> > > > > >> > >
> > > > > >> > > ceph osd crush dump | jq '[.rules, .tunables]':
> > > > > >> https://pastebin.com/gqDTjfat
> > > > > >> > >
> > > > > >> > > Best,
> > > > > >> > > Martin
> > > > > >> > >

[ceph-users] Ceph on windows?

2020-08-20 Thread Stolte, Felix
Hey guys,

it seems like there was a presentation called “ceph on windows” at the 
Cephalocon 2020, but I cannot find any information on that topic. Is there a 
video from the presentation out there or any other information? I only found 
https://ceph2020.sched.com/event/ZDUK/ceph-on-windows-alessandro-pilotti-cloudbase-solutions-mike-latimer-suse


Would be great to mount rbds directly instead using an iscsi gateway.

IT-Services
Telefon 02461 61-9243
E-Mail: f.sto...@fz-juelich.de
-
-
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt
-
-
 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Seena Fallah
Hi Igor.

Could you please tell why this config is in LEVEL_DEV (
https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
As it is documented in Ceph we can't use LEVEL_DEV in production
environments!

Thanks

On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov  wrote:

> Hi Simon,
>
>
> starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at DB
> volume.
>
> see this PR: https://github.com/ceph/ceph/pull/29687
>
> Nice overview on the overall BlueFS/RocksDB design can be find here:
>
>
> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf
>
> Which also includes some overview (as well as additional concerns) for
> changes brought by the above-mentioned PR.
>
>
> Thanks,
>
> Igor
>
>
> On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
> > Hi Michael,
> >
> > thanks for the explanation! So if I understand correctly, we waste 93
> > GB per OSD on unused NVME space, because only 30GB is actually used...?
> >
> > And to improve the space for rocksdb, we need to plan for 300GB per
> > rocksdb partition in order to benefit from this advantage
> >
> > Reducing the number of small files is something we always ask of our
> > users, but reality is what it is ;-)
> >
> > I'll have to look into how I can get an informative view on these
> > metrics... It's pretty overwhelming the amount of information coming
> > out of the ceph cluster, even when you look only superficially...
> >
> > Cheers,
> >
> > /Simon
> >
> > On 20/08/2020 10:16, Michael Bisig wrote:
> >> Hi Simon
> >>
> >> As far as I know, RocksDB only uses "leveled" space on the NVME
> >> partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every
> >> DB space above such a limit will automatically end up on slow devices.
> >> In your setup where you have 123GB per OSD that means you only use
> >> 30GB of fast device. The DB which spills over this limit will be
> >> offloaded to the HDD and accordingly, it slows down requests and
> >> compactions.
> >>
> >> You can proof what your OSD currently consumes with:
> >>ceph daemon osd.X perf dump
> >>
> >> Informative values are `db_total_bytes`, `db_used_bytes` and
> >> `slow_used_bytes`. This changes regularly because of the ongoing
> >> compactions but Prometheus mgr module exports these values such that
> >> you can track it.
> >>
> >> Small files generally leads to bigger RocksDB, especially when you
> >> use EC, but this depends on the actual amount and file sizes.
> >>
> >> I hope this helps.
> >> Regards,
> >> Michael
> >>
> >> On 20.08.20, 09:10, "Simon Oosthoek"  wrote:
> >>
> >>  Hi
> >>
> >>  Recently our ceph cluster (nautilus) is experiencing bluefs
> >> spillovers,
> >>  just 2 osd's and I disabled the warning for these osds.
> >>  (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false)
> >>
> >>  I'm wondering what causes this and how this can be prevented.
> >>
> >>  As I understand it the rocksdb for the OSD needs to store more
> >> than fits
> >>  on the NVME logical volume (123G for 12T OSD). A way to fix it
> >> could be
> >>  to increase the logical volume on the nvme (if there was space
> >> on the
> >>  nvme, which there isn't at the moment).
> >>
> >>  This is the current size of the cluster and how much is free:
> >>
> >>  [root@cephmon1 ~]# ceph df
> >>  RAW STORAGE:
> >>   CLASS SIZEAVAIL   USEDRAW USED
> >> %RAW USED
> >>   hdd   1.8 PiB 842 TiB 974 TiB  974
> >> TiB 53.63
> >>   TOTAL 1.8 PiB 842 TiB 974 TiB  974
> >> TiB 53.63
> >>
> >>  POOLS:
> >>   POOLID STORED  OBJECTS USED
> >>  %USED MAX AVAIL
> >>   cephfs_data  1 572 MiB 121.26M 2.4 GiB
> >>  0   167 TiB
> >>   cephfs_metadata  2  56 GiB 5.15M  57 GiB
> >>  0   167 TiB
> >>   cephfs_data_3copy8 201 GiB  51.68k 602 GiB
> >>  0.09   222 TiB
> >>   cephfs_data_ec8313 643 TiB 279.75M 953 TiB
> >>  58.86   485 TiB
> >>   rbd 14  21 GiB 5.66k  64 GiB
> >>  0   222 TiB
> >>   .rgw.root   15 1.2 KiB 4   1 MiB
> >>  0   167 TiB
> >>   default.rgw.control 16 0 B 8 0 B
> >>  0   167 TiB
> >>   default.rgw.meta17   765 B 4   1 MiB
> >>  0   167 TiB
> >>   default.rgw.log 18 0 B 207 0 B
> >>  0   167 TiB
> >>   cephfs_data_ec5720 433 MiB 230 1.2 GiB
> >>  0   278 TiB
> >>
> >>  The amount used can still grow a bit before we need to add
> >> nodes, but
> >>  apparently we are running into the limits of our rocskdb
> >> partitions.
> >>
> >>  D

[ceph-users] Ceph mon crash, many osd down

2020-08-20 Thread hoannv46
Hi all.

My cluster has many log mon scrub mon data

2020-08-20 13:12:16.393 7fe89becc700  0 log_channel(cluster) log [DBG] : scrub 
ok on 0,1,2,3: ScrubResult(keys {auth=100} crc {auth=3066031631})
2020-08-20 13:12:16.395 7fe89becc700  0 log_channel(cluster) log [DBG] : scrub 
ok on 0,1,2,3: ScrubResult(keys {auth=100} crc {auth=221313478})
2020-08-20 13:12:16.401 7fe89becc700  0 log_channel(cluster) log [DBG] : scrub 
ok on 0,1,2,3: ScrubResult(keys {auth=15,config=2,health=10,logm=73} crc 
{auth=2119885989,config=3307175017,health=67914304,logm=3854202346})
2020-08-20 13:12:16.404 7fe89becc700  0 log_channel(cluster) log [DBG] : scrub 
ok on 0,1,2,3: ScrubResult(keys {logm=100} crc {logm=3116621380})
2020-08-20 13:12:16.408 7fe89becc700  0 log_channel(cluster) log [DBG] : scrub 
ok on 0,1,2,3: ScrubResult(keys {logm=100} crc {logm=767596958})
2020-08-20 13:12:16.411 7fe89becc700  0 log_channel(cluster) log [DBG] : scrub 
ok on 0,1,2,3: ScrubResult(keys {logm=100} crc {logm=3982727178})
2020-08-20 13:12:16.414 7fe89becc700  0 log_channel(cluster) log [DBG] : scrub 
ok on 0,1,2,3: ScrubResult(keys {logm=100} crc {logm=4144183080})

after 900 seconds, mon mark osd down

2020-08-20 13:23:32.546 7fe89e6d1700  0 log_channel(cluster) log [INF] : 
osd.112 marked down after no beacon for 904.106586 seconds
2020-08-20 13:23:32.546 7fe89e6d1700 -1 mon.ceph-mon-1@0(leader).osd e2960665 
no beacon from osd.112 since 2020-08-20 13:08:28.441052, 904.106586 seconds 
ago.  marking down
2020-08-20 13:23:32.551 7fe89e6d1700  0 log_channel(cluster) log [WRN] : Health 
check failed: 1 osds down (OSD_DOWN)
2020-08-20 13:24:07.899 7fe89e6d1700  0 log_channel(cluster) log [INF] : 
osd.263 marked down after no beacon for 901.445447 seconds
2020-08-20 13:24:07.899 7fe89e6d1700 -1 mon.ceph-mon-1@0(leader).osd e2960666 
no beacon from osd.263 since 2020-08-20 13:09:06.454891, 901.445447 seconds 
ago.  marking down
2020-08-20 13:24:07.902 7fe89e6d1700  0 log_channel(cluster) log [WRN] : Health 
check update: 2 osds down (OSD_DOWN)
2020-08-20 13:24:13.020 7fe89e6d1700  0 log_channel(cluster) log [INF] : 
osd.384 marked down after no beacon for 900.132560 seconds
2020-08-20 13:24:13.020 7fe89e6d1700 -1 mon.ceph-mon-1@0(leader).osd e2960667 
no beacon from osd.384 since 2020-08-20 13:09:12.44, 900.132560 seconds 
ago.  marking down
2020-08-20 13:24:13.020 7fe89e6d1700  0 log_channel(cluster) log [INF] : 
osd.614 marked down after no beacon for 901.359447 seconds
2020-08-20 13:24:13.020 7fe89e6d1700 -1 mon.ceph-mon-1@0(leader).osd e2960667 
no beacon from osd.614 since 2020-08-20 13:09:11.661958, 901.359447 seconds 
ago.  marking down
2020-08-20 13:24:13.026 7fe89e6d1700  0 log_channel(cluster) log [WRN] : Health 
check update: 4 osds down (OSD_DOWN)
2020-08-20 13:24:18.084 7fe89e6d1700  0 log_channel(cluster) log [INF] : osd.34 
marked down after no beacon for 903.818250 seconds

is this bug of ceph mon.

Thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD memory leak?

2020-08-20 Thread Frank Schilder
Hi Dan and Mark,

could you please let me know if you can read the files with the version info I 
provided in my previous e-mail? I'm in the process of collecting data with more 
FS activity and would like to send it in a format that is useful for 
investigation.

Right now I'm observing a daily growth of swap of ca. 100-200MB on servers with 
16 OSDs each, 1SSD and 15HDDs. The OS+daemons operate fine, the OS manages to 
keep enough RAM available. Also the mempool dump still shows onode and data 
cached at a seemingly reasonable level. Users report a more stable performance 
of the FS after I increased the cach min sizes on all OSDs.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 17 August 2020 09:37
To: Dan van der Ster
Cc: ceph-users
Subject: [ceph-users] Re: OSD memory leak?

Hi Dan,

I use the container 
docker.io/ceph/daemon:v3.2.10-stable-3.2-mimic-centos-7-x86_64. As far as I can 
see, it uses the packages from http://download.ceph.com/rpm-mimic/el7, its a 
Centos 7 build. The version is:

# ceph -v
ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)

On Centos, the profiler packages are called different, without the "google-" 
prefix. The version I have installed is

# pprof --version
pprof (part of gperftools 2.0)

Copyright 1998-2007 Google Inc.

This is BSD licensed software; see the source for copying conditions
and license information.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.

It is possible to install pprof inside this container and analyse the 
*.heap-files I provided.

If this doesn't work for you and you want me to generate the text output for 
heap-files, I can do that. Please let me know if I should do all files and with 
what option (eg. against a base etc.).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: 14 August 2020 10:38:57
To: Frank Schilder
Cc: Mark Nelson; ceph-users
Subject: Re: [ceph-users] Re: OSD memory leak?

Hi Frank,

I'm having trouble getting the exact version of ceph you used to
create this heap profile.
Could you run the google-pprof --text steps at [1] and share the output?

Thanks, Dan

[1] https://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/


On Tue, Aug 11, 2020 at 2:37 PM Frank Schilder  wrote:
>
> Hi Mark,
>
> here is a first collection of heap profiling data (valid 30 days):
>
> https://files.dtu.dk/u/53HHic_xx5P1cceJ/heap_profiling-2020-08-03.tgz?l
>
> This was collected with the following config settings:
>
>   osd  dev  osd_memory_cache_min  
> 805306368
>   osd  basicosd_memory_target 
> 2147483648
>
> Setting the cache_min value seems to help keeping cache space available. 
> Unfortunately, the above collection is for 12 days only. I needed to restart 
> the OSD and will need to restart it soon again. I hope I can then run a 
> longer sample. The profiling does cause slow ops though.
>
> Maybe you can see something already? It seems to have collected some leaked 
> memory. Unfortunately, it was a period of extremely low load. Basically, with 
> the day of recording the utilization dropped to almost zero.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Frank Schilder 
> Sent: 21 July 2020 12:57:32
> To: Mark Nelson; Dan van der Ster
> Cc: ceph-users
> Subject: [ceph-users] Re: OSD memory leak?
>
> Quick question: Is there a way to change the frequency of heap dumps? On this 
> page http://goog-perftools.sourceforge.net/doc/heap_profiler.html a function 
> HeapProfilerSetAllocationInterval() is mentioned, but no other way of 
> configuring this. Is there a config parameter or a ceph daemon call to adjust 
> this?
>
> If not, can I change the dump path?
>
> Its likely to overrun my log partition quickly if I cannot adjust either of 
> the two.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Frank Schilder 
> Sent: 20 July 2020 15:19:05
> To: Mark Nelson; Dan van der Ster
> Cc: ceph-users
> Subject: [ceph-users] Re: OSD memory leak?
>
> Dear Mark,
>
> thank you very much for the very helpful answers. I will raise 
> osd_memory_cache_min, leave everything else alone and watch what happens. I 
> will report back here.
>
> Thanks also for raising this as an issue.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Mark Nelson 
> Sent: 20 July 2020 15:08:11
> To: Frank Schilder; Dan van der Ster
> Cc: ceph-users
> Subject: Re: [ceph-users] Re: OSD memory leak?
>
> On 7/20/20 3:23 AM, Fran

[ceph-users] Re: Ceph on windows?

2020-08-20 Thread Jason Dillaman
It's an effort to expose RBD to Windows via a native driver [1]. That
driver is basically a thin NBD shim to connect with the rbd-nbd daemon
running as a Windows service.

On Thu, Aug 20, 2020 at 6:07 AM Stolte, Felix  wrote:
>
> Hey guys,
>
> it seems like there was a presentation called “ceph on windows” at the 
> Cephalocon 2020, but I cannot find any information on that topic. Is there a 
> video from the presentation out there or any other information? I only found 
> https://ceph2020.sched.com/event/ZDUK/ceph-on-windows-alessandro-pilotti-cloudbase-solutions-mike-latimer-suse
>
>
> Would be great to mount rbds directly instead using an iscsi gateway.
>
> IT-Services
> Telefon 02461 61-9243
> E-Mail: f.sto...@fz-juelich.de
> -
> -
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt
> -
> -
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

[1] https://github.com/ceph/ceph/pulls?q=is%3Aopen+is%3Apr+label%3Awin32+

-- 
Jason
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Igor Fedotov

Hi Seena,

this parameter isn't intended to be adjusted in production environments 
- it's supposed that default behavior covers all regular customers' needs.


The issue though is that default setting is invalid. It should be 
'use_some_extra'. Gonna fix that shortly...



Thanks,

Igor




On 8/20/2020 1:44 PM, Seena Fallah wrote:

Hi Igor.

Could you please tell why this config is in LEVEL_DEV 
(https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)? 
As it is documented in Ceph we can't use LEVEL_DEV in production 
environments!


Thanks

On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov > wrote:


Hi Simon,


starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space
at DB
volume.

see this PR: https://github.com/ceph/ceph/pull/29687

Nice overview on the overall BlueFS/RocksDB design can be find here:


https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf

Which also includes some overview (as well as additional concerns)
for
changes brought by the above-mentioned PR.


Thanks,

Igor


On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
> Hi Michael,
>
> thanks for the explanation! So if I understand correctly, we
waste 93
> GB per OSD on unused NVME space, because only 30GB is actually
used...?
>
> And to improve the space for rocksdb, we need to plan for 300GB per
> rocksdb partition in order to benefit from this advantage
>
> Reducing the number of small files is something we always ask of
our
> users, but reality is what it is ;-)
>
> I'll have to look into how I can get an informative view on these
> metrics... It's pretty overwhelming the amount of information
coming
> out of the ceph cluster, even when you look only superficially...
>
> Cheers,
>
> /Simon
>
> On 20/08/2020 10:16, Michael Bisig wrote:
>> Hi Simon
>>
>> As far as I know, RocksDB only uses "leveled" space on the NVME
>> partition. The values are set to be 300MB, 3GB, 30GB and 300GB.
Every
>> DB space above such a limit will automatically end up on slow
devices.
>> In your setup where you have 123GB per OSD that means you only use
>> 30GB of fast device. The DB which spills over this limit will be
>> offloaded to the HDD and accordingly, it slows down requests and
>> compactions.
>>
>> You can proof what your OSD currently consumes with:
>>    ceph daemon osd.X perf dump
>>
>> Informative values are `db_total_bytes`, `db_used_bytes` and
>> `slow_used_bytes`. This changes regularly because of the ongoing
>> compactions but Prometheus mgr module exports these values such
that
>> you can track it.
>>
>> Small files generally leads to bigger RocksDB, especially when you
>> use EC, but this depends on the actual amount and file sizes.
>>
>> I hope this helps.
>> Regards,
>> Michael
>>
>> On 20.08.20, 09:10, "Simon Oosthoek" mailto:s.oosth...@science.ru.nl>> wrote:
>>
>>  Hi
>>
>>  Recently our ceph cluster (nautilus) is experiencing bluefs
>> spillovers,
>>  just 2 osd's and I disabled the warning for these osds.
>>  (ceph config set osd.125
bluestore_warn_on_bluefs_spillover false)
>>
>>  I'm wondering what causes this and how this can be prevented.
>>
>>  As I understand it the rocksdb for the OSD needs to store
more
>> than fits
>>  on the NVME logical volume (123G for 12T OSD). A way to
fix it
>> could be
>>  to increase the logical volume on the nvme (if there was
space
>> on the
>>  nvme, which there isn't at the moment).
>>
>>  This is the current size of the cluster and how much is free:
>>
>>  [root@cephmon1 ~]# ceph df
>>  RAW STORAGE:
>>   CLASS SIZE    AVAIL USED    RAW USED
>> %RAW USED
>>   hdd   1.8 PiB 842 TiB 974 TiB  974
>> TiB 53.63
>>   TOTAL 1.8 PiB 842 TiB 974 TiB  974
>> TiB 53.63
>>
>>  POOLS:
>>   POOL    ID STORED OBJECTS USED
>>  %USED MAX AVAIL
>>   cephfs_data  1 572 MiB 121.26M 2.4 GiB
>>  0   167 TiB
>>   cephfs_metadata  2  56 GiB 5.15M  57 GiB
>>  0   167 TiB
>>   cephfs_data_3copy    8 201 GiB 51.68k 602 GiB
>>  0.09   222 TiB
>>   cephfs_data_ec83    13 643 TiB 279.75M 953 TiB
>>  58.86   485 TiB
>>   rbd 14  21 GiB 5.66k  64 GiB
>>  0   222 TiB
>>   .rgw.root   15 1.2 KiB 4   1 MiB

[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Seena Fallah
Greate, thanks.

Is it safe to change it manually in ceph.conf until next nautilus release
or should I wait for the next nautilus release for this change? I mean does
qa run on this value for this config that we could trust and change it or
should we wait until the next nautilus release that qa ran on this value?

On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov  wrote:

> Hi Seena,
>
> this parameter isn't intended to be adjusted in production environments -
> it's supposed that default behavior covers all regular customers' needs.
>
> The issue though is that default setting is invalid. It should be
> 'use_some_extra'. Gonna fix that shortly...
>
>
> Thanks,
>
> Igor
>
>
>
>
> On 8/20/2020 1:44 PM, Seena Fallah wrote:
>
> Hi Igor.
>
> Could you please tell why this config is in LEVEL_DEV (
> https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
> As it is documented in Ceph we can't use LEVEL_DEV in production
> environments!
>
> Thanks
>
> On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov  wrote:
>
>> Hi Simon,
>>
>>
>> starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at DB
>> volume.
>>
>> see this PR: https://github.com/ceph/ceph/pull/29687
>>
>> Nice overview on the overall BlueFS/RocksDB design can be find here:
>>
>>
>> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf
>>
>> Which also includes some overview (as well as additional concerns) for
>> changes brought by the above-mentioned PR.
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>> On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
>> > Hi Michael,
>> >
>> > thanks for the explanation! So if I understand correctly, we waste 93
>> > GB per OSD on unused NVME space, because only 30GB is actually used...?
>> >
>> > And to improve the space for rocksdb, we need to plan for 300GB per
>> > rocksdb partition in order to benefit from this advantage
>> >
>> > Reducing the number of small files is something we always ask of our
>> > users, but reality is what it is ;-)
>> >
>> > I'll have to look into how I can get an informative view on these
>> > metrics... It's pretty overwhelming the amount of information coming
>> > out of the ceph cluster, even when you look only superficially...
>> >
>> > Cheers,
>> >
>> > /Simon
>> >
>> > On 20/08/2020 10:16, Michael Bisig wrote:
>> >> Hi Simon
>> >>
>> >> As far as I know, RocksDB only uses "leveled" space on the NVME
>> >> partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every
>> >> DB space above such a limit will automatically end up on slow devices.
>> >> In your setup where you have 123GB per OSD that means you only use
>> >> 30GB of fast device. The DB which spills over this limit will be
>> >> offloaded to the HDD and accordingly, it slows down requests and
>> >> compactions.
>> >>
>> >> You can proof what your OSD currently consumes with:
>> >>ceph daemon osd.X perf dump
>> >>
>> >> Informative values are `db_total_bytes`, `db_used_bytes` and
>> >> `slow_used_bytes`. This changes regularly because of the ongoing
>> >> compactions but Prometheus mgr module exports these values such that
>> >> you can track it.
>> >>
>> >> Small files generally leads to bigger RocksDB, especially when you
>> >> use EC, but this depends on the actual amount and file sizes.
>> >>
>> >> I hope this helps.
>> >> Regards,
>> >> Michael
>> >>
>> >> On 20.08.20, 09:10, "Simon Oosthoek" 
>> wrote:
>> >>
>> >>  Hi
>> >>
>> >>  Recently our ceph cluster (nautilus) is experiencing bluefs
>> >> spillovers,
>> >>  just 2 osd's and I disabled the warning for these osds.
>> >>  (ceph config set osd.125 bluestore_warn_on_bluefs_spillover false)
>> >>
>> >>  I'm wondering what causes this and how this can be prevented.
>> >>
>> >>  As I understand it the rocksdb for the OSD needs to store more
>> >> than fits
>> >>  on the NVME logical volume (123G for 12T OSD). A way to fix it
>> >> could be
>> >>  to increase the logical volume on the nvme (if there was space
>> >> on the
>> >>  nvme, which there isn't at the moment).
>> >>
>> >>  This is the current size of the cluster and how much is free:
>> >>
>> >>  [root@cephmon1 ~]# ceph df
>> >>  RAW STORAGE:
>> >>   CLASS SIZEAVAIL   USEDRAW USED
>> >> %RAW USED
>> >>   hdd   1.8 PiB 842 TiB 974 TiB  974
>> >> TiB 53.63
>> >>   TOTAL 1.8 PiB 842 TiB 974 TiB  974
>> >> TiB 53.63
>> >>
>> >>  POOLS:
>> >>   POOLID STORED  OBJECTS USED
>> >>  %USED MAX AVAIL
>> >>   cephfs_data  1 572 MiB 121.26M 2.4 GiB
>> >>  0   167 TiB
>> >>   cephfs_metadata  2  56 GiB 5.15M  57 GiB
>> >>  0   167 TiB
>> >>   cephfs_data_3copy8 201 GiB  51.68k 602 GiB
>> >>  0.09   222 TiB
>> >>   cephfs_data_ec83  

[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Igor Fedotov
From technical/developer's point of view I don't see any issues with 
tuning this option. But since now I wouldn't  recommend to enable it in 
production as it partially bypassed our regular development cycle. Being 
enabled in master for a while by default allows more develpers to 
use/try the feature before release. This can be considered as an 
additional implicit QA process. But as we just discovered this hasn't 
happened.


Hence you can definitely try it but this exposes your cluster(s) to some 
risk as for any new (and incompletely tested) feature



Thanks,

Igor


On 8/20/2020 4:06 PM, Seena Fallah wrote:

Greate, thanks.

Is it safe to change it manually in ceph.conf until next nautilus 
release or should I wait for the next nautilus release for this 
change? I mean does qa run on this value for this config that we could 
trust and change it or should we wait until the next nautilus release 
that qa ran on this value?


On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov > wrote:


Hi Seena,

this parameter isn't intended to be adjusted in production
environments - it's supposed that default behavior covers all
regular customers' needs.

The issue though is that default setting is invalid. It should be
'use_some_extra'. Gonna fix that shortly...


Thanks,

Igor




On 8/20/2020 1:44 PM, Seena Fallah wrote:

Hi Igor.

Could you please tell why this config is in LEVEL_DEV

(https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
As it is documented in Ceph we can't use LEVEL_DEV in production
environments!

Thanks

On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov mailto:ifedo...@suse.de>> wrote:

Hi Simon,


starting Nautlus v14.2.10 Bluestore is able to use 'wasted'
space at DB
volume.

see this PR: https://github.com/ceph/ceph/pull/29687

Nice overview on the overall BlueFS/RocksDB design can be
find here:


https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf

Which also includes some overview (as well as additional
concerns) for
changes brought by the above-mentioned PR.


Thanks,

Igor


On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
> Hi Michael,
>
> thanks for the explanation! So if I understand correctly,
we waste 93
> GB per OSD on unused NVME space, because only 30GB is
actually used...?
>
> And to improve the space for rocksdb, we need to plan for
300GB per
> rocksdb partition in order to benefit from this advantage
>
> Reducing the number of small files is something we always
ask of our
> users, but reality is what it is ;-)
>
> I'll have to look into how I can get an informative view on
these
> metrics... It's pretty overwhelming the amount of
information coming
> out of the ceph cluster, even when you look only
superficially...
>
> Cheers,
>
> /Simon
>
> On 20/08/2020 10:16, Michael Bisig wrote:
>> Hi Simon
>>
>> As far as I know, RocksDB only uses "leveled" space on the
NVME
>> partition. The values are set to be 300MB, 3GB, 30GB and
300GB. Every
>> DB space above such a limit will automatically end up on
slow devices.
>> In your setup where you have 123GB per OSD that means you
only use
>> 30GB of fast device. The DB which spills over this limit
will be
>> offloaded to the HDD and accordingly, it slows down
requests and
>> compactions.
>>
>> You can proof what your OSD currently consumes with:
>>    ceph daemon osd.X perf dump
>>
>> Informative values are `db_total_bytes`, `db_used_bytes` and
>> `slow_used_bytes`. This changes regularly because of the
ongoing
>> compactions but Prometheus mgr module exports these values
such that
>> you can track it.
>>
>> Small files generally leads to bigger RocksDB, especially
when you
>> use EC, but this depends on the actual amount and file sizes.
>>
>> I hope this helps.
>> Regards,
>> Michael
>>
>> On 20.08.20, 09:10, "Simon Oosthoek"
mailto:s.oosth...@science.ru.nl>>
wrote:
>>
>>  Hi
>>
>>  Recently our ceph cluster (nautilus) is experiencing
bluefs
>> spillovers,
>>  just 2 osd's and I disabled the warning for these osds.
>>  (ceph config set osd.125
bluestore_warn_on_bluefs_spillover false)
>>
>>  I'm wondering what causes this and how this can be
  

[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Seena Fallah
So you won't backport it to nautilus until it gets default to master for a
while?

On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov  wrote:

> From technical/developer's point of view I don't see any issues with
> tuning this option. But since now I wouldn't  recommend to enable it in
> production as it partially bypassed our regular development cycle. Being
> enabled in master for a while by default allows more develpers to use/try
> the feature before release. This can be considered as an additional
> implicit QA process. But as we just discovered this hasn't happened.
>
> Hence you can definitely try it but this exposes your cluster(s) to some
> risk as for any new (and incompletely tested) feature
>
>
> Thanks,
>
> Igor
>
>
> On 8/20/2020 4:06 PM, Seena Fallah wrote:
>
> Greate, thanks.
>
> Is it safe to change it manually in ceph.conf until next nautilus release
> or should I wait for the next nautilus release for this change? I mean does
> qa run on this value for this config that we could trust and change it or
> should we wait until the next nautilus release that qa ran on this value?
>
> On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov  wrote:
>
>> Hi Seena,
>>
>> this parameter isn't intended to be adjusted in production environments -
>> it's supposed that default behavior covers all regular customers' needs.
>>
>> The issue though is that default setting is invalid. It should be
>> 'use_some_extra'. Gonna fix that shortly...
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>>
>>
>> On 8/20/2020 1:44 PM, Seena Fallah wrote:
>>
>> Hi Igor.
>>
>> Could you please tell why this config is in LEVEL_DEV (
>> https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
>> As it is documented in Ceph we can't use LEVEL_DEV in production
>> environments!
>>
>> Thanks
>>
>> On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov  wrote:
>>
>>> Hi Simon,
>>>
>>>
>>> starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at DB
>>> volume.
>>>
>>> see this PR: https://github.com/ceph/ceph/pull/29687
>>>
>>> Nice overview on the overall BlueFS/RocksDB design can be find here:
>>>
>>>
>>> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf
>>>
>>> Which also includes some overview (as well as additional concerns) for
>>> changes brought by the above-mentioned PR.
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>> On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
>>> > Hi Michael,
>>> >
>>> > thanks for the explanation! So if I understand correctly, we waste 93
>>> > GB per OSD on unused NVME space, because only 30GB is actually used...?
>>> >
>>> > And to improve the space for rocksdb, we need to plan for 300GB per
>>> > rocksdb partition in order to benefit from this advantage
>>> >
>>> > Reducing the number of small files is something we always ask of our
>>> > users, but reality is what it is ;-)
>>> >
>>> > I'll have to look into how I can get an informative view on these
>>> > metrics... It's pretty overwhelming the amount of information coming
>>> > out of the ceph cluster, even when you look only superficially...
>>> >
>>> > Cheers,
>>> >
>>> > /Simon
>>> >
>>> > On 20/08/2020 10:16, Michael Bisig wrote:
>>> >> Hi Simon
>>> >>
>>> >> As far as I know, RocksDB only uses "leveled" space on the NVME
>>> >> partition. The values are set to be 300MB, 3GB, 30GB and 300GB. Every
>>> >> DB space above such a limit will automatically end up on slow devices.
>>> >> In your setup where you have 123GB per OSD that means you only use
>>> >> 30GB of fast device. The DB which spills over this limit will be
>>> >> offloaded to the HDD and accordingly, it slows down requests and
>>> >> compactions.
>>> >>
>>> >> You can proof what your OSD currently consumes with:
>>> >>ceph daemon osd.X perf dump
>>> >>
>>> >> Informative values are `db_total_bytes`, `db_used_bytes` and
>>> >> `slow_used_bytes`. This changes regularly because of the ongoing
>>> >> compactions but Prometheus mgr module exports these values such that
>>> >> you can track it.
>>> >>
>>> >> Small files generally leads to bigger RocksDB, especially when you
>>> >> use EC, but this depends on the actual amount and file sizes.
>>> >>
>>> >> I hope this helps.
>>> >> Regards,
>>> >> Michael
>>> >>
>>> >> On 20.08.20, 09:10, "Simon Oosthoek" 
>>> wrote:
>>> >>
>>> >>  Hi
>>> >>
>>> >>  Recently our ceph cluster (nautilus) is experiencing bluefs
>>> >> spillovers,
>>> >>  just 2 osd's and I disabled the warning for these osds.
>>> >>  (ceph config set osd.125 bluestore_warn_on_bluefs_spillover
>>> false)
>>> >>
>>> >>  I'm wondering what causes this and how this can be prevented.
>>> >>
>>> >>  As I understand it the rocksdb for the OSD needs to store more
>>> >> than fits
>>> >>  on the NVME logical volume (123G for 12T OSD). A way to fix it
>>> >> could be
>>> >>  to increase the logical volume on the nvme (if there was space
>>> >> on the
>>> >>  nvme, which there

[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Igor Fedotov

Correct.

On 8/20/2020 5:15 PM, Seena Fallah wrote:
So you won't backport it to nautilus until it gets default to master 
for a while?


On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov > wrote:


From technical/developer's point of view I don't see any issues
with tuning this option. But since now I wouldn't recommend to
enable it in production as it partially bypassed our regular
development cycle. Being enabled in master for a while by default
allows more develpers to use/try the feature before release. This
can be considered as an additional implicit QA process. But as we
just discovered this hasn't happened.

Hence you can definitely try it but this exposes your cluster(s)
to some risk as for any new (and incompletely tested) feature


Thanks,

Igor


On 8/20/2020 4:06 PM, Seena Fallah wrote:

Greate, thanks.

Is it safe to change it manually in ceph.conf until next nautilus
release or should I wait for the next nautilus release for this
change? I mean does qa run on this value for this config that we
could trust and change it or should we wait until the next
nautilus release that qa ran on this value?

On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov mailto:ifedo...@suse.de>> wrote:

Hi Seena,

this parameter isn't intended to be adjusted in production
environments - it's supposed that default behavior covers all
regular customers' needs.

The issue though is that default setting is invalid. It
should be 'use_some_extra'. Gonna fix that shortly...


Thanks,

Igor




On 8/20/2020 1:44 PM, Seena Fallah wrote:

Hi Igor.

Could you please tell why this config is in LEVEL_DEV

(https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
As it is documented in Ceph we can't use LEVEL_DEV in
production environments!

Thanks

On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov
mailto:ifedo...@suse.de>> wrote:

Hi Simon,


starting Nautlus v14.2.10 Bluestore is able to use
'wasted' space at DB
volume.

see this PR: https://github.com/ceph/ceph/pull/29687

Nice overview on the overall BlueFS/RocksDB design can
be find here:


https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf

Which also includes some overview (as well as additional
concerns) for
changes brought by the above-mentioned PR.


Thanks,

Igor


On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
> Hi Michael,
>
> thanks for the explanation! So if I understand
correctly, we waste 93
> GB per OSD on unused NVME space, because only 30GB is
actually used...?
>
> And to improve the space for rocksdb, we need to plan
for 300GB per
> rocksdb partition in order to benefit from this
advantage
>
> Reducing the number of small files is something we
always ask of our
> users, but reality is what it is ;-)
>
> I'll have to look into how I can get an informative
view on these
> metrics... It's pretty overwhelming the amount of
information coming
> out of the ceph cluster, even when you look only
superficially...
>
> Cheers,
>
> /Simon
>
> On 20/08/2020 10:16, Michael Bisig wrote:
>> Hi Simon
>>
>> As far as I know, RocksDB only uses "leveled" space
on the NVME
>> partition. The values are set to be 300MB, 3GB, 30GB
and 300GB. Every
>> DB space above such a limit will automatically end up
on slow devices.
>> In your setup where you have 123GB per OSD that means
you only use
>> 30GB of fast device. The DB which spills over this
limit will be
>> offloaded to the HDD and accordingly, it slows down
requests and
>> compactions.
>>
>> You can proof what your OSD currently consumes with:
>>    ceph daemon osd.X perf dump
>>
>> Informative values are `db_total_bytes`,
`db_used_bytes` and
>> `slow_used_bytes`. This changes regularly because of
the ongoing
>> compactions but Prometheus mgr module exports these
values such that
>> you can track it.
>>
>> Small files generally leads to bigger RocksDB,
especially when you
>> 

[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Seena Fallah
So what do you suggest for a short term solution? (I think you won't
backport it to nautilus at least about 6 month)

Changing db size is too expensive because I should buy new NVME devices
with double size and also redeploy all my OSDs.
Manual compaction will still have an impact on performance and doing it for
a month doesn't look very good!

On Thu, Aug 20, 2020 at 6:52 PM Igor Fedotov  wrote:

> Correct.
> On 8/20/2020 5:15 PM, Seena Fallah wrote:
>
> So you won't backport it to nautilus until it gets default to master for a
> while?
>
> On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov  wrote:
>
>> From technical/developer's point of view I don't see any issues with
>> tuning this option. But since now I wouldn't  recommend to enable it in
>> production as it partially bypassed our regular development cycle. Being
>> enabled in master for a while by default allows more develpers to use/try
>> the feature before release. This can be considered as an additional
>> implicit QA process. But as we just discovered this hasn't happened.
>>
>> Hence you can definitely try it but this exposes your cluster(s) to some
>> risk as for any new (and incompletely tested) feature
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>> On 8/20/2020 4:06 PM, Seena Fallah wrote:
>>
>> Greate, thanks.
>>
>> Is it safe to change it manually in ceph.conf until next nautilus release
>> or should I wait for the next nautilus release for this change? I mean does
>> qa run on this value for this config that we could trust and change it or
>> should we wait until the next nautilus release that qa ran on this value?
>>
>> On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov  wrote:
>>
>>> Hi Seena,
>>>
>>> this parameter isn't intended to be adjusted in production environments
>>> - it's supposed that default behavior covers all regular customers' needs.
>>>
>>> The issue though is that default setting is invalid. It should be
>>> 'use_some_extra'. Gonna fix that shortly...
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>>
>>>
>>> On 8/20/2020 1:44 PM, Seena Fallah wrote:
>>>
>>> Hi Igor.
>>>
>>> Could you please tell why this config is in LEVEL_DEV (
>>> https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
>>> As it is documented in Ceph we can't use LEVEL_DEV in production
>>> environments!
>>>
>>> Thanks
>>>
>>> On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov  wrote:
>>>
 Hi Simon,


 starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at DB
 volume.

 see this PR: https://github.com/ceph/ceph/pull/29687

 Nice overview on the overall BlueFS/RocksDB design can be find here:


 https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf

 Which also includes some overview (as well as additional concerns) for
 changes brought by the above-mentioned PR.


 Thanks,

 Igor


 On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
 > Hi Michael,
 >
 > thanks for the explanation! So if I understand correctly, we waste 93
 > GB per OSD on unused NVME space, because only 30GB is actually
 used...?
 >
 > And to improve the space for rocksdb, we need to plan for 300GB per
 > rocksdb partition in order to benefit from this advantage
 >
 > Reducing the number of small files is something we always ask of our
 > users, but reality is what it is ;-)
 >
 > I'll have to look into how I can get an informative view on these
 > metrics... It's pretty overwhelming the amount of information coming
 > out of the ceph cluster, even when you look only superficially...
 >
 > Cheers,
 >
 > /Simon
 >
 > On 20/08/2020 10:16, Michael Bisig wrote:
 >> Hi Simon
 >>
 >> As far as I know, RocksDB only uses "leveled" space on the NVME
 >> partition. The values are set to be 300MB, 3GB, 30GB and 300GB.
 Every
 >> DB space above such a limit will automatically end up on slow
 devices.
 >> In your setup where you have 123GB per OSD that means you only use
 >> 30GB of fast device. The DB which spills over this limit will be
 >> offloaded to the HDD and accordingly, it slows down requests and
 >> compactions.
 >>
 >> You can proof what your OSD currently consumes with:
 >>ceph daemon osd.X perf dump
 >>
 >> Informative values are `db_total_bytes`, `db_used_bytes` and
 >> `slow_used_bytes`. This changes regularly because of the ongoing
 >> compactions but Prometheus mgr module exports these values such that
 >> you can track it.
 >>
 >> Small files generally leads to bigger RocksDB, especially when you
 >> use EC, but this depends on the actual amount and file sizes.
 >>
 >> I hope this helps.
 >> Regards,
 >> Michael
 >>
 >> On 20.08.20, 09:10, "Simon Oosthoek" 
 wrote:
 >>
 >>  Hi
 >>
 >>  Recently

[ceph-users] Re: Ceph on windows?

2020-08-20 Thread Lenz Grimmer
On 8/20/20 2:11 PM, Jason Dillaman wrote:

> It's an effort to expose RBD to Windows via a native driver [1]. That
> driver is basically a thin NBD shim to connect with the rbd-nbd daemon
> running as a Windows service.
> 
> [1] https://github.com/ceph/ceph/pulls?q=is%3Aopen+is%3Apr+label%3Awin32+

FWIW, the Windows Network Block Device driver can be found here:

  https://github.com/cloudbase/wnbd

As far as I understand it, that part is not Ceph-specific and can
basically attach to any block device shared via the NBD protocol.

Lenz

-- 
SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg
GF: Felix Imendörffer, HRB 36809 (AG Nürnberg)



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] luks / disk encryption best practice

2020-08-20 Thread Marc Roos


I still need to move from ceph disk to ceph volume. When doing this, I 
wanted to also start using disk encryption. I am not really interested 
in encryption offered by the hdd vendors.

Is there a best practice or advice what encryption to use ciphers/hash? 
Stick to the default of CentOS7 or maybe choose what is default in 
CentOS or something else? Different settings for ssd / hdd?



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: radosgw beast access logs

2020-08-20 Thread Casey Bodley
Sure. You can track the nautilus backport's progress in
https://tracker.ceph.com/issues/47042.


On Wed, Aug 19, 2020 at 12:25 PM Wesley Dillingham
 wrote:
>
> We would very much appreciate having this backported to nautilus.
>
> Respectfully,
>
> Wes Dillingham
> w...@wesdillingham.com
> LinkedIn
>
>
> On Wed, Aug 19, 2020 at 9:02 AM Casey Bodley  wrote:
>>
>> On Tue, Aug 18, 2020 at 1:33 PM Graham Allan  wrote:
>> >
>> > Are there any plans to add access logs to the beast frontend, in the
>> > same way we can get with civetweb? Increasing the "debug rgw" setting
>> > really doesn't provide the same thing.
>> >
>> > Graham
>> > --
>> > Graham Allan - g...@umn.edu
>> > Associate Director of Operations - Minnesota Supercomputing Institute
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>> >
>>
>> Yes, this was implemented by Mark Kogan in
>> https://github.com/ceph/ceph/pull/33083. It looks like it was
>> backported to Octopus for 15.2.5 in
>> https://tracker.ceph.com/issues/45951. Is there interest in a nautilus
>> backport too?
>>
>> Casey
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bonus Ceph Tech Talk: Edge Application - Stream Multiple Video Sources

2020-08-20 Thread Mike Perez

And we're live! Please join us and bring questions!

https://bluejeans.com/908675367

On 8/17/20 11:03 AM, Mike Perez wrote:


Hi all,

We have a bonus Ceph Tech Talk for August. Join us August 20th at 
17:00 UTC to hear Neeha Kompala and Jason Weng present on Edge 
Application - Streaming Multiple Video Sources.


Don't forget on August 27th at 17:00 UTC, Pritha Srivastava will also 
be presenting on this month's Ceph Tech Talk: Secure Token Service in 
the Rados Gateway.


If you're interested in giving a Ceph Tech Talk for September 24th or 
October 22nd, please let me know!


https://ceph.io/ceph-tech-talks/

--

Mike Perez

He/Him

Ceph Community Manager

Red Hat Los Angeles 

thin...@redhat.com 
M: 1-951-572-2633  IM: IRC Freenode/OFTC: thingee

494C 5D25 2968 D361 65FB 3829 94BC D781 ADA8 8AEA

@Thingee 
  



--

Mike Perez

He/Him

Ceph Community Manager

Red Hat Los Angeles 

thin...@redhat.com 
M: 1-951-572-2633  IM: IRC Freenode/OFTC: thingee

494C 5D25 2968 D361 65FB 3829 94BC D781 ADA8 8AEA

@Thingee 
  


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bonus Ceph Tech Talk: Edge Application - Stream Multiple Video Sources

2020-08-20 Thread Marc Roos


Can't join as guest without enabling mic and/or camera???

-Original Message-
From: Mike Perez [mailto:mipe...@redhat.com] 
Sent: donderdag 20 augustus 2020 19:03
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Bonus Ceph Tech Talk: Edge Application - 
Stream Multiple Video Sources

And we're live! Please join us and bring questions!

https://bluejeans.com/908675367

On 8/17/20 11:03 AM, Mike Perez wrote:
>
> Hi all,
>
> We have a bonus Ceph Tech Talk for August. Join us August 20th at 
> 17:00 UTC to hear Neeha Kompala and Jason Weng present on Edge 
> Application - Streaming Multiple Video Sources.
>
> Don't forget on August 27th at 17:00 UTC, Pritha Srivastava will also 
> be presenting on this month's Ceph Tech Talk: Secure Token Service in 
> the Rados Gateway.
>
> If you're interested in giving a Ceph Tech Talk for September 24th or 
> October 22nd, please let me know!
>
> https://ceph.io/ceph-tech-talks/
>
> --
>
> Mike Perez
>
> He/Him
>
> Ceph Community Manager
>
> Red Hat Los Angeles 
>
> thin...@redhat.com 
> M: 1-951-572-2633  IM: IRC Freenode/OFTC: thingee
>
> 494C 5D25 2968 D361 65FB 3829 94BC D781 ADA8 8AEA
>
> @Thingee 
>   
> 
>
-- 

Mike Perez

He/Him

Ceph Community Manager

Red Hat Los Angeles 

thin...@redhat.com 
M: 1-951-572-2633  IM: IRC Freenode/OFTC: thingee

494C 5D25 2968 D361 65FB 3829 94BC D781 ADA8 8AEA

@Thingee 



___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Bonus Ceph Tech Talk: Edge Application - Stream Multiple Video Sources

2020-08-20 Thread Bobby
Hi...Will it be available on youtube?

On Thursday, August 20, 2020, Marc Roos  wrote:
>
> Can't join as guest without enabling mic and/or camera???
>
> -Original Message-
> From: Mike Perez [mailto:mipe...@redhat.com]
> Sent: donderdag 20 augustus 2020 19:03
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: Bonus Ceph Tech Talk: Edge Application -
> Stream Multiple Video Sources
>
> And we're live! Please join us and bring questions!
>
> https://bluejeans.com/908675367
>
> On 8/17/20 11:03 AM, Mike Perez wrote:
>>
>> Hi all,
>>
>> We have a bonus Ceph Tech Talk for August. Join us August 20th at
>> 17:00 UTC to hear Neeha Kompala and Jason Weng present on Edge
>> Application - Streaming Multiple Video Sources.
>>
>> Don't forget on August 27th at 17:00 UTC, Pritha Srivastava will also
>> be presenting on this month's Ceph Tech Talk: Secure Token Service in
>> the Rados Gateway.
>>
>> If you're interested in giving a Ceph Tech Talk for September 24th or
>> October 22nd, please let me know!
>>
>> https://ceph.io/ceph-tech-talks/
>>
>> --
>>
>> Mike Perez
>>
>> He/Him
>>
>> Ceph Community Manager
>>
>> Red Hat Los Angeles 
>>
>> thin...@redhat.com 
>> M: 1-951-572-2633  IM: IRC Freenode/OFTC: thingee
>>
>> 494C 5D25 2968 D361 65FB 3829 94BC D781 ADA8 8AEA
>>
>> @Thingee 
>> 
>> 
>>
> --
>
> Mike Perez
>
> He/Him
>
> Ceph Community Manager
>
> Red Hat Los Angeles 
>
> thin...@redhat.com 
> M: 1-951-572-2633  IM: IRC Freenode/OFTC: thingee
>
> 494C 5D25 2968 D361 65FB 3829 94BC D781 ADA8 8AEA
>
> @Thingee 
> 
> 
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD memory leak?

2020-08-20 Thread Mark Nelson

Hi Frank,


 I downloaded but haven't had time to get the environment setup yet 
either.  It might be better to just generate the txt files if you can.



Thanks!

Mark


On 8/20/20 2:33 AM, Frank Schilder wrote:

Hi Dan and Mark,

could you please let me know if you can read the files with the version info I 
provided in my previous e-mail? I'm in the process of collecting data with more 
FS activity and would like to send it in a format that is useful for 
investigation.

Right now I'm observing a daily growth of swap of ca. 100-200MB on servers with 
16 OSDs each, 1SSD and 15HDDs. The OS+daemons operate fine, the OS manages to 
keep enough RAM available. Also the mempool dump still shows onode and data 
cached at a seemingly reasonable level. Users report a more stable performance 
of the FS after I increased the cach min sizes on all OSDs.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 17 August 2020 09:37
To: Dan van der Ster
Cc: ceph-users
Subject: [ceph-users] Re: OSD memory leak?

Hi Dan,

I use the container 
docker.io/ceph/daemon:v3.2.10-stable-3.2-mimic-centos-7-x86_64. As far as I can 
see, it uses the packages from http://download.ceph.com/rpm-mimic/el7, its a 
Centos 7 build. The version is:

# ceph -v
ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)

On Centos, the profiler packages are called different, without the "google-" 
prefix. The version I have installed is

# pprof --version
pprof (part of gperftools 2.0)

Copyright 1998-2007 Google Inc.

This is BSD licensed software; see the source for copying conditions
and license information.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.

It is possible to install pprof inside this container and analyse the 
*.heap-files I provided.

If this doesn't work for you and you want me to generate the text output for 
heap-files, I can do that. Please let me know if I should do all files and with 
what option (eg. against a base etc.).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: 14 August 2020 10:38:57
To: Frank Schilder
Cc: Mark Nelson; ceph-users
Subject: Re: [ceph-users] Re: OSD memory leak?

Hi Frank,

I'm having trouble getting the exact version of ceph you used to
create this heap profile.
Could you run the google-pprof --text steps at [1] and share the output?

Thanks, Dan

[1] https://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/


On Tue, Aug 11, 2020 at 2:37 PM Frank Schilder  wrote:

Hi Mark,

here is a first collection of heap profiling data (valid 30 days):

https://files.dtu.dk/u/53HHic_xx5P1cceJ/heap_profiling-2020-08-03.tgz?l

This was collected with the following config settings:

   osd  dev  osd_memory_cache_min  805306368
   osd  basicosd_memory_target 
2147483648

Setting the cache_min value seems to help keeping cache space available. 
Unfortunately, the above collection is for 12 days only. I needed to restart 
the OSD and will need to restart it soon again. I hope I can then run a longer 
sample. The profiling does cause slow ops though.

Maybe you can see something already? It seems to have collected some leaked 
memory. Unfortunately, it was a period of extremely low load. Basically, with 
the day of recording the utilization dropped to almost zero.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 21 July 2020 12:57:32
To: Mark Nelson; Dan van der Ster
Cc: ceph-users
Subject: [ceph-users] Re: OSD memory leak?

Quick question: Is there a way to change the frequency of heap dumps? On this 
page http://goog-perftools.sourceforge.net/doc/heap_profiler.html a function 
HeapProfilerSetAllocationInterval() is mentioned, but no other way of 
configuring this. Is there a config parameter or a ceph daemon call to adjust 
this?

If not, can I change the dump path?

Its likely to overrun my log partition quickly if I cannot adjust either of the 
two.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 20 July 2020 15:19:05
To: Mark Nelson; Dan van der Ster
Cc: ceph-users
Subject: [ceph-users] Re: OSD memory leak?

Dear Mark,

thank you very much for the very helpful answers. I will raise 
osd_memory_cache_min, leave everything else alone and watch what happens. I 
will report back here.

Thanks also for raising this as an issue.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Mark Nelson 
Sent: 20 July 2020 15:08:11
To: Frank Schilder; Dan van der Ster
Cc: ceph-users

[ceph-users] Re: Bonus Ceph Tech Talk: Edge Application - Stream Multiple Video Sources

2020-08-20 Thread Mike Perez

Here's the video in case you missed it:

https://www.youtube.com/watch?v=Q8bU-m07Czo

On 8/20/20 10:03 AM, Mike Perez wrote:


And we're live! Please join us and bring questions!

https://bluejeans.com/908675367

On 8/17/20 11:03 AM, Mike Perez wrote:


Hi all,

We have a bonus Ceph Tech Talk for August. Join us August 20th at 
17:00 UTC to hear Neeha Kompala and Jason Weng present on Edge 
Application - Streaming Multiple Video Sources.


Don't forget on August 27th at 17:00 UTC, Pritha Srivastava will also 
be presenting on this month's Ceph Tech Talk: Secure Token Service in 
the Rados Gateway.


If you're interested in giving a Ceph Tech Talk for September 24th or 
October 22nd, please let me know!


https://ceph.io/ceph-tech-talks/

--

Mike Perez

He/Him

Ceph Community Manager

Red Hat Los Angeles 

thin...@redhat.com 
M: 1-951-572-2633  IM: IRC Freenode/OFTC: thingee

494C 5D25 2968 D361 65FB 3829 94BC D781 ADA8 8AEA

@Thingee 
  



--

Mike Perez

He/Him

Ceph Community Manager

Red Hat Los Angeles 

thin...@redhat.com 
M: 1-951-572-2633  IM: IRC Freenode/OFTC: thingee

494C 5D25 2968 D361 65FB 3829 94BC D781 ADA8 8AEA

@Thingee 
  



--

Mike Perez

He/Him

Ceph Community Manager

Red Hat Los Angeles 

thin...@redhat.com 
M: 1-951-572-2633  IM: IRC Freenode/OFTC: thingee

494C 5D25 2968 D361 65FB 3829 94BC D781 ADA8 8AEA

@Thingee 
  


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Igor Fedotov

Honestly I don't have any perfect solution for now.

If this is urgent you probably better to proceed with enabling the new 
DB space management feature.


But please do that eventually, modify 1-2 OSDs at the first stage and 
test them for some period (may be a week or two).



Thanks,

Igor


On 8/20/2020 5:36 PM, Seena Fallah wrote:
So what do you suggest for a short term solution? (I think you won't 
backport it to nautilus at least about 6 month)


Changing db size is too expensive because I should buy new NVME 
devices with double size and also redeploy all my OSDs.
Manual compaction will still have an impact on performance and doing 
it for a month doesn't look very good!


On Thu, Aug 20, 2020 at 6:52 PM Igor Fedotov > wrote:


Correct.

On 8/20/2020 5:15 PM, Seena Fallah wrote:

So you won't backport it to nautilus until it gets default to
master for a while?

On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov mailto:ifedo...@suse.de>> wrote:

From technical/developer's point of view I don't see any
issues with tuning this option. But since now I wouldn't 
recommend to enable it in production as it partially bypassed
our regular development cycle. Being enabled in master for a
while by default allows more develpers to use/try the feature
before release. This can be considered as an additional
implicit QA process. But as we just discovered this hasn't
happened.

Hence you can definitely try it but this exposes your
cluster(s) to some risk as for any new (and incompletely
tested) feature


Thanks,

Igor


On 8/20/2020 4:06 PM, Seena Fallah wrote:

Greate, thanks.

Is it safe to change it manually in ceph.conf until next
nautilus release or should I wait for the next nautilus
release for this change? I mean does qa run on this value
for this config that we could trust and change it or should
we wait until the next nautilus release that qa ran on this
value?

On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov
mailto:ifedo...@suse.de>> wrote:

Hi Seena,

this parameter isn't intended to be adjusted in
production environments - it's supposed that default
behavior covers all regular customers' needs.

The issue though is that default setting is invalid. It
should be 'use_some_extra'. Gonna fix that shortly...


Thanks,

Igor




On 8/20/2020 1:44 PM, Seena Fallah wrote:

Hi Igor.

Could you please tell why this config is in LEVEL_DEV

(https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
As it is documented in Ceph we can't use LEVEL_DEV in
production environments!

Thanks

On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov
mailto:ifedo...@suse.de>> wrote:

Hi Simon,


starting Nautlus v14.2.10 Bluestore is able to use
'wasted' space at DB
volume.

see this PR: https://github.com/ceph/ceph/pull/29687

Nice overview on the overall BlueFS/RocksDB design
can be find here:


https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf

Which also includes some overview (as well as
additional concerns) for
changes brought by the above-mentioned PR.


Thanks,

Igor


On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
> Hi Michael,
>
> thanks for the explanation! So if I understand
correctly, we waste 93
> GB per OSD on unused NVME space, because only
30GB is actually used...?
>
> And to improve the space for rocksdb, we need to
plan for 300GB per
> rocksdb partition in order to benefit from this
advantage
>
> Reducing the number of small files is something
we always ask of our
> users, but reality is what it is ;-)
>
> I'll have to look into how I can get an
informative view on these
> metrics... It's pretty overwhelming the amount of
information coming
> out of the ceph cluster, even when you look only
superficially...
>
> Cheers,
>
> /Simon
>
> On 20/08/2020 10:16, Michael Bisig wrote:
>> Hi Simon
>>
>> As far as

[ceph-users] Re: Bonus Ceph Tech Talk: Edge Application - Stream Multiple Video Sources

2020-08-20 Thread Bobby
thanks!

On Thursday, August 20, 2020, Mike Perez  wrote:
> Here's the video in case you missed it:
>
> https://www.youtube.com/watch?v=Q8bU-m07Czo
>
> On 8/20/20 10:03 AM, Mike Perez wrote:
>>
>> And we're live! Please join us and bring questions!
>>
>> https://bluejeans.com/908675367
>>
>> On 8/17/20 11:03 AM, Mike Perez wrote:
>>>
>>> Hi all,
>>>
>>> We have a bonus Ceph Tech Talk for August. Join us August 20th at 17:00
UTC to hear Neeha Kompala and Jason Weng present on Edge Application -
Streaming Multiple Video Sources.
>>>
>>> Don't forget on August 27th at 17:00 UTC, Pritha Srivastava will also
be presenting on this month's Ceph Tech Talk: Secure Token Service in the
Rados Gateway.
>>>
>>> If you're interested in giving a Ceph Tech Talk for September 24th or
October 22nd, please let me know!
>>>
>>> https://ceph.io/ceph-tech-talks/
>>>
>>> --
>>>
>>> Mike Perez
>>>
>>> He/Him
>>>
>>> Ceph Community Manager
>>>
>>> Red Hat Los Angeles 
>>>
>>> thin...@redhat.com 
>>> M: 1-951-572-2633  IM: IRC Freenode/OFTC: thingee
>>>
>>> 494C 5D25 2968 D361 65FB 3829 94BC D781 ADA8 8AEA
>>>
>>> @Thingee 
>>> 
>>> 
>>>
>> --
>>
>> Mike Perez
>>
>> He/Him
>>
>> Ceph Community Manager
>>
>> Red Hat Los Angeles 
>>
>> thin...@redhat.com 
>> M: 1-951-572-2633  IM: IRC Freenode/OFTC: thingee
>>
>> 494C 5D25 2968 D361 65FB 3829 94BC D781 ADA8 8AEA
>>
>> @Thingee 
>> 
>> 
>>
> --
>
> Mike Perez
>
> He/Him
>
> Ceph Community Manager
>
> Red Hat Los Angeles 
>
> thin...@redhat.com 
> M: 1-951-572-2633  IM: IRC Freenode/OFTC: thingee
>
> 494C 5D25 2968 D361 65FB 3829 94BC D781 ADA8 8AEA
>
> @Thingee 
> 
> 
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: BlueFS spillover detected, why, what?

2020-08-20 Thread Seena Fallah
Ok thanks. And also as you mentioned in the doc you shared from cloudferro,
It's not good to change `write_buffer_size` for bluestore rocksdb to fit
our db?

On Fri, Aug 21, 2020 at 1:46 AM Igor Fedotov  wrote:

> Honestly I don't have any perfect solution for now.
>
> If this is urgent you probably better to proceed with enabling the new DB
> space management feature.
>
> But please do that eventually, modify 1-2 OSDs at the first stage and test
> them for some period (may be a week or two).
>
>
> Thanks,
>
> Igor
>
>
> On 8/20/2020 5:36 PM, Seena Fallah wrote:
>
> So what do you suggest for a short term solution? (I think you won't
> backport it to nautilus at least about 6 month)
>
> Changing db size is too expensive because I should buy new NVME devices
> with double size and also redeploy all my OSDs.
> Manual compaction will still have an impact on performance and doing it
> for a month doesn't look very good!
>
> On Thu, Aug 20, 2020 at 6:52 PM Igor Fedotov  wrote:
>
>> Correct.
>> On 8/20/2020 5:15 PM, Seena Fallah wrote:
>>
>> So you won't backport it to nautilus until it gets default to master for
>> a while?
>>
>> On Thu, Aug 20, 2020 at 6:00 PM Igor Fedotov  wrote:
>>
>>> From technical/developer's point of view I don't see any issues with
>>> tuning this option. But since now I wouldn't  recommend to enable it in
>>> production as it partially bypassed our regular development cycle. Being
>>> enabled in master for a while by default allows more develpers to use/try
>>> the feature before release. This can be considered as an additional
>>> implicit QA process. But as we just discovered this hasn't happened.
>>>
>>> Hence you can definitely try it but this exposes your cluster(s) to some
>>> risk as for any new (and incompletely tested) feature
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>> On 8/20/2020 4:06 PM, Seena Fallah wrote:
>>>
>>> Greate, thanks.
>>>
>>> Is it safe to change it manually in ceph.conf until next nautilus
>>> release or should I wait for the next nautilus release for this change? I
>>> mean does qa run on this value for this config that we could trust and
>>> change it or should we wait until the next nautilus release that qa ran on
>>> this value?
>>>
>>> On Thu, Aug 20, 2020 at 5:25 PM Igor Fedotov  wrote:
>>>
 Hi Seena,

 this parameter isn't intended to be adjusted in production environments
 - it's supposed that default behavior covers all regular customers' needs.

 The issue though is that default setting is invalid. It should be
 'use_some_extra'. Gonna fix that shortly...


 Thanks,

 Igor




 On 8/20/2020 1:44 PM, Seena Fallah wrote:

 Hi Igor.

 Could you please tell why this config is in LEVEL_DEV (
 https://github.com/ceph/ceph/pull/29687/files#diff-3d7a065928b2852c228ffe669d7633bbR4587)?
 As it is documented in Ceph we can't use LEVEL_DEV in production
 environments!

 Thanks

 On Thu, Aug 20, 2020 at 1:58 PM Igor Fedotov  wrote:

> Hi Simon,
>
>
> starting Nautlus v14.2.10 Bluestore is able to use 'wasted' space at
> DB
> volume.
>
> see this PR: https://github.com/ceph/ceph/pull/29687
>
> Nice overview on the overall BlueFS/RocksDB design can be find here:
>
>
> https://cf2.cloudferro.com:8080/swift/v1/AUTH_5e376cddf8a94f9294259b5f48d7b2cd/ceph/rocksdb_in_ceph.pdf
>
> Which also includes some overview (as well as additional concerns) for
> changes brought by the above-mentioned PR.
>
>
> Thanks,
>
> Igor
>
>
> On 8/20/2020 11:39 AM, Simon Oosthoek wrote:
> > Hi Michael,
> >
> > thanks for the explanation! So if I understand correctly, we waste
> 93
> > GB per OSD on unused NVME space, because only 30GB is actually
> used...?
> >
> > And to improve the space for rocksdb, we need to plan for 300GB per
> > rocksdb partition in order to benefit from this advantage
> >
> > Reducing the number of small files is something we always ask of our
> > users, but reality is what it is ;-)
> >
> > I'll have to look into how I can get an informative view on these
> > metrics... It's pretty overwhelming the amount of information coming
> > out of the ceph cluster, even when you look only superficially...
> >
> > Cheers,
> >
> > /Simon
> >
> > On 20/08/2020 10:16, Michael Bisig wrote:
> >> Hi Simon
> >>
> >> As far as I know, RocksDB only uses "leveled" space on the NVME
> >> partition. The values are set to be 300MB, 3GB, 30GB and 300GB.
> Every
> >> DB space above such a limit will automatically end up on slow
> devices.
> >> In your setup where you have 123GB per OSD that means you only use
> >> 30GB of fast device. The DB which spills over this limit will be
> >> offloaded to the HDD and accordingly, it slows down requests and
>

[ceph-users] Re: OSD memory leak?

2020-08-20 Thread Frank Schilder
Hi Dan,

no worries. I checked and osd_map_dedup is set to true, the default value.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: 20 August 2020 09:41
To: Frank Schilder
Cc: Mark Nelson; ceph-users
Subject: Re: [ceph-users] Re: OSD memory leak?

Hi Frank,

I didn't get time yet. On our side, I was planning to see if the issue
persists after upgrading to v14.2.11 -- it includes some updates to
how the osdmap is referenced across OSD.cc.

BTW, do you happen to have osd_map_dedup set to false? We do, and that
surely increases the osdmap memory usage somewhat.

-- Dan



-- Dan

On Thu, Aug 20, 2020 at 9:33 AM Frank Schilder  wrote:
>
> Hi Dan and Mark,
>
> could you please let me know if you can read the files with the version info 
> I provided in my previous e-mail? I'm in the process of collecting data with 
> more FS activity and would like to send it in a format that is useful for 
> investigation.
>
> Right now I'm observing a daily growth of swap of ca. 100-200MB on servers 
> with 16 OSDs each, 1SSD and 15HDDs. The OS+daemons operate fine, the OS 
> manages to keep enough RAM available. Also the mempool dump still shows onode 
> and data cached at a seemingly reasonable level. Users report a more stable 
> performance of the FS after I increased the cach min sizes on all OSDs.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Frank Schilder 
> Sent: 17 August 2020 09:37
> To: Dan van der Ster
> Cc: ceph-users
> Subject: [ceph-users] Re: OSD memory leak?
>
> Hi Dan,
>
> I use the container 
> docker.io/ceph/daemon:v3.2.10-stable-3.2-mimic-centos-7-x86_64. As far as I 
> can see, it uses the packages from http://download.ceph.com/rpm-mimic/el7, 
> its a Centos 7 build. The version is:
>
> # ceph -v
> ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
>
> On Centos, the profiler packages are called different, without the "google-" 
> prefix. The version I have installed is
>
> # pprof --version
> pprof (part of gperftools 2.0)
>
> Copyright 1998-2007 Google Inc.
>
> This is BSD licensed software; see the source for copying conditions
> and license information.
> There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
> PARTICULAR PURPOSE.
>
> It is possible to install pprof inside this container and analyse the 
> *.heap-files I provided.
>
> If this doesn't work for you and you want me to generate the text output for 
> heap-files, I can do that. Please let me know if I should do all files and 
> with what option (eg. against a base etc.).
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Dan van der Ster 
> Sent: 14 August 2020 10:38:57
> To: Frank Schilder
> Cc: Mark Nelson; ceph-users
> Subject: Re: [ceph-users] Re: OSD memory leak?
>
> Hi Frank,
>
> I'm having trouble getting the exact version of ceph you used to
> create this heap profile.
> Could you run the google-pprof --text steps at [1] and share the output?
>
> Thanks, Dan
>
> [1] https://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/
>
>
> On Tue, Aug 11, 2020 at 2:37 PM Frank Schilder  wrote:
> >
> > Hi Mark,
> >
> > here is a first collection of heap profiling data (valid 30 days):
> >
> > https://files.dtu.dk/u/53HHic_xx5P1cceJ/heap_profiling-2020-08-03.tgz?l
> >
> > This was collected with the following config settings:
> >
> >   osd  dev  osd_memory_cache_min  
> > 805306368
> >   osd  basicosd_memory_target 
> > 2147483648
> >
> > Setting the cache_min value seems to help keeping cache space available. 
> > Unfortunately, the above collection is for 12 days only. I needed to 
> > restart the OSD and will need to restart it soon again. I hope I can then 
> > run a longer sample. The profiling does cause slow ops though.
> >
> > Maybe you can see something already? It seems to have collected some leaked 
> > memory. Unfortunately, it was a period of extremely low load. Basically, 
> > with the day of recording the utilization dropped to almost zero.
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: Frank Schilder 
> > Sent: 21 July 2020 12:57:32
> > To: Mark Nelson; Dan van der Ster
> > Cc: ceph-users
> > Subject: [ceph-users] Re: OSD memory leak?
> >
> > Quick question: Is there a way to change the frequency of heap dumps? On 
> > this page http://goog-perftools.sourceforge.net/doc/heap_profiler.html a 
> > function HeapProfilerSetAllocationInterval() is mentioned, but no other way 
> > of configuring this. Is there a config parameter or a ceph daemon call to 
> > adjust this?
> >
> > If

[ceph-users] Re: OSD memory leak?

2020-08-20 Thread Frank Schilder
Hi Mark and Dan,

I can generate text files. Can you let me know what you would like to see? 
Without further instructions, I can do a simple conversion and a conversion 
against the first dump as a base. I will upload an archive with converted files 
added tomorrow afternoon.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Mark Nelson 
Sent: 20 August 2020 21:52
To: Frank Schilder; Dan van der Ster; ceph-users
Subject: Re: [ceph-users] Re: OSD memory leak?

Hi Frank,


  I downloaded but haven't had time to get the environment setup yet
either.  It might be better to just generate the txt files if you can.


Thanks!

Mark


On 8/20/20 2:33 AM, Frank Schilder wrote:
> Hi Dan and Mark,
>
> could you please let me know if you can read the files with the version info 
> I provided in my previous e-mail? I'm in the process of collecting data with 
> more FS activity and would like to send it in a format that is useful for 
> investigation.
>
> Right now I'm observing a daily growth of swap of ca. 100-200MB on servers 
> with 16 OSDs each, 1SSD and 15HDDs. The OS+daemons operate fine, the OS 
> manages to keep enough RAM available. Also the mempool dump still shows onode 
> and data cached at a seemingly reasonable level. Users report a more stable 
> performance of the FS after I increased the cach min sizes on all OSDs.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Frank Schilder 
> Sent: 17 August 2020 09:37
> To: Dan van der Ster
> Cc: ceph-users
> Subject: [ceph-users] Re: OSD memory leak?
>
> Hi Dan,
>
> I use the container 
> docker.io/ceph/daemon:v3.2.10-stable-3.2-mimic-centos-7-x86_64. As far as I 
> can see, it uses the packages from http://download.ceph.com/rpm-mimic/el7, 
> its a Centos 7 build. The version is:
>
> # ceph -v
> ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
>
> On Centos, the profiler packages are called different, without the "google-" 
> prefix. The version I have installed is
>
> # pprof --version
> pprof (part of gperftools 2.0)
>
> Copyright 1998-2007 Google Inc.
>
> This is BSD licensed software; see the source for copying conditions
> and license information.
> There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
> PARTICULAR PURPOSE.
>
> It is possible to install pprof inside this container and analyse the 
> *.heap-files I provided.
>
> If this doesn't work for you and you want me to generate the text output for 
> heap-files, I can do that. Please let me know if I should do all files and 
> with what option (eg. against a base etc.).
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Dan van der Ster 
> Sent: 14 August 2020 10:38:57
> To: Frank Schilder
> Cc: Mark Nelson; ceph-users
> Subject: Re: [ceph-users] Re: OSD memory leak?
>
> Hi Frank,
>
> I'm having trouble getting the exact version of ceph you used to
> create this heap profile.
> Could you run the google-pprof --text steps at [1] and share the output?
>
> Thanks, Dan
>
> [1] https://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/
>
>
> On Tue, Aug 11, 2020 at 2:37 PM Frank Schilder  wrote:
>> Hi Mark,
>>
>> here is a first collection of heap profiling data (valid 30 days):
>>
>> https://files.dtu.dk/u/53HHic_xx5P1cceJ/heap_profiling-2020-08-03.tgz?l
>>
>> This was collected with the following config settings:
>>
>>osd  dev  osd_memory_cache_min  
>> 805306368
>>osd  basicosd_memory_target 
>> 2147483648
>>
>> Setting the cache_min value seems to help keeping cache space available. 
>> Unfortunately, the above collection is for 12 days only. I needed to restart 
>> the OSD and will need to restart it soon again. I hope I can then run a 
>> longer sample. The profiling does cause slow ops though.
>>
>> Maybe you can see something already? It seems to have collected some leaked 
>> memory. Unfortunately, it was a period of extremely low load. Basically, 
>> with the day of recording the utilization dropped to almost zero.
>>
>> Best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> 
>> From: Frank Schilder 
>> Sent: 21 July 2020 12:57:32
>> To: Mark Nelson; Dan van der Ster
>> Cc: ceph-users
>> Subject: [ceph-users] Re: OSD memory leak?
>>
>> Quick question: Is there a way to change the frequency of heap dumps? On 
>> this page http://goog-perftools.sourceforge.net/doc/heap_profiler.html a 
>> function HeapProfilerSetAllocationInterval() is mentioned, but no other way 
>> of configuring this. Is there a config parameter or a ceph daemon call to 
>> adjust this?
>>
>> If not, can I change the dump path?