[ceph-users] Required caps for cephfs

2019-04-30 Thread Stolte, Felix
Hi folks,

we are using nfs-ganesha to expose cephfs (Luminous) to nfs clients. I want to 
make use of snapshots, but limit the creation of snapshots to ceph admins. I 
read about cephx capabilities which allow/deny the creation of snapshots a 
while ago, but I can’t find the info anymore. Can someone help me?

Best regards
Felix
IT-Services
Telefon 02461 61-9243
E-Mail: f.sto...@fz-juelich.de
-
-
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
-
-
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] VM management setup

2019-04-30 Thread Stefan Kooman
Hi,

> Any recommendations? 
> 
> .. found a lot of names allready .. 
> OpenStack 
> CloudStack 
> Proxmox 
> .. 
> 
> But recommendations are truely welcome. 
I would recommend OpenNebula. Adopters of the KISS methodology.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Required caps for cephfs

2019-04-30 Thread Darius Kasparavičius
Hi,

Only available in mimic and up.

To create or delete snapshots, clients require the ‘s’ flag in
addition to ‘rw’. Note that when capability string also contains the
‘p’ flag, the ‘s’ flag must appear after it (all flags except ‘rw’
must be specified in alphabetical order).

http://docs.ceph.com/docs/mimic/cephfs/client-auth/

On Tue, Apr 30, 2019 at 10:36 AM Stolte, Felix  wrote:
>
> Hi folks,
>
> we are using nfs-ganesha to expose cephfs (Luminous) to nfs clients. I want 
> to make use of snapshots, but limit the creation of snapshots to ceph admins. 
> I read about cephx capabilities which allow/deny the creation of snapshots a 
> while ago, but I can’t find the info anymore. Can someone help me?
>
> Best regards
> Felix
> IT-Services
> Telefon 02461 61-9243
> E-Mail: f.sto...@fz-juelich.de
> -
> -
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> Prof. Dr. Sebastian M. Schmidt
> -
> -
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need some advice about Pools and Erasure Coding

2019-04-30 Thread Igor Podlesny
On Tue, 30 Apr 2019 at 04:13, Adrien Gillard  wrote:
> I would add that the use of cache tiering, though still possible, is not 
> recommended

It lacks references. CEPH docs I gave links to didn't say so.

> comes with its own challenges.

It's challenging for some to not over-quote when replying, but I don't
think it holds true for everyone.

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore bitmap allocator under Luminous and Mimic

2019-04-30 Thread Igor Podlesny
On Mon, 15 Apr 2019 at 19:40, Wido den Hollander  wrote:
>
> Hi,
>
> With the release of 12.2.12 the bitmap allocator for BlueStore is now
> available under Mimic and Luminous.
>
> [osd]
> bluestore_allocator = bitmap
> bluefs_allocator = bitmap

Hi!

Have you tried this? :)

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-04-30 Thread Denny Fuchs

hi,

I want to add also a memory problem.

What we have:

* Ceph version 12.2.11
* 5 x 512MB Samsung 850 Evo
* 5 x 1TB WD Red (5.4k)
* OS Debian Stretch ( Proxmox VE 5.x )
* 2 x CPU CPU E5-2620 v4
* Memory 64GB DDR4

I've added to ceph.conf

...

[osd]
  osd memory target = 3221225472
...

Which is active:


===
# ceph daemon osd.31 config show | grep memory_target
"osd_memory_target": "3221225472",
===

Problem is, that the OSD processes eating my memory:

==
# free -h
  totalusedfree  shared  buff/cache   
available
Mem:62G 52G7.8G693M2.2G  
   50G

Swap:  8.0G5.8M8.0G
==

As example osd.31, which is a HDD (WD Red)


==
# ceph daemon osd.31 dump_mempools

...

"bluestore_alloc": {
"items": 40379056,
"bytes": 40379056
},
"bluestore_cache_data": {
"items": 1613,
"bytes": 130048000
},
"bluestore_cache_onode": {
"items": 64888,
"bytes": 43604736
},
"bluestore_cache_other": {
"items": 7043426,
"bytes": 209450352
},
...
"total": {
"items": 48360478,
"bytes": 633918931
}
=


=
# ps -eo pmem,pcpu,vsize,pid,cmd | sort -k 1 -nr | head -30
 6.5  1.8 5040944 6594 /usr/bin/ceph-osd -f --cluster ceph --id 31 
--setuser ceph --setgroup ceph
 6.4  2.4 5053492 6819 /usr/bin/ceph-osd -f --cluster ceph --id 1 
--setuser ceph --setgroup ceph
 6.4  2.3 5044144 5454 /usr/bin/ceph-osd -f --cluster ceph --id 4 
--setuser ceph --setgroup ceph
 6.2  1.9 4927248 6082 /usr/bin/ceph-osd -f --cluster ceph --id 5 
--setuser ceph --setgroup ceph
 6.1  2.2 4839988 7684 /usr/bin/ceph-osd -f --cluster ceph --id 3 
--setuser ceph --setgroup ceph
 6.1  2.1 4876572 8155 /usr/bin/ceph-osd -f --cluster ceph --id 2 
--setuser ceph --setgroup ceph
 5.9  1.3 4652608 5760 /usr/bin/ceph-osd -f --cluster ceph --id 32 
--setuser ceph --setgroup ceph
 5.8  1.9 4699092 8374 /usr/bin/ceph-osd -f --cluster ceph --id 0 
--setuser ceph --setgroup ceph
 5.8  1.4 4562480 5623 /usr/bin/ceph-osd -f --cluster ceph --id 30 
--setuser ceph --setgroup ceph
 5.7  1.3 4491624 7268 /usr/bin/ceph-osd -f --cluster ceph --id 34 
--setuser ceph --setgroup ceph
 5.5  1.2 4430164 6201 /usr/bin/ceph-osd -f --cluster ceph --id 33 
--setuser ceph --setgroup ceph
 5.4  1.4 4319480 6405 /usr/bin/ceph-osd -f --cluster ceph --id 29 
--setuser ceph --setgroup ceph
 1.0  0.8 1094500 4749 /usr/bin/ceph-mon -f --cluster ceph --id 
fc-r02-ceph-osd-01 --setuser ceph --setgroup ceph
 0.2  4.8 948764  4803 /usr/bin/ceph-mgr -f --cluster ceph --id 
fc-r02-ceph-osd-01 --setuser ceph --setgroup ceph

=

After a reboot, the nodes uses round about 30GB, but over a month its 
again over 50GB and growing.


Any suggestions ?

cu denny
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need some advice about Pools and Erasure Coding

2019-04-30 Thread Adrien Gillard
On Tue, Apr 30, 2019 at 10:06 AM Igor Podlesny  wrote:
>
> On Tue, 30 Apr 2019 at 04:13, Adrien Gillard  wrote:
> > I would add that the use of cache tiering, though still possible, is not 
> > recommended
>
> It lacks references. CEPH docs I gave links to didn't say so.

The cache tiering documention mentions that (your link refers to it) :
http://docs.ceph.com/docs/nautilus/rados/operations/cache-tiering/#a-word-of-caution

There are some threads on the mailing list refering to the subject as
well  (by David Turner or
Christian Balzer for instance)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need some advice about Pools and Erasure Coding

2019-04-30 Thread Igor Podlesny
On Tue, 30 Apr 2019 at 19:11, Adrien Gillard 
wrote:

> On Tue, Apr 30, 2019 at 10:06 AM Igor Podlesny  wrote:
> >
> > On Tue, 30 Apr 2019 at 04:13, Adrien Gillard 
> wrote:
> > > I would add that the use of cache tiering, though still possible, is
> not recommended
> >
> > It lacks references. CEPH docs I gave links to didn't say so.
>
> The cache tiering documention mentions that (your link refers to it) :
>
> http://docs.ceph.com/docs/nautilus/rados/operations/cache-tiering/#a-word-of-caution


I saw this and didn't find "not recommended" or alike

>
> 
>
> There are some threads on the mailing list refering to the subject as
> well  (by David Turner or
> Christian Balzer for instance)


Thanks, will try to find
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-04-30 Thread Igor Podlesny
On Tue, 30 Apr 2019 at 19:10, Denny Fuchs  wrote:
[..]
> Any suggestions ?

-- Try different allocator.

In Proxmox 4 they by default had this in /etc/default/ceph {{

## use jemalloc instead of tcmalloc
#
# jemalloc is generally faster for small IO workloads and when
# ceph-osd is backed by SSDs.  However, memory usage is usually
# higher by 200-300mb.
#
#LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1

}},

so you may try using it in the same way, the package is still there in
Proxmox 5:

  libjemalloc1: /usr/lib/x86_64-linux-gnu/libjemalloc.so.1

No one can tell for sure if it would help, but jemalloc "...

is a general purpose malloc(3) implementation that emphasizes
fragmentation avoidance and scalable concurrency support.

..." -- http://jemalloc.net/

I noticed OSDs with jemalloc tend to have way bigger VSZ with time but
RSS should be fine.
Look forward hearing your experience with it.

--
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Inodes on /cephfs

2019-04-30 Thread Oliver Freyermuth

Dear Cephalopodians,

we have a classic libvirtd / KVM based virtualization cluster using Ceph-RBD 
(librbd) as backend and sharing the libvirtd configuration between the nodes 
via CephFS
(all on Mimic).

To share the libvirtd configuration between the nodes, we have symlinked some 
folders from /etc/libvirt to their counterparts on /cephfs,
so all nodes see the same configuration.
In general, this works very well (of course, there's a "gotcha": Libvirtd needs 
reloading / restart for some changes to the XMLs, we have automated that),
but there is one issue caused by Yum's cleverness (that's on CentOS 7). 
Whenever there's a libvirtd update, unattended upgrades fail, and we see:

  Transaction check error:
installing package libvirt-daemon-driver-network-4.5.0-10.el7_6.7.x86_64 
needs 2 inodes on the /cephfs filesystem
installing package libvirt-daemon-config-nwfilter-4.5.0-10.el7_6.7.x86_64 
needs 18 inodes on the /cephfs filesystem

So it seems yum follows the symlinks and checks the available inodes on 
/cephfs. Sadly, that reveals:
  [root@kvm001 libvirt]# LANG=C df -i /cephfs/
  Filesystem Inodes IUsed IFree IUse% Mounted on
  ceph-fuse  6868 0  100% /cephfs

I think that's just because there is no real "limit" on the maximum inodes on 
CephFS. However, returning 0 breaks some existing tools (notably, Yum).

What do you think? Should CephFS return something different than 0 here to not 
break existing tools?
Or should the tools behave differently? But one might also argue that if the total number 
of Inodes matches the used number of Inodes, the FS is indeed "full".
It's just unclear to me who to file a bug against ;-).

Right now, I am just using:
yum -y --setopt=diskspacecheck=0 update
as a manual workaround, but this is naturally rather cumbersome.

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Data distribution question

2019-04-30 Thread Shain Miley

Hi,

We have a cluster with 235 osd's running version 12.2.11 with a 
combination of 4 and 6 TB drives.  The data distribution across osd's 
varies from 52% to 94%.


I have been trying to figure out how to get this a bit more balanced as 
we are running into 'backfillfull' issues on a regular basis.


I've tried adding more pgs...but this did not seem to do much in terms 
of the imbalance.


Here is the end output from 'ceph osd df':

MIN/MAX VAR: 0.73/1.31  STDDEV: 7.73

We have 8199 pgs total with 6775 of them in the pool that has 97% of the 
data.


The other pools are not really used (data, metadata, .rgw.root, 
.rgw.control, etc).  I have thought about deleting those unused pools so 
that most if not all the pgs are being used by the pool with the 
majority of the data.


However...before I do that...there anything else I can do or try in 
order to see if I can balance out the data more uniformly?


Thanks in advance,

Shain

--
NPR | Shain Miley | Manager of Infrastructure, Digital Media | smi...@npr.org | 
202.513.3649

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data distribution question

2019-04-30 Thread Kenneth Van Alstyne
Shain:
Have you looked into doing a "ceph osd reweight-by-utilization” by chance?  
I’ve found that data distribution is rarely perfect and on aging clusters, I 
always have to do this periodically.

Thanks,

--
Kenneth Van Alstyne
Systems Architect
Knight Point Systems, LLC
Service-Disabled Veteran-Owned Business
1775 Wiehle Avenue Suite 101 | Reston, VA 20190
c: 228-547-8045 f: 571-266-3106
www.knightpoint.com
DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
GSA Schedule 70 SDVOSB: GS-35F-0646S
GSA MOBIS Schedule: GS-10F-0404Y
ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3

Notice: This e-mail message, including any attachments, is for the sole use of 
the intended recipient(s) and may contain confidential and privileged 
information. Any unauthorized review, copy, use, disclosure, or distribution is 
STRICTLY prohibited. If you are not the intended recipient, please contact the 
sender by reply e-mail and destroy all copies of the original message.

On Apr 30, 2019, at 11:34 AM, Shain Miley 
mailto:smi...@npr.org>> wrote:

Hi,

We have a cluster with 235 osd's running version 12.2.11 with a combination of 
4 and 6 TB drives.  The data distribution across osd's varies from 52% to 94%.

I have been trying to figure out how to get this a bit more balanced as we are 
running into 'backfillfull' issues on a regular basis.

I've tried adding more pgs...but this did not seem to do much in terms of the 
imbalance.

Here is the end output from 'ceph osd df':

MIN/MAX VAR: 0.73/1.31  STDDEV: 7.73

We have 8199 pgs total with 6775 of them in the pool that has 97% of the data.

The other pools are not really used (data, metadata, .rgw.root, .rgw.control, 
etc).  I have thought about deleting those unused pools so that most if not all 
the pgs are being used by the pool with the majority of the data.

However...before I do that...there anything else I can do or try in order to 
see if I can balance out the data more uniformly?

Thanks in advance,

Shain

--
NPR | Shain Miley | Manager of Infrastructure, Digital Media | 
smi...@npr.org | 202.513.3649

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data distribution question

2019-04-30 Thread Jack
Hi,

I see that you are using rgw
RGW comes with many pools, yet most of them are used for metadata and
configuration, those do not store many data
Such pools do not need more than a couple PG, each (I use pg_num = 8)

You need to allocate your pg on pool that actually stores the data

Please do the following, to let us know more:
Print the pg_num per pool:
for i in $(rados lspools); do echo -n "$i: "; ceph osd pool get $i
pg_num; done

Print the usage per pool:
ceph df

Also, instead of doing a "ceph osd reweight-by-utilization", check out
the balancer plugin : http://docs.ceph.com/docs/mimic/mgr/balancer/

Finally, in nautilus, the pg can now upscale and downscale automaticaly
See https://ceph.com/rados/new-in-nautilus-pg-merging-and-autotuning/


On 04/30/2019 06:34 PM, Shain Miley wrote:
> Hi,
> 
> We have a cluster with 235 osd's running version 12.2.11 with a
> combination of 4 and 6 TB drives.  The data distribution across osd's
> varies from 52% to 94%.
> 
> I have been trying to figure out how to get this a bit more balanced as
> we are running into 'backfillfull' issues on a regular basis.
> 
> I've tried adding more pgs...but this did not seem to do much in terms
> of the imbalance.
> 
> Here is the end output from 'ceph osd df':
> 
> MIN/MAX VAR: 0.73/1.31  STDDEV: 7.73
> 
> We have 8199 pgs total with 6775 of them in the pool that has 97% of the
> data.
> 
> The other pools are not really used (data, metadata, .rgw.root,
> .rgw.control, etc).  I have thought about deleting those unused pools so
> that most if not all the pgs are being used by the pool with the
> majority of the data.
> 
> However...before I do that...there anything else I can do or try in
> order to see if I can balance out the data more uniformly?
> 
> Thanks in advance,
> 
> Shain
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data distribution question

2019-04-30 Thread Kenneth Van Alstyne
Unfortunately it looks like he’s still on Luminous, but if upgrading is an 
option, the options are indeed significantly better.  If I recall correctly, at 
least the balancer module is available in Luminous.

Thanks,

--
Kenneth Van Alstyne
Systems Architect
Knight Point Systems, LLC
Service-Disabled Veteran-Owned Business
1775 Wiehle Avenue Suite 101 | Reston, VA 20190
c: 228-547-8045 f: 571-266-3106
www.knightpoint.com
DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
GSA Schedule 70 SDVOSB: GS-35F-0646S
GSA MOBIS Schedule: GS-10F-0404Y
ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3

Notice: This e-mail message, including any attachments, is for the sole use of 
the intended recipient(s) and may contain confidential and privileged 
information. Any unauthorized review, copy, use, disclosure, or distribution is 
STRICTLY prohibited. If you are not the intended recipient, please contact the 
sender by reply e-mail and destroy all copies of the original message.

On Apr 30, 2019, at 12:15 PM, Jack 
mailto:c...@jack.fr.eu.org>> wrote:

Hi,

I see that you are using rgw
RGW comes with many pools, yet most of them are used for metadata and
configuration, those do not store many data
Such pools do not need more than a couple PG, each (I use pg_num = 8)

You need to allocate your pg on pool that actually stores the data

Please do the following, to let us know more:
Print the pg_num per pool:
for i in $(rados lspools); do echo -n "$i: "; ceph osd pool get $i
pg_num; done

Print the usage per pool:
ceph df

Also, instead of doing a "ceph osd reweight-by-utilization", check out
the balancer plugin : http://docs.ceph.com/docs/mimic/mgr/balancer/

Finally, in nautilus, the pg can now upscale and downscale automaticaly
See https://ceph.com/rados/new-in-nautilus-pg-merging-and-autotuning/


On 04/30/2019 06:34 PM, Shain Miley wrote:
Hi,

We have a cluster with 235 osd's running version 12.2.11 with a
combination of 4 and 6 TB drives.  The data distribution across osd's
varies from 52% to 94%.

I have been trying to figure out how to get this a bit more balanced as
we are running into 'backfillfull' issues on a regular basis.

I've tried adding more pgs...but this did not seem to do much in terms
of the imbalance.

Here is the end output from 'ceph osd df':

MIN/MAX VAR: 0.73/1.31  STDDEV: 7.73

We have 8199 pgs total with 6775 of them in the pool that has 97% of the
data.

The other pools are not really used (data, metadata, .rgw.root,
.rgw.control, etc).  I have thought about deleting those unused pools so
that most if not all the pgs are being used by the pool with the
majority of the data.

However...before I do that...there anything else I can do or try in
order to see if I can balance out the data more uniformly?

Thanks in advance,

Shain


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data distribution question

2019-04-30 Thread Dan van der Ster
The upmap balancer in v12.2.12 works really well... Perfectly uniform on
our clusters.

.. Dan


On Tue, 30 Apr 2019, 19:22 Kenneth Van Alstyne, 
wrote:

> Unfortunately it looks like he’s still on Luminous, but if upgrading is an
> option, the options are indeed significantly better.  If I recall
> correctly, at least the balancer module is available in Luminous.
>
> Thanks,
>
> --
> Kenneth Van Alstyne
> Systems Architect
> Knight Point Systems, LLC
> Service-Disabled Veteran-Owned Business
> 1775 Wiehle Avenue Suite 101 | Reston, VA 20190
> c: 228-547-8045 f: 571-266-3106
> www.knightpoint.com
> DHS EAGLE II Prime Contractor: FC1 SDVOSB Track
> GSA Schedule 70 SDVOSB: GS-35F-0646S
> GSA MOBIS Schedule: GS-10F-0404Y
> ISO 9001 / ISO 2 / ISO 27001 / CMMI Level 3
>
> Notice: This e-mail message, including any attachments, is for the sole
> use of the intended recipient(s) and may contain confidential and
> privileged information. Any unauthorized review, copy, use, disclosure,
> or distribution is STRICTLY prohibited. If you are not the intended
> recipient, please contact the sender by reply e-mail and destroy all copies
> of the original message.
>
> On Apr 30, 2019, at 12:15 PM, Jack  wrote:
>
> Hi,
>
> I see that you are using rgw
> RGW comes with many pools, yet most of them are used for metadata and
> configuration, those do not store many data
> Such pools do not need more than a couple PG, each (I use pg_num = 8)
>
> You need to allocate your pg on pool that actually stores the data
>
> Please do the following, to let us know more:
> Print the pg_num per pool:
> for i in $(rados lspools); do echo -n "$i: "; ceph osd pool get $i
> pg_num; done
>
> Print the usage per pool:
> ceph df
>
> Also, instead of doing a "ceph osd reweight-by-utilization", check out
> the balancer plugin : http://docs.ceph.com/docs/mimic/mgr/balancer/
>
> Finally, in nautilus, the pg can now upscale and downscale automaticaly
> See https://ceph.com/rados/new-in-nautilus-pg-merging-and-autotuning/
>
>
> On 04/30/2019 06:34 PM, Shain Miley wrote:
>
> Hi,
>
> We have a cluster with 235 osd's running version 12.2.11 with a
> combination of 4 and 6 TB drives.  The data distribution across osd's
> varies from 52% to 94%.
>
> I have been trying to figure out how to get this a bit more balanced as
> we are running into 'backfillfull' issues on a regular basis.
>
> I've tried adding more pgs...but this did not seem to do much in terms
> of the imbalance.
>
> Here is the end output from 'ceph osd df':
>
> MIN/MAX VAR: 0.73/1.31  STDDEV: 7.73
>
> We have 8199 pgs total with 6775 of them in the pool that has 97% of the
> data.
>
> The other pools are not really used (data, metadata, .rgw.root,
> .rgw.control, etc).  I have thought about deleting those unused pools so
> that most if not all the pgs are being used by the pool with the
> majority of the data.
>
> However...before I do that...there anything else I can do or try in
> order to see if I can balance out the data more uniformly?
>
> Thanks in advance,
>
> Shain
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data distribution question

2019-04-30 Thread Igor Podlesny
On Wed, 1 May 2019 at 00:24, Dan van der Ster  wrote:
>
> The upmap balancer in v12.2.12 works really well... Perfectly uniform on our 
> clusters.
>
> .. Dan

mode upmap ?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data distribution question

2019-04-30 Thread Dan van der Ster
On Tue, 30 Apr 2019, 19:32 Igor Podlesny,  wrote:

> On Wed, 1 May 2019 at 00:24, Dan van der Ster  wrote:
> >
> > The upmap balancer in v12.2.12 works really well... Perfectly uniform on
> our clusters.
> >
> > .. Dan
>
> mode upmap ?
>

yes, mgr balancer, mode upmap.

..  Dan



> --
> End of message. Next message?
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data distribution question

2019-04-30 Thread Shain Miley

Here is the per pool pg_num info:

'data' pg_num 64
'metadata' pg_num 64
'rbd' pg_num 64
'npr_archive' pg_num 6775
'.rgw.root' pg_num 64
'.rgw.control' pg_num 64
'.rgw' pg_num 64
'.rgw.gc' pg_num 64
'.users.uid' pg_num 64
'.users.email' pg_num 64
'.users' pg_num 64
'.usage' pg_num 64
'.rgw.buckets.index' pg_num 128
'.intent-log' pg_num 8
'.rgw.buckets' pg_num 64
'kube' pg_num 512
'.log' pg_num 8

Here is the df output:

GLOBAL:
    SIZE    AVAIL  RAW USED %RAW USED
    1.06PiB 306TiB   778TiB 71.75
POOLS:
    NAME   ID USED    %USED MAX AVAIL OBJECTS
    data   0  11.7GiB  0.14 8.17TiB 3006
    metadata   1   0B 0 8.17TiB    0
    rbd    2  43.2GiB  0.51 8.17TiB    11147
    npr_archive    3   258TiB 97.93 5.45TiB 82619649
    .rgw.root  4    1001B 0 8.17TiB    5
    .rgw.control   5   0B 0 8.17TiB    8
    .rgw   6  6.16KiB 0 8.17TiB   35
    .rgw.gc    7   0B 0 8.17TiB   32
    .users.uid 8   0B 0 8.17TiB    0
    .users.email   9   0B 0 8.17TiB    0
    .users 10  0B 0 8.17TiB    0
    .usage 11  0B 0 8.17TiB    1
    .rgw.buckets.index 12  0B 0 8.17TiB   26
    .intent-log    17  0B 0 5.45TiB    0
    .rgw.buckets   18 24.2GiB  0.29 8.17TiB 6622
    kube   21 1.82GiB  0.03 5.45TiB  550
    .log   22  0B 0 5.45TiB  176


The stuff in the data pool and the rwg pools is old data that we used 
for testing...if you guys think that removing everything outside of rbd 
and npr_archive would make a significant impact I will give it a try.


Thanks,

Shain



On 4/30/19 1:15 PM, Jack wrote:

Hi,

I see that you are using rgw
RGW comes with many pools, yet most of them are used for metadata and
configuration, those do not store many data
Such pools do not need more than a couple PG, each (I use pg_num = 8)

You need to allocate your pg on pool that actually stores the data

Please do the following, to let us know more:
Print the pg_num per pool:
for i in $(rados lspools); do echo -n "$i: "; ceph osd pool get $i
pg_num; done

Print the usage per pool:
ceph df

Also, instead of doing a "ceph osd reweight-by-utilization", check out
the balancer plugin : 
https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.ceph.com_docs_mimic_mgr_balancer_&d=DwICAg&c=E2nBno7hEddFhl23N5nD1Q&r=cqFccwnwHGRorPuRWs36Dw&m=1BfaF7xeFT_o8pdT9mrRmWm0gCn4wgalDi3UviTy24M&s=YoiU-wa-ZXHUEj8xYmiSVRVnXnDenoUaRZMa-bfRFvo&e=

Finally, in nautilus, the pg can now upscale and downscale automaticaly
See 
https://urldefense.proofpoint.com/v2/url?u=https-3A__ceph.com_rados_new-2Din-2Dnautilus-2Dpg-2Dmerging-2Dand-2Dautotuning_&d=DwICAg&c=E2nBno7hEddFhl23N5nD1Q&r=cqFccwnwHGRorPuRWs36Dw&m=1BfaF7xeFT_o8pdT9mrRmWm0gCn4wgalDi3UviTy24M&s=7-W9i3gJAcCtrL7MzjJlG5LZ_91zeesYBT7g0rGrLh0&e=


On 04/30/2019 06:34 PM, Shain Miley wrote:

Hi,

We have a cluster with 235 osd's running version 12.2.11 with a
combination of 4 and 6 TB drives.  The data distribution across osd's
varies from 52% to 94%.

I have been trying to figure out how to get this a bit more balanced as
we are running into 'backfillfull' issues on a regular basis.

I've tried adding more pgs...but this did not seem to do much in terms
of the imbalance.

Here is the end output from 'ceph osd df':

MIN/MAX VAR: 0.73/1.31  STDDEV: 7.73

We have 8199 pgs total with 6775 of them in the pool that has 97% of the
data.

The other pools are not really used (data, metadata, .rgw.root,
.rgw.control, etc).  I have thought about deleting those unused pools so
that most if not all the pgs are being used by the pool with the
majority of the data.

However...before I do that...there anything else I can do or try in
order to see if I can balance out the data more uniformly?

Thanks in advance,

Shain


___
ceph-users mailing list
ceph-users@lists.ceph.com
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com&d=DwICAg&c=E2nBno7hEddFhl23N5nD1Q&r=cqFccwnwHGRorPuRWs36Dw&m=1BfaF7xeFT_o8pdT9mrRmWm0gCn4wgalDi3UviTy24M&s=BczlpHmYiubLlNUhgDHcEsVHAsR_RYCKYV2G_5w2Vio&e=


--
NPR | Shain Miley | Manager of Infrastructure, Digital Media | smi...@npr.org | 
202.513.3649

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data distribution question

2019-04-30 Thread Dan van der Ster
Removing pools won't make a difference.

Read up to slide 22 here:
https://www.slideshare.net/mobile/Inktank_Ceph/ceph-day-berlin-mastering-ceph-operations-upmap-and-the-mgr-balancer

..
Dan

(Apologies for terseness, I'm mobile)



On Tue, 30 Apr 2019, 20:02 Shain Miley,  wrote:

> Here is the per pool pg_num info:
>
> 'data' pg_num 64
> 'metadata' pg_num 64
> 'rbd' pg_num 64
> 'npr_archive' pg_num 6775
> '.rgw.root' pg_num 64
> '.rgw.control' pg_num 64
> '.rgw' pg_num 64
> '.rgw.gc' pg_num 64
> '.users.uid' pg_num 64
> '.users.email' pg_num 64
> '.users' pg_num 64
> '.usage' pg_num 64
> '.rgw.buckets.index' pg_num 128
> '.intent-log' pg_num 8
> '.rgw.buckets' pg_num 64
> 'kube' pg_num 512
> '.log' pg_num 8
>
> Here is the df output:
>
> GLOBAL:
>  SIZEAVAIL  RAW USED %RAW USED
>  1.06PiB 306TiB   778TiB 71.75
> POOLS:
>  NAME   ID USED%USED MAX AVAIL OBJECTS
>  data   0  11.7GiB  0.14 8.17TiB 3006
>  metadata   1   0B 0 8.17TiB0
>  rbd2  43.2GiB  0.51 8.17TiB11147
>  npr_archive3   258TiB 97.93 5.45TiB 82619649
>  .rgw.root  41001B 0 8.17TiB5
>  .rgw.control   5   0B 0 8.17TiB8
>  .rgw   6  6.16KiB 0 8.17TiB   35
>  .rgw.gc7   0B 0 8.17TiB   32
>  .users.uid 8   0B 0 8.17TiB0
>  .users.email   9   0B 0 8.17TiB0
>  .users 10  0B 0 8.17TiB0
>  .usage 11  0B 0 8.17TiB1
>  .rgw.buckets.index 12  0B 0 8.17TiB   26
>  .intent-log17  0B 0 5.45TiB0
>  .rgw.buckets   18 24.2GiB  0.29 8.17TiB 6622
>  kube   21 1.82GiB  0.03 5.45TiB  550
>  .log   22  0B 0 5.45TiB  176
>
>
> The stuff in the data pool and the rwg pools is old data that we used
> for testing...if you guys think that removing everything outside of rbd
> and npr_archive would make a significant impact I will give it a try.
>
> Thanks,
>
> Shain
>
>
>
> On 4/30/19 1:15 PM, Jack wrote:
> > Hi,
> >
> > I see that you are using rgw
> > RGW comes with many pools, yet most of them are used for metadata and
> > configuration, those do not store many data
> > Such pools do not need more than a couple PG, each (I use pg_num = 8)
> >
> > You need to allocate your pg on pool that actually stores the data
> >
> > Please do the following, to let us know more:
> > Print the pg_num per pool:
> > for i in $(rados lspools); do echo -n "$i: "; ceph osd pool get $i
> > pg_num; done
> >
> > Print the usage per pool:
> > ceph df
> >
> > Also, instead of doing a "ceph osd reweight-by-utilization", check out
> > the balancer plugin :
> https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.ceph.com_docs_mimic_mgr_balancer_&d=DwICAg&c=E2nBno7hEddFhl23N5nD1Q&r=cqFccwnwHGRorPuRWs36Dw&m=1BfaF7xeFT_o8pdT9mrRmWm0gCn4wgalDi3UviTy24M&s=YoiU-wa-ZXHUEj8xYmiSVRVnXnDenoUaRZMa-bfRFvo&e=
> >
> > Finally, in nautilus, the pg can now upscale and downscale automaticaly
> > See
> https://urldefense.proofpoint.com/v2/url?u=https-3A__ceph.com_rados_new-2Din-2Dnautilus-2Dpg-2Dmerging-2Dand-2Dautotuning_&d=DwICAg&c=E2nBno7hEddFhl23N5nD1Q&r=cqFccwnwHGRorPuRWs36Dw&m=1BfaF7xeFT_o8pdT9mrRmWm0gCn4wgalDi3UviTy24M&s=7-W9i3gJAcCtrL7MzjJlG5LZ_91zeesYBT7g0rGrLh0&e=
> >
> >
> > On 04/30/2019 06:34 PM, Shain Miley wrote:
> >> Hi,
> >>
> >> We have a cluster with 235 osd's running version 12.2.11 with a
> >> combination of 4 and 6 TB drives.  The data distribution across osd's
> >> varies from 52% to 94%.
> >>
> >> I have been trying to figure out how to get this a bit more balanced as
> >> we are running into 'backfillfull' issues on a regular basis.
> >>
> >> I've tried adding more pgs...but this did not seem to do much in terms
> >> of the imbalance.
> >>
> >> Here is the end output from 'ceph osd df':
> >>
> >> MIN/MAX VAR: 0.73/1.31  STDDEV: 7.73
> >>
> >> We have 8199 pgs total with 6775 of them in the pool that has 97% of the
> >> data.
> >>
> >> The other pools are not really used (data, metadata, .rgw.root,
> >> .rgw.control, etc).  I have thought about deleting those unused pools so
> >> that most if not all the pgs are being used by the pool with the
> >> majority of the data.
> >>
> >> However...before I do that...there anything else I can do or try in
> >> order to see if I can balance out the data more uniformly?
> >>
> >> Thanks in advance,
> >>
> >> Shain
> >>
> > ___
> > ceph-users mailing list
> > ceph-u

Re: [ceph-users] Data distribution question

2019-04-30 Thread Jack
You have a lot of useless PG, yet they have the same "weight" as the
useful ones

If those pools are useless, you can:
- drop them
- raise npr_archive's pg_num using the freed PGs

As npr_archive own 97% of your data, it should get 97% of your pg (which
is ~8000)

The balance module is still quite useful

On 04/30/2019 08:02 PM, Shain Miley wrote:
> Here is the per pool pg_num info:
> 
> 'data' pg_num 64
> 'metadata' pg_num 64
> 'rbd' pg_num 64
> 'npr_archive' pg_num 6775
> '.rgw.root' pg_num 64
> '.rgw.control' pg_num 64
> '.rgw' pg_num 64
> '.rgw.gc' pg_num 64
> '.users.uid' pg_num 64
> '.users.email' pg_num 64
> '.users' pg_num 64
> '.usage' pg_num 64
> '.rgw.buckets.index' pg_num 128
> '.intent-log' pg_num 8
> '.rgw.buckets' pg_num 64
> 'kube' pg_num 512
> '.log' pg_num 8
> 
> Here is the df output:
> 
> GLOBAL:
> SIZEAVAIL  RAW USED %RAW USED
> 1.06PiB 306TiB   778TiB 71.75
> POOLS:
> NAME   ID USED%USED MAX AVAIL OBJECTS
> data   0  11.7GiB  0.14 8.17TiB 3006
> metadata   1   0B 0 8.17TiB0
> rbd2  43.2GiB  0.51 8.17TiB11147
> npr_archive3   258TiB 97.93 5.45TiB 82619649
> .rgw.root  41001B 0 8.17TiB5
> .rgw.control   5   0B 0 8.17TiB8
> .rgw   6  6.16KiB 0 8.17TiB   35
> .rgw.gc7   0B 0 8.17TiB   32
> .users.uid 8   0B 0 8.17TiB0
> .users.email   9   0B 0 8.17TiB0
> .users 10  0B 0 8.17TiB0
> .usage 11  0B 0 8.17TiB1
> .rgw.buckets.index 12  0B 0 8.17TiB   26
> .intent-log17  0B 0 5.45TiB0
> .rgw.buckets   18 24.2GiB  0.29 8.17TiB 6622
> kube   21 1.82GiB  0.03 5.45TiB  550
> .log   22  0B 0 5.45TiB  176
> 
> 
> The stuff in the data pool and the rwg pools is old data that we used
> for testing...if you guys think that removing everything outside of rbd
> and npr_archive would make a significant impact I will give it a try.
> 
> Thanks,
> 
> Shain
> 
> 
> 
> On 4/30/19 1:15 PM, Jack wrote:
>> Hi,
>>
>> I see that you are using rgw
>> RGW comes with many pools, yet most of them are used for metadata and
>> configuration, those do not store many data
>> Such pools do not need more than a couple PG, each (I use pg_num = 8)
>>
>> You need to allocate your pg on pool that actually stores the data
>>
>> Please do the following, to let us know more:
>> Print the pg_num per pool:
>> for i in $(rados lspools); do echo -n "$i: "; ceph osd pool get $i
>> pg_num; done
>>
>> Print the usage per pool:
>> ceph df
>>
>> Also, instead of doing a "ceph osd reweight-by-utilization", check out
>> the balancer plugin :
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.ceph.com_docs_mimic_mgr_balancer_&d=DwICAg&c=E2nBno7hEddFhl23N5nD1Q&r=cqFccwnwHGRorPuRWs36Dw&m=1BfaF7xeFT_o8pdT9mrRmWm0gCn4wgalDi3UviTy24M&s=YoiU-wa-ZXHUEj8xYmiSVRVnXnDenoUaRZMa-bfRFvo&e=
>>
>>
>> Finally, in nautilus, the pg can now upscale and downscale automaticaly
>> See
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__ceph.com_rados_new-2Din-2Dnautilus-2Dpg-2Dmerging-2Dand-2Dautotuning_&d=DwICAg&c=E2nBno7hEddFhl23N5nD1Q&r=cqFccwnwHGRorPuRWs36Dw&m=1BfaF7xeFT_o8pdT9mrRmWm0gCn4wgalDi3UviTy24M&s=7-W9i3gJAcCtrL7MzjJlG5LZ_91zeesYBT7g0rGrLh0&e=
>>
>>
>>
>> On 04/30/2019 06:34 PM, Shain Miley wrote:
>>> Hi,
>>>
>>> We have a cluster with 235 osd's running version 12.2.11 with a
>>> combination of 4 and 6 TB drives.  The data distribution across osd's
>>> varies from 52% to 94%.
>>>
>>> I have been trying to figure out how to get this a bit more balanced as
>>> we are running into 'backfillfull' issues on a regular basis.
>>>
>>> I've tried adding more pgs...but this did not seem to do much in terms
>>> of the imbalance.
>>>
>>> Here is the end output from 'ceph osd df':
>>>
>>> MIN/MAX VAR: 0.73/1.31  STDDEV: 7.73
>>>
>>> We have 8199 pgs total with 6775 of them in the pool that has 97% of the
>>> data.
>>>
>>> The other pools are not really used (data, metadata, .rgw.root,
>>> .rgw.control, etc).  I have thought about deleting those unused pools so
>>> that most if not all the pgs are being used by the pool with the
>>> majority of the data.
>>>
>>> However...before I do that...there anything else I can do or try in
>>> order to see if I can balance out the data more uniformly?
>>>
>>> Thanks in advance,
>>>
>>> Shain
>>>
>> ___
>> ceph-users mailing list
>>

Re: [ceph-users] Data distribution question

2019-04-30 Thread Igor Podlesny
On Wed, 1 May 2019 at 01:01, Dan van der Ster  wrote:
>> > The upmap balancer in v12.2.12 works really well... Perfectly uniform on 
>> > our clusters.
>>
>> mode upmap ?
>
> yes, mgr balancer, mode upmap.

I see. Was it a matter of just:

1) ceph balancer mode upmap
2) ceph balancer on

or were there any other steps?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data distribution question

2019-04-30 Thread Igor Podlesny
On Wed, 1 May 2019 at 01:26, Jack  wrote:
> If those pools are useless, you can:
> - drop them

As Dan pointed out it's unlikely of having any effect.
The thing is imbalance is a "property" of a pool, I'd suppose that
most often -- is the most loaded one (or of a few most loaded ones).
Not that much used pools don't impact it.

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [events] Ceph at Red Hat Summit May 7th 6:30pm

2019-04-30 Thread Mike Perez
Hey all,

If you happen to be attending the Boston Red Hat Summit or in the area,
please join the Ceph and Gluster community May 7th 6:30pm at our happy hour
event. Find all the details on the Eventbrite page. Looking forward to
seeing you all there!

https://www.eventbrite.com/e/ceph-and-gluster-community-happy-hour-at-red-hat-summit-registration-60698158827

--
Mike Perez (thingee)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data distribution question

2019-04-30 Thread Dan van der Ster
On Tue, Apr 30, 2019 at 8:26 PM Igor Podlesny  wrote:
>
> On Wed, 1 May 2019 at 01:01, Dan van der Ster  wrote:
> >> > The upmap balancer in v12.2.12 works really well... Perfectly uniform on 
> >> > our clusters.
> >>
> >> mode upmap ?
> >
> > yes, mgr balancer, mode upmap.
>
> I see. Was it a matter of just:
>
> 1) ceph balancer mode upmap
> 2) ceph balancer on
>
> or were there any other steps?

All of the clients need to be luminous our newer:

# ceph osd set-require-min-compat-client luminous

You need to enable the module:

# ceph mgr module enable balancer

You probably don't want to it run 24/7:

# ceph config-key set mgr/balancer/begin_time 0800
# ceph config-key set mgr/balancer/end_time 1800

The default rate that it balances things are a bit too high for my taste:

# ceph config-key set mgr/balancer/max_misplaced 0.005
# ceph config-key set mgr/balancer/upmap_max_iterations 2

(Those above are optional... YMMV)

Now fail the active mgr so that the new one reads those new options above.

# ceph mgr fail 

Enable the upmap mode:

# ceph balancer mode upmap

Test it once to see that it works at all:

# ceph balancer optimize myplan
# ceph balancer show myplan
# ceph balancer reset

(any errors, start debugging -- use debug_mgr = 4/5 and check the
active mgr's log for the balancer details.)

# ceph balancer on

Now it'll start moving the PGs around until things are quite well balanced.
In our clusters that process takes a week or two... it depends on
cluster size, numpgs, etc...

Hope that helps!

Dan

>
> --
> End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data distribution question

2019-04-30 Thread Igor Podlesny
On Wed, 1 May 2019 at 01:26, Igor Podlesny  wrote:
> On Wed, 1 May 2019 at 01:01, Dan van der Ster  wrote:
> >> > The upmap balancer in v12.2.12 works really well... Perfectly uniform on 
> >> > our clusters.
> >>
> >> mode upmap ?
> >
> > yes, mgr balancer, mode upmap.

Also -- do your CEPHs have single root hierarchy pools (like
"default"), or there're some pools that use non-default ones?

Looking through docs I didn't find a way to narrow balancer's scope
down to specific pool(s), although personally I would prefer it to
operate on a small set of them.

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data distribution question

2019-04-30 Thread Igor Podlesny
On Wed, 1 May 2019 at 01:58, Dan van der Ster  wrote:
> On Tue, Apr 30, 2019 at 8:26 PM Igor Podlesny  wrote:
[...]
> All of the clients need to be luminous our newer:
>
> # ceph osd set-require-min-compat-client luminous
>
> You need to enable the module:
>
> # ceph mgr module enable balancer

(Enabled by default according to the docs.)
>
> You probably don't want to it run 24/7:
>
> # ceph config-key set mgr/balancer/begin_time 0800
> # ceph config-key set mgr/balancer/end_time 1800

oh, that's handy.

> The default rate that it balances things are a bit too high for my taste:
>
> # ceph config-key set mgr/balancer/max_misplaced 0.005
> # ceph config-key set mgr/balancer/upmap_max_iterations 2
>
> (Those above are optional... YMMV)

Yep, but good to know!
>
> Now fail the active mgr so that the new one reads those new options above.
>
> # ceph mgr fail 
>
> Enable the upmap mode:
>
> # ceph balancer mode upmap
>
> Test it once to see that it works at all:
>
> # ceph balancer optimize myplan
> # ceph balancer show myplan
> # ceph balancer reset
>
> (any errors, start debugging -- use debug_mgr = 4/5 and check the
> active mgr's log for the balancer details.)
>
> # ceph balancer on
>
> Now it'll start moving the PGs around until things are quite well balanced.
> In our clusters that process takes a week or two... it depends on
> cluster size, numpgs, etc...
>
> Hope that helps!

Thank you :)

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data distribution question

2019-04-30 Thread Dan van der Ster
On Tue, Apr 30, 2019 at 9:01 PM Igor Podlesny  wrote:
>
> On Wed, 1 May 2019 at 01:26, Igor Podlesny  wrote:
> > On Wed, 1 May 2019 at 01:01, Dan van der Ster  wrote:
> > >> > The upmap balancer in v12.2.12 works really well... Perfectly uniform 
> > >> > on our clusters.
> > >>
> > >> mode upmap ?
> > >
> > > yes, mgr balancer, mode upmap.
>
> Also -- do your CEPHs have single root hierarchy pools (like
> "default"), or there're some pools that use non-default ones?
>
> Looking through docs I didn't find a way to narrow balancer's scope
> down to specific pool(s), although personally I would prefer it to
> operate on a small set of them.
>

We have a mix of both single and dual root hierarchies -- the upmap
balancer works for all.
(E.g. this works: pool A with 3 replicas in root A, pool B with 3
replicas in root B.
However if you have a cluster with two roots, and a pool that does
something complex like put 2 replicas in root A and 1 replica in root
B -- I haven't tested that recently).

In luminous and mimic there isn't a way to scope the auto balancing
down to limited pools.
In practice that doesn't really matter, because of how it works, roughly:

while true:
   select a random pool
   get the pg distribution for that pool
   create upmaps (or remove existing upmaps) to balance the pgs for that pool
   sleep 60s

Eventually it attacks all pools and gets them fully balanced. (It
anyway spends most of the time balancing the pools that matter,
because the ones that don't have data get "balanced" quickly).
If you absolutely must limit the pools, you have to script something
to loop on `ceph balancer optimize myplan ; ceph balancer exec
myplan`

Something to reiterate: v12.2.12 has the latest upmap balancing
heuristics, which are miles better than 12.2.11. (Big thanks to Xie
Xingguo who worked hard to get this right!!!)
Mimic v13.2.5 doesn't have those fixes (maybe in the pipeline for
13.2.6?) and I haven't checked Nautilus.
If you're on mimic, then it's upmap balancer heuristics are better
than nothing, but it might be imperfect or not work in certain cases
(e.g. multi-root).

-- Dan


> --
> End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] HEALTH_WARN - 3 modules have failed dependencies

2019-04-30 Thread Ranjan Ghosh

Hi my beloved Ceph list,

After an upgrade from Ubuntu Cosmic to Ubuntu Disco (and according Ceph 
packages updated from 13.2.2 to 13.2.4), I now get this when I enter 
"ceph health":


HEALTH_WARN 3 modules have failed dependencies

"ceph mgr module ls" only reports those 3 modules enabled:

"enabled_modules": [
    "dashboard",
    "restful",
    "status"
    ],
...

Then I found this page here:

docs.ceph.com/docs/master/rados/operations/health-checks

Under "MGR_MODULE_DEPENDENCY" it says:

"An enabled manager module is failing its dependency check. This health 
check should come with an explanatory message from the module about the 
problem."


What is "this health check"? If the page talks about "ceph health" or 
"ceph -s" then, no, there is no explanatory message there on what's wrong.


Furthermore, it says:

"This health check is only applied to enabled modules. If a module is 
not enabled, you can see whether it is reporting dependency issues in 
the output of ceph module ls."


The command "ceph module ls", however, doesn't exist. If "ceph mgr 
module ls" is really meant, then I get this:


{
    "enabled_modules": [
    "dashboard",
    "restful",
    "status"
    ],
    "disabled_modules": [
    {
    "name": "balancer",
    "can_run": true,
    "error_string": ""
    },
    {
    "name": "hello",
    "can_run": false,
    "error_string": "Interpreter change detected - this module 
can only be loaded into one interpreter per process."

    },
    {
    "name": "influx",
    "can_run": false,
    "error_string": "Interpreter change detected - this module 
can only be loaded into one interpreter per process."

    },
    {
    "name": "iostat",
    "can_run": false,
    "error_string": "Interpreter change detected - this module 
can only be loaded into one interpreter per process."

    },
    {
    "name": "localpool",
    "can_run": false,
    "error_string": "Interpreter change detected - this module 
can only be loaded into one interpreter per process."

    },
    {
    "name": "prometheus",
    "can_run": false,
    "error_string": "Interpreter change detected - this module 
can only be loaded into one interpreter per process."

    },
    {
    "name": "selftest",
    "can_run": false,
    "error_string": "Interpreter change detected - this module 
can only be loaded into one interpreter per process."

    },
    {
    "name": "smart",
    "can_run": false,
    "error_string": "Interpreter change detected - this module 
can only be loaded into one interpreter per process."

    },
    {
    "name": "telegraf",
    "can_run": false,
    "error_string": "Interpreter change detected - this module 
can only be loaded into one interpreter per process."

    },
    {
    "name": "telemetry",
    "can_run": false,
    "error_string": "Interpreter change detected - this module 
can only be loaded into one interpreter per process."

    },
    {
    "name": "zabbix",
    "can_run": false,
    "error_string": "Interpreter change detected - this module 
can only be loaded into one interpreter per process."

    }
    ]
}

Usually the Ceph documentation is great, very detailed and helpful. But 
I can find nothing on how to resolve this problem. Any help is much 
appreciated.


Thank you / Best regards

Ranjan




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] obj_size_info_mismatch error handling

2019-04-30 Thread ceph
Hello Reed,

I would give PG repair a try.
IIRC there should be issue when you have Size 3... it would be difficult when 
you have Size 2 I guess...

Hth
Mehmet

Am 29. April 2019 17:05:48 MESZ schrieb Reed Dier :
>Hi list,
>
>Woke up this morning to two PG's reporting scrub errors, in a way that
>I haven't seen before.
>> $ ceph versions
>> {
>> "mon": {
>> "ceph version 13.2.5
>(cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 3
>> },
>> "mgr": {
>> "ceph version 13.2.5
>(cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 3
>> },
>> "osd": {
>> "ceph version 13.2.4
>(b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)": 156
>> },
>> "mds": {
>> "ceph version 13.2.5
>(cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 2
>> },
>> "overall": {
>> "ceph version 13.2.4
>(b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)": 156,
>> "ceph version 13.2.5
>(cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 8
>> }
>> }
>
>
>> OSD_SCRUB_ERRORS 8 scrub errors
>> PG_DAMAGED Possible data damage: 2 pgs inconsistent
>> pg 17.72 is active+clean+inconsistent, acting [3,7,153]
>> pg 17.2b9 is active+clean+inconsistent, acting [19,7,16]
>
>Here is what $rados list-inconsistent-obj 17.2b9 --format=json-pretty
>yields:
>> {
>> "epoch": 134582,
>> "inconsistents": [
>> {
>> "object": {
>> "name": "10008536718.",
>> "nspace": "",
>> "locator": "",
>> "snap": "head",
>> "version": 0
>> },
>> "errors": [],
>> "union_shard_errors": [
>> "obj_size_info_mismatch"
>> ],
>> "shards": [
>> {
>> "osd": 7,
>> "primary": false,
>> "errors": [
>> "obj_size_info_mismatch"
>> ],
>> "size": 5883,
>> "object_info": {
>> "oid": {
>> "oid": "10008536718.",
>> "key": "",
>> "snapid": -2,
>> "hash": 1752643257,
>> "max": 0,
>> "pool": 17,
>> "namespace": ""
>> },
>> "version": "134599'448331",
>> "prior_version": "134599'448330",
>> "last_reqid": "client.1580931080.0:671854",
>> "user_version": 448331,
>> "size": 3505,
>> "mtime": "2019-04-28 15:32:20.003519",
>> "local_mtime": "2019-04-28 15:32:25.991015",
>> "lost": 0,
>> "flags": [
>> "dirty",
>> "data_digest",
>> "omap_digest"
>> ],
>> "truncate_seq": 899,
>> "truncate_size": 0,
>> "data_digest": "0xf99a3bd3",
>> "omap_digest": "0x",
>> "expected_object_size": 0,
>> "expected_write_size": 0,
>> "alloc_hint_flags": 0,
>> "manifest": {
>> "type": 0
>> },
>> "watchers": {}
>> }
>> },
>> {
>> "osd": 16,
>> "primary": false,
>> "errors": [
>> "obj_size_info_mismatch"
>> ],
>> "size": 5883,
>> "object_info": {
>> "oid": {
>> "oid": "10008536718.",
>> "key": "",
>> "snapid": -2,
>> "hash": 1752643257,
>> "max": 0,
>> "pool": 17,
>> "namespace": ""
>> },
>> "version": "134599'448331",
>> "prior_version": "134599'448330",
>> "last_reqid": "client.1580931080.0:671854",
>> "user_version": 448331,
>> "size": 3505,
>> "mtime": "2019-04-28 15:32:20.003519",
>> "local_mtime": "2019-04-28 15:32:25.991015",
>> "lost": 0,
>> "flags": [
>> "dirty",
>> "data_digest",
>> 

Re: [ceph-users] Inodes on /cephfs

2019-04-30 Thread Patrick Donnelly
On Tue, Apr 30, 2019 at 8:01 AM Oliver Freyermuth
 wrote:
>
> Dear Cephalopodians,
>
> we have a classic libvirtd / KVM based virtualization cluster using Ceph-RBD 
> (librbd) as backend and sharing the libvirtd configuration between the nodes 
> via CephFS
> (all on Mimic).
>
> To share the libvirtd configuration between the nodes, we have symlinked some 
> folders from /etc/libvirt to their counterparts on /cephfs,
> so all nodes see the same configuration.
> In general, this works very well (of course, there's a "gotcha": Libvirtd 
> needs reloading / restart for some changes to the XMLs, we have automated 
> that),
> but there is one issue caused by Yum's cleverness (that's on CentOS 7). 
> Whenever there's a libvirtd update, unattended upgrades fail, and we see:
>
>Transaction check error:
>  installing package libvirt-daemon-driver-network-4.5.0-10.el7_6.7.x86_64 
> needs 2 inodes on the /cephfs filesystem
>  installing package 
> libvirt-daemon-config-nwfilter-4.5.0-10.el7_6.7.x86_64 needs 18 inodes on the 
> /cephfs filesystem
>
> So it seems yum follows the symlinks and checks the available inodes on 
> /cephfs. Sadly, that reveals:
>[root@kvm001 libvirt]# LANG=C df -i /cephfs/
>Filesystem Inodes IUsed IFree IUse% Mounted on
>ceph-fuse  6868 0  100% /cephfs
>
> I think that's just because there is no real "limit" on the maximum inodes on 
> CephFS. However, returning 0 breaks some existing tools (notably, Yum).
>
> What do you think? Should CephFS return something different than 0 here to 
> not break existing tools?
> Or should the tools behave differently? But one might also argue that if the 
> total number of Inodes matches the used number of Inodes, the FS is indeed 
> "full".
> It's just unclear to me who to file a bug against ;-).
>
> Right now, I am just using:
> yum -y --setopt=diskspacecheck=0 update
> as a manual workaround, but this is naturally rather cumbersome.

This is fallout from [1]. See discussion on setting f_free to 0 here
[2]. In summary, userland tools are trying to be too clever by looking
at f_free. [I could be convinced to go back to f_free = ULONG_MAX if
there are other instances of this.]

[1] https://github.com/ceph/ceph/pull/23323
[2] https://github.com/ceph/ceph/pull/23323#issuecomment-409249911

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unable to list rbd block > images in nautilus dashboard

2019-04-30 Thread Wes Cilldhaire
Hi, I've just noticed PR #27652, will test this out on my setup today. 

Thanks again 


From: "Wes Cilldhaire"  
To: "Ricardo Dias"  
Cc: "ceph-users"  
Sent: Tuesday, 9 April, 2019 12:54:02 AM 
Subject: Re: [ceph-users] Unable to list rbd block > images in nautilus 
dashboard 

Thank you 

- On 9 Apr, 2019, at 12:50 AM, Ricardo Dias rd...@suse.com wrote: 

> Hi Wes, 
> 
> I just filed a bug ticket in the Ceph tracker about this: 
> 
> http://tracker.ceph.com/issues/39140 
> 
> Will work on a solution ASAP. 
> 
> Thanks, 
> Ricardo Dias 
> 
> On 08/04/19 15:41, Wes Cilldhaire wrote: 
>> It's definitely ceph-mgr that is struggling here. It uses 100% of a cpu for 
>> for 
>> several tens of seconds and reports the followinf in its log a few times 
>> before 
>> anything gets displayed 
>> 
>> Traceback (most recent call last): 
>> File "/usr/local/share/ceph/mgr/dashboard/services/exception.py", line 88, 
>> in 
>> dashboard_exception_handler 
>> return handler(*args, **kwargs) 
>> File "/usr/lib64/python2.7/site-packages/cherrypy/_cpdispatch.py", line 54, 
>> in 
>> __call__ 
>> return self.callable(*self.args, **self.kwargs) 
>> File "/usr/local/share/ceph/mgr/dashboard/controllers/__init__.py", line 
>> 649, in 
>> inner 
>> ret = func(*args, **kwargs) 
>> File "/usr/local/share/ceph/mgr/dashboard/controllers/__init__.py", line 
>> 842, in 
>> wrapper 
>> return func(*vpath, **params) 
>> File "/usr/local/share/ceph/mgr/dashboard/services/exception.py", line 44, 
>> in 
>> wrapper 
>> return f(*args, **kwargs) 
>> File "/usr/local/share/ceph/mgr/dashboard/services/exception.py", line 44, 
>> in 
>> wrapper 
>> return f(*args, **kwargs) 
>> File "/usr/local/share/ceph/mgr/dashboard/controllers/rbd.py", line 270, in 
>> list 
>> return self._rbd_list(pool_name) 
>> File "/usr/local/share/ceph/mgr/dashboard/controllers/rbd.py", line 261, in 
>> _rbd_list 
>> status, value = self._rbd_pool_list(pool) 
>> File "/usr/local/share/ceph/mgr/dashboard/tools.py", line 244, in wrapper 
>> return rvc.run(fn, args, kwargs) 
>> File "/usr/local/share/ceph/mgr/dashboard/tools.py", line 232, in run 
>> raise ViewCacheNoDataException() 
>> ViewCacheNoDataException: ViewCache: unable to retrieve data 
>> 
>> - On 5 Apr, 2019, at 5:06 PM, Wes Cilldhaire w...@sol1.com.au wrote: 
>> 
>>> Hi Lenz, 
>>> 
>>> Thanks for responding. I suspected that the number of rbd images might have 
>>> had 
>>> something to do with it so I cleaned up old disposable VM images I am no 
>>> longer 
>>> using, taking the list down from ~30 to 16, 2 in the EC pool on hdds and 
>>> the 
>>> rest on the replicated ssd pool. They vary in size from 50GB to 200GB, I 
>>> don't 
>>> have the # of objects per rbd on hand right now but maybe this is a factor 
>>> as 
>>> well, particularly with 'du'. This doesn't appear to have made a difference 
>>> in 
>>> the time and number of attempts required to list them in the dashboard. 
>>> 
>>> I suspect it might be a case of 'du on all images is always going to take 
>>> longer 
>>> than the current dashboard timeout', in which case the behaviour of the 
>>> dashboard might possibly need to change to account for this, maybe fetch 
>>> and 
>>> listt the images in parallel and asynchronously or something. As it stand 
>>> it 
>>> means the dashboard isn't really usable for managing existing images, which 
>>> is 
>>> a shame because having that ability makes ceph accessible to our clients 
>>> who 
>>> are considering it and begins affording some level of self-service for them 
>>> - 
>>> one of the reasons we've been really excited for Mimic's release actually. 
>>> I 
>>> really hope I've just done something wrong :) 
>>> 
>>> I'll try to isolate which process the delay is coming from tonight as well 
>>> as 
>>> collecting other useful metrics when I'm back on that network tonight. 
>>> 
>>> Thanks, 
>>> Wes 
>>> 
>>> 
>> (null) 
>> ___ 
>> ceph-users mailing list 
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
> 
> -- 
> Ricardo Dias 
> Senior Software Engineer - Storage Team 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, 
> HRB 21284 
> (AG Nürnberg) 
> 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
(null) 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

-- 
Wes Cilldhaire 
Lead Developer 
Support 1300 765 122 
[ mailto:supp...@sol1.com.au | supp...@sol1.com.au ] 
Direct  02 8292 0521 
Mobile  0406 190 426 
Web [ http://sol1.com.au/ | sol1.com.au ] 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inodes on /cephfs

2019-04-30 Thread Oliver Freyermuth
Am 01.05.19 um 00:51 schrieb Patrick Donnelly:
> On Tue, Apr 30, 2019 at 8:01 AM Oliver Freyermuth
>  wrote:
>>
>> Dear Cephalopodians,
>>
>> we have a classic libvirtd / KVM based virtualization cluster using Ceph-RBD 
>> (librbd) as backend and sharing the libvirtd configuration between the nodes 
>> via CephFS
>> (all on Mimic).
>>
>> To share the libvirtd configuration between the nodes, we have symlinked 
>> some folders from /etc/libvirt to their counterparts on /cephfs,
>> so all nodes see the same configuration.
>> In general, this works very well (of course, there's a "gotcha": Libvirtd 
>> needs reloading / restart for some changes to the XMLs, we have automated 
>> that),
>> but there is one issue caused by Yum's cleverness (that's on CentOS 7). 
>> Whenever there's a libvirtd update, unattended upgrades fail, and we see:
>>
>>Transaction check error:
>>  installing package 
>> libvirt-daemon-driver-network-4.5.0-10.el7_6.7.x86_64 needs 2 inodes on the 
>> /cephfs filesystem
>>  installing package 
>> libvirt-daemon-config-nwfilter-4.5.0-10.el7_6.7.x86_64 needs 18 inodes on 
>> the /cephfs filesystem
>>
>> So it seems yum follows the symlinks and checks the available inodes on 
>> /cephfs. Sadly, that reveals:
>>[root@kvm001 libvirt]# LANG=C df -i /cephfs/
>>Filesystem Inodes IUsed IFree IUse% Mounted on
>>ceph-fuse  6868 0  100% /cephfs
>>
>> I think that's just because there is no real "limit" on the maximum inodes 
>> on CephFS. However, returning 0 breaks some existing tools (notably, Yum).
>>
>> What do you think? Should CephFS return something different than 0 here to 
>> not break existing tools?
>> Or should the tools behave differently? But one might also argue that if the 
>> total number of Inodes matches the used number of Inodes, the FS is indeed 
>> "full".
>> It's just unclear to me who to file a bug against ;-).
>>
>> Right now, I am just using:
>> yum -y --setopt=diskspacecheck=0 update
>> as a manual workaround, but this is naturally rather cumbersome.
> 
> This is fallout from [1]. See discussion on setting f_free to 0 here
> [2]. In summary, userland tools are trying to be too clever by looking
> at f_free. [I could be convinced to go back to f_free = ULONG_MAX if
> there are other instances of this.]
> 
> [1] https://github.com/ceph/ceph/pull/23323
> [2] https://github.com/ceph/ceph/pull/23323#issuecomment-409249911

Thanks for the references! That certainly enlightens me on why this decision 
was taken, and of course I congratulate upon trying to prevent false 
monitoring. 
Still, even though I don't have other instances at hand (yet), I am not yet 
convinced "0" is a better choice than "ULONG_MAX". 
It certainly alerts users / monitoring software about doing something wrong, 
but it prevents a check which any file system (or rather, any file system I 
encountered so far) allows. 

Yum (or other package managers doing things in a safe manner) need to ensure 
they can fully install a package in an "atomic" way before doing so,
since rolling back may be complex or even impossible (for most file systems). 
So they need a way to check if a file system can store the additional files in 
terms of space and inodes, before placing the data there,
or risk installing something only partially, and potentially being unable to 
roll back. 

In most cases, the free number of inodes allows for that check. Of course, that 
has no (direct) meaning for CephFS, so one might argue the tools should add an 
exception for CephFS - 
but as the discussion correctly stated, there's no defined way to find out 
where the file system has a notion of "free inodes", and - if we go for an 
exceptional treatment for a list of file systems - 
not even a "clean" way to find out if the file system is CephFS (the tools will 
only see it is FUSE for ceph-fuse) [1]. 

So my question is: 
How are tools which need to ensure that a file system can accept a given number 
of bytes and inodes before actually placing the data there check that in case 
of CephFS? 
And if they should not, how do they find out that this check which is valid on 
e.g. ext4 is not useful on CephFS? 
(or, in other words: if I would file a bug report against Yum, I could not 
think of any implementation they could make to solve this issue)

Of course, if it's just us, we can live with the workaround. We monitor space 
consumption on all file systems, and may start monitoring free inodes on our 
ext4 file systems, 
such that we can safely disable the Yum check on the affected nodes. 
But I wonder whether this is the best way to go (it prevents a valid use case 
of a package manager, and there seems to be no clean way to fix it inside Yum 
that I am aware of). 

Hence, my personal preference would be ULONG_MAX, but of course feel free to 
stay with 0. If nobody else complains, it's probably a non-issue for other 
users ;-). 

Cheers,
Oliver

[1] https://github.com/ceph/ceph/pull/23323#issuecomm

[ceph-users] hardware requirements for metadata server

2019-04-30 Thread Manuel Sopena Ballesteros
Dear ceph users,

I would like to ask, does the metadata server needs much block devices for 
storage? Or does it only needs RAM? How could I calculate the amount of disks 
and/or memory needed?

Thank you very much


Manuel Sopena Ballesteros

Big Data Engineer | Kinghorn Centre for Clinical Genomics

 [cid:image001.png@01D4C835.ED3C2230] 

a: 384 Victoria Street, Darlinghurst NSW 2010
p: +61 2 9355 5760  |  +61 4 12 123 123
e: manuel...@garvan.org.au

Like us on Facebook | Follow us on 
Twitter and 
LinkedIn

NOTICE
Please consider the environment before printing this email. This message and 
any attachments are intended for the addressee named and may contain legally 
privileged/confidential/copyright information. If you are not the intended 
recipient, you should not read, use, disclose, copy or distribute this 
communication. If you have received this message in error please notify us at 
once by return email and then delete both messages. We accept no liability for 
the distribution of viruses or similar in electronic communications. This 
notice should not be removed.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] obj_size_info_mismatch error handling

2019-04-30 Thread Brad Hubbard
Which size is correct?

On Tue, Apr 30, 2019 at 1:06 AM Reed Dier  wrote:
>
> Hi list,
>
> Woke up this morning to two PG's reporting scrub errors, in a way that I 
> haven't seen before.
>
> $ ceph versions
> {
> "mon": {
> "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
> (stable)": 3
> },
> "mgr": {
> "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
> (stable)": 3
> },
> "osd": {
> "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic 
> (stable)": 156
> },
> "mds": {
> "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
> (stable)": 2
> },
> "overall": {
> "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic 
> (stable)": 156,
> "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
> (stable)": 8
> }
> }
>
>
> OSD_SCRUB_ERRORS 8 scrub errors
> PG_DAMAGED Possible data damage: 2 pgs inconsistent
> pg 17.72 is active+clean+inconsistent, acting [3,7,153]
> pg 17.2b9 is active+clean+inconsistent, acting [19,7,16]
>
>
> Here is what $rados list-inconsistent-obj 17.2b9 --format=json-pretty yields:
>
> {
> "epoch": 134582,
> "inconsistents": [
> {
> "object": {
> "name": "10008536718.",
> "nspace": "",
> "locator": "",
> "snap": "head",
> "version": 0
> },
> "errors": [],
> "union_shard_errors": [
> "obj_size_info_mismatch"
> ],
> "shards": [
> {
> "osd": 7,
> "primary": false,
> "errors": [
> "obj_size_info_mismatch"
> ],
> "size": 5883,
> "object_info": {
> "oid": {
> "oid": "10008536718.",
> "key": "",
> "snapid": -2,
> "hash": 1752643257,
> "max": 0,
> "pool": 17,
> "namespace": ""
> },
> "version": "134599'448331",
> "prior_version": "134599'448330",
> "last_reqid": "client.1580931080.0:671854",
> "user_version": 448331,
> "size": 3505,
> "mtime": "2019-04-28 15:32:20.003519",
> "local_mtime": "2019-04-28 15:32:25.991015",
> "lost": 0,
> "flags": [
> "dirty",
> "data_digest",
> "omap_digest"
> ],
> "truncate_seq": 899,
> "truncate_size": 0,
> "data_digest": "0xf99a3bd3",
> "omap_digest": "0x",
> "expected_object_size": 0,
> "expected_write_size": 0,
> "alloc_hint_flags": 0,
> "manifest": {
> "type": 0
> },
> "watchers": {}
> }
> },
> {
> "osd": 16,
> "primary": false,
> "errors": [
> "obj_size_info_mismatch"
> ],
> "size": 5883,
> "object_info": {
> "oid": {
> "oid": "10008536718.",
> "key": "",
> "snapid": -2,
> "hash": 1752643257,
> "max": 0,
> "pool": 17,
> "namespace": ""
> },
> "version": "134599'448331",
> "prior_version": "134599'448330",
> "last_reqid": "client.1580931080.0:671854",
> "user_version": 448331,
> "size": 3505,
> "mtime": "2019-04-28 15:32:20.003519",
> "local_mtime": "2019-04-28 15:32:25.991015",
> "lost": 0,
> "flags": [
> "dirty",
> "data_digest",
> "omap_digest"
> ],
> "truncate_seq": 899,
> "truncate_size": 0,
> "data_digest": "0xf99a3bd3",
>   

Re: [ceph-users] obj_size_info_mismatch error handling

2019-04-30 Thread Brad Hubbard
On Wed, May 1, 2019 at 10:54 AM Brad Hubbard  wrote:
>
> Which size is correct?

Sorry, accidental discharge =D

If the object info size is *incorrect* try forcing a write to the OI
with something like the following.

1. rados -p [name_of_pool_17] setomapval 10008536718.
temporary-key anything
2. ceph pg deep-scrub 17.2b9
3. Wait for the scrub to finish
4. rados -p [name_of_pool_2] rmomapkey 10008536718. temporary-key

If the object info size is *correct* you could try just doing a rados
get followed by a rados put of the object to see if the size is
updated correctly.

It's more likely the object info size is wrong IMHO.

>
> On Tue, Apr 30, 2019 at 1:06 AM Reed Dier  wrote:
> >
> > Hi list,
> >
> > Woke up this morning to two PG's reporting scrub errors, in a way that I 
> > haven't seen before.
> >
> > $ ceph versions
> > {
> > "mon": {
> > "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) 
> > mimic (stable)": 3
> > },
> > "mgr": {
> > "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) 
> > mimic (stable)": 3
> > },
> > "osd": {
> > "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) 
> > mimic (stable)": 156
> > },
> > "mds": {
> > "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) 
> > mimic (stable)": 2
> > },
> > "overall": {
> > "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) 
> > mimic (stable)": 156,
> > "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) 
> > mimic (stable)": 8
> > }
> > }
> >
> >
> > OSD_SCRUB_ERRORS 8 scrub errors
> > PG_DAMAGED Possible data damage: 2 pgs inconsistent
> > pg 17.72 is active+clean+inconsistent, acting [3,7,153]
> > pg 17.2b9 is active+clean+inconsistent, acting [19,7,16]
> >
> >
> > Here is what $rados list-inconsistent-obj 17.2b9 --format=json-pretty 
> > yields:
> >
> > {
> > "epoch": 134582,
> > "inconsistents": [
> > {
> > "object": {
> > "name": "10008536718.",
> > "nspace": "",
> > "locator": "",
> > "snap": "head",
> > "version": 0
> > },
> > "errors": [],
> > "union_shard_errors": [
> > "obj_size_info_mismatch"
> > ],
> > "shards": [
> > {
> > "osd": 7,
> > "primary": false,
> > "errors": [
> > "obj_size_info_mismatch"
> > ],
> > "size": 5883,
> > "object_info": {
> > "oid": {
> > "oid": "10008536718.",
> > "key": "",
> > "snapid": -2,
> > "hash": 1752643257,
> > "max": 0,
> > "pool": 17,
> > "namespace": ""
> > },
> > "version": "134599'448331",
> > "prior_version": "134599'448330",
> > "last_reqid": "client.1580931080.0:671854",
> > "user_version": 448331,
> > "size": 3505,
> > "mtime": "2019-04-28 15:32:20.003519",
> > "local_mtime": "2019-04-28 15:32:25.991015",
> > "lost": 0,
> > "flags": [
> > "dirty",
> > "data_digest",
> > "omap_digest"
> > ],
> > "truncate_seq": 899,
> > "truncate_size": 0,
> > "data_digest": "0xf99a3bd3",
> > "omap_digest": "0x",
> > "expected_object_size": 0,
> > "expected_write_size": 0,
> > "alloc_hint_flags": 0,
> > "manifest": {
> > "type": 0
> > },
> > "watchers": {}
> > }
> > },
> > {
> > "osd": 16,
> > "primary": false,
> > "errors": [
> > "obj_size_info_mismatch"
> > ],
> > "size": 5883,
> > "object_info": {
> > "oid": {
> > "oid": "10008536718.",
> > "key": "",
> > "snapid": -2,
> > "hash": 1752643257,
> > "max": 0,
> > "pool": 17,
> >   

Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-04-30 Thread Igor Podlesny
On Tue, 30 Apr 2019 at 20:56, Igor Podlesny  wrote:
> On Tue, 30 Apr 2019 at 19:10, Denny Fuchs  wrote:
> [..]
> > Any suggestions ?
>
> -- Try different allocator.

Ah, BTW, except memory allocator there's another option: recently
backported bitmap allocator.
Igor Fedotov wrote about it's expected to have lesser memory footprint
with time:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034299.html

Also I'm not sure though if it's okay to switch existent OSDs "on-fly"
-- changing config and restarting OSDs.
Igor (Fedotov), can you please elaborate on this matter?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inodes on /cephfs

2019-04-30 Thread Yury Shevchuk
cephfs is not alone at this, there are other inode-less filesystems
around.  They all go with zeroes:

# df -i /nfs-dir
Filesystem  Inodes IUsed IFree IUse% Mounted on
xxx.xxx.xx.x:/xxx/xxx/x  0 0 0 - /xxx

# df -i /reiserfs-dir
FilesystemInodes   IUsed   IFree IUse% Mounted on
/xxx//x0   0   0-  /xxx/xxx//x

# df -i /btrfs-dir
Filesystem   Inodes IUsed IFree IUse% Mounted on
/xxx/xx/  0 0 0 - /

Would YUM refuse to install on them all, including mainstream btrfs?
I doubt that.  Prehaps YUM is confused by Inodes count that
cephfs (alone!) reports as non-zero.  Look at YUM sources?


-- Yury

On Wed, May 01, 2019 at 01:23:57AM +0200, Oliver Freyermuth wrote:
> Am 01.05.19 um 00:51 schrieb Patrick Donnelly:
> > On Tue, Apr 30, 2019 at 8:01 AM Oliver Freyermuth
> >  wrote:
> >>
> >> Dear Cephalopodians,
> >>
> >> we have a classic libvirtd / KVM based virtualization cluster using 
> >> Ceph-RBD (librbd) as backend and sharing the libvirtd configuration 
> >> between the nodes via CephFS
> >> (all on Mimic).
> >>
> >> To share the libvirtd configuration between the nodes, we have symlinked 
> >> some folders from /etc/libvirt to their counterparts on /cephfs,
> >> so all nodes see the same configuration.
> >> In general, this works very well (of course, there's a "gotcha": Libvirtd 
> >> needs reloading / restart for some changes to the XMLs, we have automated 
> >> that),
> >> but there is one issue caused by Yum's cleverness (that's on CentOS 7). 
> >> Whenever there's a libvirtd update, unattended upgrades fail, and we see:
> >>
> >>Transaction check error:
> >>  installing package 
> >> libvirt-daemon-driver-network-4.5.0-10.el7_6.7.x86_64 needs 2 inodes on 
> >> the /cephfs filesystem
> >>  installing package 
> >> libvirt-daemon-config-nwfilter-4.5.0-10.el7_6.7.x86_64 needs 18 inodes on 
> >> the /cephfs filesystem
> >>
> >> So it seems yum follows the symlinks and checks the available inodes on 
> >> /cephfs. Sadly, that reveals:
> >>[root@kvm001 libvirt]# LANG=C df -i /cephfs/
> >>Filesystem Inodes IUsed IFree IUse% Mounted on
> >>ceph-fuse  6868 0  100% /cephfs
> >>
> >> I think that's just because there is no real "limit" on the maximum inodes 
> >> on CephFS. However, returning 0 breaks some existing tools (notably, Yum).
> >>
> >> What do you think? Should CephFS return something different than 0 here to 
> >> not break existing tools?
> >> Or should the tools behave differently? But one might also argue that if 
> >> the total number of Inodes matches the used number of Inodes, the FS is 
> >> indeed "full".
> >> It's just unclear to me who to file a bug against ;-).
> >>
> >> Right now, I am just using:
> >> yum -y --setopt=diskspacecheck=0 update
> >> as a manual workaround, but this is naturally rather cumbersome.
> > 
> > This is fallout from [1]. See discussion on setting f_free to 0 here
> > [2]. In summary, userland tools are trying to be too clever by looking
> > at f_free. [I could be convinced to go back to f_free = ULONG_MAX if
> > there are other instances of this.]
> > 
> > [1] https://github.com/ceph/ceph/pull/23323
> > [2] https://github.com/ceph/ceph/pull/23323#issuecomment-409249911
> 
> Thanks for the references! That certainly enlightens me on why this decision 
> was taken, and of course I congratulate upon trying to prevent false 
> monitoring. 
> Still, even though I don't have other instances at hand (yet), I am not yet 
> convinced "0" is a better choice than "ULONG_MAX". 
> It certainly alerts users / monitoring software about doing something wrong, 
> but it prevents a check which any file system (or rather, any file system I 
> encountered so far) allows. 
> 
> Yum (or other package managers doing things in a safe manner) need to ensure 
> they can fully install a package in an "atomic" way before doing so,
> since rolling back may be complex or even impossible (for most file systems). 
> So they need a way to check if a file system can store the additional files 
> in terms of space and inodes, before placing the data there,
> or risk installing something only partially, and potentially being unable to 
> roll back. 
> 
> In most cases, the free number of inodes allows for that check. Of course, 
> that has no (direct) meaning for CephFS, so one might argue the tools should 
> add an exception for CephFS - 
> but as the discussion correctly stated, there's no defined way to find out 
> where the file system has a notion of "free inodes", and - if we go for an 
> exceptional treatment for a list of file systems - 
> not even a "clean" way to find out if the file system is CephFS (the tools 
> will only see it is FUSE for ceph-fuse) [1]. 
> 
> So my question is: 
> How are tools which need to ensure that a file system can accept a given 
> number of bytes and inodes before ac