[ceph-users] CephFS: effects of using hard links

2019-03-19 Thread Erwin Bogaard
Hi,

 

For a number of application we use, there is a lot of file duplication. This
wastes precious storage space, which I would like to avoid.

When using a local disk, I can use a hard link to let all duplicate files
point to the same inode (use "rdfind", for example).

 

As there isn't any deduplication in Ceph(FS) I'm wondering if I can use hard
links on CephFS in the same way as I use for 'regular' file systems like
ext4 and xfs.

1. Is it advisible to use hard links on CephFS? (It isn't in the 'best
practices': http://docs.ceph.com/docs/master/cephfs/app-best-practices/)

2. Is there any performance (dis)advantage?

3. When using hard links, is there an actual space savings, or is there some
trickery happening?

4. Are there any issues (other than the regular hard link 'gotcha's') I need
to keep in mind combining hard links with CephFS?

 

Thanks

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] leak memory when mount cephfs

2019-03-19 Thread Zhenshi Zhou
Hi,

I mount cephfs on my client servers. Some of the servers mount without any
error whereas others don't.

The error:
# ceph-fuse -n client.kvm -m ceph.somedomain.com:6789 /mnt/kvm -r /kvm -d
2019-03-19 17:03:29.136 7f8c80eddc80 -1 deliberately leaking some memory
2019-03-19 17:03:29.137 7f8c80eddc80  0 ceph version 13.2.4
(b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable), process
ceph-fuse, pid 2951226
ceph-fuse: symbol lookup error: ceph-fuse: undefined symbol:
_Z12pipe_cloexecPi

I'm not sure why some servers cannot mount cephfs. Are the servers don't
have
enough memory?

Both client and server use version 13.2.4.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-volume lvm batch OSD replacement

2019-03-19 Thread Dan van der Ster
Hi all,

We've just hit our first OSD replacement on a host created with
`ceph-volume lvm batch` with mixed hdds+ssds.

The hdd /dev/sdq was prepared like this:
   # ceph-volume lvm batch /dev/sd[m-r] /dev/sdac --yes

Then /dev/sdq failed and was then zapped like this:
  # ceph-volume lvm zap /dev/sdq --destroy

The zap removed the pv/vg/lv from sdq, but left behind the db on
/dev/sdac (see P.S.)

Now we're replaced /dev/sdq and we're wondering how to proceed. We see
two options:
  1. reuse the existing db lv from osd.240 (Though the osd fsid will
change when we re-create, right?)
  2. remove the db lv from sdac then run
# ceph-volume lvm batch /dev/sdq /dev/sdac
 which should do the correct thing.

This is all v12.2.11 btw.
If (2) is the prefered approached, then it looks like a bug that the
db lv was not destroyed by lvm zap --destroy.

Once we sort this out, we'd be happy to contribute to the ceph-volume
lvm batch doc.

Thanks!

Dan

P.S:

= osd.240 ==

  [  db]
/dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd

  type  db
  osd id240
  cluster fsid  b4f463a0-c671-43a8-bd36-e40ab8d233d2
  cluster name  ceph
  osd fsid  d4d1fb15-a30a-4325-8628-706772ee4294
  db device
/dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
  encrypted 0
  db uuid   iWWdyU-UhNu-b58z-ThSp-Bi3B-19iA-06iJIc
  cephx lockbox secret
  block uuidu4326A-Q8bH-afPb-y7Y6-ftNf-TE1X-vjunBd
  block device
/dev/ceph-f78ff8a3-803d-4b6d-823b-260b301109ac/osd-data-9e4bf34d-1aa3-4c0a-9655-5dba52dcfcd7
  vdo   0
  crush device classNone
  devices   /dev/sdac
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume lvm batch OSD replacement

2019-03-19 Thread Alfredo Deza
On Tue, Mar 19, 2019 at 6:47 AM Dan van der Ster  wrote:
>
> Hi all,
>
> We've just hit our first OSD replacement on a host created with
> `ceph-volume lvm batch` with mixed hdds+ssds.
>
> The hdd /dev/sdq was prepared like this:
># ceph-volume lvm batch /dev/sd[m-r] /dev/sdac --yes
>
> Then /dev/sdq failed and was then zapped like this:
>   # ceph-volume lvm zap /dev/sdq --destroy
>
> The zap removed the pv/vg/lv from sdq, but left behind the db on
> /dev/sdac (see P.S.)

That is correct behavior for the zap command used.

>
> Now we're replaced /dev/sdq and we're wondering how to proceed. We see
> two options:
>   1. reuse the existing db lv from osd.240 (Though the osd fsid will
> change when we re-create, right?)

This is possible but you are right that in the current state, the FSID
and other cluster data exist in the LV metadata. To reuse this LV for
a new (replaced) OSD
then you would need to zap the LV *without* the --destroy flag, which
would clear all metadata on the LV and do a wipefs. The command would
need the full path to
the LV associated with osd.240, something like:

ceph-volume lvm zap /dev/ceph-osd-lvs/db-lv-240

>   2. remove the db lv from sdac then run
> # ceph-volume lvm batch /dev/sdq /dev/sdac
>  which should do the correct thing.

This would also work if the db lv is fully removed with --destroy

>
> This is all v12.2.11 btw.
> If (2) is the prefered approached, then it looks like a bug that the
> db lv was not destroyed by lvm zap --destroy.

Since /dev/sdq was passed in to zap, just that one device was removed,
so this is working as expected.

Alternatively, zap has the ability to destroy or zap LVs associated
with an OSD ID. I think this is not released yet for Luminous but
should be in the next release (which seems to be what you want)

>
> Once we sort this out, we'd be happy to contribute to the ceph-volume
> lvm batch doc.
>
> Thanks!
>
> Dan
>
> P.S:
>
> = osd.240 ==
>
>   [  db]
> /dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
>
>   type  db
>   osd id240
>   cluster fsid  b4f463a0-c671-43a8-bd36-e40ab8d233d2
>   cluster name  ceph
>   osd fsid  d4d1fb15-a30a-4325-8628-706772ee4294
>   db device
> /dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
>   encrypted 0
>   db uuid   iWWdyU-UhNu-b58z-ThSp-Bi3B-19iA-06iJIc
>   cephx lockbox secret
>   block uuidu4326A-Q8bH-afPb-y7Y6-ftNf-TE1X-vjunBd
>   block device
> /dev/ceph-f78ff8a3-803d-4b6d-823b-260b301109ac/osd-data-9e4bf34d-1aa3-4c0a-9655-5dba52dcfcd7
>   vdo   0
>   crush device classNone
>   devices   /dev/sdac
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume lvm batch OSD replacement

2019-03-19 Thread Alfredo Deza
On Tue, Mar 19, 2019 at 7:00 AM Alfredo Deza  wrote:
>
> On Tue, Mar 19, 2019 at 6:47 AM Dan van der Ster  wrote:
> >
> > Hi all,
> >
> > We've just hit our first OSD replacement on a host created with
> > `ceph-volume lvm batch` with mixed hdds+ssds.
> >
> > The hdd /dev/sdq was prepared like this:
> ># ceph-volume lvm batch /dev/sd[m-r] /dev/sdac --yes
> >
> > Then /dev/sdq failed and was then zapped like this:
> >   # ceph-volume lvm zap /dev/sdq --destroy
> >
> > The zap removed the pv/vg/lv from sdq, but left behind the db on
> > /dev/sdac (see P.S.)
>
> That is correct behavior for the zap command used.
>
> >
> > Now we're replaced /dev/sdq and we're wondering how to proceed. We see
> > two options:
> >   1. reuse the existing db lv from osd.240 (Though the osd fsid will
> > change when we re-create, right?)
>
> This is possible but you are right that in the current state, the FSID
> and other cluster data exist in the LV metadata. To reuse this LV for
> a new (replaced) OSD
> then you would need to zap the LV *without* the --destroy flag, which
> would clear all metadata on the LV and do a wipefs. The command would
> need the full path to
> the LV associated with osd.240, something like:
>
> ceph-volume lvm zap /dev/ceph-osd-lvs/db-lv-240
>
> >   2. remove the db lv from sdac then run
> > # ceph-volume lvm batch /dev/sdq /dev/sdac
> >  which should do the correct thing.
>
> This would also work if the db lv is fully removed with --destroy
>
> >
> > This is all v12.2.11 btw.
> > If (2) is the prefered approached, then it looks like a bug that the
> > db lv was not destroyed by lvm zap --destroy.
>
> Since /dev/sdq was passed in to zap, just that one device was removed,
> so this is working as expected.
>
> Alternatively, zap has the ability to destroy or zap LVs associated
> with an OSD ID. I think this is not released yet for Luminous but
> should be in the next release (which seems to be what you want)

Seems like 12.2.11 was released with the ability to zap by OSD ID. You
can also zap by OSD FSID, both way will zap (and optionally destroy if
using --destroy)
all LVs associated with the OSD.

Full examples on this can be found here:

http://docs.ceph.com/docs/luminous/ceph-volume/lvm/zap/#removing-devices


>
> >
> > Once we sort this out, we'd be happy to contribute to the ceph-volume
> > lvm batch doc.
> >
> > Thanks!
> >
> > Dan
> >
> > P.S:
> >
> > = osd.240 ==
> >
> >   [  db]
> > /dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
> >
> >   type  db
> >   osd id240
> >   cluster fsid  b4f463a0-c671-43a8-bd36-e40ab8d233d2
> >   cluster name  ceph
> >   osd fsid  d4d1fb15-a30a-4325-8628-706772ee4294
> >   db device
> > /dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
> >   encrypted 0
> >   db uuid   iWWdyU-UhNu-b58z-ThSp-Bi3B-19iA-06iJIc
> >   cephx lockbox secret
> >   block uuidu4326A-Q8bH-afPb-y7Y6-ftNf-TE1X-vjunBd
> >   block device
> > /dev/ceph-f78ff8a3-803d-4b6d-823b-260b301109ac/osd-data-9e4bf34d-1aa3-4c0a-9655-5dba52dcfcd7
> >   vdo   0
> >   crush device classNone
> >   devices   /dev/sdac
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume lvm batch OSD replacement

2019-03-19 Thread Dan van der Ster
On Tue, Mar 19, 2019 at 12:17 PM Alfredo Deza  wrote:
>
> On Tue, Mar 19, 2019 at 7:00 AM Alfredo Deza  wrote:
> >
> > On Tue, Mar 19, 2019 at 6:47 AM Dan van der Ster  
> > wrote:
> > >
> > > Hi all,
> > >
> > > We've just hit our first OSD replacement on a host created with
> > > `ceph-volume lvm batch` with mixed hdds+ssds.
> > >
> > > The hdd /dev/sdq was prepared like this:
> > ># ceph-volume lvm batch /dev/sd[m-r] /dev/sdac --yes
> > >
> > > Then /dev/sdq failed and was then zapped like this:
> > >   # ceph-volume lvm zap /dev/sdq --destroy
> > >
> > > The zap removed the pv/vg/lv from sdq, but left behind the db on
> > > /dev/sdac (see P.S.)
> >
> > That is correct behavior for the zap command used.
> >
> > >
> > > Now we're replaced /dev/sdq and we're wondering how to proceed. We see
> > > two options:
> > >   1. reuse the existing db lv from osd.240 (Though the osd fsid will
> > > change when we re-create, right?)
> >
> > This is possible but you are right that in the current state, the FSID
> > and other cluster data exist in the LV metadata. To reuse this LV for
> > a new (replaced) OSD
> > then you would need to zap the LV *without* the --destroy flag, which
> > would clear all metadata on the LV and do a wipefs. The command would
> > need the full path to
> > the LV associated with osd.240, something like:
> >
> > ceph-volume lvm zap /dev/ceph-osd-lvs/db-lv-240
> >
> > >   2. remove the db lv from sdac then run
> > > # ceph-volume lvm batch /dev/sdq /dev/sdac
> > >  which should do the correct thing.
> >
> > This would also work if the db lv is fully removed with --destroy
> >
> > >
> > > This is all v12.2.11 btw.
> > > If (2) is the prefered approached, then it looks like a bug that the
> > > db lv was not destroyed by lvm zap --destroy.
> >
> > Since /dev/sdq was passed in to zap, just that one device was removed,
> > so this is working as expected.
> >
> > Alternatively, zap has the ability to destroy or zap LVs associated
> > with an OSD ID. I think this is not released yet for Luminous but
> > should be in the next release (which seems to be what you want)
>
> Seems like 12.2.11 was released with the ability to zap by OSD ID. You
> can also zap by OSD FSID, both way will zap (and optionally destroy if
> using --destroy)
> all LVs associated with the OSD.
>
> Full examples on this can be found here:
>
> http://docs.ceph.com/docs/luminous/ceph-volume/lvm/zap/#removing-devices
>
>

Ohh that's an improvement! (Our goal is outsourcing the failure
handling to non-ceph experts, so this will help simplify things.)

In our example, the operator needs to know the osd id, then can do:

1. ceph-volume lvm zap --destroy --osd-id 240 (wipes sdq and removes
the lvm from sdac for osd.240)
2. replace the hdd
3. ceph-volume lvm batch /dev/sdq /dev/sdac --osd-ids 240

But I just remembered that the --osd-ids flag hasn't been backported
to luminous, so we can't yet do that. I guess we'll follow the first
(1) procedure to re-use the existing db lv.

-- dan

> >
> > >
> > > Once we sort this out, we'd be happy to contribute to the ceph-volume
> > > lvm batch doc.
> > >
> > > Thanks!
> > >
> > > Dan
> > >
> > > P.S:
> > >
> > > = osd.240 ==
> > >
> > >   [  db]
> > > /dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
> > >
> > >   type  db
> > >   osd id240
> > >   cluster fsid  b4f463a0-c671-43a8-bd36-e40ab8d233d2
> > >   cluster name  ceph
> > >   osd fsid  d4d1fb15-a30a-4325-8628-706772ee4294
> > >   db device
> > > /dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
> > >   encrypted 0
> > >   db uuid   iWWdyU-UhNu-b58z-ThSp-Bi3B-19iA-06iJIc
> > >   cephx lockbox secret
> > >   block uuidu4326A-Q8bH-afPb-y7Y6-ftNf-TE1X-vjunBd
> > >   block device
> > > /dev/ceph-f78ff8a3-803d-4b6d-823b-260b301109ac/osd-data-9e4bf34d-1aa3-4c0a-9655-5dba52dcfcd7
> > >   vdo   0
> > >   crush device classNone
> > >   devices   /dev/sdac
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume lvm batch OSD replacement

2019-03-19 Thread Alfredo Deza
On Tue, Mar 19, 2019 at 7:26 AM Dan van der Ster  wrote:
>
> On Tue, Mar 19, 2019 at 12:17 PM Alfredo Deza  wrote:
> >
> > On Tue, Mar 19, 2019 at 7:00 AM Alfredo Deza  wrote:
> > >
> > > On Tue, Mar 19, 2019 at 6:47 AM Dan van der Ster  
> > > wrote:
> > > >
> > > > Hi all,
> > > >
> > > > We've just hit our first OSD replacement on a host created with
> > > > `ceph-volume lvm batch` with mixed hdds+ssds.
> > > >
> > > > The hdd /dev/sdq was prepared like this:
> > > ># ceph-volume lvm batch /dev/sd[m-r] /dev/sdac --yes
> > > >
> > > > Then /dev/sdq failed and was then zapped like this:
> > > >   # ceph-volume lvm zap /dev/sdq --destroy
> > > >
> > > > The zap removed the pv/vg/lv from sdq, but left behind the db on
> > > > /dev/sdac (see P.S.)
> > >
> > > That is correct behavior for the zap command used.
> > >
> > > >
> > > > Now we're replaced /dev/sdq and we're wondering how to proceed. We see
> > > > two options:
> > > >   1. reuse the existing db lv from osd.240 (Though the osd fsid will
> > > > change when we re-create, right?)
> > >
> > > This is possible but you are right that in the current state, the FSID
> > > and other cluster data exist in the LV metadata. To reuse this LV for
> > > a new (replaced) OSD
> > > then you would need to zap the LV *without* the --destroy flag, which
> > > would clear all metadata on the LV and do a wipefs. The command would
> > > need the full path to
> > > the LV associated with osd.240, something like:
> > >
> > > ceph-volume lvm zap /dev/ceph-osd-lvs/db-lv-240
> > >
> > > >   2. remove the db lv from sdac then run
> > > > # ceph-volume lvm batch /dev/sdq /dev/sdac
> > > >  which should do the correct thing.
> > >
> > > This would also work if the db lv is fully removed with --destroy
> > >
> > > >
> > > > This is all v12.2.11 btw.
> > > > If (2) is the prefered approached, then it looks like a bug that the
> > > > db lv was not destroyed by lvm zap --destroy.
> > >
> > > Since /dev/sdq was passed in to zap, just that one device was removed,
> > > so this is working as expected.
> > >
> > > Alternatively, zap has the ability to destroy or zap LVs associated
> > > with an OSD ID. I think this is not released yet for Luminous but
> > > should be in the next release (which seems to be what you want)
> >
> > Seems like 12.2.11 was released with the ability to zap by OSD ID. You
> > can also zap by OSD FSID, both way will zap (and optionally destroy if
> > using --destroy)
> > all LVs associated with the OSD.
> >
> > Full examples on this can be found here:
> >
> > http://docs.ceph.com/docs/luminous/ceph-volume/lvm/zap/#removing-devices
> >
> >
>
> Ohh that's an improvement! (Our goal is outsourcing the failure
> handling to non-ceph experts, so this will help simplify things.)
>
> In our example, the operator needs to know the osd id, then can do:
>
> 1. ceph-volume lvm zap --destroy --osd-id 240 (wipes sdq and removes
> the lvm from sdac for osd.240)
> 2. replace the hdd
> 3. ceph-volume lvm batch /dev/sdq /dev/sdac --osd-ids 240
>
> But I just remembered that the --osd-ids flag hasn't been backported
> to luminous, so we can't yet do that. I guess we'll follow the first
> (1) procedure to re-use the existing db lv.

It has! (I initially thought it wasn't). Check if `ceph-volume lvm zap
--help` has the flags available, I think they should appear for
12.2.11
>
> -- dan
>
> > >
> > > >
> > > > Once we sort this out, we'd be happy to contribute to the ceph-volume
> > > > lvm batch doc.
> > > >
> > > > Thanks!
> > > >
> > > > Dan
> > > >
> > > > P.S:
> > > >
> > > > = osd.240 ==
> > > >
> > > >   [  db]
> > > > /dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
> > > >
> > > >   type  db
> > > >   osd id240
> > > >   cluster fsid  b4f463a0-c671-43a8-bd36-e40ab8d233d2
> > > >   cluster name  ceph
> > > >   osd fsid  d4d1fb15-a30a-4325-8628-706772ee4294
> > > >   db device
> > > > /dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
> > > >   encrypted 0
> > > >   db uuid   iWWdyU-UhNu-b58z-ThSp-Bi3B-19iA-06iJIc
> > > >   cephx lockbox secret
> > > >   block uuidu4326A-Q8bH-afPb-y7Y6-ftNf-TE1X-vjunBd
> > > >   block device
> > > > /dev/ceph-f78ff8a3-803d-4b6d-823b-260b301109ac/osd-data-9e4bf34d-1aa3-4c0a-9655-5dba52dcfcd7
> > > >   vdo   0
> > > >   crush device classNone
> > > >   devices   /dev/sdac
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume lvm batch OSD replacement

2019-03-19 Thread Dan van der Ster
On Tue, Mar 19, 2019 at 1:05 PM Alfredo Deza  wrote:
>
> On Tue, Mar 19, 2019 at 7:26 AM Dan van der Ster  wrote:
> >
> > On Tue, Mar 19, 2019 at 12:17 PM Alfredo Deza  wrote:
> > >
> > > On Tue, Mar 19, 2019 at 7:00 AM Alfredo Deza  wrote:
> > > >
> > > > On Tue, Mar 19, 2019 at 6:47 AM Dan van der Ster  
> > > > wrote:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > We've just hit our first OSD replacement on a host created with
> > > > > `ceph-volume lvm batch` with mixed hdds+ssds.
> > > > >
> > > > > The hdd /dev/sdq was prepared like this:
> > > > ># ceph-volume lvm batch /dev/sd[m-r] /dev/sdac --yes
> > > > >
> > > > > Then /dev/sdq failed and was then zapped like this:
> > > > >   # ceph-volume lvm zap /dev/sdq --destroy
> > > > >
> > > > > The zap removed the pv/vg/lv from sdq, but left behind the db on
> > > > > /dev/sdac (see P.S.)
> > > >
> > > > That is correct behavior for the zap command used.
> > > >
> > > > >
> > > > > Now we're replaced /dev/sdq and we're wondering how to proceed. We see
> > > > > two options:
> > > > >   1. reuse the existing db lv from osd.240 (Though the osd fsid will
> > > > > change when we re-create, right?)
> > > >
> > > > This is possible but you are right that in the current state, the FSID
> > > > and other cluster data exist in the LV metadata. To reuse this LV for
> > > > a new (replaced) OSD
> > > > then you would need to zap the LV *without* the --destroy flag, which
> > > > would clear all metadata on the LV and do a wipefs. The command would
> > > > need the full path to
> > > > the LV associated with osd.240, something like:
> > > >
> > > > ceph-volume lvm zap /dev/ceph-osd-lvs/db-lv-240
> > > >
> > > > >   2. remove the db lv from sdac then run
> > > > > # ceph-volume lvm batch /dev/sdq /dev/sdac
> > > > >  which should do the correct thing.
> > > >
> > > > This would also work if the db lv is fully removed with --destroy
> > > >
> > > > >
> > > > > This is all v12.2.11 btw.
> > > > > If (2) is the prefered approached, then it looks like a bug that the
> > > > > db lv was not destroyed by lvm zap --destroy.
> > > >
> > > > Since /dev/sdq was passed in to zap, just that one device was removed,
> > > > so this is working as expected.
> > > >
> > > > Alternatively, zap has the ability to destroy or zap LVs associated
> > > > with an OSD ID. I think this is not released yet for Luminous but
> > > > should be in the next release (which seems to be what you want)
> > >
> > > Seems like 12.2.11 was released with the ability to zap by OSD ID. You
> > > can also zap by OSD FSID, both way will zap (and optionally destroy if
> > > using --destroy)
> > > all LVs associated with the OSD.
> > >
> > > Full examples on this can be found here:
> > >
> > > http://docs.ceph.com/docs/luminous/ceph-volume/lvm/zap/#removing-devices
> > >
> > >
> >
> > Ohh that's an improvement! (Our goal is outsourcing the failure
> > handling to non-ceph experts, so this will help simplify things.)
> >
> > In our example, the operator needs to know the osd id, then can do:
> >
> > 1. ceph-volume lvm zap --destroy --osd-id 240 (wipes sdq and removes
> > the lvm from sdac for osd.240)
> > 2. replace the hdd
> > 3. ceph-volume lvm batch /dev/sdq /dev/sdac --osd-ids 240
> >
> > But I just remembered that the --osd-ids flag hasn't been backported
> > to luminous, so we can't yet do that. I guess we'll follow the first
> > (1) procedure to re-use the existing db lv.
>
> It has! (I initially thought it wasn't). Check if `ceph-volume lvm zap
> --help` has the flags available, I think they should appear for
> 12.2.11

Is it there? Indeed I see zap --osd-id, but for the recreation I'm
referring to batch --osd-ids, which afaict is only in nautilus:

https://github.com/ceph/ceph/blob/nautilus/src/ceph-volume/ceph_volume/devices/lvm/batch.py#L248

-- dan


> >
> > -- dan
> >
> > > >
> > > > >
> > > > > Once we sort this out, we'd be happy to contribute to the ceph-volume
> > > > > lvm batch doc.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Dan
> > > > >
> > > > > P.S:
> > > > >
> > > > > = osd.240 ==
> > > > >
> > > > >   [  db]
> > > > > /dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
> > > > >
> > > > >   type  db
> > > > >   osd id240
> > > > >   cluster fsid  b4f463a0-c671-43a8-bd36-e40ab8d233d2
> > > > >   cluster name  ceph
> > > > >   osd fsid  d4d1fb15-a30a-4325-8628-706772ee4294
> > > > >   db device
> > > > > /dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
> > > > >   encrypted 0
> > > > >   db uuid   iWWdyU-UhNu-b58z-ThSp-Bi3B-19iA-06iJIc
> > > > >   cephx lockbox secret
> > > > >   block uuidu4326A-Q8bH-afPb-y7Y6-ftNf-TE1X-vjunBd
> > > > >   block device
> > > > > /dev/ceph-f78ff8a3-803d-4b6d-823b

[ceph-users] v14.2.0 Nautilus released

2019-03-19 Thread Abhishek Lekshmanan

We're glad to announce the first release of Nautilus v14.2.0 stable
series. There have been a lot of changes across components from the
previous Ceph releases, and we advise everyone to go through the release
and upgrade notes carefully.

The release also saw commits from over 300 contributors and we'd like to
thank everyone from the community for making this release happen. The
next major release of Ceph will be called Octopus. Consider joining
us in the second edition of Cephalocon, at Barcelona this year
https://ceph.com/cephalocon/barcelona-2019/

For a detailed changelog please refer to the official blog entry at
http://ceph.com/releases/v14-2-0-nautilus-released/

Major Changes from Mimic


- *Dashboard*:

  The Ceph Dashboard has gained a lot of new functionality:

  * Support for multiple users / roles
  * SSO (SAMLv2) for user authentication
  * Auditing support
  * New landing page, showing more metrics and health info
  * I18N support
  * REST API documentation with Swagger API

  New Ceph management features include:

  * OSD management (mark as down/out, change OSD settings, recovery profiles)
  * Cluster config settings editor
  * Ceph Pool management (create/modify/delete)
  * ECP management
  * RBD mirroring configuration
  * Embedded Grafana Dashboards (derived from Ceph Metrics)
  * CRUSH map viewer
  * NFS Ganesha management
  * iSCSI target management (via ceph-iscsi)
  * RBD QoS configuration
  * Ceph Manager (ceph-mgr) module management
  * Prometheus alert Management

  Also, the Ceph Dashboard is now split into its own package named
  ``ceph-mgr-dashboard``. You might want to install it separately,
  if your package management software fails to do so when it installs
  ``ceph-mgr``.

- *RADOS*:

  * The number of placement groups (PGs) per pool can now be decreased
at any time, and the cluster can `automatically tune the PG count 
`_
based on cluster utilization or administrator hints.
  * The new :ref:`v2 wire protocol ` brings support for encryption on 
the wire.
  * Physical `storage devices `_ consumed by OSD and Monitor daemons 
are
now tracked by the cluster along with health metrics (i.e.,
SMART), and the cluster can apply a pre-trained prediction model
or a cloud-based prediction service to `warn about expected
HDD or SSD failures `_.
  * The NUMA node for OSD daemons can easily be monitored via the
``ceph osd numa-status`` command, and configured via the
``osd_numa_node`` config option.
  * When BlueStore OSDs are used, space utilization is now broken down
by object data, omap data, and internal metadata, by pool, and by
pre- and post- compression sizes.
  * OSDs more effectively prioritize the most important PGs and
objects when performing recovery and backfill.
  * Progress for long-running background processes--like recovery
after a device failure--is now reported as part of ``ceph
status``.
  * An experimental `Coupled-Layer "Clay" erasure code
`_
plugin has been added that reduces network bandwidth and IO needed
for most recovery operations.

- *RGW*:

  * S3 lifecycle transition for tiering between storage classes.
  * A new web frontend (Beast) has replaced civetweb as the default,
improving overall performance.
  * A new publish/subscribe infrastructure allows RGW to feed events
to serverless frameworks like knative or data pipelies like Kafka.
  * A range of authentication features, including STS federation using
OAuth2 and OpenID::connect and an OPA (Open Policy Agent)
authentication delegation prototype.
  * The new archive zone federation feature enables full preservation
of all objects (including history) in a separate zone.

- *CephFS*:

  * MDS stability has been greatly improved for large caches and
long-running clients with a lot of RAM. Cache trimming and client
capability recall is now throttled to prevent overloading the MDS.
  * CephFS may now be exported via NFS-Ganesha clusters in environments managed
by Rook. Ceph manages the clusters and ensures high-availability and
scalability. An `introductory demo
`_
is available. More automation of this feature is expected to be forthcoming
in future minor releases of Nautilus.
  * The MDS ``mds_standby_for_*``, ``mon_force_standby_active``, and
``mds_standby_replay`` configuration options have been obsoleted. Instead,
the operator :ref:`may now set ` the new
``allow_standby_replay`` flag on the CephFS file system. This setting
causes standbys to become standby-replay for any available rank in the file
system.
  * MDS now supports dropping its cache which concurrently asks clients
to trim their caches. This is done using MDS admin socket ``cache drop``
command.
  * It is now possible to check the progress of an on-go

[ceph-users] fio test rbd - single thread - qd1

2019-03-19 Thread jesper
Hi All.

I'm trying to get head and tails into where we can stretch our Ceph cluster
into what applications. Parallism works excellent, but baseline throughput
it - perhaps - not what I would expect it to be.

Luminous cluster running bluestore - all OSD-daemons have 16GB of cache.

Fio files attacher - 4KB random read and 4KB random write - test file is
"only" 1GB
In this i ONLY care about raw IOPS numbers.

I have 2 pools, both 3x replicated .. one backed with SSDs S4510's
(14x1TB) and one with HDD's 84x10TB.

Network latency from rbd mount to one of the osd-hosts.
--- ceph-osd01.nzcorp.net ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9189ms
rtt min/avg/max/mdev = 0.084/0.108/0.146/0.022 ms

SSD:
randr:
# grep iops read*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
N   Min   MaxMedian   AvgStddev
x  38   1727.07   2033.66   1954.71 1949.4789 46.592401
randw:
# grep iops write*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
N   Min   MaxMedian   AvgStddev
x  36400.05455.26436.58 433.91417 12.468187

The double (or triple) network penalty of-course kicks in and delivers a
lower throughput here.
Are these performance numbers in the ballpark of what we'd expect?

With 1GB of test file .. I would really expect this to be memory cached in
the OSD/bluestore cache
and thus deliver a read IOPS closer to theoretical max: 1s/0.108ms => 9.2K
IOPS

Again on the write side - all OSDs are backed by Battery-Backed write
cache, thus writes should go directly
into memory of the constroller .. .. still slower than reads - due to
having to visit 3 hosts.. but not this low?

Suggestions for improvements? Are other people seeing similar results?

For the HDD tests I get similar - surprisingly slow numbers:
# grep iops write*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
N   Min   MaxMedian   AvgStddev
x  38 36.91 118.8 69.14 72.926842  21.75198

This should have the same performance characteristics as the SSD's as the
writes should be hitting BBWC.

# grep iops read*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
N   Min   MaxMedian   AvgStddev
x  39 26.18181.51 48.16 50.574872  24.01572

Same here - shold be cached in the blue-store cache as it is 16GB x 84
OSD's  .. with a 1GB testfile.

Any thoughts - suggestions - insights ?

Jesper

fio-single-thread-randr.ini
Description: Binary data


fio-single-thread-randw.ini
Description: Binary data
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v14.2.0 Nautilus released

2019-03-19 Thread Sean Purdy
Hi,


Will debian packages be released?  I don't see them in the nautilus repo.  I 
thought that Nautilus was going to be debian-friendly, unlike Mimic.


Sean

On Tue, 19 Mar 2019 14:58:41 +0100
Abhishek Lekshmanan  wrote:

> 
> We're glad to announce the first release of Nautilus v14.2.0 stable
> series. There have been a lot of changes across components from the
> previous Ceph releases, and we advise everyone to go through the release
> and upgrade notes carefully.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fio test rbd - single thread - qd1

2019-03-19 Thread Piotr Dałek
One thing you can check is the CPU performance (cpu governor in particular). 
On such light loads I've seen CPUs sitting in low performance mode (slower 
clocks), giving MUCH worse performance results than when tried with heavier 
loads. Try "cpupower monitor" on OSD nodes in a loop and observe the core 
frequencies.


On 2019-03-19 3:17 p.m., jes...@krogh.cc wrote:

Hi All.

I'm trying to get head and tails into where we can stretch our Ceph cluster
into what applications. Parallism works excellent, but baseline throughput
it - perhaps - not what I would expect it to be.

Luminous cluster running bluestore - all OSD-daemons have 16GB of cache.

Fio files attacher - 4KB random read and 4KB random write - test file is
"only" 1GB
In this i ONLY care about raw IOPS numbers.

I have 2 pools, both 3x replicated .. one backed with SSDs S4510's
(14x1TB) and one with HDD's 84x10TB.

Network latency from rbd mount to one of the osd-hosts.
--- ceph-osd01.nzcorp.net ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9189ms
rtt min/avg/max/mdev = 0.084/0.108/0.146/0.022 ms

SSD:
randr:
# grep iops read*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
 N   Min   MaxMedian   AvgStddev
x  38   1727.07   2033.66   1954.71 1949.4789 46.592401
randw:
# grep iops write*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
 N   Min   MaxMedian   AvgStddev
x  36400.05455.26436.58 433.91417 12.468187

The double (or triple) network penalty of-course kicks in and delivers a
lower throughput here.
Are these performance numbers in the ballpark of what we'd expect?

With 1GB of test file .. I would really expect this to be memory cached in
the OSD/bluestore cache
and thus deliver a read IOPS closer to theoretical max: 1s/0.108ms => 9.2K
IOPS

Again on the write side - all OSDs are backed by Battery-Backed write
cache, thus writes should go directly
into memory of the constroller .. .. still slower than reads - due to
having to visit 3 hosts.. but not this low?

Suggestions for improvements? Are other people seeing similar results?

For the HDD tests I get similar - surprisingly slow numbers:
# grep iops write*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
 N   Min   MaxMedian   AvgStddev
x  38 36.91 118.8 69.14 72.926842  21.75198

This should have the same performance characteristics as the SSD's as the
writes should be hitting BBWC.

# grep iops read*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
 N   Min   MaxMedian   AvgStddev
x  39 26.18181.51 48.16 50.574872  24.01572

Same here - shold be cached in the blue-store cache as it is 16GB x 84
OSD's  .. with a 1GB testfile.

Any thoughts - suggestions - insights ?

Jesper



--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovhcloud.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Looking up buckets in multi-site radosgw configuration

2019-03-19 Thread Casey Bodley



On 3/19/19 12:05 AM, David Coles wrote:

I'm looking at setting up a multi-site radosgw configuration where
data is sharded over multiple clusters in a single physical location;
and would like to understand how Ceph handles requests in this
configuration.

Looking through the radosgw source[1] it looks like radowgw will
return 301 redirect if I request a bucket that is not in the current
zonegroup. This redirect appears to be to the endpoint for the
zonegroup (I assume as configured by `radosgw-admin zonegroup create
--endpoints`). This seems like it would work well for multiple
geographic regions (e.g. us-east and us-west) for ensuring that a
request is redirected to the region (zonegroup) that hosts the bucket.
We could possibly improve this by virtual hosted buckets and having
DNS point to the correct region for that bucket.

I notice that it's also possible to configure zones in a zonegroup
that don't peform replication[2] (e.g. us-east-1 and us-east-2). In
this case I assume that if I direct a request to the wrong zone, then
Ceph will just report that the object as not-found because, despite
the bucket metadata being replicated from the zonegroup master, the
objects will never be replicated from one zone to the other. Another
layer (like a consistent hash across the bucket name or database)
would be required for routing to the correct zone.

Is this mostly correct? Are there other ways of controlling which
cluster data is placed (i.e. placement groups)?


Yeah, correct on both points. The zonegroup redirects would be the only 
way to guide clients between clusters.




Thanks!

1. 
https://github.com/ceph/ceph/blob/affb7d396f76273e885cfdbcd363c1882496726c/src/rgw/rgw_op.cc#L653-L669
2. 
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/object_gateway_guide_for_red_hat_enterprise_linux/multi_site#configuring_multiple_zones_without_replication

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume lvm batch OSD replacement

2019-03-19 Thread Jan Fajerski

On Tue, Mar 19, 2019 at 02:17:56PM +0100, Dan van der Ster wrote:

On Tue, Mar 19, 2019 at 1:05 PM Alfredo Deza  wrote:


On Tue, Mar 19, 2019 at 7:26 AM Dan van der Ster  wrote:
>
> On Tue, Mar 19, 2019 at 12:17 PM Alfredo Deza  wrote:
> >
> > On Tue, Mar 19, 2019 at 7:00 AM Alfredo Deza  wrote:
> > >
> > > On Tue, Mar 19, 2019 at 6:47 AM Dan van der Ster  
wrote:
> > > >
> > > > Hi all,
> > > >
> > > > We've just hit our first OSD replacement on a host created with
> > > > `ceph-volume lvm batch` with mixed hdds+ssds.
> > > >
> > > > The hdd /dev/sdq was prepared like this:
> > > ># ceph-volume lvm batch /dev/sd[m-r] /dev/sdac --yes
> > > >
> > > > Then /dev/sdq failed and was then zapped like this:
> > > >   # ceph-volume lvm zap /dev/sdq --destroy
> > > >
> > > > The zap removed the pv/vg/lv from sdq, but left behind the db on
> > > > /dev/sdac (see P.S.)
> > >
> > > That is correct behavior for the zap command used.
> > >
> > > >
> > > > Now we're replaced /dev/sdq and we're wondering how to proceed. We see
> > > > two options:
> > > >   1. reuse the existing db lv from osd.240 (Though the osd fsid will
> > > > change when we re-create, right?)
> > >
> > > This is possible but you are right that in the current state, the FSID
> > > and other cluster data exist in the LV metadata. To reuse this LV for
> > > a new (replaced) OSD
> > > then you would need to zap the LV *without* the --destroy flag, which
> > > would clear all metadata on the LV and do a wipefs. The command would
> > > need the full path to
> > > the LV associated with osd.240, something like:
> > >
> > > ceph-volume lvm zap /dev/ceph-osd-lvs/db-lv-240
> > >
> > > >   2. remove the db lv from sdac then run
> > > > # ceph-volume lvm batch /dev/sdq /dev/sdac
> > > >  which should do the correct thing.
> > >
> > > This would also work if the db lv is fully removed with --destroy
> > >
> > > >
> > > > This is all v12.2.11 btw.
> > > > If (2) is the prefered approached, then it looks like a bug that the
> > > > db lv was not destroyed by lvm zap --destroy.
> > >
> > > Since /dev/sdq was passed in to zap, just that one device was removed,
> > > so this is working as expected.
> > >
> > > Alternatively, zap has the ability to destroy or zap LVs associated
> > > with an OSD ID. I think this is not released yet for Luminous but
> > > should be in the next release (which seems to be what you want)
> >
> > Seems like 12.2.11 was released with the ability to zap by OSD ID. You
> > can also zap by OSD FSID, both way will zap (and optionally destroy if
> > using --destroy)
> > all LVs associated with the OSD.
> >
> > Full examples on this can be found here:
> >
> > http://docs.ceph.com/docs/luminous/ceph-volume/lvm/zap/#removing-devices
> >
> >
>
> Ohh that's an improvement! (Our goal is outsourcing the failure
> handling to non-ceph experts, so this will help simplify things.)
>
> In our example, the operator needs to know the osd id, then can do:
>
> 1. ceph-volume lvm zap --destroy --osd-id 240 (wipes sdq and removes
> the lvm from sdac for osd.240)
> 2. replace the hdd
> 3. ceph-volume lvm batch /dev/sdq /dev/sdac --osd-ids 240
>
> But I just remembered that the --osd-ids flag hasn't been backported
> to luminous, so we can't yet do that. I guess we'll follow the first
> (1) procedure to re-use the existing db lv.

It has! (I initially thought it wasn't). Check if `ceph-volume lvm zap
--help` has the flags available, I think they should appear for
12.2.11


Is it there? Indeed I see zap --osd-id, but for the recreation I'm
referring to batch --osd-ids, which afaict is only in nautilus:

https://github.com/ceph/ceph/blob/nautilus/src/ceph-volume/ceph_volume/devices/lvm/batch.py#L248

Right, this PR was not backported yet https://github.com/ceph/ceph/pull/25542
I'll get on that.

We'll probably need to look at how c-v is developed now that nautilus is out.  
Maintaining three branches (luminous, mimic, nautlius) nad more in the future 
with essentially the same code make no sense and adds plenty of unnecessary 
work.


-- dan



>
> -- dan
>
> > >
> > > >
> > > > Once we sort this out, we'd be happy to contribute to the ceph-volume
> > > > lvm batch doc.
> > > >
> > > > Thanks!
> > > >
> > > > Dan
> > > >
> > > > P.S:
> > > >
> > > > = osd.240 ==
> > > >
> > > >   [  db]
/dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
> > > >
> > > >   type  db
> > > >   osd id240
> > > >   cluster fsid  b4f463a0-c671-43a8-bd36-e40ab8d233d2
> > > >   cluster name  ceph
> > > >   osd fsid  d4d1fb15-a30a-4325-8628-706772ee4294
> > > >   db device
> > > > 
/dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
> > > >   encrypted 0
> > > >   db uuid   iWWdyU-UhNu-b58z-ThSp-Bi3B-19iA-06iJIc
> > > >   cephx l

Re: [ceph-users] v14.2.0 Nautilus released

2019-03-19 Thread Benjamin Cherian
Hi,

I'm getting an error when trying to use the APT repo for Ubuntu bionic.
Does anyone else have this issue? Is the mirror sync actually still in
progress? Or was something setup incorrectly?

E: Failed to fetch
https://download.ceph.com/debian-nautilus/dists/bionic/main/binary-amd64/Packages.bz2
File has unexpected size (15515 != 15488). Mirror sync in progress? [IP:
158.69.68.124 443]
   Hashes of expected file:
- Filesize:15488 [weak]
-
SHA256:d5ea08e095eeeaa5cc134b1661bfaf55280fcbf8a265d584a4af80d2a424ec17
- SHA1:6da3a8aa17ed7f828f35f546cdcf923040e8e5b0 [weak]
- MD5Sum:7e5a4ecea4a4edc3f483623d48b6efa4 [weak]
   Release file created at: Mon, 11 Mar 2019 18:44:46 +


Thanks,
Ben


On Tue, Mar 19, 2019 at 7:24 AM Sean Purdy  wrote:

> Hi,
>
>
> Will debian packages be released?  I don't see them in the nautilus repo.
> I thought that Nautilus was going to be debian-friendly, unlike Mimic.
>
>
> Sean
>
> On Tue, 19 Mar 2019 14:58:41 +0100
> Abhishek Lekshmanan  wrote:
>
> >
> > We're glad to announce the first release of Nautilus v14.2.0 stable
> > series. There have been a lot of changes across components from the
> > previous Ceph releases, and we advise everyone to go through the release
> > and upgrade notes carefully.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fio test rbd - single thread - qd1

2019-03-19 Thread jesper
> One thing you can check is the CPU performance (cpu governor in
> particular).
> On such light loads I've seen CPUs sitting in low performance mode (slower
> clocks), giving MUCH worse performance results than when tried with
> heavier
> loads. Try "cpupower monitor" on OSD nodes in a loop and observe the core
> frequencies.
>

Thanks for the suggestion. They seem to be all powered up .. other
suggestion/reflections
are truely welcome.. Thanks.

Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v14.2.0 Nautilus released

2019-03-19 Thread Sarunas Burdulis
On 3/19/19 2:52 PM, Benjamin Cherian wrote:
> Hi,
> 
> I'm getting an error when trying to use the APT repo for Ubuntu bionic.
> Does anyone else have this issue? Is the mirror sync actually still in
> progress? Or was something setup incorrectly?
> 
> E: Failed to fetch
> https://download.ceph.com/debian-nautilus/dists/bionic/main/binary-amd64/Packages.bz2
>  
> File has unexpected size (15515 != 15488). Mirror sync in progress? [IP:
> 158.69.68.124 443]
>    Hashes of expected file:
>     - Filesize:15488 [weak]
>     -
> SHA256:d5ea08e095eeeaa5cc134b1661bfaf55280fcbf8a265d584a4af80d2a424ec17
>     - SHA1:6da3a8aa17ed7f828f35f546cdcf923040e8e5b0 [weak]
>     - MD5Sum:7e5a4ecea4a4edc3f483623d48b6efa4 [weak]
>    Release file created at: Mon, 11 Mar 2019 18:44:46 +

I'm getting the same error for `apt update` with

deb https://download.ceph.com/debian-nautilus/ bionic main

-- 
Sarunas Burdulis
Systems Administrator, Dartmouth Mathematics
math.dartmouth.edu/~sarunas



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can CephFS Kernel Client Not Read & Write at the Same Time?

2019-03-19 Thread Andrew Richards
I don't think file locks are to blame. I tired to control for that in my tests; 
I was reading with fio from one set of files (multiple fio pids spawned from a 
single command) while writing with dd to an entirely different file using a 
different shell on the same host. So one CephFS kernel client, all different 
files being acted on by different pids, and still the interruption of reads 
when writes were being synched.

Thanks,
Andrew Richards
Senior Systems Engineer
keepertechnology

> On Mar 8, 2019, at 2:14 AM, Yan, Zheng  wrote:
> 
> CephFS kernel mount blocks reads while other client has dirty data in
> its page cache.   Cache coherency rule looks like:
> 
> state 1 - only one client opens a file for read/write.  the client can
> use page cache
> state 2 - multiple clients open a file for read, no client opens the
> file for wirte. clients can use page cache
> state 3 - multiple clients open a file for read/write. client are not
> allowed to use page cache.
> 
> The behavior you saw is likely caused by state transition from 1 to 3
> 
> On Fri, Mar 8, 2019 at 8:15 AM Gregory Farnum  wrote:
>> 
>> In general, no, this is not an expected behavior.
>> 
>> My guess would be that something odd is happening with the other clients you 
>> have to the system, and there's a weird pattern with the way the file locks 
>> are being issued. Can you be more precise about exactly what workload you're 
>> running, and get the output of the session list on your MDS while doing so?
>> -Greg
>> 
>> On Wed, Mar 6, 2019 at 9:49 AM Andrew Richards 
>>  wrote:
>>> 
>>> We discovered recently that our CephFS mount appeared to be halting reads 
>>> when writes were being synched to the Ceph cluster to the point it was 
>>> affecting applications.
>>> 
>>> I also posted this as a Gist with embedded graph images to help illustrate: 
>>> https://gist.github.com/keeperAndy/aa80d41618caa4394e028478f4ad1694
>>> 
>>> The following is the plain text from the Gist.
>>> 
>>> First, details about the host:
>>> 
>>> 
>>>$ uname -r
>>>4.16.13-041613-generic
>>> 
>>>$ egrep 'xfs|ceph' /proc/mounts
>>>192.168.1.115:6789,192.168.1.116:6789,192.168.1.117:6789:/ /cephfs ceph 
>>> rw,noatime,name=cephfs,secret=,rbytes,acl,wsize=16777216 0 0
>>>/dev/mapper/tst01-lvidmt01 /rbd_xfs xfs 
>>> rw,relatime,attr2,inode64,logbsize=256k,sunit=512,swidth=1024,noquota 0 0
>>> 
>>>$ ceph -v
>>>ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous 
>>> (stable)
>>> 
>>>$ cat /proc/net/bonding/bond1
>>>Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
>>> 
>>>Bonding Mode: adaptive load balancing
>>>Primary Slave: None
>>>Currently Active Slave: net6
>>>MII Status: up
>>>MII Polling Interval (ms): 100
>>>Up Delay (ms): 200
>>>Down Delay (ms): 200
>>> 
>>>Slave Interface: net8
>>>MII Status: up
>>>Speed: 1 Mbps
>>>Duplex: full
>>>Link Failure Count: 2
>>>Permanent HW addr: e4:1d:2d:17:71:e1
>>>Slave queue ID: 0
>>> 
>>>Slave Interface: net6
>>>MII Status: up
>>>Speed: 1 Mbps
>>>Duplex: full
>>>Link Failure Count: 1
>>>Permanent HW addr: e4:1d:2d:17:71:e0
>>>Slave queue ID: 0
>>> 
>>> 
>>> 
>>> We had CephFS mounted alongside an XFS filesystem made up of 16 RBD images 
>>> aggregated under LVM as our storage targets. The link to the Ceph cluster 
>>> from the host is a mode 6 2x10GbE bond (bond1 above).
>>> 
>>> We started capturing network counters from the Ceph cluster connection 
>>> (bond1) on the host using ifstat at its most granular setting of 0.1 
>>> (sampling every tenth of a second). We then ran various overlapping read 
>>> and write operations in separate shells on the same host to obtain samples 
>>> of how our different means of accessing Ceph handled this. We converted our 
>>> ifstat output to CSV and insterted it into a spreadsheet to visualize the 
>>> network activity.
>>> 
>>> We found that the CephFS kernel mount did indeed appear to pause ongoing 
>>> reads when writes were being flushed from the page cache to the Ceph 
>>> cluster.
>>> 
>>> We wanted to see if we could make this more pronounced, so we added a 
>>> 6Gb-limit tc filter to the interface and re-ran our tests. This yielded 
>>> much lengthier delay periods in the reads while the writes were more slowly 
>>> flushed from the page cache to the Ceph cluster.
>>> 
>>> A more restrictive 2Gbit-limit tc filter produced much lengthier delays of 
>>> our reads as the writes were synched to the cluster.
>>> 
>>> When we tested the same I/O on the RBD-backed XFS file system on the same 
>>> host, we found a very different pattern. The reads seemed to be given 
>>> priority over the write activity, but the writes were only slowed, they 
>>> were not halted.
>>> 
>>> Finally we tested overlapping SMB client reads and writes to a Samba share 
>>> that used the userspace libceph-based VFS_Ceph module to produce the share. 
>>> In 

Re: [ceph-users] leak memory when mount cephfs

2019-03-19 Thread Brad Hubbard
On Tue, Mar 19, 2019 at 7:54 PM Zhenshi Zhou  wrote:
>
> Hi,
>
> I mount cephfs on my client servers. Some of the servers mount without any
> error whereas others don't.
>
> The error:
> # ceph-fuse -n client.kvm -m ceph.somedomain.com:6789 /mnt/kvm -r /kvm -d
> 2019-03-19 17:03:29.136 7f8c80eddc80 -1 deliberately leaking some memory
> 2019-03-19 17:03:29.137 7f8c80eddc80  0 ceph version 13.2.4 
> (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable), process ceph-fuse, 
> pid 2951226
> ceph-fuse: symbol lookup error: ceph-fuse: undefined symbol: 
> _Z12pipe_cloexecPi

$ c++filt  _Z12pipe_cloexecPi
pipe_cloexec(int*)

$ sudo find /lib* /usr/lib* -iname '*.so*' | xargs nm -AD 2>&1 | grep
_Z12pipe_cloexecPi
/usr/lib64/ceph/libceph-common.so:0063bb00 T _Z12pipe_cloexecPi
/usr/lib64/ceph/libceph-common.so.0:0063bb00 T _Z12pipe_cloexecPi

This appears to be an incompatibility between ceph-fuse and the
version of libceph-common it is finding. The version of ceph-fuse you
are using expects  libceph-common to define the function
"pipe_cloexec(int*)" but it does not. I'd say the verion of
libceph-common.so you have installed is too old. Compare it to the
version on a system that works.

>
> I'm not sure why some servers cannot mount cephfs. Are the servers don't have
> enough memory?
>
> Both client and server use version 13.2.4.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v14.2.0 Nautilus released

2019-03-19 Thread Konstantin Shalygin

On 3/19/19 2:52 PM, Benjamin Cherian wrote:
>/Hi, />//>/I'm getting an error when trying to use the APT repo for Ubuntu bionic. />/Does anyone else have this issue? Is the mirror sync actually still in />/progress? Or was something setup incorrectly? />//>/E: Failed to fetch />/https://download.ceph.com/debian-nautilus/dists/bionic/main/binary-amd64/Packages.bz2 
/>/File has unexpected size (15515 != 15488). Mirror sync in progress? [IP: />/158.69.68.124 443] />/   Hashes of expected file: />/    - Filesize:15488 [weak] />/    - />/SHA256:d5ea08e095eeeaa5cc134b1661bfaf55280fcbf8a265d584a4af80d2a424ec17 />/    - SHA1:6da3a8aa17ed7f828f35f546cdcf923040e8e5b0 [weak] />/    - MD5Sum:7e5a4ecea4a4edc3f483623d48b6efa4 [weak] />/   Release file created at: Mon, 11 Mar 2019 18:44:46 + /

I'm getting the same error for `apt update` with

debhttps://download.ceph.com/debian-nautilus/  bionic main


I think you also affected with this [1] issue.


[1] http://tracker.ceph.com/issues/38763

k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] SSD Recovery Settings

2019-03-19 Thread Brent Kennedy
I setup an SSD Luminous 12.2.11 cluster and realized after data had been
added that pg_num was not set properly on the default.rgw.buckets.data pool
( where all the data goes ).  I adjusted the settings up, but recovery is
going really slow ( like 56-110MiB/s ) ticking down at .002 per log
entry(ceph -w).  These are all SSDs on luminous 12.2.11 ( no journal drives
) with a set of 2 10Gb fiber twinax in a bonded LACP config.  There are six
servers, 60 OSDs, each OSD is 2TB.  There was about 4TB of data ( 3 million
objects ) added to the cluster before I noticed the red blinking lights.

 

I tried adjusting the recovery to:

ceph tell 'osd.*' injectargs '--osd-max-backfills 16'

ceph tell 'osd.*' injectargs '--osd-recovery-max-active 30'

 

Which did help a little, but didn't seem to have the impact I was looking
for.  I have used the settings on HDD clusters before to speed things up (
using 8 backfills and 4 max active though ).  Did I miss something or is
this part of the pg expansion process.  Should I be doing something else
with SSD clusters?

 

Regards,

-Brent

 

Existing Clusters:

Test: Luminous 12.2.11 with 3 osd servers, 1 mon/man, 1 gateway ( all
virtual on SSD )

US Production(HDD): Jewel 10.2.11 with 5 osd servers, 3 mons, 3 gateways
behind haproxy LB

UK Production(HDD): Luminous 12.2.11 with 15 osd servers, 3 mons/man, 3
gateways behind haproxy LB

US Production(SSD): Luminous 12.2.11 with 6 osd servers, 3 mons/man, 3
gateways behind haproxy LB

 

 

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: effects of using hard links

2019-03-19 Thread Gregory Farnum
On Tue, Mar 19, 2019 at 2:13 PM Erwin Bogaard 
wrote:

> Hi,
>
>
>
> For a number of application we use, there is a lot of file duplication.
> This wastes precious storage space, which I would like to avoid.
>
> When using a local disk, I can use a hard link to let all duplicate files
> point to the same inode (use “rdfind”, for example).
>
>
>
> As there isn’t any deduplication in Ceph(FS) I’m wondering if I can use
> hard links on CephFS in the same way as I use for ‘regular’ file systems
> like ext4 and xfs.
>
> 1. Is it advisible to use hard links on CephFS? (It isn’t in the ‘best
> practices’: http://docs.ceph.com/docs/master/cephfs/app-best-practices/)
>

This should be okay now. Hard links have changed a few times so Zheng can
correct me if I've gotten something wrong, but the differences between
regular files from a user/performance perspective are:
* if you take snapshots and have hard links, hard-linked files are special
and will be a member of *every* snapshot in the system (which only matters
if you actually write to them during all those snapshots)
* opening a hard-linked file may behave as if you were doing two file opens
instead of one, from a performance perspective. But this might have
changed? (In the past, you would need to look up the file name you open,
and then do another lookup on the authoritative location of the file.)


> 2. Is there any performance (dis)advantage?
>

Generally not once the file is open.

3. When using hard links, is there an actual space savings, or is there
> some trickery happening?
>

If you create a hard link, there is a single copy of the file data in RADOS
that all the file names refer to. I think that's what you're asking?


> 4. Are there any issues (other than the regular hard link ‘gotcha’s’) I
> need to keep in mind combining hard links with CephFS?
>

Not other than above.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Recovery Settings

2019-03-19 Thread Konstantin Shalygin

I setup an SSD Luminous 12.2.11 cluster and realized after data had been
added that pg_num was not set properly on the default.rgw.buckets.data pool
( where all the data goes ).  I adjusted the settings up, but recovery is
going really slow ( like 56-110MiB/s ) ticking down at .002 per log
entry(ceph -w).  These are all SSDs on luminous 12.2.11 ( no journal drives
) with a set of 2 10Gb fiber twinax in a bonded LACP config.  There are six
servers, 60 OSDs, each OSD is 2TB.  There was about 4TB of data ( 3 million
objects ) added to the cluster before I noticed the red blinking lights.

  


I tried adjusting the recovery to:

ceph tell 'osd.*' injectargs '--osd-max-backfills 16'

ceph tell 'osd.*' injectargs '--osd-recovery-max-active 30'

  


Which did help a little, but didn't seem to have the impact I was looking
for.  I have used the settings on HDD clusters before to speed things up (
using 8 backfills and 4 max active though ).  Did I miss something or is
this part of the pg expansion process.  Should I be doing something else
with SSD clusters?

  


Regards,

-Brent

  


Existing Clusters:

Test: Luminous 12.2.11 with 3 osd servers, 1 mon/man, 1 gateway ( all
virtual on SSD )

US Production(HDD): Jewel 10.2.11 with 5 osd servers, 3 mons, 3 gateways
behind haproxy LB

UK Production(HDD): Luminous 12.2.11 with 15 osd servers, 3 mons/man, 3
gateways behind haproxy LB

US Production(SSD): Luminous 12.2.11 with 6 osd servers, 3 mons/man, 3
gateways behind haproxy LB


Try to lower `osd_recovery_sleep*` options.

You can get your current values from ceph admin socket like this:

```

ceph daemon osd.0 config show | jq 'to_entries[] | if 
(.key|test("^(osd_recovery_sleep)(.*)")) then (.) else empty end'


```


k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CEPH ISCSI LIO multipath change delay

2019-03-19 Thread li jerry
Hi,ALL

I've deployed mimic(13.2.5) cluster on 3 CentOS 7.6 servers, then configured 
iscsi-target and created a LUN, referring to 
http://docs.ceph.com/docs/mimic/rbd/iscsi-target-cli/.
I have another server which is CentOS 7.4, configured and mounted the LUN I've 
just created, referring to 
http://docs.ceph.com/docs/mimic/rbd/iscsi-initiator-linux/.
I'm trying to do a HA testing:
1. Perform a WRITE test with DD command
2. Stop one 'Activate' iscsi-target node(ini 0), DD IO hangs over 25 seconds 
until iscsi-target switch to another node
3. DD IO goes back normal
My question is, why it takes so long for the iscsi-target switching? Is there 
any settings I've misconfigured?
Usually it only take a few seconds to switch on the enterprise storage products.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v14.2.0 Nautilus released

2019-03-19 Thread Alfredo Deza
On Tue, Mar 19, 2019 at 2:53 PM Benjamin Cherian
 wrote:
>
> Hi,
>
> I'm getting an error when trying to use the APT repo for Ubuntu bionic. Does 
> anyone else have this issue? Is the mirror sync actually still in progress? 
> Or was something setup incorrectly?
>
> E: Failed to fetch 
> https://download.ceph.com/debian-nautilus/dists/bionic/main/binary-amd64/Packages.bz2
>   File has unexpected size (15515 != 15488). Mirror sync in progress? [IP: 
> 158.69.68.124 443]
>Hashes of expected file:
> - Filesize:15488 [weak]
> - SHA256:d5ea08e095eeeaa5cc134b1661bfaf55280fcbf8a265d584a4af80d2a424ec17
> - SHA1:6da3a8aa17ed7f828f35f546cdcf923040e8e5b0 [weak]
> - MD5Sum:7e5a4ecea4a4edc3f483623d48b6efa4 [weak]
>Release file created at: Mon, 11 Mar 2019 18:44:46 +
>

This has now been fixed, let me know if you have any more issues.


>
> Thanks,
> Ben
>
>
> On Tue, Mar 19, 2019 at 7:24 AM Sean Purdy  wrote:
>>
>> Hi,
>>
>>
>> Will debian packages be released?  I don't see them in the nautilus repo.  I 
>> thought that Nautilus was going to be debian-friendly, unlike Mimic.
>>
>>
>> Sean
>>
>> On Tue, 19 Mar 2019 14:58:41 +0100
>> Abhishek Lekshmanan  wrote:
>>
>> >
>> > We're glad to announce the first release of Nautilus v14.2.0 stable
>> > series. There have been a lot of changes across components from the
>> > previous Ceph releases, and we advise everyone to go through the release
>> > and upgrade notes carefully.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v14.2.0 Nautilus released

2019-03-19 Thread Alfredo Deza
There aren't any Debian packages built for this release because we
haven't updated the infrastructure to build (and test) Debian packages
yet.

On Tue, Mar 19, 2019 at 10:24 AM Sean Purdy  wrote:
>
> Hi,
>
>
> Will debian packages be released?  I don't see them in the nautilus repo.  I 
> thought that Nautilus was going to be debian-friendly, unlike Mimic.
>
>
> Sean
>
> On Tue, 19 Mar 2019 14:58:41 +0100
> Abhishek Lekshmanan  wrote:
>
> >
> > We're glad to announce the first release of Nautilus v14.2.0 stable
> > series. There have been a lot of changes across components from the
> > previous Ceph releases, and we advise everyone to go through the release
> > and upgrade notes carefully.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com