date:20171017

Re: [ceph-users] Bareos and libradosstriper works only for 4M sripe_unit size

2017-10-17 Thread Maged Mokhtar

>> Would it be 4 objects of 24M and 4 objects of 250KB? Or will the last
4 objects be artificially padded (with 0's) to meet the stripe_unit? 

It will be 4 object of 24M + 1M stored on the 5th object 

If you write 104M :  4 object of 24M + 8M stored on the 5th object 

If you write 105M :  4 object of 24M + 8M stored on the 5th object + 1M
on 6th object 

Maged 

On 2017-10-17 01:59, Christian Wuerdig wrote:

> Maybe an additional example where the numbers don't line up all so
> nicely would be good as well. For example it's not immediately obvious
> to me what would happen with the stripe settings given by your example
> but you write 97M of data
> Would it be 4 objects of 24M and 4 objects of 250KB? Or will the last
> 4 objects be artificially padded (with 0's) to meet the stripe_unit?
> 
> On Tue, Oct 17, 2017 at 12:35 PM, Alexander Kushnirenko
>  wrote: Hi, Gregory, Ian!
> 
> There is very little information on striper mode in Ceph documentation.
> Could this explanation help?
> 
> The logic of striper mode is very much the same as in RAID-0.  There are 3
> parameters that drives it:
> 
> stripe_unit - the stripe size  (default=4M)
> stripe_count - how many objects to write in parallel (default=1)
> object_size  - when to stop increasing object size and create new objects.
> (default =4M)
> 
> For example if you write 132M of data (132 consecutive pieces of data 1M
> each) in striped mode with the following parameters:
> stripe_unit = 8M
> stripe_count = 4
> object_size = 24M
> Then 8 objects will be created - 4 objects with 24M size and 4 objects with
> 8M size.
> 
> Obj1=24MObj2=24MObj3=24MObj4=24M
> 00 .. 07 08 .. 0f 10 .. 17 18 .. 1f  <-- consecutive
> 1M pieces of data
> 20 .. 27 21 .. 2f 30 .. 37 38 .. 3f
> 40 .. 47 48 .. 4f 50 .. 57 58 .. 5f
> 
> Obj5= 8MObj6= 8MObj7= 8MObj8= 8M
> 60 .. 6768 .. 6f70 .. 7778 .. 7f
> 
> Alexander.
> 
> On Wed, Oct 11, 2017 at 3:19 PM, Alexander Kushnirenko
>  wrote: 
> Oh!  I put a wrong link, sorry  The picture which explains stripe_unit and
> stripe count is here:
> 
> https://indico.cern.ch/event/330212/contributions/1718786/attachments/642384/883834/CephPluginForXroot.pdf
> 
> I tried to attach it in the mail, but it was blocked.
> 
> On Wed, Oct 11, 2017 at 3:16 PM, Alexander Kushnirenko
>  wrote: 
> Hi, Ian!
> 
> Thank you for your reference!
> 
> Could you comment on the following rule:
> object_size = stripe_unit * stripe_count
> Or it is not necessarily so?
> 
> I refer to page 8 in this report:
> 
> https://indico.cern.ch/event/531810/contributions/2298934/attachments/1358128/2053937/Ceph-Experience-at-RAL-final.pdf
> 
> Alexander.
> 
> On Wed, Oct 11, 2017 at 1:11 PM,  wrote: 
> Hi Gregory
> 
> You're right, when setting the object layout in libradosstriper, one
> should set all three parameters (the number of stripes, the size of the
> stripe unit, and the size of the striped object). The Ceph plugin for
> GridFTP has an example of this at
> https://github.com/stfc/gridFTPCephPlugin/blob/master/ceph_posix.cpp#L371
> 
> At RAL, we use the following values:
> 
> $STRIPER_NUM_STRIPES 1
> 
> $STRIPER_STRIPE_UNIT 8388608
> 
> $STRIPER_OBJECT_SIZE 67108864
> 
> Regards,
> 
> Ian Johnson MBCS
> 
> Data Services Group
> 
> Scientific Computing Department
> 
> Rutherford Appleton Laboratory
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-10-17 Thread Marco Baldini - H.S. Amiata


Hello

Here my results

In this node, I have 3 OSDs (1TB HDD), osd.1 and osd.2 have blocks.db in 
SSD partitions each of 90GB, osd.8 has no separate blocks.db


pve-hs-main[0]:~$ for i in {1,2,8} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf 
dump | jq '.bluestore.bluestore_onodes'` ; done
osd.1 db per object: 20872
osd.2 db per object: 20416
osd.8 db per object: 16888


In this node, I have 3 OSDs (1TB HDD), each with a 60GB blocks.db on a 
separate SSD


pve-hs-2[0]:/$ for i in {3..5} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf 
dump | jq '.bluestore.bluestore_onodes'` ; done
osd.3 db per object: 19053
osd.4 db per object: 18742
osd.5 db per object: 14979


In this node I have 3 OSDs (1TB HDD) with no separate SSD

pve-hs-3[0]:~$ for i in {0,6,7} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf 
dump | jq '.bluestore.bluestore_onodes'` ; done
osd.0 db per object: 27392
osd.6 db per object: 54065
osd.7 db per object: 69986


My ceph df and rados df, if they can be useful

pve-hs-3[0]:~$ ceph df detail
GLOBAL:
SIZE  AVAIL RAW USED %RAW USED OBJECTS
8742G 6628G2114G 24.19187k
POOLS:
NAME   ID QUOTA OBJECTS QUOTA BYTES USED   %USED
 MAX AVAIL OBJECTS DIRTY READ  WRITE RAW USED
cephbackup 9  N/A   N/A   469G  7.38
 2945G  120794  117k  759k 2899k 938G
cephwin13 N/A   N/A 73788M  1.21
 1963G   18711 18711 1337k 1637k 216G
cephnix14 N/A   N/A   201G  3.31
 1963G   52407 52407  791k 1781k 605G
pve-hs-3[0]:~$ rados df detail
POOL_NAME  USED   OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED 
RD_OPS  RD WR_OPS  WR
cephbackup   469G  120794  0 241588  0   00  
777872  7286M 2968926 718G
cephnix  201G   52407  0 157221  0   00  
810317 67057M 1824184 242G
cephwin73788M   18711  0  56133  0   00 
1369792   155G 1677060 136G

total_objects191912
total_used   2114G
total_avail  6628G
total_space  8742G


Can someone see a pattern?



Il 17/10/2017 08:54, Wido den Hollander ha scritto:

Op 16 oktober 2017 om 18:14 schreef Richard Hesketh 
:


On 16/10/17 13:45, Wido den Hollander wrote:

Op 26 september 2017 om 16:39 schreef Mark Nelson :
On 09/26/2017 01:10 AM, Dietmar Rieder wrote:

thanks David,

that's confirming what I was assuming. To bad that there is no
estimate/method to calculate the db partition size.

It's possible that we might be able to get ranges for certain kinds of
scenarios.  Maybe if you do lots of small random writes on RBD, you can
expect a typical metadata size of X per object.  Or maybe if you do lots
of large sequential object writes in RGW, it's more like Y.  I think
it's probably going to be tough to make it accurate for everyone though.

So I did a quick test. I wrote 75.000 objects to a BlueStore device:

root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes'
75085
root@alpha:~#

I then saw the RocksDB database was 450MB in size:

root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes'
459276288
root@alpha:~#

459276288 / 75085 = 6116

So about 6kb of RocksDB data per object.

Let's say I want to store 1M objects in a single OSD I would need ~6GB of DB 
space.

Is this a safe assumption? Do you think that 6kb is normal? Low? High?

There aren't many of these numbers out there for BlueStore right now so I'm 
trying to gather some numbers.

Wido

If I check for the same stats on OSDs in my production cluster I see similar 
but variable values:

root@vm-ds-01:~/ceph-conf# for i in {0..9} ; do echo -n "osd.$i db per object: 
" ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon 
osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.0 db per object: 7490
osd.1 db per object: 7523
osd.2 db per object: 7378
osd.3 db per object: 7447
osd.4 db per object: 7233
osd.5 db per object: 7393
osd.6 db per object: 7074
osd.7 db per object: 7967
osd.8 db per object: 7253
osd.9 db per object: 7680

root@vm-ds-02:~# for i in {10..19} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf 
dump | jq '.bluestore.bluestore_onodes'` ; done
osd.10 db per object: 5168
osd.11 db per object: 5291
osd.12 db per object: 5476
osd.13 db per object: 4978
osd.14 db per object: 5252
osd.15 db per object: 5461
osd.16 db per object: 5135
osd.17 db per object: 5126
osd.18

Re: [ceph-users] Rbd resize, refresh rescan

2017-10-17 Thread Marc Roos


Rbd resize is automatically on the mapped host.

However for the changes to appear in libvirt/qemu, I have to
virsh qemu-monitor-command vps-test2 --hmp "info block"
virsh qemu-monitor-command vps-test2 --hmp "block_resize 
drive-scsi0-0-0-0 12G" 



-Original Message-
From: Marc Roos 
Sent: maandag 18 september 2017 23:02
To: David Turner
Subject: RE: [ceph-users] Rbd resize, refresh rescan


Yes I can remember, I guess I have to do something in kvm/virt-manager, 
so the change is relayed to the guest.

-Original Message-
From: David Turner [mailto:drakonst...@gmail.com]
Sent: maandag 18 september 2017 23:00
To: Marc Roos; ceph-users
Subject: Re: [ceph-users] Rbd resize, refresh rescan

Disk Management in Windows should very easily extend a partition to use 
the rest of the disk.  You should just right click the partition and 
select "Extend Volume" and that's it.  I did it in Windows 10 over the 
weekend for a laptop that had been set up weird.  

On Mon, Sep 18, 2017 at 4:49 PM Marc Roos  
wrote:



Yes, I think you are right, after I saw this in dmesg, I noticed 
with
fdisk the block device was updated
 rbd21: detected capacity change from 5368709120 to 6442450944

Maybe this also works (found a something that refered to a 
/sys/class,
which I don’t have) echo 1 > /sys/devices/rbd/21/refresh

(I am trying to online increase the size via kvm, virtio disk in 
win
2016)


-Original Message-
From: David Turner [mailto:drakonst...@gmail.com]
Sent: maandag 18 september 2017 22:42
To: Marc Roos; ceph-users
Subject: Re: [ceph-users] Rbd resize, refresh rescan

I've never needed to do anything other than extend the partition 
and/or
filesystem when I increased the size of an RBD.  Particularly if I
didn't partition the RBD I only needed to extend the filesystem.

Which method are you mapping/mounting the RBD?  Is it through a
Hypervisor or just mapped to a server?  What are you seeing to 
indicate
that the RBD isn't already reflecting the larger size?  Which 
version of
Ceph are you using?

On Mon, Sep 18, 2017 at 4:31 PM Marc Roos 

wrote:



Is there something like this for scsi, to rescan the size 
of the
rbd
device and make it available? (while it is being used)

echo 1 >  /sys/class/scsi_device/2\:0\:0\:0/device/rescan




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph-ISCSI

2017-10-17 Thread Frédéric Nass

Hi folks, 

For those who missed it, the fun was here :-) : 
https://youtu.be/IgpVOOVNJc0?t=3715 

Frederic. 

- Le 11 Oct 17, à 17:05, Jake Young  a écrit : 

> On Wed, Oct 11, 2017 at 8:57 AM Jason Dillaman < [ mailto:jdill...@redhat.com 
> |
> jdill...@redhat.com ] > wrote:

>> On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López < [
>> mailto:jorp...@unizar.es | jorp...@unizar.es ] > wrote:

>>> As far as I am able to understand there are 2 ways of setting iscsi for ceph

>>> 1- using kernel (lrbd) only able on SUSE, CentOS, fedora...

>> The target_core_rbd approach is only utilized by SUSE (and its derivatives 
>> like
>> PetaSAN) as far as I know. This was the initial approach for Red Hat-derived
>> kernels as well until the upstream kernel maintainers indicated that they
>> really do not want a specialized target backend for just krbd. The next 
>> attempt
>> was to re-use the existing target_core_iblock to interface with krbd via the
>> kernel's block layer, but that hit similar upstream walls trying to get 
>> support
>> for SCSI command passthrough to the block layer.

>>> 2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli)

>> The TCMU approach is what upstream and Red Hat-derived kernels will support
>> going forward.
>> The lrbd project was developed by SUSE to assist with configuring a cluster 
>> of
>> iSCSI gateways via the cli. The ceph-iscsi-config + ceph-iscsi-cli projects 
>> are
>> similar in goal but take a slightly different approach. ceph-iscsi-config
>> provides a set of common Python libraries that can be re-used by 
>> ceph-iscsi-cli
>> and ceph-ansible for deploying and configuring the gateway. The 
>> ceph-iscsi-cli
>> project provides the gwcli tool which acts as a cluster-aware replacement for
>> targetcli.

>>> I don't know which one is better, I am seeing that oficial support is 
>>> pointing
>>> to tcmu but i havent done any testbench.

>> We (upstream Ceph) provide documentation for the TCMU approach because that 
>> is
>> what is available against generic upstream kernels (starting with 4.14 when
>> it's out). Since it uses librbd (which still needs to undergo some 
>> performance
>> improvements) instead of krbd, we know that librbd 4k IO performance is 
>> slower
>> compared to krbd, but 64k and 128k IO performance is comparable. However, I
>> think most iSCSI tuning guides would already tell you to use larger block 
>> sizes
>> (i.e. 64K NTFS blocks or 32K-128K ESX blocks).

>>> Does anyone tried both? Do they give the same output? Are both able to 
>>> manage
>>> multiple iscsi targets mapped to a single rbd disk?

>> Assuming you mean multiple portals mapped to the same RBD disk, the answer is
>> yes, both approaches should support ALUA. The ceph-iscsi-config tooling will
>> only configure Active/Passive because we believe there are certain edge
>> conditions that could result in data corruption if configured for 
>> Active/Active
>> ALUA.

>> The TCMU approach also does not currently support SCSI persistent reservation
>> groups (needed for Windows clustering) because that support isn't available 
>> in
>> the upstream kernel. The SUSE kernel has an approach that utilizes two
>> round-trips to the OSDs for each IO to simulate PGR support. Earlier this
>> summer I believe SUSE started to look into how to get generic PGR support
>> merged into the upstream kernel using corosync/dlm to synchronize the states
>> between multiple nodes in the target. I am not sure of the current state of
>> that work, but it would benefit all LIO targets when complete.

>>> I will try to make my own testing but if anyone has tried in advance it 
>>> would be
>>> really helpful.

>>> Jorge Pinilla López
>>> [ mailto:jorp...@unizar.es | jorp...@unizar.es ]

>>> [
>>> 
>>> https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient
>>> ]   Libre de virus. [
>>> 
>>> https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient
>>> | www.avast.com ] [
>>> 
>>> https://mail.univ-lorraine.fr/#m_7291678653307726003_m_7112777861777147567_m_2432837294105570265_m_4580024349895004366_m_-4947191068488210222_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2
>>> |   ]

>>> ___
>>> ceph-users mailing list
>>> [ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ]
>>> [ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]

>> --
>> Jason
>> ___
>> ceph-users mailing list
>> [ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ]
>> [ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]

> Thanks Jason!

> You should cut and paste that answer into a blog post on [ http://ceph.com/ |
> ceph.com ] . It is a great summary of where things stand

[ceph-users] Retrieve progress of volume flattening using RBD python library

2017-10-17 Thread Xavier Trilla

Hi,

Does anybody know if there is a way to inspect the progress of a volume 
flattening while using the python rbd library?

I mean, using the CLI is it possible to see the progress of the flattening, but 
when calling volume.flatten() it just blocks until it's done.

Is there any way to infer the progress?

Hope somebody may help.

Thanks!
Xavier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph-ISCSI

2017-10-17 Thread Jorge Pinilla López

So what I have understood the final sum up was to support MC to be able
to Multipath Active/Active

How is that proyect going?

Windows will be able to support it because they have already implemented
it client-side but unless ESXi implements it, VMware will only be able
to do Active/Passive, am I right?

El 17/10/2017 a las 11:01, Frédéric Nass escribió:
> Hi folks,
>
> For those who missed it, the fun was here :-) :
> https://youtu.be/IgpVOOVNJc0?t=3715
>
> Frederic.
>
> - Le 11 Oct 17, à 17:05, Jake Young  a écrit :
>
>
> On Wed, Oct 11, 2017 at 8:57 AM Jason Dillaman
> mailto:jdill...@redhat.com>> wrote:
>
> On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López
> mailto:jorp...@unizar.es>> wrote:
>
> As far as I am able to understand there are 2 ways of
> setting iscsi for ceph
>
> 1- using kernel (lrbd) only able on SUSE, CentOS, fedora...
>
>
> The target_core_rbd approach is only utilized by SUSE (and its
> derivatives like PetaSAN) as far as I know. This was the
> initial approach for Red Hat-derived kernels as well until the
> upstream kernel maintainers indicated that they really do not
> want a specialized target backend for just krbd. The next
> attempt was to re-use the existing target_core_iblock to
> interface with krbd via the kernel's block layer, but that hit
> similar upstream walls trying to get support for SCSI command
> passthrough to the block layer.
>  
>
> 2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli)
>
>
> The TCMU approach is what upstream and Red Hat-derived kernels
> will support going forward. 
>  
> The lrbd project was developed by SUSE to assist with
> configuring a cluster of iSCSI gateways via the cli.  The
> ceph-iscsi-config + ceph-iscsi-cli projects are similar in
> goal but take a slightly different approach. ceph-iscsi-config
> provides a set of common Python libraries that can be re-used
> by ceph-iscsi-cli and ceph-ansible for deploying and
> configuring the gateway. The ceph-iscsi-cli project provides
> the gwcli tool which acts as a cluster-aware replacement for
> targetcli.
>
> I don't know which one is better, I am seeing that oficial
> support is pointing to tcmu but i havent done any testbench.
>
>
> We (upstream Ceph) provide documentation for the TCMU approach
> because that is what is available against generic upstream
> kernels (starting with 4.14 when it's out). Since it uses
> librbd (which still needs to undergo some performance
> improvements) instead of krbd, we know that librbd 4k IO
> performance is slower compared to krbd, but 64k and 128k IO
> performance is comparable. However, I think most iSCSI tuning
> guides would already tell you to use larger block sizes (i.e.
> 64K NTFS blocks or 32K-128K ESX blocks).
>  
>
> Does anyone tried both? Do they give the same output? Are
> both able to manage multiple iscsi targets mapped to a
> single rbd disk?
>
>
> Assuming you mean multiple portals mapped to the same RBD
> disk, the answer is yes, both approaches should support ALUA.
> The ceph-iscsi-config tooling will only configure
> Active/Passive because we believe there are certain edge
> conditions that could result in data corruption if configured
> for Active/Active ALUA.
>
> The TCMU approach also does not currently support SCSI
> persistent reservation groups (needed for Windows clustering)
> because that support isn't available in the upstream kernel.
> The SUSE kernel has an approach that utilizes two round-trips
> to the OSDs for each IO to simulate PGR support. Earlier this
> summer I believe SUSE started to look into how to get generic
> PGR support merged into the upstream kernel using corosync/dlm
> to synchronize the states between multiple nodes in the
> target. I am not sure of the current state of that work, but
> it would benefit all LIO targets when complete.
>  
>
> I will try to make my own testing but if anyone has tried
> in advance it would be really helpful.
>
> 
> 
> *Jorge Pinilla López*
> jorp...@unizar.es 
> 
> 
>
> 
> 
>   Libre de virus. www.avast.com
> 
>

Re: [ceph-users] Ceph-ISCSI

2017-10-17 Thread Maged Mokhtar

The issue with active/active is the following condition:
client initiator sends write operation to gateway server A
server A does not respond within client timeout
client initiator re-sends failed write operation to gateway server B
client initiator sends another write operation to gateway server C(orB)
on the same sector with different data
Server A wakes up and write pending data, which will over-write sector
with old data 

As Jason mentioned this is an edge condition but pauses challenges on
how to deal with this, some approaches: 

-increase the timeout of the client failover + implement fencing with a
smaller heartbeat timeout. 
-implement a distributed operation counter (using a Ceph object or a
distributed configuration/dml tool ) so that if server B gets an
operation it can detect this was because of server A failing and starts
fencing action. 
-similar to the above but rely on iSCSI session counters in Microsoft
MCS..MPIO does not generate consecutice numbers accross the different
session paths. 

Maged 

On 2017-10-17 12:23, Jorge Pinilla López wrote:

> So what I have understood the final sum up was to support MC to be able to 
> Multipath Active/Active 
> 
> How is that proyect going? Windows will be able to support it because they 
> have already implemented it client-side but unless ESXi implements it, VMware 
> will only be able to do Active/Passive, am I right?
> 
> El 17/10/2017 a las 11:01, Frédéric Nass escribió: 
> 
> Hi folks, 
> 
> For those who missed it, the fun was here :-) : 
> https://youtu.be/IgpVOOVNJc0?t=3715 
> 
> Frederic. 
> 
> - Le 11 Oct 17, à 17:05, Jake Young  a écrit :
> 
> On Wed, Oct 11, 2017 at 8:57 AM Jason Dillaman  wrote: 
> 
> On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López  
> wrote:
> 
> As far as I am able to understand there are 2 ways of setting iscsi for ceph
> 
> 1- using kernel (lrbd) only able on SUSE, CentOS, fedora... 
> 
> The target_core_rbd approach is only utilized by SUSE (and its derivatives 
> like PetaSAN) as far as I know. This was the initial approach for Red 
> Hat-derived kernels as well until the upstream kernel maintainers indicated 
> that they really do not want a specialized target backend for just krbd. The 
> next attempt was to re-use the existing target_core_iblock to interface with 
> krbd via the kernel's block layer, but that hit similar upstream walls trying 
> to get support for SCSI command passthrough to the block layer. 
> 
> 2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli) 
> 
> The TCMU approach is what upstream and Red Hat-derived kernels will support 
> going forward.  
> 
> The lrbd project was developed by SUSE to assist with configuring a cluster 
> of iSCSI gateways via the cli.  The ceph-iscsi-config + ceph-iscsi-cli 
> projects are similar in goal but take a slightly different approach. 
> ceph-iscsi-config provides a set of common Python libraries that can be 
> re-used by ceph-iscsi-cli and ceph-ansible for deploying and configuring the 
> gateway. The ceph-iscsi-cli project provides the gwcli tool which acts as a 
> cluster-aware replacement for targetcli. 
> 
> I don't know which one is better, I am seeing that oficial support is 
> pointing to tcmu but i havent done any testbench. 
> 
> We (upstream Ceph) provide documentation for the TCMU approach because that 
> is what is available against generic upstream kernels (starting with 4.14 
> when it's out). Since it uses librbd (which still needs to undergo some 
> performance improvements) instead of krbd, we know that librbd 4k IO 
> performance is slower compared to krbd, but 64k and 128k IO performance is 
> comparable. However, I think most iSCSI tuning guides would already tell you 
> to use larger block sizes (i.e. 64K NTFS blocks or 32K-128K ESX blocks). 
> 
> Does anyone tried both? Do they give the same output? Are both able to manage 
> multiple iscsi targets mapped to a single rbd disk? 
> 
> Assuming you mean multiple portals mapped to the same RBD disk, the answer is 
> yes, both approaches should support ALUA. The ceph-iscsi-config tooling will 
> only configure Active/Passive because we believe there are certain edge 
> conditions that could result in data corruption if configured for 
> Active/Active ALUA. 
> The TCMU approach also does not currently support SCSI persistent reservation 
> groups (needed for Windows clustering) because that support isn't available 
> in the upstream kernel. The SUSE kernel has an approach that utilizes two 
> round-trips to the OSDs for each IO to simulate PGR support. Earlier this 
> summer I believe SUSE started to look into how to get generic PGR support 
> merged into the upstream kernel using corosync/dlm to synchronize the states 
> between multiple nodes in the target. I am not sure of the current state of 
> that work, but it would benefit all LIO targets when complete. 
> 
> I will try to make my own testing but if anyone has tried in advance it would 
> be really helpful.
> 
> ---

Re: [ceph-users] cephfs: some metadata operations take seconds to complete

2017-10-17 Thread Tyanko Aleksiev

Thanks for the replies.
I'll move all our testbed installation to Luminous and redo the tests.

Cheers,
Tyanko

On 17 October 2017 at 10:14, Yan, Zheng  wrote:

> On Tue, Oct 17, 2017 at 1:07 AM, Tyanko Aleksiev
>  wrote:
> > Hi,
> >
> > At UZH we are currently evaluating cephfs as a distributed file system
> for
> > the scratch space of an HPC installation. Some slow down of the metadata
> > operations seems to occur under certain circumstances. In particular,
> > commands issued after some big file deletion could take several seconds.
> >
> > Example:
> >
> > dd bs=$((1024*1024*128)) count=2048 if=/dev/zero of=./dd-test
> > 274877906944 bytes (275 GB, 256 GiB) copied, 224.798 s, 1.2 GB/s
> >
> > dd bs=$((1024*1024*128)) count=2048 if=./dd-test of=./dd-test2
> > 274877906944 bytes (275 GB, 256 GiB) copied, 1228.87 s, 224 MB/s
> >
> > ls; time rm dd-test2 ; time ls
> > dd-test  dd-test2
> >
> > real0m0.004s
> > user0m0.000s
> > sys 0m0.000s
> > dd-test
> >
> > real0m8.795s
> > user0m0.000s
> > sys 0m0.000s
> >
> > Additionally, the time it takes to complete the "ls" command appears to
> be
> > proportional to the size of the deleted file. The issue described above
> is
> > not limited to "ls" but extends to other commands:
> >
> > ls ; time rm dd-test2 ; time du -hs ./*
> > dd-test  dd-test2
> >
> > real0m0.003s
> > user0m0.000s
> > sys 0m0.000s
> > 128G./dd-test
> >
> > real0m9.974s
> > user0m0.000s
> > sys 0m0.000s
> >
> > What might be causing this behavior and eventually how could we improve
> it?
> >
>
> Seems like mds was waiting for journal flush, it can wait up to
> 'mds_tick_interval'. This issue should be fix in  luminous release.
>
> Regards
> Yan, Zheng
>
> > Setup:
> >
> > - ceph version: 10.2.9, OS: Ubuntu 16.04, kernel: 4.8.0-58-generic,
> > - 3 monitors,
> > - 1 mds,
> > - 3 storage nodes with 24 X 4TB disks on each node: 1 OSD/disk (72 OSDs
> in
> > total). 4TB disks are used for the cephfs_data pool. Journaling is on
> SSDs,
> > - we installed an 400GB NVMe disk on each storage node and aggregated the
> > tree disks in crush rule. cephfs_metadata pool was then created using
> that
> > rule and therefore is hosted on the NVMes. Journaling and data are on the
> > same partition here.
> >
> > So far we are using the default ceph configuration settings.
> >
> > Clients are mounting the file system with the kernel driver using the
> > following options (again default):
> > "rw,noatime,name=admin,secret=,acl,_netdev".
> >
> > Thank you in advance for the help.
> >
> > Cheers,
> > Tyanko
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-10-17 Thread Mark Nelson




On 10/17/2017 01:54 AM, Wido den Hollander wrote:



Op 16 oktober 2017 om 18:14 schreef Richard Hesketh 
:


On 16/10/17 13:45, Wido den Hollander wrote:

Op 26 september 2017 om 16:39 schreef Mark Nelson :
On 09/26/2017 01:10 AM, Dietmar Rieder wrote:

thanks David,

that's confirming what I was assuming. To bad that there is no
estimate/method to calculate the db partition size.


It's possible that we might be able to get ranges for certain kinds of
scenarios.  Maybe if you do lots of small random writes on RBD, you can
expect a typical metadata size of X per object.  Or maybe if you do lots
of large sequential object writes in RGW, it's more like Y.  I think
it's probably going to be tough to make it accurate for everyone though.


So I did a quick test. I wrote 75.000 objects to a BlueStore device:

root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes'
75085
root@alpha:~#

I then saw the RocksDB database was 450MB in size:

root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes'
459276288
root@alpha:~#

459276288 / 75085 = 6116

So about 6kb of RocksDB data per object.

Let's say I want to store 1M objects in a single OSD I would need ~6GB of DB 
space.

Is this a safe assumption? Do you think that 6kb is normal? Low? High?

There aren't many of these numbers out there for BlueStore right now so I'm 
trying to gather some numbers.

Wido


If I check for the same stats on OSDs in my production cluster I see similar 
but variable values:

root@vm-ds-01:~/ceph-conf# for i in {0..9} ; do echo -n "osd.$i db per object: 
" ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon 
osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.0 db per object: 7490
osd.1 db per object: 7523
osd.2 db per object: 7378
osd.3 db per object: 7447
osd.4 db per object: 7233
osd.5 db per object: 7393
osd.6 db per object: 7074
osd.7 db per object: 7967
osd.8 db per object: 7253
osd.9 db per object: 7680

root@vm-ds-02:~# for i in {10..19} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf 
dump | jq '.bluestore.bluestore_onodes'` ; done
osd.10 db per object: 5168
osd.11 db per object: 5291
osd.12 db per object: 5476
osd.13 db per object: 4978
osd.14 db per object: 5252
osd.15 db per object: 5461
osd.16 db per object: 5135
osd.17 db per object: 5126
osd.18 db per object: 9336
osd.19 db per object: 4986

root@vm-ds-03:~# for i in {20..29} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf 
dump | jq '.bluestore.bluestore_onodes'` ; done
osd.20 db per object: 5115
osd.21 db per object: 4844
osd.22 db per object: 5063
osd.23 db per object: 5486
osd.24 db per object: 5228
osd.25 db per object: 4966
osd.26 db per object: 5047
osd.27 db per object: 5021
osd.28 db per object: 5321
osd.29 db per object: 5150

root@vm-ds-04:~# for i in {30..39} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf 
dump | jq '.bluestore.bluestore_onodes'` ; done
osd.30 db per object: 6658
osd.31 db per object: 6445
osd.32 db per object: 6259
osd.33 db per object: 6691
osd.34 db per object: 6513
osd.35 db per object: 6628
osd.36 db per object: 6779
osd.37 db per object: 6819
osd.38 db per object: 6677
osd.39 db per object: 6689

root@vm-ds-05:~# for i in {40..49} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf 
dump | jq '.bluestore.bluestore_onodes'` ; done
osd.40 db per object: 5335
osd.41 db per object: 5203
osd.42 db per object: 5552
osd.43 db per object: 5188
osd.44 db per object: 5218
osd.45 db per object: 5157
osd.46 db per object: 4956
osd.47 db per object: 5370
osd.48 db per object: 5117
osd.49 db per object: 5313

I'm not sure why so much variance (these nodes are basically identical) and I 
think that the db_used_bytes includes the WAL at least in my case, as I don't 
have a separate WAL device. I'm not sure how big the WAL is relative to 
metadata and hence how much this might be thrown off, but ~6kb/object seems 
like a reasonable value to take for back-of-envelope calculating.



Yes, judging from your numbers 6kb/object seems reasonable. More datapoints are 
welcome in this case.

Some input from a BlueStore dev might be helpful as well to see we are not 
drawing the wrong conclusions here.

Wido


I would be very careful about drawing too many conclusions given a 
single snapshot in time, especially if there haven't been a lot of 
partial object rewrites yet.  Just on the surface, 6KB/object feels low 
(especially if you they are moderately large objects), but perhaps if 
they've never been rewritten this is a reasonable lower bound.  This is 
important because things like 4MB RBD objects that are regularly 
rewritten might behave a lot differen

[ceph-users] Unstable clock

2017-10-17 Thread Mohamad Gebai

Hi,

I am looking at the following issue: http://tracker.ceph.com/issues/21375

In summary, during a 'rados bench', impossible latency values (e.g.
9.00648e+07) are suddenly reported. I looked briefly at the code, it
seems CLOCK_REALTIME is used, which means that wall clock changes would
affect this output. This is a VM cluster, so the hypothesis was that the
system's clock was falling behind for some reason, then getting
readjusted (that's the only way I could reproduce the issue), which I
think is quite possible in a virtual environment.

A concern was raised: are there more critical parts of Ceph where a
clock jumping around might interfere with the behavior of the cluster?
It would be good to know if there are any, and maybe prepare for them?

Mohamad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unstable clock

2017-10-17 Thread Joao Eduardo Luis


On 10/17/2017 01:30 PM, Mohamad Gebai wrote:

A concern was raised: are there more critical parts of Ceph where a
clock jumping around might interfere with the behavior of the cluster?
It would be good to know if there are any, and maybe prepare for them?


cephx and monitor paxos leases come to mind.

  -Joao
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unstable clock

2017-10-17 Thread Sage Weil

On Tue, 17 Oct 2017, Mohamad Gebai wrote:
> Hi,
> 
> I am looking at the following issue: http://tracker.ceph.com/issues/21375
> 
> In summary, during a 'rados bench', impossible latency values (e.g.
> 9.00648e+07) are suddenly reported. I looked briefly at the code, it
> seems CLOCK_REALTIME is used, which means that wall clock changes would
> affect this output. This is a VM cluster, so the hypothesis was that the
> system's clock was falling behind for some reason, then getting
> readjusted (that's the only way I could reproduce the issue), which I
> think is quite possible in a virtual environment.
> 
> A concern was raised: are there more critical parts of Ceph where a
> clock jumping around might interfere with the behavior of the cluster?

Yes, definitely.

> It would be good to know if there are any, and maybe prepare for them?

Adam added a new set of clock primitives that include a monotonic clock 
option that should be used in all cases where we're measuring the passage 
of time instead of the wall clock time.  There is a longstanding trello 
card to go through and change the latency calculations to use the 
monotonic clock.  There are probably dozens of places where an ill-timed 
clock jump is liable to trigger some random assert.  It's just a matter of 
going through and auditing calls to the legacy ceph_clock_now() method.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Luminous : 3 clients failing to respond to cache pressure

2017-10-17 Thread Yoann Moulin

Hello,

I have a luminous (12.2.1) cluster with 3 nodes for cephfs (no rbd or rgw) and 
we hit the "X clients failing to respond to cache pressure" message.
I have 3 mds servers active.

Is this something I have to worry about ?

here some information about the cluster :

> root@iccluster054:~# ceph --cluster container -s
>   cluster:
> id: a294a95a-0baa-4641-81c1-7cd70fd93216
> health: HEALTH_WARN
> 3 clients failing to respond to cache pressure
>  
>   services:
> mon: 3 daemons, quorum 
> iccluster041.iccluster.epfl.ch,iccluster042.iccluster.epfl.ch,iccluster054.iccluster.epfl.ch
> mgr: iccluster042(active), standbys: iccluster054
> mds: cephfs-3/3/3 up  
> {0=iccluster054.iccluster.epfl.ch=up:active,1=iccluster041.iccluster.epfl.ch=up:active,2=iccluster042.iccluster.epfl.ch=up:active}
> osd: 18 osds: 18 up, 18 in
>  
>   data:
> pools:   3 pools, 544 pgs
> objects: 2357k objects, 564 GB
> usage:   2011 GB used, 65055 GB / 67066 GB avail
> pgs: 544 active+clean
>  



> root@iccluster041:~# ceph --cluster container daemon 
> mds.iccluster041.iccluster.epfl.ch perf dump mds
> {
> "mds": {
> "request": 193508283,
> "reply": 192815355,
> "reply_latency": {
> "avgcount": 192815355,
> "sum": 457371.475011160,
> "avgtime": 0.002372069
> },
> "forward": 692928,
> "dir_fetch": 1717132,
> "dir_commit": 43521,
> "dir_split": 4197,
> "dir_merge": 4244,
> "inode_max": 2147483647,
> "inodes": 11098,
> "inodes_top": 7668,
> "inodes_bottom": 3404,
> "inodes_pin_tail": 26,
> "inodes_pinned": 143,
> "inodes_expired": 138623,
> "inodes_with_caps": 87,
> "caps": 239,
> "subtrees": 15,
> "traverse": 195425369,
> "traverse_hit": 192867085,
> "traverse_forward": 692723,
> "traverse_discover": 476,
> "traverse_dir_fetch": 1714684,
> "traverse_remote_ino": 0,
> "traverse_lock": 6,
> "load_cent": 19465322425,
> "q": 0,
> "exported": 1211,
> "exported_inodes": 845556,
> "imported": 1082,
> "imported_inodes": 1209280
> }
> }


> root@iccluster041:~# ceph --cluster container daemon 
> mds.iccluster041.iccluster.epfl.ch perf dump mds
> {
> "mds": {
> "request": 193508283,
> "reply": 192815355,
> "reply_latency": {
> "avgcount": 192815355,
> "sum": 457371.475011160,
> "avgtime": 0.002372069
> },
> "forward": 692928,
> "dir_fetch": 1717132,
> "dir_commit": 43521,
> "dir_split": 4197,
> "dir_merge": 4244,
> "inode_max": 2147483647,
> "inodes": 11098,
> "inodes_top": 7668,
> "inodes_bottom": 3404,
> "inodes_pin_tail": 26,
> "inodes_pinned": 143,
> "inodes_expired": 138623,
> "inodes_with_caps": 87,
> "caps": 239,
> "subtrees": 15,
> "traverse": 195425369,
> "traverse_hit": 192867085,
> "traverse_forward": 692723,
> "traverse_discover": 476,
> "traverse_dir_fetch": 1714684,
> "traverse_remote_ino": 0,
> "traverse_lock": 6,
> "load_cent": 19465322425,
> "q": 0,
> "exported": 1211,
> "exported_inodes": 845556,
> "imported": 1082,
> "imported_inodes": 1209280
> }
> }

> root@iccluster054:~# ceph --cluster container daemon 
> mds.iccluster054.iccluster.epfl.ch perf dump mds
> {
> "mds": {
> "request": 267620366,
> "reply": 255792944,
> "reply_latency": {
> "avgcount": 255792944,
> "sum": 42256.407340600,
> "avgtime": 0.000165197
> },
> "forward": 11827411,
> "dir_fetch": 183,
> "dir_commit": 2607,
> "dir_split": 27,
> "dir_merge": 19,
> "inode_max": 2147483647,
> "inodes": 3740,
> "inodes_top": 2517,
> "inodes_bottom": 1149,
> "inodes_pin_tail": 74,
> "inodes_pinned": 143,
> "inodes_expired": 2103018,
> "inodes_with_caps": 57,
> "caps": 272,
> "subtrees": 8,
> "traverse": 267626346,
> "traverse_hit": 255796915,
> "traverse_forward": 11826902,
> "traverse_discover": 77,
> "traverse_dir_fetch": 30,
> "traverse_remote_ino": 0,
> "traverse_lock": 0,
> "load_cent": 26824996745,
> "q": 3,
> "exported": 1319,
> "exported_inodes": 2037400,
> "imported": 418,
> "imported_inodes": 7347
> }
> }

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Luminous : 3 clients failing to respond to cache pressure

2017-10-17 Thread Wido den Hollander


> Op 17 oktober 2017 om 15:35 schreef Yoann Moulin :
> 
> 
> Hello,
> 
> I have a luminous (12.2.1) cluster with 3 nodes for cephfs (no rbd or rgw) 
> and we hit the "X clients failing to respond to cache pressure" message.
> I have 3 mds servers active.
> 

What type of client? Kernel? FUSE?

If it's a kernel client, what kernel are you running?

Wido

> Is this something I have to worry about ?
> 
> here some information about the cluster :
> 
> > root@iccluster054:~# ceph --cluster container -s
> >   cluster:
> > id: a294a95a-0baa-4641-81c1-7cd70fd93216
> > health: HEALTH_WARN
> > 3 clients failing to respond to cache pressure
> >  
> >   services:
> > mon: 3 daemons, quorum 
> > iccluster041.iccluster.epfl.ch,iccluster042.iccluster.epfl.ch,iccluster054.iccluster.epfl.ch
> > mgr: iccluster042(active), standbys: iccluster054
> > mds: cephfs-3/3/3 up  
> > {0=iccluster054.iccluster.epfl.ch=up:active,1=iccluster041.iccluster.epfl.ch=up:active,2=iccluster042.iccluster.epfl.ch=up:active}
> > osd: 18 osds: 18 up, 18 in
> >  
> >   data:
> > pools:   3 pools, 544 pgs
> > objects: 2357k objects, 564 GB
> > usage:   2011 GB used, 65055 GB / 67066 GB avail
> > pgs: 544 active+clean
> >  
> 
> 
> 
> > root@iccluster041:~# ceph --cluster container daemon 
> > mds.iccluster041.iccluster.epfl.ch perf dump mds
> > {
> > "mds": {
> > "request": 193508283,
> > "reply": 192815355,
> > "reply_latency": {
> > "avgcount": 192815355,
> > "sum": 457371.475011160,
> > "avgtime": 0.002372069
> > },
> > "forward": 692928,
> > "dir_fetch": 1717132,
> > "dir_commit": 43521,
> > "dir_split": 4197,
> > "dir_merge": 4244,
> > "inode_max": 2147483647,
> > "inodes": 11098,
> > "inodes_top": 7668,
> > "inodes_bottom": 3404,
> > "inodes_pin_tail": 26,
> > "inodes_pinned": 143,
> > "inodes_expired": 138623,
> > "inodes_with_caps": 87,
> > "caps": 239,
> > "subtrees": 15,
> > "traverse": 195425369,
> > "traverse_hit": 192867085,
> > "traverse_forward": 692723,
> > "traverse_discover": 476,
> > "traverse_dir_fetch": 1714684,
> > "traverse_remote_ino": 0,
> > "traverse_lock": 6,
> > "load_cent": 19465322425,
> > "q": 0,
> > "exported": 1211,
> > "exported_inodes": 845556,
> > "imported": 1082,
> > "imported_inodes": 1209280
> > }
> > }
> 
> 
> > root@iccluster041:~# ceph --cluster container daemon 
> > mds.iccluster041.iccluster.epfl.ch perf dump mds
> > {
> > "mds": {
> > "request": 193508283,
> > "reply": 192815355,
> > "reply_latency": {
> > "avgcount": 192815355,
> > "sum": 457371.475011160,
> > "avgtime": 0.002372069
> > },
> > "forward": 692928,
> > "dir_fetch": 1717132,
> > "dir_commit": 43521,
> > "dir_split": 4197,
> > "dir_merge": 4244,
> > "inode_max": 2147483647,
> > "inodes": 11098,
> > "inodes_top": 7668,
> > "inodes_bottom": 3404,
> > "inodes_pin_tail": 26,
> > "inodes_pinned": 143,
> > "inodes_expired": 138623,
> > "inodes_with_caps": 87,
> > "caps": 239,
> > "subtrees": 15,
> > "traverse": 195425369,
> > "traverse_hit": 192867085,
> > "traverse_forward": 692723,
> > "traverse_discover": 476,
> > "traverse_dir_fetch": 1714684,
> > "traverse_remote_ino": 0,
> > "traverse_lock": 6,
> > "load_cent": 19465322425,
> > "q": 0,
> > "exported": 1211,
> > "exported_inodes": 845556,
> > "imported": 1082,
> > "imported_inodes": 1209280
> > }
> > }
> 
> > root@iccluster054:~# ceph --cluster container daemon 
> > mds.iccluster054.iccluster.epfl.ch perf dump mds
> > {
> > "mds": {
> > "request": 267620366,
> > "reply": 255792944,
> > "reply_latency": {
> > "avgcount": 255792944,
> > "sum": 42256.407340600,
> > "avgtime": 0.000165197
> > },
> > "forward": 11827411,
> > "dir_fetch": 183,
> > "dir_commit": 2607,
> > "dir_split": 27,
> > "dir_merge": 19,
> > "inode_max": 2147483647,
> > "inodes": 3740,
> > "inodes_top": 2517,
> > "inodes_bottom": 1149,
> > "inodes_pin_tail": 74,
> > "inodes_pinned": 143,
> > "inodes_expired": 2103018,
> > "inodes_with_caps": 57,
> > "caps": 272,
> > "subtrees": 8,
> > "traverse": 267626346,
> > "traverse_hit": 255796915,
> > "traverse_forward": 11826902,
> > "traverse_discover": 77,
> > "traverse_dir_fetch": 30,
> > "t

Re: [ceph-users] Unstable clock

2017-10-17 Thread Mohamad Gebai


On 10/17/2017 09:27 AM, Sage Weil wrote:
> On Tue, 17 Oct 2017, Mohamad Gebai wrote:
>
>> It would be good to know if there are any, and maybe prepare for them?
> Adam added a new set of clock primitives that include a monotonic clock 
> option that should be used in all cases where we're measuring the passage 
> of time instead of the wall clock time.  There is a longstanding trello 
> card to go through and change the latency calculations to use the 
> monotonic clock.  There are probably dozens of places where an ill-timed 
> clock jump is liable to trigger some random assert.  It's just a matter of 
> going through and auditing calls to the legacy ceph_clock_now() method.
>

Thanks Sage. I assume that's the card you're referring to:
https://trello.com/c/SAtGPq0N/65-use-time-span-monotonic-for-durations

I can take of that one if no one else has started working on it.

Mohamad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unstable clock

2017-10-17 Thread Sage Weil

On Tue, 17 Oct 2017, Mohamad Gebai wrote:
> On 10/17/2017 09:27 AM, Sage Weil wrote:
> > On Tue, 17 Oct 2017, Mohamad Gebai wrote:
> >
> >> It would be good to know if there are any, and maybe prepare for them?
> > Adam added a new set of clock primitives that include a monotonic clock 
> > option that should be used in all cases where we're measuring the passage 
> > of time instead of the wall clock time.  There is a longstanding trello 
> > card to go through and change the latency calculations to use the 
> > monotonic clock.  There are probably dozens of places where an ill-timed 
> > clock jump is liable to trigger some random assert.  It's just a matter of 
> > going through and auditing calls to the legacy ceph_clock_now() method.
> >
> 
> Thanks Sage. I assume that's the card you're referring to:
> https://trello.com/c/SAtGPq0N/65-use-time-span-monotonic-for-durations
> 
> I can take of that one if no one else has started working on it.

That would be wonderful!  I'm pretty sure nobody else is looking at it so 
you win today.  :)

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Luminous : 3 clients failing to respond to cache pressure

2017-10-17 Thread Yoann Moulin


>> I have a luminous (12.2.1) cluster with 3 nodes for cephfs (no rbd or rgw) 
>> and we hit the "X clients failing to respond to cache pressure" message.
>> I have 3 mds servers active.
> 
> What type of client? Kernel? FUSE?
> 
> If it's a kernel client, what kernel are you running?

kernel client, version 4.10.0-35-generic, it's for kubernetes environment

https://kubernetes.io/docs/concepts/storage/volumes/#cephfs
https://github.com/kubernetes/examples/tree/master/staging/volumes/cephfs/

containers use this yaml template :

https://github.com/kubernetes/examples/blob/master/staging/volumes/cephfs/cephfs.yaml

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unstable clock

2017-10-17 Thread Mohamad Gebai


On 10/17/2017 09:57 AM, Sage Weil wrote:
> On Tue, 17 Oct 2017, Mohamad Gebai wrote:
>>
>> Thanks Sage. I assume that's the card you're referring to:
>> https://trello.com/c/SAtGPq0N/65-use-time-span-monotonic-for-durations
>>
>> I can take of that one if no one else has started working on it.
> That would be wonderful!  I'm pretty sure nobody else is looking at it so 
> you win today.  :)
>

Great :) Anything I should, like add my face to the card or something?

Mohamad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] OSD are marked as down after jewel -> luminous upgrade

2017-10-17 Thread Daniel Carrasco

Hello,

Today I've decided to upgrade my Ceph cluster to latest LTS version. To do
it I've used the steps posted on release notes:
http://ceph.com/releases/v12-2-0-luminous-released/

After upgrade all the daemons I've noticed that all OSD daemons are marked
as down even when all are working, so the cluster becomes down.

Maybe the problem is the command "ceph osd require-osd-release luminous",
but all OSD are on Luminous version.

-
-
# ceph versions
{
"mon": {
"ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e)
luminous (stable)": 3
},
"mgr": {
"ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e)
luminous (stable)": 3
},
"osd": {
"ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e)
luminous (stable)": 2
},
"mds": {
"ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e)
luminous (stable)": 2
},
"overall": {
"ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e)
luminous (stable)": 10
}
}

-
-
# ceph osd versions
{
"ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e)
luminous (stable)": 2
}

# ceph osd tree
ID CLASS WEIGHT  TYPE NAME  STATUS REWEIGHT PRI-AFF
-1   0.08780 root default
-2   0.04390 host alantra_fs-01
 0   ssd 0.04390 osd.0  up  1.0 1.0
-3   0.04390 host alantra_fs-02
 1   ssd 0.04390 osd.1  up  1.0 1.0
-4 0  host alantra_fs-03

-
-
# ceph -s
  cluster:
id: 5f8e66b5-1adc-4930-b5d8-c0f44dc2037e
health: HEALTH_WARN
nodown flag(s) set

  services:
mon: 3 daemons, quorum alantra_fs-02,alantra_fs-01,alantra_fs-03
mgr: alantra_fs-03(active), standbys: alantra_fs-01, alantra_fs-02
mds: cephfs-1/1/1 up  {0=alantra_fs-01=up:active}, 1 up:standby
osd: 2 osds: 2 up, 2 in
 flags nodown

  data:
pools:   3 pools, 192 pgs
objects: 40177 objects, 3510 MB
usage:   7486 MB used, 84626 MB / 92112 MB avail
pgs: 192 active+clean

  io:
client:   564 kB/s rd, 767 B/s wr, 33 op/s rd, 0 op/s wr

-
-
Log:
2017-10-17 16:15:25.466807 mon.alantra_fs-02 [INF] osd.0 marked down after
no beacon for 29.864632 seconds
2017-10-17 16:15:25.467557 mon.alantra_fs-02 [WRN] Health check failed: 1
osds down (OSD_DOWN)
2017-10-17 16:15:25.467587 mon.alantra_fs-02 [WRN] Health check failed: 1
host (1 osds) down (OSD_HOST_DOWN)
2017-10-17 16:15:27.494526 mon.alantra_fs-02 [WRN] Health check failed:
Degraded data redundancy: 63 pgs unclean (PG_DEGRADED)
2017-10-17 16:15:27.501956 mon.alantra_fs-02 [INF] Health check cleared:
OSD_DOWN (was: 1 osds down)
2017-10-17 16:15:27.501997 mon.alantra_fs-02 [INF] Health check cleared:
OSD_HOST_DOWN (was: 1 host (1 osds) down)
2017-10-17 16:15:27.502012 mon.alantra_fs-02 [INF] Cluster is now healthy
2017-10-17 16:15:27.518798 mon.alantra_fs-02 [INF] osd.0
10.20.1.109:6801/3319 boot
2017-10-17 16:15:26.414023 osd.0 [WRN] Monitor daemon marked osd.0 down,
but it is still running
2017-10-17 16:15:30.470477 mon.alantra_fs-02 [INF] osd.1 marked down after
no beacon for 25.007336 seconds
2017-10-17 16:15:30.471014 mon.alantra_fs-02 [WRN] Health check failed: 1
osds down (OSD_DOWN)
2017-10-17 16:15:30.471047 mon.alantra_fs-02 [WRN] Health check failed: 1
host (1 osds) down (OSD_HOST_DOWN)
2017-10-17 16:15:30.532427 mon.alantra_fs-02 [WRN] overall HEALTH_WARN 1
osds down; 1 host (1 osds) down; Degraded data redundancy: 63 pgs unclean
2017-10-17 16:15:31.590661 mon.alantra_fs-02 [INF] Health check cleared:
PG_DEGRADED (was: Degraded data redundancy: 63 pgs unclean)
2017-10-17 16:15:34.703027 mon.alantra_fs-02 [INF] Health check cleared:
OSD_DOWN (was: 1 osds down)
2017-10-17 16:15:34.703061 mon.alantra_fs-02 [INF] Health check cleared:
OSD_HOST_DOWN (was: 1 host (1 osds) down)
2017-10-17 16:15:34.703078 mon.alantra_fs-02 [INF] Cluster is now healthy
2017-10-17 16:15:34.714002 mon.alantra_fs-02 [INF] osd.1
10.20.1.97:6801/2310 boot
2017-10-17 16:15:33.614640 osd.1 [WRN] Monitor daemon marked osd.1 down,
but it is still running
2017-10-17

Re: [ceph-users] OSD are marked as down after jewel -> luminous upgrade

2017-10-17 Thread Marc Roos

Did you check this?

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg39886.html 








-Original Message-
From: Daniel Carrasco [mailto:d.carra...@i2tic.com] 
Sent: dinsdag 17 oktober 2017 17:49
To: ceph-us...@ceph.com
Subject: [ceph-users] OSD are marked as down after jewel -> luminous 
upgrade

Hello,

Today I've decided to upgrade my Ceph cluster to latest LTS version. To 
do it I've used the steps posted on release notes:
http://ceph.com/releases/v12-2-0-luminous-released/

After upgrade all the daemons I've noticed that all OSD daemons are 
marked as down even when all are working, so the cluster becomes down.

Maybe the problem is the command "ceph osd require-osd-release 
luminous", but all OSD are on Luminous version.


-


-

# ceph versions
{
"mon": {
"ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
luminous (stable)": 3
},
"mgr": {
"ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
luminous (stable)": 3
},
"osd": {
"ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
luminous (stable)": 2
},
"mds": {
"ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
luminous (stable)": 2
},
"overall": {
"ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
luminous (stable)": 10
}
}


-


-

# ceph osd versions
{
"ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
luminous (stable)": 2 }

# ceph osd tree

ID CLASS WEIGHT  TYPE NAME  STATUS REWEIGHT PRI-AFF 
-1   0.08780 root default   
-2   0.04390 host alantra_fs-01 
 0   ssd 0.04390 osd.0  up  1.0 1.0 
-3   0.04390 host alantra_fs-02 
 1   ssd 0.04390 osd.1  up  1.0 1.0 
-4 0  host alantra_fs-03


-


-

# ceph -s
  cluster:
id: 5f8e66b5-1adc-4930-b5d8-c0f44dc2037e
health: HEALTH_WARN
nodown flag(s) set
 
  services:
mon: 3 daemons, quorum alantra_fs-02,alantra_fs-01,alantra_fs-03
mgr: alantra_fs-03(active), standbys: alantra_fs-01, alantra_fs-02
mds: cephfs-1/1/1 up  {0=alantra_fs-01=up:active}, 1 up:standby
osd: 2 osds: 2 up, 2 in
 flags nodown
 
  data:
pools:   3 pools, 192 pgs
objects: 40177 objects, 3510 MB
usage:   7486 MB used, 84626 MB / 92112 MB avail
pgs: 192 active+clean
 
  io:
client:   564 kB/s rd, 767 B/s wr, 33 op/s rd, 0 op/s wr


-


-
Log:
2017-10-17 16:15:25.466807 mon.alantra_fs-02 [INF] osd.0 marked down 
after no beacon for 29.864632 seconds
2017-10-17 16:15:25.467557 mon.alantra_fs-02 [WRN] Health check failed: 
1 osds down (OSD_DOWN)
2017-10-17 16:15:25.467587 mon.alantra_fs-02 [WRN] Health check failed: 
1 host (1 osds) down (OSD_HOST_DOWN)
2017-10-17 16:15:27.494526 mon.alantra_fs-02 [WRN] Health check failed: 
Degraded data redundancy: 63 pgs unclean (PG_DEGRADED)
2017-10-17 16:15:27.501956 mon.alantra_fs-02 [INF] Health check cleared: 
OSD_DOWN (was: 1 osds down)
2017-10-17 16:15:27.501997 mon.alantra_fs-02 [INF] Health check cleared: 
OSD_HOST_DOWN (was: 1 host (1 osds) down)
2017-10-17 16:15:27.502012 mon.alantra_fs-02 [INF] Cluster is now 
healthy
2017-10-17 16:15:27.518798 mon.alantra_fs-02 [INF] osd.0 
10.20.1.109:6801/3319 boot
2017-10-17 16:15:26.414023 osd.0 [WRN] Monitor daemon marked osd.0 down, 
but it is still running
2017-10-17 16:15:30.470477 mon.alantra_fs-02 [INF] osd.1 marked down 
after no beacon for 25.007336 seconds
2017-10-17 16:15:30.471014 mon.alantra_fs-02 [WRN] Health check failed: 
1 osds down (OSD_DOWN)
2017-10-17 16:15:30.471047 mon.alantra_fs-02 [WRN] Health check failed: 
1 host (1 osds) down (OSD_HOST_DOWN)
2017-10-17 16:15:30.532427 mon.alantra_fs-02 [WRN] overall HEALTH_WARN 1 
osds down; 1 host (1 osds) down; Degraded data redundancy: 63 pgs 
unclean
2017-10-17 16:15:31.590661 mon.alantra_fs-02 [INF] Health check cleared: 
PG_DEGRADED (was: Degraded data redundancy: 63 pgs unclean)
2017-10-17 16:15:34.703027 mon.a

[ceph-users] OSD crashed while reparing inconsistent PG luminous

2017-10-17 Thread Ana Aviles

Hello all,

We had an inconsistent PG on our cluster. While performing PG repair
operation the OSD crashed. The OSD was not able to start again anymore,
and there was no hardware failure on the disk itself. This is the log output

2017-10-17 17:48:55.771384 7f234930d700 -1 log_channel(cluster) log
[ERR] : 2.2fc repair 1 missing, 0 inconsistent objects
2017-10-17 17:48:55.771417 7f234930d700 -1 log_channel(cluster) log
[ERR] : 2.2fc repair 3 errors, 1 fixed
2017-10-17 17:48:56.047896 7f234930d700 -1
/build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: In function 'virtual void
PrimaryLogPG::on_local_recover(const hobject_t&, const
ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)'
thread 7f234930d700 time 2017-10-17 17:48:55.924115
/build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p !=
recovery_info.ss.clone_snaps.end())

 ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x56236c8ff3f2]
 2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo
const&, std::shared_ptr, bool,
ObjectStore::Transaction*)+0xd63) [0x56236c476213]
 3: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp const&,
PullOp*, std::__cxx11::list >*,
ObjectStore::Transaction*)+0x693) [0x56236c60d4d3]
 4:
(ReplicatedBackend::_do_pull_response(boost::intrusive_ptr)+0x2b5)
[0x56236c60dd75]
 5:
(ReplicatedBackend::_handle_message(boost::intrusive_ptr)+0x20c)
[0x56236c61196c]
 6: (PGBackend::handle_message(boost::intrusive_ptr)+0x50)
[0x56236c521aa0]
 7: (PrimaryLogPG::do_request(boost::intrusive_ptr&,
ThreadPool::TPHandle&)+0x55d) [0x56236c48662d]
 8: (OSD::dequeue_op(boost::intrusive_ptr,
boost::intrusive_ptr, ThreadPool::TPHandle&)+0x3a9)
[0x56236c3091a9]
 9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr
const&)+0x57) [0x56236c5a2ae7]
 10: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x130e) [0x56236c3307de]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884)
[0x56236c9041e4]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56236c907220]
 13: (()+0x76ba) [0x7f2366be96ba]
 14: (clone()+0x6d) [0x7f2365c603dd]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

Thanks!

Ana


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD crashed while reparing inconsistent PG luminous

2017-10-17 Thread Cassiano Pilipavicius

Hello, I have a problem with OSDs crashing after upgrading to 
bluestore/luminous, due to the fact that I was using JEMALLOC and it 
seems that there is a bug on bluestore osds x jemalloc. Changing to 
tcmalloc solved my issues. Dont know if you have the same issue, but in 
my environment, the osds crashed mainly while repairing, or when I have 
a high load on the cluster.



Em 10/17/2017 2:51 PM, Ana Aviles escreveu:

Hello all,

We had an inconsistent PG on our cluster. While performing PG repair
operation the OSD crashed. The OSD was not able to start again anymore,
and there was no hardware failure on the disk itself. This is the log output

2017-10-17 17:48:55.771384 7f234930d700 -1 log_channel(cluster) log
[ERR] : 2.2fc repair 1 missing, 0 inconsistent objects
2017-10-17 17:48:55.771417 7f234930d700 -1 log_channel(cluster) log
[ERR] : 2.2fc repair 3 errors, 1 fixed
2017-10-17 17:48:56.047896 7f234930d700 -1
/build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: In function 'virtual void
PrimaryLogPG::on_local_recover(const hobject_t&, const
ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)'
thread 7f234930d700 time 2017-10-17 17:48:55.924115
/build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p !=
recovery_info.ss.clone_snaps.end())

  ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
(stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x56236c8ff3f2]
  2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo
const&, std::shared_ptr, bool,
ObjectStore::Transaction*)+0xd63) [0x56236c476213]
  3: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp const&,
PullOp*, std::__cxx11::list >*,
ObjectStore::Transaction*)+0x693) [0x56236c60d4d3]
  4:
(ReplicatedBackend::_do_pull_response(boost::intrusive_ptr)+0x2b5)
[0x56236c60dd75]
  5:
(ReplicatedBackend::_handle_message(boost::intrusive_ptr)+0x20c)
[0x56236c61196c]
  6: (PGBackend::handle_message(boost::intrusive_ptr)+0x50)
[0x56236c521aa0]
  7: (PrimaryLogPG::do_request(boost::intrusive_ptr&,
ThreadPool::TPHandle&)+0x55d) [0x56236c48662d]
  8: (OSD::dequeue_op(boost::intrusive_ptr,
boost::intrusive_ptr, ThreadPool::TPHandle&)+0x3a9)
[0x56236c3091a9]
  9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr
const&)+0x57) [0x56236c5a2ae7]
  10: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x130e) [0x56236c3307de]
  11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884)
[0x56236c9041e4]
  12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56236c907220]
  13: (()+0x76ba) [0x7f2366be96ba]
  14: (clone()+0x6d) [0x7f2365c603dd]
  NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

Thanks!

Ana


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD are marked as down after jewel -> luminous upgrade

2017-10-17 Thread Daniel Carrasco

Thanks!!

I'll take a look later.

Anyway, all my Ceph daemons are in same version on all nodes (I've upgraded
the whole cluster).

Cheers!!

El 17 oct. 2017 6:39 p. m., "Marc Roos"  escribió:

Did you check this?

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg39886.html








-Original Message-
From: Daniel Carrasco [mailto:d.carra...@i2tic.com]
Sent: dinsdag 17 oktober 2017 17:49
To: ceph-us...@ceph.com
Subject: [ceph-users] OSD are marked as down after jewel -> luminous
upgrade

Hello,

Today I've decided to upgrade my Ceph cluster to latest LTS version. To
do it I've used the steps posted on release notes:
http://ceph.com/releases/v12-2-0-luminous-released/

After upgrade all the daemons I've noticed that all OSD daemons are
marked as down even when all are working, so the cluster becomes down.

Maybe the problem is the command "ceph osd require-osd-release
luminous", but all OSD are on Luminous version.


-


-

# ceph versions
{
"mon": {
"ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e)
luminous (stable)": 3
},
"mgr": {
"ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e)
luminous (stable)": 3
},
"osd": {
"ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e)
luminous (stable)": 2
},
"mds": {
"ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e)
luminous (stable)": 2
},
"overall": {
"ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e)
luminous (stable)": 10
}
}


-


-

# ceph osd versions
{
"ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e)
luminous (stable)": 2 }

# ceph osd tree

ID CLASS WEIGHT  TYPE NAME  STATUS REWEIGHT PRI-AFF
-1   0.08780 root default
-2   0.04390 host alantra_fs-01
 0   ssd 0.04390 osd.0  up  1.0 1.0
-3   0.04390 host alantra_fs-02
 1   ssd 0.04390 osd.1  up  1.0 1.0
-4 0  host alantra_fs-03


-


-

# ceph -s
  cluster:
id: 5f8e66b5-1adc-4930-b5d8-c0f44dc2037e
health: HEALTH_WARN
nodown flag(s) set

  services:
mon: 3 daemons, quorum alantra_fs-02,alantra_fs-01,alantra_fs-03
mgr: alantra_fs-03(active), standbys: alantra_fs-01, alantra_fs-02
mds: cephfs-1/1/1 up  {0=alantra_fs-01=up:active}, 1 up:standby
osd: 2 osds: 2 up, 2 in
 flags nodown

  data:
pools:   3 pools, 192 pgs
objects: 40177 objects, 3510 MB
usage:   7486 MB used, 84626 MB / 92112 MB avail
pgs: 192 active+clean

  io:
client:   564 kB/s rd, 767 B/s wr, 33 op/s rd, 0 op/s wr


-


-
Log:
2017-10-17 16:15:25.466807 mon.alantra_fs-02 [INF] osd.0 marked down
after no beacon for 29.864632 seconds
2017-10-17 16:15:25.467557 mon.alantra_fs-02 [WRN] Health check failed:
1 osds down (OSD_DOWN)
2017-10-17 16:15:25.467587 mon.alantra_fs-02 [WRN] Health check failed:
1 host (1 osds) down (OSD_HOST_DOWN)
2017-10-17 16:15:27.494526 mon.alantra_fs-02 [WRN] Health check failed:
Degraded data redundancy: 63 pgs unclean (PG_DEGRADED)
2017-10-17 16:15:27.501956 mon.alantra_fs-02 [INF] Health check cleared:
OSD_DOWN (was: 1 osds down)
2017-10-17 16:15:27.501997 mon.alantra_fs-02 [INF] Health check cleared:
OSD_HOST_DOWN (was: 1 host (1 osds) down)
2017-10-17 16:15:27.502012 mon.alantra_fs-02 [INF] Cluster is now
healthy
2017-10-17 16:15:27.518798 mon.alantra_fs-02 [INF] osd.0
10.20.1.109:6801/3319 boot
2017-10-17 16:15:26.414023 osd.0 [WRN] Monitor daemon marked osd.0 down,
but it is still running
2017-10-17 16:15:30.470477 mon.alantra_fs-02 [INF] osd.1 marked down
after no beacon for 25.007336 seconds
2017-10-17 16:15:30.471014 mon.alantra_fs-02 [WRN] Health check failed:
1 osds down (OSD_DOWN)
2017-10-17 16:15:30.471047 mon.alantra_fs-02 [WRN] Health check failed:
1 host (1 osds) down (OSD_HOST_DOWN)
2017-10-17 16:15:30.532427 mon.alantra_fs-02 [WRN] overall HEALTH_WARN 1
osds down; 1 host (1 osds) down; Degraded data redundancy: 63 pgs
unclean
2017-10-17 16:15:31.590661 mon.alantra_fs-02 [INF] Health check cleared:
PG_DEGRADED (was

[ceph-users] Help with full osd and RGW not responsive

2017-10-17 Thread Bryan Banister

Hi all,

Still a real novice here and we didn't set up our initial RGW cluster very 
well.  We have 134 osds and set up our RGW pool with only 64 PGs, thus not all 
of our OSDs got data and now we have one that is 95% full.

This apparently has put the cluster into a HEALTH_ERR condition:
[root@carf-ceph-osd01 ~]# ceph health detail
HEALTH_ERR full flag(s) set; 1 full osd(s); 1 pools have many more objects per 
pg than average; application not enabled on 6 pool(s); too few PGs per OSD (26 
< min 30)
OSDMAP_FLAGS full flag(s) set
OSD_FULL 1 full osd(s)
osd.5 is full
MANY_OBJECTS_PER_PG 1 pools have many more objects per pg than average
pool carf01.rgw.buckets.data objects per pg (602762) is more than 18.3752 
times cluster average (32803)

There is plenty of space on most of the OSDs and don't know how to go about 
fixing this situation.  If we update the pg_num and pgp_num settings for this 
pool, can we rebalance the data across the OSDs?

Also, seems like this is causing a problem with the RGWs, which was reporting 
this error in the logs:
2017-10-16 16:36:47.534461 7fffe6c5c700  1 heartbeat_map is_healthy 
'RGWAsyncRadosProcessor::m_tp thread 0x7fffdc447700' had timed out after 600

After trying to restart the RGW, we see this now:
2017-10-17 10:40:38.517002 7fffe6c5c700  1 heartbeat_map is_healthy 
'RGWAsyncRadosProcessor::m_tp thread 0x7fffddc4a700' had timed out after 600
2017-10-17 10:40:42.124046 77fd4e00  0 deferred set uid:gid to 167:167 
(ceph:ceph)
2017-10-17 10:40:42.124162 77fd4e00  0 ceph version 12.2.0 
(32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process (unknown), 
pid 65313
2017-10-17 10:40:42.245259 77fd4e00  0 client.769905.objecter  FULL, paused 
modify 0x5662fb00 tid 0
2017-10-17 10:45:42.124283 7fffe7bcf700 -1 Initialization timeout, failed to 
initialize
2017-10-17 10:45:42.353496 77fd4e00  0 deferred set uid:gid to 167:167 
(ceph:ceph)
2017-10-17 10:45:42.353618 77fd4e00  0 ceph version 12.2.0 
(32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process (unknown), 
pid 71842
2017-10-17 10:45:42.388621 77fd4e00  0 client.769986.objecter  FULL, paused 
modify 0x5662fb00 tid 0
2017-10-17 10:50:42.353731 7fffe7bcf700 -1 Initialization timeout, failed to 
initialize

Seems pretty evident that the "FULL, paused" is a problem.  So if I fix the 
first issue the RGW should be ok after?

Thanks in advance,
-Bryan



Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you are hereby notified that any review, dissemination 
or copying of this email is strictly prohibited, and to please notify the 
sender immediately and destroy this email and any attachments. Email 
transmission cannot be guaranteed to be secure or error-free. The Company, 
therefore, does not make any guarantees as to the completeness or accuracy of 
this email or any attachments. This email is for informational purposes only 
and does not constitute a recommendation, offer, request or solicitation of any 
kind to buy, sell, subscribe, redeem or perform any type of transaction of a 
financial product.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Thick provisioning

2017-10-17 Thread Jason Dillaman

There is no existing option to thick provision images within RBD. When
an image is created or cloned, the only actions that occur are some
small metadata updates to describe the image. This allows image
creation to be a quick, constant time operation regardless of the
image size. To thick provision the entire image would require writing
data to the entire image and ensuring discard support is disabled to
prevent the OS from releasing space back (and thus re-sparsifying the
image).

On Mon, Oct 16, 2017 at 10:49 AM,   wrote:
> Hi,
>
> I have deployed a Ceph cluster (Jewel). By default all block devices that
> are created are thin provisioned.
>
> Is it possible to change this setting? I would like to have that all
> created block devices are thick provisioned.
>
> In front of the Ceph cluster, I am running Openstack.
>
> Thanks!
>
> Sinan
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Efficient storage of small objects / bulk erasure coding

2017-10-17 Thread Jiri Horky

Hi list,

we are thinking of building relatively big CEPH-based object storage for
storage of our sample files - we have about 700M files ranging from very
small (1-4KiB) files to pretty big ones (several GiB). Median of file
size is 64KiB. Since the required space is relatively large (1PiB of
usable storage), we are thinking of utilizing erasure coding for this
case. On the other hand, we need to achieve at least 1200MiB/s
throughput on reads. The working assumption is 4+2 EC (thus 50% overhead).

Since the EC is per-object, the small objects will be stripped to even
smaller ones. With 4+2 EC, one needs (at least) 4 IOs to read a single
object in this scenario -> number of required IOPS when using EC is
relatively high. Some vendors (such as Hitachi, but I believe EMC as
well) do offline, predefined-chunk size EC instead. The idea is to first
write objects with replication factor of 3, wait for enough objects to
fill 4x 64MiB chunks and only do EC on that. This not only makes the EC
less computationally intensive, and repairs much faster, but it also
allows reading majority of the small objects directly by reading just
part of one of the chunk from it (assuming non degraded state) - one
chunk actually contains the whole object.
I wonder if something similar is already possible with CEPH and/or is
planned. For our use case of very small objects, it would mean near 3-4x
performance boosts in terms of required IOPS performance.

Another option how to get out of this situation is to be able to specify
different storage pools/policies based on file size - i.e. to do 3x
replication of the very small files and only use EC for bigger files,
where the performance hit with 4x IOPS won't be that painful. But I I am
afraid this is not possible...

Any other hint is sincerely welcome.

Thank you
Jiri Horky

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] To check RBD cache enabled

2017-10-17 Thread Josy


Hi,


I am following this article :

http://ceph.com/geen-categorie/ceph-validate-that-the-rbd-cache-is-active/

I have enabled this flag in ceph.conf

|[client] admin socket = 
/var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok log file = /var/log/ceph/|



But the command to show the conf is not working :

[cephuser@ceph-las-admin-a1 ceph-cluster]$ sudo ceph --admin-daemon  
/etc/ceph/ceph.client.admin.keyring config show
admin_socket: exception getting command descriptions: [Errno 111] 
Connection refused


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] To check RBD cache enabled

2017-10-17 Thread Jean-Charles Lopez

Hi 

syntax uses the admin socket file : ceph --admin-daemon 
/var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok config get rbd_cache

Should be /var/run/ceph/ceph.client.admin.$pid.$cctid.asok if your connection 
is using client.admin to connect to the cluster and your cluster name is set to 
the default of ceph. But obviously can’t know from here the PID and the CCTID 
you will have to identify.

You can actually do a ls /var/run/ceph to find the correct admin socket file

Regards
JC Lopez
Senior Technical Instructor, Global Storage Consulting Practice
Red Hat, Inc.
jelo...@redhat.com 
+1 408-680-6959

> On Oct 17, 2017, at 12:50, Josy  wrote:
> 
> Hi,
> 
> 
> I am following this article :
> 
> http://ceph.com/geen-categorie/ceph-validate-that-the-rbd-cache-is-active/ 
> 
> I have enabled this flag in ceph.conf
> 
> [client]
> admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
> log file = /var/log/ceph/
> 
> But the command to show the conf is not working : 
> [cephuser@ceph-las-admin-a1 ceph-cluster]$ sudo ceph --admin-daemon  
> /etc/ceph/ceph.client.admin.keyring config show
> admin_socket: exception getting command descriptions: [Errno 111] Connection 
> refused
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to stop using (unmount) a failed OSD with BlueStore ?

2017-10-17 Thread Alejandro Comisario

hi guys, any tip or help ?

On Mon, Oct 16, 2017 at 1:50 PM, Alejandro Comisario 
wrote:

> Hi all, i have to hot-swap a failed osd on a Luminous Cluster with Blue
> store (the disk is SATA, WAL and DB are on NVME).
>
> I've issued a:
> * ceph osd crush reweight osd_id 0
> * systemctl stop (osd I'd daemon)
> * umount /var/lib/ceph/osd/osd_id
> * ceph osd destroy osd_id
>
> everything seems of, but if I left everything as is ( until I wait for the
> replaced disk ) I can see that dmesg errors on writing on the device are
> still appearing.
>
> The osd is of course down and out the crushmap.
> am I missing something ? like a step to execute or something else ?
>
> hoping to get help.
> best.
>
> alejandrito
>



-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
_
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Help with full osd and RGW not responsive

2017-10-17 Thread Andreas Calminder

Hi,
You should most definitely look over number of pgs, there's a pg calculator
available here: http://ceph.com/pgcalc/

You can increase pgs but not the other way around (
http://docs.ceph.com/docs/jewel/rados/operations/placement-groups/)

To solve the immediate problem with your cluster being full you can
reweight your osds, giving the full osd a lower weight will cause writes
going to other osds and data on that osd being migrated to other osds in
the cluster: ceph osd reweight $OSDNUM $WEIGHT, described here
http://docs.ceph.com/docs/master/rados/operations/control/#osd-subsystem

When the osd isn't above the full threshold, default is 95%, the cluster
will clear its full flag and your radosgw should start accepting write
operations again, at least until another osd gets full, main problem here
is probably the low pg count.

Regards,
Andreas

On 17 Oct 2017 19:08, "Bryan Banister"  wrote:

Hi all,



Still a real novice here and we didn’t set up our initial RGW cluster very
well.  We have 134 osds and set up our RGW pool with only 64 PGs, thus not
all of our OSDs got data and now we have one that is 95% full.



This apparently has put the cluster into a HEALTH_ERR condition:

[root@carf-ceph-osd01 ~]# ceph health detail

HEALTH_ERR full flag(s) set; 1 full osd(s); 1 pools have many more objects
per pg than average; application not enabled on 6 pool(s); too few PGs per
OSD (26 < min 30)

OSDMAP_FLAGS full flag(s) set

OSD_FULL 1 full osd(s)

osd.5 is full

MANY_OBJECTS_PER_PG 1 pools have many more objects per pg than average

pool carf01.rgw.buckets.data objects per pg (602762) is more than
18.3752 times cluster average (32803)



There is plenty of space on most of the OSDs and don’t know how to go about
fixing this situation.  If we update the pg_num and pgp_num settings for
this pool, can we rebalance the data across the OSDs?



Also, seems like this is causing a problem with the RGWs, which was
reporting this error in the logs:

2017-10-16 16:36:47.534461 7fffe6c5c700  1 heartbeat_map is_healthy
'RGWAsyncRadosProcessor::m_tp thread 0x7fffdc447700' had timed out after 600



After trying to restart the RGW, we see this now:

2017-10-17 10:40:38.517002 7fffe6c5c700  1 heartbeat_map is_healthy
'RGWAsyncRadosProcessor::m_tp thread 0x7fffddc4a700' had timed out after 600

2017-10-17 10:40:42.124046 77fd4e00  0 deferred set uid:gid to 167:167
(ceph:ceph)

2017-10-17 10:40:42.124162 77fd4e00  0 ceph version 12.2.0 (
32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process (unknown),
pid 65313

2017-10-17 10:40:42.245259 77fd4e00  0 client.769905.objecter  FULL,
paused modify 0x5662fb00 tid 0

2017-10-17 10:45:42.124283 7fffe7bcf700 -1 Initialization timeout, failed
to initialize

2017-10-17 10:45:42.353496 77fd4e00  0 deferred set uid:gid to 167:167
(ceph:ceph)

2017-10-17 10:45:42.353618 77fd4e00  0 ceph version 12.2.0 (
32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process (unknown),
pid 71842

2017-10-17 10:45:42.388621 77fd4e00  0 client.769986.objecter  FULL,
paused modify 0x5662fb00 tid 0

2017-10-17 10:50:42.353731 7fffe7bcf700 -1 Initialization timeout, failed
to initialize



Seems pretty evident that the “FULL, paused” is a problem.  So if I fix the
first issue the RGW should be ok after?



Thanks in advance,

-Bryan

--

Note: This email is for the confidential use of the named addressee(s) only
and may contain proprietary, confidential or privileged information. If you
are not the intended recipient, you are hereby notified that any review,
dissemination or copying of this email is strictly prohibited, and to
please notify the sender immediately and destroy this email and any
attachments. Email transmission cannot be guaranteed to be secure or
error-free. The Company, therefore, does not make any guarantees as to the
completeness or accuracy of this email or any attachments. This email is
for informational purposes only and does not constitute a recommendation,
offer, request or solicitation of any kind to buy, sell, subscribe, redeem
or perform any type of transaction of a financial product.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Help with full osd and RGW not responsive

2017-10-17 Thread Bryan Banister

Thanks for the response, we increased our pg count to something more reasonable 
(512 for now) and things are rebalancing.

Cheers,
-Bryan

From: Andreas Calminder [mailto:andreas.calmin...@klarna.com]
Sent: Tuesday, October 17, 2017 3:48 PM
To: Bryan Banister 
Cc: Ceph Users 
Subject: Re: [ceph-users] Help with full osd and RGW not responsive

Note: External Email

Hi,
You should most definitely look over number of pgs, there's a pg calculator 
available here: http://ceph.com/pgcalc/

You can increase pgs but not the other way around 
(http://docs.ceph.com/docs/jewel/rados/operations/placement-groups/)

To solve the immediate problem with your cluster being full you can reweight 
your osds, giving the full osd a lower weight will cause writes going to other 
osds and data on that osd being migrated to other osds in the cluster: ceph osd 
reweight $OSDNUM $WEIGHT, described here 
http://docs.ceph.com/docs/master/rados/operations/control/#osd-subsystem

When the osd isn't above the full threshold, default is 95%, the cluster will 
clear its full flag and your radosgw should start accepting write operations 
again, at least until another osd gets full, main problem here is probably the 
low pg count.

Regards,
Andreas

On 17 Oct 2017 19:08, "Bryan Banister" 
mailto:bbanis...@jumptrading.com>> wrote:
Hi all,

Still a real novice here and we didn’t set up our initial RGW cluster very 
well.  We have 134 osds and set up our RGW pool with only 64 PGs, thus not all 
of our OSDs got data and now we have one that is 95% full.

This apparently has put the cluster into a HEALTH_ERR condition:
[root@carf-ceph-osd01 ~]# ceph health detail
HEALTH_ERR full flag(s) set; 1 full osd(s); 1 pools have many more objects per 
pg than average; application not enabled on 6 pool(s); too few PGs per OSD (26 
< min 30)
OSDMAP_FLAGS full flag(s) set
OSD_FULL 1 full osd(s)
osd.5 is full
MANY_OBJECTS_PER_PG 1 pools have many more objects per pg than average
pool carf01.rgw.buckets.data objects per pg (602762) is more than 18.3752 
times cluster average (32803)

There is plenty of space on most of the OSDs and don’t know how to go about 
fixing this situation.  If we update the pg_num and pgp_num settings for this 
pool, can we rebalance the data across the OSDs?

Also, seems like this is causing a problem with the RGWs, which was reporting 
this error in the logs:
2017-10-16 16:36:47.534461 7fffe6c5c700  1 heartbeat_map is_healthy 
'RGWAsyncRadosProcessor::m_tp thread 0x7fffdc447700' had timed out after 600

After trying to restart the RGW, we see this now:
2017-10-17 10:40:38.517002 7fffe6c5c700  1 heartbeat_map is_healthy 
'RGWAsyncRadosProcessor::m_tp thread 0x7fffddc4a700' had timed out after 600
2017-10-17 10:40:42.124046 77fd4e00  0 deferred set uid:gid to 167:167 
(ceph:ceph)
2017-10-17 10:40:42.124162 77fd4e00  0 ceph version 12.2.0 
(32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process (unknown), 
pid 65313
2017-10-17 10:40:42.245259 77fd4e00  0 client.769905.objecter  FULL, paused 
modify 0x5662fb00 tid 0
2017-10-17 10:45:42.124283 7fffe7bcf700 -1 Initialization timeout, failed to 
initialize
2017-10-17 10:45:42.353496 77fd4e00  0 deferred set uid:gid to 167:167 
(ceph:ceph)
2017-10-17 10:45:42.353618 77fd4e00  0 ceph version 12.2.0 
(32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process (unknown), 
pid 71842
2017-10-17 10:45:42.388621 77fd4e00  0 client.769986.objecter  FULL, paused 
modify 0x5662fb00 tid 0
2017-10-17 10:50:42.353731 7fffe7bcf700 -1 Initialization timeout, failed to 
initialize

Seems pretty evident that the “FULL, paused” is a problem.  So if I fix the 
first issue the RGW should be ok after?

Thanks in advance,
-Bryan

Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you are hereby notified that any review, dissemination 
or copying of this email is strictly prohibited, and to please notify the 
sender immediately and destroy this email and any attachments. Email 
transmission cannot be guaranteed to be secure or error-free. The Company, 
therefore, does not make any guarantees as to the completeness or accuracy of 
this email or any attachments. This email is for informational purposes only 
and does not constitute a recommendation, offer, request or solicitation of any 
kind to buy, sell, subscribe, redeem or perform any type of transaction of a 
financial product.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not

Re: [ceph-users] How to stop using (unmount) a failed OSD with BlueStore ?

2017-10-17 Thread Jamie Fargen

Alejandro-
Please provide the folloing information:
1) Include an example of an actual message you are seeing in dmesg.
2) Provide the output of # ceph status
3) Provide the output of # ceph osd tree

Regards,
Jamie Fargen



On Tue, Oct 17, 2017 at 4:34 PM, Alejandro Comisario 
wrote:

> hi guys, any tip or help ?
>
> On Mon, Oct 16, 2017 at 1:50 PM, Alejandro Comisario <
> alejan...@nubeliu.com> wrote:
>
>> Hi all, i have to hot-swap a failed osd on a Luminous Cluster with Blue
>> store (the disk is SATA, WAL and DB are on NVME).
>>
>> I've issued a:
>> * ceph osd crush reweight osd_id 0
>> * systemctl stop (osd I'd daemon)
>> * umount /var/lib/ceph/osd/osd_id
>> * ceph osd destroy osd_id
>>
>> everything seems of, but if I left everything as is ( until I wait for
>> the replaced disk ) I can see that dmesg errors on writing on the device
>> are still appearing.
>>
>> The osd is of course down and out the crushmap.
>> am I missing something ? like a step to execute or something else ?
>>
>> hoping to get help.
>> best.
>>
>> alejandrito
>>
>
>
>
> --
> *Alejandro Comisario*
> *CTO | NUBELIU*
> E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
> _
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Jamie Fargen
Consultant
jfar...@redhat.com
813-817-4430
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] How to increase the size of requests written to a ceph image

2017-10-17 Thread Russell Glaue

I am running ceph jewel on 5 nodes with SSD OSDs.
I have an LVM image on a local RAID of spinning disks.
I have an RBD image on in a pool of SSD disks.
Both disks are used to run an almost identical CentOS 7 system.
Both systems were installed with the same kickstart, though the disk
partitioning is different.

I want to make writes on the the ceph image faster. For example, lots of
writes to MySQL (via MySQL replication) on a ceph SSD image are about 10x
slower than on a spindle RAID disk image. The MySQL server on ceph rbd
image has a hard time keeping up in replication.

So I wanted to test writes on these two systems
I have a 10GB compressed (gzip) file on both servers.
I simply gunzip the file on both systems, while running iostat.

The primary difference I see in the results is the average size of the
request to the disk.
CentOS7-lvm-raid-sata writes a lot faster to disk, and the size of the
request is about 40x, but the number of writes per second is about the same
This makes me want to conclude that the smaller size of the request for
CentOS7-ceph-rbd-ssd system is the cause of it being slow.


How can I make the size of the request larger for ceph rbd images, so I can
increase the write throughput?
Would this be related to having jumbo packets enabled in my ceph storage
network?


Here is a sample of the results:

[CentOS7-lvm-raid-sata]
$ gunzip large10gFile.gz &
$ iostat -x vg_root-lv_var -d 5 -m -N
Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
...
vg_root-lv_var 0.00 0.00   30.60  452.2013.60   222.15  1000.04
8.69   14.050.99   14.93   2.07 100.04
vg_root-lv_var 0.00 0.00   88.20  182.0039.2089.43   974.95
4.659.820.99   14.10   3.70 100.00
vg_root-lv_var 0.00 0.00   75.45  278.2433.53   136.70   985.73
4.36   33.261.34   41.91   0.59  20.84
vg_root-lv_var 0.00 0.00  111.60  181.8049.6089.34   969.84
2.608.870.81   13.81   0.13   3.90
vg_root-lv_var 0.00 0.00   68.40  109.6030.4053.63   966.87
1.518.460.84   13.22   0.80  14.16
...

[CentOS7-ceph-rbd-ssd]
$ gunzip large10gFile.gz &
$ iostat -x vg_root-lv_data -d 5 -m -N
Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
...
vg_root-lv_data 0.00 0.00   46.40  167.80 0.88 1.46
 22.36 1.235.662.476.54   4.52  96.82
vg_root-lv_data 0.00 0.00   16.60   55.20 0.36 0.14
 14.44 0.99   13.919.12   15.36  13.71  98.46
vg_root-lv_data 0.00 0.00   69.00  173.80 1.34 1.32
 22.48 1.255.193.775.75   3.94  95.68
vg_root-lv_data 0.00 0.00   74.40  293.40 1.37 1.47
 15.83 1.223.312.063.63   2.54  93.26
vg_root-lv_data 0.00 0.00   90.80  359.00 1.96 3.41
 24.45 1.633.631.944.05   2.10  94.38
...

[iostat key]
w/s == The number (after merges) of write requests completed per second for
the device.
wMB/s == The number of sectors (kilobytes, megabytes) written to the device
per second.
avgrq-sz == The average size (in kilobytes) of the requests that were
issued to the device.
avgqu-sz == The average queue length of the requests that were issued to
the device.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] To check RBD cache enabled

2017-10-17 Thread Josy


Thanks for the reply.

I added rbd_non_blocking_aio = false in ceph.conf and pushed the admin 
file to all nodes.


-
[client]
admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
log file = /var/log/ceph/client.log
debug rbd = 20
debug librbd = 20
rbd_non_blocking_aio = false
--


However the config show command still shows it as true.
---
[cephuser@ceph-las-admin-a1 ceph-cluster]$ sudo ceph --admin-daemon 
/var/run/ceph/ceph-mgr.ceph-las-admin-a1.asok config show | grep 
"rbd_non_blocking_aio"

    "rbd_non_blocking_aio": "true",
---

Did I miss something ?


On 18-10-2017 01:22, Jean-Charles Lopez wrote:

Hi

syntax uses the admin socket file : ceph 
--admin-daemon /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok 
config get rbd_cache


Should be /var/run/ceph/ceph.client.admin.$pid.$cctid.asok if your 
connection is using client.admin to connect to the cluster and your 
cluster name is set to the default of ceph. But obviously can’t know 
from here the PID and the CCTID you will have to identify.


You can actually do a ls /var/run/ceph to find the correct admin 
socket file


Regards
JC Lopez
Senior Technical Instructor, Global Storage Consulting Practice
Red Hat, Inc.
jelo...@redhat.com 
+1 408-680-6959

On Oct 17, 2017, at 12:50, Josy > wrote:


Hi,


I am following this article :

http://ceph.com/geen-categorie/ceph-validate-that-the-rbd-cache-is-active/

I have enabled this flag in ceph.conf

|[client] admin socket = 
/var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok log file = 
/var/log/ceph/|



But the command to show the conf is not working :

[cephuser@ceph-las-admin-a1 ceph-cluster]$ sudo ceph --admin-daemon 
/etc/ceph/ceph.client.admin.keyring config show
admin_socket: exception getting command descriptions: [Errno 111] 
Connection refused


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to increase the size of requests written to a ceph image

2017-10-17 Thread Jason Dillaman

Take this with a grain of salt, but you could try passing
"min_io_size=,opt_io_size="
as part of QEMU's HD device parameters to see if the OS picks up the
larger IO defaults and actually uses them:

$ qemu <...snip...> -device
driver=scsi-hd,<...snip...>,min_io_size=32768,opt_io_size=4194304


On Tue, Oct 17, 2017 at 5:12 PM, Russell Glaue  wrote:
> I am running ceph jewel on 5 nodes with SSD OSDs.
> I have an LVM image on a local RAID of spinning disks.
> I have an RBD image on in a pool of SSD disks.
> Both disks are used to run an almost identical CentOS 7 system.
> Both systems were installed with the same kickstart, though the disk
> partitioning is different.
>
> I want to make writes on the the ceph image faster. For example, lots of
> writes to MySQL (via MySQL replication) on a ceph SSD image are about 10x
> slower than on a spindle RAID disk image. The MySQL server on ceph rbd image
> has a hard time keeping up in replication.
>
> So I wanted to test writes on these two systems
> I have a 10GB compressed (gzip) file on both servers.
> I simply gunzip the file on both systems, while running iostat.
>
> The primary difference I see in the results is the average size of the
> request to the disk.
> CentOS7-lvm-raid-sata writes a lot faster to disk, and the size of the
> request is about 40x, but the number of writes per second is about the same
> This makes me want to conclude that the smaller size of the request for
> CentOS7-ceph-rbd-ssd system is the cause of it being slow.
>
>
> How can I make the size of the request larger for ceph rbd images, so I can
> increase the write throughput?
> Would this be related to having jumbo packets enabled in my ceph storage
> network?
>
>
> Here is a sample of the results:
>
> [CentOS7-lvm-raid-sata]
> $ gunzip large10gFile.gz &
> $ iostat -x vg_root-lv_var -d 5 -m -N
> Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> ...
> vg_root-lv_var 0.00 0.00   30.60  452.2013.60   222.15  1000.04
> 8.69   14.050.99   14.93   2.07 100.04
> vg_root-lv_var 0.00 0.00   88.20  182.0039.2089.43   974.95
> 4.659.820.99   14.10   3.70 100.00
> vg_root-lv_var 0.00 0.00   75.45  278.2433.53   136.70   985.73
> 4.36   33.261.34   41.91   0.59  20.84
> vg_root-lv_var 0.00 0.00  111.60  181.8049.6089.34   969.84
> 2.608.870.81   13.81   0.13   3.90
> vg_root-lv_var 0.00 0.00   68.40  109.6030.4053.63   966.87
> 1.518.460.84   13.22   0.80  14.16
> ...
>
> [CentOS7-ceph-rbd-ssd]
> $ gunzip large10gFile.gz &
> $ iostat -x vg_root-lv_data -d 5 -m -N
> Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> ...
> vg_root-lv_data 0.00 0.00   46.40  167.80 0.88 1.4622.36
> 1.235.662.476.54   4.52  96.82
> vg_root-lv_data 0.00 0.00   16.60   55.20 0.36 0.1414.44
> 0.99   13.919.12   15.36  13.71  98.46
> vg_root-lv_data 0.00 0.00   69.00  173.80 1.34 1.3222.48
> 1.255.193.775.75   3.94  95.68
> vg_root-lv_data 0.00 0.00   74.40  293.40 1.37 1.4715.83
> 1.223.312.063.63   2.54  93.26
> vg_root-lv_data 0.00 0.00   90.80  359.00 1.96 3.4124.45
> 1.633.631.944.05   2.10  94.38
> ...
>
> [iostat key]
> w/s == The number (after merges) of write requests completed per second for
> the device.
> wMB/s == The number of sectors (kilobytes, megabytes) written to the device
> per second.
> avgrq-sz == The average size (in kilobytes) of the requests that were issued
> to the device.
> avgqu-sz == The average queue length of the requests that were issued to the
> device.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] To check RBD cache enabled

2017-10-17 Thread Jason Dillaman

Did you restart the librbd client application after updating the config value?

On Tue, Oct 17, 2017 at 5:29 PM, Josy  wrote:
> Thanks for the reply.
>
> I added rbd_non_blocking_aio = false in ceph.conf and pushed the admin file
> to all nodes.
>
> -
> [client]
> admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
> log file = /var/log/ceph/client.log
> debug rbd = 20
> debug librbd = 20
> rbd_non_blocking_aio = false
> --
>
>
> However the config show command still shows it as true.
> ---
> [cephuser@ceph-las-admin-a1 ceph-cluster]$ sudo ceph --admin-daemon
> /var/run/ceph/ceph-mgr.ceph-las-admin-a1.asok config show | grep
> "rbd_non_blocking_aio"
> "rbd_non_blocking_aio": "true",
> ---
>
> Did I miss something ?
>
>
> On 18-10-2017 01:22, Jean-Charles Lopez wrote:
>
> Hi
>
> syntax uses the admin socket file : ceph --admin-daemon
> /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok config get rbd_cache
>
> Should be /var/run/ceph/ceph.client.admin.$pid.$cctid.asok if your
> connection is using client.admin to connect to the cluster and your cluster
> name is set to the default of ceph. But obviously can’t know from here the
> PID and the CCTID you will have to identify.
>
> You can actually do a ls /var/run/ceph to find the correct admin socket file
>
> Regards
> JC Lopez
> Senior Technical Instructor, Global Storage Consulting Practice
> Red Hat, Inc.
> jelo...@redhat.com
> +1 408-680-6959
>
> On Oct 17, 2017, at 12:50, Josy  wrote:
>
> Hi,
>
>
> I am following this article :
>
> http://ceph.com/geen-categorie/ceph-validate-that-the-rbd-cache-is-active/
>
> I have enabled this flag in ceph.conf
>
> [client]
> admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
> log file = /var/log/ceph/
>
>
> But the command to show the conf is not working :
>
> [cephuser@ceph-las-admin-a1 ceph-cluster]$ sudo ceph --admin-daemon
> /etc/ceph/ceph.client.admin.keyring config show
> admin_socket: exception getting command descriptions: [Errno 111] Connection
> refused
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Efficient storage of small objects / bulk erasure coding

2017-10-17 Thread Gregory Farnum

On Tue, Oct 17, 2017 at 12:42 PM Jiri Horky  wrote:

> Hi list,
>
> we are thinking of building relatively big CEPH-based object storage for
> storage of our sample files - we have about 700M files ranging from very
> small (1-4KiB) files to pretty big ones (several GiB). Median of file
> size is 64KiB. Since the required space is relatively large (1PiB of
> usable storage), we are thinking of utilizing erasure coding for this
> case. On the other hand, we need to achieve at least 1200MiB/s
> throughput on reads. The working assumption is 4+2 EC (thus 50% overhead).
>
> Since the EC is per-object, the small objects will be stripped to even
> smaller ones. With 4+2 EC, one needs (at least) 4 IOs to read a single
> object in this scenario -> number of required IOPS when using EC is
> relatively high. Some vendors (such as Hitachi, but I believe EMC as
> well) do offline, predefined-chunk size EC instead. The idea is to first
> write objects with replication factor of 3, wait for enough objects to
> fill 4x 64MiB chunks and only do EC on that. This not only makes the EC
> less computationally intensive, and repairs much faster, but it also
> allows reading majority of the small objects directly by reading just
> part of one of the chunk from it (assuming non degraded state) - one
> chunk actually contains the whole object.
> I wonder if something similar is already possible with CEPH and/or is
> planned. For our use case of very small objects, it would mean near 3-4x
> performance boosts in terms of required IOPS performance.
>
> Another option how to get out of this situation is to be able to specify
> different storage pools/policies based on file size - i.e. to do 3x
> replication of the very small files and only use EC for bigger files,
> where the performance hit with 4x IOPS won't be that painful. But I I am
> afraid this is not possible...
>
>
Unfortunately any logic like this would need to be handled in your
application layer. Raw RADOS does not do object sharding or aggregation on
its own.
CERN did contribute the libradosstriper, which will break down your
multi-gigabyte objects into more typical sizes, but a generic system for
packing many small objects into larger ones is tough — the choices depend
so much on likely access patterns and such.

I would definitely recommend working out something like that, though!
-Greg


> Any other hint is sincerely welcome.
>
> Thank you
> Jiri Horky
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD crashed while reparing inconsistent PG luminous

2017-10-17 Thread Gregory Farnum

On Tue, Oct 17, 2017 at 9:51 AM Ana Aviles  wrote:

> Hello all,
>
> We had an inconsistent PG on our cluster. While performing PG repair
> operation the OSD crashed. The OSD was not able to start again anymore,
> and there was no hardware failure on the disk itself. This is the log
> output
>
> 2017-10-17 17:48:55.771384 7f234930d700 -1 log_channel(cluster) log
> [ERR] : 2.2fc repair 1 missing, 0 inconsistent objects
> 2017-10-17 17:48:55.771417 7f234930d700 -1 log_channel(cluster) log
> [ERR] : 2.2fc repair 3 errors, 1 fixed
> 2017-10-17 17:48:56.047896 7f234930d700 -1
> /build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: In function 'virtual void
> PrimaryLogPG::on_local_recover(const hobject_t&, const
> ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)'
> thread 7f234930d700 time 2017-10-17 17:48:55.924115
> /build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p !=
> recovery_info.ss.clone_snaps.end())
>

Hmm. The OSD got a push op containing a snapshot it doesn't think should
exist. I also see that there's a comment "// hmm, should we warn?" on that
assert.

Can you take a full log with "debug osd = 20" set, post it with
ceph-post-file, and create a ticket on tracker.ceph.com?

Are all your OSDs running that same version?
-Greg


>
>  ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x102) [0x56236c8ff3f2]
>  2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo
> const&, std::shared_ptr, bool,
> ObjectStore::Transaction*)+0xd63) [0x56236c476213]
>  3: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp const&,
> PullOp*, std::__cxx11::list std::allocator >*,
> ObjectStore::Transaction*)+0x693) [0x56236c60d4d3]
>  4:
>
> (ReplicatedBackend::_do_pull_response(boost::intrusive_ptr)+0x2b5)
> [0x56236c60dd75]
>  5:
> (ReplicatedBackend::_handle_message(boost::intrusive_ptr)+0x20c)
> [0x56236c61196c]
>  6: (PGBackend::handle_message(boost::intrusive_ptr)+0x50)
> [0x56236c521aa0]
>  7: (PrimaryLogPG::do_request(boost::intrusive_ptr&,
> ThreadPool::TPHandle&)+0x55d) [0x56236c48662d]
>  8: (OSD::dequeue_op(boost::intrusive_ptr,
> boost::intrusive_ptr, ThreadPool::TPHandle&)+0x3a9)
> [0x56236c3091a9]
>  9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr
> const&)+0x57) [0x56236c5a2ae7]
>  10: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x130e) [0x56236c3307de]
>  11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884)
> [0x56236c9041e4]
>  12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56236c907220]
>  13: (()+0x76ba) [0x7f2366be96ba]
>  14: (clone()+0x6d) [0x7f2365c603dd]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
>
> Thanks!
>
> Ana
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Luminous : 3 clients failing to respond to cache pressure

2017-10-17 Thread Gregory Farnum

On Tue, Oct 17, 2017 at 6:36 AM Yoann Moulin  wrote:

> Hello,
>
> I have a luminous (12.2.1) cluster with 3 nodes for cephfs (no rbd or rgw)
> and we hit the "X clients failing to respond to cache pressure" message.
> I have 3 mds servers active.
>
> Is this something I have to worry about ?
>

This message means
* the MDS has exceeded the size of its cache, and
* the MSD has asked clients to reduce the number of files they hold
capabilities on (so the MDS can trim them out of cache), and
* the clients are not returning capabilities

It's entirely possible this is because the clients are actually holding
references to all those files. If you haven't configured your cache size
explicitly, you can probably increase it by a lot, and perhaps put this
warning to bed.
-Greg


>
> here some information about the cluster :
>
> > root@iccluster054:~# ceph --cluster container -s
> >   cluster:
> > id: a294a95a-0baa-4641-81c1-7cd70fd93216
> > health: HEALTH_WARN
> > 3 clients failing to respond to cache pressure
> >
> >   services:
> > mon: 3 daemons, quorum iccluster041.iccluster.epfl.ch,
> iccluster042.iccluster.epfl.ch,iccluster054.iccluster.epfl.ch
> > mgr: iccluster042(active), standbys: iccluster054
> > mds: cephfs-3/3/3 up  {0=iccluster054.iccluster.epfl.ch=up:active,1=
> iccluster041.iccluster.epfl.ch=up:active,2=iccluster042.iccluster.epfl.ch
> =up:active}
> > osd: 18 osds: 18 up, 18 in
> >
> >   data:
> > pools:   3 pools, 544 pgs
> > objects: 2357k objects, 564 GB
> > usage:   2011 GB used, 65055 GB / 67066 GB avail
> > pgs: 544 active+clean
> >
>
>
>
> > root@iccluster041:~# ceph --cluster container daemon
> mds.iccluster041.iccluster.epfl.ch perf dump mds
> > {
> > "mds": {
> > "request": 193508283,
> > "reply": 192815355,
> > "reply_latency": {
> > "avgcount": 192815355,
> > "sum": 457371.475011160,
> > "avgtime": 0.002372069
> > },
> > "forward": 692928,
> > "dir_fetch": 1717132,
> > "dir_commit": 43521,
> > "dir_split": 4197,
> > "dir_merge": 4244,
> > "inode_max": 2147483647 <(214)%20748-3647>,
> > "inodes": 11098,
> > "inodes_top": 7668,
> > "inodes_bottom": 3404,
> > "inodes_pin_tail": 26,
> > "inodes_pinned": 143,
> > "inodes_expired": 138623,
> > "inodes_with_caps": 87,
> > "caps": 239,
> > "subtrees": 15,
> > "traverse": 195425369,
> > "traverse_hit": 192867085,
> > "traverse_forward": 692723,
> > "traverse_discover": 476,
> > "traverse_dir_fetch": 1714684,
> > "traverse_remote_ino": 0,
> > "traverse_lock": 6,
> > "load_cent": 19465322425,
> > "q": 0,
> > "exported": 1211,
> > "exported_inodes": 845556,
> > "imported": 1082,
> > "imported_inodes": 1209280
> > }
> > }
>
>
> > root@iccluster041:~# ceph --cluster container daemon
> mds.iccluster041.iccluster.epfl.ch perf dump mds
> > {
> > "mds": {
> > "request": 193508283,
> > "reply": 192815355,
> > "reply_latency": {
> > "avgcount": 192815355,
> > "sum": 457371.475011160,
> > "avgtime": 0.002372069
> > },
> > "forward": 692928,
> > "dir_fetch": 1717132,
> > "dir_commit": 43521,
> > "dir_split": 4197,
> > "dir_merge": 4244,
> > "inode_max": 2147483647 <(214)%20748-3647>,
> > "inodes": 11098,
> > "inodes_top": 7668,
> > "inodes_bottom": 3404,
> > "inodes_pin_tail": 26,
> > "inodes_pinned": 143,
> > "inodes_expired": 138623,
> > "inodes_with_caps": 87,
> > "caps": 239,
> > "subtrees": 15,
> > "traverse": 195425369,
> > "traverse_hit": 192867085,
> > "traverse_forward": 692723,
> > "traverse_discover": 476,
> > "traverse_dir_fetch": 1714684,
> > "traverse_remote_ino": 0,
> > "traverse_lock": 6,
> > "load_cent": 19465322425,
> > "q": 0,
> > "exported": 1211,
> > "exported_inodes": 845556,
> > "imported": 1082,
> > "imported_inodes": 1209280
> > }
> > }
>
> > root@iccluster054:~# ceph --cluster container daemon
> mds.iccluster054.iccluster.epfl.ch perf dump mds
> > {
> > "mds": {
> > "request": 267620366,
> > "reply": 255792944,
> > "reply_latency": {
> > "avgcount": 255792944,
> > "sum": 42256.407340600,
> > "avgtime": 0.000165197
> > },
> > "forward": 11827411,
> > "dir_fetch": 183,
> > "dir_commit": 2607,
> > "dir_split": 27,
> > "dir_merge": 19,
> > "inode_max": 2147483647 <(214)%20748-3647>,
> > "inodes": 3740,
> > "inodes_top": 2517,
> > "inodes_bo

Re: [ceph-users] To check RBD cache enabled

2017-10-17 Thread Jean-Charles Lopez

Hi Josy,

just a doubt but it looks like your ASOK file is the one from a Ceph Manager. 
So my suspicion is that you may be running the command from the wrong machine.

To run this command, you need to ssh into the machine where the client 
connection is being initiated.

But may be I am wrong regarding your exact connection point.

As Jason points it out you also need to make sure that your restart the client 
connection for the changes in the ceph.conf file to take effect.

Regards
JC Lopez
Senior Technical Instructor, Global Storage Consulting Practice
Red Hat, Inc.
jelo...@redhat.com 
+1 408-680-6959

> On Oct 17, 2017, at 14:29, Josy  wrote:
> 
> Thanks for the reply.
> 
> I added rbd_non_blocking_aio = false in ceph.conf and pushed the admin file 
> to all nodes.
> 
> -
> [client]
> admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
> log file = /var/log/ceph/client.log
> debug rbd = 20
> debug librbd = 20
> rbd_non_blocking_aio = false
> --
> 
> 
> However the config show command still shows it as true. 
> ---
> [cephuser@ceph-las-admin-a1 ceph-cluster]$ sudo ceph --admin-daemon 
> /var/run/ceph/ceph-mgr.ceph-las-admin-a1.asok config show | grep 
> "rbd_non_blocking_aio"
> "rbd_non_blocking_aio": "true",
> ---
> 
> Did I miss something ? 
> 
> On 18-10-2017 01:22, Jean-Charles Lopez wrote:
>> Hi 
>> 
>> syntax uses the admin socket file : ceph --admin-daemon 
>> /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok config get rbd_cache
>> 
>> Should be /var/run/ceph/ceph.client.admin.$pid.$cctid.asok if your 
>> connection is using client.admin to connect to the cluster and your cluster 
>> name is set to the default of ceph. But obviously can’t know from here the 
>> PID and the CCTID you will have to identify.
>> 
>> You can actually do a ls /var/run/ceph to find the correct admin socket file
>> 
>> Regards
>> JC Lopez
>> Senior Technical Instructor, Global Storage Consulting Practice
>> Red Hat, Inc.
>> jelo...@redhat.com 
>> +1 408-680-6959
>> 
>>> On Oct 17, 2017, at 12:50, Josy >> > wrote:
>>> 
>>> Hi,
>>> 
>>> 
>>> I am following this article :
>>> 
>>> http://ceph.com/geen-categorie/ceph-validate-that-the-rbd-cache-is-active/ 
>>> 
>>> I have enabled this flag in ceph.conf
>>> 
>>> [client]
>>> admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
>>> log file = /var/log/ceph/
>>> 
>>> But the command to show the conf is not working : 
>>> [cephuser@ceph-las-admin-a1 ceph-cluster]$ sudo ceph --admin-daemon  
>>> /etc/ceph/ceph.client.admin.keyring config show
>>> admin_socket: exception getting command descriptions: [Errno 111] 
>>> Connection refused
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> 
>> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] To check RBD cache enabled

2017-10-17 Thread Josy


Hi,

I am running the command  from the admin server.

Because there are no asok file in the client server
ls /var/run/ceph/ lists no files in the client server.


>> As Jason points it out you also need to make sure that your restart 
the client connection for the changes in the ceph.conf file to take effect.


You mean restart the client server ?

(I am sorry, this is something new for me. I have just started learning 
ceph.)



On 18-10-2017 03:32, Jean-Charles Lopez wrote:

Hi Josy,

just a doubt but it looks like your ASOK file is the one from a Ceph 
Manager. So my suspicion is that you may be running the command from 
the wrong machine.


To run this command, you need to ssh into the machine where the client 
connection is being initiated.


But may be I am wrong regarding your exact connection point.

As Jason points it out you also need to make sure that your restart 
the client connection for the changes in the ceph.conf file to take 
effect.


Regards
JC Lopez
Senior Technical Instructor, Global Storage Consulting Practice
Red Hat, Inc.
jelo...@redhat.com 
+1 408-680-6959

On Oct 17, 2017, at 14:29, Josy > wrote:


Thanks for the reply.

I added rbd_non_blocking_aio = false in ceph.conf and pushed the 
admin file to all nodes.


-
[client]
admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
log file = /var/log/ceph/client.log
debug rbd = 20
debug librbd = 20
rbd_non_blocking_aio = false
--


However the config show command still shows it as true.
---
[cephuser@ceph-las-admin-a1 ceph-cluster]$ sudo ceph --admin-daemon 
/var/run/ceph/ceph-mgr.ceph-las-admin-a1.asok config show | grep 
"rbd_non_blocking_aio"

    "rbd_non_blocking_aio": "true",
---

Did I miss something ?


On 18-10-2017 01:22, Jean-Charles Lopez wrote:

Hi

syntax uses the admin socket file : ceph 
--admin-daemon /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok 
config get rbd_cache


Should be /var/run/ceph/ceph.client.admin.$pid.$cctid.asok if your 
connection is using client.admin to connect to the cluster and your 
cluster name is set to the default of ceph. But obviously can’t know 
from here the PID and the CCTID you will have to identify.


You can actually do a ls /var/run/ceph to find the correct admin 
socket file


Regards
JC Lopez
Senior Technical Instructor, Global Storage Consulting Practice
Red Hat, Inc.
jelo...@redhat.com 
+1 408-680-6959

On Oct 17, 2017, at 12:50, Josy > wrote:


Hi,


I am following this article :

http://ceph.com/geen-categorie/ceph-validate-that-the-rbd-cache-is-active/

I have enabled this flag in ceph.conf

|[client] admin socket = 
/var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok log file = 
/var/log/ceph/|



But the command to show the conf is not working :

[cephuser@ceph-las-admin-a1 ceph-cluster]$ sudo ceph --admin-daemon 
/etc/ceph/ceph.client.admin.keyring config show
admin_socket: exception getting command descriptions: [Errno 111] 
Connection refused


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com








___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] To check RBD cache enabled

2017-10-17 Thread Jason Dillaman

On Tue, Oct 17, 2017 at 6:30 PM, Josy  wrote:
> Hi,
>
> I am running the command  from the admin server.
>
> Because there are no asok file in the client server
> ls /var/run/ceph/ lists no files in the client server.

Most likely a permissions or SElinux/AppArmor issue where the librbd
client application cannot write to the directory.

>
>>> As Jason points it out you also need to make sure that your restart the
>>> client connection for the changes in the ceph.conf file to take effect.
>
> You mean restart the client server ?
>
> (I am sorry, this is something new for me. I have just started learning
> ceph.)

Assuming this is for QEMU, QEMU is the librbd client so you would have
to stop/start the VM to pick up any configuration changes (or perform
a live migration to another server).

> On 18-10-2017 03:32, Jean-Charles Lopez wrote:
>
> Hi Josy,
>
> just a doubt but it looks like your ASOK file is the one from a Ceph
> Manager. So my suspicion is that you may be running the command from the
> wrong machine.
>
> To run this command, you need to ssh into the machine where the client
> connection is being initiated.
>
> But may be I am wrong regarding your exact connection point.
>
> As Jason points it out you also need to make sure that your restart the
> client connection for the changes in the ceph.conf file to take effect.
>
> Regards
> JC Lopez
> Senior Technical Instructor, Global Storage Consulting Practice
> Red Hat, Inc.
> jelo...@redhat.com
> +1 408-680-6959
>
> On Oct 17, 2017, at 14:29, Josy  wrote:
>
> Thanks for the reply.
>
> I added rbd_non_blocking_aio = false in ceph.conf and pushed the admin file
> to all nodes.
>
> -
> [client]
> admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
> log file = /var/log/ceph/client.log
> debug rbd = 20
> debug librbd = 20
> rbd_non_blocking_aio = false
> --
>
>
> However the config show command still shows it as true.
> ---
> [cephuser@ceph-las-admin-a1 ceph-cluster]$ sudo ceph --admin-daemon
> /var/run/ceph/ceph-mgr.ceph-las-admin-a1.asok config show | grep
> "rbd_non_blocking_aio"
> "rbd_non_blocking_aio": "true",
> ---
>
> Did I miss something ?
>
>
> On 18-10-2017 01:22, Jean-Charles Lopez wrote:
>
> Hi
>
> syntax uses the admin socket file : ceph --admin-daemon
> /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok config get rbd_cache
>
> Should be /var/run/ceph/ceph.client.admin.$pid.$cctid.asok if your
> connection is using client.admin to connect to the cluster and your cluster
> name is set to the default of ceph. But obviously can’t know from here the
> PID and the CCTID you will have to identify.
>
> You can actually do a ls /var/run/ceph to find the correct admin socket file
>
> Regards
> JC Lopez
> Senior Technical Instructor, Global Storage Consulting Practice
> Red Hat, Inc.
> jelo...@redhat.com
> +1 408-680-6959
>
> On Oct 17, 2017, at 12:50, Josy  wrote:
>
> Hi,
>
>
> I am following this article :
>
> http://ceph.com/geen-categorie/ceph-validate-that-the-rbd-cache-is-active/
>
> I have enabled this flag in ceph.conf
>
> [client]
> admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
> log file = /var/log/ceph/
>
>
> But the command to show the conf is not working :
>
> [cephuser@ceph-las-admin-a1 ceph-cluster]$ sudo ceph --admin-daemon
> /etc/ceph/ceph.client.admin.keyring config show
> admin_socket: exception getting command descriptions: [Errno 111] Connection
> refused
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD crashed while reparing inconsistent PG luminous

2017-10-17 Thread Mart van Santen


Hi Greg,

(I'm a colleague of Ana), Thank you for your reply


On 10/17/2017 11:57 PM, Gregory Farnum wrote:
>
>
> On Tue, Oct 17, 2017 at 9:51 AM Ana Aviles  > wrote:
>
> Hello all,
>
> We had an inconsistent PG on our cluster. While performing PG repair
> operation the OSD crashed. The OSD was not able to start again anymore,
> and there was no hardware failure on the disk itself. This is the log 
> output
>
> 2017-10-17 17:48:55.771384 7f234930d700 -1 log_channel(cluster) log
> [ERR] : 2.2fc repair 1 missing, 0 inconsistent objects
> 2017-10-17 17:48:55.771417 7f234930d700 -1 log_channel(cluster) log
> [ERR] : 2.2fc repair 3 errors, 1 fixed
> 2017-10-17 17:48:56.047896 7f234930d700 -1
> /build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: In function 'virtual void
> PrimaryLogPG::on_local_recover(const hobject_t&, const
> ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)'
> thread 7f234930d700 time 2017-10-17 17:48:55.924115
> /build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p !=
> recovery_info.ss.clone_snaps.end())
>
>
> Hmm. The OSD got a push op containing a snapshot it doesn't think should 
> exist. I also see that there's a comment "// hmm, should we warn?" on that 
> assert.


We catched also those log entries, which indeed point to a clone/snapshot 
problem:

 -9877> 2017-10-17 17:46:16.044077 7f234db16700 10 log_client  will send 
2017-10-17 17:46:13.367842 osd.78 osd.78 [:::::203]:6880/9116 
483 : cluster [ERR] 2.2fc shard 78 missing 
2:3f72b543:::rbd_data.332d5a836bcc485.fcf6:466a7
 -9876> 2017-10-17 17:46:16.044105 7f234db16700 10 log_client  will send 
2017-10-17 17:46:13.368026 osd.78 osd.78 [:::::203]:6880/9116 
484 : cluster [ERR] repair 2.2fc 
2:3f72b543:::rbd_data.332d5a836bcc485.fcf6:466a7 is an unexpected 
clone
 -9868> 2017-10-17 17:46:16.324112 7f2354b24700 10 log_client  logged 
2017-10-17 17:46:13.367842 osd.78 osd.78 [:::::203]:6880/9116 
483 : cluster [ERR] 2.2fc shard 78 missing 
2:3f72b543:::rbd_data.332d5a836bcc485.fcf6:466a7
 -9867> 2017-10-17 17:46:16.324128 7f2354b24700 10 log_client  logged 
2017-10-17 17:46:13.368026 osd.78 osd.78 [:::::203]:6880/9116 
484 : cluster [ERR] repair 2.2fc 
2:3f72b543:::rbd_data.332d5a836bcc485.fcf6:466a7 is an unexpected 
clone
   -36> 2017-10-17 17:48:55.771384 7f234930d700 -1 log_channel(cluster) log 
[ERR] : 2.2fc repair 1 missing, 0 inconsistent objects
   -35> 2017-10-17 17:48:55.771417 7f234930d700 -1 log_channel(cluster) log 
[ERR] : 2.2fc repair 3 errors, 1 fixed
-4> 2017-10-17 17:48:56.046071 7f234db16700 10 log_client  will send 
2017-10-17 17:48:55.771390 osd.78 osd.78 [:::::203]:6880/9116 
485 : cluster [ERR] 2.2fc repair 1 missing, 0 inconsistent objects
-3> 2017-10-17 17:48:56.046088 7f234db16700 10 log_client  will send 
2017-10-17 17:48:55.771419 osd.78 osd.78 [:::::203]:6880/9116 
486 : cluster [ERR] 2.2fc repair 3 errors, 1 fixed

>
> Can you take a full log with "debug osd = 20" set, post it with 
> ceph-post-file, and create a ticket on tracker.ceph.com 
> ?

We will submit the ticket tomorrow (we are in CEST), We want to have more pair 
of eyes on it when we start the OSD again.

After this crash the OSD was marked as out by us. The cluster rebalanced 
itself, unfortunately, the same issue appear on another OSD (same pg), after 
several crashes of this OSD, the OSD came back up, but now with one PG down. I 
assume the cluster decided it 'finished' the ceph pg repair command and removed 
the 'repair' state, but now with a broken pg. If you have any hints on how we 
can get the PG online again, we would be very grateful, so we can work on that 
tomorrow.


Thanks,

Mart




Some general info about this cluster:

- all OSD runs the same version, also monitors are all 12.2.1 (ubuntu xenial)

- the cluster is a backup cluster and has min/size 1 and replication 2, so only 
2 copies.

- the cluster was recently upgraded from jewel to luminous (3 weeks ago)

- the cluster was recently upgraded from straw to straw2 (1 week ago)

- it was in HEALTH_OK till this happend.

- we use filestore only

- the cluster was installed with hammer originally. upgraded to infernalis, 
jewel and now luminous



health:
(noup/noout set on purpose while we trying to recover)


$ ceph -s
  cluster:
id: ----x
health: HEALTH_WARN
noup,noout flag(s) set
Reduced data availability: 1 pg inactive, 1 pg down
Degraded data redundancy: 2892/31621143 objects degraded (0.009%), 
2 pgs unclean, 1 pg degraded, 1 pg undersized
 
  services:
mon: 3 daemons, quorum ds2-mon1,ds2-mon2,ds2-mon3
mgr: ds2-mon1(active)
osd: 93 osds: 92 up, 92 in; 1 remapped pgs
 flags noup,noout
rg

Re: [ceph-users] How to stop using (unmount) a failed OSD with BlueStore ?

2017-10-17 Thread Alejandro Comisario

Jamie, thanks for replying, info is as follow:

1)

[Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 Sense Key : Medium
Error [current]
[Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 Add. Sense: No
additional sense information
[Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 CDB: Read(10) 28 00 00
00 09 10 00 00 f0 00
[Fri Oct 13 10:21:24 2017] blk_update_request: I/O error, dev sdx, sector
2320

2)

ndc-cl-mon1:~# ceph status
  cluster:
id: 48158350-ba8a-420b-9c09-68da57205924
health: HEALTH_OK

  services:
mon: 3 daemons, quorum ndc-cl-mon1,ndc-cl-mon2,ndc-cl-mon3
mgr: ndc-cl-mon1(active), standbys: ndc-cl-mon3, ndc-cl-mon2
osd: 161 osds: 160 up, 160 in

  data:
pools:   4 pools, 12288 pgs
objects: 663k objects, 2650 GB
usage:   9695 GB used, 258 TB / 267 TB avail
pgs: 12288 active+clean

  io:
client:   0 B/s rd, 1248 kB/s wr, 49 op/s rd, 106 op/s wr

3)

https://pastebin.com/MeCKqvp1


On Tue, Oct 17, 2017 at 5:59 PM, Jamie Fargen  wrote:

> Alejandro-
> Please provide the folloing information:
> 1) Include an example of an actual message you are seeing in dmesg.
> 2) Provide the output of # ceph status
> 3) Provide the output of # ceph osd tree
>
> Regards,
> Jamie Fargen
>
>
>
> On Tue, Oct 17, 2017 at 4:34 PM, Alejandro Comisario <
> alejan...@nubeliu.com> wrote:
>
>> hi guys, any tip or help ?
>>
>> On Mon, Oct 16, 2017 at 1:50 PM, Alejandro Comisario <
>> alejan...@nubeliu.com> wrote:
>>
>>> Hi all, i have to hot-swap a failed osd on a Luminous Cluster with Blue
>>> store (the disk is SATA, WAL and DB are on NVME).
>>>
>>> I've issued a:
>>> * ceph osd crush reweight osd_id 0
>>> * systemctl stop (osd I'd daemon)
>>> * umount /var/lib/ceph/osd/osd_id
>>> * ceph osd destroy osd_id
>>>
>>> everything seems of, but if I left everything as is ( until I wait for
>>> the replaced disk ) I can see that dmesg errors on writing on the device
>>> are still appearing.
>>>
>>> The osd is of course down and out the crushmap.
>>> am I missing something ? like a step to execute or something else ?
>>>
>>> hoping to get help.
>>> best.
>>>
>>> alejandrito
>>>
>>
>>
>>
>> --
>> *Alejandro Comisario*
>> *CTO | NUBELIU*
>> E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
>> _
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> Jamie Fargen
> Consultant
> jfar...@redhat.com
> 813-817-4430
>



-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
_
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] To check RBD cache enabled

2017-10-17 Thread Josy

I think it is permission error, because when running ceph -s it shows 
this error at the top


-

$ ceph -s
2017-10-17 15:53:26.132180 7f7698834700 -1 asok(0x7f76940017a0) 
AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed 
to bind the UNIX domain socket to 
'/var/run/ceph/ceph-client.admin.29983.140147265902928.asok': (13) 
Permission denied

  cluster:
    id: de296604-d85c-46ab-a3af-add3367f0e6d
    health: HEALTH_OK


Selinux is disabled in the server. Also I changed ownership of 
/var/run/ceph to the ceph user.

Still no luck. 'ls /var/run/ceph/ lists' no files in the client server



On 18-10-2017 04:07, Jason Dillaman wrote:

On Tue, Oct 17, 2017 at 6:30 PM, Josy  wrote:

Hi,

I am running the command  from the admin server.

Because there are no asok file in the client server
ls /var/run/ceph/ lists no files in the client server.

Most likely a permissions or SElinux/AppArmor issue where the librbd
client application cannot write to the directory.


As Jason points it out you also need to make sure that your restart the
client connection for the changes in the ceph.conf file to take effect.

You mean restart the client server ?

(I am sorry, this is something new for me. I have just started learning
ceph.)

Assuming this is for QEMU, QEMU is the librbd client so you would have
to stop/start the VM to pick up any configuration changes (or perform
a live migration to another server).


On 18-10-2017 03:32, Jean-Charles Lopez wrote:

Hi Josy,

just a doubt but it looks like your ASOK file is the one from a Ceph
Manager. So my suspicion is that you may be running the command from the
wrong machine.

To run this command, you need to ssh into the machine where the client
connection is being initiated.

But may be I am wrong regarding your exact connection point.

As Jason points it out you also need to make sure that your restart the
client connection for the changes in the ceph.conf file to take effect.

Regards
JC Lopez
Senior Technical Instructor, Global Storage Consulting Practice
Red Hat, Inc.
jelo...@redhat.com
+1 408-680-6959

On Oct 17, 2017, at 14:29, Josy  wrote:

Thanks for the reply.

I added rbd_non_blocking_aio = false in ceph.conf and pushed the admin file
to all nodes.

-
[client]
admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
log file = /var/log/ceph/client.log
debug rbd = 20
debug librbd = 20
rbd_non_blocking_aio = false
--


However the config show command still shows it as true.
---
[cephuser@ceph-las-admin-a1 ceph-cluster]$ sudo ceph --admin-daemon
/var/run/ceph/ceph-mgr.ceph-las-admin-a1.asok config show | grep
"rbd_non_blocking_aio"
 "rbd_non_blocking_aio": "true",
---

Did I miss something ?


On 18-10-2017 01:22, Jean-Charles Lopez wrote:

Hi

syntax uses the admin socket file : ceph --admin-daemon
/var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok config get rbd_cache

Should be /var/run/ceph/ceph.client.admin.$pid.$cctid.asok if your
connection is using client.admin to connect to the cluster and your cluster
name is set to the default of ceph. But obviously can’t know from here the
PID and the CCTID you will have to identify.

You can actually do a ls /var/run/ceph to find the correct admin socket file

Regards
JC Lopez
Senior Technical Instructor, Global Storage Consulting Practice
Red Hat, Inc.
jelo...@redhat.com
+1 408-680-6959

On Oct 17, 2017, at 12:50, Josy  wrote:

Hi,


I am following this article :

http://ceph.com/geen-categorie/ceph-validate-that-the-rbd-cache-is-active/

I have enabled this flag in ceph.conf

[client]
admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
log file = /var/log/ceph/


But the command to show the conf is not working :

[cephuser@ceph-las-admin-a1 ceph-cluster]$ sudo ceph --admin-daemon
/etc/ceph/ceph.client.admin.keyring config show
admin_socket: exception getting command descriptions: [Errno 111] Connection
refused

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to stop using (unmount) a failed OSD with BlueStore ?

2017-10-17 Thread Jamie Fargen

Alejandro-

Those are kernel messages indicating that the an error was encountered when
data was sent to the storage device and are not related directly to the
operation of Ceph. The messages you sent also appear to have happened 4
days ago on Friday and if they have subsided then it probably means nothing
further has tried to read/write to the disk, but the messages will be
present in dmesg until the kernel ring buffer is overwritten or the system
is restarted.

-Jamie


On Tue, Oct 17, 2017 at 6:47 PM, Alejandro Comisario 
wrote:

> Jamie, thanks for replying, info is as follow:
>
> 1)
>
> [Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 FAILED Result:
> hostbyte=DID_OK driverbyte=DRIVER_SENSE
> [Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 Sense Key : Medium
> Error [current]
> [Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 Add. Sense: No
> additional sense information
> [Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 CDB: Read(10) 28 00 00
> 00 09 10 00 00 f0 00
> [Fri Oct 13 10:21:24 2017] blk_update_request: I/O error, dev sdx, sector
> 2320
>
> 2)
>
> ndc-cl-mon1:~# ceph status
>   cluster:
> id: 48158350-ba8a-420b-9c09-68da57205924
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum ndc-cl-mon1,ndc-cl-mon2,ndc-cl-mon3
> mgr: ndc-cl-mon1(active), standbys: ndc-cl-mon3, ndc-cl-mon2
> osd: 161 osds: 160 up, 160 in
>
>   data:
> pools:   4 pools, 12288 pgs
> objects: 663k objects, 2650 GB
> usage:   9695 GB used, 258 TB / 267 TB avail
> pgs: 12288 active+clean
>
>   io:
> client:   0 B/s rd, 1248 kB/s wr, 49 op/s rd, 106 op/s wr
>
> 3)
>
> https://pastebin.com/MeCKqvp1
>
>
> On Tue, Oct 17, 2017 at 5:59 PM, Jamie Fargen  wrote:
>
>> Alejandro-
>> Please provide the folloing information:
>> 1) Include an example of an actual message you are seeing in dmesg.
>> 2) Provide the output of # ceph status
>> 3) Provide the output of # ceph osd tree
>>
>> Regards,
>> Jamie Fargen
>>
>>
>>
>> On Tue, Oct 17, 2017 at 4:34 PM, Alejandro Comisario <
>> alejan...@nubeliu.com> wrote:
>>
>>> hi guys, any tip or help ?
>>>
>>> On Mon, Oct 16, 2017 at 1:50 PM, Alejandro Comisario <
>>> alejan...@nubeliu.com> wrote:
>>>
 Hi all, i have to hot-swap a failed osd on a Luminous Cluster with Blue
 store (the disk is SATA, WAL and DB are on NVME).

 I've issued a:
 * ceph osd crush reweight osd_id 0
 * systemctl stop (osd I'd daemon)
 * umount /var/lib/ceph/osd/osd_id
 * ceph osd destroy osd_id

 everything seems of, but if I left everything as is ( until I wait for
 the replaced disk ) I can see that dmesg errors on writing on the device
 are still appearing.

 The osd is of course down and out the crushmap.
 am I missing something ? like a step to execute or something else ?

 hoping to get help.
 best.

 alejandrito

>>>
>>>
>>>
>>> --
>>> *Alejandro Comisario*
>>> *CTO | NUBELIU*
>>> E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
>>> _
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>>
>> --
>> Jamie Fargen
>> Consultant
>> jfar...@redhat.com
>> 813-817-4430 <(813)%20817-4430>
>>
>
>
>
> --
> *Alejandro Comisario*
> *CTO | NUBELIU*
> E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
> _
>



-- 
Jamie Fargen
Consultant
jfar...@redhat.com
813-817-4430
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to stop using (unmount) a failed OSD with BlueStore ?

2017-10-17 Thread Alejandro Comisario

I believe you are absolutelly right.
It was my fault not checking the dates before posting, my bad.

Thanks for you help.
best.

On Tue, Oct 17, 2017 at 8:14 PM, Jamie Fargen  wrote:

> Alejandro-
>
> Those are kernel messages indicating that the an error was encountered
> when data was sent to the storage device and are not related directly to
> the operation of Ceph. The messages you sent also appear to have happened 4
> days ago on Friday and if they have subsided then it probably means nothing
> further has tried to read/write to the disk, but the messages will be
> present in dmesg until the kernel ring buffer is overwritten or the system
> is restarted.
>
> -Jamie
>
>
> On Tue, Oct 17, 2017 at 6:47 PM, Alejandro Comisario <
> alejan...@nubeliu.com> wrote:
>
>> Jamie, thanks for replying, info is as follow:
>>
>> 1)
>>
>> [Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 FAILED Result:
>> hostbyte=DID_OK driverbyte=DRIVER_SENSE
>> [Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 Sense Key : Medium
>> Error [current]
>> [Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 Add. Sense: No
>> additional sense information
>> [Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 CDB: Read(10) 28 00
>> 00 00 09 10 00 00 f0 00
>> [Fri Oct 13 10:21:24 2017] blk_update_request: I/O error, dev sdx, sector
>> 2320
>>
>> 2)
>>
>> ndc-cl-mon1:~# ceph status
>>   cluster:
>> id: 48158350-ba8a-420b-9c09-68da57205924
>> health: HEALTH_OK
>>
>>   services:
>> mon: 3 daemons, quorum ndc-cl-mon1,ndc-cl-mon2,ndc-cl-mon3
>> mgr: ndc-cl-mon1(active), standbys: ndc-cl-mon3, ndc-cl-mon2
>> osd: 161 osds: 160 up, 160 in
>>
>>   data:
>> pools:   4 pools, 12288 pgs
>> objects: 663k objects, 2650 GB
>> usage:   9695 GB used, 258 TB / 267 TB avail
>> pgs: 12288 active+clean
>>
>>   io:
>> client:   0 B/s rd, 1248 kB/s wr, 49 op/s rd, 106 op/s wr
>>
>> 3)
>>
>> https://pastebin.com/MeCKqvp1
>>
>>
>> On Tue, Oct 17, 2017 at 5:59 PM, Jamie Fargen  wrote:
>>
>>> Alejandro-
>>> Please provide the folloing information:
>>> 1) Include an example of an actual message you are seeing in dmesg.
>>> 2) Provide the output of # ceph status
>>> 3) Provide the output of # ceph osd tree
>>>
>>> Regards,
>>> Jamie Fargen
>>>
>>>
>>>
>>> On Tue, Oct 17, 2017 at 4:34 PM, Alejandro Comisario <
>>> alejan...@nubeliu.com> wrote:
>>>
 hi guys, any tip or help ?

 On Mon, Oct 16, 2017 at 1:50 PM, Alejandro Comisario <
 alejan...@nubeliu.com> wrote:

> Hi all, i have to hot-swap a failed osd on a Luminous Cluster with
> Blue store (the disk is SATA, WAL and DB are on NVME).
>
> I've issued a:
> * ceph osd crush reweight osd_id 0
> * systemctl stop (osd I'd daemon)
> * umount /var/lib/ceph/osd/osd_id
> * ceph osd destroy osd_id
>
> everything seems of, but if I left everything as is ( until I wait for
> the replaced disk ) I can see that dmesg errors on writing on the device
> are still appearing.
>
> The osd is of course down and out the crushmap.
> am I missing something ? like a step to execute or something else ?
>
> hoping to get help.
> best.
>
> alejandrito
>



 --
 *Alejandro Comisario*
 *CTO | NUBELIU*
 E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
 _

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


>>>
>>>
>>> --
>>> Jamie Fargen
>>> Consultant
>>> jfar...@redhat.com
>>> 813-817-4430 <(813)%20817-4430>
>>>
>>
>>
>>
>> --
>> *Alejandro Comisario*
>> *CTO | NUBELIU*
>> E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
>> _
>>
>
>
>
> --
> Jamie Fargen
> Consultant
> jfar...@redhat.com
> 813-817-4430
>



-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
_
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to stop using (unmount) a failed OSD with BlueStore ?

2017-10-17 Thread Alejandro Comisario

I believe you are absolutelly right.
It was my fault not checking the dates before posting, my bad.

Thanks for you help.
best.

On Tue, Oct 17, 2017 at 8:14 PM, Jamie Fargen  wrote:

> Alejandro-
>
> Those are kernel messages indicating that the an error was encountered
> when data was sent to the storage device and are not related directly to
> the operation of Ceph. The messages you sent also appear to have happened 4
> days ago on Friday and if they have subsided then it probably means nothing
> further has tried to read/write to the disk, but the messages will be
> present in dmesg until the kernel ring buffer is overwritten or the system
> is restarted.
>
> -Jamie
>
>
> On Tue, Oct 17, 2017 at 6:47 PM, Alejandro Comisario <
> alejan...@nubeliu.com> wrote:
>
>> Jamie, thanks for replying, info is as follow:
>>
>> 1)
>>
>> [Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 FAILED Result:
>> hostbyte=DID_OK driverbyte=DRIVER_SENSE
>> [Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 Sense Key : Medium
>> Error [current]
>> [Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 Add. Sense: No
>> additional sense information
>> [Fri Oct 13 10:21:24 2017] sd 0:2:23:0: [sdx] tag#0 CDB: Read(10) 28 00
>> 00 00 09 10 00 00 f0 00
>> [Fri Oct 13 10:21:24 2017] blk_update_request: I/O error, dev sdx, sector
>> 2320
>>
>> 2)
>>
>> ndc-cl-mon1:~# ceph status
>>   cluster:
>> id: 48158350-ba8a-420b-9c09-68da57205924
>> health: HEALTH_OK
>>
>>   services:
>> mon: 3 daemons, quorum ndc-cl-mon1,ndc-cl-mon2,ndc-cl-mon3
>> mgr: ndc-cl-mon1(active), standbys: ndc-cl-mon3, ndc-cl-mon2
>> osd: 161 osds: 160 up, 160 in
>>
>>   data:
>> pools:   4 pools, 12288 pgs
>> objects: 663k objects, 2650 GB
>> usage:   9695 GB used, 258 TB / 267 TB avail
>> pgs: 12288 active+clean
>>
>>   io:
>> client:   0 B/s rd, 1248 kB/s wr, 49 op/s rd, 106 op/s wr
>>
>> 3)
>>
>> https://pastebin.com/MeCKqvp1
>>
>>
>> On Tue, Oct 17, 2017 at 5:59 PM, Jamie Fargen  wrote:
>>
>>> Alejandro-
>>> Please provide the folloing information:
>>> 1) Include an example of an actual message you are seeing in dmesg.
>>> 2) Provide the output of # ceph status
>>> 3) Provide the output of # ceph osd tree
>>>
>>> Regards,
>>> Jamie Fargen
>>>
>>>
>>>
>>> On Tue, Oct 17, 2017 at 4:34 PM, Alejandro Comisario <
>>> alejan...@nubeliu.com> wrote:
>>>
 hi guys, any tip or help ?

 On Mon, Oct 16, 2017 at 1:50 PM, Alejandro Comisario <
 alejan...@nubeliu.com> wrote:

> Hi all, i have to hot-swap a failed osd on a Luminous Cluster with
> Blue store (the disk is SATA, WAL and DB are on NVME).
>
> I've issued a:
> * ceph osd crush reweight osd_id 0
> * systemctl stop (osd I'd daemon)
> * umount /var/lib/ceph/osd/osd_id
> * ceph osd destroy osd_id
>
> everything seems of, but if I left everything as is ( until I wait for
> the replaced disk ) I can see that dmesg errors on writing on the device
> are still appearing.
>
> The osd is of course down and out the crushmap.
> am I missing something ? like a step to execute or something else ?
>
> hoping to get help.
> best.
>
> alejandrito
>



 --
 *Alejandro Comisario*
 *CTO | NUBELIU*
 E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
 _

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


>>>
>>>
>>> --
>>> Jamie Fargen
>>> Consultant
>>> jfar...@redhat.com
>>> 813-817-4430 <(813)%20817-4430>
>>>
>>
>>
>>
>> --
>> *Alejandro Comisario*
>> *CTO | NUBELIU*
>> E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
>> _
>>
>
>
>
> --
> Jamie Fargen
> Consultant
> jfar...@redhat.com
> 813-817-4430
>



-- 
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@nubeliu.comCell: +54911 3770 1857
_
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] To check RBD cache enabled

2017-10-17 Thread Jean-Charles Lopez

Hi Josy,

this is correct.

Just make sure that your current user as well as the user for your VMs (if you 
are using a VM environment) are allowed to write to this directory.

Also make sure that /var/run/ceph exists.

Once you have fixed the permissions problem and made sure that the path where 
you want to create the socket file exists it will be OK.

Note that the socket file can be created anywhere. So you could actually 
position the parameter as admin_socket = “./my-client.asok” just for the 
purpose of a test

Regards
JC Lopez
Senior Technical Instructor, Global Storage Consulting Practice
Red Hat, Inc.
jelo...@redhat.com 
+1 408-680-6959

> On Oct 17, 2017, at 16:07, Josy  wrote:
> 
> I think it is permission error, because when running ceph -s it shows this 
> error at the top
> 
> -
> 
> $ ceph -s
> 2017-10-17 15:53:26.132180 7f7698834700 -1 asok(0x7f76940017a0) 
> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to 
> bind the UNIX domain socket to 
> '/var/run/ceph/ceph-client.admin.29983.140147265902928.asok': (13) Permission 
> denied
>   cluster:
> id: de296604-d85c-46ab-a3af-add3367f0e6d
> health: HEALTH_OK
> 
> 
> Selinux is disabled in the server. Also I changed ownership of /var/run/ceph 
> to the ceph user.
> Still no luck. 'ls /var/run/ceph/ lists' no files in the client server
> 
> 
> 
> On 18-10-2017 04:07, Jason Dillaman wrote:
>> On Tue, Oct 17, 2017 at 6:30 PM, Josy  wrote:
>>> Hi,
>>> 
>>> I am running the command  from the admin server.
>>> 
>>> Because there are no asok file in the client server
>>> ls /var/run/ceph/ lists no files in the client server.
>> Most likely a permissions or SElinux/AppArmor issue where the librbd
>> client application cannot write to the directory.
>> 
> As Jason points it out you also need to make sure that your restart the
> client connection for the changes in the ceph.conf file to take effect.
>>> You mean restart the client server ?
>>> 
>>> (I am sorry, this is something new for me. I have just started learning
>>> ceph.)
>> Assuming this is for QEMU, QEMU is the librbd client so you would have
>> to stop/start the VM to pick up any configuration changes (or perform
>> a live migration to another server).
>> 
>>> On 18-10-2017 03:32, Jean-Charles Lopez wrote:
>>> 
>>> Hi Josy,
>>> 
>>> just a doubt but it looks like your ASOK file is the one from a Ceph
>>> Manager. So my suspicion is that you may be running the command from the
>>> wrong machine.
>>> 
>>> To run this command, you need to ssh into the machine where the client
>>> connection is being initiated.
>>> 
>>> But may be I am wrong regarding your exact connection point.
>>> 
>>> As Jason points it out you also need to make sure that your restart the
>>> client connection for the changes in the ceph.conf file to take effect.
>>> 
>>> Regards
>>> JC Lopez
>>> Senior Technical Instructor, Global Storage Consulting Practice
>>> Red Hat, Inc.
>>> jelo...@redhat.com
>>> +1 408-680-6959
>>> 
>>> On Oct 17, 2017, at 14:29, Josy  wrote:
>>> 
>>> Thanks for the reply.
>>> 
>>> I added rbd_non_blocking_aio = false in ceph.conf and pushed the admin file
>>> to all nodes.
>>> 
>>> -
>>> [client]
>>> admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
>>> log file = /var/log/ceph/client.log
>>> debug rbd = 20
>>> debug librbd = 20
>>> rbd_non_blocking_aio = false
>>> --
>>> 
>>> 
>>> However the config show command still shows it as true.
>>> ---
>>> [cephuser@ceph-las-admin-a1 ceph-cluster]$ sudo ceph --admin-daemon
>>> /var/run/ceph/ceph-mgr.ceph-las-admin-a1.asok config show | grep
>>> "rbd_non_blocking_aio"
>>> "rbd_non_blocking_aio": "true",
>>> ---
>>> 
>>> Did I miss something ?
>>> 
>>> 
>>> On 18-10-2017 01:22, Jean-Charles Lopez wrote:
>>> 
>>> Hi
>>> 
>>> syntax uses the admin socket file : ceph --admin-daemon
>>> /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok config get rbd_cache
>>> 
>>> Should be /var/run/ceph/ceph.client.admin.$pid.$cctid.asok if your
>>> connection is using client.admin to connect to the cluster and your cluster
>>> name is set to the default of ceph. But obviously can’t know from here the
>>> PID and the CCTID you will have to identify.
>>> 
>>> You can actually do a ls /var/run/ceph to find the correct admin socket file
>>> 
>>> Regards
>>> JC Lopez
>>> Senior Technical Instructor, Global Storage Consulting Practice
>>> Red Hat, Inc.
>>> jelo...@redhat.com
>>> +1 408-680-6959
>>> 
>>> On Oct 17, 2017, at 12:50, Josy  wrote:
>>> 
>>> Hi,
>>> 
>>> 
>>> I am following this article :
>>> 
>>> http://ceph.com/geen-categorie/ceph-validate-that-the-rbd-cache-is-active/
>>> 
>>> I have enabled this flag in ceph.conf
>>> 
>>> [client]
>>> admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
>>> log file = /var/log/ceph/
>>> 
>>> 
>>> But the command to show the conf is not working :
>>> 
>>> [cephuser@ceph-las-admin-a1 ceph-

Re: [ceph-users] [MONITOR SEGFAULT] Luminous cluster stuck when adding monitor

2017-10-17 Thread Nico Schottelius


Hello everyone,

is there any solution in sight for this problem? Currently our cluster
is stuck with a 2 monitor configuration, as everytime we restart the one
server2, it crashes after some minutes (and in between the cluster is stuck).

Should we consider downgrading to kraken to fix that problem?

Best,

Nico


--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Thick provisioning

2017-10-17 Thread Wido den Hollander


> Op 17 oktober 2017 om 19:38 schreef Jason Dillaman :
> 
> 
> There is no existing option to thick provision images within RBD. When
> an image is created or cloned, the only actions that occur are some
> small metadata updates to describe the image. This allows image
> creation to be a quick, constant time operation regardless of the
> image size. To thick provision the entire image would require writing
> data to the entire image and ensuring discard support is disabled to
> prevent the OS from releasing space back (and thus re-sparsifying the
> image).
> 

Indeed. It makes me wonder why anybody would want it. It will:

- Impact recovery performance
- Impact scrubbing performance
- Utilize more space then needed

Why would you want to do this Sinan?

Wido

> On Mon, Oct 16, 2017 at 10:49 AM,   wrote:
> > Hi,
> >
> > I have deployed a Ceph cluster (Jewel). By default all block devices that
> > are created are thin provisioned.
> >
> > Is it possible to change this setting? I would like to have that all
> > created block devices are thick provisioned.
> >
> > In front of the Ceph cluster, I am running Openstack.
> >
> > Thanks!
> >
> > Sinan
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> -- 
> Jason
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-10-17 Thread Wido den Hollander


> Op 17 oktober 2017 om 14:21 schreef Mark Nelson :
> 
> 
> 
> 
> On 10/17/2017 01:54 AM, Wido den Hollander wrote:
> >
> >> Op 16 oktober 2017 om 18:14 schreef Richard Hesketh 
> >> :
> >>
> >>
> >> On 16/10/17 13:45, Wido den Hollander wrote:
>  Op 26 september 2017 om 16:39 schreef Mark Nelson :
>  On 09/26/2017 01:10 AM, Dietmar Rieder wrote:
> > thanks David,
> >
> > that's confirming what I was assuming. To bad that there is no
> > estimate/method to calculate the db partition size.
> 
>  It's possible that we might be able to get ranges for certain kinds of
>  scenarios.  Maybe if you do lots of small random writes on RBD, you can
>  expect a typical metadata size of X per object.  Or maybe if you do lots
>  of large sequential object writes in RGW, it's more like Y.  I think
>  it's probably going to be tough to make it accurate for everyone though.
> >>>
> >>> So I did a quick test. I wrote 75.000 objects to a BlueStore device:
> >>>
> >>> root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes'
> >>> 75085
> >>> root@alpha:~#
> >>>
> >>> I then saw the RocksDB database was 450MB in size:
> >>>
> >>> root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes'
> >>> 459276288
> >>> root@alpha:~#
> >>>
> >>> 459276288 / 75085 = 6116
> >>>
> >>> So about 6kb of RocksDB data per object.
> >>>
> >>> Let's say I want to store 1M objects in a single OSD I would need ~6GB of 
> >>> DB space.
> >>>
> >>> Is this a safe assumption? Do you think that 6kb is normal? Low? High?
> >>>
> >>> There aren't many of these numbers out there for BlueStore right now so 
> >>> I'm trying to gather some numbers.
> >>>
> >>> Wido
> >>
> >> If I check for the same stats on OSDs in my production cluster I see 
> >> similar but variable values:
> >>
> >> root@vm-ds-01:~/ceph-conf# for i in {0..9} ; do echo -n "osd.$i db per 
> >> object: " ; expr `ceph daemon osd.$i perf dump | jq 
> >> '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq 
> >> '.bluestore.bluestore_onodes'` ; done
> >> osd.0 db per object: 7490
> >> osd.1 db per object: 7523
> >> osd.2 db per object: 7378
> >> osd.3 db per object: 7447
> >> osd.4 db per object: 7233
> >> osd.5 db per object: 7393
> >> osd.6 db per object: 7074
> >> osd.7 db per object: 7967
> >> osd.8 db per object: 7253
> >> osd.9 db per object: 7680
> >>
> >> root@vm-ds-02:~# for i in {10..19} ; do echo -n "osd.$i db per object: " ; 
> >> expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph 
> >> daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> >> osd.10 db per object: 5168
> >> osd.11 db per object: 5291
> >> osd.12 db per object: 5476
> >> osd.13 db per object: 4978
> >> osd.14 db per object: 5252
> >> osd.15 db per object: 5461
> >> osd.16 db per object: 5135
> >> osd.17 db per object: 5126
> >> osd.18 db per object: 9336
> >> osd.19 db per object: 4986
> >>
> >> root@vm-ds-03:~# for i in {20..29} ; do echo -n "osd.$i db per object: " ; 
> >> expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph 
> >> daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> >> osd.20 db per object: 5115
> >> osd.21 db per object: 4844
> >> osd.22 db per object: 5063
> >> osd.23 db per object: 5486
> >> osd.24 db per object: 5228
> >> osd.25 db per object: 4966
> >> osd.26 db per object: 5047
> >> osd.27 db per object: 5021
> >> osd.28 db per object: 5321
> >> osd.29 db per object: 5150
> >>
> >> root@vm-ds-04:~# for i in {30..39} ; do echo -n "osd.$i db per object: " ; 
> >> expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph 
> >> daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> >> osd.30 db per object: 6658
> >> osd.31 db per object: 6445
> >> osd.32 db per object: 6259
> >> osd.33 db per object: 6691
> >> osd.34 db per object: 6513
> >> osd.35 db per object: 6628
> >> osd.36 db per object: 6779
> >> osd.37 db per object: 6819
> >> osd.38 db per object: 6677
> >> osd.39 db per object: 6689
> >>
> >> root@vm-ds-05:~# for i in {40..49} ; do echo -n "osd.$i db per object: " ; 
> >> expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph 
> >> daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> >> osd.40 db per object: 5335
> >> osd.41 db per object: 5203
> >> osd.42 db per object: 5552
> >> osd.43 db per object: 5188
> >> osd.44 db per object: 5218
> >> osd.45 db per object: 5157
> >> osd.46 db per object: 4956
> >> osd.47 db per object: 5370
> >> osd.48 db per object: 5117
> >> osd.49 db per object: 5313
> >>
> >> I'm not sure why so much variance (these nodes are basically identical) 
> >> and I think that the db_used_bytes includes the WAL at least in my case, 
> >> as I don't have a separate WAL device. I'm not sure how big the WAL is 
> >> relative to metadata and hence how much this might be thrown off, but 
> >> ~6kb/object seems like a reasonable value to take for back

52 matches

Mail list logo