[ceph-users] Re: cephfs file layouts, empty objects in first data pool

2020-02-10 Thread Håkan T Johansson


On Mon, 10 Feb 2020, Gregory Farnum wrote:


On Sun, Feb 9, 2020 at 3:24 PM Håkan T Johansson  wrote:

  Hi,

  running 14.2.6, debian buster (backports).

  Have set up a cephfs with 3 data pools and one metadata pool:
  myfs_data, myfs_data_hdd, myfs_data_ssd, and myfs_metadata.

  The data of all files are with the use of ceph.dir.layout.pool either
  stored in the pools myfs_data_hdd or myfs_data_ssd.  This has also been
  checked by dumping the ceph.file.layout.pool attributes of all files.

  The filesystem has 1617949 files and 36042 directories.

  There are however approximately as many objects in the first pool created
  for the cephfs, myfs_data, as there are files.  They also becomes more or
  fewer as files are created or deleted (so cannot be some leftover from
  earlier exercises).  Note how the USED size is reported as 0 bytes,
  correctly reflecting that no file data is stored in them.

  POOL_NAME        USED OBJECTS CLONES  COPIES MISSING_ON_PRIMARY UNFOUND 
DEGRADED  RD_OPS      RD   WR_OPS      WR USED COMPR UNDER COMPR
  myfs_data         0 B 1618229      0 4854687                  0       0   
     0 2263590 129 GiB 23312479 124 GiB        0 B         0 B
  myfs_data_hdd 831 GiB  136309      0  408927                  0       0   
     0  106046 200 GiB   269084 277 GiB        0 B         0 B
  myfs_data_ssd  43 GiB 1552412      0 4657236                  0       0   
     0  181468 2.3 GiB  4661935  12 GiB        0 B         0 B
  myfs_metadata 1.2 GiB   36096      0  108288                  0       0   
     0 4828623  82 GiB  1355102 143 GiB        0 B         0 B

  Is this expected?

  I was assuming that in this scenario, all objects, both their data and any
  keys would be either in the metadata pool, or the two pools where the
  objects are stored.

  Is it some additional metadata keys that are stored in the first
  created data pool for cephfs?  This would not be so nice in case the osd
  selection rules for it are using worse disks than the data itself...


https://docs.ceph.com/docs/master/cephfs/file-layouts/#adding-a-data-pool-to-the-mds
 notes there is “a small amount of metadata” kept in the primary pool. 


Thanks!  This I managed to miss, probably as it was at the bottom of the 
page.  In case one wants to use layouts to separate fast (likely many) 
from slow (likely large) files, it then sounds as the primary pool should 
the fast kind too, due to the large amount of objects.  Thus this needs to 
be highlighted early in that documentation.



That’s not terribly clear; what is actually stored is a per-file location 
backtrace (its location in the directory tree) used for hardlink lookups and 
disaster recovery
scenarios.


This info would be nice to add to the manual page.  It is nice to know 
what kind of information is stored there.


Again thanks for the clarification!


  Btw: is there any tool to see the amount of key value data size associated
  with a pool?  'ceph osd df' gives omap and meta for osds, but not broken
  down per pool.


I think this is in the newest master code, but I’m not certain which release 
it’s in...


Would it then (when available) also be in the 'rados df' command?

Best regards,
Håkan



-Greg



  Best regards,
  Håkan
  ___
  ceph-users mailing list -- ceph-users@ceph.io
  To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Write i/o in CephFS metadata pool

2020-02-10 Thread Samy Ascha



> On 6 Feb 2020, at 11:23, Stefan Kooman  wrote:
> 
>> Hi!
>> 
>> I've confirmed that the write IO to the metadata pool is coming form active 
>> MDSes.
>> 
>> I'm experiencing very poor write performance on clients and I would like to 
>> see if there's anything I can do to optimise the performance.
>> 
>> Right now, I'm specifically focussing on speeding up this use case:
>> 
>> In CephFS mounted dir:
>> 
>> $ time unzip -q wordpress-seo.12.9.1.zip 
>> 
>> real 0m47.596s
>> user 0m0.218s
>> sys  0m0.157s
>> 
>> On RBD mount:
>> 
>> $ time unzip -q wordpress-seo.12.9.1.zip 
>> 
>> real 0m0.176s
>> user 0m0.131s
>> sys  0m0.045s
>> 
>> The difference is just too big. I'm having real trouble finding a good 
>> reference to check my setup for bad configuration etc.
>> 
>> I have network bandwidth, RAM and CPU to spare, but I'm unsure on how to put 
>> it to work to help my case.
> 
> Are there a lot of directories to be created from that zip file? I think
> it boils down to the directory operations that need to be performed
> synchrously. See
> https://fosdem.org/2020/schedule/event/sds_ceph_async_directory_ops/
> https://fosdem.org/2020/schedule/event/sds_ceph_async_directory_ops/attachments/slides/3962/export/events/attachments/sds_ceph_async_directory_ops/slides/3962/async_dirops_cephfs.pdf
> https://video.fosdem.org/2020/H.1308/sds_ceph_async_directory_ops.webm

Hi!

Last Friday, I did a round of updates that were pending and planned for 
installation.

After the updates, all server and client components were running their latest 
version and systems were rebooted to latest kernel versions.

Ceph: Mimic
Kernel: 5.3 (Ubuntu HWE)

The write IO to the metadata pool is down by a factor of 10 and performance 
seems much improved.

Though this does not give me a lot of intel on what the problem was, I'm glad 
that it is now pretty much resolved ;)

Before the updates, I was running different (minor) versions of Ceph and kernel 
clients. This may have not been ideal, but I'm not sure on details of possible 
issues with that.

Rebooting everything may have also eliminated some issues. I did not have the 
opportunity to do much analysis on that, since I was working in a production 
environment.

Well, maybe some of you have extra insights. I'm happy to close this issue and 
will be monitoring and recording related info in case this happens again.

Thanks much for your inputs, and have a good week,

Samy
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] extract disk usage stats from running ceph cluster

2020-02-10 Thread lists

Hi,

We would like to replace the current seagate ST4000NM0034 HDDs in our 
ceph cluster with SSDs, and before doing that, we would like to checkout 
the typical usage of our current drives, over the last years, so we can 
select the best (price/performance/endurance) SSD to replace them with.


I am trying to extract this info from the fields "Blocks received from 
initiator" / "blocks sent to initiator", as these are the fields 
smartctl gets from the seagate disks. But the numbers seem strange, and 
I would like to request feedback here.


Three nodes, all equal, 8 OSDs per node, all 4TB ST4000NM0034 
(filestore) HDDs with SSD-based journals:



root@node1:~# ceph osd crush tree
ID CLASS WEIGHT   TYPE NAME
-1   87.35376 root default
-2   29.11688 host node1
 0   hdd  3.64000 osd.0
 1   hdd  3.64000 osd.1
 2   hdd  3.63689 osd.2
 3   hdd  3.64000 osd.3
12   hdd  3.64000 osd.12
13   hdd  3.64000 osd.13
14   hdd  3.64000 osd.14
15   hdd  3.64000 osd.15
-3   29.12000 host node2
 4   hdd  3.64000 osd.4
 5   hdd  3.64000 osd.5
 6   hdd  3.64000 osd.6
 7   hdd  3.64000 osd.7
16   hdd  3.64000 osd.16
17   hdd  3.64000 osd.17
18   hdd  3.64000 osd.18
19   hdd  3.64000 osd.19
-4   29.11688 host node3
 8   hdd  3.64000 osd.8
 9   hdd  3.64000 osd.9
10   hdd  3.64000 osd.10
11   hdd  3.64000 osd.11
20   hdd  3.64000 osd.20
21   hdd  3.64000 osd.21
22   hdd  3.64000 osd.22
23   hdd  3.63689 osd.23


We are looking at the numbers from smartctl, and basing our calculations 
on this output for each individual various OSD:

Vendor (Seagate) cache information
  Blocks sent to initiator = 3783529066
  Blocks received from initiator = 3121186120
  Blocks read from cache and sent to initiator = 545427169
  Number of read and write commands whose size <= segment size = 93877358
  Number of read and write commands whose size > segment size = 2290879


I created the following spreadsheet:


blocks sent blocks received total blocks
to initiatorfrom initiator  calculated  read%   write%  
aka
node1
osd0905060564   1900663448  2805724012  32,26%  67,74%  
sda
osd12270442418  3756215880  6026658298  37,67%  62,33%  
sdb
osd23531938448  3940249192  7472187640  47,27%  52,73%  
sdc
osd32824808123  3130655416  5955463539  47,43%  52,57%  
sdd
osd12   1956722491  1294854032  3251576523  60,18%  39,82%  
sdg
osd13   3410188306  1265443936  4675632242  72,94%  27,06%  
sdh
osd14   3765454090  3115079112  6880533202  54,73%  45,27%  
sdi
osd15   2272246730  2218847264  4491093994  50,59%  49,41%  
sdj

node2   
osd43974937107  740853712   4715790819  84,29%  15,71%  
sda
osd51181377668  2109150744  3290528412  35,90%  64,10%  
sdb
osd51903438106  608869008   2512307114  75,76%  24,24%  
sdc
osd73511170043  724345936   4235515979  82,90%  17,10%  
sdd
osd16   2642731906  3981984640  6624716546  39,89%  60,11%  
sdg
osd17   3994977805  3703856288  7698834093  51,89%  48,11%  
sdh
osd18   3992157229  2096991672  6089148901  65,56%  34,44%  
sdi
osd19   279766405   1053039640  1332806045  20,99%  79,01%  
sdj

node3   
osd83711322586  234696960   3946019546  94,05%  5,95%   
sda
osd91203912715  313299  4336902715  27,76%  72,24%  
sdb
osd10   912356010   1681434416  2593790426  35,17%  64,83%  
sdc
osd11   810488345   2626589896  3437078241  23,58%  76,42%  
sdd
osd20   1506879946  2421596680  3928476626  38,36%  61,64%  
sdg
osd21   2991526593  7525120 2999051713  99,75%  0,25%   
sdh
osd22   295603373226114552  3255674889  0,91%   99,09%  
sdi
osd23   2019195656  2563506320  4582701976  44,06%  55,94%  
sdj


But as can be seen above, this results in some very strange numbers, for 
example node3/osd21 and node2/osd19, node3/osd8, the numbers are unlikely.


So, probably we're doing something wrong in our logic here.

Can someone explain what we're doing wrong, and is it possible to obtain 
stats like these also from ceph directly? Does ceph keep historical 
stats like above..?


MJ
___
ceph-users mailing list -- ce

[ceph-users] Fwd: PrimaryLogPG.cc: 11550: FAILED ceph_assert(head_obc)

2020-02-10 Thread Jake Grimmett
Dear All,

Following a clunky* cluster restart, we had

23 "objects unfound"
14 pg recovery_unfound

We could see no way to recover the unfound objects, we decided to mark
the objects in one pg unfound...

[root@ceph1 bad_oid]# ceph pg 5.f2f mark_unfound_lost delete
pg has 2 objects unfound and apparently lost marking

Unfortunately, this immediately crashed the primary OSD for this PG:

OSD log showing the osd crashing 3 times here: 

the assert was :>

2020-02-10 13:38:45.003 7fa713ef3700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.6/rpm/el7/BUILD/ceph-14.2.6/src/osd/PrimaryLogPG.cc:
In function 'int PrimaryLogPG::recover_missing(const hobject_t&,
eversion_t, int, PGBackend::RecoveryHandle*)' thread 7fa713ef3700 time
2020-02-10 13:38:45.000875
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.6/rpm/el7/BUILD/ceph-14.2.6/src/osd/PrimaryLogPG.cc:
11550: FAILED ceph_assert(head_obc)


Questions..

1) Is it possible to recover the flapping OSD? or should we fail out the
flapping OSD and hope the cluster recovers?

2) We have 13 other pg with unfound objects. Do we need to mark_unfound
these one at a time, and then fail out their primary OSD? (allowing the
cluster to recover before mark_unfound the next pg & failing it's
primary OSD)



* thread describing the bad restart :>


many thanks!

Jake

-- 
Dr Jake Grimmett
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.


-- 
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Phone 01223 267019
Mobile 0776 9886539
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Running cephadm as a nonroot user

2020-02-10 Thread Jason Borden
We have been using ceph-deploy in our existing cluster running as a non root 
user with sudo permissions. I've been working on getting an octopus cluster 
working using cephadm. During bootstrap I ran into a 
"execnet.gateway_bootstrap.HostNotFound" issue. It turns out that the problem 
was caused by an sshd setting we use: "PermitRootLogin no". Since we do not 
allow root ssh login directly, is there a way to make cephadm use ssh as a 
nonroot user with sudo permissions like we did with ceph-deploy?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Running cephadm as a nonroot user

2020-02-10 Thread Sage Weil
There is a 'packaged' mode that does this, but it's a bit different:

- you have to install the cephadm package on each host
- the package sets up a cephadm user and sudoers.d file
- mgr/cephadm will ssh in as that user and sudo as needed

The net is that you have to make sure cephadm is installed and up 
to date (vs the 'root' mode that does pipes the mgr's copy to teh 
remote python interpreter).

On Mon, 10 Feb 2020, Jason Borden wrote:
> We have been using ceph-deploy in our existing cluster running as a non root 
> user with sudo permissions. I've been working on getting an octopus cluster 
> working using cephadm. During bootstrap I ran into a 
> "execnet.gateway_bootstrap.HostNotFound" issue. It turns out that the problem 
> was caused by an sshd setting we use: "PermitRootLogin no". Since we do not 
> allow root ssh login directly, is there a way to make cephadm use ssh as a 
> nonroot user with sudo permissions like we did with ceph-deploy?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs file layouts, empty objects in first data pool

2020-02-10 Thread Gregory Farnum
On Mon, Feb 10, 2020 at 12:29 AM Håkan T Johansson  wrote:
>
>
> On Mon, 10 Feb 2020, Gregory Farnum wrote:
>
> > On Sun, Feb 9, 2020 at 3:24 PM Håkan T Johansson  
> > wrote:
> >
> >   Hi,
> >
> >   running 14.2.6, debian buster (backports).
> >
> >   Have set up a cephfs with 3 data pools and one metadata pool:
> >   myfs_data, myfs_data_hdd, myfs_data_ssd, and myfs_metadata.
> >
> >   The data of all files are with the use of ceph.dir.layout.pool either
> >   stored in the pools myfs_data_hdd or myfs_data_ssd.  This has also 
> > been
> >   checked by dumping the ceph.file.layout.pool attributes of all files.
> >
> >   The filesystem has 1617949 files and 36042 directories.
> >
> >   There are however approximately as many objects in the first pool 
> > created
> >   for the cephfs, myfs_data, as there are files.  They also becomes 
> > more or
> >   fewer as files are created or deleted (so cannot be some leftover from
> >   earlier exercises).  Note how the USED size is reported as 0 bytes,
> >   correctly reflecting that no file data is stored in them.
> >
> >   POOL_NAMEUSED OBJECTS CLONES  COPIES MISSING_ON_PRIMARY 
> > UNFOUND DEGRADED  RD_OPS  RD   WR_OPS  WR USED COMPR UNDER COMPR
> >   myfs_data 0 B 1618229  0 4854687  0   
> > 00 2263590 129 GiB 23312479 124 GiB0 B 0 B
> >   myfs_data_hdd 831 GiB  136309  0  408927  0   
> > 00  106046 200 GiB   269084 277 GiB0 B 0 B
> >   myfs_data_ssd  43 GiB 1552412  0 4657236  0   
> > 00  181468 2.3 GiB  4661935  12 GiB0 B 0 B
> >   myfs_metadata 1.2 GiB   36096  0  108288  0   
> > 00 4828623  82 GiB  1355102 143 GiB0 B 0 B
> >
> >   Is this expected?
> >
> >   I was assuming that in this scenario, all objects, both their data 
> > and any
> >   keys would be either in the metadata pool, or the two pools where the
> >   objects are stored.
> >
> >   Is it some additional metadata keys that are stored in the first
> >   created data pool for cephfs?  This would not be so nice in case the 
> > osd
> >   selection rules for it are using worse disks than the data itself...
> >
> >
> > https://docs.ceph.com/docs/master/cephfs/file-layouts/#adding-a-data-pool-to-the-mds
> >  notes there is “a small amount of metadata” kept in the primary pool.
>
> Thanks!  This I managed to miss, probably as it was at the bottom of the
> page.  In case one wants to use layouts to separate fast (likely many)
> from slow (likely large) files, it then sounds as the primary pool should
> the fast kind too, due to the large amount of objects.  Thus this needs to
> be highlighted early in that documentation.
>
> > That’s not terribly clear; what is actually stored is a per-file location 
> > backtrace (its location in the directory tree) used for hardlink lookups 
> > and disaster recovery
> > scenarios.
>
> This info would be nice to add to the manual page.  It is nice to know
> what kind of information is stored there.

Yeah, PRs welcome. :p
Just to be clear though, that shouldn't be performance-critical. It's
lazily updated by the MDS when the directory location changes, but not
otherwise.

>
> Again thanks for the clarification!
>
> >   Btw: is there any tool to see the amount of key value data size 
> > associated
> >   with a pool?  'ceph osd df' gives omap and meta for osds, but not 
> > broken
> >   down per pool.
> >
> >
> > I think this is in the newest master code, but I’m not certain which 
> > release it’s in...
>
> Would it then (when available) also be in the 'rados df' command?

I really don't remember how everything is shared out but I think so?

>
> Best regards,
> Håkan
>
>
> > -Greg
> >
> >
> >
> >   Best regards,
> >   Håkan
> >   ___
> >   ceph-users mailing list -- ceph-users@ceph.io
> >   To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> >
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: extract disk usage stats from running ceph cluster

2020-02-10 Thread ceph
Hello MJ,

Perhaps your PGs are a unbalanced?

Ceph osd df tree

Greetz
Mehmet 

Am 10. Februar 2020 14:58:25 MEZ schrieb lists :
>Hi,
>
>We would like to replace the current seagate ST4000NM0034 HDDs in our 
>ceph cluster with SSDs, and before doing that, we would like to
>checkout 
>the typical usage of our current drives, over the last years, so we can
>
>select the best (price/performance/endurance) SSD to replace them with.
>
>I am trying to extract this info from the fields "Blocks received from 
>initiator" / "blocks sent to initiator", as these are the fields 
>smartctl gets from the seagate disks. But the numbers seem strange, and
>
>I would like to request feedback here.
>
>Three nodes, all equal, 8 OSDs per node, all 4TB ST4000NM0034 
>(filestore) HDDs with SSD-based journals:
>
>> root@node1:~# ceph osd crush tree
>> ID CLASS WEIGHT   TYPE NAME
>> -1   87.35376 root default
>> -2   29.11688 host node1
>>  0   hdd  3.64000 osd.0
>>  1   hdd  3.64000 osd.1
>>  2   hdd  3.63689 osd.2
>>  3   hdd  3.64000 osd.3
>> 12   hdd  3.64000 osd.12
>> 13   hdd  3.64000 osd.13
>> 14   hdd  3.64000 osd.14
>> 15   hdd  3.64000 osd.15
>> -3   29.12000 host node2
>>  4   hdd  3.64000 osd.4
>>  5   hdd  3.64000 osd.5
>>  6   hdd  3.64000 osd.6
>>  7   hdd  3.64000 osd.7
>> 16   hdd  3.64000 osd.16
>> 17   hdd  3.64000 osd.17
>> 18   hdd  3.64000 osd.18
>> 19   hdd  3.64000 osd.19
>> -4   29.11688 host node3
>>  8   hdd  3.64000 osd.8
>>  9   hdd  3.64000 osd.9
>> 10   hdd  3.64000 osd.10
>> 11   hdd  3.64000 osd.11
>> 20   hdd  3.64000 osd.20
>> 21   hdd  3.64000 osd.21
>> 22   hdd  3.64000 osd.22
>> 23   hdd  3.63689 osd.23
>
>We are looking at the numbers from smartctl, and basing our
>calculations 
>on this output for each individual various OSD:
>> Vendor (Seagate) cache information
>>   Blocks sent to initiator = 3783529066
>>   Blocks received from initiator = 3121186120
>>   Blocks read from cache and sent to initiator = 545427169
>>   Number of read and write commands whose size <= segment size =
>93877358
>>   Number of read and write commands whose size > segment size =
>2290879
>
>I created the following spreadsheet:
>
>>  blocks sent blocks received total blocks
>>  to initiatorfrom initiator  calculated  read%   write%  
>> aka
>> node1
>> osd0 905060564   1900663448  2805724012  32,26%  67,74%  
>> sda
>> osd1 2270442418  3756215880  6026658298  37,67%  62,33%  
>> sdb
>> osd2 3531938448  3940249192  7472187640  47,27%  52,73%  
>> sdc
>> osd3 2824808123  3130655416  5955463539  47,43%  52,57%  
>> sdd
>> osd121956722491  1294854032  3251576523  60,18%  39,82%  
>> sdg
>> osd133410188306  1265443936  4675632242  72,94%  27,06%  
>> sdh
>> osd143765454090  3115079112  6880533202  54,73%  45,27%  
>> sdi
>> osd152272246730  2218847264  4491093994  50,59%  49,41%  
>> sdj
>>  
>> node2
>> osd4 3974937107  740853712   4715790819  84,29%  15,71%  
>> sda
>> osd5 1181377668  2109150744  3290528412  35,90%  64,10%  
>> sdb
>> osd5 1903438106  608869008   2512307114  75,76%  24,24%  
>> sdc
>> osd7 3511170043  724345936   4235515979  82,90%  17,10%  
>> sdd
>> osd162642731906  3981984640  6624716546  39,89%  60,11%  
>> sdg
>> osd173994977805  3703856288  7698834093  51,89%  48,11%  
>> sdh
>> osd183992157229  2096991672  6089148901  65,56%  34,44%  
>> sdi
>> osd19279766405   1053039640  1332806045  20,99%  79,01%  
>> sdj
>>  
>> node3
>> osd8 3711322586  234696960   3946019546  94,05%  5,95%   
>> sda
>> osd9 1203912715  313299  4336902715  27,76%  72,24%  
>> sdb
>> osd10912356010   1681434416  2593790426  35,17%  64,83%  
>> sdc
>> osd11810488345   2626589896  3437078241  23,58%  76,42%  
>> sdd
>> osd201506879946  2421596680  3928476626  38,36%  61,64%  
>> sdg
>> osd212991526593  7525120 2999051713  99,75%  0,25%   
>> sdh
>> osd22295603373226114552  3255674889  0,91%   99,09%  
>> sdi
>> osd232019195656  2563506320  4582701976  44,06%  55,94%  
>> s

[ceph-users] ERROR: osd init failed: (1) Operation not permitted

2020-02-10 Thread Ml Ml
Hello List,

first of all: Yes - i made mistakes. Now i am trying to recover :-/

I had a healthy 3 node cluster which i wanted to convert to a single one.
My goal was to reinstall a fresh 3 Node cluster and start with 2 nodes.

I was able to healthy turn it from a 3 Node Cluster to a 2 Node cluster.
Then the problems began.

I started to change size=1 and min_size=1. (i know, i know, i will
never ever to that again!)
Health was okay until here. Then over sudden both nodes got
fenced...one node refused to boot, mons where missing, etc...to make
long story short, here is where i am right now:


root@node03:~ # ceph -s
cluster b3be313f-d0ef-42d5-80c8-6b41380a47e3
 health HEALTH_WARN
53 pgs stale
53 pgs stuck stale
 monmap e4: 2 mons at {0=10.15.15.3:6789/0,1=10.15.15.2:6789/0}
election epoch 298, quorum 0,1 1,0
 osdmap e6097: 14 osds: 9 up, 9 in
  pgmap v93644673: 512 pgs, 1 pools, 1193 GB data, 304 kobjects
1088 GB used, 32277 GB / 33366 GB avail
 459 active+clean
  53 stale+active+clean

root@node03:~ # ceph osd tree
ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 32.56990 root default
-2 25.35992 host node03
 0  3.57999 osd.0up  1.0  1.0
 5  3.62999 osd.5up  1.0  1.0
 6  3.62999 osd.6up  1.0  1.0
 7  3.62999 osd.7up  1.0  1.0
 8  3.62999 osd.8up  1.0  1.0
19  3.62999 osd.19   up  1.0  1.0
20  3.62999 osd.20   up  1.0  1.0
-3  7.20998 host node02
 3  3.62999 osd.3up  1.0  1.0
 4  3.57999 osd.4up  1.0  1.0
 10 osd.1  down0  1.0
 90 osd.9  down0  1.0
100 osd.10 down0  1.0
170 osd.17 down0  1.0
180 osd.18 down0  1.0



my main mistakes seemd to be:

ceph osd out osd.1
ceph auth del osd.1
systemctl stop ceph-osd@1
ceph osd rm 1
umount /var/lib/ceph/osd/ceph-1
ceph osd crush remove osd.1

As far as i can tell, ceph waits and needs data from that OSD.1 (which
i removed)



root@node03:~ # ceph health detail
HEALTH_WARN 53 pgs stale; 53 pgs stuck stale
pg 0.1a6 is stuck stale for 5086.552795, current state
stale+active+clean, last acting [1]
pg 0.142 is stuck stale for 5086.552784, current state
stale+active+clean, last acting [1]
pg 0.1e is stuck stale for 5086.552820, current state
stale+active+clean, last acting [1]
pg 0.e0 is stuck stale for 5086.552855, current state
stale+active+clean, last acting [1]
pg 0.1d is stuck stale for 5086.552822, current state
stale+active+clean, last acting [1]
pg 0.13c is stuck stale for 5086.552791, current state
stale+active+clean, last acting [1]
[...] SNIP [...]
pg 0.e9 is stuck stale for 5086.552955, current state
stale+active+clean, last acting [1]
pg 0.87 is stuck stale for 5086.552939, current state
stale+active+clean, last acting [1]


When i try to start ODS.1 manually, i get:

2020-02-10 18:48:26.107444 7f9ce31dd880  0 ceph version 0.94.10
(b1e0532418e4631af01acbc0cedd426f1905f4af), process ceph-osd, pid
10210
2020-02-10 18:48:26.134417 7f9ce31dd880  0
filestore(/var/lib/ceph/osd/ceph-1) backend xfs (magic 0x58465342)
2020-02-10 18:48:26.184202 7f9ce31dd880  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
FIEMAP ioctl is supported and appears to work
2020-02-10 18:48:26.184209 7f9ce31dd880  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2020-02-10 18:48:26.184526 7f9ce31dd880  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2020-02-10 18:48:26.184585 7f9ce31dd880  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_feature: extsize
is disabled by conf
2020-02-10 18:48:26.309755 7f9ce31dd880  0
filestore(/var/lib/ceph/osd/ceph-1) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2020-02-10 18:48:26.633926 7f9ce31dd880  1 journal _open
/var/lib/ceph/osd/ceph-1/journal fd 20: 5367660544 bytes, block size
4096 bytes, directio = 1, aio = 1
2020-02-10 18:48:26.642185 7f9ce31dd880  1 journal _open
/var/lib/ceph/osd/ceph-1/journal fd 20: 5367660544 bytes, block size
4096 bytes, directio = 1, aio = 1
2020-02-10 18:48:26.664273 7f9ce31dd880  0 
cls/hello/cls_hello.cc:271: loading cls_hello
2020-02-10 18:48:26.732154 7f9ce31dd880  0 osd.1 6002 crush map has
features 1107558400, adjusting msgr requires for clients
2020-02-10 18:48:26.732163 7f9ce31dd880  0 osd.1 6002 crush map has
features 1107558400 was 8705, ad

[ceph-users] Re: Running cephadm as a nonroot user

2020-02-10 Thread Jason Borden
Thanks for the quick reply! I am using the cephadm package. I just wasn't aware 
that of the user that was created as part of the package install. My 
/etc/sudoers.d/cephadm seems to be incorrect. It gives root permission to 
/usr/bin/cephadm, but cephadm is installed in /usr/sbin. That is easily fixed 
though. I added a "cephadm ALL=NOPASSWD: /usr/sbin/cephadm --image * bootstrap 
*", and ran "sudo /usr/sbin/cephadm --image 
ceph/daemon-base:latest-master-devel bootstrap --mon-ip 10.8.13.121" as the 
cephadm user, but still got the "execnet.gateway_bootstrap.HostNotFound" error. 
Is there documentation somewhere on how to bootstrap using the cephadm user?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: extract disk usage stats from running ceph cluster

2020-02-10 Thread Joe Comeau
try from admin node
 
ceph osd df
ceph osd status
thanks Joe
 

>>>  2/10/2020 10:44 AM >>>
Hello MJ,

Perhaps your PGs are a unbalanced?

Ceph osd df tree

Greetz
Mehmet 

Am 10. Februar 2020 14:58:25 MEZ schrieb lists :
>Hi,
>
>We would like to replace the current seagate ST4000NM0034 HDDs in our 
>ceph cluster with SSDs, and before doing that, we would like to
>checkout 
>the typical usage of our current drives, over the last years, so we can
>
>select the best (price/performance/endurance) SSD to replace them with.
>
>I am trying to extract this info from the fields "Blocks received from 
>initiator" / "blocks sent to initiator", as these are the fields 
>smartctl gets from the seagate disks. But the numbers seem strange, and
>
>I would like to request feedback here.
>
>Three nodes, all equal, 8 OSDs per node, all 4TB ST4000NM0034 
>(filestore) HDDs with SSD-based journals:
>
>> root@node1:~# ceph osd crush tree
>> ID CLASS WEIGHT   TYPE NAME
>> -1  87.35376 root default
>> -2  29.11688 host node1
>>  0   hdd  3.64000 osd.0
>>  1   hdd  3.64000 osd.1
>>  2   hdd  3.63689 osd.2
>>  3   hdd  3.64000 osd.3
>> 12   hdd  3.64000osd.12
>> 13   hdd  3.64000osd.13
>> 14   hdd  3.64000osd.14
>> 15   hdd  3.64000osd.15
>> -3  29.12000 host node2
>>  4   hdd  3.64000 osd.4
>>  5   hdd  3.64000 osd.5
>>  6   hdd  3.64000 osd.6
>>  7   hdd  3.64000 osd.7
>> 16   hdd  3.64000osd.16
>> 17   hdd  3.64000osd.17
>> 18   hdd  3.64000osd.18
>> 19   hdd  3.64000osd.19
>> -4  29.11688 host node3
>>  8   hdd  3.64000 osd.8
>>  9   hdd  3.64000 osd.9
>> 10   hdd  3.64000osd.10
>> 11   hdd  3.64000osd.11
>> 20   hdd  3.64000osd.20
>> 21   hdd  3.64000osd.21
>> 22   hdd  3.64000osd.22
>> 23   hdd  3.63689osd.23
>
>We are looking at the numbers from smartctl, and basing our
>calculations 
>on this output for each individual various OSD:
>> Vendor (Seagate) cache information
>>   Blocks sent to initiator = 3783529066
>>   Blocks received from initiator = 3121186120
>>   Blocks read from cache and sent to initiator = 545427169
>>   Number of read and write commands whose size <= segment size =
>93877358
>>   Number of read and write commands whose size > segment size =
>2290879
>
>I created the following spreadsheet:
>
>>  blocks sent blocks received total blocks
>>   to initiatorfrom initiatorcalculatedread%write%
>>  aka
>> node1
>> osd0 905060564   1900663448  2805724012  32,26%  67,74%  
>> sda
>> osd1 2270442418  3756215880  6026658298  37,67%  62,33%  
>> sdb
>> osd2 3531938448  3940249192  7472187640  47,27%  52,73%  
>> sdc
>> osd3 2824808123  3130655416  5955463539  47,43%  52,57%  
>> sdd
>> osd121956722491  1294854032  3251576523  60,18%  39,82%  
>> sdg
>> osd133410188306  1265443936  4675632242  72,94%  27,06%  
>> sdh
>> osd143765454090  3115079112  6880533202  54,73%  45,27%  
>> sdi
>> osd152272246730  2218847264  4491093994  50,59%  49,41%  
>> sdj
>>  
>> node2
>> osd4 3974937107  740853712   4715790819  84,29%  15,71%  
>> sda
>> osd5 1181377668  2109150744  3290528412  35,90%  64,10%  
>> sdb
>> osd5 1903438106  608869008   2512307114  75,76%  24,24%  
>> sdc
>> osd7 3511170043  724345936   4235515979  82,90%  17,10%  
>> sdd
>> osd162642731906  3981984640  6624716546  39,89%  60,11%  
>> sdg
>> osd173994977805  3703856288  7698834093  51,89%  48,11%  
>> sdh
>> osd183992157229  2096991672  6089148901  65,56%  34,44%  
>> sdi
>> osd19279766405   1053039640  1332806045  20,99%  79,01%  
>> sdj
>>  
>> node3
>> osd8 3711322586  234696960   3946019546  94,05%  5,95%   
>> sda
>> osd9 1203912715  313299  4336902715  27,76%  72,24%  
>> sdb
>> osd10912356010   1681434416  2593790426  35,17%  64,83%  
>> sdc
>> osd11810488345   2626589896  3437078241  23,58%  76,42%  
>> sdd
>> osd201506879946  2421596680  3928476626  38,36%  61,64%  
>> sdg
>> osd212991526593  7525120 2999051713  99,75%  0,25%   
>> sdh
>> osd22295603373226114552  3255674889  0,91%   99

[ceph-users] Re: Benefits of high RAM on a metadata server?

2020-02-10 Thread Marco Mühlenbeck

Hi together,
I am new here. I am a little bit confused about the discussion about the 
amount RAM of the metadata server.
In the SUSE Deployment Guide for SUSE Enterprise Storage 6 (release 
2020-01-27) in the chapter "2.2 Minimum Cluster Configuration" there the 
is a sentence:

"... Metadata Servers require incremental 4 GB RAM and four cores."
Your discussion is about 128 GB and 256 GB. This is far away from the 
SUSE min. requirements.
Can you explain that or give any hint to that why the value are so 
different?

Marco

Am 07.02.2020 um 09:05 schrieb Stefan Kooman:

Quoting Wido den Hollander (w...@42on.com):


On 2/6/20 11:01 PM, Matt Larson wrote:

Hi, we are planning out a Ceph storage cluster and were choosing
between 64GB, 128GB, or even 256GB on metadata servers. We are
considering having 2 metadata servers overall.

Does going to high levels of RAM possibly yield any performance
benefits? Is there a size beyond which there are just diminishing
returns vs cost?


The MDS will try to cache as much inodes as you allow it to.

So the amount of users nor the total amount of bytes doesn't matter,
it's the amount of inodes, thus: files and directories.

If clients are using unique datasets (files / directories) than the
amount of clients do matter. If that is the case you might also ask
yourself why you need a clustered filesystem, as it will definitely not
speed things up compared to a local fs (metadata operations that is).


The more you have of those, the more memory it requires.

To clarify: in (active) use. Just having a lot of data around does not
necessarily require a lot of memory.


A lot of small files? A lot of memory!

The expected use case would be for a cluster where there might be
10-20 concurrent users working on individual datasets of 5TB in size.
I expect there would be lots of reads of the 5TB datasets matched with
the creation of hundreds to thousands of smaller files during
processing of the images.

Hundreds to thousands of files is not a lot. Are these datasets to be
stored permanently, or only temporarily? I guess it is convenient to
just configure one fs for all clients to use, but it might not be the
best fit / best performing solution in your case.

Gr. Stefan


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: High CPU usage by ceph-mgr in 14.2.6

2020-02-10 Thread Joe Bardgett
Has anyone attempted to use gdbpmp since 14.2.6 to grab data?  I have not been 
able to successfully do it on my clusters.  It has just been hanging at 
attaching to process.

If you have been able to, would you be available for a discussion regarding 
your configuration?

Thanks,

Joe Bardgett
Storage Operations
jbardg...@godaddy.com

Cell - 480-221-7337
Office - 602-420-4403

This email message and any attachments hereto is intended for use only by the 
addressee(s) named herein and may contain legally privileged and/or 
confidential information. If you have received this email in error, please 
immediately notify the sender and permanently delete the original and any copy 
of this message and its attachments.

-Original Message-
From: Neha Ojha  
Sent: Wednesday, January 29, 2020 12:29 PM
To: Joe Bardgett 
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: High CPU usage by ceph-mgr in 14.2.6

Notice: This email is from an external sender.



Hi Joe,

Can you grab a wallclock profiler dump from the mgr process and share it with 
us? This was useful for us to get to the root cause of the issue in 14.2.5.

Quoting Mark's suggestion from "[ceph-users] High CPU usage by ceph-mgr in 
14.2.5" below.

If you can get a wallclock profiler on the mgr process we might be able to 
figure out specifics of what's taking so much time (ie processing pg_summary or 
something else).  Assuming you have gdb with the python bindings and the ceph 
debug packages installed, if you (are anyone) could try gdbpmp on the 100% mgr 
process that would be fantastic.


https://github.com/markhpc/gdbpmp


gdbpmp.py -p`pidof ceph-mgr` -n 1000 -o mgr.gdbpmp


If you want to view the results:


gdbpmp.py -i mgr.gdbpmp -t 1

Thanks,
Neha



On Wed, Jan 29, 2020 at 7:35 AM  wrote:
>
> Modules that are normally enabled:
>
> ceph mgr module ls | jq -r '.enabled_modules'
> [
>   "dashboard",
>   "prometheus",
>   "restful"
> ]
>
> We did test with all modules disabled, restarted the mgrs and saw no 
> difference.
>
> Joe
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
> email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs file layouts, empty objects in first data pool

2020-02-10 Thread Dave Hall
I was also confused by this topic and had intended to post a question 
this week.  The documentation I recall reading said something about 'if 
you want to use erasure coding on a CephFS, you should use a small 
replicated data pool as the first pool, and your erasure coded pool as 
the second.'  I did not see any obvious indication of how this would 
'auto-magically' put the small files in the replicated pool and the 
large files in the erasure pool. although this sounds like a desirable 
behavior.  Instead I found the notes 'file layouts' which doesn't seem 
to be able to use size as a criterion.


Does anybody have anything further to add that would help clarify this?

Thanks.

-Dave

Dave Hall
Binghamton University

On 2/10/20 1:26 PM, Gregory Farnum wrote:

On Mon, Feb 10, 2020 at 12:29 AM Håkan T Johansson  wrote:


On Mon, 10 Feb 2020, Gregory Farnum wrote:


On Sun, Feb 9, 2020 at 3:24 PM Håkan T Johansson  wrote:

   Hi,

   running 14.2.6, debian buster (backports).

   Have set up a cephfs with 3 data pools and one metadata pool:
   myfs_data, myfs_data_hdd, myfs_data_ssd, and myfs_metadata.

   The data of all files are with the use of ceph.dir.layout.pool either
   stored in the pools myfs_data_hdd or myfs_data_ssd.  This has also been
   checked by dumping the ceph.file.layout.pool attributes of all files.

   The filesystem has 1617949 files and 36042 directories.

   There are however approximately as many objects in the first pool created
   for the cephfs, myfs_data, as there are files.  They also becomes more or
   fewer as files are created or deleted (so cannot be some leftover from
   earlier exercises).  Note how the USED size is reported as 0 bytes,
   correctly reflecting that no file data is stored in them.

   POOL_NAMEUSED OBJECTS CLONES  COPIES MISSING_ON_PRIMARY UNFOUND 
DEGRADED  RD_OPS  RD   WR_OPS  WR USED COMPR UNDER COMPR
   myfs_data 0 B 1618229  0 4854687  0   0  
  0 2263590 129 GiB 23312479 124 GiB0 B 0 B
   myfs_data_hdd 831 GiB  136309  0  408927  0   0  
  0  106046 200 GiB   269084 277 GiB0 B 0 B
   myfs_data_ssd  43 GiB 1552412  0 4657236  0   0  
  0  181468 2.3 GiB  4661935  12 GiB0 B 0 B
   myfs_metadata 1.2 GiB   36096  0  108288  0   0  
  0 4828623  82 GiB  1355102 143 GiB0 B 0 B

   Is this expected?

   I was assuming that in this scenario, all objects, both their data and 
any
   keys would be either in the metadata pool, or the two pools where the
   objects are stored.

   Is it some additional metadata keys that are stored in the first
   created data pool for cephfs?  This would not be so nice in case the osd
   selection rules for it are using worse disks than the data itself...


https://docs.ceph.com/docs/master/cephfs/file-layouts/#adding-a-data-pool-to-the-mds
 notes there is “a small amount of metadata” kept in the primary pool.

Thanks!  This I managed to miss, probably as it was at the bottom of the
page.  In case one wants to use layouts to separate fast (likely many)
from slow (likely large) files, it then sounds as the primary pool should
the fast kind too, due to the large amount of objects.  Thus this needs to
be highlighted early in that documentation.


That’s not terribly clear; what is actually stored is a per-file location 
backtrace (its location in the directory tree) used for hardlink lookups and 
disaster recovery
scenarios.

This info would be nice to add to the manual page.  It is nice to know
what kind of information is stored there.

Yeah, PRs welcome. :p
Just to be clear though, that shouldn't be performance-critical. It's
lazily updated by the MDS when the directory location changes, but not
otherwise.


Again thanks for the clarification!


   Btw: is there any tool to see the amount of key value data size 
associated
   with a pool?  'ceph osd df' gives omap and meta for osds, but not broken
   down per pool.


I think this is in the newest master code, but I’m not certain which release 
it’s in...

Would it then (when available) also be in the 'rados df' command?

I really don't remember how everything is shared out but I think so?


Best regards,
Håkan



-Greg



   Best regards,
   Håkan
   ___
   ceph-users mailing list -- ceph-users@ceph.io
   To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How to monitor Ceph MDS operation latencies when slow cephfs performance

2020-02-10 Thread jalagam . ceph
Hello , 

Cephfs operations are slow in our cluster , I see low number of operations or 
throughput in the pools and all other resources as well. I think it is MDS 
operations that are causing the issue. I increased mds_cache_memory_limit to 3 
GB from 1 GB but not seeing any improvements in the user access times. 

How do I monitor the MDS operations like metadata operations latencies 
including inode access times update time and directory operations latencies ? 
we am using 14.2.3 ceph version. 

I have increased mds_cache_memory_limit but not sure how to check what is been 
used and how effectively we are using it. 
# ceph config get mds.0 mds_cache_memory_limit
3221225472

I also see this , we are maninging PG using autoscale , however I see BIAS as 
4.0 where as all pools have 1.0 not sure what is this number exactly and how it 
effect cluster . 
# ceph osd pool autoscale-status | egrep "cephfs|POOL"
 POOL SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO 
 TARGET RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE
 cephfs01-metadata   1775M3.0167.6T  0. 
4.0   8  on
 cephfs01-data0 739.5G3.0167.6T  0.0129 
1.0  32  on

There is one large OMAP. 
[root@knode25 /]# ceph health detail
HEALTH_WARN 1 large omap objects
LARGE_OMAP_OBJECTS 1 large omap objects
1 large objects found in pool 'cephfs01-metadata'
Search the cluster log for 'Large omap object found' for more details.


I recently had similar one and I was able to remove that by running deep scrub 
, not sure why they are keep forming and how to solve this for good ?  


Thanks,
Uday.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Running cephadm as a nonroot user

2020-02-10 Thread Jason Borden
Ok, I've been digging around a bit with the code and made progress, but haven't 
got it all working yet. Here's what I've done:

# yum install cephadm
# ln -s ../sbin/cephadm /usr/bin/cephadm  #Needed to reference the correct path
# cephadm bootstrap --output-config /etc/ceph/ceph.conf --output-keyring 
/etc/ceph/ceph.client.admin.keyring --skip-ssh --mon-ip 10.8.13.121
# ceph mgr module enable cephadm
# ceph config set mgr mgr/cephadm/mode cephadm-package
# ceph cephadm generate-key
# ceph cephadm get-pub-key > ~cephadm/.ssh/authorized_keys
# echo "cephadm ALL=NOPASSWD: /usr/bin/cephadm --image * check-host *" >> 
/etc/sudoers.d/cephadm  #Needed to run check-host as part of joining other nodes

At this point it seems to be configured to use the cephadm user. Distribute the 
public key to other hosts with cephadm package installed.

However I am now getting errors when trying to add a host with orchestrator or 
when doing a host-check:
# ceph orchestrator host add k8shost3.acedatacenter.com 10.8.13.123
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1069, in _handle_command
return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf)
  File "/usr/share/ceph/mgr/mgr_module.py", line 309, in call
return self.func(mgr, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator.py", line 142, in wrapper
return func(*args, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator_cli/module.py", line 168, in _add_host
orchestrator.raise_if_exception(completion)
  File "/usr/share/ceph/mgr/orchestrator.py", line 639, in raise_if_exception
raise e
orchestrator.OrchestratorError: New host k8shost3.acedatacenter.com 
(10.8.13.123) failed check: ['usage: cephadm [-h] [--image IMAGE] [--docker] 
[--data-dir DATA_DIR]', '   [--log-dir LOG_DIR] [--logrotate-dir 
LOGROTATE_DIR]', '   [--unit-dir UNIT_DIR] [--verbose] [--timeout 
TIMEOUT]', '   
{version,pull,ls,adopt,rm-daemon,rm-cluster,run,shell,enter,ceph-volume,unit,bootstrap,deploy,check-host}',
 '   ...', 'cephadm: error: unrecognized arguments: 
--expect-hostname k8shost3.acedatacenter.com']

Again, I'm not sure I've done everything that I need to get things set up. Any 
thoughts?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Running cephadm as a nonroot user

2020-02-10 Thread Jason Borden
I missed a line while pasting the previous message:
# ceph orchestrator set backend cephadm
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs file layouts, empty objects in first data pool

2020-02-10 Thread Håkan T Johansson


On Mon, 10 Feb 2020, Gregory Farnum wrote:


On Mon, Feb 10, 2020 at 12:29 AM Håkan T Johansson  wrote:



On Mon, 10 Feb 2020, Gregory Farnum wrote:


On Sun, Feb 9, 2020 at 3:24 PM Håkan T Johansson  wrote:

  Hi,

  running 14.2.6, debian buster (backports).

  Have set up a cephfs with 3 data pools and one metadata pool:
  myfs_data, myfs_data_hdd, myfs_data_ssd, and myfs_metadata.

  The data of all files are with the use of ceph.dir.layout.pool either
  stored in the pools myfs_data_hdd or myfs_data_ssd.  This has also been
  checked by dumping the ceph.file.layout.pool attributes of all files.

  The filesystem has 1617949 files and 36042 directories.

  There are however approximately as many objects in the first pool created
  for the cephfs, myfs_data, as there are files.  They also becomes more or
  fewer as files are created or deleted (so cannot be some leftover from
  earlier exercises).  Note how the USED size is reported as 0 bytes,
  correctly reflecting that no file data is stored in them.

  POOL_NAMEUSED OBJECTS CLONES  COPIES MISSING_ON_PRIMARY UNFOUND 
DEGRADED  RD_OPS  RD   WR_OPS  WR USED COMPR UNDER COMPR
  myfs_data 0 B 1618229  0 4854687  0   0   
 0 2263590 129 GiB 23312479 124 GiB0 B 0 B
  myfs_data_hdd 831 GiB  136309  0  408927  0   0   
 0  106046 200 GiB   269084 277 GiB0 B 0 B
  myfs_data_ssd  43 GiB 1552412  0 4657236  0   0   
 0  181468 2.3 GiB  4661935  12 GiB0 B 0 B
  myfs_metadata 1.2 GiB   36096  0  108288  0   0   
 0 4828623  82 GiB  1355102 143 GiB0 B 0 B

  Is this expected?

  I was assuming that in this scenario, all objects, both their data and any
  keys would be either in the metadata pool, or the two pools where the
  objects are stored.

  Is it some additional metadata keys that are stored in the first
  created data pool for cephfs?  This would not be so nice in case the osd
  selection rules for it are using worse disks than the data itself...


https://docs.ceph.com/docs/master/cephfs/file-layouts/#adding-a-data-pool-to-the-mds
 notes there is “a small amount of metadata” kept in the primary pool.


Thanks!  This I managed to miss, probably as it was at the bottom of the
page.  In case one wants to use layouts to separate fast (likely many)
from slow (likely large) files, it then sounds as the primary pool should
the fast kind too, due to the large amount of objects.  Thus this needs to
be highlighted early in that documentation.


That’s not terribly clear; what is actually stored is a per-file location 
backtrace (its location in the directory tree) used for hardlink lookups and 
disaster recovery
scenarios.


This info would be nice to add to the manual page.  It is nice to know
what kind of information is stored there.


Yeah, PRs welcome. :p
Just to be clear though, that shouldn't be performance-critical. It's
lazily updated by the MDS when the directory location changes, but not
otherwise.


The sheer amount of objects seems to make a big difference when a pg is 
rebalanced between drives though, in particular from a HDD (with SSD DB), 
despite having no data ?  (Also comparing to the metadata pool, which does 
not have one object per file.)


Best regards,
Håkan






Again thanks for the clarification!


  Btw: is there any tool to see the amount of key value data size associated
  with a pool?  'ceph osd df' gives omap and meta for osds, but not broken
  down per pool.


I think this is in the newest master code, but I’m not certain which release 
it’s in...


Would it then (when available) also be in the 'rados df' command?


I really don't remember how everything is shared out but I think so?



Best regards,
Håkan



-Greg



  Best regards,
  Håkan
  ___
  ceph-users mailing list -- ceph-users@ceph.io
  To unsubscribe send an email to ceph-users-le...@ceph.io







___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io