[ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-16 Thread Uwe Sauter
Hi folks,

I'm currently chewing on an issue regarding "slow requests are blocked". I'd 
like to identify the OSD that is causing those events
once the cluster is back to HEALTH_OK (as I have no monitoring yet that would 
get this info in realtime).

Collecting this information could help identify aging disks if you were able to 
accumulate and analyze which OSD had blocking
requests in the past and how often those events occur.

My research so far let's me think that this information is only available as 
long as the requests are actually blocked. Is this
correct?

MON logs only show that those events occure and how many requests are in 
blocking state but no indication of which OSD is
affected. Is there a way to identify blocking requests from the OSD log files?


On a side note: I was trying to write a small Python script that would extract 
this kind of information in realtime but while I
was able to register a MonitorLog callback that would receive the same messages 
as you would get with "ceph -w" I haven's seen in
the librados Python bindings documentation the possibility to do the equivalent 
of "ceph health detail". Any suggestions on how to
get the blocking OSDs via librados?


Thanks,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests are blocked

2018-05-16 Thread Grigory Murashov

Hello Paul!

Thanks for your answer.

How did you understand it's RGW Metadata stuff?

No, I don't use any SSDs. Where I can find out more about Metadata 
pools, using SSD etc?..


Thanks.

Grigory Murashov
Voximplant

15.05.2018 23:42, Paul Emmerich пишет:
Looks like it's mostly RGW metadata stuff; are you running your 
non-data RGW pools on SSDs (you should, that can help *a lot*)?



Paul

2018-05-15 18:49 GMT+02:00 Grigory Murashov >:


Hello guys!

I collected output of ceph daemon osd.16 dump_ops_in_flight and
ceph daemon osd.16 dump_historic_ops.

Here is the output of ceph heath details in the moment of problem

HEALTH_WARN 20 slow requests are blocked > 32 sec
REQUEST_SLOW 20 slow requests are blocked > 32 sec
    20 ops are blocked > 65.536 sec
    osds 16,27,29 have blocked requests > 65.536 sec

So I grab logs from osd.16.

The file is attached.  Could you please help to translate?

Thanks in advance.

Grigory Murashov
Voximplant

14.05.2018 18:14, Grigory Murashov пишет:


Hello David!

2. I set it up 10/10

3. Thanks, my problem was I did it on host where was no osd.15
daemon.

Could you please help to read osd logs?

Here is a part from ceph.log

2018-05-14 13:46:32.644323 mon.storage-ru1-osd1 mon.0
185.164.149.2:6789/0  553895 :
cluster [INF] Cluster is now healthy
2018-05-14 13:46:43.741921 mon.storage-ru1-osd1 mon.0
185.164.149.2:6789/0  553896 :
cluster [WRN] Health check failed: 21 slow requests are blocked >
32 sec (REQUEST_SLOW)
2018-05-14 13:46:49.746994 mon.storage-ru1-osd1 mon.0
185.164.149.2:6789/0  553897 :
cluster [WRN] Health check update: 23 slow requests are blocked >
32 sec (REQUEST_SLOW)
2018-05-14 13:46:55.752314 mon.storage-ru1-osd1 mon.0
185.164.149.2:6789/0  553900 :
cluster [WRN] Health check update: 3 slow requests are blocked >
32 sec (REQUEST_SLOW)
2018-05-14 13:47:01.030686 mon.storage-ru1-osd1 mon.0
185.164.149.2:6789/0  553901 :
cluster [WRN] Health check update: 4 slow requests are blocked >
32 sec (REQUEST_SLOW)
2018-05-14 13:47:07.764236 mon.storage-ru1-osd1 mon.0
185.164.149.2:6789/0  553903 :
cluster [WRN] Health check update: 32 slow requests are blocked >
32 sec (REQUEST_SLOW)
2018-05-14 13:47:13.770833 mon.storage-ru1-osd1 mon.0
185.164.149.2:6789/0  553904 :
cluster [WRN] Health check update: 21 slow requests are blocked >
32 sec (REQUEST_SLOW)
2018-05-14 13:47:17.774530 mon.storage-ru1-osd1 mon.0
185.164.149.2:6789/0  553905 :
cluster [INF] Health check cleared: REQUEST_SLOW (was: 12 slow
requests are blocked > 32 sec)
2018-05-14 13:47:17.774582 mon.storage-ru1-osd1 mon.0
185.164.149.2:6789/0  553906 :
cluster [INF] Cluster is now healthy

At 13-47 I had a problem with osd.21

1. Ceph Health (storage-ru1-osd1.voximplant.com:ceph.health): HEALTH_WARN
{u'REQUEST_SLOW': {u'severity': u'HEALTH_WARN', u'summary': {u'message': u'4 
slow requests are blocked > 32 sec'}}}
HEALTH_WARN 4 slow requests are blocked > 32 sec
REQUEST_SLOW 4 slow requests are blocked > 32 sec
 2 ops are blocked > 65.536 sec
 2 ops are blocked > 32.768 sec
 osd.21 has blocked requests > 65.536 sec

Here is a part from ceph-osd.21.log

2018-05-14 13:47:06.891399 7fb806dd6700 10 osd.21 pg_epoch: 236
pg[2.0( v 236'297 (0'0,236'297] local-lis/les=223/224 n=1
ec=119/119 lis/c 223/223 les/c/f 224/224/0 223/223/212) [21,29,15]
r=0 lpr=223 crt=236'297 lcod 236'296 mlcod 236'296 active+clean] 
dropping ondisk_read_lock
2018-05-14 13:47:06.891435 7fb806dd6700 10 osd.21 236 dequeue_op
0x56453b753f80 finish
2018-05-14 13:47:07.111388 7fb8185f9700 10 osd.21 236 tick
2018-05-14 13:47:07.111398 7fb8185f9700 10 osd.21 236 do_waiters
-- start
2018-05-14 13:47:07.111401 7fb8185f9700 10 osd.21 236 do_waiters
-- finish
2018-05-14 13:47:07.800421 7fb817df8700 10 osd.21 236
tick_without_osd_lock
2018-05-14 13:47:07.800444 7fb817df8700 10 osd.21 236
promote_throttle_recalibrate 0 attempts, promoted 0 objects and
0  bytes; target 25 obj/sec or 5120 k bytes/sec
2018-05-14 13:47:07.800449 7fb817df8700 10 osd.21 236
promote_throttle_recalibrate  actual 0, actual/prob ratio 1,
adjusted new_prob 1000, prob 1000 -> 1000
2018-05-14 13:47:08.111470 7fb8185f9700 10 osd.21 236 tick
2018-05-14 13:47:08.111483 7fb8185f9700 10 osd.21 236 do_waiters
-- start
2018-05-14 13:47:08.111485 7fb8185f9700 10 osd.21 236 do_waiters
-- finish
2018-05-14 13:47:08.181070 7fb8055d3700 10

Re: [ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-16 Thread Mohamad Gebai
Hi,

On 05/16/2018 04:16 AM, Uwe Sauter wrote:
> Hi folks,
>
> I'm currently chewing on an issue regarding "slow requests are blocked". I'd 
> like to identify the OSD that is causing those events
> once the cluster is back to HEALTH_OK (as I have no monitoring yet that would 
> get this info in realtime).
>
> Collecting this information could help identify aging disks if you were able 
> to accumulate and analyze which OSD had blocking
> requests in the past and how often those events occur.
>
> My research so far let's me think that this information is only available as 
> long as the requests are actually blocked. Is this
> correct?

I think this is what you're looking for:

$> ceph daemon osd.X dump_historic_slow_ops

which gives you recent slow operations, as opposed to

$> ceph daemon osd.X dump_blocked_ops

which returns current blocked operations. You can also add a filter to
those commands.

Mohamad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-16 Thread Uwe Sauter
Hi Mohamad,


>> I'm currently chewing on an issue regarding "slow requests are blocked". I'd 
>> like to identify the OSD that is causing those events
>> once the cluster is back to HEALTH_OK (as I have no monitoring yet that 
>> would get this info in realtime).
>>
>> Collecting this information could help identify aging disks if you were able 
>> to accumulate and analyze which OSD had blocking
>> requests in the past and how often those events occur.
>>
>> My research so far let's me think that this information is only available as 
>> long as the requests are actually blocked. Is this
>> correct?
> 
> I think this is what you're looking for:
> 
> $> ceph daemon osd.X dump_historic_slow_ops
> 
> which gives you recent slow operations, as opposed to
> 
> $> ceph daemon osd.X dump_blocked_ops
> 
> which returns current blocked operations. You can also add a filter to
> those commands.

Thanks for these commands. I'll have a look into those. If I understand these 
correctly it means that I need to run these at each
server for each OSD instead of at a central location, is that correct?

Regards,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on NVMe/SSD Ceph OSDs

2018-05-16 Thread Blair Bethwaite
On 15 May 2018 at 08:45, Wido den Hollander  wrote:
>
> > We've got some Skylake Ubuntu based hypervisors that we can look at to
> > compare tomorrow...
> >
>
> Awesome!


Ok, so results still inconclusive I'm afraid...

The Ubuntu machines we're looking at (Dell R740s and C6420s running with
Performance BIOS power profile, which amongst other things disables cstates
and enables turbo) are currently running either a 4.13 or a 4.15 HWE kernel
- we needed 4.13 to support PERC10 and even get them booting from local
storage, then 4.15 to get around a prlimit bug that was breaking Nova
snapshots, so here we are. Where are you getting 4.16,
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.16/ ?

So interestingly in our case we seem to have no cpufreq driver loaded.
After installing linux-generic-tools (cause cpupower is supposed to
supersede cpufrequtils I think?):

rr42-03:~$ uname -a
Linux rcgpudc1rr42-03 4.15.0-13-generic #14~16.04.1-Ubuntu SMP Sat Mar 17
03:04:59 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

rr42-03:~$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.15.0-13-generic root=/dev/mapper/vg00-root ro
intel_iommu=on iommu=pt intel_idle.max_cstate=0 processor.max_cstate=1

rr42-03:~$ lscpu
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):36
On-line CPU(s) list:   0-35
Thread(s) per core:1
Core(s) per socket:18
Socket(s): 2
NUMA node(s):  2
Vendor ID: GenuineIntel
CPU family:6
Model: 85
Model name:Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
Stepping:  4
CPU MHz:   3400.956
BogoMIPS:  5401.45
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  1024K
L3 cache:  25344K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl
xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl
vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic
movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm
3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin
mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2
smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap
clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1
xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local ibpb ibrs stibp
dtherm ida arat pln pts pku ospke

rr42-03:~$ sudo cpupower frequency-info
analyzing CPU 0:
  no or unknown cpufreq driver is active on this CPU
  CPUs which run at the same hardware frequency: Not Available
  CPUs which need to have their frequency coordinated by software: Not
Available
  maximum transition latency:  Cannot determine or is not supported.
Not Available
  available cpufreq governors: Not Available
  Unable to determine current policy
  current CPU frequency: Unable to call hardware
  current CPU frequency:  Unable to call to kernel
  boost state support:
Supported: yes
Active: yes


And of course there is nothing under sysfs (/sys/devices/system/cpu*). But
/proc/cpuinfo and cpupower-monitor show that we seem to be hitting turbo
freqs:

rr42-03:~$ sudo cpupower monitor
  |Nehalem|| Mperf
PKG |CORE|CPU | C3   | C6   | PC3  | PC6  || C0   | Cx   | Freq
   0|   0|   0|  0.00|  0.00|  0.00|  0.00||  0.05| 99.95|  3391
   0|   1|   4|  0.00|  0.00|  0.00|  0.00||  0.02| 99.98|  3389
   0|   2|   8|  0.00|  0.00|  0.00|  0.00||  0.14| 99.86|  3067
   0|   3|   6|  0.00|  0.00|  0.00|  0.00||  0.01| 99.99|  3385
   0|   4|   2|  0.00|  0.00|  0.00|  0.00||  0.09| 99.91|  3119
   0|   8|  12|  0.00|  0.00|  0.00|  0.00||  0.03| 99.97|  3312
   0|   9|  16|  0.00|  0.00|  0.00|  0.00||  0.11| 99.89|  3157
   0|  10|  14|  0.00|  0.00|  0.00|  0.00||  0.01| 99.99|  3352
   0|  11|  10|  0.00|  0.00|  0.00|  0.00||  0.05| 99.95|  3390
   0|  16|  20|  0.00|  0.00|  0.00|  0.00||  0.00|100.00|  3387
   0|  17|  24|  0.00|  0.00|  0.00|  0.00||  0.22| 99.78|  3115
   0|  18|  26|  0.00|  0.00|  0.00|  0.00||  0.01| 99.99|  3389
   0|  19|  22|  0.00|  0.00|  0.00|  0.00||  0.00|100.00|  3366
   0|  20|  18|  0.00|  0.00|  0.00|  0.00||  0.01| 99.99|  3392
   0|  24|  28|  0.00|  0.00|  0.00|  0.00||  0.00|100.00|  3376
   0|  25|  32|  0.00|  0.00|  0.00|  0.00||  0.05| 99.95|  3390
   0|  26|  34|  0.00|  0.00|  0.00|  0.00||  0.03| 99.97|  3391
   0|  27|  30|  0.00|  0.00|  0.00|  0.00||  0.01| 99.99|  3392
   1|   0|   1|  0.00|  0.00|  0.00|  0.00||  0.00|100.00|  3394
   1|   1|   5|  0.00|  0.00|  0.00|  0.

Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on NVMe/SSD Ceph OSDs

2018-05-16 Thread John Hearns
Blair,
   methinks someone is doing bitcoin mining on your systems when they are
idle   :-)

I WAS going to say that maybe the cpupower utility needs an update to cope
with that generation of CPUs.
But 7proc/cpuinfo never lies  (does it ?)




On 16 May 2018 at 13:22, Blair Bethwaite  wrote:

> On 15 May 2018 at 08:45, Wido den Hollander  wrote:
>>
>> > We've got some Skylake Ubuntu based hypervisors that we can look at to
>> > compare tomorrow...
>> >
>>
>> Awesome!
>
>
> Ok, so results still inconclusive I'm afraid...
>
> The Ubuntu machines we're looking at (Dell R740s and C6420s running with
> Performance BIOS power profile, which amongst other things disables cstates
> and enables turbo) are currently running either a 4.13 or a 4.15 HWE kernel
> - we needed 4.13 to support PERC10 and even get them booting from local
> storage, then 4.15 to get around a prlimit bug that was breaking Nova
> snapshots, so here we are. Where are you getting 4.16,
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.16/ ?
>
> So interestingly in our case we seem to have no cpufreq driver loaded.
> After installing linux-generic-tools (cause cpupower is supposed to
> supersede cpufrequtils I think?):
>
> rr42-03:~$ uname -a
> Linux rcgpudc1rr42-03 4.15.0-13-generic #14~16.04.1-Ubuntu SMP Sat Mar 17
> 03:04:59 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> rr42-03:~$ cat /proc/cmdline
> BOOT_IMAGE=/vmlinuz-4.15.0-13-generic root=/dev/mapper/vg00-root ro
> intel_iommu=on iommu=pt intel_idle.max_cstate=0 processor.max_cstate=1
>
> rr42-03:~$ lscpu
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):36
> On-line CPU(s) list:   0-35
> Thread(s) per core:1
> Core(s) per socket:18
> Socket(s): 2
> NUMA node(s):  2
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 85
> Model name:Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
> Stepping:  4
> CPU MHz:   3400.956
> BogoMIPS:  5401.45
> Virtualization:VT-x
> L1d cache: 32K
> L1i cache: 32K
> L2 cache:  1024K
> L3 cache:  25344K
> NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34
> NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35
> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
> syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
> rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64
> monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca
> sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
> rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3
> invpcid_single pti intel_ppin mba tpr_shadow vnmi flexpriority ept vpid
> fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a
> avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw
> avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
> cqm_mbm_local ibpb ibrs stibp dtherm ida arat pln pts pku ospke
>
> rr42-03:~$ sudo cpupower frequency-info
> analyzing CPU 0:
>   no or unknown cpufreq driver is active on this CPU
>   CPUs which run at the same hardware frequency: Not Available
>   CPUs which need to have their frequency coordinated by software: Not
> Available
>   maximum transition latency:  Cannot determine or is not supported.
> Not Available
>   available cpufreq governors: Not Available
>   Unable to determine current policy
>   current CPU frequency: Unable to call hardware
>   current CPU frequency:  Unable to call to kernel
>   boost state support:
> Supported: yes
> Active: yes
>
>
> And of course there is nothing under sysfs (/sys/devices/system/cpu*). But
> /proc/cpuinfo and cpupower-monitor show that we seem to be hitting turbo
> freqs:
>
> rr42-03:~$ sudo cpupower monitor
>   |Nehalem|| Mperf
> PKG |CORE|CPU | C3   | C6   | PC3  | PC6  || C0   | Cx   | Freq
>0|   0|   0|  0.00|  0.00|  0.00|  0.00||  0.05| 99.95|  3391
>0|   1|   4|  0.00|  0.00|  0.00|  0.00||  0.02| 99.98|  3389
>0|   2|   8|  0.00|  0.00|  0.00|  0.00||  0.14| 99.86|  3067
>0|   3|   6|  0.00|  0.00|  0.00|  0.00||  0.01| 99.99|  3385
>0|   4|   2|  0.00|  0.00|  0.00|  0.00||  0.09| 99.91|  3119
>0|   8|  12|  0.00|  0.00|  0.00|  0.00||  0.03| 99.97|  3312
>0|   9|  16|  0.00|  0.00|  0.00|  0.00||  0.11| 99.89|  3157
>0|  10|  14|  0.00|  0.00|  0.00|  0.00||  0.01| 99.99|  3352
>0|  11|  10|  0.00|  0.00|  0.00|  0.00||  0.05| 99.95|  3390
>0|  16|  20|  0.00|  0.00|  0.00|  0.00||  0.00|100.00|  3387
>0|  17|  24|  0.00|  0.00|  0.00|  0.00||  0.22| 99.78|  3115
>0|  18|  26|  0.00|  0.00|  0.00|  0.00||  0.01| 99.99|  3389
>0|  1

Re: [ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-16 Thread Mohamad Gebai

On 05/16/2018 07:18 AM, Uwe Sauter wrote:
> Hi Mohamad,
>
>>
>> I think this is what you're looking for:
>>
>> $> ceph daemon osd.X dump_historic_slow_ops
>>
>> which gives you recent slow operations, as opposed to
>>
>> $> ceph daemon osd.X dump_blocked_ops
>>
>> which returns current blocked operations. You can also add a filter to
>> those commands.
> Thanks for these commands. I'll have a look into those. If I understand these 
> correctly it means that I need to run these at each
> server for each OSD instead of at a central location, is that correct?
>

That's the case, as it uses the admin socket.

Mohamad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on NVMe/SSD Ceph OSDs

2018-05-16 Thread Blair Bethwaite
Possibly, but I think they'd be using the V100s rather than the CPUs.

For reference:

rr42-03:~$ sudo cpupower monitor -l
Monitor "Nehalem" (4 states) - Might overflow after 92200 s
C3  [C] -> Processor Core C3
C6  [C] -> Processor Core C6
PC3 [P] -> Processor Package C3
PC6 [P] -> Processor Package C6
Monitor "Mperf" (3 states) - Might overflow after 92200 s
C0  [T] -> Processor Core not idle
Cx  [T] -> Processor Core in an idle state
Freq[T] -> Average Frequency (including boost) in MHz

So that node is doing nothing much right now.


On 16 May 2018 at 04:29, John Hearns  wrote:

> Blair,
>methinks someone is doing bitcoin mining on your systems when they are
> idle   :-)
>
> I WAS going to say that maybe the cpupower utility needs an update to cope
> with that generation of CPUs.
> But 7proc/cpuinfo never lies  (does it ?)
>
>
>
>
> On 16 May 2018 at 13:22, Blair Bethwaite 
> wrote:
>
>> On 15 May 2018 at 08:45, Wido den Hollander  wrote:
>>>
>>> > We've got some Skylake Ubuntu based hypervisors that we can look at to
>>> > compare tomorrow...
>>> >
>>>
>>> Awesome!
>>
>>
>> Ok, so results still inconclusive I'm afraid...
>>
>> The Ubuntu machines we're looking at (Dell R740s and C6420s running with
>> Performance BIOS power profile, which amongst other things disables cstates
>> and enables turbo) are currently running either a 4.13 or a 4.15 HWE kernel
>> - we needed 4.13 to support PERC10 and even get them booting from local
>> storage, then 4.15 to get around a prlimit bug that was breaking Nova
>> snapshots, so here we are. Where are you getting 4.16,
>> http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.16/ ?
>>
>> So interestingly in our case we seem to have no cpufreq driver loaded.
>> After installing linux-generic-tools (cause cpupower is supposed to
>> supersede cpufrequtils I think?):
>>
>> rr42-03:~$ uname -a
>> Linux rcgpudc1rr42-03 4.15.0-13-generic #14~16.04.1-Ubuntu SMP Sat Mar 17
>> 03:04:59 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>>
>> rr42-03:~$ cat /proc/cmdline
>> BOOT_IMAGE=/vmlinuz-4.15.0-13-generic root=/dev/mapper/vg00-root ro
>> intel_iommu=on iommu=pt intel_idle.max_cstate=0 processor.max_cstate=1
>>
>> rr42-03:~$ lscpu
>> Architecture:  x86_64
>> CPU op-mode(s):32-bit, 64-bit
>> Byte Order:Little Endian
>> CPU(s):36
>> On-line CPU(s) list:   0-35
>> Thread(s) per core:1
>> Core(s) per socket:18
>> Socket(s): 2
>> NUMA node(s):  2
>> Vendor ID: GenuineIntel
>> CPU family:6
>> Model: 85
>> Model name:Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
>> Stepping:  4
>> CPU MHz:   3400.956
>> BogoMIPS:  5401.45
>> Virtualization:VT-x
>> L1d cache: 32K
>> L1i cache: 32K
>> L2 cache:  1024K
>> L3 cache:  25344K
>> NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34
>> NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35
>> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
>> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
>> syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
>> rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64
>> monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca
>> sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
>> rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3
>> invpcid_single pti intel_ppin mba tpr_shadow vnmi flexpriority ept vpid
>> fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a
>> avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw
>> avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
>> cqm_mbm_local ibpb ibrs stibp dtherm ida arat pln pts pku ospke
>>
>> rr42-03:~$ sudo cpupower frequency-info
>> analyzing CPU 0:
>>   no or unknown cpufreq driver is active on this CPU
>>   CPUs which run at the same hardware frequency: Not Available
>>   CPUs which need to have their frequency coordinated by software: Not
>> Available
>>   maximum transition latency:  Cannot determine or is not supported.
>> Not Available
>>   available cpufreq governors: Not Available
>>   Unable to determine current policy
>>   current CPU frequency: Unable to call hardware
>>   current CPU frequency:  Unable to call to kernel
>>   boost state support:
>> Supported: yes
>> Active: yes
>>
>>
>> And of course there is nothing under sysfs (/sys/devices/system/cpu*).
>> But /proc/cpuinfo and cpupower-monitor show that we seem to be hitting
>> turbo freqs:
>>
>> rr42-03:~$ sudo cpupower monitor
>>   |Nehalem|| Mperf
>> PKG |CORE|CPU | C3   | C6   | PC3  | PC6  || C0   | Cx   | Freq
>>0|   0|   0|  0.00|  0.00|  0.00|  0.00||  0.05| 99.95|  3391
>>

[ceph-users] ceph as storage for docker registry

2018-05-16 Thread Tomasz Płaza

Hi All,

We are running ceph 12.2.3 as storage for docker registry with swift API. This 
is the only workload with swift API on our ceph cluster. We need to run 
radosgw-admin bucket check --fix --check-objects --bucket docker-registry from 
time to time to fix issue described 
https://github.com/docker/distribution/issues/2430.
Last time I had to push some tags manually with s3client because swift client showed 
err.detail="swift: Timeout expired while waiting for segments of /docker/"

In ceph conf We have:

[client]
 rgw resolve cname = true
 rgw thread pool size = 256
 rgw num rados handles = 8
 rgw override bucket index max shards = 8
 log file = /dev/null
 rgw swift url = http://someurl.com

Is there any recommended  configuration for ceph and docker-registry on swift 
API?

Thanks,
Tom


Spółki Grupy Wirtualna Polska:

Wirtualna Polska Holding Spółka Akcyjna z siedzibą w Warszawie, ul. Jutrzenki 
137A, 02-231 Warszawa, wpisana do Krajowego Rejestru Sądowego - Rejestru 
Przedsiębiorców prowadzonego przez Sąd Rejonowy dla m.st. Warszawy w Warszawie 
pod nr KRS: 407130, kapitał zakładowy: 1 440 487,60 zł (w całości 
wpłacony), Numer Identyfikacji Podatkowej (NIP): 521-31-11-513

Wirtualna Polska Media Spółka Akcyjna z siedzibą w Warszawie, ul. Jutrzenki 
137A, 02-231 Warszawa, wpisana do Krajowego Rejestru Sądowego - Rejestru 
Przedsiębiorców prowadzonego przez Sąd Rejonowy dla m.st. Warszawy w Warszawie 
pod nr KRS: 580004, kapitał zakładowy: 317 957 900,00 zł (w całości 
wpłacony), Numer Identyfikacji Podatkowej (NIP): 527-26-45-593
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] multi site with cephfs

2018-05-16 Thread Up Safe
Hi,

I'm trying to build a multi site setup.
But the only guides I've found on the net were about building it with
object storage or rbd.
What I need is cephfs.

I.e. I need to have 2 synced file storages at 2 geographical locations.
Is this possible?

Also, if I understand correctly - cephfs is just a component on top of the
object storage.
Following this logic - it should be possible, right?

Or am I totally off here?

Thanks,
Leon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multi site with cephfs

2018-05-16 Thread John Hearns
Leon,
I was at a Lenovo/SuSE seminar yesterday and asked a similar question
regarding separated sites.
How far apart are these two geographical locations?   It does matter.

On 16 May 2018 at 15:07, Up Safe  wrote:

> Hi,
>
> I'm trying to build a multi site setup.
> But the only guides I've found on the net were about building it with
> object storage or rbd.
> What I need is cephfs.
>
> I.e. I need to have 2 synced file storages at 2 geographical locations.
> Is this possible?
>
> Also, if I understand correctly - cephfs is just a component on top of the
> object storage.
> Following this logic - it should be possible, right?
>
> Or am I totally off here?
>
> Thanks,
> Leon
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multi site with cephfs

2018-05-16 Thread Up Safe
Hi,

About a 100 km.
I have a 2-4ms latency between them.

Leon

On Wed, May 16, 2018, 16:13 John Hearns  wrote:

> Leon,
> I was at a Lenovo/SuSE seminar yesterday and asked a similar question
> regarding separated sites.
> How far apart are these two geographical locations?   It does matter.
>
> On 16 May 2018 at 15:07, Up Safe  wrote:
>
>> Hi,
>>
>> I'm trying to build a multi site setup.
>> But the only guides I've found on the net were about building it with
>> object storage or rbd.
>> What I need is cephfs.
>>
>> I.e. I need to have 2 synced file storages at 2 geographical locations.
>> Is this possible?
>>
>> Also, if I understand correctly - cephfs is just a component on top of
>> the object storage.
>> Following this logic - it should be possible, right?
>>
>> Or am I totally off here?
>>
>> Thanks,
>> Leon
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on NVMe/SSD Ceph OSDs

2018-05-16 Thread Wido den Hollander


On 05/16/2018 01:22 PM, Blair Bethwaite wrote:
> On 15 May 2018 at 08:45, Wido den Hollander  > wrote:
> 
> > We've got some Skylake Ubuntu based hypervisors that we can look at to
> > compare tomorrow...
> > 
> 
> Awesome!
> 
> 
> Ok, so results still inconclusive I'm afraid...
> 
> The Ubuntu machines we're looking at (Dell R740s and C6420s running with
> Performance BIOS power profile, which amongst other things disables
> cstates and enables turbo) are currently running either a 4.13 or a 4.15
> HWE kernel - we needed 4.13 to support PERC10 and even get them booting
> from local storage, then 4.15 to get around a prlimit bug that was
> breaking Nova snapshots, so here we are. Where are you getting 4.16,
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.16/ ?
> 

Yes, that's where I got 4.16: 4.16.7-041607-generic

I also tried 4.16.8, but that didn't change anything either.

Server I am testing with are these:
https://www.supermicro.nl/products/system/1U/1029/SYS-1029U-TN10RT.cfm

> So interestingly in our case we seem to have no cpufreq driver loaded.
> After installing linux-generic-tools (cause cpupower is supposed to
> supersede cpufrequtils I think?):
> 
> rr42-03:~$ uname -a
> Linux rcgpudc1rr42-03 4.15.0-13-generic #14~16.04.1-Ubuntu SMP Sat Mar
> 17 03:04:59 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> 
> rr42-03:~$ cat /proc/cmdline
> BOOT_IMAGE=/vmlinuz-4.15.0-13-generic root=/dev/mapper/vg00-root ro
> intel_iommu=on iommu=pt intel_idle.max_cstate=0 processor.max_cstate=1
> 

I have those settings as well, intel_idle and processor.max_cstate.

[1.776036] intel_idle: disabled

That works, the CPUs stay in C0 or C1 according to i7z, but they are
clocking down in Mhz, for example:

processor   : 23
vendor_id   : GenuineIntel
cpu family  : 6
model   : 85
model name  : Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
stepping: 4
microcode   : 0x243
cpu MHz : 799.953

/sys/devices/system/cpu/intel_pstate/min_perf_pct is set to 100, but
that setting doesn't seem to do anything.

I'm running out of ideas :-)

Wido

> rr42-03:~$ lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                36
> On-line CPU(s) list:   0-35
> Thread(s) per core:    1
> Core(s) per socket:    18
> Socket(s):             2
> NUMA node(s):          2
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 85
> Model name:            Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
> Stepping:              4
> CPU MHz:               3400.956
> BogoMIPS:              5401.45
> Virtualization:        VT-x
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              1024K
> L3 cache:              25344K
> NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34
> NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35
> Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
> syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
> rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq
> dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid
> dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx
> f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3
> invpcid_single pti intel_ppin mba tpr_shadow vnmi flexpriority ept vpid
> fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx
> rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd
> avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc
> cqm_mbm_total cqm_mbm_local ibpb ibrs stibp dtherm ida arat pln pts pku
> ospke
> 
> rr42-03:~$ sudo cpupower frequency-info
> analyzing CPU 0:
>   no or unknown cpufreq driver is active on this CPU
>   CPUs which run at the same hardware frequency: Not Available
>   CPUs which need to have their frequency coordinated by software: Not
> Available
>   maximum transition latency:  Cannot determine or is not supported.
> Not Available
>   available cpufreq governors: Not Available
>   Unable to determine current policy
>   current CPU frequency: Unable to call hardware
>   current CPU frequency:  Unable to call to kernel
>   boost state support:
>     Supported: yes
>     Active: yes
> 
> 
> And of course there is nothing under sysfs (/sys/devices/system/cpu*).
> But /proc/cpuinfo and cpupower-monitor show that we seem to be hitting
> turbo freqs:
> 
> rr42-03:~$ sudo cpupower monitor
>               |Nehalem                    || Mperf
> PKG |CORE|CPU | C3   | C6   | PC3  | PC6  || C0   | Cx   | Freq
>    0|   0|   0|  0.00|  0.00|  0.00|  0.00||  0.05| 99.95|  3391
>    0|   1|   4|  0.00|  0.00|  0.00|  0.00||  0.02| 99.98|  3389
>    0|   2|   8|  0.00|  0.00|  0.00|  0.00||  0.14| 99.86|  

Re: [ceph-users] multi site with cephfs

2018-05-16 Thread John Hearns
The answer given at the seminar yesterday was that a practical limit was
around 60km.
I don't think 100km is that much longer.  I defer to the experts here.






On 16 May 2018 at 15:24, Up Safe  wrote:

> Hi,
>
> About a 100 km.
> I have a 2-4ms latency between them.
>
> Leon
>
> On Wed, May 16, 2018, 16:13 John Hearns  wrote:
>
>> Leon,
>> I was at a Lenovo/SuSE seminar yesterday and asked a similar question
>> regarding separated sites.
>> How far apart are these two geographical locations?   It does matter.
>>
>> On 16 May 2018 at 15:07, Up Safe  wrote:
>>
>>> Hi,
>>>
>>> I'm trying to build a multi site setup.
>>> But the only guides I've found on the net were about building it with
>>> object storage or rbd.
>>> What I need is cephfs.
>>>
>>> I.e. I need to have 2 synced file storages at 2 geographical locations.
>>> Is this possible?
>>>
>>> Also, if I understand correctly - cephfs is just a component on top of
>>> the object storage.
>>> Following this logic - it should be possible, right?
>>>
>>> Or am I totally off here?
>>>
>>> Thanks,
>>> Leon
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multi site with cephfs

2018-05-16 Thread Up Safe
But this is not the question here.
The question is whether I can configure multi site for CephFS.
Will I be able to do so by following the guide to set up the multi site for
object storage?

Thanks

On Wed, May 16, 2018, 16:45 John Hearns  wrote:

> The answer given at the seminar yesterday was that a practical limit was
> around 60km.
> I don't think 100km is that much longer.  I defer to the experts here.
>
>
>
>
>
>
> On 16 May 2018 at 15:24, Up Safe  wrote:
>
>> Hi,
>>
>> About a 100 km.
>> I have a 2-4ms latency between them.
>>
>> Leon
>>
>> On Wed, May 16, 2018, 16:13 John Hearns  wrote:
>>
>>> Leon,
>>> I was at a Lenovo/SuSE seminar yesterday and asked a similar question
>>> regarding separated sites.
>>> How far apart are these two geographical locations?   It does matter.
>>>
>>> On 16 May 2018 at 15:07, Up Safe  wrote:
>>>
 Hi,

 I'm trying to build a multi site setup.
 But the only guides I've found on the net were about building it with
 object storage or rbd.
 What I need is cephfs.

 I.e. I need to have 2 synced file storages at 2 geographical locations.
 Is this possible?

 Also, if I understand correctly - cephfs is just a component on top of
 the object storage.
 Following this logic - it should be possible, right?

 Or am I totally off here?

 Thanks,
 Leon

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Single ceph cluster for the object storage service of 2 OpenStack clouds

2018-05-16 Thread Massimo Sgaravatto
Thanks a lot !

On Tue, May 15, 2018 at 7:44 PM, David Turner  wrote:

> Yeah, that's how we do multiple zones.  I find following the documentation
> for multi-site (but not actually setting up a second site) to work well for
> setting up multiple realms in a single cluster.
>
> On Tue, May 15, 2018 at 9:29 AM Massimo Sgaravatto <
> massimo.sgarava...@gmail.com> wrote:
>
>> Hi
>>
>> I have been using for a while a single ceph cluster for the image and
>> block storage services of two Openstack clouds.
>>
>> Now I want to use this ceph cluster also for the object storage services
>> of the two OpenStack clouds and I want to implement that having a clear
>> separation between the two clouds. In particular I want different ceph
>> pools for the two Clouds.
>>
>> My understanding is that this can be done:
>>
>> - creating 2 realms (one for each cloud)
>> - creating one zonegroup for each realm
>> - creating one zone for each zonegroup
>> - having 1 ore more rgw instances for each zone
>>
>> Did I get it right ?
>>
>> Thanks, Massimo
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on NVMe/SSD Ceph OSDs

2018-05-16 Thread Alexandre DERUMIER
Hi,

I'm able to have fixed frequency with

intel_pstate=disable intel_idle.max_cstate=0 processor.max_cstate=1 

Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz

# cat /proc/cpuinfo |grep MHz
cpu MHz : 3400.002
cpu MHz : 3399.994
cpu MHz : 3399.995
cpu MHz : 3399.994
cpu MHz : 3399.997
cpu MHz : 3399.998
cpu MHz : 3399.992
cpu MHz : 3399.989
cpu MHz : 3399.998
cpu MHz : 3399.994
cpu MHz : 3399.988
cpu MHz : 3399.987
cpu MHz : 3399.990
cpu MHz : 3399.990
cpu MHz : 3399.994
cpu MHz : 3399.996
cpu MHz : 3399.996
cpu MHz : 3399.985
cpu MHz : 3399.991
cpu MHz : 3399.981
cpu MHz : 3399.979
cpu MHz : 3399.993
cpu MHz : 3399.985
cpu MHz : 3399.985


- Mail original -
De: "Wido den Hollander" 
À: "Blair Bethwaite" 
Cc: "ceph-users" , "Nick Fisk" 
Envoyé: Mercredi 16 Mai 2018 15:34:35
Objet: Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on 
NVMe/SSD Ceph OSDs

On 05/16/2018 01:22 PM, Blair Bethwaite wrote: 
> On 15 May 2018 at 08:45, Wido den Hollander  > wrote: 
> 
> > We've got some Skylake Ubuntu based hypervisors that we can look at to 
> > compare tomorrow... 
> > 
> 
> Awesome! 
> 
> 
> Ok, so results still inconclusive I'm afraid... 
> 
> The Ubuntu machines we're looking at (Dell R740s and C6420s running with 
> Performance BIOS power profile, which amongst other things disables 
> cstates and enables turbo) are currently running either a 4.13 or a 4.15 
> HWE kernel - we needed 4.13 to support PERC10 and even get them booting 
> from local storage, then 4.15 to get around a prlimit bug that was 
> breaking Nova snapshots, so here we are. Where are you getting 4.16, 
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.16/ ? 
> 

Yes, that's where I got 4.16: 4.16.7-041607-generic 

I also tried 4.16.8, but that didn't change anything either. 

Server I am testing with are these: 
https://www.supermicro.nl/products/system/1U/1029/SYS-1029U-TN10RT.cfm 

> So interestingly in our case we seem to have no cpufreq driver loaded. 
> After installing linux-generic-tools (cause cpupower is supposed to 
> supersede cpufrequtils I think?): 
> 
> rr42-03:~$ uname -a 
> Linux rcgpudc1rr42-03 4.15.0-13-generic #14~16.04.1-Ubuntu SMP Sat Mar 
> 17 03:04:59 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux 
> 
> rr42-03:~$ cat /proc/cmdline 
> BOOT_IMAGE=/vmlinuz-4.15.0-13-generic root=/dev/mapper/vg00-root ro 
> intel_iommu=on iommu=pt intel_idle.max_cstate=0 processor.max_cstate=1 
> 

I have those settings as well, intel_idle and processor.max_cstate. 

[ 1.776036] intel_idle: disabled 

That works, the CPUs stay in C0 or C1 according to i7z, but they are 
clocking down in Mhz, for example: 

processor : 23 
vendor_id : GenuineIntel 
cpu family : 6 
model : 85 
model name : Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz 
stepping : 4 
microcode : 0x243 
cpu MHz : 799.953 

/sys/devices/system/cpu/intel_pstate/min_perf_pct is set to 100, but 
that setting doesn't seem to do anything. 

I'm running out of ideas :-) 

Wido 

> rr42-03:~$ lscpu 
> Architecture: x86_64 
> CPU op-mode(s): 32-bit, 64-bit 
> Byte Order: Little Endian 
> CPU(s): 36 
> On-line CPU(s) list: 0-35 
> Thread(s) per core: 1 
> Core(s) per socket: 18 
> Socket(s): 2 
> NUMA node(s): 2 
> Vendor ID: GenuineIntel 
> CPU family: 6 
> Model: 85 
> Model name: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz 
> Stepping: 4 
> CPU MHz: 3400.956 
> BogoMIPS: 5401.45 
> Virtualization: VT-x 
> L1d cache: 32K 
> L1i cache: 32K 
> L2 cache: 1024K 
> L3 cache: 25344K 
> NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34 
> NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35 
> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr 
> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
> syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts 
> rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq 
> dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid 
> dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx 
> f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 
> invpcid_single pti intel_ppin mba tpr_shadow vnmi flexpriority ept vpid 
> fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx 
> rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd 
> avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc 
> cqm_mbm_total cqm_mbm_local ibpb ibrs stibp dtherm ida arat pln pts pku 
> ospke 
> 
> rr42-03:~$ sudo cpupower frequency-info 
> analyzing CPU 0: 
> no or unknown cpufreq driver is active on this CPU 
> CPUs which run at the same hardware frequency: Not Available 
> CPUs which need to have their frequency coordinated by software: Not 
> Available 
> maximum transition lat

Re: [ceph-users] multi site with cephfs

2018-05-16 Thread David Turner
Object storage multi-site is very specific to using object storage.  It
uses the RGW API's to sync s3 uploads between each site.  For CephFS you
might be able to do a sync of the rados pools, but I don't think that's
actually a thing yet.  RBD mirror is also a layer on top of things to sync
between sites.  Basically I think you need to do something on top of the
Filesystem as opposed to within Ceph  to sync it between sites.
On Wed, May 16, 2018 at 9:51 AM Up Safe  wrote:

> But this is not the question here.
> The question is whether I can configure multi site for CephFS.
> Will I be able to do so by following the guide to set up the multi site
> for object storage?
>
> Thanks
>
> On Wed, May 16, 2018, 16:45 John Hearns  wrote:
>
>> The answer given at the seminar yesterday was that a practical limit was
>> around 60km.
>> I don't think 100km is that much longer.  I defer to the experts here.
>>
>>
>>
>>
>>
>>
>> On 16 May 2018 at 15:24, Up Safe  wrote:
>>
>>> Hi,
>>>
>>> About a 100 km.
>>> I have a 2-4ms latency between them.
>>>
>>> Leon
>>>
>>> On Wed, May 16, 2018, 16:13 John Hearns  wrote:
>>>
 Leon,
 I was at a Lenovo/SuSE seminar yesterday and asked a similar question
 regarding separated sites.
 How far apart are these two geographical locations?   It does matter.

 On 16 May 2018 at 15:07, Up Safe  wrote:

> Hi,
>
> I'm trying to build a multi site setup.
> But the only guides I've found on the net were about building it with
> object storage or rbd.
> What I need is cephfs.
>
> I.e. I need to have 2 synced file storages at 2 geographical locations.
> Is this possible?
>
> Also, if I understand correctly - cephfs is just a component on top of
> the object storage.
> Following this logic - it should be possible, right?
>
> Or am I totally off here?
>
> Thanks,
> Leon
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nfs-ganesha 2.6 packages in ceph repo

2018-05-16 Thread Oliver Freyermuth
Hi David,

did you already manage to check your librados2 version and manage to pin down 
the issue? 

Cheers,
Oliver

Am 11.05.2018 um 17:15 schrieb Oliver Freyermuth:
> Hi David,
> 
> Am 11.05.2018 um 16:55 schrieb David C:
>> Hi Oliver
>>
>> Thanks for the detailed reponse! I've downgraded my libcephfs2 to 12.2.4 and 
>> still get a similar error:
>>
>> load_fsal :NFS STARTUP :CRIT :Could not dlopen 
>> module:/usr/lib64/ganesha/libfsalceph.so Error:/lib64/libcephfs.so.2: 
>> undefined symbol: 
>> _Z14common_preinitRK18CephInitParameters18code_environment_ti
>> load_fsal :NFS STARTUP :MAJ :Failed to load module 
>> (/usr/lib64/ganesha/libfsalceph.so) because: Can not access a needed shared 
>> library
>>
>> I'm on CentOS 7.4, using the following package versions:
>>
>> # rpm -qa | grep ganesha
>> nfs-ganesha-2.6.1-0.1.el7.x86_64
>> nfs-ganesha-vfs-2.6.1-0.1.el7.x86_64
>> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
>>
>> # rpm -qa | grep ceph
>> libcephfs2-12.2.4-0.el7.x86_64
>> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
> 
> Mhhhm - that sounds like a messup in the dependencies. 
> The symbol you are missing should be provided by
> librados2-12.2.4-0.el7.x86_64
> which contains
> /usr/lib64/ceph/ceph/libcephfs-common.so.0
> Do you have a different version of librados2 installed? If so, I wonder how 
> yum / rpm allowed that ;-). 
> 
> Thinking again, it might also be (if you indeed have a different version 
> there) that this is the cause also for the previous error. 
> If the problematic symbol is indeed not exposed, but can be resolved only if 
> both libraries (libcephfs-common and libcephfs) are loaded in unison with 
> matching versions,
> it might be that also 12.2.5 works fine... 
> 
> First thing, in any case, is to checkout which version of librados2 you are 
> using ;-). 
> 
> Cheers,
>   Oliver
> 
>>
>> I don't have the ceph user space components installed, assuming they're not 
>> nesscary apart from libcephfs2? Any idea why it's giving me this error?
>>
>> Thanks,
>>
>> On Fri, May 11, 2018 at 2:17 AM, Oliver Freyermuth 
>> mailto:freyerm...@physik.uni-bonn.de>> wrote:
>>
>> Hi David,
>>
>> for what it's worth, we are running with nfs-ganesha 2.6.1 from Ceph 
>> repos on CentOS 7.4 with the following set of versions:
>> libcephfs2-12.2.4-0.el7.x86_64
>> nfs-ganesha-2.6.1-0.1.el7.x86_64
>> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
>> Of course, we plan to upgrade to 12.2.5 soon-ish...
>>
>> Am 11.05.2018 um 00:05 schrieb David C:
>> > Hi All
>> > 
>> > I'm testing out the nfs-ganesha-2.6.1-0.1.el7.x86_64.rpm package from 
>> http://download.ceph.com/nfs-ganesha/rpm-V2.6-stable/luminous/x86_64/ 
>> 
>> > 
>> > It's failing to load /usr/lib64/ganesha/libfsalceph.so
>> > 
>> > With libcephfs-12.2.1 installed I get the following error in my 
>> ganesha log:
>> > 
>> >     load_fsal :NFS STARTUP :CRIT :Could not dlopen 
>> module:/usr/lib64/ganesha/libfsalceph.so Error:
>> >     /usr/lib64/ganesha/libfsalceph.so: undefined symbol: 
>> ceph_set_deleg_timeout
>> >     load_fsal :NFS STARTUP :MAJ :Failed to load module 
>> (/usr/lib64/ganesha/libfsalceph.so) because
>> >     : Can not access a needed shared library
>>
>> That looks like an ABI incompatibility, probably the nfs-ganesha 
>> packages should block this libcephfs2-version (and older ones).
>>
>> > 
>> > 
>> > With libcephfs-12.2.5 installed I get:
>> > 
>> >     load_fsal :NFS STARTUP :CRIT :Could not dlopen 
>> module:/usr/lib64/ganesha/libfsalceph.so Error:
>> >     /lib64/libcephfs.so.2: undefined symbol: 
>> _ZNK5FSMap10parse_roleEN5boost17basic_string_viewIcSt11char_traitsIcEEEP10mds_role_tRSo
>> >     load_fsal :NFS STARTUP :MAJ :Failed to load module 
>> (/usr/lib64/ganesha/libfsalceph.so) because
>> >     : Can not access a needed shared library
>>
>> That looks ugly and makes me fear for our planned 12.2.5-upgrade.
>> Interestingly, we do not have that symbol on 12.2.4:
>> # nm -D /lib64/libcephfs.so.2 | grep FSMap
>>                  U _ZNK5FSMap10parse_roleERKSsP10mds_role_tRSo
>>                  U _ZNK5FSMap13print_summaryEPN4ceph9FormatterEPSo
>> and NFS-Ganesha works fine.
>>
>> Looking at:
>> https://github.com/ceph/ceph/blob/v12.2.4/src/mds/FSMap.h 
>> 
>> versus
>> https://github.com/ceph/ceph/blob/v12.2.5/src/mds/FSMap.h 
>> 
>> it seems this commit:
>> 
>> https://github.com/ceph/ceph/commit/7d8b3c1082b6b870710989773f3cd98a472b9a3d 
>> 
>> changed libcephfs2 ABI.
>>
>> I've no idea how that's usually handled and whether ABI breakage should 
>> occur within point releases (I would not have expec

Re: [ceph-users] a big cluster or several small

2018-05-16 Thread Alexandre DERUMIER
Hi,


>>Our main reason for using multiple clusters is that Ceph has a bad 
>>reliability history when scaling up and even now there are many issues 
>>unresolved (https://tracker.ceph.com/issues/21761 for example) so by 
>>dividing single, large cluster into few smaller ones, we reduce the impact 
>>for customers when things go fatally wrong - when one cluster goes down or 
>>it's performance is on single ESDI drive level due to recovery, other 
>>clusters - and their users - are unaffected. For us this already proved 
>>useful in the past.


we are also doing multiple small clusters here. (3 nodes, 18 osd (ssd or nvme))
mainly vms and rbd, so it's not a problem.

Mainly to avoid lags for all clients when a osd goes down for example, or make 
upgrade more easy.


We only have a bigger cluster for radosgw and object storage.


Alexandre


- Mail original -
De: "Piotr Dałek" 
À: "ceph-users" 
Envoyé: Mardi 15 Mai 2018 09:14:53
Objet: Re: [ceph-users] a big cluster or several small

On 18-05-14 06:49 PM, Marc Boisis wrote: 
> 
> Hi, 
> 
> Hello, 
> Currently we have a 294 OSD (21 hosts/3 racks) cluster with RBD clients 
> only, 1 single pool (size=3). 
> 
> We want to divide this cluster into several to minimize the risk in case of 
> failure/crash. 
> For example, a cluster for the mail, another for the file servers, a test 
> cluster ... 
> Do you think it's a good idea ? 

If reliability and data availability is your main concern, and you don't 
share data between clusters - yes. 

> Do you have experience feedback on multiple clusters in production on the 
> same hardware: 
> - containers (LXD or Docker) 
> - multiple cluster on the same host without virtualization (with ceph-deploy 
> ... --cluster ...) 
> - multilple pools 
> ... 
> 
> Do you have any advice? 

We're using containers to host OSDs, but we don't host multiple clusters on 
same machine (in other words, single physical machine hosts containers for 
one and the same cluster). We're using Ceph for RBD images, so having 
multiple clusters isn't a problem for us. 

Our main reason for using multiple clusters is that Ceph has a bad 
reliability history when scaling up and even now there are many issues 
unresolved (https://tracker.ceph.com/issues/21761 for example) so by 
dividing single, large cluster into few smaller ones, we reduce the impact 
for customers when things go fatally wrong - when one cluster goes down or 
it's performance is on single ESDI drive level due to recovery, other 
clusters - and their users - are unaffected. For us this already proved 
useful in the past. 

-- 
Piotr Dałek 
piotr.da...@corp.ovh.com 
https://www.ovhcloud.com 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Public network faster than cluster network

2018-05-16 Thread Gandalf Corvotempesta
No more advices for a new cluster ?

Sorry for these multiple posts but I had some trouble with ML. I'm getting
"Access Denied"
Il giorno ven 11 mag 2018 alle ore 10:21 Gandalf Corvotempesta <
gandalf.corvotempe...@gmail.com> ha scritto:

> no more advices for a new cluster ?
> Il giorno gio 10 mag 2018 alle ore 10:38 Gandalf Corvotempesta <
> gandalf.corvotempe...@gmail.com> ha scritto:

> > Il giorno gio 10 mag 2018 alle ore 09:48 Christian Balzer  > ha scritto:
> > > Without knowing what your use case is (lots of large reads or writes,
or
> > > the more typical smallish I/Os) it's hard to give specific advice.

> > 99% VM hosting.
> > Everything else would be negligible and I don't care if not optimized.

> > > Which would give you 24 servers with up to 20Gb/s per server when both
> > > switches are working, something that's likely to be very close to 100%
> > > of the time.

> > 24 servers between hypervisors and storages, right ?
> > Thus, are you saying to split in this way:

> > switch0.port0 to port 12 as hypervisor, network1
> > switch0.port13 to 24 as storage, network1

> > switch0.port0 to port 12 as hypervisor, network2
> > switch0.port13 to 24 as storage, network2

> > In this case, with 2 switches I can have a fully redundant network,
> > but I also need a ISL to aggregate bandwidth.

> > > That's a very optimistic number, assuming journal/WAL/DB on SSDs _and_
> no
> > > concurrent write activity.
> > > Since you said hypervisors up there one assumes VMs on RBDs and a
mixed
> > > I/O pattern, saturating your disks with IOPS long before bandwidth
> becomes
> > > an issue.

> > Based on a real use-case, how much bandwidth should I expect with 12
SATA
> > spinning disks (7200rpm)
> > in mixed workload ? Obviously, a sequential read would need about
> > 12*100MB/s*8 mbit/s

> > > The biggest argument against the 1GB/s links is the latency as
> mentioned.

> > 10GBe should have 1/10 latency, right ?

> > Now, as I'm evaluating many SDS and Ceph, on the paper, is the most
> > expensive in terms of needed hardware,
> > what do you suggest for a small (scalable) storage, starting with just 3
> > storage servers (12 disks each but not fully populated),
> > 1x 16ports 10GBaseT switch, (many) 24ports Gigabit switch and about 5
> > hypervisors servers ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests are blocked

2018-05-16 Thread Paul Emmerich
By looking at the operations that are slow in your dump_*_ops command.
We've found that it's best to move all the metadata stuff for RGW onto
SSDs, i.e., all pools except the actual data pool.

But that depends on your use case and whether the slow requests you are
seeing is actually a problem for you. Maybe you
don't care too much about the metadata operations.

Paul

2018-05-16 12:25 GMT+02:00 Grigory Murashov :

> Hello Paul!
>
> Thanks for your answer.
>
> How did you understand it's RGW Metadata stuff?
>
> No, I don't use any SSDs. Where I can find out more about Metadata pools,
> using SSD etc?..
>
> Thanks.
>
> Grigory Murashov
> Voximplant
>
> 15.05.2018 23:42, Paul Emmerich пишет:
>
> Looks like it's mostly RGW metadata stuff; are you running your non-data
> RGW pools on SSDs (you should, that can help *a lot*)?
>
>
> Paul
>
> 2018-05-15 18:49 GMT+02:00 Grigory Murashov :
>
>> Hello guys!
>>
>> I collected output of ceph daemon osd.16 dump_ops_in_flight and ceph
>> daemon osd.16 dump_historic_ops.
>>
>> Here is the output of ceph heath details in the moment of problem
>>
>> HEALTH_WARN 20 slow requests are blocked > 32 sec
>> REQUEST_SLOW 20 slow requests are blocked > 32 sec
>> 20 ops are blocked > 65.536 sec
>> osds 16,27,29 have blocked requests > 65.536 sec
>>
>> So I grab logs from osd.16.
>>
>> The file is attached.  Could you please help to translate?
>>
>> Thanks in advance.
>>
>> Grigory Murashov
>> Voximplant
>>
>> 14.05.2018 18:14, Grigory Murashov пишет:
>>
>> Hello David!
>>
>> 2. I set it up 10/10
>>
>> 3. Thanks, my problem was I did it on host where was no osd.15 daemon.
>>
>> Could you please help to read osd logs?
>>
>> Here is a part from ceph.log
>>
>> 2018-05-14 13:46:32.644323 mon.storage-ru1-osd1 mon.0
>> 185.164.149.2:6789/0 553895 : cluster [INF] Cluster is now healthy
>> 2018-05-14 13:46:43.741921 mon.storage-ru1-osd1 mon.0
>> 185.164.149.2:6789/0 553896 : cluster [WRN] Health check failed: 21 slow
>> requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-05-14 13:46:49.746994 mon.storage-ru1-osd1 mon.0
>> 185.164.149.2:6789/0 553897 : cluster [WRN] Health check update: 23 slow
>> requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-05-14 13:46:55.752314 mon.storage-ru1-osd1 mon.0
>> 185.164.149.2:6789/0 553900 : cluster [WRN] Health check update: 3 slow
>> requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-05-14 13:47:01.030686 mon.storage-ru1-osd1 mon.0
>> 185.164.149.2:6789/0 553901 : cluster [WRN] Health check update: 4 slow
>> requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-05-14 13:47:07.764236 mon.storage-ru1-osd1 mon.0
>> 185.164.149.2:6789/0 553903 : cluster [WRN] Health check update: 32 slow
>> requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-05-14 13:47:13.770833 mon.storage-ru1-osd1 mon.0
>> 185.164.149.2:6789/0 553904 : cluster [WRN] Health check update: 21 slow
>> requests are blocked > 32 sec (REQUEST_SLOW)
>> 2018-05-14 13:47:17.774530 mon.storage-ru1-osd1 mon.0
>> 185.164.149.2:6789/0 553905 : cluster [INF] Health check cleared:
>> REQUEST_SLOW (was: 12 slow requests are blocked > 32 sec)
>> 2018-05-14 13:47:17.774582 mon.storage-ru1-osd1 mon.0
>> 185.164.149.2:6789/0 553906 : cluster [INF] Cluster is now healthy
>>
>> At 13-47 I had a problem with osd.21
>>
>> 1. Ceph Health (storage-ru1-osd1.voximplant.com:ceph.health): HEALTH_WARN
>> {u'REQUEST_SLOW': {u'severity': u'HEALTH_WARN', u'summary': {u'message': u'4 
>> slow requests are blocked > 32 sec'}}}
>> HEALTH_WARN 4 slow requests are blocked > 32 sec
>> REQUEST_SLOW 4 slow requests are blocked > 32 sec
>> 2 ops are blocked > 65.536 sec
>> 2 ops are blocked > 32.768 sec
>> osd.21 has blocked requests > 65.536 sec
>>
>> Here is a part from ceph-osd.21.log
>>
>> 2018-05-14 13:47:06.891399 7fb806dd6700 10 osd.21 pg_epoch: 236 pg[2.0( v
>> 236'297 (0'0,236'297] local-lis/les=223/224 n=1 ec=119/119 lis/c 223/223
>> les/c/f 224/224/0 223/223/212) [21,29,15]
>> r=0 lpr=223 crt=236'297 lcod 236'296 mlcod 236'296 active+clean]
>> dropping ondisk_read_lock
>> 2018-05-14 13:47:06.891435 7fb806dd6700 10 osd.21 236 dequeue_op
>> 0x56453b753f80 finish
>> 2018-05-14 13:47:07.111388 7fb8185f9700 10 osd.21 236 tick
>> 2018-05-14 13:47:07.111398 7fb8185f9700 10 osd.21 236 do_waiters -- start
>> 2018-05-14 13:47:07.111401 7fb8185f9700 10 osd.21 236 do_waiters -- finish
>> 2018-05-14 13:47:07.800421 7fb817df8700 10 osd.21 236
>> tick_without_osd_lock
>> 2018-05-14 13:47:07.800444 7fb817df8700 10 osd.21 236
>> promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0  bytes;
>> target 25 obj/sec or 5120 k bytes/sec
>> 2018-05-14 13:47:07.800449 7fb817df8700 10 osd.21 236
>> promote_throttle_recalibrate  actual 0, actual/prob ratio 1, adjusted
>> new_prob 1000, prob 1000 -> 1000
>> 2018-05-14 13:47:08.111470 7fb8185f9700 10 osd.21 236 tick
>> 2018-05-14 13:47:08.111483 7fb8185f9700 10 osd.21 236 do_waiters -- start
>> 2018-05-14 13:47:08.111485 7fb8185f9700 10 osd.21 236 do_waiters -- f

Re: [ceph-users] a big cluster or several small

2018-05-16 Thread Matthew Vernon
Hi,

On 14/05/18 17:49, Marc Boisis wrote:

> Currently we have a 294 OSD (21 hosts/3 racks) cluster with RBD clients
> only, 1 single pool (size=3).

That's not a large cluster.

> We want to divide this cluster into several to minimize the risk in case
> of failure/crash.
> For example, a cluster for the mail, another for the file servers, a
> test cluster ...
> Do you think it's a good idea ?

I'd venture the opinion that you cluster isn't yet big enough to be
thinking about that; you get increased reliability with a larger cluster
(each disk failure is a smaller % of the whole, for example); our
largest cluster here is 3060 OSDs...

We've grown this from a start of 540 OSDs.

> Do you have experience feedback on multiple clusters in production on
> the same hardware:

I think if you did want to have multiple clusters, you'd want to have
each cluster on different hardware.

Regards,

Matthew


-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD features and feature journaling performance

2018-05-16 Thread Jorge Pinilla López
I'm trying to better understand rbd features but I have only found the 
information on the RBD page, is there any further RBD feature information and 
implementation?

also I would like to know about journaling feature, it seems to destroy rbd 
performance:
Without journaling feature:
rbd bench --io-type write nmve/test --io-size 4M
bench  type write io_size 4194304 io_threads 16 bytes 1073741824 pattern 
sequential
  SEC   OPS   OPS/SEC   BYTES/SEC
1   165167.70  703388666.65
elapsed: 1  ops:  256  ops/sec:   147.34  bytes/sec: 617979315.69

With journaling feature:
rbd bench --io-type write nmve/test --io-size 4M
bench  type write io_size 4194304 io_threads 16 bytes 1073741824 pattern 
sequential
  SEC   OPS   OPS/SEC   BYTES/SEC
139 40.07  168083012.60
275 39.57  165958196.86
3   111 39.29  164800587.00
4   146 40.45  169644582.55
5   181 38.99  163537471.86
6   215 35.93  150720311.94
7   240 33.29  139648167.70
elapsed: 7  ops:  256  ops/sec:34.36  bytes/sec: 144134536.70

What is the reason of that behavior? What journaling does?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] a big cluster or several small

2018-05-16 Thread Jack
For what it worth, yahoo published their setup some years ago:
https://yahooeng.tumblr.com/post/116391291701/yahoo-cloud-object-store-object-storage-at

54 nodes per cluster for 3.2PB of raw storage, I guess this leads to 16
* 4TB hdd per node, thus 896 per cluster

(they may have used ssd as journal, anyway)
This setup is only viable because they are using HTTP, which load is
easily and highly balanceable

On 05/16/2018 06:20 PM, Matthew Vernon wrote:
> Hi,
> 
> On 14/05/18 17:49, Marc Boisis wrote:
> 
>> Currently we have a 294 OSD (21 hosts/3 racks) cluster with RBD clients
>> only, 1 single pool (size=3).
> 
> That's not a large cluster.
> 
>> We want to divide this cluster into several to minimize the risk in case
>> of failure/crash.
>> For example, a cluster for the mail, another for the file servers, a
>> test cluster ...
>> Do you think it's a good idea ?
> 
> I'd venture the opinion that you cluster isn't yet big enough to be
> thinking about that; you get increased reliability with a larger cluster
> (each disk failure is a smaller % of the whole, for example); our
> largest cluster here is 3060 OSDs...
> 
> We've grown this from a start of 540 OSDs.
> 
>> Do you have experience feedback on multiple clusters in production on
>> the same hardware:
> 
> I think if you did want to have multiple clusters, you'd want to have
> each cluster on different hardware.
> 
> Regards,
> 
> Matthew
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Poor CentOS 7.5 client performance

2018-05-16 Thread Donald "Mac" McCarthy
Recently upgraded a CEPH client to CentOS 7.5.  Upon doing so read and write 
performance became intolerably slow.  ~2.5 MB/s.  When booted back to a CentOS 
7.4 kernel, performance went back to a normal 200 MB/s read and write. I have 
not seen any mention of this issue in all of the normal places including here. 
For now the solution is clear - don’t use the 7.5 update. Has anyone else 
experienced this issue?

Please excuse any typos.  Autocorrect is evil!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor CentOS 7.5 client performance

2018-05-16 Thread Jason Dillaman
What is your client (librbd, krbd, CephFS, ceph-client, ...) and how
are you testing performance?

On Wed, May 16, 2018 at 11:14 AM, Donald "Mac" McCarthy
 wrote:
> Recently upgraded a CEPH client to CentOS 7.5.  Upon doing so read and write 
> performance became intolerably slow.  ~2.5 MB/s.  When booted back to a 
> CentOS 7.4 kernel, performance went back to a normal 200 MB/s read and write. 
> I have not seen any mention of this issue in all of the normal places 
> including here. For now the solution is clear - don’t use the 7.5 update. Has 
> anyone else experienced this issue?
>
> Please excuse any typos.  Autocorrect is evil!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor CentOS 7.5 client performance

2018-05-16 Thread Donald "Mac" McCarthy
CephFS.  8 core atom C2758, 16 GB ram, 256GB ssd, 2.5 GB NIC (supermicro 
microblade node).

Read test:
dd if=/ceph/1GB.test of=/dev/null bs=1M

Write
dd if=/dev/zero of=/ceph/out.test bs=1M count=1024


The tests are identical on both kernels - the results... well that is a 
different story.

CEPH servers are on release 12.2.4

Client updated to 12.2.5 - in both kernels, ceph client is 12.2.5

Please excuse any typos.  Autocorrect is evil!

> On May 16, 2018, at 14:18, Jason Dillaman  wrote:
> 
> What is your client (librbd, krbd, CephFS, ceph-client, ...) and how
> are you testing performance?
> 
> On Wed, May 16, 2018 at 11:14 AM, Donald "Mac" McCarthy
>  wrote:
>> Recently upgraded a CEPH client to CentOS 7.5.  Upon doing so read and write 
>> performance became intolerably slow.  ~2.5 MB/s.  When booted back to a 
>> CentOS 7.4 kernel, performance went back to a normal 200 MB/s read and 
>> write. I have not seen any mention of this issue in all of the normal places 
>> including here. For now the solution is clear - don’t use the 7.5 update. 
>> Has anyone else experienced this issue?
>> 
>> Please excuse any typos.  Autocorrect is evil!
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> -- 
> Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] dovecot + cephfs - sdbox vs mdbox

2018-05-16 Thread Webert de Souza Lima
I'm sending this message to both dovecot and ceph-users ML so please don't
mind if something seems too obvious for you.

Hi,

I have a question for both dovecot and ceph lists and below I'll explain
what's going on.

Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox), when
using sdbox, a new file is stored for each email message.
When using mdbox, multiple messages are appended to a single file until it
reaches/passes the rotate limit.

I would like to understand better how the mdbox format impacts on IO
performance.
I think it's generally expected that fewer larger file translate to less IO
and more troughput when compared to more small files, but how does dovecot
handle that with mdbox?
If dovecot does flush data to storage upon each and every new email is
arrived and appended to the corresponding file, would that mean that it
generate the same ammount of IO as it would do with one file per message?
Also, if using mdbox many messages will be appended to a said file before a
new file is created. That should mean that a file descriptor is kept open
for sometime by dovecot process.
Using cephfs as backend, how would this impact cluster performance
regarding MDS caps and inodes cached when files from thousands of users are
opened and appended all over?

I would like to understand this better.

Why?
We are a small Business Email Hosting provider with bare metal, self hosted
systems, using dovecot for servicing mailboxes and cephfs for email storage.

We are currently working on dovecot and storage redesign to be in
production ASAP. The main objective is to serve more users with better
performance, high availability and scalability.
* high availability and load balancing is extremely important to us *

On our current model, we're using mdbox format with dovecot, having
dovecot's INDEXes stored in a replicated pool of SSDs, and messages stored
in a replicated pool of HDDs (under a Cache Tier with a pool of SSDs).
All using cephfs / filestore backend.

Currently there are 3 clusters running dovecot 2.2.34 and ceph Jewel
(10.2.9-4).
 - ~25K users from a few thousands of domains per cluster
 - ~25TB of email data per cluster
 - ~70GB of dovecot INDEX [meta]data per cluster
 - ~100MB of cephfs metadata per cluster

Our goal is to build a single ceph cluster for storage that could expand in
capacity, be highly available and perform well enough. I know, that's what
everyone wants.

Cephfs is an important choise because:
 - there can be multiple mountpoints, thus multiple dovecot instances on
different hosts
 - the same storage backend is used for all dovecot instances
 - no need of sharding domains
 - dovecot is easily load balanced (with director sticking users to the
same dovecot backend)

On the upcoming upgrade we intent to:
 - upgrade ceph to 12.X (Luminous)
 - drop the SSD Cache Tier (because it's deprecated)
 - use bluestore engine

I was said on freenode/#dovecot that there are many cases where SDBOX would
perform better with NFS sharing.
In case of cephfs, at first, I wouldn't think that would be true because
more files == more generated IO, but thinking about what I said in the
beginning regarding sdbox vs mdbox that could be wrong.

Any thoughts will be highlt appreciated.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nfs-ganesha 2.6 packages in ceph repo

2018-05-16 Thread David C
Hi Oliver

Thanks for following up. I just picked this up again today and it was
indeed librados2...the package wasn't installed! It's working now, haven't
tested much but I haven't noticed any problems yet. This is with
nfs-ganesha-2.6.1-0.1.el7.x86_64, libcephfs2-12.2.5-0.el7.x86_64 and
librados2-12.2.5-0.el7.x86_64. Thanks for the pointer on that.

I'd be interested to hear your experience with ganesha with cephfs if
you're happy to share some insights. Any tuning you would recommend?

Thanks,

On Wed, May 16, 2018 at 4:14 PM, Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Hi David,
>
> did you already manage to check your librados2 version and manage to pin
> down the issue?
>
> Cheers,
> Oliver
>
> Am 11.05.2018 um 17:15 schrieb Oliver Freyermuth:
> > Hi David,
> >
> > Am 11.05.2018 um 16:55 schrieb David C:
> >> Hi Oliver
> >>
> >> Thanks for the detailed reponse! I've downgraded my libcephfs2 to
> 12.2.4 and still get a similar error:
> >>
> >> load_fsal :NFS STARTUP :CRIT :Could not dlopen
> module:/usr/lib64/ganesha/libfsalceph.so Error:/lib64/libcephfs.so.2:
> undefined symbol: _Z14common_preinitRK18CephInitParameters1
> 8code_environment_ti
> >> load_fsal :NFS STARTUP :MAJ :Failed to load module 
> >> (/usr/lib64/ganesha/libfsalceph.so)
> because: Can not access a needed shared library
> >>
> >> I'm on CentOS 7.4, using the following package versions:
> >>
> >> # rpm -qa | grep ganesha
> >> nfs-ganesha-2.6.1-0.1.el7.x86_64
> >> nfs-ganesha-vfs-2.6.1-0.1.el7.x86_64
> >> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
> >>
> >> # rpm -qa | grep ceph
> >> libcephfs2-12.2.4-0.el7.x86_64
> >> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
> >
> > Mhhhm - that sounds like a messup in the dependencies.
> > The symbol you are missing should be provided by
> > librados2-12.2.4-0.el7.x86_64
> > which contains
> > /usr/lib64/ceph/ceph/libcephfs-common.so.0
> > Do you have a different version of librados2 installed? If so, I wonder
> how yum / rpm allowed that ;-).
> >
> > Thinking again, it might also be (if you indeed have a different version
> there) that this is the cause also for the previous error.
> > If the problematic symbol is indeed not exposed, but can be resolved
> only if both libraries (libcephfs-common and libcephfs) are loaded in
> unison with matching versions,
> > it might be that also 12.2.5 works fine...
> >
> > First thing, in any case, is to checkout which version of librados2 you
> are using ;-).
> >
> > Cheers,
> >   Oliver
> >
> >>
> >> I don't have the ceph user space components installed, assuming they're
> not nesscary apart from libcephfs2? Any idea why it's giving me this error?
> >>
> >> Thanks,
> >>
> >> On Fri, May 11, 2018 at 2:17 AM, Oliver Freyermuth <
> freyerm...@physik.uni-bonn.de >
> wrote:
> >>
> >> Hi David,
> >>
> >> for what it's worth, we are running with nfs-ganesha 2.6.1 from
> Ceph repos on CentOS 7.4 with the following set of versions:
> >> libcephfs2-12.2.4-0.el7.x86_64
> >> nfs-ganesha-2.6.1-0.1.el7.x86_64
> >> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
> >> Of course, we plan to upgrade to 12.2.5 soon-ish...
> >>
> >> Am 11.05.2018 um 00:05 schrieb David C:
> >> > Hi All
> >> >
> >> > I'm testing out the nfs-ganesha-2.6.1-0.1.el7.x86_64.rpm package
> from http://download.ceph.com/nfs-ganesha/rpm-V2.6-stable/luminous/x86_64/
> 
> >> >
> >> > It's failing to load /usr/lib64/ganesha/libfsalceph.so
> >> >
> >> > With libcephfs-12.2.1 installed I get the following error in my
> ganesha log:
> >> >
> >> > load_fsal :NFS STARTUP :CRIT :Could not dlopen
> module:/usr/lib64/ganesha/libfsalceph.so Error:
> >> > /usr/lib64/ganesha/libfsalceph.so: undefined symbol:
> ceph_set_deleg_timeout
> >> > load_fsal :NFS STARTUP :MAJ :Failed to load module
> (/usr/lib64/ganesha/libfsalceph.so) because
> >> > : Can not access a needed shared library
> >>
> >> That looks like an ABI incompatibility, probably the nfs-ganesha
> packages should block this libcephfs2-version (and older ones).
> >>
> >> >
> >> >
> >> > With libcephfs-12.2.5 installed I get:
> >> >
> >> > load_fsal :NFS STARTUP :CRIT :Could not dlopen
> module:/usr/lib64/ganesha/libfsalceph.so Error:
> >> > /lib64/libcephfs.so.2: undefined symbol: _ZNK5FSMap10parse_
> roleEN5boost17basic_string_viewIcSt11char_traitsIcEEEP10mds_role_tRSo
> >> > load_fsal :NFS STARTUP :MAJ :Failed to load module
> (/usr/lib64/ganesha/libfsalceph.so) because
> >> > : Can not access a needed shared library
> >>
> >> That looks ugly and makes me fear for our planned 12.2.5-upgrade.
> >> Interestingly, we do not have that symbol on 12.2.4:
> >> # nm -D /lib64/libcephfs.so.2 | grep FSMap
> >>  U _ZNK5FSMap10parse_roleERKSsP10mds_role_tRSo
> >> 

Re: [ceph-users] dovecot + cephfs - sdbox vs mdbox

2018-05-16 Thread Jack
Hi,

Many (most ?) filesystems does not store multiple files on the same block

Thus, with sdbox, every single mail (you know, that kind of mail with 10
lines in it) will eat an inode, and a block (4k here)
mdbox is more compact on this way

Another difference: sdbox removes the message, mdbox does not : a single
metadata update is performed, which may be packed with others if many
files are deleted at once

That said, I do not have experience with dovecot + cephfs, nor have made
tests for sdbox vs mdbox

However, and this is a bit out of topic, I recommend you look at the
following dovecot's features (if not already done), as they are awesome
and will help you a lot:
- Compression (classic, https://wiki.dovecot.org/Plugins/Zlib)
- Single-Instance-Storage (aka sis, aka "attachment deduplication" :
https://www.dovecot.org/list/dovecot/2013-December/094276.html)

Regards,
On 05/16/2018 08:37 PM, Webert de Souza Lima wrote:
> I'm sending this message to both dovecot and ceph-users ML so please don't
> mind if something seems too obvious for you.
> 
> Hi,
> 
> I have a question for both dovecot and ceph lists and below I'll explain
> what's going on.
> 
> Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox), when
> using sdbox, a new file is stored for each email message.
> When using mdbox, multiple messages are appended to a single file until it
> reaches/passes the rotate limit.
> 
> I would like to understand better how the mdbox format impacts on IO
> performance.
> I think it's generally expected that fewer larger file translate to less IO
> and more troughput when compared to more small files, but how does dovecot
> handle that with mdbox?
> If dovecot does flush data to storage upon each and every new email is
> arrived and appended to the corresponding file, would that mean that it
> generate the same ammount of IO as it would do with one file per message?
> Also, if using mdbox many messages will be appended to a said file before a
> new file is created. That should mean that a file descriptor is kept open
> for sometime by dovecot process.
> Using cephfs as backend, how would this impact cluster performance
> regarding MDS caps and inodes cached when files from thousands of users are
> opened and appended all over?
> 
> I would like to understand this better.
> 
> Why?
> We are a small Business Email Hosting provider with bare metal, self hosted
> systems, using dovecot for servicing mailboxes and cephfs for email storage.
> 
> We are currently working on dovecot and storage redesign to be in
> production ASAP. The main objective is to serve more users with better
> performance, high availability and scalability.
> * high availability and load balancing is extremely important to us *
> 
> On our current model, we're using mdbox format with dovecot, having
> dovecot's INDEXes stored in a replicated pool of SSDs, and messages stored
> in a replicated pool of HDDs (under a Cache Tier with a pool of SSDs).
> All using cephfs / filestore backend.
> 
> Currently there are 3 clusters running dovecot 2.2.34 and ceph Jewel
> (10.2.9-4).
>  - ~25K users from a few thousands of domains per cluster
>  - ~25TB of email data per cluster
>  - ~70GB of dovecot INDEX [meta]data per cluster
>  - ~100MB of cephfs metadata per cluster
> 
> Our goal is to build a single ceph cluster for storage that could expand in
> capacity, be highly available and perform well enough. I know, that's what
> everyone wants.
> 
> Cephfs is an important choise because:
>  - there can be multiple mountpoints, thus multiple dovecot instances on
> different hosts
>  - the same storage backend is used for all dovecot instances
>  - no need of sharding domains
>  - dovecot is easily load balanced (with director sticking users to the
> same dovecot backend)
> 
> On the upcoming upgrade we intent to:
>  - upgrade ceph to 12.X (Luminous)
>  - drop the SSD Cache Tier (because it's deprecated)
>  - use bluestore engine
> 
> I was said on freenode/#dovecot that there are many cases where SDBOX would
> perform better with NFS sharing.
> In case of cephfs, at first, I wouldn't think that would be true because
> more files == more generated IO, but thinking about what I said in the
> beginning regarding sdbox vs mdbox that could be wrong.
> 
> Any thoughts will be highlt appreciated.
> 
> Regards,
> 
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
> *IRC NICK - WebertRLZ*
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dovecot + cephfs - sdbox vs mdbox

2018-05-16 Thread Webert de Souza Lima
Hello Jack,

yes, I imagine I'll have to do some work on tuning the block size on
cephfs. Thanks for the advise.
I knew that using mdbox, messages are not removed but I though that was
true in sdbox too. Thanks again.

We'll soon do benchmarks of sdbox vs mdbox over cephfs with bluestore
backend.
We'll have to do some some work on how to simulate user traffic, for writes
and readings. That seems troublesome.

Thanks for the plugins recommendations. I'll take the change and ask you
how is the SIS status? We have used it in the past and we've had some
problems with it.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, May 16, 2018 at 4:19 PM Jack  wrote:

> Hi,
>
> Many (most ?) filesystems does not store multiple files on the same block
>
> Thus, with sdbox, every single mail (you know, that kind of mail with 10
> lines in it) will eat an inode, and a block (4k here)
> mdbox is more compact on this way
>
> Another difference: sdbox removes the message, mdbox does not : a single
> metadata update is performed, which may be packed with others if many
> files are deleted at once
>
> That said, I do not have experience with dovecot + cephfs, nor have made
> tests for sdbox vs mdbox
>
> However, and this is a bit out of topic, I recommend you look at the
> following dovecot's features (if not already done), as they are awesome
> and will help you a lot:
> - Compression (classic, https://wiki.dovecot.org/Plugins/Zlib)
> - Single-Instance-Storage (aka sis, aka "attachment deduplication" :
> https://www.dovecot.org/list/dovecot/2013-December/094276.html)
>
> Regards,
> On 05/16/2018 08:37 PM, Webert de Souza Lima wrote:
> > I'm sending this message to both dovecot and ceph-users ML so please
> don't
> > mind if something seems too obvious for you.
> >
> > Hi,
> >
> > I have a question for both dovecot and ceph lists and below I'll explain
> > what's going on.
> >
> > Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox),
> when
> > using sdbox, a new file is stored for each email message.
> > When using mdbox, multiple messages are appended to a single file until
> it
> > reaches/passes the rotate limit.
> >
> > I would like to understand better how the mdbox format impacts on IO
> > performance.
> > I think it's generally expected that fewer larger file translate to less
> IO
> > and more troughput when compared to more small files, but how does
> dovecot
> > handle that with mdbox?
> > If dovecot does flush data to storage upon each and every new email is
> > arrived and appended to the corresponding file, would that mean that it
> > generate the same ammount of IO as it would do with one file per message?
> > Also, if using mdbox many messages will be appended to a said file
> before a
> > new file is created. That should mean that a file descriptor is kept open
> > for sometime by dovecot process.
> > Using cephfs as backend, how would this impact cluster performance
> > regarding MDS caps and inodes cached when files from thousands of users
> are
> > opened and appended all over?
> >
> > I would like to understand this better.
> >
> > Why?
> > We are a small Business Email Hosting provider with bare metal, self
> hosted
> > systems, using dovecot for servicing mailboxes and cephfs for email
> storage.
> >
> > We are currently working on dovecot and storage redesign to be in
> > production ASAP. The main objective is to serve more users with better
> > performance, high availability and scalability.
> > * high availability and load balancing is extremely important to us *
> >
> > On our current model, we're using mdbox format with dovecot, having
> > dovecot's INDEXes stored in a replicated pool of SSDs, and messages
> stored
> > in a replicated pool of HDDs (under a Cache Tier with a pool of SSDs).
> > All using cephfs / filestore backend.
> >
> > Currently there are 3 clusters running dovecot 2.2.34 and ceph Jewel
> > (10.2.9-4).
> >  - ~25K users from a few thousands of domains per cluster
> >  - ~25TB of email data per cluster
> >  - ~70GB of dovecot INDEX [meta]data per cluster
> >  - ~100MB of cephfs metadata per cluster
> >
> > Our goal is to build a single ceph cluster for storage that could expand
> in
> > capacity, be highly available and perform well enough. I know, that's
> what
> > everyone wants.
> >
> > Cephfs is an important choise because:
> >  - there can be multiple mountpoints, thus multiple dovecot instances on
> > different hosts
> >  - the same storage backend is used for all dovecot instances
> >  - no need of sharding domains
> >  - dovecot is easily load balanced (with director sticking users to the
> > same dovecot backend)
> >
> > On the upcoming upgrade we intent to:
> >  - upgrade ceph to 12.X (Luminous)
> >  - drop the SSD Cache Tier (because it's deprecated)
> >  - use bluestore engine
> >
> > I was said on freenode/#dovecot that there are many cases where SDBOX
> would
> > perform better with NFS sh

Re: [ceph-users] dovecot + cephfs - sdbox vs mdbox

2018-05-16 Thread Danny Al-Gaaf
Hi,

some time back we had similar discussions when we, as an email provider,
discussed to move away from traditional NAS/NFS storage to Ceph.

The problem with POSIX file systems and dovecot is that e.g. with mdbox
only around ~20% of the IO operations are READ/WRITE, the rest are
metadata IOs. You will not change this with using CephFS since it will
basically behave the same way as e.g. NFS.

We decided to develop librmb to store emails as objects directly in
RADOS instead of CephFS. The project is still under development, so you
should not use it in production, but you can try it to run a POC.

For more information check out my slides from Ceph Day London 2018:
https://dalgaaf.github.io/cephday-london2018-emailstorage/#/cover-page

The project can be found on github:
https://github.com/ceph-dovecot/

-Danny

Am 16.05.2018 um 20:37 schrieb Webert de Souza Lima:
> I'm sending this message to both dovecot and ceph-users ML so please don't
> mind if something seems too obvious for you.
> 
> Hi,
> 
> I have a question for both dovecot and ceph lists and below I'll explain
> what's going on.
> 
> Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox), when
> using sdbox, a new file is stored for each email message.
> When using mdbox, multiple messages are appended to a single file until it
> reaches/passes the rotate limit.
> 
> I would like to understand better how the mdbox format impacts on IO
> performance.
> I think it's generally expected that fewer larger file translate to less IO
> and more troughput when compared to more small files, but how does dovecot
> handle that with mdbox?
> If dovecot does flush data to storage upon each and every new email is
> arrived and appended to the corresponding file, would that mean that it
> generate the same ammount of IO as it would do with one file per message?
> Also, if using mdbox many messages will be appended to a said file before a
> new file is created. That should mean that a file descriptor is kept open
> for sometime by dovecot process.
> Using cephfs as backend, how would this impact cluster performance
> regarding MDS caps and inodes cached when files from thousands of users are
> opened and appended all over?
> 
> I would like to understand this better.
> 
> Why?
> We are a small Business Email Hosting provider with bare metal, self hosted
> systems, using dovecot for servicing mailboxes and cephfs for email storage.
> 
> We are currently working on dovecot and storage redesign to be in
> production ASAP. The main objective is to serve more users with better
> performance, high availability and scalability.
> * high availability and load balancing is extremely important to us *
> 
> On our current model, we're using mdbox format with dovecot, having
> dovecot's INDEXes stored in a replicated pool of SSDs, and messages stored
> in a replicated pool of HDDs (under a Cache Tier with a pool of SSDs).
> All using cephfs / filestore backend.
> 
> Currently there are 3 clusters running dovecot 2.2.34 and ceph Jewel
> (10.2.9-4).
>  - ~25K users from a few thousands of domains per cluster
>  - ~25TB of email data per cluster
>  - ~70GB of dovecot INDEX [meta]data per cluster
>  - ~100MB of cephfs metadata per cluster
> 
> Our goal is to build a single ceph cluster for storage that could expand in
> capacity, be highly available and perform well enough. I know, that's what
> everyone wants.
> 
> Cephfs is an important choise because:
>  - there can be multiple mountpoints, thus multiple dovecot instances on
> different hosts
>  - the same storage backend is used for all dovecot instances
>  - no need of sharding domains
>  - dovecot is easily load balanced (with director sticking users to the
> same dovecot backend)
> 
> On the upcoming upgrade we intent to:
>  - upgrade ceph to 12.X (Luminous)
>  - drop the SSD Cache Tier (because it's deprecated)
>  - use bluestore engine
> 
> I was said on freenode/#dovecot that there are many cases where SDBOX would
> perform better with NFS sharing.
> In case of cephfs, at first, I wouldn't think that would be true because
> more files == more generated IO, but thinking about what I said in the
> beginning regarding sdbox vs mdbox that could be wrong.
> 
> Any thoughts will be highlt appreciated.
> 
> Regards,
> 
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
> *IRC NICK - WebertRLZ*
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dovecot + cephfs - sdbox vs mdbox

2018-05-16 Thread Jack
On 05/16/2018 09:35 PM, Webert de Souza Lima wrote:
> We'll soon do benchmarks of sdbox vs mdbox over cephfs with bluestore
> backend.
> We'll have to do some some work on how to simulate user traffic, for writes
> and readings. That seems troublesome.
I would appreciate seeing these results !

> Thanks for the plugins recommendations. I'll take the change and ask you
> how is the SIS status? We have used it in the past and we've had some
> problems with it.

I am using it since Dec 2016 with mdbox, with no issue at all (I am
currently using Dovecot 2.2.27-3 from Debian Stretch)
The only config I use is mail_attachment_dir, the rest lies as default
(mail_attachment_min_size = 128k, mail_attachment_fs = sis posix,
ail_attachment_hash = %{sha1})
The backend storage is a local filesystem, and there is only one Dovecot
instance

> 
> Regards,
> 
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
> *IRC NICK - WebertRLZ*
> 
> 
> On Wed, May 16, 2018 at 4:19 PM Jack  wrote:
> 
>> Hi,
>>
>> Many (most ?) filesystems does not store multiple files on the same block
>>
>> Thus, with sdbox, every single mail (you know, that kind of mail with 10
>> lines in it) will eat an inode, and a block (4k here)
>> mdbox is more compact on this way
>>
>> Another difference: sdbox removes the message, mdbox does not : a single
>> metadata update is performed, which may be packed with others if many
>> files are deleted at once
>>
>> That said, I do not have experience with dovecot + cephfs, nor have made
>> tests for sdbox vs mdbox
>>
>> However, and this is a bit out of topic, I recommend you look at the
>> following dovecot's features (if not already done), as they are awesome
>> and will help you a lot:
>> - Compression (classic, https://wiki.dovecot.org/Plugins/Zlib)
>> - Single-Instance-Storage (aka sis, aka "attachment deduplication" :
>> https://www.dovecot.org/list/dovecot/2013-December/094276.html)
>>
>> Regards,
>> On 05/16/2018 08:37 PM, Webert de Souza Lima wrote:
>>> I'm sending this message to both dovecot and ceph-users ML so please
>> don't
>>> mind if something seems too obvious for you.
>>>
>>> Hi,
>>>
>>> I have a question for both dovecot and ceph lists and below I'll explain
>>> what's going on.
>>>
>>> Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox),
>> when
>>> using sdbox, a new file is stored for each email message.
>>> When using mdbox, multiple messages are appended to a single file until
>> it
>>> reaches/passes the rotate limit.
>>>
>>> I would like to understand better how the mdbox format impacts on IO
>>> performance.
>>> I think it's generally expected that fewer larger file translate to less
>> IO
>>> and more troughput when compared to more small files, but how does
>> dovecot
>>> handle that with mdbox?
>>> If dovecot does flush data to storage upon each and every new email is
>>> arrived and appended to the corresponding file, would that mean that it
>>> generate the same ammount of IO as it would do with one file per message?
>>> Also, if using mdbox many messages will be appended to a said file
>> before a
>>> new file is created. That should mean that a file descriptor is kept open
>>> for sometime by dovecot process.
>>> Using cephfs as backend, how would this impact cluster performance
>>> regarding MDS caps and inodes cached when files from thousands of users
>> are
>>> opened and appended all over?
>>>
>>> I would like to understand this better.
>>>
>>> Why?
>>> We are a small Business Email Hosting provider with bare metal, self
>> hosted
>>> systems, using dovecot for servicing mailboxes and cephfs for email
>> storage.
>>>
>>> We are currently working on dovecot and storage redesign to be in
>>> production ASAP. The main objective is to serve more users with better
>>> performance, high availability and scalability.
>>> * high availability and load balancing is extremely important to us *
>>>
>>> On our current model, we're using mdbox format with dovecot, having
>>> dovecot's INDEXes stored in a replicated pool of SSDs, and messages
>> stored
>>> in a replicated pool of HDDs (under a Cache Tier with a pool of SSDs).
>>> All using cephfs / filestore backend.
>>>
>>> Currently there are 3 clusters running dovecot 2.2.34 and ceph Jewel
>>> (10.2.9-4).
>>>  - ~25K users from a few thousands of domains per cluster
>>>  - ~25TB of email data per cluster
>>>  - ~70GB of dovecot INDEX [meta]data per cluster
>>>  - ~100MB of cephfs metadata per cluster
>>>
>>> Our goal is to build a single ceph cluster for storage that could expand
>> in
>>> capacity, be highly available and perform well enough. I know, that's
>> what
>>> everyone wants.
>>>
>>> Cephfs is an important choise because:
>>>  - there can be multiple mountpoints, thus multiple dovecot instances on
>>> different hosts
>>>  - the same storage backend is used for all dovecot instances
>>>  - no need of sharding domains
>>>  - dovecot is easily load balanced (with director sticking user

Re: [ceph-users] Nfs-ganesha 2.6 packages in ceph repo

2018-05-16 Thread Oliver Freyermuth
Hi David,

thanks for the reply! 

Interesting that the package was not installed - it was for us, but the 
machines we run the nfs-ganesha servers on are also OSDs, so it might have been 
pulled in via ceph-packages for us. 
In any case, I'd say this means librados2 as dependency is missing either in 
the libcephfs or in nfs-ganesha packages. 

Also, good news that things work fine with 12.2.5 - so I hope our upgrade will 
also go without bumps ;-). 

My experience is sadly only a few months old. We've started with nfs-ganesha 
2.5 from the Ceph repos, but hit a bad locking issue, which I also reported to 
this list. 
After upgrading to 2.6, we did not observe any further hard issues. It seems 
that there are sometimes issues with slow locks if processes are running with a 
working directory in ceph
and other ceph-fuse clients want to access files in the same directory, but 
there are no "deadlock" situations anymore. 

In terms of tuning, I did not do anything special yet. I'm running with some 
basic NFS / Fileserver kernel tunables (sysctl):
net.core.rmem_max = 12582912
net.core.wmem_max = 12582912
net.ipv4.tcp_rmem = 10240 87380 12582912
net.ipv4.tcp_wmem = 10240 87380 12582912
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_no_metrics_save = 1
net.core.netdev_max_backlog = 25
net.core.default_qdisc = fq_codel

However, I did not do explicit testing of different values, but just followed 
general recommendations here. 

It seems ACLs and quotas are honoured by the NFS server (as expected, since it 
uses libcephfs behind the scenes). 
Right now, throughput for bulk data is close to perfect (we manage to saturate 
our 1 GBit/s link) and for metadata access it seems close to what ceph-fuse 
achieves,
which is sufficient for us. 

Cheers and thanks for the feedback,
Oliver

Am 16.05.2018 um 21:06 schrieb David C:
> Hi Oliver
> 
> Thanks for following up. I just picked this up again today and it was indeed 
> librados2...the package wasn't installed! It's working now, haven't tested 
> much but I haven't noticed any problems yet. This is with 
> nfs-ganesha-2.6.1-0.1.el7.x86_64, libcephfs2-12.2.5-0.el7.x86_64 and 
> librados2-12.2.5-0.el7.x86_64. Thanks for the pointer on that.
> 
> I'd be interested to hear your experience with ganesha with cephfs if you're 
> happy to share some insights. Any tuning you would recommend?
> 
> Thanks,
> 
> On Wed, May 16, 2018 at 4:14 PM, Oliver Freyermuth 
> mailto:freyerm...@physik.uni-bonn.de>> wrote:
> 
> Hi David,
> 
> did you already manage to check your librados2 version and manage to pin 
> down the issue?
> 
> Cheers,
>         Oliver
> 
> Am 11.05.2018 um 17:15 schrieb Oliver Freyermuth:
> > Hi David,
> >
> > Am 11.05.2018 um 16:55 schrieb David C:
> >> Hi Oliver
> >>
> >> Thanks for the detailed reponse! I've downgraded my libcephfs2 to 
> 12.2.4 and still get a similar error:
> >>
> >> load_fsal :NFS STARTUP :CRIT :Could not dlopen 
> module:/usr/lib64/ganesha/libfsalceph.so Error:/lib64/libcephfs.so.2: 
> undefined symbol: 
> _Z14common_preinitRK18CephInitParameters18code_environment_ti
> >> load_fsal :NFS STARTUP :MAJ :Failed to load module 
> (/usr/lib64/ganesha/libfsalceph.so) because: Can not access a needed shared 
> library
> >>
> >> I'm on CentOS 7.4, using the following package versions:
> >>
> >> # rpm -qa | grep ganesha
> >> nfs-ganesha-2.6.1-0.1.el7.x86_64
> >> nfs-ganesha-vfs-2.6.1-0.1.el7.x86_64
> >> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
> >>
> >> # rpm -qa | grep ceph
> >> libcephfs2-12.2.4-0.el7.x86_64
> >> nfs-ganesha-ceph-2.6.1-0.1.el7.x86_64
> >
> > Mhhhm - that sounds like a messup in the dependencies.
> > The symbol you are missing should be provided by
> > librados2-12.2.4-0.el7.x86_64
> > which contains
> > /usr/lib64/ceph/ceph/libcephfs-common.so.0
> > Do you have a different version of librados2 installed? If so, I wonder 
> how yum / rpm allowed that ;-).
> >
> > Thinking again, it might also be (if you indeed have a different 
> version there) that this is the cause also for the previous error.
> > If the problematic symbol is indeed not exposed, but can be resolved 
> only if both libraries (libcephfs-common and libcephfs) are loaded in unison 
> with matching versions,
> > it might be that also 12.2.5 works fine...
> >
> > First thing, in any case, is to checkout which version of librados2 you 
> are using ;-).
> >
> > Cheers,
> >       Oliver
> >
> >>
> >> I don't have the ceph user space components installed, assuming 
> they're not nesscary apart from libcephfs2? Any idea why it's giving me this 
> error?
> >>
> >> Thanks,
> >>
> >> On Fri, May 11, 2018 at 2:17 AM, Oliver Freyermuth 
> mailto:freyerm...@physik.uni-bonn.de> 
>  

Re: [ceph-users] dovecot + cephfs - sdbox vs mdbox

2018-05-16 Thread Webert de Souza Lima
Hello Danny,

I actually saw that thread and I was very excited about it. I thank you all
for that idea and all the effort being put in it.
I haven't yet tried to play around with your plugin but I intend to, and to
contribute back. I think when it's ready for production it will be
unbeatable.

I have watched your talk at Cephalocon (on YouTube). I'll see your slides,
maybe they'll give me more insights on our infrastructure architecture.

As you can see our business is still taking baby steps compared to Deutsche
Telekom's but we face infrastructure challenges everyday since ever.
As for now, I think we could still fit with cephfs, but we definitely need
some improvement.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, May 16, 2018 at 4:42 PM Danny Al-Gaaf 
wrote:

> Hi,
>
> some time back we had similar discussions when we, as an email provider,
> discussed to move away from traditional NAS/NFS storage to Ceph.
>
> The problem with POSIX file systems and dovecot is that e.g. with mdbox
> only around ~20% of the IO operations are READ/WRITE, the rest are
> metadata IOs. You will not change this with using CephFS since it will
> basically behave the same way as e.g. NFS.
>
> We decided to develop librmb to store emails as objects directly in
> RADOS instead of CephFS. The project is still under development, so you
> should not use it in production, but you can try it to run a POC.
>
> For more information check out my slides from Ceph Day London 2018:
> https://dalgaaf.github.io/cephday-london2018-emailstorage/#/cover-page
>
> The project can be found on github:
> https://github.com/ceph-dovecot/
>
> -Danny
>
> Am 16.05.2018 um 20:37 schrieb Webert de Souza Lima:
> > I'm sending this message to both dovecot and ceph-users ML so please
> don't
> > mind if something seems too obvious for you.
> >
> > Hi,
> >
> > I have a question for both dovecot and ceph lists and below I'll explain
> > what's going on.
> >
> > Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox),
> when
> > using sdbox, a new file is stored for each email message.
> > When using mdbox, multiple messages are appended to a single file until
> it
> > reaches/passes the rotate limit.
> >
> > I would like to understand better how the mdbox format impacts on IO
> > performance.
> > I think it's generally expected that fewer larger file translate to less
> IO
> > and more troughput when compared to more small files, but how does
> dovecot
> > handle that with mdbox?
> > If dovecot does flush data to storage upon each and every new email is
> > arrived and appended to the corresponding file, would that mean that it
> > generate the same ammount of IO as it would do with one file per message?
> > Also, if using mdbox many messages will be appended to a said file
> before a
> > new file is created. That should mean that a file descriptor is kept open
> > for sometime by dovecot process.
> > Using cephfs as backend, how would this impact cluster performance
> > regarding MDS caps and inodes cached when files from thousands of users
> are
> > opened and appended all over?
> >
> > I would like to understand this better.
> >
> > Why?
> > We are a small Business Email Hosting provider with bare metal, self
> hosted
> > systems, using dovecot for servicing mailboxes and cephfs for email
> storage.
> >
> > We are currently working on dovecot and storage redesign to be in
> > production ASAP. The main objective is to serve more users with better
> > performance, high availability and scalability.
> > * high availability and load balancing is extremely important to us *
> >
> > On our current model, we're using mdbox format with dovecot, having
> > dovecot's INDEXes stored in a replicated pool of SSDs, and messages
> stored
> > in a replicated pool of HDDs (under a Cache Tier with a pool of SSDs).
> > All using cephfs / filestore backend.
> >
> > Currently there are 3 clusters running dovecot 2.2.34 and ceph Jewel
> > (10.2.9-4).
> >  - ~25K users from a few thousands of domains per cluster
> >  - ~25TB of email data per cluster
> >  - ~70GB of dovecot INDEX [meta]data per cluster
> >  - ~100MB of cephfs metadata per cluster
> >
> > Our goal is to build a single ceph cluster for storage that could expand
> in
> > capacity, be highly available and perform well enough. I know, that's
> what
> > everyone wants.
> >
> > Cephfs is an important choise because:
> >  - there can be multiple mountpoints, thus multiple dovecot instances on
> > different hosts
> >  - the same storage backend is used for all dovecot instances
> >  - no need of sharding domains
> >  - dovecot is easily load balanced (with director sticking users to the
> > same dovecot backend)
> >
> > On the upcoming upgrade we intent to:
> >  - upgrade ceph to 12.X (Luminous)
> >  - drop the SSD Cache Tier (because it's deprecated)
> >  - use bluestore engine
> >
> > I was said on freenode/#dovecot that there are many c

Re: [ceph-users] dovecot + cephfs - sdbox vs mdbox

2018-05-16 Thread Webert de Souza Lima
Thanks Jack.

That's good to know. It is definitely something to consider.
In a distributed storage scenario we might build a dedicated pool for that
and tune the pool as more capacity or performance is needed.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, May 16, 2018 at 4:45 PM Jack  wrote:

> On 05/16/2018 09:35 PM, Webert de Souza Lima wrote:
> > We'll soon do benchmarks of sdbox vs mdbox over cephfs with bluestore
> > backend.
> > We'll have to do some some work on how to simulate user traffic, for
> writes
> > and readings. That seems troublesome.
> I would appreciate seeing these results !
>
> > Thanks for the plugins recommendations. I'll take the change and ask you
> > how is the SIS status? We have used it in the past and we've had some
> > problems with it.
>
> I am using it since Dec 2016 with mdbox, with no issue at all (I am
> currently using Dovecot 2.2.27-3 from Debian Stretch)
> The only config I use is mail_attachment_dir, the rest lies as default
> (mail_attachment_min_size = 128k, mail_attachment_fs = sis posix,
> ail_attachment_hash = %{sha1})
> The backend storage is a local filesystem, and there is only one Dovecot
> instance
>
> >
> > Regards,
> >
> > Webert Lima
> > DevOps Engineer at MAV Tecnologia
> > *Belo Horizonte - Brasil*
> > *IRC NICK - WebertRLZ*
> >
> >
> > On Wed, May 16, 2018 at 4:19 PM Jack  wrote:
> >
> >> Hi,
> >>
> >> Many (most ?) filesystems does not store multiple files on the same
> block
> >>
> >> Thus, with sdbox, every single mail (you know, that kind of mail with 10
> >> lines in it) will eat an inode, and a block (4k here)
> >> mdbox is more compact on this way
> >>
> >> Another difference: sdbox removes the message, mdbox does not : a single
> >> metadata update is performed, which may be packed with others if many
> >> files are deleted at once
> >>
> >> That said, I do not have experience with dovecot + cephfs, nor have made
> >> tests for sdbox vs mdbox
> >>
> >> However, and this is a bit out of topic, I recommend you look at the
> >> following dovecot's features (if not already done), as they are awesome
> >> and will help you a lot:
> >> - Compression (classic, https://wiki.dovecot.org/Plugins/Zlib)
> >> - Single-Instance-Storage (aka sis, aka "attachment deduplication" :
> >> https://www.dovecot.org/list/dovecot/2013-December/094276.html)
> >>
> >> Regards,
> >> On 05/16/2018 08:37 PM, Webert de Souza Lima wrote:
> >>> I'm sending this message to both dovecot and ceph-users ML so please
> >> don't
> >>> mind if something seems too obvious for you.
> >>>
> >>> Hi,
> >>>
> >>> I have a question for both dovecot and ceph lists and below I'll
> explain
> >>> what's going on.
> >>>
> >>> Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox),
> >> when
> >>> using sdbox, a new file is stored for each email message.
> >>> When using mdbox, multiple messages are appended to a single file until
> >> it
> >>> reaches/passes the rotate limit.
> >>>
> >>> I would like to understand better how the mdbox format impacts on IO
> >>> performance.
> >>> I think it's generally expected that fewer larger file translate to
> less
> >> IO
> >>> and more troughput when compared to more small files, but how does
> >> dovecot
> >>> handle that with mdbox?
> >>> If dovecot does flush data to storage upon each and every new email is
> >>> arrived and appended to the corresponding file, would that mean that it
> >>> generate the same ammount of IO as it would do with one file per
> message?
> >>> Also, if using mdbox many messages will be appended to a said file
> >> before a
> >>> new file is created. That should mean that a file descriptor is kept
> open
> >>> for sometime by dovecot process.
> >>> Using cephfs as backend, how would this impact cluster performance
> >>> regarding MDS caps and inodes cached when files from thousands of users
> >> are
> >>> opened and appended all over?
> >>>
> >>> I would like to understand this better.
> >>>
> >>> Why?
> >>> We are a small Business Email Hosting provider with bare metal, self
> >> hosted
> >>> systems, using dovecot for servicing mailboxes and cephfs for email
> >> storage.
> >>>
> >>> We are currently working on dovecot and storage redesign to be in
> >>> production ASAP. The main objective is to serve more users with better
> >>> performance, high availability and scalability.
> >>> * high availability and load balancing is extremely important to us *
> >>>
> >>> On our current model, we're using mdbox format with dovecot, having
> >>> dovecot's INDEXes stored in a replicated pool of SSDs, and messages
> >> stored
> >>> in a replicated pool of HDDs (under a Cache Tier with a pool of SSDs).
> >>> All using cephfs / filestore backend.
> >>>
> >>> Currently there are 3 clusters running dovecot 2.2.34 and ceph Jewel
> >>> (10.2.9-4).
> >>>  - ~25K users from a few thousands of domains per cluster
> >>>  - ~25TB of email data per cluster
> >>> 

[ceph-users] ceph-volume and systemd troubles

2018-05-16 Thread Andras Pataki

Dear ceph users,

I've been experimenting setting up a new node with ceph-volume and 
bluestore.  Most of the setup works right, but I'm running into a 
strange interaction between ceph-volume and systemd when starting OSDs.


After preparing/activating the OSD, a systemd unit instance is created 
with a symlink in /etc/systemd/system/multi-user.target.wants
    ceph-volume@lvm-121-7a9aceb3-ac01-4c2e-97f7-94954004e2fb.service -> 
/usr/lib/systemd/system/ceph-volume@.service


I've moved this dependency to ceph-osd.target.wants, since I'd like to 
be able to start/stop all OSDs on the same node with one command (let me 
know if there is a better way).  The stopping works without this, since 
ceph-osd@.service is marked as part of ceph-osd.target, but starting 
does not since these new ceph-volume units aren't together in a separate 
target.


However, when I run 'systemctl start ceph-osd.target' multiple times, 
the systemctl command hangs, even though the OSD starts up fine.  
Interestingly, 'systemctl start 
ceph-volume@lvm-121-7a9aceb3-ac01-4c2e-97f7-94954004e2fb.service' does 
not hang, however.
Troubleshooting further, I see that the ceph-volume@.target unit calls 
'ceph-volume lvm trigger 121-7a9aceb3-ac01-4c2e-97f7-94954004e2fb', 
which in turn calls 'Activate', running a few systemd commands:


Running command: ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev 
/dev/H900D00/H900D00 --path /var/lib/ceph/osd/ceph-121
Running command: ln -snf /dev/H900D00/H900D00 
/var/lib/ceph/osd/ceph-121/block

Running command: chown -R ceph:ceph /dev/dm-0
Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-121
Running command: systemctl enable 
ceph-volume@lvm-121-7a9aceb3-ac01-4c2e-97f7-94954004e2fb

Running command: systemctl start ceph-osd@121
--> ceph-volume lvm activate successful for osd ID: 121

The problem seems to be the 'systemctl enable' command, which 
essentially tries to enable the unit that is currently being executed 
(for the case when running systemctl start ceph-osd.target).  Somehow 
systemd (in CentOS) isn't very happy with that.  If I edit the python 
scripts to check that the unit is not enabled before enabling it - the 
hangs stop.
For example, replacing in 
/usr/lib/python2.7/site-packages/ceph_volume/systemd/systemd.py


   def enable(unit):
    process.run(['systemctl', 'enable', unit])


with

   def enable(unit):
    stdout, stderr, retcode = process.call(['systemctl',
   'is-enabled', unit], show_command=True)
    if retcode != 0:
    process.run(['systemctl', 'enable', unit])


fixes the issue.

Has anyone run into this, or has any ideas on how to proceed?

Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume and systemd troubles

2018-05-16 Thread Alfredo Deza
On Wed, May 16, 2018 at 4:50 PM, Andras Pataki
 wrote:
> Dear ceph users,
>
> I've been experimenting setting up a new node with ceph-volume and
> bluestore.  Most of the setup works right, but I'm running into a strange
> interaction between ceph-volume and systemd when starting OSDs.
>
> After preparing/activating the OSD, a systemd unit instance is created with
> a symlink in /etc/systemd/system/multi-user.target.wants
> ceph-volume@lvm-121-7a9aceb3-ac01-4c2e-97f7-94954004e2fb.service ->
> /usr/lib/systemd/system/ceph-volume@.service
>
> I've moved this dependency to ceph-osd.target.wants, since I'd like to be
> able to start/stop all OSDs on the same node with one command (let me know
> if there is a better way).  The stopping works without this, since
> ceph-osd@.service is marked as part of ceph-osd.target, but starting does
> not since these new ceph-volume units aren't together in a separate target.
>
> However, when I run 'systemctl start ceph-osd.target' multiple times, the
> systemctl command hangs, even though the OSD starts up fine.  Interestingly,
> 'systemctl start
> ceph-volume@lvm-121-7a9aceb3-ac01-4c2e-97f7-94954004e2fb.service' does not
> hang, however.
> Troubleshooting further, I see that the ceph-volume@.target unit calls
> 'ceph-volume lvm trigger 121-7a9aceb3-ac01-4c2e-97f7-94954004e2fb', which in
> turn calls 'Activate', running a few systemd commands:
>
> Running command: ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev
> /dev/H900D00/H900D00 --path /var/lib/ceph/osd/ceph-121
> Running command: ln -snf /dev/H900D00/H900D00
> /var/lib/ceph/osd/ceph-121/block
> Running command: chown -R ceph:ceph /dev/dm-0
> Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-121
> Running command: systemctl enable
> ceph-volume@lvm-121-7a9aceb3-ac01-4c2e-97f7-94954004e2fb
> Running command: systemctl start ceph-osd@121
> --> ceph-volume lvm activate successful for osd ID: 121
>
> The problem seems to be the 'systemctl enable' command, which essentially
> tries to enable the unit that is currently being executed (for the case when
> running systemctl start ceph-osd.target).  Somehow systemd (in CentOS) isn't
> very happy with that.  If I edit the python scripts to check that the unit
> is not enabled before enabling it - the hangs stop.
> For example, replacing in
> /usr/lib/python2.7/site-packages/ceph_volume/systemd/systemd.py
>
> def enable(unit):
> process.run(['systemctl', 'enable', unit])
>
>
> with
>
> def enable(unit):
> stdout, stderr, retcode = process.call(['systemctl', 'is-enabled',
> unit], show_command=True)
> if retcode != 0:
> process.run(['systemctl', 'enable', unit])
>
>
> fixes the issue.
>
> Has anyone run into this, or has any ideas on how to proceed?

This looks like an oversight on our end. We don't run into this
because we haven't tried to start/stop all OSDs at once in our tests.

Can you create a ticket so that we can fix this? Your changes look
correct to me.

http://tracker.ceph.com/projects/ceph-volume/issues/new

>
> Andras
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intepreting reason for blocked request

2018-05-16 Thread Gregory Farnum
On Sat, May 12, 2018 at 3:22 PM Bryan Henderson 
wrote:

> I recently had some requests blocked indefinitely; I eventually cleared it
> up by recycling the OSDs, but I'd like some help interpreting the log
> messages
> that supposedly give clue as to what caused the blockage:
>
> (I reformatted for easy email reading)
>
> 2018-05-03 01:56:35.248623 osd.0 192.168.1.16:6800/348 53 :
>   cluster [WRN] 7 slow requests, 2 included below;
>   oldest blocked for > 961.596517 secs
>
> 2018-05-03 01:56:35.249122 osd.0 192.168.1.16:6800/348 54 :
>   cluster [WRN] slow request 961.557151 seconds old,
>   received at 2018-05-03 01:40:33.689191:
> pg_query(4.f epoch 490) currently wait for new map
>
> 2018-05-03 01:56:35.249543 osd.0 192.168.1.16:6800/348 55 :
>   cluster [WRN] slow request 961.556655 seconds old,
>   received at 2018-05-03 01:40:33.689686:
> pg_query(1.d epoch 490) currently wait for new map
>
> 2018-05-03 01:56:31.918589 osd.1 192.168.1.23:6800/345 80 :
>   cluster [WRN] 2 slow requests, 2 included below;
>   oldest blocked for > 960.677480 secs
>
> 2018-05-03 01:56:31.920076 osd.1 192.168.1.23:6800/345 81 :
>   cluster [WRN] slow request 960.677480 seconds old,
>   received at 2018-05-03 01:40:31.238642:
> osd_op(mds.0.57:1 mds0_inotable [read 0~0] 2.b852b893
>   RETRY=2 ack+retry+read+known_if_redirected e490) currently reached_pg
>
> 2018-05-03 01:56:31.921526 osd.1 192.168.1.23:6800/345 82 :
>   cluster [WRN] slow request 960.663817 seconds old,
>   received at 2018-05-03 01:40:31.252305:
> osd_op(mds.0.57:3 mds_snaptable [read 0~0] 2.d90270ad
>   RETRY=2 ack+retry+read+known_if_redirected e490) currently reached_pg
>
> "wait for new map": what map would that be, and where is the OSD expecting
> it
> to come from?
>

The OSD is waiting for a new OSD map, which it will get from one of its
peers or the monitor (by request). This tends to happen if the client sees
a newer version than the OSD does.


>
> "reached_pg"?
>

The request has been delivered into a queue for the PG to process, but it
hasn't been picked up and worked on yet. There's nothing about the request
that is blocking it here, but some other kind of back pressure is going on
— either the PG is working or waiting on another request that prevents it
from picking up new ones, or there's some kind of throttler preventing it
from picking up new work, or there's no CPU time available for some reason.

With what you've shown here, it looks like either your cluster is
dramatically overloaded, or else something is going on with the mds tables
that is killing the OSD whenever it tries to access them. I think there
were some past issues with them if they grew too large in older releases?
-Greg


>
> You see two OSDs: osd.0 and osd.1.  They're basically set up as a mirrored
> pair.
>
> Thanks.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume and systemd troubles

2018-05-16 Thread Andras Pataki

Done: tracker #24152

Thanks,

Andras


On 05/16/2018 04:58 PM, Alfredo Deza wrote:

On Wed, May 16, 2018 at 4:50 PM, Andras Pataki
 wrote:

Dear ceph users,

I've been experimenting setting up a new node with ceph-volume and
bluestore.  Most of the setup works right, but I'm running into a strange
interaction between ceph-volume and systemd when starting OSDs.

After preparing/activating the OSD, a systemd unit instance is created with
a symlink in /etc/systemd/system/multi-user.target.wants
 ceph-volume@lvm-121-7a9aceb3-ac01-4c2e-97f7-94954004e2fb.service ->
/usr/lib/systemd/system/ceph-volume@.service

I've moved this dependency to ceph-osd.target.wants, since I'd like to be
able to start/stop all OSDs on the same node with one command (let me know
if there is a better way).  The stopping works without this, since
ceph-osd@.service is marked as part of ceph-osd.target, but starting does
not since these new ceph-volume units aren't together in a separate target.

However, when I run 'systemctl start ceph-osd.target' multiple times, the
systemctl command hangs, even though the OSD starts up fine.  Interestingly,
'systemctl start
ceph-volume@lvm-121-7a9aceb3-ac01-4c2e-97f7-94954004e2fb.service' does not
hang, however.
Troubleshooting further, I see that the ceph-volume@.target unit calls
'ceph-volume lvm trigger 121-7a9aceb3-ac01-4c2e-97f7-94954004e2fb', which in
turn calls 'Activate', running a few systemd commands:

Running command: ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev
/dev/H900D00/H900D00 --path /var/lib/ceph/osd/ceph-121
Running command: ln -snf /dev/H900D00/H900D00
/var/lib/ceph/osd/ceph-121/block
Running command: chown -R ceph:ceph /dev/dm-0
Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-121
Running command: systemctl enable
ceph-volume@lvm-121-7a9aceb3-ac01-4c2e-97f7-94954004e2fb
Running command: systemctl start ceph-osd@121
--> ceph-volume lvm activate successful for osd ID: 121

The problem seems to be the 'systemctl enable' command, which essentially
tries to enable the unit that is currently being executed (for the case when
running systemctl start ceph-osd.target).  Somehow systemd (in CentOS) isn't
very happy with that.  If I edit the python scripts to check that the unit
is not enabled before enabling it - the hangs stop.
For example, replacing in
/usr/lib/python2.7/site-packages/ceph_volume/systemd/systemd.py

def enable(unit):
 process.run(['systemctl', 'enable', unit])


with

def enable(unit):
 stdout, stderr, retcode = process.call(['systemctl', 'is-enabled',
unit], show_command=True)
 if retcode != 0:
 process.run(['systemctl', 'enable', unit])


fixes the issue.

Has anyone run into this, or has any ideas on how to proceed?

This looks like an oversight on our end. We don't run into this
because we haven't tried to start/stop all OSDs at once in our tests.

Can you create a ticket so that we can fix this? Your changes look
correct to me.

http://tracker.ceph.com/projects/ceph-volume/issues/new


Andras


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Luminous - OSD constantly crashing caused by corrupted placement group

2018-05-16 Thread Gregory Farnum
On Wed, May 16, 2018 at 6:49 AM Siegfried Höllrigl <
siegfried.hoellr...@xidras.com> wrote:

> Hi Greg !
>
> Thank you for your fast reply.
>
> We have now deleted the PG on OSD.130 like you suggested and started it :
>
> ceph-s-06 # ceph-objectstore-tool --data-path
> /var/lib/ceph/osd/ceph-130/ --pgid 5.9b --op remove --force
>   marking collection for removal
> setting '_remove' omap key
> finish_remove_pgs 5.9b_head removing 5.9b
> Remove successful
> ceph-s-06 # systemctl start ceph-osd@130.service
>
> The cluster recovered again until it came to the PG 5.9b. Then OSD.130
> crashed again. -> No Change
>
> So we wanted to start the other way and export the PG from the primary
> (healthy) OSD. (OSD.19) but that fails:
>
> root@ceph-s-03:/tmp5.9b# ceph-objectstore-tool --op export --pgid 5.9b
> --data-path /var/lib/ceph/osd/ceph-19 --file /tmp5.9b/5.9b.export
> OSD has the store locked
>
> But we don't want to stop OSD.19 on this server because this Pool has
> size=3 and size_min=2.
> (this would make pg5.9b inaccessable)
>

I'm a bit confused. Are you saying that
1) the ceph-objectstore-tool you pasted there successfully removed pg 5.9b
from osd.130 (as it appears), AND
2) pg 5.9b was active with one of the other nodes as primary, so all data
remained available, AND
3) when pg 5.9b got backfilled into osd.130, osd.130 crashed again? (But
the other OSDs kept the PG fully available, without crashing?)

That sequence of events is *deeply* confusing and I really don't understand
how it might happen.

Sadly I don't think you can grab a PG for export without stopping the OSD
in question.


>
> When we query the pg, we can see a lot of "snap_trimq".
> Can this be cleaned somehow, even if the pg is undersized and degraded ?


I *think* the PG will keep trimming snapshots even if undersized+degraded
(though I don't remember for sure), but snapshot trimming is often heavily
throttled and I'm not aware of any way to specifically push one PG to the
front. If you're interested in speeding snaptrimming up you can search the
archives or check the docs for the appropriate config options.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Increasing number of PGs by not a factor of two?

2018-05-16 Thread Oliver Schulz

Dear all,

we have a Ceph cluster that has slowly evolved over several
years and Ceph versions (started with 18 OSDs and 54 TB
in 2013, now about 200 OSDs and 1.5 PB, still the same
cluster, with data continuity). So there are some
"early sins" in the cluster configuration, left over from
the early days.

One of these sins is the number of PGs in our CephFS "data"
pool, which is 7200 and therefore not (as recommended)
a power of two. Pretty much all of our data is in the
"data" pool, the only other pools are "rbd" and "metadata",
both contain little data (and they have way too many PGs
already, another early sin).

Is it possible - and safe - to change the number of "data"
pool PGs from 7200 to 8192 or 16384? As we recently added
more OSDs, I guess it would be time to increase the number
of PGs anyhow. Or would we have to go to 14400 instead of
16384?


Thanks for any advice,

Oliver
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-16 Thread Brad Hubbard
On Wed, May 16, 2018 at 6:16 PM, Uwe Sauter  wrote:
> Hi folks,
>
> I'm currently chewing on an issue regarding "slow requests are blocked". I'd 
> like to identify the OSD that is causing those events
> once the cluster is back to HEALTH_OK (as I have no monitoring yet that would 
> get this info in realtime).
>
> Collecting this information could help identify aging disks if you were able 
> to accumulate and analyze which OSD had blocking
> requests in the past and how often those events occur.
>
> My research so far let's me think that this information is only available as 
> long as the requests are actually blocked. Is this
> correct?

You don't give any indication what version you are running but see
https://tracker.ceph.com/issues/23205

>
> MON logs only show that those events occure and how many requests are in 
> blocking state but no indication of which OSD is
> affected. Is there a way to identify blocking requests from the OSD log files?
>
>
> On a side note: I was trying to write a small Python script that would 
> extract this kind of information in realtime but while I
> was able to register a MonitorLog callback that would receive the same 
> messages as you would get with "ceph -w" I haven's seen in
> the librados Python bindings documentation the possibility to do the 
> equivalent of "ceph health detail". Any suggestions on how to
> get the blocking OSDs via librados?
>
>
> Thanks,
>
> Uwe
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD features and feature journaling performance

2018-05-16 Thread Konstantin Shalygin

I'm trying to better understand rbd features but I have only found the
information on the RBD page, is there any further RBD feature information and
implementation?



http://tracker.ceph.com/issues/15000




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OpenStack Summit Vancouver 2018

2018-05-16 Thread Leonardo Vaz
Hey Cephers,

As many of you know the OpenStack Summit Vancouver starts on next
Monday, May 21st and the vibrant Ceph Community will be present!

We created the following pad to organize the Ceph activities during
the conference:

 http://pad.ceph.com/p/openstack-summit-vancouver-2018

If you're planning to attend the OpenStack Summit feel free to include
your name on the list, as well join us at the RDO booth,
presentations, workshops, lightning talks and the social events!

See you in Vancouver!

Kindest regards,

Leo

-- 
Leonardo Vaz
Ceph Community Manager
Open Source and Standards Team
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Luminous - OSD constantly crashing caused by corrupted placement group

2018-05-16 Thread Siegfried Höllrigl

Am 17.05.2018 um 00:12 schrieb Gregory Farnum:



I'm a bit confused. Are you saying that
1) the ceph-objectstore-tool you pasted there successfully removed pg 
5.9b from osd.130 (as it appears), AND

Yes. The process ceph-osd for osd.130 was not runnin in that phase.
2) pg 5.9b was active with one of the other OSDs as primary, so all 
data remained available, AND
Yes. pg 5.9b is active all of the time (on two other OSDs). I think 
OSD.19 is the primary for that pg.

"ceph pg 5.9b query" thells me :
.
    "up": [
    19,
    166
    ],
    "acting": [
    19,
    166
    ],
    "actingbackfill": [
    "19",
    "166"
    ],


3) when pg 5.9b got backfilled into osd.130, osd.130 crashed again? 
(But the other OSDs kept the PG fully available, without crashing?)

Yes.

It crashes again with the following lines in the osd log :
    -2> 2018-05-16 11:11:59.639980 7fe812ffd700  5 -- 
10.7.2.141:6800/173031 >> 10.7.2.49:6836/3920 conn(0x5619ed76c000 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=24047 cs=1 l=0). rx 
osd.19 seq 24 0x5619eebd6d00 pg_backfill(progress 5.9b e 505567/505567 
lb 5:d97d84eb:::rbd_data.112913b238e1f29.0ba3:56c06) v3
    -1> 2018-05-16 11:11:59.639995 7fe812ffd700  1 -- 
10.7.2.141:6800/173031 <== osd.19 10.7.2.49:6836/3920 24  
pg_backfill(progress 5.9b e 505567/505567 lb 
5:d97d84eb:::rbd_data.112913b238e1f29.0ba3:56c06) v3  
955+0+0 (3741758263 0 0) 0x5619eebd6d00 con 0x5619ed76c000
 0> 2018-05-16 11:11:59.645952 7fe7fe7eb700 -1 
/build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: In function 'virtual void 
PrimaryLogPG::on_local_recover(const hobject_t&, const 
ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' 
thread 7fe7fe7eb700 time 2018-05-16 11:11:59.640238
/build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != 
recovery_info.ss.clone_snaps.end())


 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) 
luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x102) [0x5619c11b1a02]
 2: (PrimaryLogPG::on_local_recover(hobject_t const&, 
ObjectRecoveryInfo const&, std::shared_ptr, bool, 
ObjectStore::Transaction*)+0xd63) [0x5619c0d1f873]
 3: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, 
PushReplyOp*, ObjectStore::Transaction*)+0x2da) [0x5619c0eb15ca]
 4: 
(ReplicatedBackend::_do_push(boost::intrusive_ptr)+0x12e) 
[0x5619c0eb17fe]
 5: 
(ReplicatedBackend::_handle_message(boost::intrusive_ptr)+0x2c1) 
[0x5619c0ec0d71]
 6: (PGBackend::handle_message(boost::intrusive_ptr)+0x50) 
[0x5619c0dcc440]
 7: (PrimaryLogPG::do_request(boost::intrusive_ptr&, 
ThreadPool::TPHandle&)+0x543) [0x5619c0d30853]
 8: (OSD::dequeue_op(boost::intrusive_ptr, 
boost::intrusive_ptr, ThreadPool::TPHandle&)+0x3a9) 
[0x5619c0ba7539]
 9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr 
const&)+0x57) [0x5619c0e50f37]
 10: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0x1047) [0x5619c0bd5847]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884) 
[0x5619c11b67f4]

 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5619c11b9830]
 13: (()+0x76ba) [0x7fe8173746ba]
 14: (clone()+0x6d) [0x7fe8163eb41d]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.




That sequence of events is *deeply* confusing and I really don't 
understand how it might happen.


Sadly I don't think you can grab a PG for export without stopping the 
OSD in question.



When we query the pg, we can see a lot of "snap_trimq".
Can this be cleaned somehow, even if the pg is undersized and
degraded ?


I *think* the PG will keep trimming snapshots even if 
undersized+degraded (though I don't remember for sure), but snapshot 
trimming is often heavily throttled and I'm not aware of any way to 
specifically push one PG to the front. If you're interested in 
speeding snaptrimming up you can search the archives or check the docs 
for the appropriate config options.

-Greg


Ok. I think we should try that next.

Thank you !





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-16 Thread Uwe Sauter

Hi,


I'm currently chewing on an issue regarding "slow requests are blocked". I'd 
like to identify the OSD that is causing those events
once the cluster is back to HEALTH_OK (as I have no monitoring yet that would 
get this info in realtime).

Collecting this information could help identify aging disks if you were able to 
accumulate and analyze which OSD had blocking
requests in the past and how often those events occur.

My research so far let's me think that this information is only available as 
long as the requests are actually blocked. Is this
correct?


You don't give any indication what version you are running but see
https://tracker.ceph.com/issues/23205


the cluster is an Proxmox installation which is based on an Ubuntu kernel.

# ceph -v
ceph version 12.2.5 (dfcb7b53b2e4fcd2a5af0240d4975adc711ab96e) luminous (stable)

The mistery is that these blocked requests occur numerously when at least one of the 6 servers is booted with kernel 
4.15.17, if all are running 4.13.16 the number of blocked requests is infrequent and low.



Regards,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com