Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-21 Thread Frank Schilder
> So hdparam -W 0 /dev/sdx doesn't work or it makes no difference? 

I wrote "We found the raw throughput in fio benchmarks to be very different for 
write-cache enabled and disabled, exactly as explained in the performance 
article.", so yes, it makes a huge difference.

> Also I am not sure I understand why it should happen before OSD have been 
> started. 
> At least in my experience hdparam does it to hardware regardless.

I'm not sure I understand this question. Ideally it happens at boot time and if 
this doesn't work, at least sometimes before the OSD is started. Why and how 
else would one want this to happen?

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Understand ceph df details

2020-01-21 Thread CUZA Frédéric
Hi everyone,

I'm trying to understand where is the difference between the command :
ceph df details

And the result I'm getting when I run this script :
total_bytes=0
while read user; do
  echo $user
  bytes=$(radosgw-admin user stats --uid=${user} | grep total_bytes_rounded | 
tr -dc "0-9")
  if [ ! -z ${bytes} ]; then
total_bytes=$((total_bytes + bytes))
pretty_bytes=$(echo "scale=2; $bytes / 1000^4" | bc)
echo "  ($bytes B) $pretty_bytes TiB"
  fi
  pretty_total_bytes=$(echo "scale=2; $total_bytes / 1000^4" | bc)
done <<< "$(radosgw-admin user list | jq -r .[])"
echo ""
echo "Total : ($total_bytes B) $pretty_total_bytes TiB"


When I run df I get this :
default.rgw.buckets.data   70 N/A   N/A  226TiB 
89.23   27.2TiB 61676992 61.68M 2.05GiB  726MiB   
677TiB

And when I use my script I don't have the same result :
Total : (207579728699392 B) 207.57 TiB

It means that I have 20 TiB somewhere but I can't find and must of all 
understand where this 20 TiB.
Does anyone have an explanation ?


Fi :
[root@ceph_monitor01 ~]# radosgw-admin gc list -include-all | grep oid | wc -l
23
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-21 Thread Sasha Litvak
Frank,

Sorry for the confusion.  I thought that turning off cache using hdparm -W
0 /dev/sdx takes effect right away and in case of non-raid controllers and
Seagate or Micron SSDs I would see a difference starting fio benchmark
right after executing hdparm.  So I wonder it makes a difference whether
cache turned off before OSD started or after.



On Tue, Jan 21, 2020, 2:07 AM Frank Schilder  wrote:

> > So hdparam -W 0 /dev/sdx doesn't work or it makes no difference?
>
> I wrote "We found the raw throughput in fio benchmarks to be very
> different for write-cache enabled and disabled, exactly as explained in the
> performance article.", so yes, it makes a huge difference.
>
> > Also I am not sure I understand why it should happen before OSD have
> been started.
> > At least in my experience hdparam does it to hardware regardless.
>
> I'm not sure I understand this question. Ideally it happens at boot time
> and if this doesn't work, at least sometimes before the OSD is started. Why
> and how else would one want this to happen?
>
> Best regards,
>
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous Bluestore OSDs crashing with ASSERT

2020-01-21 Thread Stefan Priebe - Profihost AG
Hello Igor,

thanks for all your feedback and all your help.

The first thing i'll try is to upgrade a bunch of system from
4.19.66 kernel to 4.19.97 and see what happens.

I'll report back in 7-10 days to verify whether this helps.

Greets,
Stefan

Am 20.01.20 um 13:12 schrieb Igor Fedotov:
> Hi Stefan,
> 
> these lines are result of transaction dump performed on a failure during
> transaction submission (which is shown as
> 
> "submit_transaction error: Corruption: block checksum mismatch code = 2"
> 
> Most probably they are out of interest (checksum errors are unlikely to
> be caused by transaction content) and hence we need earlier stuff to
> learn what caused that
> 
> checksum mismatch.
> 
> It's hard to give any formal overview of what you should look for, from
> my troubleshooting experience generally one may try to find:
> 
> - some previous error/warning indications (e.g. allocation, disk access,
> etc)
> 
> - prior OSD crashes (sometimes they might have different causes/stack
> traces/assertion messages)
> 
> - any timeout or retry indications
> 
> - any uncommon log patterns which aren't present during regular running
> but happen each time before the crash/failure.
> 
> 
> Anyway I think the inspection depth should be much(?) deeper than
> presumably it is (from what I can see from your log snippets).
> 
> Ceph keeps last 1 log events with an increased log level and dumps
> them on crash with negative index starting at - up to -1 as a prefix.
> 
> -1> 2020-01-16 01:10:13.404090 7f3350a14700 -1 rocksdb:
> 
> 
> It would be great If you share several log snippets for different
> crashes containing these last 1 lines.
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 1/19/2020 9:42 PM, Stefan Priebe - Profihost AG wrote:
>> Hello Igor,
>>
>> there's absolutely nothing in the logs before.
>>
>> What do those lines mean:
>> Put( Prefix = O key =
>> 0x7f8001cc45c881217262'd_data.4303206b8b4567.9632!='0xfffe6f0012'x'
>>
>> Value size = 480)
>> Put( Prefix = O key =
>> 0x7f8001cc45c881217262'd_data.4303206b8b4567.9632!='0xfffe'o'
>>
>> Value size = 510)
>>
>> on the right size i always see 0xfffe on all
>> failed OSDs.
>>
>> greets,
>> Stefan
>> Am 19.01.20 um 14:07 schrieb Stefan Priebe - Profihost AG:
>>> Yes, except that this happens on 8 different clusters with different
>>> hw but same ceph version and same kernel version.
>>>
>>> Greets,
>>> Stefan
>>>
 Am 19.01.2020 um 11:53 schrieb Igor Fedotov :

 So the intermediate summary is:

 Any OSD in the cluster can experience interim RocksDB checksum
 failure. Which isn't present after OSD restart.

 No HW issues observed, no persistent artifacts (except OSD log)
 afterwards.

 And looks like the issue is rather specific to the cluster as no
 similar reports from other users seem to be present.


 Sorry, I'm out of ideas other then collect all the failure logs and
 try to find something common in them. May be this will shed some
 light..

 BTW from my experience it might make sense to inspect OSD log prior
 to failure (any error messages and/or prior restarts, etc) sometimes
 this might provide some hints.


 Thanks,

 Igor


> On 1/17/2020 2:30 PM, Stefan Priebe - Profihost AG wrote:
> HI Igor,
>
>> Am 17.01.20 um 12:10 schrieb Igor Fedotov:
>> hmmm..
>>
>> Just in case - suggest to check H/W errors with dmesg.
> this happens on around 80 nodes - i don't expect all of those have not
> identified hw errors. Also all of them are monitored - no dmesg
> outpout
> contains any errors.
>
>> Also there are some (not very much though) chances this is another
>> incarnation of the following bug:
>> https://tracker.ceph.com/issues/22464
>> https://github.com/ceph/ceph/pull/24649
>>
>> The corresponding PR works around it for main device reads (user data
>> only!) but theoretically it might still happen
>>
>> either for DB device or DB data at main device.
>>
>> Can you observe any bluefs spillovers? Are there any correlation
>> between
>> failing OSDs and spillover presence if any, e.g. failing OSDs always
>> have a spillover. While OSDs without spillovers never face the
>> issue...
>>
>> To validate this hypothesis one can try to monitor/check (e.g. once a
>> day for a week or something) "bluestore_reads_with_retries"
>> counter over
>> OSDs to learn if the issue is happening
>>
>> in the system.  Non-zero values mean it's there for user data/main
>> device and hence is likely to happen for DB ones as well (which
>> doesn't
>> have any workaround yet).
> OK i checked bluestore_reads_with_retries on 360 osds but all of
> them say 0.
>
>
>> Additional

Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-21 Thread Frank Schilder
OK, now I understand. Yes, the cache setting will take effect immediately. Its 
more about do you trust the disk firmware to apply the change correctly in all 
situations when production IO is active at the same time (will volatile cache 
be flushed correctly or not)? I would not and rather change the setting while 
the OSD is down.

During benchmarks on raw disks I just switched cache on and off when I needed. 
There was nothing running on the disks and the fio benchmark is destructive any 
ways.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Sasha Litvak 
Sent: 21 January 2020 10:19
To: Frank Schilder
Cc: ceph-users
Subject: Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we 
expect more? [klartext]

Frank,

Sorry for the confusion.  I thought that turning off cache using hdparm -W 0 
/dev/sdx takes effect right away and in case of non-raid controllers and 
Seagate or Micron SSDs I would see a difference starting fio benchmark right 
after executing hdparm.  So I wonder it makes a difference whether cache turned 
off before OSD started or after.



On Tue, Jan 21, 2020, 2:07 AM Frank Schilder 
mailto:fr...@dtu.dk>> wrote:
> So hdparam -W 0 /dev/sdx doesn't work or it makes no difference?

I wrote "We found the raw throughput in fio benchmarks to be very different for 
write-cache enabled and disabled, exactly as explained in the performance 
article.", so yes, it makes a huge difference.

> Also I am not sure I understand why it should happen before OSD have been 
> started.
> At least in my experience hdparam does it to hardware regardless.

I'm not sure I understand this question. Ideally it happens at boot time and if 
this doesn't work, at least sometimes before the OSD is started. Why and how 
else would one want this to happen?

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD crash after change of osd_memory_target

2020-01-21 Thread Martin Mlynář
Hi,


I'm having troubles changing osd_memory_target on my test cluster. I've
upgraded whole cluster from luminous to nautiuls, all OSDs are running
bluestore. Because this testlab is short in RAM, I wanted to lower
osd_memory_target to save some memory.

# ceph version
ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus
(stable)

# ceph config set osd osd_memory_target 2147483648
# ceph config dump
WHO   MASK LEVEL    OPTION    VALUE    RO
  mon  advanced auth_client_required  cephx    *
  mon  advanced auth_cluster_required cephx    *
  mon  advanced auth_service_required cephx    *
  mon  advanced mon_allow_pool_delete true
  mon  advanced mon_max_pg_per_osd    500
  mgr  advanced mgr/balancer/active   true
  mgr  advanced mgr/balancer/mode crush-compat
  osd  advanced osd_crush_update_on_start true
  osd  advanced osd_max_backfills 4
*  osd  basic    osd_memory_target 2147483648*

Now any OSD is unable to start/restart:

# /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph

LOG /var/log/ceph/ceph-osd.0.log:

min_mon_release 14 (nautilus)
0: [v2:10.0.92.69:3300/0,v1:10.0.92.69:6789/0] mon.testlab-ceph-03
1: [v2:10.0.92.72:3300/0,v1:10.0.92.72:6789/0] mon.testlab-ceph-04
2: [v2:10.0.92.67:3300/0,v1:10.0.92.67:6789/0] mon.testlab-ceph-01
3: [v2:10.0.92.68:3300/0,v1:10.0.92.68:6789/0] mon.testlab-ceph-02

   -54> 2020-01-21 11:45:19.289 7f6aa5d78700  1 monclient:  mon.2 has
(v2) addrs [v2:10.0.92.67:3300/0,v1:10.0.92.67:6789/0] but i'm connected
to v1:10.0.92.67:6789/0, reconnecting
   -53> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient:
_reopen_session rank -1
   -52> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient(hunting):
picked mon.testlab-ceph-01 con 0x563319682880 addr
[v2:10.0.92.67:3300/0,v1:10.0.92.67:6789/0]
   -51> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient(hunting):
picked mon.testlab-ceph-04 con 0x563319682d00 addr
[v2:10.0.92.72:3300/0,v1:10.0.92.72:6789/0]
   -50> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient(hunting):
picked mon.testlab-ceph-02 con 0x563319683180 addr
[v2:10.0.92.68:3300/0,v1:10.0.92.68:6789/0]
   -49> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient(hunting):
start opening mon connection
   -48> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient(hunting):
start opening mon connection
   -47> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient(hunting):
start opening mon connection
   -46> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient(hunting):
_renew_subs
   -45> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient(hunting):
get_auth_request con 0x563319682880 auth_method 0
   -44> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient(hunting):
get_auth_request method 2 preferred_modes [1,2]
   -43> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient(hunting):
_init_auth method 2
   -42> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient(hunting):
handle_auth_reply_more payload 9
   -41> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient(hunting):
handle_auth_reply_more payload_len 9
   -40> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient(hunting):
handle_auth_reply_more responding with 36 bytes
   -39> 2020-01-21 11:45:19.289 7f6aa6579700 10 monclient(hunting):
get_auth_request con 0x563319682d00 auth_method 0
   -38> 2020-01-21 11:45:19.289 7f6aa6579700 10 monclient(hunting):
get_auth_request method 2 preferred_modes [1,2]
   -37> 2020-01-21 11:45:19.289 7f6aa6579700 10 monclient(hunting):
_init_auth method 2
   -36> 2020-01-21 11:45:19.289 7f6aa757b700 10 monclient(hunting):
get_auth_request con 0x563319683180 auth_method 0
   -35> 2020-01-21 11:45:19.289 7f6aa757b700 10 monclient(hunting):
get_auth_request method 2 preferred_modes [1,2]
   -34> 2020-01-21 11:45:19.289 7f6aa757b700 10 monclient(hunting):
_init_auth method 2
   -33> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient(hunting):
handle_auth_done global_id 5638238 payload 386
   -32> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient: _finish_hunting 0
   -31> 2020-01-21 11:45:19.289 7f6aa6d7a700  1 monclient: found
mon.testlab-ceph-01
   -30> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient:
_send_mon_message to mon.testlab-ceph-01 at v2:10.0.92.67:3300/0
   -29> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient: _finish_auth 0
   -28> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient:
_check_auth_rotating renewing rotating keys (they expired before
2020-01-21 11:44:49.293059)
   -27> 2020-01-21 11:45:19.289 7f6aa6d7a700 10 monclient:
_send_mon_message to mon.testlab-ceph-01 at v2:10.0.92.67:3300/0
   -26> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient: handle_monmap
mon_map magic: 0 v1
   -25> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient:  got monmap
17 from mon.testlab-ceph-01 (according to old e17)
   -24> 2020-01-21 11:45:19.289 7f6aa5d78700 10 monclient: dump:
epoch 17
fsid f42082cc-c35a-44fe-b7ef-c2eb2ff1fe43
last_changed 20

Re: [ceph-users] OSD crash after change of osd_memory_target

2020-01-21 Thread Stefan Kooman
Quoting Martin Mlynář (nexus+c...@smoula.net):

> 
> When I remove this option:
> # ceph config rm osd osd_memory_target
> 
> OSD starts without any trouble. I've seen same behaviour when I wrote
> this parameter into /etc/ceph/ceph.conf
> 
> Is this a known bug? Am I doing something wrong?

I wonder if they would still crash if the OSD would drop their caches
beforehand. There is support for this in master, but it doesn't look
like it's backported to nautilus: https://tracker.ceph.com/issues/24176

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS: obscene buffer_anon memory use when scanning lots of files

2020-01-21 Thread John Madden
On 14.2.5 but also present in Luminous, buffer_anon memory use spirals
out of control when scanning many thousands of files. The use case is
more or less "look up this file and if it exists append this chunk to
it, otherwise create it with this chunk." The memory is recovered as
soon as the workload stops, and at most only 20-100 files are ever
open at one time.

Cache gets oversized but that's more or less expected, it's pretty
much always/immediately in some warn state, which makes me wonder if a
much larger cache might help buffer_anon use, looking for advice
there. This is on a deeply-hashed directory, but overall very little
data (<20GB), lots of tiny files.

As I typed this post the pool went from ~60GB to ~110GB. I've resorted
to a cronjob that restarts the active MDS when it reaches swap just to
keep the cluster alive.

~$ ceph daemon mds.mds1 dump_mempools
{
  "mempool": {
"by_pool": {
  "bloom_filter": {
"items": 4631659,
"bytes": 4631659
  },
  "bluestore_alloc": {
"items": 0,
"bytes": 0
  },
  "bluestore_cache_data": {
"items": 0,
"bytes": 0
  },
  "bluestore_cache_onode": {
"items": 0,
"bytes": 0
  },
  "bluestore_cache_other": {
"items": 0,
"bytes": 0
  },
  "bluestore_fsck": {
"items": 0,
"bytes": 0
  },
  "bluestore_txc": {
"items": 0,
"bytes": 0
  },
  "bluestore_writing_deferred": {
"items": 0,
"bytes": 0
  },
  "bluestore_writing": {
"items": 0,
"bytes": 0
  },
  "bluefs": {
"items": 0,
"bytes": 0
  },
  "buffer_anon": {
"items": 67791,
"bytes": 85598497506
  },
  "buffer_meta": {
"items": 57987,
"bytes": 5102856
  },
  "osd": {
"items": 0,
"bytes": 0
  },
  "osd_mapbl": {
"items": 0,
"bytes": 0
  },
  "osd_pglog": {
"items": 0,
"bytes": 0
  },
  "osdmap": {
"items": 582,
"bytes": 12248
  },
  "osdmap_mapping": {
"items": 0,
"bytes": 0
  },
  "pgmap": {
"items": 0,
"bytes": 0
  },
  "mds_co": {
"items": 284739975,
"bytes": 6883426437
  },
  "unittest_1": {
"items": 0,
"bytes": 0
  },
  "unittest_2": {
"items": 0,
"bytes": 0
  }
},
"total": {
  "items": 289497994,
  "bytes": 92491670706
}
  }
}


~$ ceph daemon mds.mds0 perf dump
{
  "AsyncMessenger::Worker-0": {
"msgr_recv_messages": 1360700,
"msgr_send_messages": 2298283,
"msgr_recv_bytes": 17915475859,
"msgr_send_bytes": 2024853049,
"msgr_created_connections": 2031,
"msgr_active_connections": 18446744073709552000,
"msgr_running_total_time": 96.2125937,
"msgr_running_send_time": 38.268843421,
"msgr_running_recv_time": 44.299468018,
"msgr_running_fast_dispatch_time": 17.303765523
  },
  "AsyncMessenger::Worker-1": {
"msgr_recv_messages": 971844,
"msgr_send_messages": 1266589,
"msgr_recv_bytes": 14435001275,
"msgr_send_bytes": 1755800874,
"msgr_created_connections": 213,
"msgr_active_connections": 18446744073709552000,
"msgr_running_total_time": 60.745883284,
"msgr_running_send_time": 17.694164502,
"msgr_running_recv_time": 24.300171049,
"msgr_running_fast_dispatch_time": 14.947038849
  },
  "AsyncMessenger::Worker-2": {
"msgr_recv_messages": 1742305,
"msgr_send_messages": 2163916,
"msgr_recv_bytes": 30829094382,
"msgr_send_bytes": 2915900257,
"msgr_created_connections": 233,
"msgr_active_connections": 18446744073709552000,
"msgr_running_total_time": 137.913631549,
"msgr_running_send_time": 41.234654308,
"msgr_running_recv_time": 40.918463152,
"msgr_running_fast_dispatch_time": 36.512891479
  },
  "cct": {
"total_workers": 1,
"unhealthy_workers": 0
  },
  "finisher-PurgeQueue": {
"queue_len": 0,
"complete_latency": {
  "avgcount": 47756,
  "sum": 217.373554326,
  "avgtime": 0.004551753
}
  },
  "mds": {
"request": 1178430,
"reply": 1178373,
"reply_latency": {
  "avgcount": 1178373,
  "sum": 60810.239426392,
  "avgtime": 0.051605255
},
"forward": 0,
"dir_fetch": 49751,
"dir_commit": 44312,
"dir_split": 0,
"dir_merge": 0,
"inode_max": 10,
"inodes": 2759030,
"inodes_top": 1919408,
"inodes_bottom": 836395,
"inodes_pin_tail": 3227,
"inodes_pinned": 17019,
"inodes_expired": 42387174,
"inodes_with_caps": 5485,
"caps": 11773,
"subtrees": 2,
"traverse": 1878329,
"traverse_hit": 1675078,
"traverse_forward": 0,
"traverse_discover": 0,
"traverse_dir_fetch": 42538,
"traverse_remote_ino": 0,
"traverse_lock": 25,
"load_cent": 1294614,
"q": 29,
"exported": 0

[ceph-users] CephFS with cache-tier kernel-mount client unable to write (Nautilus)

2020-01-21 Thread Hayashida, Mami
I am trying to set up a CephFS with a Cache Tier (for data) on a mini test
cluster, but a kernel-mount CephFS client is unable to write.  Cache tier
setup alone seems to be working fine (I tested it with `rados put` and `osd
map` commands to verify on which OSDs the objects are placed) and setting
up CephFS without the cache-tiering also worked fine on the same cluster
with the same client, but combining the two fails.  Here is what I have
tried:

Ceph version: 14.2.6

Set up Cache Tier:
$ ceph osd crush rule create-replicated highspeedpool default host ssd
$ ceph osd crush rule create-replicated highcapacitypool default host hdd

$ ceph osd pool create cephfs-data 256 256 highcapacitypool
$ ceph osd pool create cephfs-metadata 128 128 highspeedpool
$ ceph osd pool create cephfs-data-cache 256 256 highspeedpool

$ ceph osd tier add cephfs-data cephfs-data-cache
$ ceph osd tier cache-mode cephfs-data-cache writeback
$ ceph osd tier set-overlay cephfs-data cephfs-data-cache

$ ceph osd pool set cephfs-data-cache hit_set_type bloom

###
All the cache tier configs set (hit_set_count, hit_set period,
target_max_bytes etc.)
###

$ ceph-deploy mds create 
$ ceph fs new cephfs_test cephfs-metadata cephfs-data

$ ceph fs authorize cephfs_test client.testuser / rw
$ ceph auth ls
client.testuser
key: XXX
caps: [mds] allow rw
caps: [mon] allow r
caps: [osd] allow rw tag cephfs data=cephfs_test

### Confirm the pool setting
$ ceph osd pool ls detail
pool 1 'cephfs-data' replicated size 3 min_size 2 crush_rule 2 object_hash
rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 63 lfor
53/53/53 flags hashpspool tiers 3 read_tier 3 write_tier 3 stripe_width 0
application cephfs
pool 2 'cephfs-metadata' replicated size 3 min_size 2 crush_rule 1
object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change
63 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16
recovery_priority 5 application cephfs
pool 3 'cephfs-data-cache' replicated size 3 min_size 2 crush_rule 1
object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change
63 lfor 53/53/53 flags hashpspool,incomplete_clones tier_of 1 cache_mode
writeback target_bytes 800 hit_set
bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 120s x2
decay_rate 0 search_last_n 0 stripe_width 0

 Set up the client side (kernel mount)
$ sudo vim /etc/ceph/fsclient_secret
$ sudo mkdir /mnt/cephfs
$ sudo mount -t ceph :6789:/  /mnt/cephfs -o
name=testuser,secretfile=/etc/ceph/fsclient_secret // no errors at this
point

$ sudo vim /mnt/cephfs/file1   // Writing attempt fails

"file1" E514: write error (file system full?)
WARNING: Original file may be lost or damaged
don't quit the editor until the file is successfully written!

$ ls -l /mnt/cephfs
total 0
-rw-r--r-- 1 root root 0 Jan 21 16:25 file1

Any help will be appreciated.

*Mami Hayashida*
*Research Computing Associate*
Univ. of Kentucky ITS Research Computing Infrastructure
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] S3 Bucket usage up 150% diference between rgw-admin and external metering tools.

2020-01-21 Thread Robin H. Johnson
On Mon, Jan 20, 2020 at 12:57:51PM +, EDH - Manuel Rios wrote:
> Hi Cephs
> 
> Several nodes of our Ceph 14.2.5 are fully dedicated to host cold storage / 
> backups information.
> 
> Today checking the data usage with a customer found that rgw-admin is 
> reporting:
...
> That's near 5TB used space in CEPH, and the external tools are reporting just 
> 1.42TB.
- What are the external tools?
- How many objects do the external tools report as existing?
- Do the external tools include incomplete multipart uploads in their
  size data?
- If bucket versioning is enabled, do the tools include all versions in the
  size data?
- Are there leftover multipart pieces without a multipart head?  (this
  is a Ceph bug that I think is fixed in your release, but old pieces
  might still exist).

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS with cache-tier kernel-mount client unable to write (Nautilus)

2020-01-21 Thread Ilya Dryomov
On Tue, Jan 21, 2020 at 6:02 PM Hayashida, Mami  wrote:
>
> I am trying to set up a CephFS with a Cache Tier (for data) on a mini test 
> cluster, but a kernel-mount CephFS client is unable to write.  Cache tier 
> setup alone seems to be working fine (I tested it with `rados put` and `osd 
> map` commands to verify on which OSDs the objects are placed) and setting up 
> CephFS without the cache-tiering also worked fine on the same cluster with 
> the same client, but combining the two fails.  Here is what I have tried:
>
> Ceph version: 14.2.6
>
> Set up Cache Tier:
> $ ceph osd crush rule create-replicated highspeedpool default host ssd
> $ ceph osd crush rule create-replicated highcapacitypool default host hdd
>
> $ ceph osd pool create cephfs-data 256 256 highcapacitypool
> $ ceph osd pool create cephfs-metadata 128 128 highspeedpool
> $ ceph osd pool create cephfs-data-cache 256 256 highspeedpool
>
> $ ceph osd tier add cephfs-data cephfs-data-cache
> $ ceph osd tier cache-mode cephfs-data-cache writeback
> $ ceph osd tier set-overlay cephfs-data cephfs-data-cache
>
> $ ceph osd pool set cephfs-data-cache hit_set_type bloom
>
> ###
> All the cache tier configs set (hit_set_count, hit_set period, 
> target_max_bytes etc.)
> ###
>
> $ ceph-deploy mds create 
> $ ceph fs new cephfs_test cephfs-metadata cephfs-data
>
> $ ceph fs authorize cephfs_test client.testuser / rw
> $ ceph auth ls
> client.testuser
> key: XXX
> caps: [mds] allow rw
> caps: [mon] allow r
> caps: [osd] allow rw tag cephfs data=cephfs_test
>
> ### Confirm the pool setting
> $ ceph osd pool ls detail
> pool 1 'cephfs-data' replicated size 3 min_size 2 crush_rule 2 object_hash 
> rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 63 lfor 
> 53/53/53 flags hashpspool tiers 3 read_tier 3 write_tier 3 stripe_width 0 
> application cephfs
> pool 2 'cephfs-metadata' replicated size 3 min_size 2 crush_rule 1 
> object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 
> 63 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 
> recovery_priority 5 application cephfs
> pool 3 'cephfs-data-cache' replicated size 3 min_size 2 crush_rule 1 
> object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change 
> 63 lfor 53/53/53 flags hashpspool,incomplete_clones tier_of 1 cache_mode 
> writeback target_bytes 800 hit_set bloom{false_positive_probability: 
> 0.05, target_size: 0, seed: 0} 120s x2 decay_rate 0 search_last_n 0 
> stripe_width 0
>
>  Set up the client side (kernel mount)
> $ sudo vim /etc/ceph/fsclient_secret
> $ sudo mkdir /mnt/cephfs
> $ sudo mount -t ceph :6789:/  /mnt/cephfs -o 
> name=testuser,secretfile=/etc/ceph/fsclient_secret // no errors at this 
> point
>
> $ sudo vim /mnt/cephfs/file1   // Writing attempt fails
>
> "file1" E514: write error (file system full?)
> WARNING: Original file may be lost or damaged
> don't quit the editor until the file is successfully written!
>
> $ ls -l /mnt/cephfs
> total 0
> -rw-r--r-- 1 root root 0 Jan 21 16:25 file1
>
> Any help will be appreciated.

Hi Mami,

Is there anything in dmesg?

What happens if you mount without involving testuser (i.e. using
client.admin and the admin key)?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS with cache-tier kernel-mount client unable to write (Nautilus)

2020-01-21 Thread Hayashida, Mami
Ilya,

Thank you for your suggestions!

`dmsg` (on the client node) only had `libceph: mon0 10.33.70.222:6789
socket error on write`.  No further detail.  But using the admin key
(client.admin) for mounting CephFS solved my problem.  I was able to write
successfully! :-)

$ sudo mount -t ceph 10.33.70.222:6789:/  /mnt/cephfs -o
name=admin,secretfile=/etc/ceph/fsclient_secret // with the
corresponding client.admin key

$ sudo vim /mnt/cephfs/file4
$ sudo ls -l /mnt/cephfs
total 1
-rw-r--r-- 1 root root  0 Jan 21 16:25 file1
-rw-r--r-- 1 root root  0 Jan 21 16:45 file2
-rw-r--r-- 1 root root  0 Jan 21 18:35 file3
-rw-r--r-- 1 root root 22 Jan 21 18:42 file4

Now, here is the difference between the two keys. client.testuser was
obviously generated with the command `ceph fs authorize cephfs_test
client.testuser / rw`, but something in there is obviously interfering with
CephFS with a Cache Tier pool.  Do I need to edit the `tag` or the `data`
part?  Now, I should mention the same type of key (like client.testuser)
worked just fine when I was testing CephFS without a Cache Tier pool.

client.admin
key: XXXZZZ
caps: [mds] allow *
caps: [mgr] allow *
caps: [mon] allow *
caps: [osd] allow *

client.testuser
key: XXXZZZ
caps: [mds] allow rw
caps: [mon] allow r
caps: [osd] allow rw tag cephfs data=cephfs_test


*Mami Hayashida*
*Research Computing Associate*
Univ. of Kentucky ITS Research Computing Infrastructure



On Tue, Jan 21, 2020 at 1:26 PM Ilya Dryomov  wrote:

> On Tue, Jan 21, 2020 at 6:02 PM Hayashida, Mami 
> wrote:
> >
> > I am trying to set up a CephFS with a Cache Tier (for data) on a mini
> test cluster, but a kernel-mount CephFS client is unable to write.  Cache
> tier setup alone seems to be working fine (I tested it with `rados put` and
> `osd map` commands to verify on which OSDs the objects are placed) and
> setting up CephFS without the cache-tiering also worked fine on the same
> cluster with the same client, but combining the two fails.  Here is what I
> have tried:
> >
> > Ceph version: 14.2.6
> >
> > Set up Cache Tier:
> > $ ceph osd crush rule create-replicated highspeedpool default host ssd
> > $ ceph osd crush rule create-replicated highcapacitypool default host hdd
> >
> > $ ceph osd pool create cephfs-data 256 256 highcapacitypool
> > $ ceph osd pool create cephfs-metadata 128 128 highspeedpool
> > $ ceph osd pool create cephfs-data-cache 256 256 highspeedpool
> >
> > $ ceph osd tier add cephfs-data cephfs-data-cache
> > $ ceph osd tier cache-mode cephfs-data-cache writeback
> > $ ceph osd tier set-overlay cephfs-data cephfs-data-cache
> >
> > $ ceph osd pool set cephfs-data-cache hit_set_type bloom
> >
> > ###
> > All the cache tier configs set (hit_set_count, hit_set period,
> target_max_bytes etc.)
> > ###
> >
> > $ ceph-deploy mds create 
> > $ ceph fs new cephfs_test cephfs-metadata cephfs-data
> >
> > $ ceph fs authorize cephfs_test client.testuser / rw
> > $ ceph auth ls
> > client.testuser
> > key: XXX
> > caps: [mds] allow rw
> > caps: [mon] allow r
> > caps: [osd] allow rw tag cephfs data=cephfs_test
> >
> > ### Confirm the pool setting
> > $ ceph osd pool ls detail
> > pool 1 'cephfs-data' replicated size 3 min_size 2 crush_rule 2
> object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change
> 63 lfor 53/53/53 flags hashpspool tiers 3 read_tier 3 write_tier 3
> stripe_width 0 application cephfs
> > pool 2 'cephfs-metadata' replicated size 3 min_size 2 crush_rule 1
> object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change
> 63 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16
> recovery_priority 5 application cephfs
> > pool 3 'cephfs-data-cache' replicated size 3 min_size 2 crush_rule 1
> object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode warn last_change
> 63 lfor 53/53/53 flags hashpspool,incomplete_clones tier_of 1 cache_mode
> writeback target_bytes 800 hit_set
> bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 120s x2
> decay_rate 0 search_last_n 0 stripe_width 0
> >
> >  Set up the client side (kernel mount)
> > $ sudo vim /etc/ceph/fsclient_secret
> > $ sudo mkdir /mnt/cephfs
> > $ sudo mount -t ceph :6789:/  /mnt/cephfs -o
> name=testuser,secretfile=/etc/ceph/fsclient_secret // no errors at this
> point
> >
> > $ sudo vim /mnt/cephfs/file1   // Writing attempt fails
> >
> > "file1" E514: write error (file system full?)
> > WARNING: Original file may be lost or damaged
> > don't quit the editor until the file is successfully written!
> >
> > $ ls -l /mnt/cephfs
> > total 0
> > -rw-r--r-- 1 root root 0 Jan 21 16:25 file1
> >
> > Any help will be appreciated.
>
> Hi Mami,
>
> Is there anything in dmesg?
>
> What happens if you mount without involving testuser (i.e. using
> client.admin and the admin key)?
>
> Thanks,
>
> Ilya
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.co

Re: [ceph-users] S3 Bucket usage up 150% diference between rgw-admin and external metering tools.

2020-01-21 Thread EDH - Manuel Rios
Hi Robin,

- What are the external tools? CloudBerry S3 Explorer  and S3 Browser
- How many objects do the external tools report as existing?  Tool report 72142 
keys (Aprox 6TB) vs  CEPH num_objects  180981 (9TB)
- Do the external tools include incomplete multipart uploads in their  size 
data? I think no one external software include incomplete objects in the size, 
due S3 api list recursive don't include it.
Checking for incomplete multiparts , I got a response 404 NoSuchKeys.
- If bucket versioning is enabled, do the tools include all versions in the
  size data? Versioning is not enabled
- Are there leftover multipart pieces without a multipart head?   How can we 
check it?

Specific bucket information:
{
"bucket": "XX",
"tenant": "",
"zonegroup": "4d8c7c5f-ca40-4ee3-b5bb-b2cad90bd007",
"placement_rule": "default-placement",
"explicit_placement": {
"data_pool": "default.rgw.buckets.data",
"data_extra_pool": "default.rgw.buckets.non-ec",
"index_pool": "default.rgw.buckets.index"
},
"id": "48efb8c3-693c-4fe0-bbe4-fdc16f590a82.132873679.2",
"marker": "48efb8c3-693c-4fe0-bbe4-fdc16f590a82.3886182.52",
"index_type": "Normal",
"owner": "XXX",
"ver": "0#89789,1#60165,2#80652,3#76367",
"master_ver": "0#0,1#0,2#0,3#0",
"mtime": "2020-01-05 19:29:59.360574Z",
"max_marker": "0#,1#,2#,3#",
"usage": {
"rgw.main": {
"size": 9050249319344,
"size_actual": 9050421526528,
"size_utilized": 9050249319344,
"size_kb": 8838134101,
"size_kb_actual": 8838302272,
"size_kb_utilized": 8838134101,
"num_objects": 180981
},
"rgw.multimeta": {
"size": 0,
"size_actual": 0,
"size_utilized": 3861,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 4,
"num_objects": 143
}
},
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1024,
"max_size_kb": 0,
"max_objects": -1
}
}

-Mensaje original-
De: ceph-users  En nombre de Robin H. Johnson
Enviado el: martes, 21 de enero de 2020 18:58
CC: ceph-users@lists.ceph.com
Asunto: Re: [ceph-users] S3 Bucket usage up 150% diference between rgw-admin 
and external metering tools.

On Mon, Jan 20, 2020 at 12:57:51PM +, EDH - Manuel Rios wrote:
> Hi Cephs
> 
> Several nodes of our Ceph 14.2.5 are fully dedicated to host cold storage / 
> backups information.
> 
> Today checking the data usage with a customer found that rgw-admin is 
> reporting:
...
> That's near 5TB used space in CEPH, and the external tools are reporting just 
> 1.42TB.
- What are the external tools?
- How many objects do the external tools report as existing?
- Do the external tools include incomplete multipart uploads in their
  size data?
- If bucket versioning is enabled, do the tools include all versions in the
  size data?
- Are there leftover multipart pieces without a multipart head?  (this
  is a Ceph bug that I think is fixed in your release, but old pieces
  might still exist).

--
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB 
E9B85B1F 825BCECF EE05E6F6 A48F6136 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS with cache-tier kernel-mount client unable to write (Nautilus)

2020-01-21 Thread Ilya Dryomov
On Tue, Jan 21, 2020 at 7:51 PM Hayashida, Mami  wrote:
>
> Ilya,
>
> Thank you for your suggestions!
>
> `dmsg` (on the client node) only had `libceph: mon0 10.33.70.222:6789 socket 
> error on write`.  No further detail.  But using the admin key (client.admin) 
> for mounting CephFS solved my problem.  I was able to write successfully! :-)
>
> $ sudo mount -t ceph 10.33.70.222:6789:/  /mnt/cephfs -o 
> name=admin,secretfile=/etc/ceph/fsclient_secret // with the corresponding 
> client.admin key
>
> $ sudo vim /mnt/cephfs/file4
> $ sudo ls -l /mnt/cephfs
> total 1
> -rw-r--r-- 1 root root  0 Jan 21 16:25 file1
> -rw-r--r-- 1 root root  0 Jan 21 16:45 file2
> -rw-r--r-- 1 root root  0 Jan 21 18:35 file3
> -rw-r--r-- 1 root root 22 Jan 21 18:42 file4
>
> Now, here is the difference between the two keys. client.testuser was 
> obviously generated with the command `ceph fs authorize cephfs_test 
> client.testuser / rw`, but something in there is obviously interfering with 
> CephFS with a Cache Tier pool.  Do I need to edit the `tag` or the `data` 
> part?  Now, I should mention the same type of key (like client.testuser) 
> worked just fine when I was testing CephFS without a Cache Tier pool.
>
> client.admin
> key: XXXZZZ
> caps: [mds] allow *
> caps: [mgr] allow *
> caps: [mon] allow *
> caps: [osd] allow *
>
> client.testuser
> key: XXXZZZ
> caps: [mds] allow rw
> caps: [mon] allow r
> caps: [osd] allow rw tag cephfs data=cephfs_test

Right.  I think this is because with cache tiering you have two data
pools involved, but "ceph fs authorize" generates an OSD cap that ends
up restricting the client to the data pool that that the filesystem
"knows" about.

You will probably need to create your client users by hand instead of
generating them with "ceph fs authorize".  CCing Patrick who might know
more.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD crash after change of osd_memory_target

2020-01-21 Thread Martin Mlynář
Dne út 21. 1. 2020 17:09 uživatel Stefan Kooman  napsal:

> Quoting Martin Mlynář (nexus+c...@smoula.net):
>
> >
> > When I remove this option:
> > # ceph config rm osd osd_memory_target
> >
> > OSD starts without any trouble. I've seen same behaviour when I wrote
> > this parameter into /etc/ceph/ceph.conf
> >
> > Is this a known bug? Am I doing something wrong?
>
> I wonder if they would still crash if the OSD would drop their caches
> beforehand. There is support for this in master, but it doesn't look
> like it's backported to nautilus: https://tracker.ceph.com/issues/24176
>
> Gr. Stefan
>

Do you think this could help? OSD does not even start, I'm getting a little
lost how flushing caches could help.

According to trace I suspect something around processing config values.



>
> --
> | BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD crash after change of osd_memory_target

2020-01-21 Thread Stefan Kooman
Quoting Martin Mlynář (nexus+c...@smoula.net):

> Do you think this could help? OSD does not even start, I'm getting a little
> lost how flushing caches could help.

I might have mis-understood. I though the OSDs crashed when you set the
config setting.

> According to trace I suspect something around processing config values.

I've just set the same config setting on a test cluster and restarted an
OSD without problem. So, not sure what is going on there.

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-21 Thread Eric K. Miller
We were able to isolate an individual Micron 5200 and perform Vitaliy's
tests in his spreadsheet.

An interesting item - write cache changes do NOT require a power cycle
to take effect, at least on a Micron 5200.

The complete results from fio are included at the end of this message
for the individual tests, for both write enabled and disabled.

The shortened version of the results:

Journal IOPS (sync=1)

-
Write cache ON:  
  write: IOPS=19.7k, BW=76.0MiB/s (80.7MB/s)(4618MiB/60001msec)
 lat (usec): min=42, max=1273, avg=50.18, stdev= 6.40

Write cache OFF:
  write: IOPS=32.3k, BW=126MiB/s (132MB/s)(7560MiB/60001msec)
 lat (usec): min=25, max=7079, avg=30.55, stdev= 7.94


Journal IOPS (fsync=1)

-
Write cache ON:
  write: IOPS=16.9k, BW=66.2MiB/s (69.4MB/s)(3971MiB/60001msec)
 lat (usec): min=24, max=5068, avg=31.77, stdev= 7.82

Write cache OFF:
  write: IOPS=32.1k, BW=126MiB/s (132MB/s)(7533MiB/60001msec)
 lat (usec): min=24, max=7076, avg=29.41, stdev= 7.52


Parallel random (sync)

-
Write cache ON:
  write: IOPS=43.9k, BW=172MiB/s (180MB/s)(10.1GiB/60001msec)
 lat (usec): min=220, max=14767, avg=727.61, stdev=313.36

Write cache OFF:
  write: IOPS=44.3k, BW=173MiB/s (181MB/s)(10.1GiB/60001msec)
 lat (usec): min=134, max=4941, avg=721.96, stdev=311.46


Parallel random (fsync)

-
Write cache ON:
  write: IOPS=44.4k, BW=173MiB/s (182MB/s)(10.2GiB/60001msec)
 lat (usec): min=109, max=4349, avg=703.01, stdev=303.69

Write cache OFF:
  write: IOPS=44.6k, BW=174MiB/s (183MB/s)(10.2GiB/60001msec)
 lat (usec): min=26, max=7288, avg=716.32, stdev=300.48


Non-txn random

-
Write cache ON:
  write: IOPS=43.1k, BW=168MiB/s (177MB/s)(9.87GiB/60004msec)
 lat (usec): min=350, max=41703, avg=2967.89, stdev=1682.28

Write cache OFF:
  write: IOPS=43.4k, BW=170MiB/s (178MB/s)(9.93GiB/60004msec)
 lat (usec): min=177, max=42795, avg=2947.52, stdev=1666.24


Linear write

-
Write cache ON:
  write: IOPS=126, BW=505MiB/s (530MB/s)(29.6GiB/60027msec)
 lat (msec): min=226, max=281, avg=253.26, stdev= 3.51

Write cache OFF:
  write: IOPS=126, BW=507MiB/s (531MB/s)(29.8GiB/60254msec)
 lat (msec): min=7, max=492, avg=252.52, stdev=13.16


So, we can determine that some improvement can be seen with the write
cache disabled (specifically on a Micron 5200), it is not enough that
will likely change much in terms of Ceph's performance unless journal
latency, IOPS, and bandwidth are a bottleneck.

The "Journal IOPS (sync=1)" test shows the most dramatic difference,
where disabling the write cache reduces the I/O latency by 39% (a
reduction from 50.18us to 30.55us with a difference of 0.02ms) which
respectively raises the IOPS and throughput of synchronous I/O.

The "Journal IOPS (fsync=1)" test also shows a dramatic difference, but
in terms of IOPS and throughput (approximately +90%), not latency.

Hope this helps!  I would love to hear feedback.

Eric



###
# Journal IOPS (sync=1)
###

# Write cache ENABLED
hdparm -W 1 /dev/sde

fio --ioengine=libaio -sync=1 --direct=1 --name=test --bs=4k --iodepth=1
--readwrite=write  --runtime 60 --filename=/dev/sde

test: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T)
4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=75.6MiB/s][r=0,w=19.3k
IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=38269: Tue Jan 21 14:20:01 2020
  write: IOPS=19.7k, BW=76.0MiB/s (80.7MB/s)(4618MiB/60001msec)
slat (usec): min=2, max=180, avg= 4.43, stdev= 1.86
clat (nsec): min=1950, max=1262.3k, avg=45662.55, stdev=5778.88
 lat (usec): min=42, max=1273, avg=50.18, stdev= 6.40
clat percentiles (usec):
 |  1.00th=[   42],  5.00th=[   42], 10.00th=[   43], 20.00th=[
43],
 | 30.00th=[   43], 40.00th=[   44], 50.00th=[   44], 60.00th=[
45],
 | 70.00th=[   47], 80.00th=[   48], 90.00th=[   51], 95.00th=[
55],
 | 99.00th=[   66], 99.50th=[   74], 99.90th=[   91], 99.95th=[
104],
 | 99.99th=[  167]
   bw (  KiB/s): min=70152, max=81704, per=100.00%, avg=78835.97,
stdev=2929.71, samples=119
   iops: min=17538, max=20426, avg=19708.98, stdev=732.40,
samples=119
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=88.53%
  lat (usec)   : 100=11.41%, 250=0.06%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%
  cpu  : usr=3.07%, sys=13.62%, ctx=1182324, majf=0, minf=27
  IO depths:

Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-21 Thread Виталий Филиппов
Hi! Thanks.

The parameter gets reset when you reconnect the SSD so in fact it requires not 
to power cycle it after changing the parameter :-)

Ok, this case seems lucky, ~2x change isn't a lot. Can you tell the exact model 
and capacity of this Micron, and what controller was used in this test? I'll 
add it to the spreadsheet.
-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]

2020-01-21 Thread Eric K. Miller
Hi Vitaliy,

The drive is a Micron 5200 ECO 3.84TB

This is from the msecli utility:

Device Name  : /dev/sde
Model No : Micron_5200_MTFDDAK3T8TDC
Serial No: 
FW-Rev   : D1MU404
Total Size   : 3840.00GB
Drive Status : Drive is in good health
Sata Link Speed  : Gen3 (6.0 Gbps)
Sata Link Max Speed  : Gen3 (6.0 Gbps)
Temp(C)  : 26

The server motherboard is: SuperMicro X10DRU-i+

Drives are connected to SATA connectors on the motherboard.

Processors are:  Xeon E5-2690v4

Eric


From: Виталий Филиппов [mailto:vita...@yourcmc.ru] 
Sent: Tuesday, January 21, 2020 3:43 PM
To: Eric K. Miller
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] low io with enterprise SSDs ceph luminous - can we 
expect more? [klartext]

Hi! Thanks.

The parameter gets reset when you reconnect the SSD so in fact it requires not 
to power cycle it after changing the parameter :-)

Ok, this case seems lucky, ~2x change isn't a lot. Can you tell the exact model 
and capacity of this Micron, and what controller was used in this test? I'll 
add it to the spreadsheet.
-- 
With best regards,
Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS: obscene buffer_anon memory use when scanning lots of files

2020-01-21 Thread Patrick Donnelly
On Tue, Jan 21, 2020 at 8:32 AM John Madden  wrote:
>
> On 14.2.5 but also present in Luminous, buffer_anon memory use spirals
> out of control when scanning many thousands of files. The use case is
> more or less "look up this file and if it exists append this chunk to
> it, otherwise create it with this chunk." The memory is recovered as
> soon as the workload stops, and at most only 20-100 files are ever
> open at one time.
>
> Cache gets oversized but that's more or less expected, it's pretty
> much always/immediately in some warn state, which makes me wonder if a
> much larger cache might help buffer_anon use, looking for advice
> there. This is on a deeply-hashed directory, but overall very little
> data (<20GB), lots of tiny files.
>
> As I typed this post the pool went from ~60GB to ~110GB. I've resorted
> to a cronjob that restarts the active MDS when it reaches swap just to
> keep the cluster alive.

This looks like it will be fixed by

https://tracker.ceph.com/issues/42943

That will be available in v14.2.7.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] S3 Bucket usage up 150% diference between rgw-admin and external metering tools.

2020-01-21 Thread EDH - Manuel Rios
Hi Cbodley  ,

As you requested by IRC we tested directly with AWS Cli.

Results:
aws --endpoint=http://XX --profile=ceph s3api list-multipart-uploads 
--bucket Evol6

It reports near 170 uploads.

We used the last one:
{
"Initiator": {
"DisplayName": "x",
"ID": "xx"
},
"Initiated": "2019-12-03T01:23:06.007Z",
"UploadId": "2~r0BMPPs8CewVZ6Qheu1s9WzaBn7bBvU",
"StorageClass": "STANDARD",
"Key": 
"MBS-da43656f-2b8c-464f-b341-03fdbdf446ae/CBB_SRV2K12/CBB_VM/192.168.0.197/SRV2K12/Hard
 disk 1$/20191203010516/431.cbrevision",
"Owner": {
"DisplayName": "x",
"ID": ""
}
}

aws --endpoint=http://x --profile=ceph s3api abort-multipart-upload 
--bucket Evol6 --key 
'MBS-da43656f-2b8c-464f-b341-03fdbdf446ae/CBB_SRV2K12/CBB_VM/192.168.0.197/SRV2K12/Hard
 disk 1$/20191203010516/431.cbrevision' --upload-id 
2~r0BMPPs8CewVZ6Qheu1s9WzaBn7bBvU

Return: An error occurred (NoSuchUpload) when calling the AbortMultipartUpload 
operation: Unknown

The same error is reported by S3CMD.
Maybe is there something wrong parsing the "1$" inside the key 

Best Regards, 

Regards
Manuel

-Mensaje original-
De: ceph-users  En nombre de EDH - Manuel 
Rios
Enviado el: martes, 21 de enero de 2020 20:09
Para: Robin H. Johnson 
CC: ceph-users@lists.ceph.com
Asunto: Re: [ceph-users] S3 Bucket usage up 150% diference between rgw-admin 
and external metering tools.

Hi Robin,

- What are the external tools? CloudBerry S3 Explorer  and S3 Browser
- How many objects do the external tools report as existing?  Tool report 72142 
keys (Aprox 6TB) vs  CEPH num_objects  180981 (9TB)
- Do the external tools include incomplete multipart uploads in their  size 
data? I think no one external software include incomplete objects in the size, 
due S3 api list recursive don't include it.
Checking for incomplete multiparts , I got a response 404 NoSuchKeys.
- If bucket versioning is enabled, do the tools include all versions in the
  size data? Versioning is not enabled
- Are there leftover multipart pieces without a multipart head?   How can we 
check it?

Specific bucket information:
{
"bucket": "XX",
"tenant": "",
"zonegroup": "4d8c7c5f-ca40-4ee3-b5bb-b2cad90bd007",
"placement_rule": "default-placement",
"explicit_placement": {
"data_pool": "default.rgw.buckets.data",
"data_extra_pool": "default.rgw.buckets.non-ec",
"index_pool": "default.rgw.buckets.index"
},
"id": "48efb8c3-693c-4fe0-bbe4-fdc16f590a82.132873679.2",
"marker": "48efb8c3-693c-4fe0-bbe4-fdc16f590a82.3886182.52",
"index_type": "Normal",
"owner": "XXX",
"ver": "0#89789,1#60165,2#80652,3#76367",
"master_ver": "0#0,1#0,2#0,3#0",
"mtime": "2020-01-05 19:29:59.360574Z",
"max_marker": "0#,1#,2#,3#",
"usage": {
"rgw.main": {
"size": 9050249319344,
"size_actual": 9050421526528,
"size_utilized": 9050249319344,
"size_kb": 8838134101,
"size_kb_actual": 8838302272,
"size_kb_utilized": 8838134101,
"num_objects": 180981
},
"rgw.multimeta": {
"size": 0,
"size_actual": 0,
"size_utilized": 3861,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 4,
"num_objects": 143
}
},
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1024,
"max_size_kb": 0,
"max_objects": -1
}
}

-Mensaje original-
De: ceph-users  En nombre de Robin H. 
Johnson Enviado el: martes, 21 de enero de 2020 18:58
CC: ceph-users@lists.ceph.com
Asunto: Re: [ceph-users] S3 Bucket usage up 150% diference between rgw-admin 
and external metering tools.

On Mon, Jan 20, 2020 at 12:57:51PM +, EDH - Manuel Rios wrote:
> Hi Cephs
> 
> Several nodes of our Ceph 14.2.5 are fully dedicated to host cold storage / 
> backups information.
> 
> Today checking the data usage with a customer found that rgw-admin is 
> reporting:
...
> That's near 5TB used space in CEPH, and the external tools are reporting just 
> 1.42TB.
- What are the external tools?
- How many objects do the external tools report as existing?
- Do the external tools include incomplete multipart uploads in their
  size data?
- If bucket versioning is enabled, do the tools include all versions in the
  size data?
- Are there leftover multipart pieces without a multipart head?  (this
  is a Ceph bug that I think is fixed in your release, but old pieces
  might still exist).

--
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB 
E9B85B1F 825BCECF EE05E6F6 A48F6136 
_

[ceph-users] Unable to track different ceph client version connections

2020-01-21 Thread Pardhiv Karri
Hi,

We upgraded our Ceph cluster from Hammer to Luminous and it is running
fine. Post upgrade we live migrated all our Openstack instances (not 100%
sure). Currently we see 1658 clients still on Hammer version. To track the
clients we increased the debugging of debug_mon=10/10, debug_ms=1/5,
debug_monc=5/20 on all three monitors and looking at all three monitor logs
at /var/log/ceph/mon..log and grepping for hammer and 0x81dff8eeacfffb but
not seeing anything in logs even after hours of waiting.

Earlier in our other clusters it used to show in logs from which Openstack
compute node it is originating from. Am I missing something  or do I need
to add more logging or need to check a different log on all three ceph
monitor nodes?

Ceph Features Output:
==
{
"mon": {
"group": {
"features": "0x3ffddff8eeacfffb",
"release": "luminous",
"num": 3
}
},
"osd": {
"group": {
"features": "0x3ffddff8eeacfffb",
"release": "luminous",
"num": 1049
}
},
"client": {
"group": {
"features": "0x81dff8eeacfffb",
"release": "hammer",
"num": 1658
},
"group": {
"features": "0x3ffddff8eeacfffb",
"release": "luminous",
"num": 8712
}
}
}

-- 
*Pardhiv K*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS: obscene buffer_anon memory use when scanning lots of files

2020-01-21 Thread Dan van der Ster
On Wed, Jan 22, 2020 at 12:24 AM Patrick Donnelly 
wrote:

> On Tue, Jan 21, 2020 at 8:32 AM John Madden  wrote:
> >
> > On 14.2.5 but also present in Luminous, buffer_anon memory use spirals
> > out of control when scanning many thousands of files. The use case is
> > more or less "look up this file and if it exists append this chunk to
> > it, otherwise create it with this chunk." The memory is recovered as
> > soon as the workload stops, and at most only 20-100 files are ever
> > open at one time.
> >
> > Cache gets oversized but that's more or less expected, it's pretty
> > much always/immediately in some warn state, which makes me wonder if a
> > much larger cache might help buffer_anon use, looking for advice
> > there. This is on a deeply-hashed directory, but overall very little
> > data (<20GB), lots of tiny files.
> >
> > As I typed this post the pool went from ~60GB to ~110GB. I've resorted
> > to a cronjob that restarts the active MDS when it reaches swap just to
> > keep the cluster alive.
>
> This looks like it will be fixed by
>
> https://tracker.ceph.com/issues/42943
>
> That will be available in v14.2.7.
>

Couldn't John confirm that this is the issue by checking the heap stats and
triggering the release via

  ceph tell mds.mds1 heap stats
  ceph tell mds.mds1 heap release

(this would be much less disruptive than restarting the MDS)

-- Dan



>
> --
> Patrick Donnelly, Ph.D.
> He / Him / His
> Senior Software Engineer
> Red Hat Sunnyvale, CA
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com