I decided to restart osd.0, then the load of the cephfs and on all osd
nodes dropped. After this I still have on the first server
[@~]# cat
/sys/kernel/debug/ceph/0f1701f5-453a-4a3b-928d-f652a2bbbcb0.client357431
0/osdc
REQUESTS 0 homeless 0
LINGER REQUESTS
BACKOFFS
[@~]# cat
/sys/kernel/debug/ceph/0f1701f5-453a-4a3b-928d-f652a2bbbcb0.client358422
4/osdc
REQUESTS 2 homeless 0
317841 osd0 20.d6ec44c1 20.1 [0,28,5]/0 [0,28,5]/0
e65040 10001b44a70.00000000 0x40001c 102023 read
317853 osd0 20.5956d31b 20.1b [0,5,10]/0 [0,5,10]/0
e65040 10001ad8962.00000000 0x40001c 40731 read
LINGER REQUESTS
BACKOFFS
And dmesg -T keeps giving me these (again with wrong timestamps)
[Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 session
established
[Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 io error
[Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 session
lost, hunting for new mon
[Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 session
established
[Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 io error
[Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 session
lost, hunting for new mon
[Thu Jul 11 11:23:21 2019] libceph: mon2 192.168.10.113:6789 session
established
[Thu Jul 11 11:23:21 2019] libceph: mon2 192.168.10.113:6789 io error
[Thu Jul 11 11:23:21 2019] libceph: mon2 192.168.10.113:6789 session
lost, hunting for new mon
[Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 session
established
[Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 io error
[Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 session
lost, hunting for new mon
[Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 session
established
[Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 io error
[Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 session
lost, hunting for new mon
What to do now? Restarting the monitor did not help.
-----Original Message-----
Subject: Re: [ceph-users] Luminous cephfs maybe not to stable as
expected?
Forgot to add these
[@ ~]# cat
/sys/kernel/debug/ceph/0f1701f5-453a-4a3b-928d-f652a2bbbcb0.client357431
0/osdc
REQUESTS 0 homeless 0
LINGER REQUESTS
BACKOFFS
[@~]# cat
/sys/kernel/debug/ceph/0f1701f5-453a-4a3b-928d-f652a2bbbcb0.client358422
4/osdc
REQUESTS 38 homeless 0
317841 osd0 20.d6ec44c1 20.1 [0,28,5]/0 [0,28,5]/0
e65040 10001b44a70.00000000 0x40001c 101139 read
317853 osd0 20.5956d31b 20.1b [0,5,10]/0 [0,5,10]/0
e65040 10001ad8962.00000000 0x40001c 39847 read
317835 osd3 20.ede889de 20.1e [3,12,27]/3 [3,12,27]/3
e65040 10001ad80f6.00000000 0x40001c 87758 read
317838 osd3 20.7b730a4e 20.e [3,31,9]/3 [3,31,9]/3
e65040 10001ad89d8.00000000 0x40001c 83444 read
317844 osd3 20.feead84c 20.c [3,13,18]/3 [3,13,18]/3
e65040 10001ad8733.00000000 0x40001c 77267 read
317852 osd3 20.bd2658e 20.e [3,31,9]/3 [3,31,9]/3
e65040 10001ad7e00.00000000 0x40001c 39331 read
317830 osd4 20.922e6d04 20.4 [4,16,27]/4 [4,16,27]/4
e65040 10001ad80f2.00000000 0x40001c 86326 read
317837 osd4 20.fe93d4ab 20.2b [4,14,25]/4 [4,14,25]/4
e65040 10001ad80fb.00000000 0x40001c 78951 read
317839 osd4 20.d7af926b 20.2b [4,14,25]/4 [4,14,25]/4
e65040 10001ad80ee.00000000 0x40001c 77556 read
317849 osd5 20.5fcb95c5 20.5 [5,18,29]/5 [5,18,29]/5
e65040 10001ad7f75.00000000 0x40001c 61147 read
317857 osd5 20.28764e9a 20.1a [5,7,28]/5 [5,7,28]/5
e65040 10001ad8a10.00000000 0x40001c 30369 read
317859 osd5 20.7bb79985 20.5 [5,18,29]/5 [5,18,29]/5
e65040 10001ad7fe8.00000000 0x40001c 27942 read
317836 osd8 20.e7bf5cf4 20.34 [8,5,10]/8 [8,5,10]/8
e65040 10001ad7d79.00000000 0x40001c 133699 read
317842 osd8 20.abbb9df4 20.34 [8,5,10]/8 [8,5,10]/8
e65040 10001d5903f.00000000 0x40001c 125308 read
317850 osd8 20.ecd0034 20.34 [8,5,10]/8 [8,5,10]/8
e65040 10001ad89b2.00000000 0x40001c 68348 read
317854 osd8 20.cef50134 20.34 [8,5,10]/8 [8,5,10]/8
e65040 10001ad8728.00000000 0x40001c 57431 read
317861 osd8 20.3e859bb4 20.34 [8,5,10]/8 [8,5,10]/8
e65040 10001ad8108.00000000 0x40001c 50642 read
317847 osd9 20.fc9e9f43 20.3 [9,29,17]/9 [9,29,17]/9
e65040 10001ad8101.00000000 0x40001c 88464 read
317848 osd9 20.d32b6ac3 20.3 [9,29,17]/9 [9,29,17]/9
e65040 10001ad8100.00000000 0x40001c 85929 read
317862 osd11 20.ee6cc689 20.9 [11,0,12]/11 [11,0,12]/11
e65040 10001ad7d64.00000000 0x40001c 40266 read
317843 osd12 20.a801f0e9 20.29 [12,26,8]/12 [12,26,8]/12
e65040 10001ad7f07.00000000 0x40001c 86610 read
317851 osd12 20.8bb48de9 20.29 [12,26,8]/12 [12,26,8]/12
e65040 10001ad7e4f.00000000 0x40001c 46746 read
317860 osd12 20.47815f36 20.36 [12,0,28]/12 [12,0,28]/12
e65040 10001ad8035.00000000 0x40001c 35249 read
317831 osd15 20.9e3acb53 20.13 [15,0,1]/15 [15,0,1]/15
e65040 10001ad8978.00000000 0x40001c 85329 read
317840 osd15 20.2a40efdf 20.1f [15,4,17]/15 [15,4,17]/15
e65040 10001ad7ef8.00000000 0x40001c 76282 read
317846 osd15 20.8143f15f 20.1f [15,4,17]/15 [15,4,17]/15
e65040 10001ad89d1.00000000 0x40001c 61297 read
317864 osd15 20.c889a49c 20.1c [15,0,31]/15 [15,0,31]/15
e65040 10001ad89fb.00000000 0x40001c 24385 read
317832 osd18 20.f76227a 20.3a [18,6,15]/18 [18,6,15]/18
e65040 10001ad8020.00000000 0x40001c 82852 read
317833 osd18 20.d8edab31 20.31 [18,29,14]/18 [18,29,14]/18
e65040 10001ad8952.00000000 0x40001c 82852 read
317858 osd18 20.8f69d231 20.31 [18,29,14]/18 [18,29,14]/18
e65040 10001ad8176.00000000 0x40001c 32400 read
317855 osd22 20.b3342c0f 20.f [22,18,31]/22 [22,18,31]/22
e65040 10001ad8146.00000000 0x40001c 51024 read
317863 osd23 20.cde0ce7b 20.3b [23,1,6]/23 [23,1,6]/23
e65040 10001ad856c.00000000 0x40001c 34521 read
317865 osd23 20.702d2dfe 20.3e [23,9,22]/23 [23,9,22]/23
e65040 10001ad8a5e.00000000 0x40001c 30664 read
317866 osd23 20.cb4a32fe 20.3e [23,9,22]/23 [23,9,22]/23
e65040 10001ad8575.00000000 0x40001c 29683 read
317867 osd23 20.9a008910 20.10 [23,12,6]/23 [23,12,6]/23
e65040 10001ad7d24.00000000 0x40001c 29683 read
317834 osd25 20.6efd4911 20.11 [25,4,0]/25 [25,4,0]/25
e65040 10001ad8023.00000000 0x40001c 147589 read
317856 osd26 20.febb382a 20.2a [26,0,18]/26 [26,0,18]/26
e65040 10001ad8145.00000000 0x40001c 65169 read
317845 osd27 20.5b433067 20.27 [27,7,14]/27 [27,7,14]/27
e65040 10001ad8965.00000000 0x40001c 124461 read
LINGER REQUESTS
BACKOFFS
-----Original Message-----
Subject: [ceph-users] Luminous cephfs maybe not to stable as expected?
Maybe this requires some attention. I have a default centos7 (maybe not
the most recent kernel though), ceph luminous setup eg. no different
kernels.
This is 2nd or 3rd time that a vm is going into a high load (151) and
stopping its services. I have two vm's both mounting the same 2 cephfs
'shares'. After the last incident I dismounted the shares on the 2nd
server. (Migrating to a new environment this 2nd server is not doing
anything). Last time I thought maybe this could be related to my work on
the switch from the stupid allocator to the bitmap.
Anyway yesterday I thought lets mount again the 2 shares on the 2nd
server, see what happens. And this morning the high load was back. Afaik
the 2nd server is only doing a cron job on the cephfs mounts, creating
snapshots.
1) I have now still increased load on the osd nodes, from cephfs. How
can I see what client is doing this? I don’t seem to get this from
'ceph daemon mds.c session ls' however 'ceph osd pool stats | grep
client -B 1' indicates it is cephfs.
2) ceph osd blacklist ls
No blacklist entries
3) the first server keeps generating such messages, while there is no
issue with connectivity.
[Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session
lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: mon1
192.168.10.112:6789 session established [Thu Jul 11 10:41:22 2019]
libceph: mon1 192.168.10.112:6789 io error [Thu Jul 11 10:41:22 2019]
libceph: mon1 192.168.10.112:6789 session lost, hunting for new mon [Thu
Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session
established [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789
io error [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789
session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph:
mon2 192.168.10.113:6789 session established [Thu Jul 11 10:41:22 2019]
libceph: mon2 192.168.10.113:6789 io error [Thu Jul 11 10:41:22 2019]
libceph: mon2 192.168.10.113:6789 session lost, hunting for new mon [Thu
Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session
established [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789
io error [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789
session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph:
mon2 192.168.10.113:6789 session established [Thu Jul 11 10:41:22 2019]
libceph: mon2 192.168.10.113:6789 io error [Thu Jul 11 10:41:22 2019]
libceph: mon2 192.168.10.113:6789 session lost, hunting for new mon [Thu
Jul 11 10:41:22 2019] libceph: osd25 192.168.10.114:6804 io error [Thu
Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session
established [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789
io error [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789
session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph:
mon2 192.168.10.113:6789 session established [Thu Jul 11 10:41:22 2019]
libceph: mon2 192.168.10.113:6789 io error [Thu Jul 11 10:41:22 2019]
libceph: mon2 192.168.10.113:6789 session lost, hunting for new mon [Thu
Jul 11 10:41:22 2019] libceph: osd18 192.168.10.112:6802 io error [Thu
Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session
established [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789
io error [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789
session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph:
mon2 192.168.10.113:6789 session established [Thu Jul 11 10:41:22 2019]
libceph: mon2 192.168.10.113:6789 io error [Thu Jul 11 10:41:22 2019]
libceph: mon2 192.168.10.113:6789 session lost, hunting for new mon [Thu
Jul 11 10:41:22 2019] libceph: osd22 192.168.10.111:6811 io error [Thu
Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session
established [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789
io error [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789
session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph:
mon0 192.168.10.111:6789 session established
PS dmesg -T gives me strange times, as you can see these are in the
future, os time is 2 min behind (which is the correct one, ntpd sync).
[@ ]# uptime
10:39:17 up 50 days, 13:31, 2 users, load average: 3.60, 3.02, 2.57
4) unmount the filesystem on the first server fails.
5) evicting the cephfs sessions of the first server, does not change the
load of the cephfs on the osd nodes.
6) unmounting all cephfs clients, still leaves me with cephfs activity
on the data pool and on the osd nodes.
[@c03 ~]# ceph daemon mds.c session ls
[]
7) On the first server
[@~]# ps -auxf| grep D
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 6716 3.0 0.0 0 0 ? D 10:18 0:59 \_
[kworker/0:2]
root 20039 0.0 0.0 123520 1212 pts/0 D+ 10:28 0:00 |
\_ umount /home/mail-archive/
[@ ~]# cat /proc/6716/stack
[<ffffffff8385e110>] __wait_on_freeing_inode+0xb0/0xf0
[<ffffffff8385e1e9>] find_inode+0x99/0xc0 [<ffffffff8385e281>]
ilookup5_nowait+0x71/0x90 [<ffffffff8385f09f>] ilookup5+0xf/0x60
[<ffffffffc060fb35>] remove_session_caps+0xf5/0x1d0 [ceph]
[<ffffffffc06158fc>] dispatch+0x39c/0xb00 [ceph] [<ffffffffc052afb4>]
try_read+0x514/0x12c0 [libceph] [<ffffffffc052bf64>]
ceph_con_workfn+0xe4/0x1530 [libceph] [<ffffffff836b9e3f>]
process_one_work+0x17f/0x440 [<ffffffff836baed6>]
worker_thread+0x126/0x3c0 [<ffffffff836c1d21>] kthread+0xd1/0xe0
[<ffffffff83d75c37>] ret_from_fork_nospec_end+0x0/0x39
[<ffffffffffffffff>] 0xffffffffffffffff
[@ ~]# cat /proc/20039/stack
[<ffffffff837b5e14>] __lock_page+0x74/0x90 [<ffffffff837c744c>]
truncate_inode_pages_range+0x6cc/0x700
[<ffffffff837c74ef>] truncate_inode_pages_final+0x4f/0x60
[<ffffffff8385f02c>] evict+0x16c/0x180
[<ffffffff8385f87c>] iput+0xfc/0x190
[<ffffffff8385aa18>] shrink_dcache_for_umount_subtree+0x158/0x1e0
[<ffffffff8385c3bf>] shrink_dcache_for_umount+0x2f/0x60
[<ffffffff8384426f>] generic_shutdown_super+0x1f/0x100
[<ffffffff838446b2>] kill_anon_super+0x12/0x20 [<ffffffffc05ea130>]
ceph_kill_sb+0x30/0x80 [ceph] [<ffffffff83844a6e>]
deactivate_locked_super+0x4e/0x70 [<ffffffff838451f6>]
deactivate_super+0x46/0x60 [<ffffffff8386373f>] cleanup_mnt+0x3f/0x80
[<ffffffff838637d2>] __cleanup_mnt+0x12/0x20 [<ffffffff836be88b>]
task_work_run+0xbb/0xe0 [<ffffffff8362bc65>] do_notify_resume+0xa5/0xc0
[<ffffffff83d76134>] int_signal+0x12/0x17 [<ffffffffffffffff>]
0xffffffffffffffff
What to do now? In ceph.conf I only have these entries, not sure if I
still should keep them.
# 100k+ files in 2 folders
mds bal fragment size max = 120000
mds_session_blacklist_on_timeout = false mds_session_blacklist_on_evict
= false mds_cache_memory_limit = 8000000000
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com