[ceph-users] mds damaged with preallocated inodes that are inconsistent with inotable

2024-08-07 Thread zxcs
HI, Experts, 

we are running a cephfs with V16.2.*, and has multi active mds. Currently, we 
are hitting  a mds fs cephfs mds.*  id damaged. and this mds always complain 


“client  *** loaded with preallocated inodes that are inconsistent with 
inotable”


and the mds always suicide during replay. Could anyone please help here ? We 
really need you shed some light!


Thanks lot !


xz
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph nvme timeout and then aborting

2021-02-19 Thread zxcs
Hi, 

I have one ceph cluster with nautilus 14.2.10 and one node has 3 SSD and 4 HDD 
each. 
Also has two nvmes as cache.  (Means nvme0n1 cache for 0-2 SSD  and Nvme1n1 
cache for 3-7 HDD)

but there is one nodes’ nvme0n1 always hit below issues(see name..I/O…timeout, 
aborting), and sudden this nvme0n1 disappear . 
After that i need reboot this node to recover.
Any one hit same issue ? and how to slow it? Any suggestion are welcome. Thanks 
in advance!
I am once googled the issue, and see below link, but not see any help 
https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd 


From syslog
Feb 19 01:31:52 ip kernel: [1275313.393211] nvme :03:00.0: I/O 949 QID 12 
timeout, aborting
Feb 19 01:31:53 ip kernel: [1275314.389232] nvme :03:00.0: I/O 728 QID 5 
timeout, aborting
Feb 19 01:31:53 ip kernel: [1275314.389247] nvme :03:00.0: I/O 515 QID 7 
timeout, aborting
Feb 19 01:31:53 ip kernel: [1275314.389252] nvme :03:00.0: I/O 516 QID 7 
timeout, aborting
Feb 19 01:31:53 ip kernel: [1275314.389257] nvme :03:00.0: I/O 517 QID 7 
timeout, aborting
Feb 19 01:31:53 ip kernel: [1275314.389263] nvme :03:00.0: I/O 82 QID 9 
timeout, aborting
Feb 19 01:31:53 ip kernel: [1275314.389271] nvme :03:00.0: I/O 853 QID 13 
timeout, aborting
Feb 19 01:31:53 ip kernel: [1275314.389275] nvme :03:00.0: I/O 854 QID 13 
timeout, aborting
Feb 19 01:32:23 ip kernel: [1275344.401708] nvme :03:00.0: I/O 728 QID 5 
timeout, reset controller
Feb 19 01:32:52 ip kernel: [1275373.394112] nvme :03:00.0: I/O 0 QID 0 
timeout, reset controller
Feb 19 01:33:53 ip ceph-osd[3179]: 
/build/ceph-14.2.10/src/common/HeartbeatMap.cc: In function 'bool 
ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, 
ceph::time_detail::coarse_mono_clock::rep)' thread 7f36c03fb700 time 2021-02-19 
01:33:53.436018
Feb 19 01:33:53 ip ceph-osd[3179]: 
/build/ceph-14.2.10/src/common/HeartbeatMap.cc: 82: ceph_abort_msg("hit suicide 
timeout")
Feb 19 01:33:53 ip ceph-osd[3179]:  ceph version 14.2.10 
(b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)
Feb 19 01:33:53 ip ceph-osd[3179]:  1: (ceph::__ceph_abort(char const*, int, 
char const*, std::__cxx11::basic_string, 
std::allocator > const&)+0xdf) [0x83eb8c]
Feb 19 01:33:53 ip ceph-osd[3179]:  2: 
(ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, 
unsigned long)+0x4a5) [0xec56f5]
Feb 19 01:33:53 ip ceph-osd[3179]:  3: (ceph::HeartbeatMap::is_healthy()+0x106) 
[0xec6846]
Feb 19 01:33:53 ip ceph-osd[3179]:  4: (OSD::handle_osd_ping(MOSDPing*)+0x67c) 
[0x8aaf0c]
Feb 19 01:33:53 ip ceph-osd[3179]:  5: 
(OSD::heartbeat_dispatch(Message*)+0x1eb) [0x8b3f4b]
Feb 19 01:33:53 ip ceph-osd[3179]:  6: 
(DispatchQueue::fast_dispatch(boost::intrusive_ptr const&)+0x27d) 
[0x12456bd]
Feb 19 01:33:53 ip ceph-osd[3179]:  7: (ProtocolV2::handle_message()+0x9d6) 
[0x129b4e6]
Feb 19 01:33:53 ip ceph-osd[3179]:  8: 
(ProtocolV2::handle_read_frame_dispatch()+0x160) [0x12ad330]
Feb 19 01:33:53 ip ceph-osd[3179]:  9: 
(ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr&&, int)+0x178) [0x12ad598]
Feb 19 01:33:53 ip ceph-osd[3179]:  10: 
(ProtocolV2::run_continuation(Ct&)+0x34) [0x12956b4]
Feb 19 01:33:53 ip ceph-osd[3179]:  11: (AsyncConnection::process()+0x186) 
[0x126f446]
Feb 19 01:33:53 ip ceph-osd[3179]:  12: (EventCenter::process_events(unsigned 
int, std::chrono::duration 
>*)+0x7cd) [0x10b14cd]
Feb 19 01:33:53 ip ceph-osd[3179]:  13: /usr/bin/ceph-osd() [0x10b3fd8]
Feb 19 01:33:53 ip ceph-osd[3179]:  14: /usr/bin/ceph-osd() [0x162b59f]
Feb 19 01:33:53 ip ceph-osd[3179]:  15: (()+0x76ba) [0x7f36c2ed46ba]
Feb 19 01:33:53 ip ceph-osd[3179]:  16: (clone()+0x6d) [0x7f36c24db4dd]
Feb 19 01:33:53 ip ceph-osd[3179]: *** Caught signal (Aborted) **

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-19 Thread zxcs
Thank you very much, Konstantin!

Here is the output of `nvme smart-log /dev/nvme0n1`

Smart Log for NVME device:nvme0n1 namespace-id:
critical_warning: 0
temperature : 27 C
available_spare : 100%
available_spare_threshold   : 10%
percentage_used : 1%
data_units_read : 602,417,903
data_units_written  : 24,350,864
host_read_commands  : 5,610,227,794
host_write_commands : 519,030,512
controller_busy_time: 14,356
power_cycles: 7
power_on_hours  : 4,256
unsafe_shutdowns: 5
media_errors: 0
num_err_log_entries : 0
Warning Temperature Time: 0
Critical Composite Temperature Time : 0
Temperature Sensor 1: 27 C
Temperature Sensor 2: 41 C
Temperature Sensor 3: 0 C
Temperature Sensor 4: 0 C
Temperature Sensor 5: 0 C
Temperature Sensor 6: 0 C
Temperature Sensor 7: 0 C
Temperature Sensor 8: 0 C


Thanks,

zx

> 在 2021年2月19日,下午6:01,Konstantin Shalygin  写道:
> 
> Please paste your `name smart-log /dev/nvme0n1` output
> 
> 
> 
> k
> 
>> On 19 Feb 2021, at 12:53, zxcs > <mailto:zhuxion...@163.com>> wrote:
>> 
>> I have one ceph cluster with nautilus 14.2.10 and one node has 3 SSD and 4 
>> HDD each. 
>> Also has two nvmes as cache.  (Means nvme0n1 cache for 0-2 SSD  and Nvme1n1 
>> cache for 3-7 HDD)
>> 
>> but there is one nodes’ nvme0n1 always hit below issues(see 
>> name..I/O…timeout, aborting), and sudden this nvme0n1 disappear . 
>> After that i need reboot this node to recover.
>> Any one hit same issue ? and how to slow it? Any suggestion are welcome. 
>> Thanks in advance!
>> I am once googled the issue, and see below link, but not see any help 
>> https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd 
>> <https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd> 
>> <https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd 
>> <https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd>><https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd
>>  <https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd> 
>> <https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd 
>> <https://askubuntu.com/questions/981657/cannot-suspend-with-nvme-m-2-ssd>>>
>> 
>> From syslog
>> Feb 19 01:31:52 ip kernel: [1275313.393211] nvme :03:00.0: I/O 949 QID 
>> 12 timeout, aborting
>> Feb 19 01:31:53 ip kernel: [1275314.389232] nvme :03:00.0: I/O 728 QID 5 
>> timeout, aborting
>> Feb 19 01:31:53 ip kernel: [1275314.389247] nvme :03:00.0: I/O 515 QID 7 
>> timeout, aborting
>> Feb 19 01:31:53 ip kernel: [1275314.389252] nvme :03:00.0: I/O 516 QID 7 
>> timeout, aborting
>> Feb 19 01:31:53 ip kernel: [1275314.389257] nvme :03:00.0: I/O 517 QID 7 
>> timeout, aborting
>> Feb 19 01:31:53 ip kernel: [1275314.389263] nvme :03:00.0: I/O 82 QID 9 
>> timeout, aborting
>> Feb 19 01:31:53 ip kernel: [1275314.389271] nvme :03:00.0: I/O 853 QID 
>> 13 timeout, aborting
>> Feb 19 01:31:53 ip kernel: [1275314.389275] nvme :03:00.0: I/O 854 QID 
>> 13 timeout, aborting
>> Feb 19 01:32:23 ip kernel: [1275344.401708] nvme :03:00.0: I/O 728 QID 5 
>> timeout, reset controller
>> Feb 19 01:32:52 ip kernel: [1275373.394112] nvme :03:00.0: I/O 0 QID 0 
>> timeout, reset controller
>> Feb 19 01:33:53 ip ceph-osd[3179]: 
>> /build/ceph-14.2.10/src/common/HeartbeatMap.cc <http://heartbeatmap.cc/> 
>> <http://heartbeatmap.cc/ <http://heartbeatmap.cc/>>: In function 'bool 
>> ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, 
>> ceph::time_detail::coarse_mono_clock::rep)' thread 7f36c03fb700 time 
>> 2021-02-19 01:33:53.436018
>> Feb 19 01:33:53 ip ceph-osd[3179]: 
>> /build/ceph-14.2.10/src/common/HeartbeatMap.cc <http://heartbeatmap.cc/> 
>> <http://heartbeatmap.cc/ <http://heartbeatmap.cc/>>: 82: ceph_abort_msg("hit 
>> suicide timeout")
>> Feb 19 01:33:53 ip ceph-osd[3179]:  ceph version 14.2.10 
>> (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)
>> Feb 19 01:33:53 ip ceph-osd[3179]:  1: (ceph::__ceph_abort(char const*, int, 
>> char const*, std::__cxx11::basic_string, 
>> std::allocator 

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-19 Thread zxcs
BTW, actually i have two nodes has same issues, and another error node's nvme 
output as below 

Smart Log for NVME device:nvme0n1 namespace-id:
critical_warning: 0
temperature : 29 C
available_spare : 100%
available_spare_threshold   : 10%
percentage_used : 1%
data_units_read : 592,340,175
data_units_written  : 26,443,352
host_read_commands  : 5,341,278,662
host_write_commands : 515,730,885
controller_busy_time: 14,052
power_cycles: 8
power_on_hours  : 4,294
unsafe_shutdowns: 6
media_errors: 0
num_err_log_entries : 0
Warning Temperature Time: 0
Critical Composite Temperature Time : 0
Temperature Sensor 1: 29 C
Temperature Sensor 2: 46 C
Temperature Sensor 3: 0 C
Temperature Sensor 4: 0 C
Temperature Sensor 5: 0 C
Temperature Sensor 6: 0 C
Temperature Sensor 7: 0 C
Temperature Sensor 8: 0 C


For compare, i get one healthy node’s nvme output as below:

mart Log for NVME device:nvme0n1 namespace-id:
critical_warning: 0
temperature : 27 C
available_spare : 100%
available_spare_threshold   : 10%
percentage_used : 1%
data_units_read : 579,829,652
data_units_written  : 28,271,336
host_read_commands  : 5,237,750,233
host_write_commands : 518,979,861
controller_busy_time: 14,166
power_cycles: 3
power_on_hours  : 4,252
unsafe_shutdowns: 1
media_errors: 0
num_err_log_entries : 0
Warning Temperature Time: 0
Critical Composite Temperature Time : 0
Temperature Sensor 1: 27 C
Temperature Sensor 2: 39 C
Temperature Sensor 3: 0 C
Temperature Sensor 4: 0 C
Temperature Sensor 5: 0 C
Temperature Sensor 6: 0 C
Temperature Sensor 7: 0 C
Temperature Sensor 8: 0 C


Thanks,
zx


> 在 2021年2月19日,下午6:08,zxcs  写道:
> 
> Thank you very much, Konstantin!
> 
> Here is the output of `nvme smart-log /dev/nvme0n1`
> 
> Smart Log for NVME device:nvme0n1 namespace-id:
> critical_warning: 0
> temperature : 27 C
> available_spare : 100%
> available_spare_threshold   : 10%
> percentage_used : 1%
> data_units_read : 602,417,903
> data_units_written  : 24,350,864
> host_read_commands  : 5,610,227,794
> host_write_commands : 519,030,512
> controller_busy_time: 14,356
> power_cycles: 7
> power_on_hours  : 4,256
> unsafe_shutdowns: 5
> media_errors: 0
> num_err_log_entries : 0
> Warning Temperature Time: 0
> Critical Composite Temperature Time : 0
> Temperature Sensor 1: 27 C
> Temperature Sensor 2: 41 C
> Temperature Sensor 3: 0 C
> Temperature Sensor 4: 0 C
> Temperature Sensor 5: 0 C
> Temperature Sensor 6: 0 C
> Temperature Sensor 7: 0 C
> Temperature Sensor 8: 0 C
> 
> 
> Thanks,
> 
> zx
> 
>> 在 2021年2月19日,下午6:01,Konstantin Shalygin > <mailto:k0...@k0ste.ru>> 写道:
>> 
>> Please paste your `name smart-log /dev/nvme0n1` output
>> 
>> 
>> 
>> k
>> 
>>> On 19 Feb 2021, at 12:53, zxcs >> <mailto:zhuxion...@163.com> <mailto:zhuxion...@163.com 
>>> <mailto:zhuxion...@163.com>>> wrote:
>>> 
>>> I have one ceph cluster with nautilus 14.2.10 and one node has 3 SSD and 4 
>>> HDD each. 
>>> Also has two nvmes as cache.  (Means nvme0n1 cache for 0-2 SSD  and Nvme1n1 
>>> cache for 3-7 HDD)
>>> 
>>> but there is one nodes’ nvme0n1 always hit below issues(see 
>>> name..I/O…timeout, aborting), and sudden this nvme0n1 disappear . 
>>> After that i need reboot this node to recover.
>>> Any one hit same issue ? and how to slow it? Any suggestion are welcome. 
>>> Thanks in advance!
>>> I am once googled the issue, and see below link, but not see any help 
>>> https

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-19 Thread zxcs
you mean OS? it ubuntu 16.04 and Nvme is Samsung 970 PRO 1TB.

Thanks,
zx

> 在 2021年2月19日,下午6:56,Konstantin Shalygin  <mailto:k0...@k0ste.ru>> 写道:
> 
> Look's good, what is your hardware? Server model & NVM'es?
> 
> 
> 
> k
> 
>> On 19 Feb 2021, at 13:22, zxcs > <mailto:zhuxion...@163.com>> wrote:
>> 
>> BTW, actually i have two nodes has same issues, and another error node's 
>> nvme output as below 
>> 
>> Smart Log for NVME device:nvme0n1 namespace-id:
>> critical_warning: 0
>> temperature : 29 C
>> available_spare : 100%
>> available_spare_threshold   : 10%
>> percentage_used : 1%
>> data_units_read : 592,340,175
>> data_units_written  : 26,443,352
>> host_read_commands  : 5,341,278,662
>> host_write_commands : 515,730,885
>> controller_busy_time: 14,052
>> power_cycles: 8
>> power_on_hours  : 4,294
>> unsafe_shutdowns: 6
>> media_errors: 0
>> num_err_log_entries : 0
>> Warning Temperature Time: 0
>> Critical Composite Temperature Time : 0
>> Temperature Sensor 1: 29 C
>> Temperature Sensor 2: 46 C
>> Temperature Sensor 3: 0 C
>> Temperature Sensor 4: 0 C
>> Temperature Sensor 5: 0 C
>> Temperature Sensor 6: 0 C
>> Temperature Sensor 7: 0 C
>> Temperature Sensor 8: 0 C
>> 
>> 
>> For compare, i get one healthy node’s nvme output as below:
>> 
>> mart Log for NVME device:nvme0n1 namespace-id:
>> critical_warning: 0
>> temperature : 27 C
>> available_spare : 100%
>> available_spare_threshold   : 10%
>> percentage_used : 1%
>> data_units_read : 579,829,652
>> data_units_written  : 28,271,336
>> host_read_commands  : 5,237,750,233
>> host_write_commands : 518,979,861
>> controller_busy_time: 14,166
>> power_cycles: 3
>> power_on_hours  : 4,252
>> unsafe_shutdowns: 1
>> media_errors: 0
>> num_err_log_entries : 0
>> Warning Temperature Time: 0
>> Critical Composite Temperature Time : 0
>> Temperature Sensor 1: 27 C
>> Temperature Sensor 2: 39 C
>> Temperature Sensor 3: 0 C
>> Temperature Sensor 4: 0 C
>> Temperature Sensor 5: 0 C
>> Temperature Sensor 6: 0 C
>> Temperature Sensor 7: 0 C
>> Temperature Sensor 8: 0 C
> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-21 Thread zxcs
One nvme  sudden crash again. Could anyone please help shed some light here? 
Thank a ton!!!
Below are syslog and ceph log.

From  /var/log/syslog
Feb 21 19:38:33 ip kernel: [232562.847916] nvme :03:00.0: I/O 943 QID 7 
timeout, aborting
Feb 21 19:38:34 ip kernel: [232563.847946] nvme :03:00.0: I/O 911 QID 18 
timeout, aborting
Feb 21 19:38:34 ip kernel: [232563.847964] nvme :03:00.0: I/O 776 QID 28 
timeout, aborting
Feb 21 19:38:36 ip ceph-osd[3241]: 2021-02-21 19:38:36.218 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:36 ip kernel: [232565.851961] nvme :03:00.0: I/O 442 QID 2 
timeout, aborting
Feb 21 19:38:36 ip kernel: [232565.851982] nvme :03:00.0: I/O 912 QID 18 
timeout, aborting
Feb 21 19:38:37 ip ceph-osd[3241]: 2021-02-21 19:38:37.254 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:38 ip ceph-osd[3241]: 2021-02-21 19:38:38.286 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:39 ip ceph-osd[3241]: 2021-02-21 19:38:39.334 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:40 ip ceph-osd[3241]: 2021-02-21 19:38:40.322 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:41 ip ceph-osd[3241]: 2021-02-21 19:38:41.326 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:41 ip kernel: [232570.852035] nvme :03:00.0: I/O 860 QID 9 
timeout, aborting
Feb 21 19:38:42 ip ceph-osd[3241]: 2021-02-21 19:38:42.298 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:43 ip ceph-osd[3241]: 2021-02-21 19:38:43.258 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:44 ip ceph-osd[3241]: 2021-02-21 19:38:44.258 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 2 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:45 ip ntpd[3480]: Soliciting pool server 84.16.67.12
Feb 21 19:38:45 ip ceph-osd[3241]: 2021-02-21 19:38:45.286 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 2 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:46 ip ceph-osd[3241]: 2021-02-21 19:38:46.254 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 2 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:38:47 ip ceph-osd[3241]: 2021-02-21 19:38:47.226 7f023b58f700 -1 
osd.16 7868 get_health_metrics reporting 2 slow ops, oldest is 
osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
[create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
ondisk+write+known_if_redirected+full_force e7868)
Feb 21 19:39:04 ip kernel: [232593.860464] nvme :03:00.0: I/O 943 QID 7 
timeout, reset controller
Feb 21 19:39:33 ip kernel: [232622.868975] nvme :03:00.0: I/O 0 QID 0 
timeout, reset controller
Feb 21 19:40:35 ip ceph-osd[3241]: 2021-02-21 19:

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-21 Thread zxcs
Thanks for you reply!

Yes, it a Nvme, and on node has two Nvmes as db/wal, one for ssd(0-2) and 
another for hdd(3-6).
I have no spare to try.
It’s  very strange, the load not very high at that time. and both ssd and nvme 
seems healthy.

If cannot fix it.  I am afraid I need to setup more nodes and set out remove 
these OSDs which using this Nvme?  

Thanks,
zx


> 在 2021年2月22日,上午10:07,Mark Lehrer  写道:
> 
>> One nvme  sudden crash again. Could anyone please help shed some light here?
> 
> It looks like a flaky NVMe drive.  Do you have a spare to try?
> 
> 
> On Mon, Feb 22, 2021 at 1:56 AM zxcs  wrote:
>> 
>> One nvme  sudden crash again. Could anyone please help shed some light here? 
>> Thank a ton!!!
>> Below are syslog and ceph log.
>> 
>> From  /var/log/syslog
>> Feb 21 19:38:33 ip kernel: [232562.847916] nvme :03:00.0: I/O 943 QID 7 
>> timeout, aborting
>> Feb 21 19:38:34 ip kernel: [232563.847946] nvme :03:00.0: I/O 911 QID 18 
>> timeout, aborting
>> Feb 21 19:38:34 ip kernel: [232563.847964] nvme :03:00.0: I/O 776 QID 28 
>> timeout, aborting
>> Feb 21 19:38:36 ip ceph-osd[3241]: 2021-02-21 19:38:36.218 7f023b58f700 -1 
>> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e7868)
>> Feb 21 19:38:36 ip kernel: [232565.851961] nvme :03:00.0: I/O 442 QID 2 
>> timeout, aborting
>> Feb 21 19:38:36 ip kernel: [232565.851982] nvme :03:00.0: I/O 912 QID 18 
>> timeout, aborting
>> Feb 21 19:38:37 ip ceph-osd[3241]: 2021-02-21 19:38:37.254 7f023b58f700 -1 
>> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e7868)
>> Feb 21 19:38:38 ip ceph-osd[3241]: 2021-02-21 19:38:38.286 7f023b58f700 -1 
>> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e7868)
>> Feb 21 19:38:39 ip ceph-osd[3241]: 2021-02-21 19:38:39.334 7f023b58f700 -1 
>> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e7868)
>> Feb 21 19:38:40 ip ceph-osd[3241]: 2021-02-21 19:38:40.322 7f023b58f700 -1 
>> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e7868)
>> Feb 21 19:38:41 ip ceph-osd[3241]: 2021-02-21 19:38:41.326 7f023b58f700 -1 
>> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e7868)
>> Feb 21 19:38:41 ip kernel: [232570.852035] nvme :03:00.0: I/O 860 QID 9 
>> timeout, aborting
>> Feb 21 19:38:42 ip ceph-osd[3241]: 2021-02-21 19:38:42.298 7f023b58f700 -1 
>> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e7868)
>> Feb 21 19:38:43 ip ceph-osd[3241]: 2021-02-21 19:38:43.258 7f023b58f700 -1 
>> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e7868)
>> Feb 21 19:38:44 ip ceph-osd[3241]: 2021-02-21 19:38:44.258 7f023b58f700 -1 
>> osd.16 7868 get_health_metrics reporting 2 slow ops, oldest is 
>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
>> ondisk+write+known_if_redirected+full_force e7868)
>> Feb 21 19:38:45 ip ntpd[3480]: Soliciting pool server 84.16.67.12
>> Feb 21 19:38:

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-22 Thread zxcs
Haven’t do any fio test for single  disk , but did fio for the ceph cluster, 
actually the cluster has 12 nodes, and each node has same disks(means, 2 nvmes 
for cache, and 3 ssds as osd, 4 hdds also as osd).
Only two nodes has such problem. And these two nodes are crash many times(at 
least 4 times). The others are good.  So it strange.
This cluster has run more than half years. 


Thanks,
zx

> 在 2021年2月22日,下午6:37,Marc  写道:
> 
> Don't you have problems, just because the Samsung 970 PRO is not suitable for 
> this? Have you run fio tests to make sure it would work ok?
> 
> https://yourcmc.ru/wiki/Ceph_performance
> https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-0u0r5fAjjufLKayaut_FOPxYZjc/edit#gid=0
> 
> 
> 
>> -Original Message-
>> Sent: 22 February 2021 03:16
>> us...@ceph.io>
>> Subject: [ceph-users] Re: Ceph nvme timeout and then aborting
>> 
>> Thanks for you reply!
>> 
>> Yes, it a Nvme, and on node has two Nvmes as db/wal, one for ssd(0-2)
>> and another for hdd(3-6).
>> I have no spare to try.
>> It’s  very strange, the load not very high at that time. and both ssd
>> and nvme seems healthy.
>> 
>> If cannot fix it.  I am afraid I need to setup more nodes and set out
>> remove these OSDs which using this Nvme?
>> 
>> Thanks,
>> zx
>> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-22 Thread zxcs
Thanks  a lot, Marc!

I will try to do a fio test for the crash disks when there is no traffic in our 
cluster.
we using Samsung nvme 970pro as  wal/db and using SSD 860 Pro as SSD. And the 
nvme disappear  after ssd hit timeout. may be also need throw 970pro away?
Thanks,
zx 

> 在 2021年2月22日,下午9:25,Marc  写道:
> 
> So on the disks that crash anyway, do the fio test. If it crashes, you will 
> know it has nothing to do with ceph. If it does not crash you will probably 
> get poor fio result, which would explain the problems with ceph.
> 
> This is what someone wrote in the past. If you did not do your research on 
> drives, I think it is probably your drives.
> 
> " just throw away your crappy Samsung SSD 860 Pro "
> https://www.mail-archive.com/ceph-users@ceph.io/msg06820.html
> 
> 
> 
>> -Original Message-
>> From: zxcs 
>> Sent: 22 February 2021 13:10
>> To: Marc 
>> Cc: Mark Lehrer ; Konstantin Shalygin
>> ; ceph-users 
>> Subject: Re: [ceph-users] Ceph nvme timeout and then aborting
>> 
>> Haven’t do any fio test for single  disk , but did fio for the ceph
>> cluster, actually the cluster has 12 nodes, and each node has same
>> disks(means, 2 nvmes for cache, and 3 ssds as osd, 4 hdds also as osd).
>> Only two nodes has such problem. And these two nodes are crash many
>> times(at least 4 times). The others are good.  So it strange.
>> This cluster has run more than half years.
>> 
>> 
>> Thanks,
>> zx
>> 
>>> 在 2021年2月22日,下午6:37,Marc  写道:
>>> 
>>> Don't you have problems, just because the Samsung 970 PRO is not
>> suitable for this? Have you run fio tests to make sure it would work ok?
>>> 
>>> https://yourcmc.ru/wiki/Ceph_performance
>>> https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-
>> 0u0r5fAjjufLKayaut_FOPxYZjc/edit#gid=0
>>> 
>>> 
>>> 
>>>> -Original Message-
>>>> Sent: 22 February 2021 03:16
>>>> us...@ceph.io>
>>>> Subject: [ceph-users] Re: Ceph nvme timeout and then aborting
>>>> 
>>>> Thanks for you reply!
>>>> 
>>>> Yes, it a Nvme, and on node has two Nvmes as db/wal, one for ssd(0-2)
>>>> and another for hdd(3-6).
>>>> I have no spare to try.
>>>> It’s  very strange, the load not very high at that time. and both ssd
>>>> and nvme seems healthy.
>>>> 
>>>> If cannot fix it.  I am afraid I need to setup more nodes and set out
>>>> remove these OSDs which using this Nvme?
>>>> 
>>>> Thanks,
>>>> zx
>>>> 
>>> 
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-22 Thread zxcs
From ceph document, i see using fast device as wal/db could improve the 
performance.
So we using one(2TB) or two(1TB) samsung  Nvme 970pro as wal/db here, and yes, 
we have two data pools,  ssd pool and hdd pool, also ssd pool using samsung 
860Pro.
the Nvme970 as wal/db for both ssd pool and hdd pool.
I haven’t do a test, mean compare the performance  WITH Nvme as wal/db for ssd 
pool and WITHOUT a nvme as wal/db as ssd 
pool.(https://docs.ceph.com/en/latest/start/hardware-recommendations/)
 Just due to see the document, said using fast device and we know nvme is 
faster than normal ssd. 

Also i have another question here, in some document,  it said we only need 
using fast device as db, and no need create wal(means using nvme or ssd as db 
for hdd pool, no need create wal ), do you think so?  

we will scale out the cluster soon(for fix the two crash nodes), and haven’t 
made the decision about device, one choice may be below:
1 nvme(samsung 980 pro) create db(no wal) for the hdd pool.
no nvme for the ssd pool, and the ssd disk using Samsung 883 dct
 
Would  you are experts please help to shed some light here, Thanks a ton!  
  
Thanks,
zx

> 在 2021年2月23日,上午5:32,Mark Lehrer  写道:
> 
>> Yes, it a Nvme, and on node has two Nvmes as db/wal, one
>> for ssd(0-2) and another for hdd(3-6).  I have no spare to try.
>> ...
>> I/O 517 QID 7 timeout, aborting
>> Input/output error
> 
> If you are seeing errors like these, it is almost certainly a bad
> drive unless you are using fabric.
> 
> Why are you putting the wal on an SSD in the first place?  Are you
> sure it is even necessary, especially when one of your pools is
> already SSD?
> 
> Adding this complexity just means that there are more things to break
> when you least expect it. Putting the db/wal on a separate drive is
> usually premature optimization that is only useful for benchmarkers.
> My opinion of course.
> 
> Mark
> 
> 
> 
> 
> 
> 
> 
> 
> On Sun, Feb 21, 2021 at 7:16 PM zxcs  wrote:
>> 
>> Thanks for you reply!
>> 
>> Yes, it a Nvme, and on node has two Nvmes as db/wal, one for ssd(0-2) and 
>> another for hdd(3-6).
>> I have no spare to try.
>> It’s  very strange, the load not very high at that time. and both ssd and 
>> nvme seems healthy.
>> 
>> If cannot fix it.  I am afraid I need to setup more nodes and set out remove 
>> these OSDs which using this Nvme?
>> 
>> Thanks,
>> zx
>> 
>> 
>>> 在 2021年2月22日,上午10:07,Mark Lehrer  写道:
>>> 
>>>> One nvme  sudden crash again. Could anyone please help shed some light 
>>>> here?
>>> 
>>> It looks like a flaky NVMe drive.  Do you have a spare to try?
>>> 
>>> 
>>> On Mon, Feb 22, 2021 at 1:56 AM zxcs  wrote:
>>>> 
>>>> One nvme  sudden crash again. Could anyone please help shed some light 
>>>> here? Thank a ton!!!
>>>> Below are syslog and ceph log.
>>>> 
>>>> From  /var/log/syslog
>>>> Feb 21 19:38:33 ip kernel: [232562.847916] nvme :03:00.0: I/O 943 QID 
>>>> 7 timeout, aborting
>>>> Feb 21 19:38:34 ip kernel: [232563.847946] nvme :03:00.0: I/O 911 QID 
>>>> 18 timeout, aborting
>>>> Feb 21 19:38:34 ip kernel: [232563.847964] nvme :03:00.0: I/O 776 QID 
>>>> 28 timeout, aborting
>>>> Feb 21 19:38:36 ip ceph-osd[3241]: 2021-02-21 19:38:36.218 7f023b58f700 -1 
>>>> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
>>>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>>>> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
>>>> ondisk+write+known_if_redirected+full_force e7868)
>>>> Feb 21 19:38:36 ip kernel: [232565.851961] nvme :03:00.0: I/O 442 QID 
>>>> 2 timeout, aborting
>>>> Feb 21 19:38:36 ip kernel: [232565.851982] nvme :03:00.0: I/O 912 QID 
>>>> 18 timeout, aborting
>>>> Feb 21 19:38:37 ip ceph-osd[3241]: 2021-02-21 19:38:37.254 7f023b58f700 -1 
>>>> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
>>>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>>>> [create,setxattr parent (357),setxattr layout (30)] snapc 0=[] 
>>>> ondisk+write+known_if_redirected+full_force e7868)
>>>> Feb 21 19:38:38 ip ceph-osd[3241]: 2021-02-21 19:38:38.286 7f023b58f700 -1 
>>>> osd.16 7868 get_health_metrics reporting 1 slow ops, oldest is 
>>>> osd_op(mds.1.51327:1034954064 2.80 2:018429c8:::2002458b0ca.:head 
>>>> [create,setxattr parent (35

[ceph-users] how to disable ceph version check?

2023-11-07 Thread zxcs
Hi, Experts,

we have a ceph cluster report HEALTH_ERR due to multiple old versions. 

health: HEALTH_ERR
There are daemons running multiple old versions of ceph

after run `ceph version`, we see three ceph versions in {16.2.*} , these 
daemons are ceph osd.

our question is: how to stop this version check , we cannot upgrade all old 
daemon.



Thanks,
Xiong
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] mds hit find_exports balancer runs too long

2023-11-09 Thread zxcs
Hi, Experts,

we have a CephFS cluster running with 16.2.*, and enable multi active mds, 
found somehow mds complain  some info as below:

mds.*.bal find_exports balancer runs too long


and we already  set below config

 mds_bal_interval = 30
 mds_bal_sample_interval = 12

and then we can see slow mds request from `ceph -s`.

our question is : why ceph mds complain this and how can we prevent this 
problem  from happening again?


Thanks a ton,


xiong
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] really need help how to save old client out of hang?

2023-11-16 Thread zxcs
Hi, Experts,

we have an cephfs cluster 16.2.* run with multi active mds, and we have some 
old machine run with ubuntu 16.04 , so we mount these client using ceph-fuse. 

After a full mds process restart, all of these old ubuntu 16.04 clients cannot 
connect to ceph , `ls -lrth` or `df -hT` hang on the client node. 

and we can see mds log said `evicting unresponsive client *** after waiting ** 
seconds during restart`

also checking the client node,  no ceph-fuse process, (ps -ef | grep ceph or ps 
-ef | grep fuse). 

but there is a remount process (ps -ef | grep mount) in the client node.

root 176516  1 0  01:38 ?  00:00:00 mount -i -o remount  /data

cannot kill this process using `sudo kill -9 176516`.

we really need experts help us how to save this client out of hang. we cannot 
restart this client node due to there are some other critical  service.



Thanks a ton.

zx 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] mds slow request with “failed to authpin, subtree is being exported"

2023-11-22 Thread zxcs
HI, Experts,

we are using cephfs with  16.2.* with multi active mds, and recently, we have 
two nodes mount with ceph-fuse due to the old os system. 

and  one nodes run a python script with `glob.glob(path)`, and another client 
doing `cp` operation on the same path. 

then we see some log about `mds slow request`, and logs complain “failed to 
authpin, subtree is being exported"

then need to restart mds, 


our question is, does there any dead lock?  how can we avoid this and how to 
fix it without restart mds(it will influence other users) ? 


Thanks a ton!


xz
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mds slow request with “failed to authpin, subtree is being exported"

2023-11-22 Thread zxcs
Thanks a ton, Xiubo!

it not disappear.

even we umount the ceph directory on these two old os node.

after dump ops flight , we can see some request, and the earliest complain 
“failed to authpin, subtree is being exported"

And how to avoid this, would you please help to shed some light here?

Thanks,
xz


> 2023年11月22日 19:44,Xiubo Li  写道:
> 
> 
> On 11/22/23 16:02, zxcs wrote:
>> HI, Experts,
>> 
>> we are using cephfs with  16.2.* with multi active mds, and recently, we 
>> have two nodes mount with ceph-fuse due to the old os system.
>> 
>> and  one nodes run a python script with `glob.glob(path)`, and another 
>> client doing `cp` operation on the same path.
>> 
>> then we see some log about `mds slow request`, and logs complain “failed to 
>> authpin, subtree is being exported"
>> 
>> then need to restart mds,
>> 
>> 
>> our question is, does there any dead lock?  how can we avoid this and how to 
>> fix it without restart mds(it will influence other users) ?
> 
> BTW, won't the slow requests disappear themself later ?
> 
> It looks like the exporting is slow or there too many exports are going on.
> 
> Thanks
> 
> - Xiubo
> 
>> 
>> Thanks a ton!
>> 
>> 
>> xz
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mds slow request with “failed to authpin, subtree is being exported"

2023-11-26 Thread zxcs
current, we using `ceph config set mds mds_bal_interval 3600` to set a fixed 
time(1 hour).

we also have a question about how to set no balance for multi active mds.

means, we will enable multi active mds(to improve throughput) and no balance 
for these mds.

and if we set mds_bal_interval as big number seems can void this issue?



Thanks,
xz

> 2023年11月27日 10:56,Ben  写道:
> 
> with the same mds configuration, we see exactly the same(problem, log and
> solution) with 17.2.5, constantly happening again and again in couples days
> intervals. MDS servers are stuck somewhere, ceph status reports no issue
> however. We need to restart some of the mds (if not all of them) to restore
> them back. Hopefully this could be fixed soon or get docs updated with
> warning for the balancer's usage in production environment.
> 
> thanks and regards
> 
> Xiubo Li  于2023年11月23日周四 15:47写道:
> 
>> 
>> On 11/23/23 11:25, zxcs wrote:
>>> Thanks a ton, Xiubo!
>>> 
>>> it not disappear.
>>> 
>>> even we umount the ceph directory on these two old os node.
>>> 
>>> after dump ops flight , we can see some request, and the earliest
>> complain “failed to authpin, subtree is being exported"
>>> 
>>> And how to avoid this, would you please help to shed some light here?
>> 
>> Okay, as Frank mentioned you can try to disable the balancer by pining
>> the directories. As I remembered the balancer is buggy.
>> 
>> And also you can raise one ceph tracker and provide the debug logs if
>> you have.
>> 
>> Thanks
>> 
>> - Xiubo
>> 
>> 
>>> Thanks,
>>> xz
>>> 
>>> 
>>>> 2023年11月22日 19:44,Xiubo Li  写道:
>>>> 
>>>> 
>>>> On 11/22/23 16:02, zxcs wrote:
>>>>> HI, Experts,
>>>>> 
>>>>> we are using cephfs with  16.2.* with multi active mds, and recently,
>> we have two nodes mount with ceph-fuse due to the old os system.
>>>>> 
>>>>> and  one nodes run a python script with `glob.glob(path)`, and another
>> client doing `cp` operation on the same path.
>>>>> 
>>>>> then we see some log about `mds slow request`, and logs complain
>> “failed to authpin, subtree is being exported"
>>>>> 
>>>>> then need to restart mds,
>>>>> 
>>>>> 
>>>>> our question is, does there any dead lock?  how can we avoid this and
>> how to fix it without restart mds(it will influence other users) ?
>>>> BTW, won't the slow requests disappear themself later ?
>>>> 
>>>> It looks like the exporting is slow or there too many exports are going
>> on.
>>>> 
>>>> Thanks
>>>> 
>>>> - Xiubo
>>>> 
>>>>> Thanks a ton!
>>>>> 
>>>>> 
>>>>> xz
>>>>> ___
>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>> ___
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: mds slow request with “failed to authpin, subtree is being exported"

2023-12-04 Thread zxcs
Thanks a lot, Xiubo!

we already set ‘mds_bal_interval’ to 0. and the slow mds seems decrease.

But somehow we still see mds complain slow request. and from mds log , can see 

“slow request *** seconds old, received at 2023-12-04T…: internal op 
exportdir:mds.* currently acquired locks”

so our question is, why it still see "internal op exportdir”, any other config 
also need to set 0? and could please shed light here which config we need set .


Thanks,
xz 

> 2023年11月27日 13:19,Xiubo Li  写道:
> 
> 
> On 11/27/23 13:12, zxcs wrote:
>> current, we using `ceph config set mds mds_bal_interval 3600` to set a fixed 
>> time(1 hour).
>> 
>> we also have a question about how to set no balance for multi active mds.
>> 
>> means, we will enable multi active mds(to improve throughput) and no balance 
>> for these mds.
>> 
>> and if we set mds_bal_interval as big number seems can void this issue?
>> 
> You can just set 'mds_bal_interval' to 0.
> 
> 
>> 
>> 
>> Thanks,
>> xz
>> 
>>> 2023年11月27日 10:56,Ben  写道:
>>> 
>>> with the same mds configuration, we see exactly the same(problem, log and
>>> solution) with 17.2.5, constantly happening again and again in couples days
>>> intervals. MDS servers are stuck somewhere, ceph status reports no issue
>>> however. We need to restart some of the mds (if not all of them) to restore
>>> them back. Hopefully this could be fixed soon or get docs updated with
>>> warning for the balancer's usage in production environment.
>>> 
>>> thanks and regards
>>> 
>>> Xiubo Li  于2023年11月23日周四 15:47写道:
>>> 
>>>> 
>>>> On 11/23/23 11:25, zxcs wrote:
>>>>> Thanks a ton, Xiubo!
>>>>> 
>>>>> it not disappear.
>>>>> 
>>>>> even we umount the ceph directory on these two old os node.
>>>>> 
>>>>> after dump ops flight , we can see some request, and the earliest
>>>> complain “failed to authpin, subtree is being exported"
>>>>> 
>>>>> And how to avoid this, would you please help to shed some light here?
>>>> 
>>>> Okay, as Frank mentioned you can try to disable the balancer by pining
>>>> the directories. As I remembered the balancer is buggy.
>>>> 
>>>> And also you can raise one ceph tracker and provide the debug logs if
>>>> you have.
>>>> 
>>>> Thanks
>>>> 
>>>> - Xiubo
>>>> 
>>>> 
>>>>> Thanks,
>>>>> xz
>>>>> 
>>>>> 
>>>>>> 2023年11月22日 19:44,Xiubo Li  写道:
>>>>>> 
>>>>>> 
>>>>>> On 11/22/23 16:02, zxcs wrote:
>>>>>>> HI, Experts,
>>>>>>> 
>>>>>>> we are using cephfs with  16.2.* with multi active mds, and recently,
>>>> we have two nodes mount with ceph-fuse due to the old os system.
>>>>>>> 
>>>>>>> and  one nodes run a python script with `glob.glob(path)`, and another
>>>> client doing `cp` operation on the same path.
>>>>>>> 
>>>>>>> then we see some log about `mds slow request`, and logs complain
>>>> “failed to authpin, subtree is being exported"
>>>>>>> 
>>>>>>> then need to restart mds,
>>>>>>> 
>>>>>>> 
>>>>>>> our question is, does there any dead lock?  how can we avoid this and
>>>> how to fix it without restart mds(it will influence other users) ?
>>>>>> BTW, won't the slow requests disappear themself later ?
>>>>>> 
>>>>>> It looks like the exporting is slow or there too many exports are going
>>>> on.
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> - Xiubo
>>>>>> 
>>>>>>> Thanks a ton!
>>>>>>> 
>>>>>>> 
>>>>>>> xz
>>>>>>> ___
>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>> ___
>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>> ___
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>> 
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Cephfs too many repaired copies on osds

2023-12-12 Thread zxcs
Hi, Experts,

we are using cephfs with  16.2.* with multi active mds, and recently we see an 
osd report

“full object read crc *** != expected ox on :head”
“missing primary copy of ***: will try to read copies on **”

from `ceph -s`, could see 

OSD_TOO_MANY_REPAIRS: Too many repaired reads on ** OSDs.


we don’t know how to fix this , could you are please help shed some light here? 
 Thanks a ton!


Thanks
xz 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephfs too many repaired copies on osds

2023-12-12 Thread zxcs
Also osd frequently report these ERROR logs, lead this osd has slow request. 
how to stop these log ?

> “full object read crc *** != expected ox on :head”
> “missing primary copy of ***: will try to read copies on **”



Thanks
xz

> 2023年12月13日 01:20,zxcs  写道:
> 
> Hi, Experts,
> 
> we are using cephfs with  16.2.* with multi active mds, and recently we see 
> an osd report
> 
> “full object read crc *** != expected ox on :head”
> “missing primary copy of ***: will try to read copies on **”
> 
> from `ceph -s`, could see 
> 
> OSD_TOO_MANY_REPAIRS: Too many repaired reads on ** OSDs.
> 
> 
> we don’t know how to fix this , could you are please help shed some light 
> here?  Thanks a ton!
> 
> 
> Thanks
> xz 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephfs read hang after cluster stuck, but need attach the process to continue

2023-12-13 Thread zxcs
Hi, experts,

we are using cephfs with  16.2.* with multi active mds, and recently we see a 
strange thing, 

we have some c++ code about read file from cephfs. the client code just call 
very base read(), 

and when the cluster hit mds has slow request, and later the cluster back to 
normal. the read hang. 

then we need to gdb attach -p `pid of c++ process`, just attach, do nothing, 
the code then continue running.

Our question is why this happens? and config could tune or we need change out 
c++ read code?


Thanks
xz
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] binary file cannot execute in cephfs directory

2022-08-22 Thread zxcs
Hi, experts, 


We are using cephfs 15.2.13, and after mount ceph on one node, copy a binary 
into the ceph dir, see below (cmake-3.22 is a binary), 

but when i using `./cmake-3.22` it report permission denied, why? this file has 
“x” permission, and “ld" is the binary file owner. 

could anyone please help to tell the story here? Thanks a ton!!!  



Thanks

Xiong
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: binary file cannot execute in cephfs directory

2022-08-22 Thread zxcs
In case someone missing the picture. Just copy the text as below:


1d@***ceph dir**$ 1s -lrth
total 13M
-rwxr-xr-x 1 ld ld 13M Nov 29 2021 cmake-3.22
1rwxrwxrwx 1 ld ld 10 Jul 26 10:03 cmake > cmake-3.22
-rwxrwxr-x 1 ld ld 25 Aug 19 15:52 test.sh


ld@***ceph dir**$./cmake-3.22
bash: ./cmake-3.22: Permission denied


> 2022年8月23日 08:57,zxcs  写道:
> 
> Hi, experts, 
> 
> 
> We are using cephfs 15.2.13, and after mount ceph on one node, copy a binary 
> into the ceph dir, see below (cmake-3.22 is a binary), 
> 
> but when i using `./cmake-3.22` it report permission denied, why? this file 
> has “x” permission, and “ld" is the binary file owner. 
> 
> could anyone please help to tell the story here? Thanks a ton!!!  
> 
> 
> 
> Thanks
> 
> Xiong
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: binary file cannot execute in cephfs directory

2022-08-23 Thread zxcs
oh, yes, there is a “noexec” option in the mount command. Thanks a ton!

Thanks,
Xiong

> 2022年8月23日 22:01,Daniel Gryniewicz  写道:
> 
> Does the mount have the "noexec" option on it?
> 
> Daniel
> 
> On 8/22/22 21:02, zxcs wrote:
>> In case someone missing the picture. Just copy the text as below:
>> 1d@***ceph dir**$ 1s -lrth
>> total 13M
>> -rwxr-xr-x 1 ld ld 13M Nov 29 2021 cmake-3.22
>> 1rwxrwxrwx 1 ld ld 10 Jul 26 10:03 cmake > cmake-3.22
>> -rwxrwxr-x 1 ld ld 25 Aug 19 15:52 test.sh
>> ld@***ceph dir**$./cmake-3.22
>> bash: ./cmake-3.22: Permission denied
>>> 2022年8月23日 08:57,zxcs  写道:
>>> 
>>> Hi, experts,
>>> 
>>> 
>>> We are using cephfs 15.2.13, and after mount ceph on one node, copy a 
>>> binary into the ceph dir, see below (cmake-3.22 is a binary),
>>> 
>>> but when i using `./cmake-3.22` it report permission denied, why? this file 
>>> has “x” permission, and “ld" is the binary file owner.
>>> 
>>> could anyone please help to tell the story here? Thanks a ton!!!
>>> 
>>> 
>>> 
>>> Thanks
>>> 
>>> Xiong
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] how to fix slow request without remote or restart mds

2022-08-26 Thread zxcs
Hi, experts

we have a cephfs cluster with 15.2.* version and kernel mount, today there is a 
health report mds slow request as below, i checked this mds log, seems it 
report some slow request for a long time. 

mds report: 
1 MDSs report slow requests

mds log:
log_channel(cluster) log [WRN] : slow request 34616.878139 seconds old, 
received at 2022-08-26T08:49:16.400430+0800: 
client_request(client.100807545:2601765 getattr

i know we can restart mds to fix this(may be), but seems there only one 
directory hang, called A, (means when i ls -lrth /ceph/path/A, it stuck), and 
list other directory no issue. 

my question is how can we fix this without remount ceph on this node or restart 
mds (this will impact other uses).


Thanks in advance!

Thanks,
Xiong
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: how to fix slow request without remote or restart mds

2022-08-30 Thread zxcs
Thanks a ton!

Yes, restart mds fixed this.  But can’t confirm it hit bug 50840, seems when we 
read huge small files will hit this! (means more than 10,000 small files in one 
directory ).



Thanks
Xiong

> 2022年8月26日 19:13,Stefan Kooman  写道:
> 
> On 8/26/22 12:33, zxcs wrote:
>> Hi, experts
>> we have a cephfs cluster with 15.2.* version and kernel mount, today there 
>> is a health report mds slow request as below, i checked this mds log, seems 
>> it report some slow request for a long time.
>> mds report:
>> 1 MDSs report slow requests
>> mds log:
>> log_channel(cluster) log [WRN] : slow request 34616.878139 seconds old, 
>> received at 2022-08-26T08:49:16.400430+0800: 
>> client_request(client.100807545:2601765 getattr
>> i know we can restart mds to fix this(may be), but seems there only one 
>> directory hang, called A, (means when i ls -lrth /ceph/path/A, it stuck), 
>> and list other directory no issue.
> Might be this bug: https://tracker.ceph.com/issues/50840
> We hit this bug. A restart of the MDS is necessary. What version of Octopus 
> do you run? This is fixed in Octopus 15.2.17 [1,2].
> 
> So if you hit this bug (which might be difficult to tell) you can update the 
> MDS and restart the MDS with the new version.
> 
>> my question is how can we fix this without remount ceph on this node or 
>> restart mds (this will impact other uses).
> 
> Gr. Stefan
> 
> [1]: https://docs.ceph.com/en/latest/releases/octopus/#changelog
> [2]: https://tracker.ceph.com/issues/51202
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] how to fix mds stuck at dispatched without restart ads

2022-08-30 Thread zxcs


Hi, experts

we have a cephfs(15.2.13) cluster with kernel mount, and when we read from 
2000+ processes to one ceph path(called /path/to/A/), then all of the process 
hung, and ls -lrth /path/to/A/ always stuck, but list other directory are 
health( /path/to/B/), 

health detail always report mds has slow request.  And then we need to restart 
the mds fix this issue.

How can we fix this without restart mds(restart mds always impact other users)?

Any suggestions are welcome! Thanks a ton!

from this dump_ops_in_flight:

"description": "client_request(client.100807215:2856632 getattr AsLsXsFs 
#0x200978a3326 2022-08-31T09:36:30.444927+0800 caller_id=2049, 
caller_gid=2049})",
"initiated_at": "2022-08-31T09:36:30.454570+0800",
"age": 17697.012491966001,
"duration": 17697.012805568,
"type_data": {
"flag_point": "dispatched",
"reqid": "client. 100807215:2856632",
"op_type": "client_request",
"client_info":
"client": "client.100807215",
"tid": 2856632
"events":
"time": "2022-08-31T09:36:30.454570+0800",
"event": "initiated"

"time": "2022-08-31T09:36:30.454572+0800",
"event": "throttled"

"time": "2022-08-31T09:36:30.454570+0800",
"event": "header read"

"time": "2022-08-31T09:36:30.454580+0800",
'event": "all_read"
"time": "2022-08-31T09:36:30.454604+0800",
"event": "dispatched"
}



Thanks,
Xiong
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: how to fix mds stuck at dispatched without restart ads

2022-09-01 Thread zxcs
Thanks a lot, xiubo!!!


this time we still restarted mds fix this due to user urgent need list 
/path/to/A/, i will try to mds debug log if we hit it again.

Also, haven’t try flush mds journal before, any side effect to do this? This 
cephfs cluster is a production environment, we need very careful to do anything.

And I read this bug details  https://tracker.ceph.com/issues/50840 
<https://tracker.ceph.com/issues/50840> , from pre-mail mentions it fixed in 
15.2.17, we using ceps-deploy to deploy(upgrade) cephfs, seems it latest ceph 
version is 15.2.16? Will update if we can fixed after upgrade. 

Will try to flush ads journal option when we hit this bug next time(if no user 
urgent need list directory). Seems it can 100% recurrent these days. Thanks All!




Thanks,
zx



> 2022年8月31日 15:23,Xiubo Li  写道:
> 
> 
> On 8/31/22 2:43 PM, zxcs wrote:
>> Hi, experts
>> 
>> we have a cephfs(15.2.13) cluster with kernel mount, and when we read from 
>> 2000+ processes to one ceph path(called /path/to/A/), then all of the 
>> process hung, and ls -lrth /path/to/A/ always stuck, but list other 
>> directory are health( /path/to/B/),
>> 
>> health detail always report mds has slow request.  And then we need to 
>> restart the mds fix this issue.
>> 
>> How can we fix this without restart mds(restart mds always impact other 
>> users)?
>> 
>> Any suggestions are welcome! Thanks a ton!
>> 
>> from this dump_ops_in_flight:
>> 
>> "description": "client_request(client.100807215:2856632 getattr AsLsXsFs 
>> #0x200978a3326 2022-08-31T09:36:30.444927+0800 caller_id=2049, 
>> caller_gid=2049})",
>> "initiated_at": "2022-08-31T09:36:30.454570+0800",
>> "age": 17697.012491966001,
>> "duration": 17697.012805568,
>> "type_data": {
>> "flag_point": "dispatched",
>> "reqid": "client. 100807215:2856632",
>> "op_type": "client_request",
>> "client_info":
>> "client": "client.100807215",
>> "tid": 2856632
>> "events":
>> "time": "2022-08-31T09:36:30.454570+0800",
>> "event": "initiated"
>> 
>> "time": "2022-08-31T09:36:30.454572+0800",
>> "event": "throttled"
>> 
>> "time": "2022-08-31T09:36:30.454570+0800",
>> "event": "header read"
>> 
>> "time": "2022-08-31T09:36:30.454580+0800",
>> 'event": "all_read"
>> "time": "2022-08-31T09:36:30.454604+0800",
>> "event": "dispatched"
>> }
>> 
> AFAIK there is no easy way to do this. At least we need to know why it gets 
> stuck and where. From your above and the previous mail thread, it should 
> stuck in getattr request and sounds like a smiliar issue with 
> https://tracker.ceph.com/issues/50840 <https://tracker.ceph.com/issues/50840>.
> 
> If it's not, it should be a new bug and could you create one tracker and 
> provide the mds side debug logs.
> 
> Maybe you can try to flush the mds journal to see what will happen ?
> 
> - Xiubo
> 
> 
>> 
>> Thanks,
>> Xiong
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io>
>> To unsubscribe send an email to ceph-users-le...@ceph.io 
>> <mailto:ceph-users-le...@ceph.io>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] how to speed up hundreds of millions small files read base on cephfs?

2022-09-01 Thread zxcs
Hi, experts,

We are using cephfs(15.2.*) with kernel mount on our production environment. 
And these days when we do massive read from cluster(multi processes),  ceph 
health always report slow ops for some osds(build with hdd(8TB) which using ssd 
as db cache). 

our cluster have more read than write request. 

health log like below:
100 slow ops, oldest one blocked for 114 sec, [osd.* ...] has slow ops (SLOW 
_OPS)

 
my question is does there any best practices to process hundreds of millions 
small files(means 100kb-300kb each file and 1+ files in each directory, 
also more than 5000 directory)? A
 
Any config we can tune or any patch we can apply try to speed up the read(more 
important than write) and any other file system we could try (we also not sure 
cephfs is the best choice to store such huge small files )?

Please experts shed some light here! We really need your are help here!

Any suggestions are welcome! Thanks in advance!~

Thanks,
zx

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] how to sync data on two site CephFS

2023-02-16 Thread zxcs
Hi, Experts,

we  already have a CephFS cluster, called A,  and now we want to setup another 
CephFS cluster(called B) in other site.
And we need to  synchronize data with each other for some directory(if all 
directory can synchronize , then very very good), Means when we write a file in 
A cluster, this file can auto sync to B cluster, and when we create a file or 
directory on B Cluster, this file or directory can auto sync to A Cluster.

our question is does there any best practices to do that on CephFS?

Thanks in advance!


Thanks,
zx
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] how to set load balance on multi active mds?

2023-08-09 Thread zxcs
Hi, experts,

we have a  product env build with ceph version 16.2.11 pacific, and using 
CephFS. 
Also enable multi active mds(more than 10), but we usually see load unbalance 
on our client request with these mds. 
see below picture. the top 1 mds has 32.2k client request. and the last one 
only 331. 

this always lead our cluster into very bad situation. say many MDS report slow 
requests…  
...
  7 MDSs report slow requests
  1 MDSs behind on trimming
…


So our question is how to set those mdss load balance? Could any one please 
help to shed some light here?
Thanks a ton!


Thanks,
xz

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: how to set load balance on multi active mds?

2023-08-09 Thread zxcs
Thanks a lot, Eugen!

we are using dynamic subtree pinning, we have another cluster using manual 
pinning, but we have many directory , and we need pin each dir for each 
request. so in our new cluster, we want to try dynamic subtree pinning. we 
don’t want to human kick in every time. Because some A directory hot, and 
sometimes B directory hot.. each directory has many subdirectory and 
sub-subdirectory...

But we found the load not balance on all mds when we using dynamic subtree 
pinning. So we want to know if any config we can tune for the dynamic subtree 
pinning. Thanks again! 

Thanks,
xz

> 2023年8月9日 17:40,Eugen Block  写道:
> 
> Hi,
> 
> you could benefit from directory pinning [1] or dynamic subtree pinning [2]. 
> We had great results with manual pinning in an older Nautilus cluster, didn't 
> have a chance to test the dynamic subtree pinning yet though. It's difficult 
> to tell in advance which option would suit best your use-case, so you'll 
> probably have to try.
> 
> Regards,
> Eugen
> 
> [1] 
> https://docs.ceph.com/en/reef/cephfs/multimds/#manually-pinning-directory-trees-to-a-particular-rank
> [2] 
> https://docs.ceph.com/en/reef/cephfs/multimds/#dynamic-subtree-partitioning-with-balancer-on-specific-ranks
> 
> Zitat von zxcs mailto:zhuxion...@163.com>>:
> 
>> Hi, experts,
>> 
>> we have a  product env build with ceph version 16.2.11 pacific, and using 
>> CephFS.
>> Also enable multi active mds(more than 10), but we usually see load 
>> unbalance on our client request with these mds.
>> see below picture. the top 1 mds has 32.2k client request. and the last one 
>> only 331.
>> 
>> this always lead our cluster into very bad situation. say many MDS report 
>> slow requests…
>>  ...
>>  7 MDSs report slow requests
>>  1 MDSs behind on trimming
>>  …
>> 
>> 
>> So our question is how to set those mdss load balance? Could any one please 
>> help to shed some light here?
>> Thanks a ton!
>> 
>> 
>> Thanks,
>> xz
>> 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io>
>> To unsubscribe send an email to ceph-users-le...@ceph.io 
>> <mailto:ceph-users-le...@ceph.io>
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io>
> To unsubscribe send an email to ceph-users-le...@ceph.io 
> <mailto:ceph-users-le...@ceph.io>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] how to list ceph file size on ubuntu 20.04

2021-11-16 Thread zxcs
Hi, 

I want to list cephfs directory size on ubuntu 20.04, but when I use ls -alh 
[directory] ,it shows the number of files and directorys under this 
directory(it only count number not size) , i remember when i use ls -alh 
[directory] on ubuntu 16.04, it will shows the size of this directory (include 
it sub directory). i know du -sh can list the directory size, but it very slow 
as our ceph directory has tons of small files.

Would anyone please help to shed some light here? Thanks a ton!


Thanks,
Xiong 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: how to list ceph file size on ubuntu 20.04

2021-11-17 Thread zxcs
Thanks a ton!!! Very helps!Thanks,Xiong在 2021年11月17日,上午11:16,胡 玮文  写道:There is a rbytes mount option [1]. Besides, you can use “getfattr -n ceph.dir.rbytes /path/in/cephfs”[1]: https://docs.ceph.com/en/latest/man/8/mount.ceph/#advancedWeiwen Hu在 2021年11月17日,10:26,zxcs  写道:Hi,I want to list cephfs directory size on ubuntu 20.04, but when I use ls -alh [directory] ,it shows the number of files and directorys under this directory(it only count number not size) , i remember when i use ls -alh [directory] on ubuntu 16.04, it will shows the size of this directory (include it sub directory). i know du -sh can list the directory size, but it very slow as our ceph directory has tons of small files.Would anyone please help to shed some light here? Thanks a ton!Thanks,Xiong___ceph-users mailing list -- ceph-users@ceph.ioTo unsubscribe send an email to ceph-users-le...@ceph.io___ceph-users mailing list -- ceph-users@ceph.ioTo unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How to using alluxio with Cephfs as backend storage?

2021-11-25 Thread zxcs
Hi,

I am want to using alluxio to speed up the read/write cephfs, so want to ask if 
anyone already did this ?  Any wiki  or experience to share how to setup the 
environment?
I know there is a wiki about alluxio  using cephfs as backend storage 
https://docs.alluxio.io/os/user/stable/en/ufs/CephFS.html , but seems this wiki 
is very simple, i can’t setup the environment.

Thanks in advance!


Thanks,
Xiong
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to using alluxio with Cephfs as backend storage?

2021-11-25 Thread zxcs
Wow,  so supervised!   Words cannot express my thanks for you, yantao!
I send you a  mail with my  detail questions, would you please help to check. 
Thanks a ton

Thanks,
Xiong


> 在 2021年11月26日,上午10:47,xueyantao2114  写道:
> 
> First, thanks for you question. Alluxio underfs ceph and cephfs  both 
> are contributed and maintained by my team,
> the wiki https://docs.alluxio.io/os/user/stable/en/ufs/CephFS.html 
>  was written by 
> me. Can you describe the problem in detail ?
> May be i can do some help.
> 
> 
>  

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] how to change system time with cephfs not lost connect

2022-01-04 Thread zxcs
Hi, 

Recently we need do some change timezone test on our ubuntu node. And this node 
mount a cephfs with kernel driver. when I changed the time of the system(for 
example, current is 2022-01-05 09:00:00, then we change the time to 2022-01-03 
08:00:00 using date command), after about 30m~1h, this node cannot connect to 
cephfs. From dmesg log it always report some logs like  "libceph: osd*** 
*.*.*.*:port bad authorize reply”

we mount this cephfs with  “-o rbytes,noatime,async,noexec,nodev,nodiratime”.

Any suggestion how can we change the time and the cephfs  not lost connect?  
One way is before change time, umount ceph , then change time, and mount ceph 
again, but this is seems a little complex. 

Thanks in advance!


Thanks,
Xiong
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io