[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-23 Thread Marc
I don't think there are here people advising to use consumer grade ssd's/nvme's. The enterprise ssd's often have more twpd, and are just stable under high constant load. My 1,5 year old sm863a still has 099 wearlevel and 097 poweronhours, some other sm863a of 3,8 years has 099 wearlevel and

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-22 Thread zxcs
From ceph document, i see using fast device as wal/db could improve the performance. So we using one(2TB) or two(1TB) samsung Nvme 970pro as wal/db here, and yes, we have two data pools, ssd pool and hdd pool, also ssd pool using samsung 860Pro. the Nvme970 as wal/db for both ssd pool and hdd

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-22 Thread zxcs
ve you run fio tests to make sure it would work ok? >>> >>> https://yourcmc.ru/wiki/Ceph_performance >>> https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX- >> 0u0r5fAjjufLKayaut_FOPxYZjc/edit#gid=0 >>> >>> >>> >>>> -Original Message--

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-22 Thread Mark Lehrer
> Yes, it a Nvme, and on node has two Nvmes as db/wal, one > for ssd(0-2) and another for hdd(3-6). I have no spare to try. > ... > I/O 517 QID 7 timeout, aborting > Input/output error If you are seeing errors like these, it is almost certainly a bad drive unless you are using fabric. Why are yo

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-22 Thread Marc
0r5fAjjufLKayaut_FOPxYZjc/edit#gid=0 > > > > > > > >> -Original Message- > >> Sent: 22 February 2021 03:16 > >> us...@ceph.io> > >> Subject: [ceph-users] Re: Ceph nvme timeout and then aborting > >> > >> Thanks for you

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-22 Thread zxcs
rk ok? > > https://yourcmc.ru/wiki/Ceph_performance > https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-0u0r5fAjjufLKayaut_FOPxYZjc/edit#gid=0 > > > >> -Original Message- >> Sent: 22 February 2021 03:16 >> us...@ceph.io> >> Subject: [ceph-users]

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-22 Thread Marc
Original Message- > Sent: 22 February 2021 03:16 > us...@ceph.io> > Subject: [ceph-users] Re: Ceph nvme timeout and then aborting > > Thanks for you reply! > > Yes, it a Nvme, and on node has two Nvmes as db/wal, one for ssd(0-2) > and another for hdd(3-6). > I have

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-21 Thread zxcs
Thanks for you reply! Yes, it a Nvme, and on node has two Nvmes as db/wal, one for ssd(0-2) and another for hdd(3-6). I have no spare to try. It’s very strange, the load not very high at that time. and both ssd and nvme seems healthy. If cannot fix it. I am afraid I need to setup more nodes a

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-21 Thread Mark Lehrer
> One nvme sudden crash again. Could anyone please help shed some light here? It looks like a flaky NVMe drive. Do you have a spare to try? On Mon, Feb 22, 2021 at 1:56 AM zxcs wrote: > > One nvme sudden crash again. Could anyone please help shed some light here? > Thank a ton!!! > Below ar

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-21 Thread zxcs
One nvme sudden crash again. Could anyone please help shed some light here? Thank a ton!!! Below are syslog and ceph log. From /var/log/syslog Feb 21 19:38:33 ip kernel: [232562.847916] nvme :03:00.0: I/O 943 QID 7 timeout, aborting Feb 21 19:38:34 ip kernel: [232563.847946] nvme :03:0

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-19 Thread zxcs
you mean OS? it ubuntu 16.04 and Nvme is Samsung 970 PRO 1TB. Thanks, zx > 在 2021年2月19日,下午6:56,Konstantin Shalygin > 写道: > > Look's good, what is your hardware? Server model & NVM'es? > > > > k > >> On 19 Feb 2021, at 13:22, zxcs > > wrote:

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-19 Thread Konstantin Shalygin
Look's good, what is your hardware? Server model & NVM'es? k > On 19 Feb 2021, at 13:22, zxcs wrote: > > BTW, actually i have two nodes has same issues, and another error node's nvme > output as below > > Smart Log for NVME device:nvme0n1 namespace-id: > critical_warning

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-19 Thread zxcs
BTW, actually i have two nodes has same issues, and another error node's nvme output as below Smart Log for NVME device:nvme0n1 namespace-id: critical_warning: 0 temperature : 29 C available_spare : 100% available_spare_thre

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-19 Thread zxcs
Thank you very much, Konstantin! Here is the output of `nvme smart-log /dev/nvme0n1` Smart Log for NVME device:nvme0n1 namespace-id: critical_warning: 0 temperature : 27 C available_spare : 100% available_spare_threshold

[ceph-users] Re: Ceph nvme timeout and then aborting

2021-02-19 Thread Konstantin Shalygin
Please paste your `name smart-log /dev/nvme0n1` output k > On 19 Feb 2021, at 12:53, zxcs wrote: > > I have one ceph cluster with nautilus 14.2.10 and one node has 3 SSD and 4 > HDD each. > Also has two nvmes as cache. (Means nvme0n1 cache for 0-2 SSD and Nvme1n1 > cache for 3-7 HDD) >