Re: [smartos-discuss] NVMe on SmartOS

Dirk Steinberg Tue, 12 Apr 2016 13:50:19 -0700

> Am 12.04.2016 um 21:24 schrieb Robert Mustacchi <[email protected]>:
> 
> Hi Dirk,
> 
> Apologies that you're running into so many issues on the bleeding edge.
> 
> On 4/12/16 11:58 , Dirk Steinberg wrote:
>> The PM951 is a NVMe 1.1 device, therefore I uncommented „strict-version=0;“
>> in /kernel/drv/nvme.conf and did „update_drv -vf nvme“. That was not enough.
>> 
>> „grep nvme /etc/driver_aliases“ only yields: nvme „pciex8086,953“
>> 
>> I tried:
>> [root@nuc6 ~]# update_drv -a -i ‚„pci144d,a802"' nvme
>> 
>> That actually attached the driver and make the SSD visible. I could
>> install the zones pool on that SSD.
> 
> Can you run prtconf -v for that device and share the compatible line if
> possible? With that I should be able to look at what aliases beyond that
> we should be using. I've also cc'd the original developer of the NVMe
> driver who may have more insight


            pci144d,a801, instance #0
                System software properties:
                    name='strict-version' type=int items=1
                        value=00000000
                Hardware properties:
                    name='compatible' type=string items=13
                        value='pciex144d,a802.144d.a801.1' + 
'pciex144d,a802.144d.a801' + 'pciex144d,a802.1' + 'pciex144d,a802' + 
'pciexclass,010802' + 'pciexclass,0108' + 'pci144d,a802.144d.a801.1' + 
'pci144d,a802.144d.a801' + 'pci144d,a801' + 'pci144d,a802.1' + 'pci144d,a802' + 
'pciclass,010802' + 'pciclass,0108'


>> That did the trick. I could boot from USB straight into the zones pool on 
>> the NVMe.
>> 
>> To test things out, I copied some large files into the pool and did a scrub 
>> afterwards.
>> OMG!! This was the outcome:
>> 
>> [root@nuc6 ~]# zpool status
>>  pool: zones
>> state: ONLINE
>> status: One or more devices has experienced an error resulting in data
>>      corruption.  Applications may be affected.
>> action: Restore the file in question if possible.  Otherwise restore the
>>      entire pool from backup.
>>   see: http://illumos.org/msg/ZFS-8000-8A
>>  scan: scrub repaired 0 in 0h0m with 1 errors on Tue Apr 12 18:04:37 2016
>> config:
>> 
>>      NAME        STATE     READ WRITE CKSUM
>>      zones       ONLINE       0     0     5
>>        c0t1d0    ONLINE       0     0    14
>> 
>> errors: 1 data errors, use ‚-v' for a list
>> 
>> Obviously something is very wrong here. I do NOT like to see this.
>> 
>> I looked at the file that was supposed to be corrupted. Since I had copied
>> that file over from another machine I could verify by means of the md5/sha256
>> checksums that the file was NOT actually corrupt. The checksums on both
>> machines were identical! I played around a little more, copied a few files
>> and ran a few scrubs. I got many more checksum errors on scrubbing
>> (how can it be that there are less errors for the pool in total than for the 
>> one 
>> disk that makes up the pool?) and also a second file with permanent errors
>> (let’s say the first file is „a“ and the second „b“). I could verify that 
>> file „b“ 
>> was also not corrupted.
>> 
>> Now I had both files a and b in the list of files with permanent errors. 
>> After some
>> time, file a disappeared from that list and only b remained!!!
>> Eventually that list of files was empty again! And that pool does not have 
>> redundancy, and I also did not use copies=2 or such.
>> 
>> [root@nuc6 ~]# zpool status -v
>>  pool: zones
>> state: ONLINE
>> status: One or more devices has experienced an unrecoverable error.  An
>>      attempt was made to correct the error.  Applications are unaffected.
>> action: Determine if the device needs to be replaced, and clear the errors
>>      using 'zpool clear' or replace the device with 'zpool replace'.
>>   see: http://illumos.org/msg/ZFS-8000-9P
>>  scan: scrub repaired 0 in 0h0m with 0 errors on Tue Apr 12 18:20:28 2016
>> config:
>> 
>>      NAME        STATE     READ WRITE CKSUM
>>      zones       ONLINE       0     0     5
>>        c0t1d0    ONLINE       0     0    15
>> 
>> errors: No known data errors
>> [root@nuc6 ~]#
>> 
>> This output at the same time says „unrecoverable error“ AND „No known data 
>> errors“!
>> And the files with „Permanent errors“ magically disappeared. How is that 
>> possible?
>> 
>> BTW, the memory is good. I did run a memory test overnight.
>> 
>> It appears that these errors only occur randomly in the read path of the 
>> scrubbing code.
>> I have no explanation for this.
>> 
>> Does anyone have any clues to
>> a) what is going on?
>> b) what can be done about it?
>> 
>> Although I could prove that all reported permanent errors during scrubbing
>> were not really errors after all, the entire thing feels really bad and 
>> cannot 
>> be used like this IMHO.
>> 
>> Does anyone have NVMe running reliably? If so, what devices?
> 
> We have some Intel P3600s that have been working.
> 
> Hans, have you seen something like this?
> 
> Dirk, it sounds like this is reproducible, if so, if we were able to
> figure out some follow up questions, could we ask you to do some DTrace
> or other debugging?

Sure, as long as I have this „playground“ system. I am currently thinking
about returning the NVMe device for a refund if I do not think I will be
able to use it productively.

If you send me a dtrace script I am happy to run it.

Thanks for your help.
Dirk

> Thanks,
> Robert
> 


-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com

Re: [smartos-discuss] NVMe on SmartOS

Reply via email to