> Am 12.04.2016 um 21:24 schrieb Robert Mustacchi <[email protected]>:
>
> Hi Dirk,
>
> Apologies that you're running into so many issues on the bleeding edge.
>
> On 4/12/16 11:58 , Dirk Steinberg wrote:
>> The PM951 is a NVMe 1.1 device, therefore I uncommented „strict-version=0;“
>> in /kernel/drv/nvme.conf and did „update_drv -vf nvme“. That was not enough.
>>
>> „grep nvme /etc/driver_aliases“ only yields: nvme „pciex8086,953“
>>
>> I tried:
>> [root@nuc6 ~]# update_drv -a -i ‚„pci144d,a802"' nvme
>>
>> That actually attached the driver and make the SSD visible. I could
>> install the zones pool on that SSD.
>
> Can you run prtconf -v for that device and share the compatible line if
> possible? With that I should be able to look at what aliases beyond that
> we should be using. I've also cc'd the original developer of the NVMe
> driver who may have more insight
pci144d,a801, instance #0
System software properties:
name='strict-version' type=int items=1
value=00000000
Hardware properties:
name='compatible' type=string items=13
value='pciex144d,a802.144d.a801.1' +
'pciex144d,a802.144d.a801' + 'pciex144d,a802.1' + 'pciex144d,a802' +
'pciexclass,010802' + 'pciexclass,0108' + 'pci144d,a802.144d.a801.1' +
'pci144d,a802.144d.a801' + 'pci144d,a801' + 'pci144d,a802.1' + 'pci144d,a802' +
'pciclass,010802' + 'pciclass,0108'
>> That did the trick. I could boot from USB straight into the zones pool on
>> the NVMe.
>>
>> To test things out, I copied some large files into the pool and did a scrub
>> afterwards.
>> OMG!! This was the outcome:
>>
>> [root@nuc6 ~]# zpool status
>> pool: zones
>> state: ONLINE
>> status: One or more devices has experienced an error resulting in data
>> corruption. Applications may be affected.
>> action: Restore the file in question if possible. Otherwise restore the
>> entire pool from backup.
>> see: http://illumos.org/msg/ZFS-8000-8A
>> scan: scrub repaired 0 in 0h0m with 1 errors on Tue Apr 12 18:04:37 2016
>> config:
>>
>> NAME STATE READ WRITE CKSUM
>> zones ONLINE 0 0 5
>> c0t1d0 ONLINE 0 0 14
>>
>> errors: 1 data errors, use ‚-v' for a list
>>
>> Obviously something is very wrong here. I do NOT like to see this.
>>
>> I looked at the file that was supposed to be corrupted. Since I had copied
>> that file over from another machine I could verify by means of the md5/sha256
>> checksums that the file was NOT actually corrupt. The checksums on both
>> machines were identical! I played around a little more, copied a few files
>> and ran a few scrubs. I got many more checksum errors on scrubbing
>> (how can it be that there are less errors for the pool in total than for the
>> one
>> disk that makes up the pool?) and also a second file with permanent errors
>> (let’s say the first file is „a“ and the second „b“). I could verify that
>> file „b“
>> was also not corrupted.
>>
>> Now I had both files a and b in the list of files with permanent errors.
>> After some
>> time, file a disappeared from that list and only b remained!!!
>> Eventually that list of files was empty again! And that pool does not have
>> redundancy, and I also did not use copies=2 or such.
>>
>> [root@nuc6 ~]# zpool status -v
>> pool: zones
>> state: ONLINE
>> status: One or more devices has experienced an unrecoverable error. An
>> attempt was made to correct the error. Applications are unaffected.
>> action: Determine if the device needs to be replaced, and clear the errors
>> using 'zpool clear' or replace the device with 'zpool replace'.
>> see: http://illumos.org/msg/ZFS-8000-9P
>> scan: scrub repaired 0 in 0h0m with 0 errors on Tue Apr 12 18:20:28 2016
>> config:
>>
>> NAME STATE READ WRITE CKSUM
>> zones ONLINE 0 0 5
>> c0t1d0 ONLINE 0 0 15
>>
>> errors: No known data errors
>> [root@nuc6 ~]#
>>
>> This output at the same time says „unrecoverable error“ AND „No known data
>> errors“!
>> And the files with „Permanent errors“ magically disappeared. How is that
>> possible?
>>
>> BTW, the memory is good. I did run a memory test overnight.
>>
>> It appears that these errors only occur randomly in the read path of the
>> scrubbing code.
>> I have no explanation for this.
>>
>> Does anyone have any clues to
>> a) what is going on?
>> b) what can be done about it?
>>
>> Although I could prove that all reported permanent errors during scrubbing
>> were not really errors after all, the entire thing feels really bad and
>> cannot
>> be used like this IMHO.
>>
>> Does anyone have NVMe running reliably? If so, what devices?
>
> We have some Intel P3600s that have been working.
>
> Hans, have you seen something like this?
>
> Dirk, it sounds like this is reproducible, if so, if we were able to
> figure out some follow up questions, could we ask you to do some DTrace
> or other debugging?
Sure, as long as I have this „playground“ system. I am currently thinking
about returning the NVMe device for a refund if I do not think I will be
able to use it productively.
If you send me a dtrace script I am happy to run it.
Thanks for your help.
Dirk
> Thanks,
> Robert
>
-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription:
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com