Hi Dirk, Apologies that you're running into so many issues on the bleeding edge.
On 4/12/16 11:58 , Dirk Steinberg wrote: > The PM951 is a NVMe 1.1 device, therefore I uncommented „strict-version=0;“ > in /kernel/drv/nvme.conf and did „update_drv -vf nvme“. That was not enough. > > „grep nvme /etc/driver_aliases“ only yields: nvme „pciex8086,953“ > > I tried: > [root@nuc6 ~]# update_drv -a -i ‚„pci144d,a802"' nvme > > That actually attached the driver and make the SSD visible. I could > install the zones pool on that SSD. Can you run prtconf -v for that device and share the compatible line if possible? With that I should be able to look at what aliases beyond that we should be using. I've also cc'd the original developer of the NVMe driver who may have more insight. > That did the trick. I could boot from USB straight into the zones pool on the > NVMe. > > To test things out, I copied some large files into the pool and did a scrub > afterwards. > OMG!! This was the outcome: > > [root@nuc6 ~]# zpool status > pool: zones > state: ONLINE > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. > see: http://illumos.org/msg/ZFS-8000-8A > scan: scrub repaired 0 in 0h0m with 1 errors on Tue Apr 12 18:04:37 2016 > config: > > NAME STATE READ WRITE CKSUM > zones ONLINE 0 0 5 > c0t1d0 ONLINE 0 0 14 > > errors: 1 data errors, use ‚-v' for a list > > Obviously something is very wrong here. I do NOT like to see this. > > I looked at the file that was supposed to be corrupted. Since I had copied > that file over from another machine I could verify by means of the md5/sha256 > checksums that the file was NOT actually corrupt. The checksums on both > machines were identical! I played around a little more, copied a few files > and ran a few scrubs. I got many more checksum errors on scrubbing > (how can it be that there are less errors for the pool in total than for the > one > disk that makes up the pool?) and also a second file with permanent errors > (let’s say the first file is „a“ and the second „b“). I could verify that > file „b“ > was also not corrupted. > > Now I had both files a and b in the list of files with permanent errors. > After some > time, file a disappeared from that list and only b remained!!! > Eventually that list of files was empty again! And that pool does not have > redundancy, and I also did not use copies=2 or such. > > [root@nuc6 ~]# zpool status -v > pool: zones > state: ONLINE > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using 'zpool clear' or replace the device with 'zpool replace'. > see: http://illumos.org/msg/ZFS-8000-9P > scan: scrub repaired 0 in 0h0m with 0 errors on Tue Apr 12 18:20:28 2016 > config: > > NAME STATE READ WRITE CKSUM > zones ONLINE 0 0 5 > c0t1d0 ONLINE 0 0 15 > > errors: No known data errors > [root@nuc6 ~]# > > This output at the same time says „unrecoverable error“ AND „No known data > errors“! > And the files with „Permanent errors“ magically disappeared. How is that > possible? > > BTW, the memory is good. I did run a memory test overnight. > > It appears that these errors only occur randomly in the read path of the > scrubbing code. > I have no explanation for this. > > Does anyone have any clues to > a) what is going on? > b) what can be done about it? > > Although I could prove that all reported permanent errors during scrubbing > were not really errors after all, the entire thing feels really bad and > cannot > be used like this IMHO. > > Does anyone have NVMe running reliably? If so, what devices? We have some Intel P3600s that have been working. Hans, have you seen something like this? Dirk, it sounds like this is reproducible, if so, if we were able to figure out some follow up questions, could we ask you to do some DTrace or other debugging? Thanks, Robert ------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00 Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb Powered by Listbox: http://www.listbox.com
