Hi,
I am having problems getting NVMe to work on SmartOS and hope that
someone can enlighten me on what I need to do to get it flying.
I am runnning the newest SmartOS (joyent_20160330T234717Z) on
a new Skylake-based Intel NUC6.
My problem: the Samsung PM951 M.2 NVMe SSD is properly recognized
and supported in Linux, I could even boot Ubuntu from that SSD.
On SmartOS, the device is not recognized and not listed in diskinfo or sysinfo
or format.
The relevant device info (from Linux) is:
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD
Controller (rev 01)
02:00.0 0108: 144d:a802 (rev 01) (prog-if 02 [NVM Express])
Subsystem: 144d:a801
Flags: bus master, fast devsel, latency 0
Memory at df000000 (64-bit, non-prefetchable) [size=16K]
I/O ports at e000 [size=256]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/8 Maskable- 64bit+
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [b0] MSI-X: Enable+ Count=9 Masked-
Capabilities: [100] Advanced Error Reporting
Capabilities: [148] Device Serial Number 00-00-00-00-00-00-00-00
Capabilities: [158] Power Budgeting <?>
Capabilities: [168] #19
Capabilities: [188] Latency Tolerance Reporting
Capabilities: [190] L1 PM Substates
Kernel driver in use: nvme
Kernel modules: nvme
root@lubuntu:~# nvme list
Node Model Version Namepace Usage
Format FW Rev
---------------- -------------------- -------- --------
-------------------------- ---------------- --------
/dev/nvme0n1 SAMSUNG MZVLV512HCJH 1.1 1 0.00 B / 512.11 GB
512 B + 0 B BXV7000Q
root@dws-desktop:~# nvme fw-log /dev/nvme0n1
Firmware Log for device:nvme0n1
afi : 0x1
frs1 : 0x5130303037565842 (BXV7000Q)
The PM951 is a NVMe 1.1 device, therefore I uncommented „strict-version=0;“
in /kernel/drv/nvme.conf and did „update_drv -vf nvme“. That was not enough.
„grep nvme /etc/driver_aliases“ only yields: nvme „pciex8086,953“
I tried:
[root@nuc6 ~]# update_drv -a -i ‚„pci144d,a802"' nvme
That actually attached the driver and make the SSD visible. I could
install the zones pool on that SSD.
To have SmartOS find the zones pool on boot I had to patch the boot_archive:
mount /zones/smartos/platform/i86pc/amd64/boot_archive /mnt
sed -i '' -e '/#strict-version=0;/{s:.*:strict-version=0;:;}'
/mnt/kernel/drv/nvme.conf
sed -i '' -e '/nvme/{p;s:.*:nvme "pci144d,a802":;}' /mnt/etc/driver_aliases
umount /mnt
That did the trick. I could boot from USB straight into the zones pool on the
NVMe.
To test things out, I copied some large files into the pool and did a scrub
afterwards.
OMG!! This was the outcome:
[root@nuc6 ~]# zpool status
pool: zones
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: scrub repaired 0 in 0h0m with 1 errors on Tue Apr 12 18:04:37 2016
config:
NAME STATE READ WRITE CKSUM
zones ONLINE 0 0 5
c0t1d0 ONLINE 0 0 14
errors: 1 data errors, use ‚-v' for a list
Obviously something is very wrong here. I do NOT like to see this.
I looked at the file that was supposed to be corrupted. Since I had copied
that file over from another machine I could verify by means of the md5/sha256
checksums that the file was NOT actually corrupt. The checksums on both
machines were identical! I played around a little more, copied a few files
and ran a few scrubs. I got many more checksum errors on scrubbing
(how can it be that there are less errors for the pool in total than for the
one
disk that makes up the pool?) and also a second file with permanent errors
(let’s say the first file is „a“ and the second „b“). I could verify that file
„b“
was also not corrupted.
Now I had both files a and b in the list of files with permanent errors. After
some
time, file a disappeared from that list and only b remained!!!
Eventually that list of files was empty again! And that pool does not have
redundancy, and I also did not use copies=2 or such.
[root@nuc6 ~]# zpool status -v
pool: zones
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub repaired 0 in 0h0m with 0 errors on Tue Apr 12 18:20:28 2016
config:
NAME STATE READ WRITE CKSUM
zones ONLINE 0 0 5
c0t1d0 ONLINE 0 0 15
errors: No known data errors
[root@nuc6 ~]#
This output at the same time says „unrecoverable error“ AND „No known data
errors“!
And the files with „Permanent errors“ magically disappeared. How is that
possible?
BTW, the memory is good. I did run a memory test overnight.
It appears that these errors only occur randomly in the read path of the
scrubbing code.
I have no explanation for this.
Does anyone have any clues to
a) what is going on?
b) what can be done about it?
Although I could prove that all reported permanent errors during scrubbing
were not really errors after all, the entire thing feels really bad and cannot
be used like this IMHO.
Does anyone have NVMe running reliably? If so, what devices?
Thanks for any help.
Cheers
Dirk
-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription:
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com