Re: [PERFORM] New server: SSD/RAID recommendations?

Graeme B. Bell Tue, 07 Jul 2015 05:05:15 -0700

1. Does the sammy nvme have *complete* power loss protection though, for all 
fsync'd data?
I am very badly burned by my experiences with Crucial SSDs and their 'power 
loss protection' which doesn't actually ensure all fsync'd data gets into flash.
It certainly looks pretty with all those capacitors on top in the photos, but 
we need some plug pull tests to be sure.


2. Apologies for the typo in the previous post, raidz5 should have been raidz1. 

3. Also, something to think about when you start having single disk solutions 
(or non-ZFS raid, for that matter).

SSDs are so unlike HDDs. 

The samsung nvme has a UBER (uncorrectable bit error rate) measured at 1 in 
10^17. That's one bit gone bad in 12500 TB, a good number.  Chances are the 
drives fails before you hit a bit error, and if not, ZFS would catch it.

Whereas current HDDS are at the 1 in 10^14 level. That means an error every 
12TB, by the specs. That means, every time you fill your cheap 6-8TB seagate 
drive, it likely corrupted some of your data *even if it performed according to 
the spec*. (That's also why RAID5 isn't viable for rebuilding large arrays, 
incidentally).

Graeme Bell


On 07 Jul 2015, at 12:56, Mkrtchyan, Tigran <tigran.mkrtch...@desy.de> wrote:

> 
> 
> ----- Original Message -----
>> From: "Graeme B. Bell" <graeme.b...@nibio.no>
>> To: "Mkrtchyan, Tigran" <tigran.mkrtch...@desy.de>
>> Cc: "Graeme B. Bell" <graeme.b...@nibio.no>, "Steve Crawford" 
>> <scrawf...@pinpointresearch.com>, "Wes Vaske (wvaske)"
>> <wva...@micron.com>, "pgsql-performance" <pgsql-performance@postgresql.org>
>> Sent: Tuesday, July 7, 2015 12:38:10 PM
>> Subject: Re: [PERFORM] New server: SSD/RAID recommendations?
> 
>> I am unsure about the performance side but, ZFS is generally very attractive 
>> to
>> me.
>> 
>> Key advantages:
>> 
>> 1) Checksumming and automatic fixing-of-broken-things on every file (not just
>> postgres pages, but your scripts, O/S, program files).
>> 2) Built-in  lightweight compression (doesn't help with TOAST tables, in fact
>> may slow them down, but helpful for other things). This may actually be a net
>> negative for pg so maybe turn it off.
>> 3) ZRAID mirroring or ZRAID5/6. If you have trouble persuading someone that 
>> it's
>> safe to replace a RAID array with a single drive... you can use a couple of
>> NVMe SSDs with ZFS mirror or zraid, and  get the same availability you'd get
>> from a RAID controller. Slightly better, arguably, since they claim to have
>> fixed the raid write-hole problem.
>> 4) filesystem snapshotting
>> 
>> Despite the costs of checksumming etc., I suspect ZRAID running on a fast CPU
>> with multiple NVMe drives will outperform quite a lot of the alternatives, 
>> with
>> great data integrity guarantees.
> 
> 
> We are planing to have a test setup as well. For now I have single NVMe SSD 
> on my
> test system:
> 
> # lspci | grep NVM
> 85:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD 
> Controller 171X (rev 03)
> 
> # mount | grep nvm
> /dev/nvme0n1p1 on /var/lib/pgsql/9.5 type ext4 
> (rw,noatime,nodiratime,data=ordered)
> 
> 
> and quite happy with it. We have write heavy workload on it to see when it 
> will
> break. Postgres Performs very well. About x2.5 faster than with regular disks
> with a single client and almost linear with multiple clients (picture 
> attached.
> On Y number of high level op/s our application does, X number of clients). The
> setup is used last 3 months. Looks promising but for production we need to
> to have disk size twice as big as on the test system. Until today, I was
> planning to use a RAID10 with a HW controller...
> 
> Related to ZFS. We use ZFSonlinux and behaviour is not as good as with 
> solaris.
> Let's re-phrase it: performance is unpredictable. We run READZ2 with 30x3TB 
> disks.
> 
> Tigran.
> 
>> 
>> Haven't built one yet. Hope to, later this year. Steve, I would love to know
>> more about how you're getting on with your NVMe disk in postgres!
>> 
>> Graeme.
>> 
>> On 07 Jul 2015, at 12:28, Mkrtchyan, Tigran <tigran.mkrtch...@desy.de> wrote:
>> 
>>> Thanks for the Info.
>>> 
>>> So if RAID controllers are not an option, what one should use to build
>>> big databases? LVM with xfs? BtrFs? Zfs?
>>> 
>>> Tigran.
>>> 
>>> ----- Original Message -----
>>>> From: "Graeme B. Bell" <graeme.b...@nibio.no>
>>>> To: "Steve Crawford" <scrawf...@pinpointresearch.com>
>>>> Cc: "Wes Vaske (wvaske)" <wva...@micron.com>, "pgsql-performance"
>>>> <pgsql-performance@postgresql.org>
>>>> Sent: Tuesday, July 7, 2015 12:22:00 PM
>>>> Subject: Re: [PERFORM] New server: SSD/RAID recommendations?
>>> 
>>>> Completely agree with Steve.
>>>> 
>>>> 1. Intel NVMe looks like the best bet if you have modern enough hardware 
>>>> for
>>>> NVMe. Otherwise e.g. S3700 mentioned elsewhere.
>>>> 
>>>> 2. RAID controllers.
>>>> 
>>>> We have e.g. 10-12 of these here and e.g. 25-30 SSDs, among various 
>>>> machines.
>>>> This might give people idea about where the risk lies in the path from 
>>>> disk to
>>>> CPU.
>>>> 
>>>> We've had 2 RAID card failures in the last 12 months that nuked the array 
>>>> with
>>>> days of downtime, and 2 problems with batteries suddenly becoming useless 
>>>> or
>>>> suddenly reporting wildly varying temperatures/overheating. There may have 
>>>> been
>>>> other RAID problems I don't know about.
>>>> 
>>>> Our IT dept were replacing Seagate HDDs last year at a rate of 2-3 per 
>>>> week (I
>>>> guess they have 100-200 disks?). We also have about 25-30 Hitachi/HGST 
>>>> HDDs.
>>>> 
>>>> So by my estimates:
>>>> 30% annual problem rate with RAID controllers
>>>> 30-50% failure rate with Seagate HDDs (backblaze saw similar results)
>>>> 0% failure rate with HGST HDDs.
>>>> 0% failure in our SSDs.   (to be fair, our one samsung SSD apparently has 
>>>> a bug
>>>> in TRIM under linux, which I'll need to investigate to see if we have been
>>>> affected by).
>>>> 
>>>> also, RAID controllers aren't free - not just the money but also the 
>>>> management
>>>> of them (ever tried writing a complex install script that interacts work 
>>>> with
>>>> MegaCLI? It can be done but it's not much fun.). Just take a look at the
>>>> MegaCLI manual and ask yourself... is this even worth it (if you have a 
>>>> good
>>>> MTBF on an enterprise SSD).
>>>> 
>>>> RAID was meant to be about ensuring availability of data. I have trouble
>>>> believing that these days....
>>>> 
>>>> Graeme Bell
>>>> 
>>>> 
>>>> On 06 Jul 2015, at 18:56, Steve Crawford <scrawf...@pinpointresearch.com> 
>>>> wrote:
>>>> 
>>>>> 
>>>>> 2. We don't typically have redundant electronic components in our 
>>>>> servers. Sure,
>>>>> we have dual power supplies and dual NICs (though generally to handle 
>>>>> external
>>>>> failures) and ECC-RAM but no hot-backup CPU or redundant RAM banks 
>>>>> and...no
>>>>> backup RAID card. Intel Enterprise SSD already have power-fail protection 
>>>>> so I
>>>>> don't need a RAID card to give me BBU. Given the MTBF of good enterprise 
>>>>> SSD
>>>>> I'm left to wonder if placing a RAID card in front merely adds a new 
>>>>> point of
>>>>> failure and scheduled-downtime-inducing hands-on maintenance (I'm looking 
>>>>> at
>>>>> you, RAID backup battery).
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>>>> To make changes to your subscription:
>>>> http://www.postgresql.org/mailpref/pgsql-performance
>> 
>> 
>> 
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
> <pg-with-ssd.png>



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] New server: SSD/RAID recommendations?

Reply via email to