On Thu, May 3, 2018 at 6:54 AM, Nick Fisk <n...@fisk.me.uk> wrote:
> -----Original Message-----
> From: Alex Gorbachev <a...@iss-integration.com>
> Sent: 02 May 2018 22:05
> To: Nick Fisk <n...@fisk.me.uk>
> Cc: ceph-users <ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] Bluestore on HDD+SSD sync write latency experiences
>
> Hi Nick,
>
> On Tue, May 1, 2018 at 4:50 PM, Nick Fisk <n...@fisk.me.uk> wrote:
>> Hi all,
>>
>>
>>
>> Slowly getting round to migrating clusters to Bluestore but I am
>> interested in how people are handling the potential change in write
>> latency coming from Filestore? Or maybe nobody is really seeing much 
>> difference?
>>
>>
>>
>> As we all know, in Bluestore, writes are not double written and in
>> most cases go straight to disk. Whilst this is awesome for people with
>> pure SSD or pure HDD clusters as the amount of overhead is drastically
>> reduced, for people with HDD+SSD journals in Filestore land, the
>> double write had the side effect of acting like a battery backed
>> cache, accelerating writes when not under saturation.
>>
>>
>>
>> In some brief testing I am seeing Filestore OSD’s with NVME journal
>> show an average apply latency of around 1-2ms whereas some new
>> Bluestore OSD’s in the same cluster are showing 20-40ms. I am fairly
>> certain this is due to writes exhibiting the latency of the underlying
>> 7.2k disk. Note, cluster is very lightly loaded, this is not anything being 
>> driven into saturation.
>>
>>
>>
>> I know there is a deferred write tuning knob which adjusts the cutover
>> for when an object is double written, but at the default of 32kb, I
>> suspect a lot of IO’s even in the 1MB area are still drastically
>> slower going straight to disk than if double written to NVME 1st. Has
>> anybody else done any investigation in this area? Is there any long
>> turn harm at running a cluster deferring writes up to 1MB+ in size to
>> mimic the Filestore double write approach?
>>
>>
>>
>> I also suspect after looking through github that deferred writes only
>> happen when overwriting an existing object or blob (not sure which
>> case applies), so new allocations are still written straight to disk. Can 
>> anyone confirm?
>>
>>
>>
>> PS. If your spinning disks are connected via a RAID controller with
>> BBWC then you are not affected by this.
>
> We saw this behavior even on Areca 1883, which does buffer HDD writes.
> The way out was to put WAL and DB on NVMe drives and that solved performance 
> problems.
>
> Just to confirm, our problem is not poor performance of the RocksDB when 
> running on HDD, but the direct write to disk of data. Or have I misunderstood 
> your comment?

Correct, the write latencies were quite high, then we moved WAL and DB
to NVMe PCIe devices and the latencies greatly improved.  Almost like
a Filestore journal behavior.

Regards,
Alex

>>
>>
>>
>> Thanks,
>>
>> Nick
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to