[ceph-users] Re: any experience on using Bcache on top of HDD OSD

2021-04-20 Thread Matthias Ferdinand
On Tue, Apr 20, 2021 at 08:27:50AM +0200, huxia...@horebdata.cn wrote:
> Dear Mattias,
> 
> Very glad to know that your setting with Bcache works well in production.
> 
> How long have you been puting XFS on bcache on HDD in production?  Which 
> bcache version (i mean the kernel) do you use? or do you use a special 
> version of bcache?

Hi,

this installation has been running for at least 4 years, on Ubuntu 16.04
with Ubuntu kernel 4.15 (from package linux-image-generic-hwe-16.04).

Matthias
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Ceph-maintainers] v14.2.20 Nautilus released

2021-04-20 Thread Ilya Dryomov
On Tue, Apr 20, 2021 at 2:01 AM David Galloway  wrote:
>
> This is the 20th bugfix release in the Nautilus stable series.  It
> addresses a security vulnerability in the Ceph authentication framework.
> We recommend users to update to this release. For a detailed release
> notes with links & changelog please refer to the official blog entry at
> https://ceph.io/releases/v14-2-20-nautilus-released
>
> Security Fixes
> --
>
> * This release includes a security fix that ensures the global_id value
> (a numeric value that should be unique for every authenticated client or
> daemon in the cluster) is reclaimed after a network disconnect or ticket
> renewal in a secure fashion.  Two new health alerts may appear during
> the upgrade indicating that there are clients or daemons that are not
> yet patched with the appropriate fix.

The link in the blog entry should point at

https://docs.ceph.com/en/latest/security/CVE-2021-20288/

Please refer there for details and recommendations.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Ceph-maintainers] v15.2.11 Octopus released

2021-04-20 Thread Ilya Dryomov
On Tue, Apr 20, 2021 at 1:56 AM David Galloway  wrote:
>
> This is the 11th bugfix release in the Octopus stable series.  It
> addresses a security vulnerability in the Ceph authentication framework.
> We recommend users to update to this release. For a detailed release
> notes with links & changelog please refer to the official blog entry at
> https://ceph.io/releases/v15-2-11-octopus-released
>
> Security Fixes
> --
>
> * This release includes a security fix that ensures the global_id value
> (a numeric value that should be unique for every authenticated client or
> daemon in the cluster) is reclaimed after a network disconnect or ticket
> renewal in a secure fashion. Two new health alerts may appear during the
> upgrade indicating that there are clients or daemons that are not yet
> patched with the appropriate fix.

The link in the blog entry should point at

https://docs.ceph.com/en/latest/security/CVE-2021-20288/

Please refer there for details and recommendations.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Ceph-maintainers] v16.2.1 Pacific released

2021-04-20 Thread Ilya Dryomov
On Tue, Apr 20, 2021 at 2:02 AM David Galloway  wrote:
>
> This is the first bugfix release in the Pacific stable series. It
> addresses a security vulnerability in the Ceph authentication framework.
>  We recommend users to update to this release. For a detailed release
> notes with links & changelog please refer to the official blog entry at
> https://ceph.io/releases/v16-2-1-pacific-released
>
> Security Fixes
> --
>
> * This release includes a security fix that ensures the global_id value
> (a numeric value that should be unique for every authenticated client or
> daemon in the cluster) is reclaimed after a network disconnect or ticket
> renewal in a secure fashion.  Two new health alerts may appear during
> the upgrade indicating that there are clients or daemons that are not
> yet patched with the appropriate fix.

The link in the blog entry should point at

https://docs.ceph.com/en/latest/security/CVE-2021-20288/

Please refer there for details and recommendations.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Ceph-maintainers] v14.2.20 Nautilus released

2021-04-20 Thread Dan van der Ster
On Tue, Apr 20, 2021 at 11:26 AM Ilya Dryomov  wrote:
>
> On Tue, Apr 20, 2021 at 2:01 AM David Galloway  wrote:
> >
> > This is the 20th bugfix release in the Nautilus stable series.  It
> > addresses a security vulnerability in the Ceph authentication framework.
> > We recommend users to update to this release. For a detailed release
> > notes with links & changelog please refer to the official blog entry at
> > https://ceph.io/releases/v14-2-20-nautilus-released
> >
> > Security Fixes
> > --
> >
> > * This release includes a security fix that ensures the global_id value
> > (a numeric value that should be unique for every authenticated client or
> > daemon in the cluster) is reclaimed after a network disconnect or ticket
> > renewal in a secure fashion.  Two new health alerts may appear during
> > the upgrade indicating that there are clients or daemons that are not
> > yet patched with the appropriate fix.
>
> The link in the blog entry should point at
>
> https://docs.ceph.com/en/latest/security/CVE-2021-20288/
>
> Please refer there for details and recommendations.

Thanks Ilya.

Is there any potential issue if clients upgrade before the cluster daemons?
(Our clients will likely get 14.2.20 before all the clusters have been
upgraded).

Cheers, Dan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Ceph-maintainers] v14.2.20 Nautilus released

2021-04-20 Thread Ilya Dryomov
On Tue, Apr 20, 2021 at 11:30 AM Dan van der Ster  wrote:
>
> On Tue, Apr 20, 2021 at 11:26 AM Ilya Dryomov  wrote:
> >
> > On Tue, Apr 20, 2021 at 2:01 AM David Galloway  wrote:
> > >
> > > This is the 20th bugfix release in the Nautilus stable series.  It
> > > addresses a security vulnerability in the Ceph authentication framework.
> > > We recommend users to update to this release. For a detailed release
> > > notes with links & changelog please refer to the official blog entry at
> > > https://ceph.io/releases/v14-2-20-nautilus-released
> > >
> > > Security Fixes
> > > --
> > >
> > > * This release includes a security fix that ensures the global_id value
> > > (a numeric value that should be unique for every authenticated client or
> > > daemon in the cluster) is reclaimed after a network disconnect or ticket
> > > renewal in a secure fashion.  Two new health alerts may appear during
> > > the upgrade indicating that there are clients or daemons that are not
> > > yet patched with the appropriate fix.
> >
> > The link in the blog entry should point at
> >
> > https://docs.ceph.com/en/latest/security/CVE-2021-20288/
> >
> > Please refer there for details and recommendations.
>
> Thanks Ilya.
>
> Is there any potential issue if clients upgrade before the cluster daemons?
> (Our clients will likely get 14.2.20 before all the clusters have been
> upgraded).

No issue.  Userspace clients would just start doing what is expected
by the protocol, same as kernel clients.

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: BlueFS spillover detected (Nautilus 14.2.16)

2021-04-20 Thread by morphin
There is a lot of bug-fix on RGW between 14.2.16 --> 19 and this is a
prod environment. I always follow few versions behind to minimize the
risk. Only for OSD improvement I'll not take the risk at RGW side.

It's better to play with rocksdb options.
Thanks for the advice.

Konstantin Shalygin , 19 Nis 2021 Pzt, 22:57 tarihinde
şunu yazdı:
>
> Multiplier is already == 10, it's not need to change, just a base. Or only 
> multiplier.
> I better suggest upgrade to 14.2.19 and use new bluestore policy to use extra 
> space for rocksdb levels (will be activated by default)
>
>
>
>
> k
>
> On 19 Apr 2021, at 21:09, by morphin  wrote:
>
> Are you tried to say add these (below) options to the config?
>
> - options.max_bytes_for_level_base = 536870912; // 512MB
> - options.max_bytes_for_level_multiplier = 10;
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: HBA vs caching Raid controller

2021-04-20 Thread Mark Lehrer
> One server has LSI SAS3008 [0] instead of the Perc H800,
> which comes with 512MB RAM + BBU. On most servers latencies are around
> 4-12ms (average 6ms), on the system with the LSI controller we see
> 20-60ms (average 30ms) latency.

Are these reads, writes, or a mixed workload?  I would expect an
improvement in writes, but 512MB of cache isn't likely to help much on
reads with such a large data set.

Just as a test, you could removing the battery on one of the H800s to
disable the write cache -- or else disable write caching with megaraid
or equivalent.





On Mon, Apr 19, 2021 at 12:21 PM Nico Schottelius
 wrote:
>
>
> Good evening,
>
> I've to tackle an old, probably recurring topic: HBAs vs. Raid
> controllers. While generally speaking many people in the ceph field
> recommend to go with HBAs, it seems in our infrastructure the only
> server we phased in with an HBA vs. raid controller is actually doing
> worse in terms of latency.
>
> For the background: we have many Perc H800+MD1200 [1] systems running with
> 10TB HDDs (raid0, read ahead, writeback cache).
> One server has LSI SAS3008 [0] instead of the Perc H800,
> which comes with 512MB RAM + BBU. On most servers latencies are around
> 4-12ms (average 6ms), on the system with the LSI controller we see
> 20-60ms (average 30ms) latency.
>
> Now, my question is, are we doing some inherently wrong with the
> SAS3008 or does in fact the cache help to possible reduce seek time?
>
> We were considering to move more towards LSI HBAs to reduce maintenance
> effort, however if we have a factor of 5 in latency between the two
> different systems, it might be better to stay on the H800 path for
> disks.
>
> Any input/experiences appreciated.
>
> Best regards,
>
> Nico
>
> [0]
> 05:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008 
> PCI-Express Fusion-MPT SAS-3 (rev 02)
> Subsystem: Dell 12Gbps HBA
> Kernel driver in use: mpt3sas
> Kernel modules: mpt3sas
>
> [1]
> 08:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 
> [Liberator] (rev 05)
> Subsystem: Dell PERC H800 Adapter
> Kernel driver in use: megaraid_sas
> Kernel modules: megaraid_sas
>
> --
> Sustainable and modern Infrastructures by ungleich.ch
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Ceph-maintainers] v14.2.20 Nautilus released

2021-04-20 Thread Mike Perez
I've updated these entries with the appropriate link. Thanks Ilya.

On Tue, Apr 20, 2021 at 2:27 AM Ilya Dryomov  wrote:
>
> On Tue, Apr 20, 2021 at 2:01 AM David Galloway  wrote:
> >
> > This is the 20th bugfix release in the Nautilus stable series.  It
> > addresses a security vulnerability in the Ceph authentication framework.
> > We recommend users to update to this release. For a detailed release
> > notes with links & changelog please refer to the official blog entry at
> > https://ceph.io/releases/v14-2-20-nautilus-released
> >
> > Security Fixes
> > --
> >
> > * This release includes a security fix that ensures the global_id value
> > (a numeric value that should be unique for every authenticated client or
> > daemon in the cluster) is reclaimed after a network disconnect or ticket
> > renewal in a secure fashion.  Two new health alerts may appear during
> > the upgrade indicating that there are clients or daemons that are not
> > yet patched with the appropriate fix.
>
> The link in the blog entry should point at
>
> https://docs.ceph.com/en/latest/security/CVE-2021-20288/
>
> Please refer there for details and recommendations.
>
> Thanks,
>
> Ilya
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Mike Perez
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Issues upgrading to 16.2.1

2021-04-20 Thread Radoslav Milanov

Hello

Tried cephadm upgrade form 16.2.0 to 16.2.1

Managers were updated first then process halted on first monitor being 
upgraded. The monitor fails to start:



root@host3:/var/lib/ceph/c8ee2878-9d54-11eb-bbca-1c34da4b9fb6/mon.host3# 
/usr/bin/docker run --rm --ipc=host --net=host --entrypoint 
/usr/bin/ceph-mon --privileged --group-add=disk --init --name 
ceph-c8ee2878-9d54-11eb-bbca-1c34da4b9fb6-mon.host3 -e 
CONTAINER_IMAGE=ceph/ceph@sha256:9b04c0f15704c49591640a37c7adfd40ffad0a4b42fecb950c3407687cb4f29a 
-e NODE_NAME=host3 -e CEPH_USE_RANDOM_NONCE=1 -v 
/var/run/ceph/c8ee2878-9d54-11eb-bbca-1c34da4b9fb6:/var/run/ceph:z -v 
/var/log/ceph/c8ee2878-9d54-11eb-bbca-1c34da4b9fb6:/var/log/ceph:z -v 
/var/lib/ceph/c8ee2878-9d54-11eb-bbca-1c34da4b9fb6/crash:/var/lib/ceph/crash:z 
-v 
/var/lib/ceph/c8ee2878-9d54-11eb-bbca-1c34da4b9fb6/mon.host3:/var/lib/ceph/mon/ceph-host3:z 
-v 
/var/lib/ceph/c8ee2878-9d54-11eb-bbca-1c34da4b9fb6/mon.host3/config:/etc/ceph/ceph.conf:z 
-v /dev:/dev -v /run/udev:/run/udev 
ceph/ceph@sha256:9b04c0f15704c49591640a37c7adfd40ffad0a4b42fecb950c3407687cb4f29a 
-n mon.host3 -f --setuser ceph --setgroup ceph 
--default-log-to-file=false --default-log-to-stderr=true 
'--default-log-stderr-prefix=debug ' 
--default-mon-cluster-log-to-file=false 
--default-mon-cluster-log-to-stderr=true
debug 2021-04-20T14:06:19.437+ 7f11e080a700  0 set uid:gid to 
167:167 (ceph:ceph)
debug 2021-04-20T14:06:19.437+ 7f11e080a700  0 ceph version 16.2.0 
(0c2054e95bcd9b30fdd908a79ac1d8bbc3394442) pacific (stable), process 
ceph-mon, pid 7
debug 2021-04-20T14:06:19.437+ 7f11e080a700  0 pidfile_write: ignore 
empty --pid-file
debug 2021-04-20T14:06:19.441+ 7f11e080a700  0 load: jerasure load: 
lrc load: isa
debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 rocksdb: RocksDB 
version: 6.8.1


debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 rocksdb: Git sha 
rocksdb_build_git_sha:@0@
debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 rocksdb: Compile date 
Mar 30 2021

debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 rocksdb: DB SUMMARY

debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 rocksdb: CURRENT 
file:  CURRENT


debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 rocksdb: IDENTITY 
file:  IDENTITY


debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 rocksdb: MANIFEST 
file:  MANIFEST-000152 size: 221 Bytes


debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 rocksdb: SST files in 
/var/lib/ceph/mon/ceph-host3/store.db dir, Total Num: 2, files: 
000137.sst 000139.sst


debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 rocksdb: Write Ahead 
Log file in /var/lib/ceph/mon/ceph-host3/store.db: 000153.log size: 0 ;


debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 
rocksdb: Options.error_if_exists: 0
debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 
rocksdb:   Options.create_if_missing: 0
debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 
rocksdb: Options.paranoid_checks: 1
debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 
rocksdb: Options.env: 0x5641244cf1c0
debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 
rocksdb:  Options.fs: Posix File System
debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 
rocksdb:    Options.info_log: 0x564126753220
debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 
rocksdb:    Options.max_file_opening_threads: 16
debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 
rocksdb:  Options.statistics: (nil)
debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 
rocksdb:   Options.use_fsync: 0
debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 
rocksdb:   Options.max_log_file_size: 0
debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 
rocksdb:  Options.max_manifest_file_size: 1073741824
debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 
rocksdb:   Options.log_file_time_to_roll: 0
debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 
rocksdb:   Options.keep_log_file_num: 1000
debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 
rocksdb:    Options.recycle_log_file_num: 0
debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 
rocksdb: Options.allow_fallocate: 1
debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 
rocksdb:    Options.allow_mmap_reads: 0
debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 
rocksdb:   Options.allow_mmap_writes: 0
debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 
rocksdb:    Options.use_direct_reads: 0
debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 rocksdb: 
Options.use_direct_io_for_flush_and_compaction: 0
debug 2021-04-20T14:06:19.441+ 7f11e080a700  4 rocksdb:  
Options.create_missing_column_families: 0
debug 2021-04-

[ceph-users] Re: HBA vs caching Raid controller

2021-04-20 Thread Reed Dier
I don't have any performance bits to offer, but I do have one experiential bit 
to offer.

My initial ceph deployment was on existing servers, that had LSI raid 
controllers (3108 specifically).
We created R0 vd's for each disk, and had BBUs so were using write back caching.
The big problem that arose was the pdcache value, which in my case defaults to 
on.

We had a lightning strike that took out the datacenter, and we lost 21/24 OSDs.
Granted, this was back in XFS-on-filestore days, but this was a painful lesson 
learned.
It was narrowed down to the pdcache and not to the raid controller caching 
functions after carrying out some power-loss scenarios after the incident.

So, make sure you turn your pdcache off in perccli.

Reed

> On Apr 19, 2021, at 1:20 PM, Nico Schottelius  
> wrote:
> 
> 
> Good evening,
> 
> I've to tackle an old, probably recurring topic: HBAs vs. Raid
> controllers. While generally speaking many people in the ceph field
> recommend to go with HBAs, it seems in our infrastructure the only
> server we phased in with an HBA vs. raid controller is actually doing
> worse in terms of latency.
> 
> For the background: we have many Perc H800+MD1200 [1] systems running with
> 10TB HDDs (raid0, read ahead, writeback cache).
> One server has LSI SAS3008 [0] instead of the Perc H800,
> which comes with 512MB RAM + BBU. On most servers latencies are around
> 4-12ms (average 6ms), on the system with the LSI controller we see
> 20-60ms (average 30ms) latency.
> 
> Now, my question is, are we doing some inherently wrong with the
> SAS3008 or does in fact the cache help to possible reduce seek time?
> 
> We were considering to move more towards LSI HBAs to reduce maintenance
> effort, however if we have a factor of 5 in latency between the two
> different systems, it might be better to stay on the H800 path for
> disks.
> 
> Any input/experiences appreciated.
> 
> Best regards,
> 
> Nico
> 
> [0]
> 05:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008 
> PCI-Express Fusion-MPT SAS-3 (rev 02)
>   Subsystem: Dell 12Gbps HBA
>   Kernel driver in use: mpt3sas
>   Kernel modules: mpt3sas
> 
> [1]
> 08:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 
> [Liberator] (rev 05)
>   Subsystem: Dell PERC H800 Adapter
>   Kernel driver in use: megaraid_sas
>   Kernel modules: megaraid_sas
> 
> --
> Sustainable and modern Infrastructures by ungleich.ch
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: EC Backfill Observations

2021-04-20 Thread Josh Durgin

Hey Josh, adding the dev list where you may get more input.

Generally I think your analysis is correct about the current behavior.

In particular if another copy of a shard is available, backfill or
recovery will read from just that copy, not the rest of the OSDs.

Otherwise, k shards must be read to reconstruct the data (for reed-
solomon family erasure codes).

IIRC it doesn't matter whether it's a data or parity shard, the
path is the same.

With respect to reservations, it seems like an oversight that
we don't reserve other shards for backfilling. We reserve all
shards for recovery [0].

On the other hand, overload from recovery is handled better in
pacific and beyond with mclock-based QoS, which provides much
more effective control of recovery traffic [1][2].

In prior versions, the osd_recovery_sleep option was the best
way to get more fine-grained control of recovery and backfill
traffic, but this was not dynamic at all. osd_max_backfills
allowed a maximum limit to parallelism. mclock supercedes these
both when it's enabled, since it can handle bursting and throttling
itself.

Josh

[0] 
https://github.com/ceph/ceph/blob/v16.2.1/src/osd/PeeringState.cc#L5914-L5921
[1] 
https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#dmclock-qos

[2] https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/

On 4/19/21 12:24 PM, Josh Baergen wrote:

Hey all,

I wanted to confirm my understanding of some of the mechanics of
backfill in EC pools. I've yet to find a document that outlines this
in detail; if there is one, please send it my way. :) Some of what I
write below is likely in the "well, duh" category, but I tended
towards completeness.

First off, I understand that backfill reservations work the same way
between replicated pools and EC pools. A local reservation is taken on
the primary OSD, then a remote reservation on the backfill target(s),
before the backfill is allowed to begin. Until this point, the
backfill is in the backfill_wait state.

When the backfill begins, though, is when the differences begin. Let's
say we have an EC 3:2 PG that's backfilling from OSD 2 to OSD 5
(formatted here like pgs_brief):

 1.1  active+remapped+backfilling   [0,1,5,3,4]  0   [0,1,2,3,4]  0

The question in my mind was: Where is the data for this backfill
coming from? In replicated pools, all reads come from the primary.
However, in this case, the primary does not have the data in question;
the primary has to either read the EC chunk from OSD 2, or it has to
reconstruct it by reading from 3 of the OSDs in the acting set.

Based on observation, I _think_ this is what happens:
1. As long as the PG is not degraded, the backfill read is simply
forwarded by the primary to OSD 2.
2. Once the PG becomes degraded, the backfill read needs to use the
reconstructing path, and begins reading from 3 of the OSDs in the
acting set.

Questions:
1. Can anyone confirm or correct my description of how EC backfill
operates? In particular, in case 2 above, does it matter whether OSD 2
is the cause of degradation, for example? Does the read still get
forwarded to a single OSD when it's parity chunks that are being moved
via backfill?
2. I'm curious as to why a 3rd reservation, for the source OSD, wasn't
introduced as a part of EC in Ceph. We've occasionally seen an OSD
become overloaded because several backfills were reading from it
simultaneously, and there's no way to control this via the normal
osd_max_backfills mechanism. Is anyone aware of discussions to this
effect?

Thanks!
Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: HBA vs caching Raid controller

2021-04-20 Thread Anthony D'Atri

I don’t have the firmware versions handy, but at one point around the 2014-2015 
timeframe I found that both LSI’s firmware and storcli claimed that the default 
setting was DiskDefault, ie. leave whatever the drive has alone.  It turned 
out, though, that for the 9266 and 9271, at least, behind the scenes it was 
claiming DiskDefault, but was actually turning on the drive’s volatile cache.  
Which resulted in the power-loss behavior you describe.

There were also hardware and firmware issues that resulted in preserved / 
pinned cache not being properly restored - in one case, if even a drive failed 
hard, I had to replace the HBA in order to boot.

I posted a list of RoC HBA thoughts including these to the list back around 
late-summer 2017.  

> 
> I don't have any performance bits to offer, but I do have one experiential 
> bit to offer.
> 
> My initial ceph deployment was on existing servers, that had LSI raid 
> controllers (3108 specifically).
> We created R0 vd's for each disk, and had BBUs so were using write back 
> caching.
> The big problem that arose was the pdcache value, which in my case defaults 
> to on.
> 
> We had a lightning strike that took out the datacenter, and we lost 21/24 
> OSDs.
> Granted, this was back in XFS-on-filestore days, but this was a painful 
> lesson learned.
> It was narrowed down to the pdcache and not to the raid controller caching 
> functions after carrying out some power-loss scenarios after the incident.
> 
> So, make sure you turn your pdcache off in perccli.
> 
> Reed
> 
>> On Apr 19, 2021, at 1:20 PM, Nico Schottelius  
>> wrote:
>> 
>> 
>> Good evening,
>> 
>> I've to tackle an old, probably recurring topic: HBAs vs. Raid
>> controllers. While generally speaking many people in the ceph field
>> recommend to go with HBAs, it seems in our infrastructure the only
>> server we phased in with an HBA vs. raid controller is actually doing
>> worse in terms of latency.
>> 
>> For the background: we have many Perc H800+MD1200 [1] systems running with
>> 10TB HDDs (raid0, read ahead, writeback cache).
>> One server has LSI SAS3008 [0] instead of the Perc H800,
>> which comes with 512MB RAM + BBU. On most servers latencies are around
>> 4-12ms (average 6ms), on the system with the LSI controller we see
>> 20-60ms (average 30ms) latency.
>> 
>> Now, my question is, are we doing some inherently wrong with the
>> SAS3008 or does in fact the cache help to possible reduce seek time?
>> 
>> We were considering to move more towards LSI HBAs to reduce maintenance
>> effort, however if we have a factor of 5 in latency between the two
>> different systems, it might be better to stay on the H800 path for
>> disks.
>> 
>> Any input/experiences appreciated.
>> 
>> Best regards,
>> 
>> Nico
>> 
>> [0]
>> 05:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008 
>> PCI-Express Fusion-MPT SAS-3 (rev 02)
>>  Subsystem: Dell 12Gbps HBA
>>  Kernel driver in use: mpt3sas
>>  Kernel modules: mpt3sas
>> 
>> [1]
>> 08:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 
>> [Liberator] (rev 05)
>>  Subsystem: Dell PERC H800 Adapter
>>  Kernel driver in use: megaraid_sas
>>  Kernel modules: megaraid_sas
>> 
>> --
>> Sustainable and modern Infrastructures by ungleich.ch
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: HBA vs caching Raid controller

2021-04-20 Thread Nico Schottelius



Marc  writes:

> This is what I have when I query prometheus, most hdd's are still sata 
> 5400rpm, there are also some ssd's. I also did not optimize cpu frequency 
> settings. (forget about the instance=c03, that is just because the data comes 
> from mgr c03, these drives are on different hosts)
>
> ceph_osd_apply_latency_ms
>
> ceph_osd_apply_latency_ms{ceph_daemon="osd.12", instance="c03", job="ceph"}   
> 42
> ...
> ceph_osd_apply_latency_ms{ceph_daemon="osd.19", instance="c03", job="ceph"}   
> 1

I assume this looks somewhat normal, with a bit of variance due to
access.

> avg (ceph_osd_apply_latency_ms)
> 9.336

I see something similar, around 9ms average latency for HDD based osds,
best case average around 3ms.

> So I guess it is possible for you to get lower values on the lsi hba

Can you let me know which exact model you have?

> Maybe you can tune read a head on the lsi with something like this.
> echo 8192 > /sys/block/$line/queue/read_ahead_kb
> echo 1024 > /sys/block/$line/queue/nr_requests

I tried both of them, even going up to 16MB read ahead cache, but
besides a short burst when changing the values, the average stays +/-
the same on that host.

I also checked cpu speed (same as the rest), io scheduler (using "none"
really drives the disks crazy). What I observed is that the avq value in
atop is lower than on the other servers, which are around 15. This
server is more in the range 1-3.

> Also check for pci-e 3 those have higher bus speeds.

True, even though pci-e 2, x8 should be able to deliver 4 GB/s, if I am
not mistaken.



--
Sustainable and modern Infrastructures by ungleich.ch
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: HBA vs caching Raid controller

2021-04-20 Thread Nico Schottelius


Mark Lehrer  writes:

>> One server has LSI SAS3008 [0] instead of the Perc H800,
>> which comes with 512MB RAM + BBU. On most servers latencies are around
>> 4-12ms (average 6ms), on the system with the LSI controller we see
>> 20-60ms (average 30ms) latency.
>
> Are these reads, writes, or a mixed workload?  I would expect an
> improvement in writes, but 512MB of cache isn't likely to help much on
> reads with such a large data set.

It's mostly write (~20MB/s), little read (1-5 MB/s) work load. This is
probably due to many people using this storage for backup.

> Just as a test, you could removing the battery on one of the H800s to
> disable the write cache -- or else disable write caching with megaraid
> or equivalent.

That is certainly an interesting idea - and rereading your message and
my statement above might actually explain the behaviour:

- The pattern is mainly write centric, so write latency is probably the
  real factor
- The HDD OSDs behind the raid controllers can cache / reorder writes
  and reduce seeks potentially

So while "a raid controller" per se does probably not improve or reduce
speed for ceph, "a (disk/raid) controller with a battery backed cache",
might actually.

In this context: is anyone here using HBAs with battery backed cache,
and if yes, which controllers do you tend to use?

Nico


--
Sustainable and modern Infrastructures by ungleich.ch
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: HBA vs caching Raid controller

2021-04-20 Thread Nico Schottelius



Reed Dier  writes:

> I don't have any performance bits to offer, but I do have one experiential 
> bit to offer.
>
> My initial ceph deployment was on existing servers, that had LSI raid 
> controllers (3108 specifically).
> We created R0 vd's for each disk, and had BBUs so were using write back 
> caching.
> The big problem that arose was the pdcache value, which in my case defaults 
> to on.
>
> We had a lightning strike that took out the datacenter, and we lost 21/24 
> OSDs.
> Granted, this was back in XFS-on-filestore days, but this was a painful 
> lesson learned.
> It was narrowed down to the pdcache and not to the raid controller caching 
> functions after carrying out some power-loss scenarios after the incident.

It's not a 100% clear to me, but is the pdcache the same as the disk
internal (non battery backed up) cache?

As we are located very nearby the hydropower plant, we actually connect
each server individually to an UPS. Our original motivation was mainly
to cut off over voltage, but it has the nice side effect of having
another battery buffer on servers.


--
Sustainable and modern Infrastructures by ungleich.ch
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: HBA vs caching Raid controller

2021-04-20 Thread Anthony D'Atri

> It's not a 100% clear to me, but is the pdcache the same as the disk
> internal (non battery backed up) cache?

Yes, AIUI.

> As we are located very nearby the hydropower plant, we actually connect
> each server individually to an UPS.

Lucky you. I’ve seen an entire DC go dark with a power outage thanks to a 
transfer switch not kicking in generators.  Then again two weeks later.  Taught 
me to be paranoid about power loss.  And with dual power supplies, to 
proactively check status, so that I don’t find out only when it’s too late that 
a cord unseated or a PSU died, resulting in a silent loss of redundancy.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: HBA vs caching Raid controller

2021-04-20 Thread Mark Lehrer
> - The pattern is mainly write centric, so write latency is
>   probably the real factor
> - The HDD OSDs behind the raid controllers can cache / reorder
>   writes and reduce seeks potentially

OK that makes sense.

Unfortunately, re-ordering HDD writes without a battery backup is kind
of dangerous -- writes need to happen in order or the filesystem will
punish you when you least expect it.  This is the whole point of the
battery backup - to make sure that out-of-order writes get written to
disk even if there is a power loss in the middle of writing the
controller-write-cache data in an HDD-optimized order.

Your use case is ideal for an SSD-based WAL -- though it may be
difficult to beat the cost of H800s these days.


> In this context: is anyone here using HBAs with battery
> backed cache, and if yes, which controllers do you tend to use?

I almost always use MegaRAID-based controllers (such as the H800).


Good luck,
Mark


On Tue, Apr 20, 2021 at 2:28 PM Nico Schottelius
 wrote:
>
>
> Mark Lehrer  writes:
>
> >> One server has LSI SAS3008 [0] instead of the Perc H800,
> >> which comes with 512MB RAM + BBU. On most servers latencies are around
> >> 4-12ms (average 6ms), on the system with the LSI controller we see
> >> 20-60ms (average 30ms) latency.
> >
> > Are these reads, writes, or a mixed workload?  I would expect an
> > improvement in writes, but 512MB of cache isn't likely to help much on
> > reads with such a large data set.
>
> It's mostly write (~20MB/s), little read (1-5 MB/s) work load. This is
> probably due to many people using this storage for backup.
>
> > Just as a test, you could removing the battery on one of the H800s to
> > disable the write cache -- or else disable write caching with megaraid
> > or equivalent.
>
> That is certainly an interesting idea - and rereading your message and
> my statement above might actually explain the behaviour:
>
> - The pattern is mainly write centric, so write latency is probably the
>   real factor
> - The HDD OSDs behind the raid controllers can cache / reorder writes
>   and reduce seeks potentially
>
> So while "a raid controller" per se does probably not improve or reduce
> speed for ceph, "a (disk/raid) controller with a battery backed cache",
> might actually.
>
> In this context: is anyone here using HBAs with battery backed cache,
> and if yes, which controllers do you tend to use?
>
> Nico
>
>
> --
> Sustainable and modern Infrastructures by ungleich.ch
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Single OSD crash/restarting during scrub operation on specific PG

2021-04-20 Thread Mark Johnson
We've recently recovered from a bit of a disaster where we had some power 
outages (combination of data centre power maintenance and us not having our 
redundant power supplies connected to the correct redundant power circuits - 
lesson learnt).  We ended up with one OSD that wouldn't start - seems like 
perhaps filesystem corruption as an FSCK found and fixed a couple of errors but 
the OSD still wouldn't start so we ended up marking it lost and letting data 
backfill to other OSDs.  Ended up with a handful of 'incomplete' or 
'incomplete/down' pgs which was causing radosgw to stop accepting connections.  
Found a useful blog post from somebody that got us to a point of using the 
ceph-objectstore-tool to determine the correct remaining copy and mark pgs as 
complete.  Backfill operations then wrote out pgs to different OSDs and the 
cluster returned to a health OK state, and radosgw started working normally.  
At least, that's what I thought.

Now, I'm occasionally seeing one OSD crashing every now and then, sometimes 
after a few hours, sometimes after only 10 minutes.  It always starts itself up 
again and the queued up backfills cancel and the cluster returns to OK until 
the next time.  It's always the same OSD and going through the logs just now, 
it seems to always occur when performing a scrub operation on the same pg 
(although, I haven't checked every single instance to be completely sure).

We're running Jewel (yes, I know it's old but we can't upgrade).

Here's the last couple of lines from the OSD log when it crashes on two 
different occasions - I've used the hex code from the "Caught signal" line to 
reference events that match that same code in both instances.  Looks  roughly 
the same on both occasions in that it's always the same pg, however the last 
object shown in the log prior to the crash always seems to be different.

   -91> 2021-04-21 02:37:32.118290 7fed046e6700  5 write_log with: dirty_to: 
0'0, dirty_from: 4294967295'18446744073709551615, dirty_divergent_priors: 
false, divergent_priors: 0, writeout_from: 3110'3174946, trimmed:
   -90> 2021-04-21 02:37:32.118380 7fed046e6700  5 -- op tracker -- seq: 2219, 
time: 2021-04-21 02:37:32.118379, event: commit_queued_for_journal_write, op: 
osd_repop(client.3095191.0:19172420 30.65 
30:a78d321d:::71b98f4a-2ef6-466a-bb1d-eb477b317c78.604098.18432929_1149756827.ogg:head
 v 3110'3174946)
-1> 2021-04-21 02:37:32.748831 7fed046e6700  5 -- op tracker -- seq: 2221, 
time: 2021-04-21 02:37:32.748830, event: reached_pg, op: replica scrub(pg: 
30.65,from:0'0,to:2923'3171906,epoch:3110,start:30:a63a08df:::71b98f4a-2ef6-466a-bb1d-eb477b317c78.577500.22382834__shadow_.cqXthfu1litKEyNZ53I_voGLwuhonVX_1:0,end:30:a63a13fd:::71b98f4a-2ef6-466a-bb1d-eb477b317c78.604098.26065941_1234799607.gsm:0,chunky:1,deep:1,seed:4294967295,version:6)
 0> 2021-04-21 02:37:32.797826 7fed046e6700 -1 os/filestore/FileStore.cc: 
In function 'int FileStore::lfn_find(const ghobject_t&, const Index&, 
IndexedPath*)' thread 7fed046e6700 time 2021-04-21 02:37:32.790356
2021-04-21 02:37:32.859265 7fed046e6700 -1 *** Caught signal (Aborted) **
 in thread 7fed046e6700 thread_name:tp_osd_tp
 0> 2021-04-21 02:37:32.859265 7fed046e6700 -1 *** Caught signal (Aborted) 
**
 in thread 7fed046e6700 thread_name:tp_osd_tp


-17> 2021-04-21 03:55:09.090430 7f43382c7700  5 -- op tracker -- seq: 1596, 
time: 2021-04-21 03:55:09.090430, event: done, op: replica scrub(pg: 
30.65,from:0'0,to:2979'3174652,epoch:3122,start:30:a639eb4a:::71b98f4a-2ef6-466a-bb1d-eb477b317c78.577500.17157485_1132337117.ogg:0,end:30:a639f7f1:::71b98f4a-2ef6-466a-bb1d-eb477b317c78.604098.21768594_1188151257.ogg:0,chunky:1,deep:1,seed:4294967295,version:6)
-5> 2021-04-21 03:55:09.777503 7f43382c7700  5 -- op tracker -- seq: 1598, 
time: 2021-04-21 03:55:09.777476, event: reached_pg, op: replica scrub(pg: 
30.65,from:0'0,to:2929'3172006,epoch:3122,start:30:a63a047c:::71b98f4a-2ef6-466a-bb1d-eb477b317c78.577500.18649542__shadow_.tKOWzKIibnLhX3Bu32FiiuG0FH1lIl4_1:0,end:30:a63a0ea6:::71b98f4a-2ef6-466a-bb1d-eb477b317c78.577500.14411965_1101425097.gsm:0,chunky:1,deep:1,seed:4294967295,version:6)
 0> 2021-04-21 03:55:10.089217 7f43382c7700 -1 os/filestore/FileStore.cc: 
In function 'int FileStore::lfn_find(const ghobject_t&, const Index&, 
IndexedPath*)' thread 7f43382c7700 time 2021-04-21 03:55:10.081373
2021-04-21 03:55:10.157208 7f43382c7700 -1 *** Caught signal (Aborted) **
 in thread 7f43382c7700 thread_name:tp_osd_tp
 0> 2021-04-21 03:55:10.157208 7f43382c7700 -1 *** Caught signal (Aborted) 
**
 in thread 7f43382c7700 thread_name:tp_osd_tp

Any ideas what to do next?

Regards,
Mark Johnson

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io