Re: [ceph-users] HA for Vms with Ceph and KVM

2018-03-19 Thread Gregory Farnum
You can explore the rbd exclusive lock functionality if you want to do
this, but it’s not typically advised because using it makes moving live VMs
across hosts harder, IUIC.
-Greg

On Sat, Mar 17, 2018 at 7:47 PM Egoitz Aurrekoetxea 
wrote:

> Good morning,
>
>
> Does some kind of config param exist in Ceph for avoid two hosts accesing
> to the same vm pool or at least image inside?. Can be done at pool or image
> level?.
>
>
> Best regards,
> --
>
>
> [image: sarenet]
> *Egoitz Aurrekoetxea*
> Departamento de sistemas
> 944 209 470
> Parque Tecnológico. Edificio 103
> 48170 Zamudio (Bizkaia)
> ego...@sarenet.es
> www.sarenet.es
>
> Antes de imprimir este correo electrónico piense si es necesario hacerlo.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Syslog logging date/timestamp

2018-03-19 Thread Gregory Farnum
Mostly, this exists because syslog is just receiving our raw strings, and
those embed time stamps deep in the code level.

So we *could* strip them out for syslog, but we’d have still paid the cost
of generating them, and as you can see we have much higher precision than
that syslog output, plus it’s a “real” time stamp instead of just depending
on what happened when transmitting a log message over the network to the
syslog server. :)
-Greg
On Sat, Mar 17, 2018 at 7:27 AM Marc Roos  wrote:

>
>
> I have no logging options configured in ceph.conf, yet I get syslog
> entries like these
>
> Mar 16 23:55:42 c01 ceph-osd: 2018-03-16 23:55:42.535796 7f2f5c53a700 -1
> osd.0 pg_epoch: 18949 pg[17.21( v 18949'4044827
> (18949'4043279,18949'4044827] local-lis/les=18910/18911 n=3125
> ec=3636/3636 lis/c 18910/18910 les/c/f 18911/18912/0 18910/18910/18910)
> [13,0,9] r=1 lpr=18910 luod=0'0 crt=18949'4044827 lcod 18949'4044826
> active] _scan_snaps no head for
> 17:846274ce:::rbd_data.239f5274b0dc51.1d75:39 (have MIN)
> Mar 16 23:55:42 c01 ceph-osd: 2018-03-16 23:55:42.535823 7f2f5c53a700 -1
> osd.0 pg_epoch: 18949 pg[17.21( v 18949'4044827
> (18949'4043279,18949'4044827] local-lis/les=18910/18911 n=3125
> ec=3636/3636 lis/c 18910/18910 les/c/f 18911/18912/0 18910/18910/18910)
> [13,0,9] r=1 lpr=18910 luod=0'0 crt=18949'4044827 lcod 18949'4044826
> active] _scan_snaps no head for
> 17:846274ce:::rbd_data.239f5274b0dc51.1d75:26 (have MIN)
>
> Should the date/timestamp not be omitted here? We already have this from
> syslog server?
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reducing pg_num for a pool

2018-03-19 Thread Gregory Farnum
Maybe (likely?) in Mimic. Certainly the next release.

Some code has been written but the reason we haven’t done this before is
the number of edge cases involved, and it’s not clear how long rounding
those off will take.
-Greg
On Fri, Mar 16, 2018 at 2:38 PM Ovidiu Poncea 
wrote:

> Hi All,
>
> Is there any news on when/if support for decreasing pg_num will be
> available?
>
> Thank you,
> Ovidiu
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Nicolas Huillard
Hi all,

I'm experimenting with a new little storage cluster. I wanted to take
advantage of the week-end to copy all data (1TB, 10M objects) from the
cluster to a single SATA disk. I expected to saturate the SATA disk
while writing to it, but the storage cluster actually saturates its
network links, while barely writing to the destination disk (63GB
written in 20h, that's less than 1MBps).

Setup : 2 datacenters × 3 storage servers × 2 disks/OSD each, Luminous
12.2.4 on Debian stretch, 1Gbps shared network, 200Mbps fibre link
between datacenters (12ms latency). 4 clients using a single cephfs
storing data + metadata on the same spinning disks with bluestore.

Test : I'm using a single rsync on one of the client servers (the other
3 are just sitting there). rsync is local to the client, copying from
the cephfs mount (kernel client on 4.14 from stretch-backports, just to
use a potentially more recent cephfs client than on stock 4.9), to the
SATA disk. The rsync'ed tree consists of lots a tiny files (1-3kB) on
deep directory branches, along with some large files (10-100MB) in a
few directories. There is no other activity on the cluster.

Observations : I initially saw write performance on the destination
disk from a few 100kBps (during exploration of branches with tiny file)
to a few 10MBps (while copying large files), essentially seeing the
file names scrolling at a relatively fixed rate, unrelated to their
individual size.
After 5 hours, the fibre link stated to saturate at 200Mbps, while
destination disk writes is down to a few 10kBps.

Using the dashboard, I see lots of metadata writes, at 30MBps rate on
the metadata pool, which correlates to the 200Mbps link rate.
It also shows regular "Health check failed: 1 MDSs behind on trimming
(MDS_TRIM)" / "MDS health message (mds.2): Behind on trimming (64/30)".

I wonder why cephfs would write anything to the metadata (I'm mounting
on the clients with "noatime"), while I'm just reading data from it...
What could I tune to reduce that write-load-while-reading-only ?

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Gregory Farnum
The MDS has to write to its local journal when clients open files, in
case of certain kinds of failures.

I guess it doesn't distinguish between read-only (when it could
*probably* avoid writing them down? Although it's not as simple a
thing as it sounds) and writeable file opens. So every file you're
opening requires the MDS to commit to disk, and it apparently filled
up its allowable mds log size and now you're stuck on that inter-DC
link. A temporary workaround might be to just keep turning up the mds
log sizes, but I'm sort of surprised it was absorbing stuff at a
useful rate before, so I don't know if changing those will help or
not.
-Greg

On Mon, Mar 19, 2018 at 5:01 PM, Nicolas Huillard  wrote:
> Hi all,
>
> I'm experimenting with a new little storage cluster. I wanted to take
> advantage of the week-end to copy all data (1TB, 10M objects) from the
> cluster to a single SATA disk. I expected to saturate the SATA disk
> while writing to it, but the storage cluster actually saturates its
> network links, while barely writing to the destination disk (63GB
> written in 20h, that's less than 1MBps).
>
> Setup : 2 datacenters × 3 storage servers × 2 disks/OSD each, Luminous
> 12.2.4 on Debian stretch, 1Gbps shared network, 200Mbps fibre link
> between datacenters (12ms latency). 4 clients using a single cephfs
> storing data + metadata on the same spinning disks with bluestore.
>
> Test : I'm using a single rsync on one of the client servers (the other
> 3 are just sitting there). rsync is local to the client, copying from
> the cephfs mount (kernel client on 4.14 from stretch-backports, just to
> use a potentially more recent cephfs client than on stock 4.9), to the
> SATA disk. The rsync'ed tree consists of lots a tiny files (1-3kB) on
> deep directory branches, along with some large files (10-100MB) in a
> few directories. There is no other activity on the cluster.
>
> Observations : I initially saw write performance on the destination
> disk from a few 100kBps (during exploration of branches with tiny file)
> to a few 10MBps (while copying large files), essentially seeing the
> file names scrolling at a relatively fixed rate, unrelated to their
> individual size.
> After 5 hours, the fibre link stated to saturate at 200Mbps, while
> destination disk writes is down to a few 10kBps.
>
> Using the dashboard, I see lots of metadata writes, at 30MBps rate on
> the metadata pool, which correlates to the 200Mbps link rate.
> It also shows regular "Health check failed: 1 MDSs behind on trimming
> (MDS_TRIM)" / "MDS health message (mds.2): Behind on trimming (64/30)".
>
> I wonder why cephfs would write anything to the metadata (I'm mounting
> on the clients with "noatime"), while I'm just reading data from it...
> What could I tune to reduce that write-load-while-reading-only ?
>
> --
> Nicolas Huillard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prometheus RADOSGW usage exporter

2018-03-19 Thread Konstantin Shalygin

Hi Berant



I've created prometheus exporter that scrapes the RADOSGW Admin Ops API and
exports the usage information for all users and buckets. This is my first
prometheus exporter so if anyone has feedback I'd greatly appreciate it.
I've tested it against Hammer, and will shortly test against Jewel; though
looking at the docs it should work fine for Jewel as well.

https://github.com/blemmenes/radosgw_usage_exporter



It would be nice, if you take a look on PR's.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Sergey Malinin
I experienced the same issue and was able to reduce metadata writes by raising 
mds_log_events_per_segment to it’s original value multiplied several times.

From: ceph-users  on behalf of Nicolas 
Huillard 
Sent: Monday, March 19, 2018 12:01:09 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Huge amount of cephfs metadata writes while only reading 
data (rsync from storage, to single disk)

Hi all,

I'm experimenting with a new little storage cluster. I wanted to take
advantage of the week-end to copy all data (1TB, 10M objects) from the
cluster to a single SATA disk. I expected to saturate the SATA disk
while writing to it, but the storage cluster actually saturates its
network links, while barely writing to the destination disk (63GB
written in 20h, that's less than 1MBps).

Setup : 2 datacenters × 3 storage servers × 2 disks/OSD each, Luminous
12.2.4 on Debian stretch, 1Gbps shared network, 200Mbps fibre link
between datacenters (12ms latency). 4 clients using a single cephfs
storing data + metadata on the same spinning disks with bluestore.

Test : I'm using a single rsync on one of the client servers (the other
3 are just sitting there). rsync is local to the client, copying from
the cephfs mount (kernel client on 4.14 from stretch-backports, just to
use a potentially more recent cephfs client than on stock 4.9), to the
SATA disk. The rsync'ed tree consists of lots a tiny files (1-3kB) on
deep directory branches, along with some large files (10-100MB) in a
few directories. There is no other activity on the cluster.

Observations : I initially saw write performance on the destination
disk from a few 100kBps (during exploration of branches with tiny file)
to a few 10MBps (while copying large files), essentially seeing the
file names scrolling at a relatively fixed rate, unrelated to their
individual size.
After 5 hours, the fibre link stated to saturate at 200Mbps, while
destination disk writes is down to a few 10kBps.

Using the dashboard, I see lots of metadata writes, at 30MBps rate on
the metadata pool, which correlates to the 200Mbps link rate.
It also shows regular "Health check failed: 1 MDSs behind on trimming
(MDS_TRIM)" / "MDS health message (mds.2): Behind on trimming (64/30)".

I wonder why cephfs would write anything to the metadata (I'm mounting
on the clients with "noatime"), while I'm just reading data from it...
What could I tune to reduce that write-load-while-reading-only ?

--
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Memory leak in Ceph OSD?

2018-03-19 Thread Konstantin Shalygin



We don't run compression as far as I know, so that wouldn't be it. We do
actually run a mix of bluestore & filestore - due to the rest of the
cluster predating a stable bluestore by some amount.



12.2.2 -> 12.2.4 at 2018/03/10: I don't see increase of memory usage. No 
any compressions of course.



http://storage6.static.itmages.com/i/18/0319/h_1521453809_9131482_859b1fb0a5.png




k
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disk write cache - safe?

2018-03-19 Thread Frédéric Nass

Hi Steven,

Le 16/03/2018 à 17:26, Steven Vacaroaia a écrit :

Hi All,

Can someone confirm please that, for a perfect performance/safety 
compromise, the following would be the best settings  ( id 0 is SSD, 
id 1 is HDD )
Alternatively, any suggestions / sharing configuration / advice would 
be greatly appreciated

Note
server is a DELL R620 with PERC 710 , 1GB cache
SSD is entreprise Toshiba PX05SMB040Y
HDD is Entreprise Seagate  ST600MM0006


 megacli -LDGetProp  -DskCache -Lall -a0

Adapter 0-VD 0(target id: 0): Disk Write Cache : Enabled
Adapter 0-VD 1(target id: 1): Disk Write Cache : Disabled


Sounds good to me as Toshiba PX05SMB040Y SSDs include power-loss 
protection 
(https://toshiba.semicon-storage.com/eu/product/storage-products/enterprise-ssd/px05smbxxx.html)


megacli -LDGetProp  -Cache -Lall -a0

Adapter 0-VD 0(target id: 0): Cache Policy:WriteBack, ReadAdaptive, 
Direct, No Write Cache if bad BBU
Adapter 0-VD 1(target id: 1): Cache Policy:WriteBack, ReadAdaptive, 
Cached, Write Cache OK if bad BBU


I've always wondered about ReadAdaptive with no real answer. This would 
need clarification from RHCS / Ceph performance team.


With a 1GB PERC cache, my guess is that you should set SSDs to 
writethrough whatever your workload is, so that the whole cache is 
dedicated to HDDs only, and your nodes don't hit a PERC cache full issue 
that would be hard to diagnose. Besides, write caching should always be 
avoided with a bad BBU.


Regards,

Frédéric.


Many thanks

Steven





On 16 March 2018 at 06:20, Frédéric Nass 
> wrote:


Hi Tim,

I wanted to share our experience here as we've been in a situation
in the past (on a friday afternoon of course...) that injecting a
snaptrim priority of 40 to all OSDs in the cluster (to speed up
snaptimming) resulted in alls OSD nodes crashing at the same time,
in all 3 datacenters. My first thought at that particular moment
was : call your wife and tell her you'll be late home. :-D

And this event was not related to a power outage.

Fortunately I had spent some time (when building the cluster)
thinking how each option should be set along the I/O path for #1
data consistency and #2 best possible performance, and that was :

- Single SATA disks Raid0 with writeback PERC caching on each
virtual disk
- write barriers kept enabled on XFS mounts (I had measured a 1.5
% performance gap so disabling warriers was no good choice, and is
never actually)
- SATA disks write buffer disabled (as volatile)
- SSD journal disks write buffer enabled (as persistent)

We hardly believed it but when all nodes came back online, all
OSDs rejoined the cluster and service was back as it was before.
We didn't face any XFS errors nor did we have any further scrub or
deep-scrub errors.

My assumption was that the extra power demand for snaptrimimng may
have led to node power instability or that we hit a SATA firmware
or maybe a kernel bug.

We also had SSDs as Raid0 with writeback PERC cache ON but changed
that to write-through as we could get more IOPS from them
regarding our workloads.

Thanks for sharing the information about DELL changing the default
disk buffer policy. What's odd is that it all buffers were
disabled after the node rebooted, including SSDs !
I am now changing them back to enabled for SSDs only.

As said by others, you'd better keep the disks buffers disabled
and rebuild the OSDs after setting the disks as Raid0 with
writeback enabled.

Best,

Frédéric.

Le 14/03/2018 à 20:42, Tim Bishop a écrit :

I'm using Ceph on Ubuntu 16.04 on Dell R730xd servers. A
recent [1]
update to the PERC firmware disabled the disk write cache by
default
which made a noticable difference to the latency on my disks
(spinning
disks, not SSD) - by as much as a factor of 10.

For reference their change list says:

"Changes default value of drive cache for 6 Gbps SATA drive to
disabled.
This is to align with the industry for SATA drives. This may
result in a
performance degradation especially in non-Raid mode. You must
perform an
AC reboot to see existing configurations change."

It's fairly straightforward to re-enable the cache either in
the PERC
BIOS, or by using hdparm, and doing so returns the latency
back to what
it was before.

Checking the Ceph documentation I can see that older versions [2]
recommended disabling the write cache for older kernels. But
given I'm
using a newer kernel, and there's no mention of this in the
Luminous
docs, is it safe to assume it's ok to enable the disk write
cache now?

If it makes a difference, I'm using a mixture of filestore and
bluestore

Re: [ceph-users] Growing an SSD cluster with different disk sizes

2018-03-19 Thread Christian Balzer

Hello,

On Sun, 18 Mar 2018 10:59:15 -0400 Mark Steffen wrote:

> Hello,
> 
> I have a Ceph newb question I would appreciate some advice on
> 
> Presently I have 4 hosts in my Ceph cluster, each with 4 480GB eMLC drives
> in them.  These 4 hosts have 2 more empty slots each.
> 
A lot of the answers would become clearer and more relevant if you could
tell us foremost the exact SSD models (old and new) and the rest of the
cluster HW config (controllers, network).

When I read 480GB the only DC level SSDs with 3 DWPD are Samsungs, those 3
DWPD may or may not be sufficient of course for your use case.

I frequently managed to wear out SSDs more during testing and burn-in (i.e.
several RAID rebuilds) than in a year of actual usage. 
A full level data balancing with Ceph (or more than one depending on how
you bring those new SSDs and hosts online) is a significant write storm. 

> Also, I have some new servers that could also become hosts in the cluster
> (I deploy Ceph in a 'hyperconverged' configuration with KVM hypervisor; I
> find that I usually tend to run out of disk and RAM before I run out of CPU
> so why not make the most of it, at least for now).
> 
> The new hosts have only 4 available drive slots each (there are 3 of them).
> 
> Am I ok (since this is SSDs and so I'm doubting a major IO bottleneck that
> I undoubtedly would see with spinners) to just go ahead and add additional
> two 1TB drives to each of the first 4 hosts, as well as put 4 x 1TB SSDs in
> the 3 new hosts?  This would give each host a similar amount of storage,
> though an unequal amount of OSDs each.
> 
Some SSDs tend to react much worse to being written to at full speed than
others, so tuning Ceph to not use all bandwidth might be still a good idea.

> Since the failure domain is by host, and the OSDs are SSD (with 1TB drives
> typically being faster than 480GB drives anyway) is this reasonable?  Or do
> I really need to keep the configuration identical across the board and just
> add additiona 480GB drives to the new hosts and have it all match?
> 
Larger SSDs are not always faster (have more parallelism) than smaller
ones, thus the question for your models. 

Having differently sized OSDs is not a problem per se, but needs a full
understanding of what is going on. 
Your larger OSDs will see twice the action, are they 
a) really twice as fast or
b) is your load never going to be an issue anyway?

Christian

> I'm also using Luminous/Bluestore if it matters.
> 
> Thanks in advance!
> 
> *Mark Steffen*
> *"Don't believe everything you read on the Internet." -Abraham Lincoln*


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Failed to add new OSD with bluestores

2018-03-19 Thread ST Wong (ITSC)
Hi,

I tried to extend my experimental cluster with more OSDs running CentOS 7 but 
failed with warning and error with following steps:

$ ceph-deploy install --release luminous newosd1# 
no error
$ ceph-deploy osd create newosd1 --data /dev/sdb

 cut here -
[ceph_deploy.conf][DEBUG ] found configuration file at: 
/home/cephuser/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (2.0.0): /bin/ceph-deploy osd create newosd1 
--data /dev/sdb
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  verbose   : False
[ceph_deploy.cli][INFO  ]  bluestore : None
[ceph_deploy.cli][INFO  ]  cd_conf   : 

[ceph_deploy.cli][INFO  ]  cluster   : ceph
[ceph_deploy.cli][INFO  ]  fs_type   : xfs
[ceph_deploy.cli][INFO  ]  block_wal : None
[ceph_deploy.cli][INFO  ]  default_release   : False
[ceph_deploy.cli][INFO  ]  username  : None
[ceph_deploy.cli][INFO  ]  journal   : None
[ceph_deploy.cli][INFO  ]  subcommand: create
[ceph_deploy.cli][INFO  ]  host  : newosd1
[ceph_deploy.cli][INFO  ]  filestore : None
[ceph_deploy.cli][INFO  ]  func  : 
[ceph_deploy.cli][INFO  ]  ceph_conf : None
[ceph_deploy.cli][INFO  ]  zap_disk  : False
[ceph_deploy.cli][INFO  ]  data  : /dev/sdb
[ceph_deploy.cli][INFO  ]  block_db  : None
[ceph_deploy.cli][INFO  ]  dmcrypt   : False
[ceph_deploy.cli][INFO  ]  overwrite_conf: False
[ceph_deploy.cli][INFO  ]  dmcrypt_key_dir   : 
/etc/ceph/dmcrypt-keys
[ceph_deploy.cli][INFO  ]  quiet : False
[ceph_deploy.cli][INFO  ]  debug : False
[ceph_deploy.osd][DEBUG ] Creating OSD on cluster ceph with data device /dev/sdb
[newosd1][DEBUG ] connection detected need for sudo
[newosd1][DEBUG ] connected to host: newosd1
[newosd1][DEBUG ] detect platform information from remote host
[newosd1][DEBUG ] detect machine type
[newosd1][DEBUG ] find the location of an executable
[ceph_deploy.osd][INFO  ] Distro info: CentOS Linux 7.4.1708 Core
[ceph_deploy.osd][DEBUG ] Deploying osd to newosd1
[newosd1][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[newosd1][WARNIN] osd keyring does not exist yet, creating one
[newosd1][DEBUG ] create a keyring file
[newosd1][DEBUG ] find the location of an executable
[newosd1][INFO  ] Running command: sudo /usr/sbin/ceph-volume --cluster ceph 
lvm create --bluestore --data /dev/sdb
[newosd1][WARNIN] -->  RuntimeError: Unable to create a new OSD id
[newosd1][DEBUG ] Running command: ceph-authtool --gen-print-key
[newosd1][DEBUG ] Running command: ceph --cluster ceph --name 
client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - 
osd new 9683df7f-78f7-47d9-bfa2-c143002175c0
[newosd1][DEBUG ]  stderr: 2018-03-19 19:15:20.129046 7f30c520c700  0 librados: 
client.bootstrap-osd authentication error (1) Operation not permitted
[newosd1][DEBUG ]  stderr: [errno 1] error connecting to the cluster
[newosd1][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy.osd][ERROR ] Failed to execute command: /usr/sbin/ceph-volume 
--cluster ceph lvm create --bluestore --data /dev/sdb
[ceph_deploy][ERROR ] GenericError: Failed to create 1 OSDs
 cut here -


And got some error when running ceph-deploy disk list:

 cut here -
[cephuser@sc001 ~]$ ceph-deploy disk list newosd1
[ceph_deploy.conf][DEBUG ] found configuration file at: 
/home/cephuser/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (2.0.0): /bin/ceph-deploy disk list newosd1
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username  : None
[ceph_deploy.cli][INFO  ]  verbose   : False
[ceph_deploy.cli][INFO  ]  debug : False
[ceph_deploy.cli][INFO  ]  overwrite_conf: False
[ceph_deploy.cli][INFO  ]  subcommand: list
[ceph_deploy.cli][INFO  ]  quiet : False
[ceph_deploy.cli][INFO  ]  cd_conf   : 

[ceph_deploy.cli][INFO  ]  cluster   : ceph
[ceph_deploy.cli][INFO  ]  host  : ['newosd1']
[ceph_deploy.cli][INFO  ]  func  : 
[ceph_deploy.cli][INFO  ]  ceph_conf : None
[ceph_deploy.cli][INFO  ]  default_release   : False
[newosd1][DEBUG ] connection detected need for sudo
[newosd1][DEBUG ] connected to host: newosd1
[newosd1][DEBUG ] detect platform information from remote host
[newosd1][DEBUG ] detect machine type
[newosd1][DEBUG ] find the location of an executable
[newos

Re: [ceph-users] Failed to add new OSD with bluestores

2018-03-19 Thread Alfredo Deza
On Mon, Mar 19, 2018 at 7:29 AM, ST Wong (ITSC)  wrote:
> Hi,
>
>
>
> I tried to extend my experimental cluster with more OSDs running CentOS 7
> but failed with warning and error with following steps:
>
>
>
> $ ceph-deploy install --release luminous newosd1
> # no error
>
> $ ceph-deploy osd create newosd1 --data /dev/sdb
>
>
>
>  cut here -
>
> [ceph_deploy.conf][DEBUG ] found configuration file at:
> /home/cephuser/.cephdeploy.conf
>
> [ceph_deploy.cli][INFO  ] Invoked (2.0.0): /bin/ceph-deploy osd create
> newosd1 --data /dev/sdb
>
> [ceph_deploy.cli][INFO  ] ceph-deploy options:
>
> [ceph_deploy.cli][INFO  ]  verbose   : False
>
> [ceph_deploy.cli][INFO  ]  bluestore : None
>
> [ceph_deploy.cli][INFO  ]  cd_conf   :
> 
>
> [ceph_deploy.cli][INFO  ]  cluster   : ceph
>
> [ceph_deploy.cli][INFO  ]  fs_type   : xfs
>
> [ceph_deploy.cli][INFO  ]  block_wal : None
>
> [ceph_deploy.cli][INFO  ]  default_release   : False
>
> [ceph_deploy.cli][INFO  ]  username  : None
>
> [ceph_deploy.cli][INFO  ]  journal   : None
>
> [ceph_deploy.cli][INFO  ]  subcommand: create
>
> [ceph_deploy.cli][INFO  ]  host  : newosd1
>
> [ceph_deploy.cli][INFO  ]  filestore : None
>
> [ceph_deploy.cli][INFO  ]  func  :  0x1bd7578>
>
> [ceph_deploy.cli][INFO  ]  ceph_conf : None
>
> [ceph_deploy.cli][INFO  ]  zap_disk  : False
>
> [ceph_deploy.cli][INFO  ]  data  : /dev/sdb
>
> [ceph_deploy.cli][INFO  ]  block_db  : None
>
> [ceph_deploy.cli][INFO  ]  dmcrypt   : False
>
> [ceph_deploy.cli][INFO  ]  overwrite_conf: False
>
> [ceph_deploy.cli][INFO  ]  dmcrypt_key_dir   :
> /etc/ceph/dmcrypt-keys
>
> [ceph_deploy.cli][INFO  ]  quiet : False
>
> [ceph_deploy.cli][INFO  ]  debug : False
>
> [ceph_deploy.osd][DEBUG ] Creating OSD on cluster ceph with data device
> /dev/sdb
>
> [newosd1][DEBUG ] connection detected need for sudo
>
> [newosd1][DEBUG ] connected to host: newosd1
>
> [newosd1][DEBUG ] detect platform information from remote host
>
> [newosd1][DEBUG ] detect machine type
>
> [newosd1][DEBUG ] find the location of an executable
>
> [ceph_deploy.osd][INFO  ] Distro info: CentOS Linux 7.4.1708 Core
>
> [ceph_deploy.osd][DEBUG ] Deploying osd to newosd1
>
> [newosd1][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
>
> [newosd1][WARNIN] osd keyring does not exist yet, creating one
>
> [newosd1][DEBUG ] create a keyring file
>
> [newosd1][DEBUG ] find the location of an executable
>
> [newosd1][INFO  ] Running command: sudo /usr/sbin/ceph-volume --cluster ceph
> lvm create --bluestore --data /dev/sdb
>
> [newosd1][WARNIN] -->  RuntimeError: Unable to create a new OSD id
>
> [newosd1][DEBUG ] Running command: ceph-authtool --gen-print-key
>
> [newosd1][DEBUG ] Running command: ceph --cluster ceph --name
> client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i -
> osd new 9683df7f-78f7-47d9-bfa2-c143002175c0
>
> [newosd1][DEBUG ]  stderr: 2018-03-19 19:15:20.129046 7f30c520c700  0
> librados: client.bootstrap-osd authentication error (1) Operation not
> permitted
>
> [newosd1][DEBUG ]  stderr: [errno 1] error connecting to the cluster
>
> [newosd1][ERROR ] RuntimeError: command returned non-zero exit status: 1
>
> [ceph_deploy.osd][ERROR ] Failed to execute command: /usr/sbin/ceph-volume
> --cluster ceph lvm create --bluestore --data /dev/sdb
>
> [ceph_deploy][ERROR ] GenericError: Failed to create 1 OSDs

This error will happen when the monitor is refusing to accept a
connection from the OSD due to authentication failing. There are
numerous reasons why this may
fail, like for example incorrect information in the ceph configuration
file for the new OSD node.

>
>  cut here -
>
>
>
>
>
> And got some error when running ceph-deploy disk list:
>
>
>
>  cut here -
>
> [cephuser@sc001 ~]$ ceph-deploy disk list newosd1
>
> [ceph_deploy.conf][DEBUG ] found configuration file at:
> /home/cephuser/.cephdeploy.conf
>
> [ceph_deploy.cli][INFO  ] Invoked (2.0.0): /bin/ceph-deploy disk list
> newosd1
>
> [ceph_deploy.cli][INFO  ] ceph-deploy options:
>
> [ceph_deploy.cli][INFO  ]  username  : None
>
> [ceph_deploy.cli][INFO  ]  verbose   : False
>
> [ceph_deploy.cli][INFO  ]  debug : False
>
> [ceph_deploy.cli][INFO  ]  overwrite_conf: False
>
> [ceph_deploy.cli][INFO  ]  subcommand: list
>
> [ceph_deploy.cli][INFO  ]  quiet : False
>
> [ceph_deploy.cli][INFO  ]  cd_conf   

Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Nicolas Huillard
Le lundi 19 mars 2018 à 10:01 +, Sergey Malinin a écrit :
> I experienced the same issue and was able to reduce metadata writes
> by raising mds_log_events_per_segment to
> it’s original value multiplied several times.

I changed it from 1024 to 4096 :
* rsync status (1 line per file) scrolls much quicker
* OSD writes on the dashboard is much lower than reads now (it was much
higher before)
* metadata pool write rate in the 20-800kBps range now, while metadata
reads in the 20-80kBps
* data pool reads is in the hundreds of kBps, which still seems very
low
* destination disk write rate is a bit larger than the data pool read
rate (expected for btrfs), but still low
* inter-DC network load is now 1-50Mbps

I'll monitor the Munin graphs in the long run.

I can't find any doc about that mds_log_events_per_segment setting,
specially on how to choose a good value.
Can you elaborate on "original value multiplied several times" ?

I'm just seeing more MDS_TRIM warnings now. Maybe restarting the MDSs
just delayed re-emergence of the initial problem.

> 
> From: ceph-users  on behalf of
> Nicolas Huillard 
> Sent: Monday, March 19, 2018 12:01:09 PM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Huge amount of cephfs metadata writes while
> only reading data (rsync from storage, to single disk)
> 
> Hi all,
> 
> I'm experimenting with a new little storage cluster. I wanted to take
> advantage of the week-end to copy all data (1TB, 10M objects) from
> the
> cluster to a single SATA disk. I expected to saturate the SATA disk
> while writing to it, but the storage cluster actually saturates its
> network links, while barely writing to the destination disk (63GB
> written in 20h, that's less than 1MBps).
> 
> Setup : 2 datacenters × 3 storage servers × 2 disks/OSD each,
> Luminous
> 12.2.4 on Debian stretch, 1Gbps shared network, 200Mbps fibre link
> between datacenters (12ms latency). 4 clients using a single cephfs
> storing data + metadata on the same spinning disks with bluestore.
> 
> Test : I'm using a single rsync on one of the client servers (the
> other
> 3 are just sitting there). rsync is local to the client, copying from
> the cephfs mount (kernel client on 4.14 from stretch-backports, just
> to
> use a potentially more recent cephfs client than on stock 4.9), to
> the
> SATA disk. The rsync'ed tree consists of lots a tiny files (1-3kB) on
> deep directory branches, along with some large files (10-100MB) in a
> few directories. There is no other activity on the cluster.
> 
> Observations : I initially saw write performance on the destination
> disk from a few 100kBps (during exploration of branches with tiny
> file)
> to a few 10MBps (while copying large files), essentially seeing the
> file names scrolling at a relatively fixed rate, unrelated to their
> individual size.
> After 5 hours, the fibre link stated to saturate at 200Mbps, while
> destination disk writes is down to a few 10kBps.
> 
> Using the dashboard, I see lots of metadata writes, at 30MBps rate on
> the metadata pool, which correlates to the 200Mbps link rate.
> It also shows regular "Health check failed: 1 MDSs behind on trimming
> (MDS_TRIM)" / "MDS health message (mds.2): Behind on trimming
> (64/30)".
> 
> I wonder why cephfs would write anything to the metadata (I'm
> mounting
> on the clients with "noatime"), while I'm just reading data from
> it...
> What could I tune to reduce that write-load-while-reading-only ?
> 
> --
> Nicolas Huillard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-- 
Nicolas Huillard
Associé fondateur - Directeur Technique - Dolomède

nhuill...@dolomede.fr
Fixe : +33 9 52 31 06 10
Mobile : +33 6 50 27 69 08
http://www.dolomede.fr/

https://reseauactionclimat.org/planetman/
http://climat-2020.eu/
http://www.350.org/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Sergey Malinin
Default for mds_log_events_per_segment is 1024, in my set up I ended up with 
8192.
I calculated that value like IOPS / log segments * 5 seconds (afaik MDS 
performs journal maintenance once in 5 seconds by default).


On Monday, March 19, 2018 at 15:20, Nicolas Huillard wrote:

> I can't find any doc about that mds_log_events_per_segment setting,
> specially on how to choose a good value.
> Can you elaborate on "original value multiplied several times" ?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Growing an SSD cluster with different disk sizes

2018-03-19 Thread Mark Steffen
At the moment I'm just testing things out and have no critical data on
Ceph.  I'm using some Intel DC S3510 drives at the moment; these may not be
optimal but I'm just trying to do some testing and get my feet with with
Ceph (since trying it out with 9 OSDs on 2TB spinners about 4 years ago).
I had experimented with some of the Crucial M500 240GB drives in a
relatively high volume LAMP stack server in a RAID5 configuration that ran
for about 4 years with a fairly heavy load (WordPress sites and all that)
and no issues.  Other than 3x the number of writes and heavy IO during a
rebalance, is Ceph "harder" on an SSD than regular RAID would be?  I'm not
using these in a cache tier, so a lot of the data that gets written to them
in many cases with "stay" on the drives for some time.

Ok, try to keep the amount of storage per TB the same in a failure
domain/host, but I should aim to be using 1TB drives that are twice as fast
(to help with IO balance) if I'm mixing drive sizes on the same server (if
I have high IO load, which TBH I really don't and don't expect to).
Understood, thank you!

*Mark Steffen*
*"Don't believe everything you read on the Internet." -Abraham Lincoln*



On Mon, Mar 19, 2018 at 7:11 AM, Christian Balzer  wrote:

>
> Hello,
>
> On Sun, 18 Mar 2018 10:59:15 -0400 Mark Steffen wrote:
>
> > Hello,
> >
> > I have a Ceph newb question I would appreciate some advice on
> >
> > Presently I have 4 hosts in my Ceph cluster, each with 4 480GB eMLC
> drives
> > in them.  These 4 hosts have 2 more empty slots each.
> >
> A lot of the answers would become clearer and more relevant if you could
> tell us foremost the exact SSD models (old and new) and the rest of the
> cluster HW config (controllers, network).
>
> When I read 480GB the only DC level SSDs with 3 DWPD are Samsungs, those 3
> DWPD may or may not be sufficient of course for your use case.
>
> I frequently managed to wear out SSDs more during testing and burn-in (i.e.
> several RAID rebuilds) than in a year of actual usage.
> A full level data balancing with Ceph (or more than one depending on how
> you bring those new SSDs and hosts online) is a significant write storm.
>
> > Also, I have some new servers that could also become hosts in the cluster
> > (I deploy Ceph in a 'hyperconverged' configuration with KVM hypervisor; I
> > find that I usually tend to run out of disk and RAM before I run out of
> CPU
> > so why not make the most of it, at least for now).
> >
> > The new hosts have only 4 available drive slots each (there are 3 of
> them).
> >
> > Am I ok (since this is SSDs and so I'm doubting a major IO bottleneck
> that
> > I undoubtedly would see with spinners) to just go ahead and add
> additional
> > two 1TB drives to each of the first 4 hosts, as well as put 4 x 1TB SSDs
> in
> > the 3 new hosts?  This would give each host a similar amount of storage,
> > though an unequal amount of OSDs each.
> >
> Some SSDs tend to react much worse to being written to at full speed than
> others, so tuning Ceph to not use all bandwidth might be still a good idea.
>
> > Since the failure domain is by host, and the OSDs are SSD (with 1TB
> drives
> > typically being faster than 480GB drives anyway) is this reasonable?  Or
> do
> > I really need to keep the configuration identical across the board and
> just
> > add additiona 480GB drives to the new hosts and have it all match?
> >
> Larger SSDs are not always faster (have more parallelism) than smaller
> ones, thus the question for your models.
>
> Having differently sized OSDs is not a problem per se, but needs a full
> understanding of what is going on.
> Your larger OSDs will see twice the action, are they
> a) really twice as fast or
> b) is your load never going to be an issue anyway?
>
> Christian
>
> > I'm also using Luminous/Bluestore if it matters.
> >
> > Thanks in advance!
> >
> > *Mark Steffen*
> > *"Don't believe everything you read on the Internet." -Abraham Lincoln*
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Rakuten Communications
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deep Scrub distribution

2018-03-19 Thread Jonathan Proulx
On Mon, Mar 05, 2018 at 12:55:52PM -0500, Jonathan D. Proulx wrote:
:Hi All,
:
:I've recently noticed my deep scrubs are EXTREAMLY poorly
:distributed.  They are stating with in the 18->06 local time start
:stop time but are not distrubuted over enough days or well distributed
:over the range of days they have.
:
:root@ceph-mon0:~# for date in `ceph pg dump | awk '/active/{print $20}'`; do 
date +%D -d $date; done | sort | uniq -c
:dumped all
:  1 03/01/18
:  6 03/03/18
:   8358 03/04/18
:   1875 03/05/18
:
:So very nearly all 10240 pgs scrubbed lastnight/this morning.  I've
:been kicking this around for a while since I noticed poor distribution
:over a 7 day range when I was really pretty sure I'd changed that from
:the 7d default to 28d.

So for posterity I was looking at the wrong field, D'oH!

The unstructured output of `ceph pg dump` has spaces in the date
string fields "2018-03-19 01:13:45.997550" so counting off the header
line does not get the field you think it does.

In my case $20 is actually part of the "SCRUB_STAMP" not
"DEEP_SCRUB_STAMP" so it *looks* like hwat I expect "%Y-%m-%d" but
isn't the right "%Y-%m-%d"...

The rightest thing would be to use structured output like `ceph pg
dump -f json-pretty`.

ceph pg dump -f json-pretty | jq '.pg_stats[].last_deep_scrub_stamp'

now if only I can get jq's  strptime to swollow the fractional seconds
in the string this spits out (unix timestame would have ben better
than formatted time atleast in json and xml outputs)

Ooops,
-Jon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Radosgw ldap user authentication issues

2018-03-19 Thread Benjeman Meekhof
Hi Marc,

You mentioned following the instructions 'except for doing this ldap
token'.  Do I read that correctly that you did not generate / use an
LDAP token with your client?  I think that is a necessary part of
triggering the LDAP authentication (Section 3.2 and 3.3 of the doc you
linked).  I can verify it works if you do that.  Pass the base64 token
(ewogICAgIlJHetc) to the 'access key' param of your client leaving
the secret blank (it is ignored).

You can use the referenced command line tool or any method you like to
generate a base64 string which encodes a json struct that looks like
this (this is the decoded ldap token string from the docs):

{
"RGW_TOKEN": {
"version": 1,
"type": "ldap",
"id": "ceph",
"key": "800#Gorilla"
}
}

thanks,
Ben


On Sun, Mar 18, 2018 at 12:04 AM, Konstantin Shalygin  wrote:
> Hi Marc
>
>
>> looks like no search is being done there.
>
>
>> rgw::auth::s3::AWSAuthStrategy denied with reason=-13
>
>
>
> The same for me, http://tracker.ceph.com/issues/23091
>
>
> But Yehuda closed this.
>
>
>
>
> k
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Nicolas Huillard
Le lundi 19 mars 2018 à 15:30 +0300, Sergey Malinin a écrit :
> Default for mds_log_events_per_segment is 1024, in my set up I ended
> up with 8192.
> I calculated that value like IOPS / log segments * 5 seconds (afaik
> MDS performs journal maintenance once in 5 seconds by default).

I tried 4096 from the initial 1024, then 8192 at the time of your
answer, then 16384, with not much improvements...

Then I tried to reduce the number of MDS, from 4 to 1, which definitely
works (sorry if my initial mail didn't make it very clear that I was
using many MDSs, even though it mentioned mds.2).
I now have low rate of metadata write (40-50kBps), and the inter-DC
link load reflects the size and direction of the actual data.

I'll now try to reduce mds_log_events_per_segment back to its original
value (1024), because performance is not optimal, and stutters a bit
too much.

Thanks for your advice!

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Sergey Malinin
Forgot to mention, that in my setup the issue gone when I had reverted back to 
single MDS and switched dirfrag off. 


On Monday, March 19, 2018 at 18:45, Nicolas Huillard wrote:

> Then I tried to reduce the number of MDS, from 4 to 1, 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] What about Petasan?

2018-03-19 Thread Max Cuttins

Hi everybody,

does anybody have used Petasan?
On the website it claim that use Ceph with ready-to-use iSCSI.
Is it something that somebody have try already?

Experience?
Thought?
Reviews?
Dubts?
Pro?
Cons?

Thanks for any thoughts.
Max

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Backfilling on Luminous

2018-03-19 Thread David Turner
Sorry for being away. I set all of my backfilling to VERY slow settings
over the weekend and things have been stable, but incredibly slow (1%
recovery from 3% misplaced to 2% all weekend).  I'm back on it now and well
rested.

@Caspar, SWAP isn't being used on these nodes and all of the affected OSDs
have been filestore.

@Dan, I think you hit the nail on the head.  I didn't know that logging was
added for subfolder splitting in Luminous!!! That's AMAZING  We are
seeing consistent subfolder splitting all across the cluster.  The majority
of the crashed OSDs have a split started before the crash and then
commenting about it in the crash dump.  Looks like I just need to write a
daemon to watch for splitting to start and throttle recovery until it's
done.

I had injected the following timeout settings, but it didn't seem to affect
anything.  I may need to have placed them in ceph.conf and let them pick up
the new settings as the OSDs crashed, but I didn't really want different
settings on some OSDs in the cluster.

osd_op_thread_suicide_timeout=1200 (from 180)
osd-recovery-thread-timeout=300  (from 30)

My game plan for now is to watch for splitting in the log, increase
recovery sleep, decrease osd_recovery_max_active, and watch for splitting
to finish before setting them back to more aggressive settings.  After this
cluster is done backfilling I'm going to do my best to reproduce this
scenario in a test environment and open a ticket to hopefully fix why this
is happening so detrimentally.


On Fri, Mar 16, 2018 at 4:00 AM Caspar Smit  wrote:

> Hi David,
>
> What about memory usage?
>
> 1] 23 OSD nodes: 15x 10TB Seagate Ironwolf filestore with journals on
> Intel DC P3700, 70% full cluster, Dual Socket E5-2620 v4 @ 2.10GHz, 128GB
> RAM.
>
> If you upgrade to bluestore, memory usage will likely increase. 15x10TB ~~
> 150GB RAM needed especially in recovery/backfilling scenario's like these.
>
> Kind regards,
> Caspar
>
>
> 2018-03-15 21:53 GMT+01:00 Dan van der Ster :
>
>> Did you use perf top or iotop to try to identify where the osd is stuck?
>> Did you try increasing the op thread suicide timeout from 180s?
>>
>> Splitting should log at the beginning and end of an op, so it should be
>> clear if it's taking longer than the timeout.
>>
>> .. Dan
>>
>>
>>
>> On Mar 15, 2018 9:23 PM, "David Turner"  wrote:
>>
>> I am aware of the filestore splitting happening.  I manually split all of
>> the subfolders a couple weeks ago on this cluster, but every time we have
>> backfilling the newly moved PGs have a chance to split before the
>> backfilling is done.  When that has happened in the past it causes some
>> blocked requests and will flap OSDs if we don't increase the
>> osd_heartbeat_grace, but it has never consistently killed the OSDs during
>> the task.  Maybe that's new in Luminous due to some of the priority and
>> timeout settings.
>>
>> This problem in general seems unrelated to the subfolder splitting,
>> though, since it started to happen very quickly into the backfilling
>> process.  Definitely before many of the recently moved PGs would have
>> reached that point.  I've also confirmed that the OSDs that are dying are
>> not just stuck on a process (like it looks like with filestore splitting),
>> but actually segfaulting and restarting.
>>
>> On Thu, Mar 15, 2018 at 4:08 PM Dan van der Ster 
>> wrote:
>>
>>> Hi,
>>>
>>> Do you see any split or merge messages in the osd logs?
>>> I recall some surprise filestore splitting on a few osds after the
>>> luminous upgrade.
>>>
>>> .. Dan
>>>
>>>
>>> On Mar 15, 2018 6:04 PM, "David Turner"  wrote:
>>>
>>> I upgraded a [1] cluster from Jewel 10.2.7 to Luminous 12.2.2 and last
>>> week I added 2 nodes to the cluster.  The backfilling has been ATROCIOUS.
>>> I have OSDs consistently [2] segfaulting during recovery.  There's no
>>> pattern of which OSDs are segfaulting, which hosts have segfaulting OSDs,
>>> etc... It's all over the cluster.  I have been trying variants on all of
>>> these following settings with different levels of success, but I cannot
>>> eliminate the blocked requests and segfaulting
>>> OSDs.  osd_heartbeat_grace, osd_max_backfills, 
>>> osd_op_thread_suicide_timeout, osd_recovery_max_active, 
>>> osd_recovery_sleep_hdd, osd_recovery_sleep_hybrid, 
>>> osd_recovery_thread_timeout,
>>> and osd_scrub_during_recovery.  Except for setting nobackfilling on the
>>> cluster I can't stop OSDs from segfaulting during recovery.
>>>
>>> Does anyone have any ideas for this?  I've been struggling with this for
>>> over a week now.  For the first couple days I rebalanced the cluster and
>>> had this exact same issue prior to adding new storage.  Even setting
>>> osd_max_backfills to 1 and recovery_sleep to 1.0, with everything else on
>>> defaults, doesn't help.
>>>
>>> Backfilling caused things to slow down on Jewel, but I wasn't having
>>> OSDs segfault multiple times/hour like I am on Luminous.  So many OSDs are
>>> going down that I had to set nodo

[ceph-users] Multi Networks Ceph

2018-03-19 Thread Lazuardi Nasution
Hi,

What is the best way to do if there network segments different between OSD
to OSD, OSD to MON and OSD to client due to some networking policy? What
should I put for public_addr and cluster_addr? Is that simple "as is"
depend on the connected network segments of each OSD and MON? If it is not
recommended, what if I use overlaying such as VXLAN or MPLS with and
without hardware virtual switch/router offload (for example: Netronome,
Mellanox)?

Best regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-19 Thread Nicolas Huillard
On Monday, March 19, 2018 at 18:45, Nicolas Huillard wrote:
> > Then I tried to reduce the number of MDS, from 4 to 1, 

> Le lundi 19 mars 2018 à 19:15 +0300, Sergey Malinin a écrit :
> Forgot to mention, that in my setup the issue gone when I had
> reverted back to single MDS and switched dirfrag off. 

So it appears we had the same problem, and applied the same solution ;-
)
I reverted mds_log_events_per_segment back to 1024 without problems.

Bandwidth utilisation is OK, destination (single SATA disk) throughput
depends on file sizes (lots of tiny file = 1MBps ; big files = 30MBps),
and running 2 rsync in parallel only improve things.

Thanks!

-- 
Nicolas Huillard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Backfilling on Luminous

2018-03-19 Thread Pavan Rallabhandi
David,

Pretty sure you must be aware of the filestore random split on existing OSD 
PGs, `filestore split rand factor`, may be you could try that too.

Thanks,
-Pavan.

From: ceph-users  on behalf of David Turner 

Date: Monday, March 19, 2018 at 1:36 PM
To: Caspar Smit 
Cc: ceph-users 
Subject: EXT: Re: [ceph-users] Backfilling on Luminous

Sorry for being away. I set all of my backfilling to VERY slow settings over 
the weekend and things have been stable, but incredibly slow (1% recovery from 
3% misplaced to 2% all weekend).  I'm back on it now and well rested.

@Caspar, SWAP isn't being used on these nodes and all of the affected OSDs have 
been filestore.

@Dan, I think you hit the nail on the head.  I didn't know that logging was 
added for subfolder splitting in Luminous!!! That's AMAZING  We are seeing 
consistent subfolder splitting all across the cluster.  The majority of the 
crashed OSDs have a split started before the crash and then commenting about it 
in the crash dump.  Looks like I just need to write a daemon to watch for 
splitting to start and throttle recovery until it's done.

I had injected the following timeout settings, but it didn't seem to affect 
anything.  I may need to have placed them in ceph.conf and let them pick up the 
new settings as the OSDs crashed, but I didn't really want different settings 
on some OSDs in the cluster.

osd_op_thread_suicide_timeout=1200 (from 180)
osd-recovery-thread-timeout=300  (from 30)

My game plan for now is to watch for splitting in the log, increase recovery 
sleep, decrease osd_recovery_max_active, and watch for splitting to finish 
before setting them back to more aggressive settings.  After this cluster is 
done backfilling I'm going to do my best to reproduce this scenario in a test 
environment and open a ticket to hopefully fix why this is happening so 
detrimentally.


On Fri, Mar 16, 2018 at 4:00 AM Caspar Smit 
mailto:caspars...@supernas.eu>> wrote:
Hi David,

What about memory usage?

1] 23 OSD nodes: 15x 10TB Seagate Ironwolf filestore with journals on Intel DC 
P3700, 70% full cluster, Dual Socket E5-2620 v4 @ 2.10GHz, 128GB RAM.

If you upgrade to bluestore, memory usage will likely increase. 15x10TB ~~ 
150GB RAM needed especially in recovery/backfilling scenario's like these.

Kind regards,
Caspar


2018-03-15 21:53 GMT+01:00 Dan van der Ster 
mailto:d...@vanderster.com>>:
Did you use perf top or iotop to try to identify where the osd is stuck?
Did you try increasing the op thread suicide timeout from 180s?

Splitting should log at the beginning and end of an op, so it should be clear 
if it's taking longer than the timeout.

.. Dan



On Mar 15, 2018 9:23 PM, "David Turner" 
mailto:drakonst...@gmail.com>> wrote:
I am aware of the filestore splitting happening.  I manually split all of the 
subfolders a couple weeks ago on this cluster, but every time we have 
backfilling the newly moved PGs have a chance to split before the backfilling 
is done.  When that has happened in the past it causes some blocked requests 
and will flap OSDs if we don't increase the osd_heartbeat_grace, but it has 
never consistently killed the OSDs during the task.  Maybe that's new in 
Luminous due to some of the priority and timeout settings.

This problem in general seems unrelated to the subfolder splitting, though, 
since it started to happen very quickly into the backfilling process.  
Definitely before many of the recently moved PGs would have reached that point. 
 I've also confirmed that the OSDs that are dying are not just stuck on a 
process (like it looks like with filestore splitting), but actually segfaulting 
and restarting.

On Thu, Mar 15, 2018 at 4:08 PM Dan van der Ster 
mailto:d...@vanderster.com>> wrote:
Hi,

Do you see any split or merge messages in the osd logs?
I recall some surprise filestore splitting on a few osds after the luminous 
upgrade.

.. Dan


On Mar 15, 2018 6:04 PM, "David Turner" 
mailto:drakonst...@gmail.com>> wrote:
I upgraded a [1] cluster from Jewel 10.2.7 to Luminous 12.2.2 and last week I 
added 2 nodes to the cluster.  The backfilling has been ATROCIOUS.  I have OSDs 
consistently [2] segfaulting during recovery.  There's no pattern of which OSDs 
are segfaulting, which hosts have segfaulting OSDs, etc... It's all over the 
cluster.  I have been trying variants on all of these following settings with 
different levels of success, but I cannot eliminate the blocked requests and 
segfaulting OSDs.  osd_heartbeat_grace, osd_max_backfills, 
osd_op_thread_suicide_timeout, osd_recovery_max_active, osd_recovery_sleep_hdd, 
osd_recovery_sleep_hybrid, osd_recovery_thread_timeout, and 
osd_scrub_during_recovery.  Except for setting nobackfilling on the cluster I 
can't stop OSDs from segfaulting during recovery.

Does anyone have any ideas for this?  I've been struggling with this for over a 
week now.  For the first couple days I rebalanced the cluster and had this 
exact same issue prior to adding new storag

Re: [ceph-users] Growing an SSD cluster with different disk sizes

2018-03-19 Thread Christian Balzer
Hello,


On Mon, 19 Mar 2018 10:39:02 -0400 Mark Steffen wrote:

> At the moment I'm just testing things out and have no critical data on
> Ceph.  I'm using some Intel DC S3510 drives at the moment; these may not be
> optimal but I'm just trying to do some testing and get my feet with with
> Ceph (since trying it out with 9 OSDs on 2TB spinners about 4 years ago).
>
At 1 DWPD these are most likely _not_ a good fit for anything but the most
read-heavy, write-little type of clusters. 
Get the smart values for them, do some realistic and extensive testing, get
the smart values again and then extrapolate. 

> I had experimented with some of the Crucial M500 240GB drives in a
> relatively high volume LAMP stack server in a RAID5 configuration that ran
> for about 4 years with a fairly heavy load (WordPress sites and all that)
> and no issues.  
Less than 0.2 DWPD, one guesses that was not very write heavy at all or
their warranted endurance is very conservative.
But the later is not something you can bank on, either with regards to
your data safety or getting replacement SSDs of course.

>Other than 3x the number of writes and heavy IO during a
> rebalance, is Ceph "harder" on an SSD than regular RAID would be?  I'm not
> using these in a cache tier, so a lot of the data that gets written to them
> in many cases with "stay" on the drives for some time.
> 
If you're having small writes, they will get "journaled" in the WAL/DB
akin to the journal with filestore, so depending on your use case you may
see up to a 2x amplification.
Of course any write will also cause (at least one) other write to the
RocksDB, but that's more or less on par with plain filesystem journals and
their metadata.

Christian

> Ok, try to keep the amount of storage per TB the same in a failure
> domain/host, but I should aim to be using 1TB drives that are twice as fast
> (to help with IO balance) if I'm mixing drive sizes on the same server (if
> I have high IO load, which TBH I really don't and don't expect to).
> Understood, thank you!
> 
> *Mark Steffen*
> *"Don't believe everything you read on the Internet." -Abraham Lincoln*
> 
> 
> 
> On Mon, Mar 19, 2018 at 7:11 AM, Christian Balzer  wrote:
> 
> >
> > Hello,
> >
> > On Sun, 18 Mar 2018 10:59:15 -0400 Mark Steffen wrote:
> >  
> > > Hello,
> > >
> > > I have a Ceph newb question I would appreciate some advice on
> > >
> > > Presently I have 4 hosts in my Ceph cluster, each with 4 480GB eMLC  
> > drives  
> > > in them.  These 4 hosts have 2 more empty slots each.
> > >  
> > A lot of the answers would become clearer and more relevant if you could
> > tell us foremost the exact SSD models (old and new) and the rest of the
> > cluster HW config (controllers, network).
> >
> > When I read 480GB the only DC level SSDs with 3 DWPD are Samsungs, those 3
> > DWPD may or may not be sufficient of course for your use case.
> >
> > I frequently managed to wear out SSDs more during testing and burn-in (i.e.
> > several RAID rebuilds) than in a year of actual usage.
> > A full level data balancing with Ceph (or more than one depending on how
> > you bring those new SSDs and hosts online) is a significant write storm.
> >  
> > > Also, I have some new servers that could also become hosts in the cluster
> > > (I deploy Ceph in a 'hyperconverged' configuration with KVM hypervisor; I
> > > find that I usually tend to run out of disk and RAM before I run out of  
> > CPU  
> > > so why not make the most of it, at least for now).
> > >
> > > The new hosts have only 4 available drive slots each (there are 3 of  
> > them).  
> > >
> > > Am I ok (since this is SSDs and so I'm doubting a major IO bottleneck  
> > that  
> > > I undoubtedly would see with spinners) to just go ahead and add  
> > additional  
> > > two 1TB drives to each of the first 4 hosts, as well as put 4 x 1TB SSDs  
> > in  
> > > the 3 new hosts?  This would give each host a similar amount of storage,
> > > though an unequal amount of OSDs each.
> > >  
> > Some SSDs tend to react much worse to being written to at full speed than
> > others, so tuning Ceph to not use all bandwidth might be still a good idea.
> >  
> > > Since the failure domain is by host, and the OSDs are SSD (with 1TB  
> > drives  
> > > typically being faster than 480GB drives anyway) is this reasonable?  Or  
> > do  
> > > I really need to keep the configuration identical across the board and  
> > just  
> > > add additiona 480GB drives to the new hosts and have it all match?
> > >  
> > Larger SSDs are not always faster (have more parallelism) than smaller
> > ones, thus the question for your models.
> >
> > Having differently sized OSDs is not a problem per se, but needs a full
> > understanding of what is going on.
> > Your larger OSDs will see twice the action, are they
> > a) really twice as fast or
> > b) is your load never going to be an issue anyway?
> >
> > Christian
> >  
> > > I'm also using Luminous/Bluestore if it matters.
> > >
> > > Thanks in a