Re: [ceph-users] Backfilling on Luminous

2018-03-16 Thread Caspar Smit
Hi David,

What about memory usage?

1] 23 OSD nodes: 15x 10TB Seagate Ironwolf filestore with journals on Intel
DC P3700, 70% full cluster, Dual Socket E5-2620 v4 @ 2.10GHz, 128GB RAM.

If you upgrade to bluestore, memory usage will likely increase. 15x10TB ~~
150GB RAM needed especially in recovery/backfilling scenario's like these.

Kind regards,
Caspar


2018-03-15 21:53 GMT+01:00 Dan van der Ster :

> Did you use perf top or iotop to try to identify where the osd is stuck?
> Did you try increasing the op thread suicide timeout from 180s?
>
> Splitting should log at the beginning and end of an op, so it should be
> clear if it's taking longer than the timeout.
>
> .. Dan
>
>
>
> On Mar 15, 2018 9:23 PM, "David Turner"  wrote:
>
> I am aware of the filestore splitting happening.  I manually split all of
> the subfolders a couple weeks ago on this cluster, but every time we have
> backfilling the newly moved PGs have a chance to split before the
> backfilling is done.  When that has happened in the past it causes some
> blocked requests and will flap OSDs if we don't increase the
> osd_heartbeat_grace, but it has never consistently killed the OSDs during
> the task.  Maybe that's new in Luminous due to some of the priority and
> timeout settings.
>
> This problem in general seems unrelated to the subfolder splitting,
> though, since it started to happen very quickly into the backfilling
> process.  Definitely before many of the recently moved PGs would have
> reached that point.  I've also confirmed that the OSDs that are dying are
> not just stuck on a process (like it looks like with filestore splitting),
> but actually segfaulting and restarting.
>
> On Thu, Mar 15, 2018 at 4:08 PM Dan van der Ster 
> wrote:
>
>> Hi,
>>
>> Do you see any split or merge messages in the osd logs?
>> I recall some surprise filestore splitting on a few osds after the
>> luminous upgrade.
>>
>> .. Dan
>>
>>
>> On Mar 15, 2018 6:04 PM, "David Turner"  wrote:
>>
>> I upgraded a [1] cluster from Jewel 10.2.7 to Luminous 12.2.2 and last
>> week I added 2 nodes to the cluster.  The backfilling has been ATROCIOUS.
>> I have OSDs consistently [2] segfaulting during recovery.  There's no
>> pattern of which OSDs are segfaulting, which hosts have segfaulting OSDs,
>> etc... It's all over the cluster.  I have been trying variants on all of
>> these following settings with different levels of success, but I cannot
>> eliminate the blocked requests and segfaulting
>> OSDs.  osd_heartbeat_grace, osd_max_backfills, osd_op_thread
>> _suicide_timeout, osd_recovery_max_active, osd_recovery_slee
>> p_hdd, osd_recovery_sleep_hybrid, osd_recovery_thread_timeout,
>> and osd_scrub_during_recovery.  Except for setting nobackfilling on the
>> cluster I can't stop OSDs from segfaulting during recovery.
>>
>> Does anyone have any ideas for this?  I've been struggling with this for
>> over a week now.  For the first couple days I rebalanced the cluster and
>> had this exact same issue prior to adding new storage.  Even setting
>> osd_max_backfills to 1 and recovery_sleep to 1.0, with everything else on
>> defaults, doesn't help.
>>
>> Backfilling caused things to slow down on Jewel, but I wasn't having OSDs
>> segfault multiple times/hour like I am on Luminous.  So many OSDs are going
>> down that I had to set nodown to prevent potential data instability of OSDs
>> on multiple hosts going up and down all the time.  That blocks IO for every
>> OSD that dies either until it comes back up or I manually mark it down.  I
>> hope someone has some ideas for me here.  Our plan moving forward is to
>> only use half of the capacity of the drives by pretending they're 5TB
>> instead of 10TB to increase the spindle speed per TB.  Also migrating to
>> bluestore will hopefully help.
>>
>>
>> [1] 23 OSD nodes: 15x 10TB Seagate Ironwolf filestore with journals on
>> Intel DC P3700, 70% full cluster, Dual Socket E5-2620 v4 @ 2.10GHz, 128GB
>> RAM.
>>
>> [2]-19> 2018-03-15 16:42:17.998074 7fe661601700  5 --
>> 10.130.115.25:6811/2942118 >> 10.130.115.48:0/372681 conn(0x55e3ea087000
>> :6811 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pg
>> s=1920 cs=1 l=1). rx osd.254 seq 74507 0x55e3eb8e2e00 osd_ping(ping
>> e93182 stamp 2018-03-15 16:42:17.990698) v4
>>-18> 2018-03-15 16:42:17.998091 7fe661601700  1 --
>> 10.130.115.25:6811/2942118 <== osd.254 10.130.115.48:0/372681 74507 
>> osd_ping(ping e93182 stamp 2018-03-15 16:42:17.990698)
>>  v4  2004+0+0 (492539280 0 0) 0x55e3eb8e2e00 con 0x55e3ea087000
>>-17> 2018-03-15 16:42:17.998109 7fe661601700  1 heartbeat_map
>> is_healthy 'OSD::osd_op_tp thread 0x7fe639772700' had timed out after 60
>>-16> 2018-03-15 16:42:17.998111 7fe661601700  1 heartbeat_map
>> is_healthy 'OSD::osd_op_tp thread 0x7fe639f73700' had timed out after 60
>>-15> 2018-03-15 16:42:17.998120 7fe661601700  1 heartbeat_map
>> is_healthy 'OSD::osd_op_tp thread 0x7fe63a774700' had timed out after 60
>>-14> 2018-03-15 16:42:

Re: [ceph-users] Disk write cache - safe?

2018-03-16 Thread Frédéric Nass

Hi Tim,

I wanted to share our experience here as we've been in a situation in 
the past (on a friday afternoon of course...) that injecting a snaptrim 
priority of 40 to all OSDs in the cluster (to speed up snaptimming) 
resulted in alls OSD nodes crashing at the same time, in all 3 
datacenters. My first thought at that particular moment was : call your 
wife and tell her you'll be late home. :-D


And this event was not related to a power outage.

Fortunately I had spent some time (when building the cluster) thinking 
how each option should be set along the I/O path for #1 data consistency 
and #2 best possible performance, and that was :


- Single SATA disks Raid0 with writeback PERC caching on each virtual disk
- write barriers kept enabled on XFS mounts (I had measured a 1.5 % 
performance gap so disabling warriers was no good choice, and is never 
actually)

- SATA disks write buffer disabled (as volatile)
- SSD journal disks write buffer enabled (as persistent)

We hardly believed it but when all nodes came back online, all OSDs 
rejoined the cluster and service was back as it was before. We didn't 
face any XFS errors nor did we have any further scrub or deep-scrub errors.


My assumption was that the extra power demand for snaptrimimng may have 
led to node power instability or that we hit a SATA firmware or maybe a 
kernel bug.


We also had SSDs as Raid0 with writeback PERC cache ON but changed that 
to write-through as we could get more IOPS from them regarding our 
workloads.


Thanks for sharing the information about DELL changing the default disk 
buffer policy. What's odd is that it all buffers were disabled after the 
node rebooted, including SSDs !

I am now changing them back to enabled for SSDs only.

As said by others, you'd better keep the disks buffers disabled and 
rebuild the OSDs after setting the disks as Raid0 with writeback enabled.


Best,

Frédéric.

Le 14/03/2018 à 20:42, Tim Bishop a écrit :

I'm using Ceph on Ubuntu 16.04 on Dell R730xd servers. A recent [1]
update to the PERC firmware disabled the disk write cache by default
which made a noticable difference to the latency on my disks (spinning
disks, not SSD) - by as much as a factor of 10.

For reference their change list says:

"Changes default value of drive cache for 6 Gbps SATA drive to disabled.
This is to align with the industry for SATA drives. This may result in a
performance degradation especially in non-Raid mode. You must perform an
AC reboot to see existing configurations change."

It's fairly straightforward to re-enable the cache either in the PERC
BIOS, or by using hdparm, and doing so returns the latency back to what
it was before.

Checking the Ceph documentation I can see that older versions [2]
recommended disabling the write cache for older kernels. But given I'm
using a newer kernel, and there's no mention of this in the Luminous
docs, is it safe to assume it's ok to enable the disk write cache now?

If it makes a difference, I'm using a mixture of filestore and bluestore
OSDs - migration is still ongoing.

Thanks,

Tim.

[1] - 
https://www.dell.com/support/home/uk/en/ukdhs1/Drivers/DriversDetails?driverId=8WK8N
[2] - 
http://docs.ceph.com/docs/jewel/rados/configuration/filesystem-recommendations/



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Berlin Ceph MeetUp March 26 - openATTIC

2018-03-16 Thread Robert Sander
Hi,

I am happy to announce our next meetup on March 26, we will have a talk
about openATTIC presented by Jan from SuSE.

Please RSVP at https://www.meetup.com/Ceph-Berlin/events/qbpxrhyxfbjc/

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crush Bucket move crashes mons

2018-03-16 Thread Paul Emmerich
Hi,

the error looks like there might be something wrong with the device classes
(which are managed via separate trees with magic names behind the scenes).

Can you post your crush map and the command that you are trying to run?

Paul

2018-03-15 16:27 GMT+01:00 :

> Hi All,
>
>
>
> Having some interesting challenges.
>
>
>
> I am trying to move 2 new nodes + 2 new racks into my default root, I have
> added them to the cluster outside of the Root=default.
>
>
>
> They are all in and up – happy it seems. The new nodes have all 12 OSDs in
> them and they are all ‘UP’
>
>
>
> So when going to move them into the correctly room bucket under the
> default root they fail.
>
>
>
> This is the error log at the time: https://pastebin.com/mHfkEp3X
>
>
>
> I can create another host in the crush and move that in and out of rack
> buckets – all while being outside of the default root. Trying to move an
> empty Rack bucket into the default root fails too.
>
>
>
> All of the cluster is on 12.2.4. I do have 2 backfill full osds which is
> the reason for needing these disks in the cluster asap.
>
>
>
> Any thoughts?
>
>
>
> Cheers
>
>
>
> Warren Jeffs
>
>
>
> ISIS Infrastructure Services
>
> STFC Rutherford Appleton Laboratory
>
> e-mail:  warren.je...@stfc.ac.uk
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
-- 
Paul Emmerich

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous "ceph-disk activate" issue

2018-03-16 Thread Fulvio Galeazzi

Hallo,
I am on Jewel 10.2.10 and willing to upgrade to Luminous. I thought 
I'd proceed same as for the upgrade to Jewel, by running ceph-ansible on 
OSD nodes one by one, then on MON nodes one by one.

---> Is this a sensible way to upgrade to Luminous?

  Problem: on first OSD node I see that "ceph-disk activate" fails like 
at the end of this message.


Note that I am using a slightly mofied version of ceph-ansible, which is 
capable of handling my FibreChannel devices: I just aligned to official 
ceph-ansible. My changes (https://github.com/fgal/ceph-ansible.git) 
merely create a "devices" list, and as long as I set

  ceph_stable_release: jewel
 ceph-ansible is working OK, so this should exclude both 
/dev/disk/by-part* stuff and my changes.
When I change it to "luminous" I see the problem. I guess the behaviour 
of ceph-disk has changed meanwhile... I also tried to go back to 12.2.1, 
last release before ceph-disk was superseded by ceph-volume, and observe 
the same problem.

Looks to me that the problematic line could be (notice the '-' after -i):
ceph --cluster ceph --name client.bootstrap-osd --keyring 
/var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 
2ceb1f9f-5cf8-46fc-bf8c-2a905e5238b6


  Anyone has any idea as to what could be the problem?
  Thanks for your help!

Fulvio


[root@r3srv05.pa1 ~]# ceph-disk -v activate 
/dev/mapper/3600a0980005da3a2136058a22992p1 



main_activate: path = /dev/mapper/3600a0980005da3a2136058a22992p1 




get_dm_uuid: get_dm_uuid /dev/mapper/3600a0980005da3a2136058a22992p1 
uuid path is /sys/dev/block/253:25/dm/uuid 



get_dm_uuid: get_dm_uuid /dev/mapper/3600a0980005da3a2136058a22992p1 
uuid is part1-mpath-3600a0980005da3a2136058a22992 







get_dm_uuid: get_dm_uuid /dev/mapper/3600a0980005da3a2136058a22992p1 
uuid path is /sys/dev/block/253:25/dm/uuid 



get_dm_uuid: get_dm_uuid /dev/mapper/3600a0980005da3a2136058a22992p1 
uuid is part1-mpath-3600a0980005da3a2136058a22992 







command: Running command: /usr/sbin/blkid -o udev -p 
/dev/mapper/3600a0980005da3a2136058a22992p1 



get_dm_uuid: get_dm_uuid /dev/mapper/3600a0980005da3a2136058a22992p1 
uuid path is /sys/dev/block/253:25/dm/uuid 



get_dm_uuid: get_dm_uuid /dev/mapper/3600a0980005da3a2136058a22992p1 
uuid is part1-mpath-3600a0980005da3a2136058a22992 







command: Running command: /sbin/blkid -p -s TYPE -o value -- 
/dev/mapper/3600a0980005da3a2136058a22992p1 



command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. 
--lookup osd_mount_options_xfs 



mount: Mounting /dev/mapper/3600a0980005da3a2136058a22992p1 on 
/var/lib/ceph/tmp/mnt.aCTRx9 with options 
noatime,nodiratime,largeio,inode64,swalloc,logbsize=256k,allocsize=4M 

command_check_call: Running command: /usr/bin/mount -t xfs -o 
noatime,nodiratime,largeio,inode64,swalloc,logbsize=256k,allocsize=4M -- 
/dev/mapper/3600a0980005da3a2136058a22992p1 
/var/lib/ceph/tmp/mnt.aCTRx9

command: Running command: /usr/sbin/restorecon /var/lib/ceph/tmp/mnt.aCTRx9
activate: Cluster uuid is 9a9eedd0-9400-488e-96de-c349fffad7c4
command: Running command: /usr/bin/ceph-osd --cluster=ceph 
--show-config-value=fsid

activate: Cluster name is ceph
activate: OSD uuid is 2ceb1f9f-5cf8-46fc-bf8c-2a905e5238b6
allocate_osd_id: Allocating OSD id...
command: Running command: /usr/bin/ceph-authtool --gen-print-key
__init__: stderr
command_with_stdin: Running command with stdin: ceph --cluster ceph 
--name client.bootstrap-osd --keyring 
/var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 
2ceb1f9f-5cf8-46fc-bf8c-2a905e5238b6

command_with_stdin:
command_with_stdin: no valid command found; 10 closest matches:
osd setmaxosd 
osd pause
osd crush rule rm 
osd crush tree
osd crush rule create-simple{firstn|indep}
osd crush rule create-erasure  {}
osd crush get-tunable straw_calc_version
osd crush show-tunables
osd crush tunables 
legacy|argonaut|bobtail|firefly|hammer|jewel|optimal|default

osd crush set-tunable straw_calc_version 
Error EINVAL: invalid command

mount_activate: Failed to activate
unmount: Unmounting /var/lib/ceph/tmp/mnt.aCTRx9
command_check_call: Running command: /bin/umount -- 
/var/lib/ceph/tmp/mnt.aCTRx9
'['ceph', '--cluster', 'ceph', '--name', 'client.bootstrap-osd', 
'--keyring', '/var/lib/ceph/bootstrap-osd/ceph.keyring', '-i', '-', 
'osd', 'new', u'2ceb1f9f-5cf8-46fc-bf8c-2a905e5238b6']' failed with 
status code 22




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous "ceph-disk activate" issue

2018-03-16 Thread Paul Emmerich
Hi,

2018-03-16 15:18 GMT+01:00 Fulvio Galeazzi :

> Hallo,
> I am on Jewel 10.2.10 and willing to upgrade to Luminous. I thought
> I'd proceed same as for the upgrade to Jewel, by running ceph-ansible on
> OSD nodes one by one, then on MON nodes one by one.
> ---> Is this a sensible way to upgrade to Luminous?
>

no, that's the wrong order. See the Luminous release notes for the upgrade
instructions. You'll need to start with the mons, otherwise the OSDs won't
be able to start.


Paul


-- 
Paul Emmerich

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crush Bucket move crashes mons

2018-03-16 Thread warren.jeffs
Hi Paul

Many thanks for the reply.

The command is: crush move rack04  room=R80-Upper

Crush map is here: https://pastebin.com/CX7GKtBy

I’ve done some more testing, and the following all work:

· Moving machines between the racks under the default root.

· Renaming racks/hosts under the default root

· Renaming the default root

· Creating a new root

· Adding rack05 and rack04 + hosts nina408 and nina508 into the new root

But when trying to move  anything into the default root it fails.

I have tried moving the following into default root:

· Nina408 – with hosts in and without

· Nina508 – with hosts in and without

· Rack04

· Rack05

· Rack03 – which I created with nothing in it to try and move.


Since first email, I have got the cluster to HEALTH_OK with reweight mapping 
drives, so everything cluster wise appears to be functioning fine.

I have not tried manually editing the crush map and reimporting for the risk 
that it makes the cluster fall over, as this is currently in production. With 
the CLI I can at least cancel the command the monitor comes back up fine.

Many thanks.

Warren


From: Paul Emmerich [mailto:paul.emmer...@croit.io]
Sent: 16 March 2018 13:54
To: Jeffs, Warren (STFC,RAL,ISIS) 
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] Crush Bucket move crashes mons

Hi,

the error looks like there might be something wrong with the device classes 
(which are managed via separate trees with magic names behind the scenes).

Can you post your crush map and the command that you are trying to run?
Paul

2018-03-15 16:27 GMT+01:00 
mailto:warren.je...@stfc.ac.uk>>:
Hi All,

Having some interesting challenges.

I am trying to move 2 new nodes + 2 new racks into my default root, I have 
added them to the cluster outside of the Root=default.

They are all in and up – happy it seems. The new nodes have all 12 OSDs in them 
and they are all ‘UP’

So when going to move them into the correctly room bucket under the default 
root they fail.

This is the error log at the time: https://pastebin.com/mHfkEp3X

I can create another host in the crush and move that in and out of rack buckets 
– all while being outside of the default root. Trying to move an empty Rack 
bucket into the default root fails too.

All of the cluster is on 12.2.4. I do have 2 backfill full osds which is the 
reason for needing these disks in the cluster asap.

Any thoughts?

Cheers

Warren Jeffs

ISIS Infrastructure Services
STFC Rutherford Appleton Laboratory
e-mail:  warren.je...@stfc.ac.uk


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
--
Paul Emmerich

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous "ceph-disk activate" issue

2018-03-16 Thread Fulvio Galeazzi

Hallo Paul,
You're correct of course, thanks!

  Ok tried to upgrade one MON (out of 3) to Luminous by:
 - removing the MON from the cluster
 - wiping ceph-common
 - running ceph-ansible with "ceph_stable_release: luminous"

but I am now stuck at "[ceph-mon : collect admin and bootstrap keys]". 
If I execute the command in the machine I am installing I see "machine 
is not in quorum: probing".


  Am a bit confused now: should I upgrade all 3 monitors at once? What 
if anything goes wrong during the upgrade? Or should I do a manual 
upgrade rather than using ceph-ansible?


  Thanks for your time and help!

Fulvio

 Original Message 
Subject: Re: [ceph-users] Luminous "ceph-disk activate" issue
From: Paul Emmerich 
To: Fulvio Galeazzi 
CC: Ceph Users 
Date: 03/16/2018 03:23 PM


Hi,

2018-03-16 15:18 GMT+01:00 Fulvio Galeazzi >:


Hallo,
     I am on Jewel 10.2.10 and willing to upgrade to Luminous. I
thought I'd proceed same as for the upgrade to Jewel, by running
ceph-ansible on OSD nodes one by one, then on MON nodes one by one.
         ---> Is this a sensible way to upgrade to Luminous?


no, that's the wrong order. See the Luminous release notes for the 
upgrade instructions. You'll need to start with the mons, otherwise the 
OSDs won't be able to start.



Paul


--
Paul Emmerich

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io 
Tel: +49 89 1896585 90




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Moving OSDs between hosts

2018-03-16 Thread Jon Light
Hi all,

I have a very small cluster consisting of 1 overloaded OSD node and a
couple MON/MGR/MDS nodes. I will be adding new OSD nodes to the cluster and
need to move 36 drives from the existing node to a new one. I'm running
Luminous 12.2.2 on Ubuntu 16.04 and everything was created with ceph-deploy.

What is the best course of action for moving these drives? I have read some
posts that suggest I can simply move the drive and once the new OSD node
sees the drive it will update the cluster automatically.

Time isn't a problem and I want to minimize risk so I want to move 1 OSD at
a time. I was planning on stopping the OSD, moving it to the new host, and
waiting for the OSD to become up and in and the cluster to be healthy. Are
there any other steps I need to take? Should I do anything different?

Thanks in advance
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disk write cache - safe?

2018-03-16 Thread Steven Vacaroaia
Hi All,

Can someone confirm please that, for a perfect  performance/safety
compromise, the following would be the best settings  ( id 0 is SSD, id 1
is HDD )
Alternatively, any suggestions / sharing configuration / advice would be
greatly appreciated

Note
server is a DELL R620 with PERC 710 , 1GB cache
SSD is entreprise Toshiba PX05SMB040Y
HDD is Entreprise Seagate  ST600MM0006


 megacli -LDGetProp  -DskCache -Lall -a0

Adapter 0-VD 0(target id: 0): Disk Write Cache : Enabled
Adapter 0-VD 1(target id: 1): Disk Write Cache : Disabled

megacli -LDGetProp  -Cache -Lall -a0

Adapter 0-VD 0(target id: 0): Cache Policy:WriteBack, ReadAdaptive, Direct,
No Write Cache if bad BBU
Adapter 0-VD 1(target id: 1): Cache Policy:WriteBack, ReadAdaptive, Cached,
Write Cache OK if bad BBU

Many thanks

Steven





On 16 March 2018 at 06:20, Frédéric Nass 
wrote:

> Hi Tim,
>
> I wanted to share our experience here as we've been in a situation in the
> past (on a friday afternoon of course...) that injecting a snaptrim
> priority of 40 to all OSDs in the cluster (to speed up snaptimming)
> resulted in alls OSD nodes crashing at the same time, in all 3 datacenters.
> My first thought at that particular moment was : call your wife and tell
> her you'll be late home. :-D
>
> And this event was not related to a power outage.
>
> Fortunately I had spent some time (when building the cluster) thinking how
> each option should be set along the I/O path for #1 data consistency and #2
> best possible performance, and that was :
>
> - Single SATA disks Raid0 with writeback PERC caching on each virtual disk
> - write barriers kept enabled on XFS mounts (I had measured a 1.5 %
> performance gap so disabling warriers was no good choice, and is never
> actually)
> - SATA disks write buffer disabled (as volatile)
> - SSD journal disks write buffer enabled (as persistent)
>
> We hardly believed it but when all nodes came back online, all OSDs
> rejoined the cluster and service was back as it was before. We didn't face
> any XFS errors nor did we have any further scrub or deep-scrub errors.
>
> My assumption was that the extra power demand for snaptrimimng may have
> led to node power instability or that we hit a SATA firmware or maybe a
> kernel bug.
>
> We also had SSDs as Raid0 with writeback PERC cache ON but changed that to
> write-through as we could get more IOPS from them regarding our workloads.
>
> Thanks for sharing the information about DELL changing the default disk
> buffer policy. What's odd is that it all buffers were disabled after the
> node rebooted, including SSDs !
> I am now changing them back to enabled for SSDs only.
>
> As said by others, you'd better keep the disks buffers disabled and
> rebuild the OSDs after setting the disks as Raid0 with writeback enabled.
>
> Best,
>
> Frédéric.
>
> Le 14/03/2018 à 20:42, Tim Bishop a écrit :
>
>> I'm using Ceph on Ubuntu 16.04 on Dell R730xd servers. A recent [1]
>> update to the PERC firmware disabled the disk write cache by default
>> which made a noticable difference to the latency on my disks (spinning
>> disks, not SSD) - by as much as a factor of 10.
>>
>> For reference their change list says:
>>
>> "Changes default value of drive cache for 6 Gbps SATA drive to disabled.
>> This is to align with the industry for SATA drives. This may result in a
>> performance degradation especially in non-Raid mode. You must perform an
>> AC reboot to see existing configurations change."
>>
>> It's fairly straightforward to re-enable the cache either in the PERC
>> BIOS, or by using hdparm, and doing so returns the latency back to what
>> it was before.
>>
>> Checking the Ceph documentation I can see that older versions [2]
>> recommended disabling the write cache for older kernels. But given I'm
>> using a newer kernel, and there's no mention of this in the Luminous
>> docs, is it safe to assume it's ok to enable the disk write cache now?
>>
>> If it makes a difference, I'm using a mixture of filestore and bluestore
>> OSDs - migration is still ongoing.
>>
>> Thanks,
>>
>> Tim.
>>
>> [1] - https://www.dell.com/support/home/uk/en/ukdhs1/Drivers/Drive
>> rsDetails?driverId=8WK8N
>> [2] - http://docs.ceph.com/docs/jewel/rados/configuration/filesyst
>> em-recommendations/
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crush Bucket move crashes mons

2018-03-16 Thread Paul Emmerich
Hi,

looks like it fails to adjust the number of weight set entries when moving
the entries. The good news is that this is 100% reproducible with your
crush map:
you should open a bug at http://tracker.ceph.com/ to get this fixed.

Deleting the weight set fixes the problem. Moving the item manually with
manual adjustment of the weight set also works in my quick test.

Paul


2018-03-16 16:03 GMT+01:00 :

> Hi Paul
>
>
>
> Many thanks for the reply.
>
>
>
> The command is: crush move rack04  room=R80-Upper
>
>
>
> Crush map is here: https://pastebin.com/CX7GKtBy
>
> I’ve done some more testing, and the following all work:
>
> · Moving machines between the racks under the default root.
>
> · Renaming racks/hosts under the default root
>
> · Renaming the default root
>
> · Creating a new root
>
> · Adding rack05 and rack04 + hosts nina408 and nina508 into the
> new root
>
>
>
> But when trying to move  anything into the default root it fails.
>
>
>
> I have tried moving the following into default root:
>
> · Nina408 – with hosts in and without
>
> · Nina508 – with hosts in and without
>
> · Rack04
>
> · Rack05
>
> · Rack03 – which I created with nothing in it to try and move.
>
>
>
>
>
> Since first email, I have got the cluster to HEALTH_OK with reweight
> mapping drives, so everything cluster wise appears to be functioning fine.
>
>
>
> I have not tried manually editing the crush map and reimporting for the
> risk that it makes the cluster fall over, as this is currently in
> production. With the CLI I can at least cancel the command the monitor
> comes back up fine.
>
>
>
> Many thanks.
>
>
>
> Warren
>
>
>
>
>
> *From:* Paul Emmerich [mailto:paul.emmer...@croit.io]
> *Sent:* 16 March 2018 13:54
> *To:* Jeffs, Warren (STFC,RAL,ISIS) 
> *Cc:* ceph-us...@ceph.com
> *Subject:* Re: [ceph-users] Crush Bucket move crashes mons
>
>
>
> Hi,
>
> the error looks like there might be something wrong with the device
> classes (which are managed via separate trees with magic names behind the
> scenes).
>
>
> Can you post your crush map and the command that you are trying to run?
>
> Paul
>
>
>
> 2018-03-15 16:27 GMT+01:00 :
>
> Hi All,
>
>
>
> Having some interesting challenges.
>
>
>
> I am trying to move 2 new nodes + 2 new racks into my default root, I have
> added them to the cluster outside of the Root=default.
>
>
>
> They are all in and up – happy it seems. The new nodes have all 12 OSDs in
> them and they are all ‘UP’
>
>
>
> So when going to move them into the correctly room bucket under the
> default root they fail.
>
>
>
> This is the error log at the time: https://pastebin.com/mHfkEp3X
>
>
>
> I can create another host in the crush and move that in and out of rack
> buckets – all while being outside of the default root. Trying to move an
> empty Rack bucket into the default root fails too.
>
>
>
> All of the cluster is on 12.2.4. I do have 2 backfill full osds which is
> the reason for needing these disks in the cluster asap.
>
>
>
> Any thoughts?
>
>
>
> Cheers
>
>
>
> Warren Jeffs
>
>
>
> ISIS Infrastructure Services
>
> STFC Rutherford Appleton Laboratory
>
> e-mail:  warren.je...@stfc.ac.uk
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
>
> --
> Paul Emmerich
>
> croit GmbH
> Freseniusstr. 31h
> 
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90 <+49%2089%20189658590>
>



-- 
-- 
Paul Emmerich

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] SOLVED Re: Luminous "ceph-disk activate" issue

2018-03-16 Thread Fulvio Galeazzi

Hallo Paul, thanks for your tip which guided me to success.
  I just needed to manually update via yum and restart services: MONs 
first, then OSDs. I am happily running Luminous, now, and verified 
ceph-ansible can add new disks.


  Thanks

Fulvio

 Original Message 
Subject: Re: [ceph-users] Luminous "ceph-disk activate" issue
From: Fulvio Galeazzi 
To: Paul Emmerich 
CC: Ceph Users 
Date: 03/16/2018 04:58 PM


Hallo Paul,
     You're correct of course, thanks!

   Ok tried to upgrade one MON (out of 3) to Luminous by:
  - removing the MON from the cluster
  - wiping ceph-common
  - running ceph-ansible with "ceph_stable_release: luminous"

but I am now stuck at "[ceph-mon : collect admin and bootstrap keys]". 
If I execute the command in the machine I am installing I see "machine 
is not in quorum: probing".


   Am a bit confused now: should I upgrade all 3 monitors at once? What 
if anything goes wrong during the upgrade? Or should I do a manual 
upgrade rather than using ceph-ansible?


   Thanks for your time and help!

     Fulvio

 Original Message 
Subject: Re: [ceph-users] Luminous "ceph-disk activate" issue
From: Paul Emmerich 
To: Fulvio Galeazzi 
CC: Ceph Users 
Date: 03/16/2018 03:23 PM


Hi,

2018-03-16 15:18 GMT+01:00 Fulvio Galeazzi >:


    Hallo,
     I am on Jewel 10.2.10 and willing to upgrade to Luminous. I
    thought I'd proceed same as for the upgrade to Jewel, by running
    ceph-ansible on OSD nodes one by one, then on MON nodes one by one.
         ---> Is this a sensible way to upgrade to Luminous?


no, that's the wrong order. See the Luminous release notes for the 
upgrade instructions. You'll need to start with the mons, otherwise 
the OSDs won't be able to start.



Paul


--
Paul Emmerich

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io 
Tel: +49 89 1896585 90






smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crush Bucket move crashes mons

2018-03-16 Thread warren.jeffs
Hi Paul,

Many thanks for the super quick replys and analysis on this.

Is it a case of removing the weights from the new hosts and there osds then 
moving them? After reweighing them correctly?

I already have a bug open, I will get this email chain added to this.

Warren

From: Paul Emmerich [paul.emmer...@croit.io]
Sent: 16 March 2018 16:48
To: Jeffs, Warren (STFC,RAL,ISIS)
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] Crush Bucket move crashes mons

Hi,

looks like it fails to adjust the number of weight set entries when moving the 
entries. The good news is that this is 100% reproducible with your crush map:
you should open a bug at http://tracker.ceph.com/ to get this fixed.

Deleting the weight set fixes the problem. Moving the item manually with manual 
adjustment of the weight set also works in my quick test.

Paul


2018-03-16 16:03 GMT+01:00 
mailto:warren.je...@stfc.ac.uk>>:
Hi Paul

Many thanks for the reply.

The command is: crush move rack04  room=R80-Upper

Crush map is here: https://pastebin.com/CX7GKtBy

I’ve done some more testing, and the following all work:

• Moving machines between the racks under the default root.

• Renaming racks/hosts under the default root

• Renaming the default root

• Creating a new root

• Adding rack05 and rack04 + hosts nina408 and nina508 into the new root

But when trying to move  anything into the default root it fails.

I have tried moving the following into default root:

• Nina408 – with hosts in and without

• Nina508 – with hosts in and without

• Rack04

• Rack05

• Rack03 – which I created with nothing in it to try and move.


Since first email, I have got the cluster to HEALTH_OK with reweight mapping 
drives, so everything cluster wise appears to be functioning fine.

I have not tried manually editing the crush map and reimporting for the risk 
that it makes the cluster fall over, as this is currently in production. With 
the CLI I can at least cancel the command the monitor comes back up fine.

Many thanks.

Warren


From: Paul Emmerich 
[mailto:paul.emmer...@croit.io]
Sent: 16 March 2018 13:54
To: Jeffs, Warren (STFC,RAL,ISIS) 
mailto:warren.je...@stfc.ac.uk>>
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] Crush Bucket move crashes mons

Hi,

the error looks like there might be something wrong with the device classes 
(which are managed via separate trees with magic names behind the scenes).

Can you post your crush map and the command that you are trying to run?
Paul

2018-03-15 16:27 GMT+01:00 
mailto:warren.je...@stfc.ac.uk>>:
Hi All,

Having some interesting challenges.

I am trying to move 2 new nodes + 2 new racks into my default root, I have 
added them to the cluster outside of the Root=default.

They are all in and up – happy it seems. The new nodes have all 12 OSDs in them 
and they are all ‘UP’

So when going to move them into the correctly room bucket under the default 
root they fail.

This is the error log at the time: https://pastebin.com/mHfkEp3X

I can create another host in the crush and move that in and out of rack buckets 
– all while being outside of the default root. Trying to move an empty Rack 
bucket into the default root fails too.

All of the cluster is on 12.2.4. I do have 2 backfill full osds which is the 
reason for needing these disks in the cluster asap.

Any thoughts?

Cheers

Warren Jeffs

ISIS Infrastructure Services
STFC Rutherford Appleton Laboratory
e-mail:  warren.je...@stfc.ac.uk


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
--
Paul Emmerich

croit GmbH
Freseniusstr. 
31h
81247 München
www.croit.io
Tel: +49 89 1896585 90



--
--
Paul Emmerich

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Moving OSDs between hosts

2018-03-16 Thread ceph
Hi jon,

Am 16. März 2018 17:00:09 MEZ schrieb Jon Light :
>Hi all,
>
>I have a very small cluster consisting of 1 overloaded OSD node and a
>couple MON/MGR/MDS nodes. I will be adding new OSD nodes to the cluster
>and
>need to move 36 drives from the existing node to a new one. I'm running
>Luminous 12.2.2 on Ubuntu 16.04 and everything was created with
>ceph-deploy.
>
>What is the best course of action for moving these drives? I have read
>some
>posts that suggest I can simply move the drive and once the new OSD
>node
>sees the drive it will update the cluster automatically.

I would give this a try.  Had  Test this scenario at the beginning of my 
Cluster (Jewel/ceph deploy/ceph disk) and i was able to remove One osd and put 
it in Another Node- udev had done his Magic.

- Mehmet  

>
>Time isn't a problem and I want to minimize risk so I want to move 1
>OSD at
>a time. I was planning on stopping the OSD, moving it to the new host,
>and
>waiting for the OSD to become up and in and the cluster to be healthy.
>Are
>there any other steps I need to take? Should I do anything different?
>
>Thanks in advance
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Syslog logging date/timestamp

2018-03-16 Thread Marc Roos
 

I have no logging options configured in ceph.conf, yet I get syslog 
entries like these

Mar 16 23:55:42 c01 ceph-osd: 2018-03-16 23:55:42.535796 7f2f5c53a700 -1 
osd.0 pg_epoch: 18949 pg[17.21( v 18949'4044827 
(18949'4043279,18949'4044827] local-lis/les=18910/18911 n=3125 
ec=3636/3636 lis/c 18910/18910 les/c/f 18911/18912/0 18910/18910/18910) 
[13,0,9] r=1 lpr=18910 luod=0'0 crt=18949'4044827 lcod 18949'4044826 
active] _scan_snaps no head for 
17:846274ce:::rbd_data.239f5274b0dc51.1d75:39 (have MIN)
Mar 16 23:55:42 c01 ceph-osd: 2018-03-16 23:55:42.535823 7f2f5c53a700 -1 
osd.0 pg_epoch: 18949 pg[17.21( v 18949'4044827 
(18949'4043279,18949'4044827] local-lis/les=18910/18911 n=3125 
ec=3636/3636 lis/c 18910/18910 les/c/f 18911/18912/0 18910/18910/18910) 
[13,0,9] r=1 lpr=18910 luod=0'0 crt=18949'4044827 lcod 18949'4044826 
active] _scan_snaps no head for 
17:846274ce:::rbd_data.239f5274b0dc51.1d75:26 (have MIN)

Should the date/timestamp not be omitted here? We already have this from 
syslog server?




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osd recovery sleep helped us with limiting recovery impact

2018-03-16 Thread Alex Gorbachev
Hope this helps someone if your recovery is impacting client traffic.

We have been migrating OSD hosts and experiencing massive client
timeouts due to overwhelming recovery traffic in Jewel (3-4 GB/s), to
the point where Areca HBAs would seize up and crash the hosts.

Setting osd_recovery_sleep = 0.5 immediately relieved the problem.  I
tried the value of 1, but it slowed recovery too much.

This seems like a very important operational parameter to note.

--
Alex Gorbachev
Storcium
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com