[ceph-users] Hummer upgrade stuck all OSDs down

2017-04-12 Thread Siniša Denić
Hi to all, my cluster stuck after upgrade from hammer 0.94.5 to luminous. Iit seems somehow osds stuck at hammer version despite $ceph-osd --version ceph version 12.0.1 (5456408827a1a31690514342624a4ff9b66be1d5) All OSDs are down in preboot state, on every osd log it says "osdmap SORTBITWISE OS

Re: [ceph-users] Socket errors, CRC, lossy con messages

2017-04-12 Thread Ilya Dryomov
On Tue, Apr 11, 2017 at 3:10 PM, Alex Gorbachev wrote: > Hi Ilya, > > On Tue, Apr 11, 2017 at 4:06 AM, Ilya Dryomov wrote: >> On Tue, Apr 11, 2017 at 4:01 AM, Alex Gorbachev >> wrote: >>> On Mon, Apr 10, 2017 at 2:16 PM, Alex Gorbachev >>> wrote: I am trying to understand the cause of a

[ceph-users] Mon not starting after upgrading to 10.2.7

2017-04-12 Thread Nick Fisk
Hi, I just upgraded one of my mons to 10.2.7 and it is now failing to start properly. What's really odd is all the mon specific commands are now missing from the admin socket. ceph --admin-daemon /var/run/ceph/ceph-mon.gp-ceph-mon2.asok help { "config diff": "dump diff of current config and d

Re: [ceph-users] Mon not starting after upgrading to 10.2.7

2017-04-12 Thread Dan van der Ster
Can't help, but just wanted to say that the upgrade worked for us: # ceph health HEALTH_OK # ceph tell mon.* version mon.p01001532077488: ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185) mon.p01001532149022: ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185) mon.p01001532

Re: [ceph-users] Mon not starting after upgrading to 10.2.7

2017-04-12 Thread Nick Fisk
Thanks Dan, I've just managed to fixed it. It looks like the upgrade process required some extra ram, the mon node was heavily swapping, so I think it was just stalled rather than broken. Once it came back up, ram dropped down by a lot. Nick > -Original Message- > From: Dan van der Ste

Re: [ceph-users] Hummer upgrade stuck all OSDs down

2017-04-12 Thread Richard Hesketh
On 12/04/17 09:47, Siniša Denić wrote: > Hi to all, my cluster stuck after upgrade from hammer 0.94.5 to luminous. > Iit seems somehow osds stuck at hammer version despite > > Can I somehow overcome this situation and what could happened during the > upgrade? > I performed upgrade from hammer by

[ceph-users] failed lossy con, dropping message

2017-04-12 Thread Laszlo Budai
Hello, yesterday one of our compute nodes has recorded the following message for one of the ceph connections: submit_message osd_op(client.28817736.0:690186 rbd_data.15c046b11ab57b7.00c4 [read 2097152~380928] 3.6f81364a ack+read+known_if_redirected e3617) v5 remote, 10.12.68.71:68

[ceph-users] slow requests and short OSD failures in small cluster

2017-04-12 Thread Jogi Hofmüller
Dear all, we run a small cluster [1] that is exclusively used for virtualisation (kvm/libvirt). Recently we started to run into performance problems (slow requests, failing OSDs) for no *obvious* reason (at least not for us). We do nightly snapshots of VM images and keep the snapshots for 14 days

[ceph-users] PG calculator improvement

2017-04-12 Thread Frédéric Nass
Hi, I wanted to share a bad experience we had due to how the PG calculator works. When we set our production cluster months ago, we had to decide on the number of PGs to give to each pool in the cluster. As you know, the PG calc would recommended to give a lot of PGs to heavy pools in size,

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-12 Thread David Turner
I can almost guarantee what you're seeing is PG subfolder splitting. When the subfolders in a PG get X number of objects, it splits into 16 subfolders. Every cluster I manage has blocked requests and OSDs that get marked down while this is happening. To stop the OSDs getting marked down, I incre

Re: [ceph-users] Socket errors, CRC, lossy con messages

2017-04-12 Thread Alex Gorbachev
Hi Ilya, On Wed, Apr 12, 2017 at 4:58 AM Ilya Dryomov wrote: > On Tue, Apr 11, 2017 at 3:10 PM, Alex Gorbachev > wrote: > > Hi Ilya, > > > > On Tue, Apr 11, 2017 at 4:06 AM, Ilya Dryomov > wrote: > >> On Tue, Apr 11, 2017 at 4:01 AM, Alex Gorbachev > wrote: > >>> On Mon, Apr 10, 2017 at 2:16

Re: [ceph-users] Socket errors, CRC, lossy con messages

2017-04-12 Thread Ilya Dryomov
On Wed, Apr 12, 2017 at 4:28 PM, Alex Gorbachev wrote: > Hi Ilya, > > On Wed, Apr 12, 2017 at 4:58 AM Ilya Dryomov wrote: >> >> On Tue, Apr 11, 2017 at 3:10 PM, Alex Gorbachev >> wrote: >> > Hi Ilya, >> > >> > On Tue, Apr 11, 2017 at 4:06 AM, Ilya Dryomov >> > wrote: >> >> On Tue, Apr 11, 2017

Re: [ceph-users] python3-rados

2017-04-12 Thread Gerald Spencer
Ah I'm running Jewel. Is there any information online about python3-rados with Kraken? I'm having difficulties finding more then I initially posted. On Mon, Apr 10, 2017 at 10:37 PM, Wido den Hollander wrote: > > > Op 8 april 2017 om 4:03 schreef Gerald Spencer : > > > > > > Do the rados binding

[ceph-users] Adding a new rack to crush map without pain?

2017-04-12 Thread Matthew Vernon
Hi, Our current (jewel) CRUSH map has rack / host / osd (and the default replication rule does step chooseleaf firstn 0 type rack). We're shortly going to be adding some new hosts in new racks, and I'm wondering what the least-painful way of getting the new osds associated with the correct (new) r

Re: [ceph-users] rbd iscsi gateway question

2017-04-12 Thread Cédric Lemarchand
On Mon, 2017-04-10 at 12:13 -0500, Mike Christie wrote: > > > LIO-TCMU+librbd-iscsi [1] [2] looks really promising and seams to > > be the > > way to go. It would be great if somebody as insight about the > > maturity > > of the project, is it ready for testing purposes ? > > > > It is not matur

Re: [ceph-users] failed lossy con, dropping message

2017-04-12 Thread Alex Gorbachev
Hi Laszlo, On Wed, Apr 12, 2017 at 6:26 AM Laszlo Budai wrote: > Hello, > > yesterday one of our compute nodes has recorded the following message for > one of the ceph connections: > > submit_message osd_op(client.28817736.0:690186 > rbd_data.15c046b11ab57b7.00c4 [read 2097152~380928

[ceph-users] Recurring OSD crash on bluestore

2017-04-12 Thread Musee Ullah
Hi, One of the OSDs in my bluestore-enabled ceph cluster (on 11.2.0) started crashing. I uploaded a log at https://up.lae.is/p/1492018430-6d 126.txt (was too large to attach). It looks like there's a failed assert before the abort but I couldn't tell what was attempting to be asserted. I tried str

[ceph-users] saving file on cephFS mount using vi takes pause/time

2017-04-12 Thread Deepak Naidu
Folks, This is bit weird issue. I am using the cephFS volume to read write files etc its quick less than seconds. But when editing a the file on cephFS volume using vi , when saving the file the save takes couple of seconds something like sync(flush). The same doesn't happen on local filesystem

Re: [ceph-users] failed lossy con, dropping message

2017-04-12 Thread Laszlo Budai
Hi Alex, I saw your thread, but I think mine is a little bit different. I have only one message so far, and I want to better understand the issue. I would like to see whether there are any tunable parameters that could be adjusted to have an influence on this behavior. Kind regards, Laszlo On

Re: [ceph-users] failed lossy con, dropping message

2017-04-12 Thread Gregory Farnum
On Wed, Apr 12, 2017 at 3:00 AM, Laszlo Budai wrote: > Hello, > > yesterday one of our compute nodes has recorded the following message for > one of the ceph connections: > > submit_message osd_op(client.28817736.0:690186 > rbd_data.15c046b11ab57b7.00c4 [read 2097152~380928] 3.6f81364a

Re: [ceph-users] Socket errors, CRC, lossy con messages

2017-04-12 Thread Alex Gorbachev
On Wed, Apr 12, 2017 at 10:51 AM, Ilya Dryomov wrote: > On Wed, Apr 12, 2017 at 4:28 PM, Alex Gorbachev > wrote: >> Hi Ilya, >> >> On Wed, Apr 12, 2017 at 4:58 AM Ilya Dryomov wrote: >>> >>> On Tue, Apr 11, 2017 at 3:10 PM, Alex Gorbachev >>> wrote: >>> > Hi Ilya, >>> > >>> > On Tue, Apr 11, 2