Re: [ceph-users] After OSD Flap - FAILED assert(oi.version == i->first)

2016-12-01 Thread Paweł Sadowski
Hi, We see this error on Hammer 0.94.6. Bug report updated with logs. Thanks, On 11/15/2016 07:30 PM, Samuel Just wrote: > http://tracker.ceph.com/issues/17916 > > I just pushed a branch wip-17916-jewel based on v10.2.3 with some > additional debugging. Once it builds, would you be able to st

[ceph-users] node and its OSDs down...

2016-12-01 Thread M Ranga Swami Reddy
Hello, One of my ceph node with 20 OSDs down...After a couple of hours, ceph health is in OK state. Now, I tried to remove those OSDs, which were down state from ceph cluster... using the "ceh osd remove osd." then ceph clsuter started rebalancing...which is strange ..because thsoe OSDs are down f

Re: [ceph-users] Mount of CephFS hangs

2016-12-01 Thread John Spray
(Copying list back in) On Thu, Dec 1, 2016 at 10:22 AM, John Spray wrote: > On Wed, Nov 30, 2016 at 3:48 PM, Jens Offenbach wrote: >> Thanks a lot... "ceph daemon mds. session ls" was a good starting point. >> >> What is happening: >> I am in an OpenStack environment and start a VM. Afterwards,

[ceph-users] osd crash

2016-12-01 Thread VELARTIS Philipp Dürhammer
Hello! Tonight i had a osd crash. See the dump below. Also this osd is still mounted. Whats the cause? A bug? What to do next? Thank You! Dec 1 00:31:30 ceph2 kernel: [17314369.493029] divide error: [#1] SMP Dec 1 00:31:30 ceph2 kernel: [17314369.493062] Modules linked in: act_police cl

Re: [ceph-users] osd crash

2016-12-01 Thread Nick Fisk
Are you using Ubuntu 16.04 (Guessing from your kernel version). There was a numa bug in early kernels, try updating to the latest in the 4.4 series. From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of VELARTIS Philipp Dürhammer Sent: 01 December 2016 12:04 To: 'ceph-us...

[ceph-users] osd crash - disk hangs

2016-12-01 Thread VELARTIS Philipp Dürhammer
Hello! Tonight i had a osd crash. See the dump below. Also this osd is still mounted. Whats the cause? A bug? What to do next? I cant do a lsof or ps ax because it hangs. Thank You! Dec 1 00:31:30 ceph2 kernel: [17314369.493029] divide error: [#1] SMP Dec 1 00:31:30 ceph2 kernel: [17314

Re: [ceph-users] osd crash

2016-12-01 Thread VELARTIS Philipp Dürhammer
I am using proxmox so i guess ist debian. I will update the kernel there are newer versions. But generally if a osd crashes like this - can it be hardware related? How to dismount the disk? I cant even make ps ax or losof -it hangs because my osd is still mounted and blocks everything... i canno

[ceph-users] Deep-scrub cron job

2016-12-01 Thread Eugen Block
Hi list, I use the script from [1] to control the deep-scrubs myself in a cronjob. It seems to work fine, I get the "finished batch" message in /var/log/messages, but in every run I get an email from cron daemon with at least one line saying: 2016-11-30 21:40:59.271854 7f3d5700 0 mon

Re: [ceph-users] node and its OSDs down...

2016-12-01 Thread David Turner
I assume you also did ceph osd crush remove osd.. When you removed the osd that was down/out and balanced off of, you changed the weight of the host that it was on which triggers additional backfilling to balance the crush map. [cid:image0b480d.JPG@7a964f55.48b

Re: [ceph-users] pgs unfound

2016-12-01 Thread Xabier Elkano
Hi, I managed to remove the warning reweighting the crashed OSD: ceph osd crush reweight osd.33 0.8 After the recovery, the cluster is not showing the warning any more Xabier On 29/11/16 11:18, Xabier Elkano wrote: > Hi all, > > my cluster is in WARN state because apparently there are some

Re: [ceph-users] stalls caused by scrub on jewel

2016-12-01 Thread Frédéric Nass
Hi Sage, Sam, We're impacted by this bug (case 01725311). Our cluster is running RHCS 2.0 and is no more capable to scrub neither deep-scrub. [1] http://tracker.ceph.com/issues/17859 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1394007 [3] https://github.com/ceph/ceph/pull/11898 I'm worri

Re: [ceph-users] osd crash - disk hangs

2016-12-01 Thread Warren Wang - ISD
You’ll need to upgrade your kernel. It’s a terrible div by zero bug that occurs while trying to calculate load. You can still use “top –b –n1” instead of ps, but ultimately the kernel update fixed it for us. You can’t kill procs that are in uninterruptible wait. Here’s the Ubuntu version: http

Re: [ceph-users] stalls caused by scrub on jewel

2016-12-01 Thread Yoann Moulin
Hello, > We're impacted by this bug (case 01725311). Our cluster is running RHCS 2.0 > and is no more capable to scrub neither deep-scrub. > > [1] http://tracker.ceph.com/issues/17859 > [2] https://bugzilla.redhat.com/show_bug.cgi?id=1394007 > [3] https://github.com/ceph/ceph/pull/11898 > > I'm

Re: [ceph-users] Adding second interface to storage network - issue

2016-12-01 Thread Warren Wang - ISD
Jumbo frames for the cluster network has been done by quite a few operators without any problems. Admittedly, I’ve not run it that way in a year now, but we plan on switching back to jumbo for the cluster. I do agree that jumbo on the public could result in poor behavior from clients, if you’re

[ceph-users] Wrong pg count when pg number is large

2016-12-01 Thread Craig Chi
Hi list, I am testing the Ceph cluster with unpractical pg numbers to do some experiments. But when I use ceph -w to watch my cluster status, I see pg numbers doubled. From my ceph -w root@mon1:~# ceph -w cluster 1c33bf75-e080-4a70-9fd8-860ff216f595 health HEALTH_WARN too many PGs per OSD (514

Re: [ceph-users] - cluster stuck and undersized if at least one osd is down

2016-12-01 Thread Piotr Dzionek
Ok, you convinced me to increase size to 3 and min_size to 2. During my time running ceph I only had issues like single disk or host failures - nothing exotic, but I think it is better to be safe than sorry. Kind regards, Piotr Dzionek W dniu 30.11.2016 o 12:16, Nick Fisk pisze: -Origina

Re: [ceph-users] stalls caused by scrub on jewel

2016-12-01 Thread Vasu Kulkarni
On Thu, Dec 1, 2016 at 7:24 AM, Frédéric Nass < frederic.n...@univ-lorraine.fr> wrote: > > Hi Sage, Sam, > > We're impacted by this bug (case 01725311). Our cluster is running RHCS > 2.0 and is no more capable to scrub neither deep-scrub. > > [1] http://tracker.ceph.com/issues/17859 > [2] https://

Re: [ceph-users] stalls caused by scrub on jewel

2016-12-01 Thread Frédéric Nass
Hi Yoann, Thank you for your input. I was just told by RH support that it’s gonna make it to RHCS 2.0 (10.2.3). Thank you guys for the fix ! We thought about increasing the number of PGs just after changing the merge/split threshold values but this would have led to a _lot_ of data movements (

[ceph-users] rbd_default_features

2016-12-01 Thread Tomas Kukral
Hi, I was using Hammer on some clients and Jewel on others, even though is it NOT recommended. I'd like to recommend you to tripe check your rbd_default_features in case you are mixing versions. This options isn't docummented well and it is easy to miss it. I understand reasons for this chang

[ceph-users] Migrate OSD Journal to SSD

2016-12-01 Thread Reed Dier
Apologies if this has been asked dozens of times before, but most answers are from pre-Jewel days, and want to double check that the methodology still holds. Currently have 16 OSD’s across 8 machines with on-disk journals, created using ceph-deploy. These machines have NVMe storage (Intel P3600

Re: [ceph-users] Migrate OSD Journal to SSD

2016-12-01 Thread Christian Balzer
On Thu, 1 Dec 2016 18:06:38 -0600 Reed Dier wrote: > Apologies if this has been asked dozens of times before, but most answers are > from pre-Jewel days, and want to double check that the methodology still > holds. > It does. > Currently have 16 OSD’s across 8 machines with on-disk journals, c

[ceph-users] ceph - even filling disks

2016-12-01 Thread Мобилон
Good day. I have set up the repository ceph and created several pools on the hdd 4TB. My problem lies in uneven filling hdd. root@ceph-node1:~# df -H Filesystem Size Used Avail Use% Mounted on /dev/sda1 236G 2.7G 221G 2% / none4.1k 0 4.1k 0% /sys/fs/cgroup

Re: [ceph-users] ceph - even filling disks

2016-12-01 Thread John Petrini
You can reweight the OSD's either automatically based on utilization (ceph osd reweight-by-utilization) or by hand. See: https://ceph.com/planet/ceph-osd-reweight/ http://docs.ceph.com/docs/master/rados/operations/control/#osd-subsystem It's probably not ideal to have OSD's of such different size

Re: [ceph-users] node and its OSDs down...

2016-12-01 Thread M Ranga Swami Reddy
Hi David - Yep, I did the "ceph osd crush remove osd.", which started the recovery. My worries is - why Ceph is doing the recovery, if an OSD is already down and no more in the cluster. That means, ceph already maintained down OSDs objects copied to another OSDs.. here is the ceph osd tree o/p: ===