Test complete. Civet still shows the same problem:
https://gist.github.com/spjmurray/88203f564389294b3774
"/admin/user?uid=admin" is fine
"/admin/user?quota&uid=admin"a-type=user" is not so good. Upgrade to
0.94.2 didn't solve the problem nor 9.0.2. Unless anyone knows anything
more I'll go a
Hi all,
I've read in the documentation that OSDs use around 512MB on a healthy
cluster.(http://ceph.com/docs/master/start/hardware-recommendations/#ram)
Now, our OSD's are all using around 2GB of RAM memory while the cluster
is healthy.
PID USER PR NIVIRTRESSHR S %CPU %M
On Fri, Jul 17, 2015 at 1:13 PM, Kenneth Waegeman
wrote:
> Hi all,
>
> I've read in the documentation that OSDs use around 512MB on a healthy
> cluster.(http://ceph.com/docs/master/start/hardware-recommendations/#ram)
> Now, our OSD's are all using around 2GB of RAM memory while the cluster is
> h
Hi Greg + list,
Sorry to reply to this old'ish thread, but today one of these PGs bit
us in the ass.
Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171,
and 69 all crash when trying to delete pg 36.10d. They all crash with
ENOTEMPTY suggests garbage data in osd data dir
(ful
I think you'll need to use the ceph-objectstore-tool to remove the
PG/data consistently, but I've not done this — David or Sam will need
to chime in.
-Greg
On Fri, Jul 17, 2015 at 2:15 PM, Dan van der Ster wrote:
> Hi Greg + list,
>
> Sorry to reply to this old'ish thread, but today one of these
Thanks for the quick reply.
We /could/ just wipe these OSDs and start from scratch (the only other
pools were 4+2 ec and recovery already brought us to 100%
active+clean).
But it'd be good to understand and prevent this kind of crash...
Cheers, Dan
On Fri, Jul 17, 2015 at 3:18 PM, Gregory Fa
A bit of progress: rm'ing everything from inside current/36.10d_head/
actually let the OSD start and continue deleting other PGs.
Cheers, Dan
On Fri, Jul 17, 2015 at 3:26 PM, Dan van der Ster wrote:
> Thanks for the quick reply.
>
> We /could/ just wipe these OSDs and start from scratch (the onl
This is the same cluster I posted about back in April. Since then,
the situation has gotten significantly worse.
Here is what iostat looks like for the one active RBD image on this cluster:
Device: rrqm/s wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz await r_await w_awai
On 07/17/2015 08:38 AM, J David wrote:
This is the same cluster I posted about back in April. Since then,
the situation has gotten significantly worse.
Here is what iostat looks like for the one active RBD image on this cluster:
Device: rrqm/s wrqm/s r/s w/srkB/swkB/s
What does "ceph status" say? I had a problem with similar symptoms some
months ago that was accompanied by OSDs getting marked out for no apparent
reason and the cluster going into a HEALTH_WARN state intermittently.
Ultimately the root of the problem ended up being a faulty NIC. Once I took
that o
On Fri, Jul 17, 2015 at 10:21 AM, Mark Nelson wrote:
> rados -p 30 bench write
>
> just to see how it handles 4MB object writes.
Here's that, from the VM host:
Total time run: 52.062639
Total writes made: 66
Write size: 4194304
Bandwidth (MB/sec): 5.071
Stddev Ban
Hi all,
I'm trying to rebuild ceph deb packages using 'dpkg-buildpackages -nc'.
Without '-nc' the compilation works fine but obviously takes a long time.
When I add the '-nc' option, I end up with following issues:
> ..
> ./check_version ./.git_version
> ./.git_version is up to date.
> CXXL
On Fri, Jul 17, 2015 at 10:47 AM, Quentin Hartman
wrote:
> What does "ceph status" say?
Usually it says everything is cool. However just now it gave this:
cluster e9c32e63-f3eb-4c25-b172-4815ed566ec7
health HEALTH_WARN 2 requests are blocked > 32 sec
monmap e3: 3 mons at
{f16=192.
That looks a lot like what I was seeing initially. The OSDs getting marked
out was relatively rare and it took a bit before I saw it. I ended up
digging into the logs on the OSDs themselves to discover that they were
getting marked out. The messages were like "So-and-so incorrectly marked us
out" I
On Fri, Jul 17, 2015 at 11:15 AM, Quentin Hartman
wrote:
> That looks a lot like what I was seeing initially. The OSDs getting marked
> out was relatively rare and it took a bit before I saw it.
Our problem is "most of the time" and does not appear confined to a
specific ceph cluster node or OSD:
David - I'm new to Ceph myself, so can't point out any smoking guns - but
your problem "feels" like a network issue. I suggest you check all of
your OSD/Mon/Clients network interfaces. Check for errors, check that
they are negotiating the same link speed/type with your switches (if you
have LLD
Disclaimer: I'm relatively new to ceph, and haven't moved into
production with it.
Did you run your bench for 30 seconds?
For reference my bench from a VM bridged to a 10Gig card with 90x4TB
at 30 seconds is:
Total time run: 30.766596
Total writes made: 1979
Write size:
I would say use admin socket to find out which part is causing most of the
latencies, don't rule out disk anomalies.
Thanks & Regards
Somnath
-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of J David
Sent: Friday, July 17, 2015 8:07 AM
To: Quent
On 07/17/2015 09:55 AM, J David wrote:
On Fri, Jul 17, 2015 at 10:21 AM, Mark Nelson wrote:
rados -p 30 bench write
just to see how it handles 4MB object writes.
Here's that, from the VM host:
Total time run: 52.062639
Total writes made: 66
Write size: 4194304
On 7/16/15, 9:51 PM, "ceph-users on behalf of Goncalo Borges"
wrote:
>Once I substituted the fqdn by simply the hostname (without the domain)
>it worked.
Goncalo,
I ran into the same problems too - and ended up bailing on the
"ceph-deploy" tools and manually building my clusters ... eventual
Thanks for your answers,
we will also experiment with osd recovery max active / threads and
will come back to you
Regards,
Kostis
On 16 July 2015 at 12:29, Jan Schermer wrote:
> For me setting recovery_delay_start helps during the OSD bootup _sometimes_,
> but it clearly does something differen
Also, by running ceph osd perf, I see that fs_apply_latency is larger
than fs_commit_latency. Shouldn't that be the opposite? Apply latency
is afaik the time that it takes to to apply updates to the file system
in page cache. Commitcycle latency is the time it takes to flush cache
on disks, right?
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
Yes you will need to change osd to host as you thought so that copies
will be separated between hosts. You will run into problems you see
until that is changed. It will cause data movement.
-
Robert LeBlanc
PGP Fingerprint 79A2 9CA4
On Fri, Jul 17, 2015 at 12:19 PM, Mark Nelson wrote:
> Maybe try some iperf tests between the different OSD nodes in your
> cluster and also the client to the OSDs.
This proved to be an excellent suggestion. One of these is not like the others:
f16 inbound: 6Gbps
f16 outbound: 6Gbps
f17 inbound
Other than those errors, do you find RBD's will not be unmapped on
system restart/shutdown on a machine using systemd? Leaving the system
hanging without network connections trying to unmap RBD's?
That's been my experience thus far, so I wrote an (overly simple)
systemd file to handle this on a pe
Yes the rbd's are not remapped at system boot time. I haven't run into a VM or
system hang because this since I ran into it as part of investigating using
RHEL 7.1 as a client distro. Yes remapping the rbd's in a startup script worked
around the issue.
> -Original Message-
> From: Stev
Glad we were able to point you in the right direction! I would suspect a
borderline cable at this point. Did you happen to notice if the interface
had negotiated down to some dumb speed? If it had, I've seen cases where a
dodgy cable has caused an intermittent problem that causes it to negotiate
th
On Fri, 17 Jul 2015, J David wrote:
f16 inbound: 6Gbps
f16 outbound: 6Gbps
f17 inbound: 6Gbps
f17 outbound: 6Gbps
f18 inbound: 6Gbps
f18 outbound: 1.2Mbps
Unless the network was very busy when you did this, I think that 6 Gb/s
may not be very good either. Usually iperf will give you much more
On 07/15/2015 11:48 AM, Shane Gibson wrote:
Somnath - thanks for the reply ...
:-) Haven't tried anything yet - just starting to gather
info/input/direction for this solution.
Looking at the S3 API info [2] - there is no mention of support for the
"S3a" API extensions - namely "rename" suppor
May I suggest checking also the error counters on your network switch?
Check speed and duplex. Is bonding in use? Is flow control on? Can you
swap the network cable? Can you swap a NIC with another node and does the
problem follow?
Hth, Alex
On Friday, July 17, 2015, Steve Thompson wrote:
>
30 matches
Mail list logo