I suggest you look at the current offload settings on the NIC
There have been quite a few bugs when bonding, vlans, bridges etc. are involved
- sometimes you have to set the offload settings on the logical interface (like
bond0.vlan_id) and not on the NIC.
Even then, what I’ve seen is that if th
May I suggest checking also the error counters on your network switch?
Check speed and duplex. Is bonding in use? Is flow control on? Can you
swap the network cable? Can you swap a NIC with another node and does the
problem follow?
Hth, Alex
On Friday, July 17, 2015, Steve Thompson wrote:
>
On Fri, 17 Jul 2015, J David wrote:
f16 inbound: 6Gbps
f16 outbound: 6Gbps
f17 inbound: 6Gbps
f17 outbound: 6Gbps
f18 inbound: 6Gbps
f18 outbound: 1.2Mbps
Unless the network was very busy when you did this, I think that 6 Gb/s
may not be very good either. Usually iperf will give you much more
Glad we were able to point you in the right direction! I would suspect a
borderline cable at this point. Did you happen to notice if the interface
had negotiated down to some dumb speed? If it had, I've seen cases where a
dodgy cable has caused an intermittent problem that causes it to negotiate
th
On Fri, Jul 17, 2015 at 12:19 PM, Mark Nelson wrote:
> Maybe try some iperf tests between the different OSD nodes in your
> cluster and also the client to the OSDs.
This proved to be an excellent suggestion. One of these is not like the others:
f16 inbound: 6Gbps
f16 outbound: 6Gbps
f17 inbound
On 07/17/2015 09:55 AM, J David wrote:
On Fri, Jul 17, 2015 at 10:21 AM, Mark Nelson wrote:
rados -p 30 bench write
just to see how it handles 4MB object writes.
Here's that, from the VM host:
Total time run: 52.062639
Total writes made: 66
Write size: 4194304
:07 AM
To: Quentin Hartman
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Deadly slow Ceph cluster revisited
On Fri, Jul 17, 2015 at 10:47 AM, Quentin Hartman
wrote:
> What does "ceph status" say?
Usually it says everything is cool. However just now it gave this:
cluster e
Disclaimer: I'm relatively new to ceph, and haven't moved into
production with it.
Did you run your bench for 30 seconds?
For reference my bench from a VM bridged to a 10Gig card with 90x4TB
at 30 seconds is:
Total time run: 30.766596
Total writes made: 1979
Write size:
David - I'm new to Ceph myself, so can't point out any smoking guns - but
your problem "feels" like a network issue. I suggest you check all of
your OSD/Mon/Clients network interfaces. Check for errors, check that
they are negotiating the same link speed/type with your switches (if you
have LLD
On Fri, Jul 17, 2015 at 11:15 AM, Quentin Hartman
wrote:
> That looks a lot like what I was seeing initially. The OSDs getting marked
> out was relatively rare and it took a bit before I saw it.
Our problem is "most of the time" and does not appear confined to a
specific ceph cluster node or OSD:
That looks a lot like what I was seeing initially. The OSDs getting marked
out was relatively rare and it took a bit before I saw it. I ended up
digging into the logs on the OSDs themselves to discover that they were
getting marked out. The messages were like "So-and-so incorrectly marked us
out" I
On Fri, Jul 17, 2015 at 10:47 AM, Quentin Hartman
wrote:
> What does "ceph status" say?
Usually it says everything is cool. However just now it gave this:
cluster e9c32e63-f3eb-4c25-b172-4815ed566ec7
health HEALTH_WARN 2 requests are blocked > 32 sec
monmap e3: 3 mons at
{f16=192.
On Fri, Jul 17, 2015 at 10:21 AM, Mark Nelson wrote:
> rados -p 30 bench write
>
> just to see how it handles 4MB object writes.
Here's that, from the VM host:
Total time run: 52.062639
Total writes made: 66
Write size: 4194304
Bandwidth (MB/sec): 5.071
Stddev Ban
What does "ceph status" say? I had a problem with similar symptoms some
months ago that was accompanied by OSDs getting marked out for no apparent
reason and the cluster going into a HEALTH_WARN state intermittently.
Ultimately the root of the problem ended up being a faulty NIC. Once I took
that o
On 07/17/2015 08:38 AM, J David wrote:
This is the same cluster I posted about back in April. Since then,
the situation has gotten significantly worse.
Here is what iostat looks like for the one active RBD image on this cluster:
Device: rrqm/s wrqm/s r/s w/srkB/swkB/s
This is the same cluster I posted about back in April. Since then,
the situation has gotten significantly worse.
Here is what iostat looks like for the one active RBD image on this cluster:
Device: rrqm/s wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz await r_await w_awai
16 matches
Mail list logo