HI Chris, That's an interesting point, I bet the managed switches don't have jumbo frames enabled.
I think I am going to leave everything at our colo for now. Cheers, Mike On Tue, Oct 11, 2016 at 2:42 PM, Chris Taylor <ctay...@eyonic.com> wrote: > > > I see on this list often that peering issues are related to networking and > MTU sizes. Perhaps the HP 5400's or the managed switches did not have jumbo > frames enabled? > > Hope that helps you determine the issue in case you want to move the nodes > back to the other location. > > > > Chris > > > > On 2016-10-11 2:30 pm, Mike Jacobacci wrote: > > Hi Goncalo, > > Thanks for your reply! I finally figured out that our issue was with the > physical setup of the nodes. Se had one OSD and MON node in our office and > the others are co-located at our ISP. We have an almost dark fiber going > between our two buildings connected via HP 5400's, but it really isn't > since there are some switches in between doing VLAN rewriting (ISP managed). > > Even though all the interfaces were communicating without issue, no data > would move across the nodes. I ended up moving all nodes into the same > rack and data immediately started moving and the cluster is now working! > So it seems the storage traffic was being dropped/blocked by something on > our ISP side. > > Cheers, > Mike > > On Mon, Oct 10, 2016 at 5:22 PM, Goncalo Borges < > goncalo.bor...@sydney.edu.au> wrote: > >> Hi Mike... >> >> I was hoping that someone with a bit more experience would answer you >> since I never had similar situation. So, I'll try to step in and help. >> >> The peering process means that the OSDs are agreeing on the state of >> objects in the PGs they share. The peering process can take some time and >> is a hard operation to execute from a ceph point of view, specially if a >> lot of peering happens at the same time. This is one of the reasons why >> also the pg increase should be done in very small steps (normally increases >> of 256 pgs). >> >> Is your cluster slowly decreasing the number of pgs in peering? and the >> number of active pgs increasing? If you see no evolution at all after this >> time, you can have a problem. >> >> pgs which do not leave the peering state may be because: >> - incorrect crush map >> - issues in osds >> - issues with the network >> >> Check that your network is working as expected and that you do not have >> firewalls blocking traffic and so on. >> >> A pg query for one of those peering pgs may provide some further >> information about what could be wrong. >> >> Looking to osd logs may also show a bit of light. >> >> Cheers >> Goncalo >> >> >> >> ________________________________________ >> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Mike >> Jacobacci [mi...@flowjo.com] >> Sent: 10 October 2016 01:55 >> To: ceph-us...@ceph.com >> Subject: [ceph-users] New OSD Nodes, pgs haven't changed state >> >> Hi, >> >> Yesterday morning I added two more OSD nodes and changed the crushmap >> from disk to node. It looked to me like everything went ok besides some >> disks missing that I can re-add later, but the cluster status hasn't >> changed since then. Here is the output of ceph -w: >> >> cluster 395fb046-0062-4252-914c-013258c5575c >> health HEALTH_ERR >> 1761 pgs are stuck inactive for more than 300 seconds >> 1761 pgs peering >> 1761 pgs stuck inactive >> 8 requests are blocked > 32 sec >> crush map has legacy tunables (require bobtail, min is >> firefly) >> monmap e2: 3 mons at {birkeland=192.168.10.190:6789 >> /0,immanuel=192.168.10.1<http://192.168.10.190:6789/0,immanu >> el=192.168.10.1> 25:6789/0,peratt= >> 192.168.10.187:6789/0<http://192.168.10.187:6789/0>} >> election epoch 14, quorum 0,1,2 immanuel,peratt,birkeland >> osdmap e186: 26 osds: 26 up, 26 in; 1796 remapped pgs >> flags sortbitwise >> pgmap v6599413: 1796 pgs, 4 pools, 1343 GB data, 336 kobjects >> 4049 GB used, 92779 GB / 96829 GB avail >> 1761 remapped+peering >> 35 active+clean >> 2016-10-09 07:00:00.000776 mon.0 [INF] HEALTH_ERR; 1761 pgs are stuck >> inactive f or more than 300 seconds; >> 1761 pgs peering; 1761 pgs stuck inactive; 8 requests >> are blocked > 32 sec; crush map has legacy tunables >> (require bobtail, min is fir efly) >> >> >> I have legacy tunables on since Ceph is only backing our Xenserver >> infrastructure. The number of pgs remapping and clean haven't changed and >> there isn't seem to be that much data... Is this normal behavior? >> >> Here is my crushmap: >> >> # begin crush map >> tunable choose_local_tries 0 >> tunable choose_local_fallback_tries 0 >> tunable choose_total_tries 50 >> tunable chooseleaf_descend_once 1 >> tunable straw_calc_version 1 >> # devices >> device 0 osd.0 >> device 1 osd.1 >> device 2 osd.2 >> device 3 osd.3 >> device 4 osd.4 >> device 5 osd.5 >> device 6 osd.6 >> device 7 osd.7 >> device 8 osd.8 >> device 9 osd.9 >> device 10 osd.10 >> device 11 osd.11 >> device 12 osd.12 >> device 13 osd.13 >> device 14 osd.14 >> device 15 osd.15 >> device 16 osd.16 >> device 17 osd.17 >> device 18 osd.18 >> device 19 osd.19 >> device 20 osd.20 >> device 21 osd.21 >> device 22 osd.22 >> device 23 osd.23 >> device 24 osd.24 >> device 25 osd.25 >> # types >> type 0 osd >> type 1 host >> type 2 chassis >> type 3 rack >> type 4 row >> type 5 pdu >> type 6 pod >> type 7 room >> type 8 datacenter >> type 9 region >> type 10 root >> # buckets >> host tesla { >> id -2 # do not change unnecessarily >> # weight 36.369 >> alg straw >> hash 0 # rjenkins1 >> item osd.5 weight 3.637 >> item osd.0 weight 3.637 >> item osd.2 weight 3.637 >> item osd.4 weight 3.637 >> item osd.8 weight 3.637 >> item osd.3 weight 3.637 >> item osd.6 weight 3.637 >> item osd.1 weight 3.637 >> item osd.9 weight 3.637 >> item osd.7 weight 3.637 >> } >> host faraday { >> id -3 # do not change unnecessarily >> # weight 32.732 >> alg straw >> hash 0 # rjenkins1 >> item osd.23 weight 3.637 >> item osd.18 weight 3.637 >> item osd.17 weight 3.637 >> item osd.25 weight 3.637 >> item osd.20 weight 3.637 >> item osd.22 weight 3.637 >> item osd.21 weight 3.637 >> item osd.19 weight 3.637 >> item osd.24 weight 3.637 >> } >> host hertz { >> id -4 # do not change unnecessarily >> # weight 25.458 >> alg straw >> hash 0 # rjenkins1 >> item osd.15 weight 3.637 >> item osd.12 weight 3.637 >> item osd.13 weight 3.637 >> item osd.14 weight 3.637 >> item osd.16 weight 3.637 >> item osd.10 weight 3.637 >> item osd.11 weight 3.637 >> } >> root default { >> id -1 # do not change unnecessarily >> # weight 94.559 >> alg straw >> hash 0 # rjenkins1 >> item tesla weight 36.369 >> item faraday weight 32.732 >> item hertz weight 25.458 >> } >> # rules >> rule replicated_ruleset { >> ruleset 0 >> type replicated >> min_size 1 >> max_size 10 >> step take default >> step chooseleaf firstn 0 type host >> step emit >> } >> # end crush map >> >> >> Cheers, >> Mike >> > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com