Re: [ceph-users] unfound blocks IO or gives IO error?
On Fri, Jun 22, 2018 at 10:44 PM Gregory Farnum wrote: > > On Fri, Jun 22, 2018 at 6:22 AM Sergey Malinin wrote: >> >> From >> http://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-pg/ : >> >> "Now 1 knows that these object exist, but there is no live ceph-osd who has >> a copy. In this case, IO to those objects will block, and the cluster will >> hope that the failed node comes back soon; this is assumed to be preferable >> to returning an IO error to the user." > > > This is definitely the default and the way I recommend you run a cluster. But > do keep in mind sometimes other layers in your stack have their own timeouts > and will start throwing errors if the Ceph library doesn't return an IO > quickly enough. :) Right, that's understood. This is the nice behaviour of virtio-blk vs virtio-scsi: the latter has a timeout but blk blocks forever. On 5000 attached volumes we saw around 12 of these IO errors, and this was the first time in 5 years of upgrades that an IO error happened... -- dan > -Greg > >> >> >> On 22.06.2018, at 16:16, Dan van der Ster wrote: >> >> Hi all, >> >> Quick question: does an IO with an unfound object result in an IO >> error or should the IO block? >> >> During a jewel to luminous upgrade some PGs passed through a state >> with unfound objects for a few seconds. And this seems to match the >> times when we had a few IO errors on RBD attached volumes. >> >> Wondering what is the correct behaviour here... >> >> Cheers, Dan >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Help! Luminous 12.2.5 CephFS - MDS crashed and now won't start (failing at MDCache::add_inode)
Hi all, We have a Luminous 12.2.5 cluster, running entirely just CephFS with 1 active and 1 standby MDS. The active MDS crashed and now won't start again with this same error: ### 0> 2018-06-25 16:11:21.136203 7f01c2749700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/mds/MDCache.cc: In function 'void MDCache::add_inode(CInode*)' thread 7f01c2749700 time 2018-06-25 16:11:21.133236 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED assert(!p) ### Right before that point is just a bunch of client connection requests. There are also a few other inode errors such as: ### 2018-06-25 09:30:37.889166 7f934c1e5700 -1 log_channel(cluster) log [ERR] : loaded dup inode 0x198f00a [2,head] v3426852030 at ~mds0/stray5/198f00a, but inode 0x198f00a.head v3426838533 already exists at ~mds0/stray2/198f00a ### We've done this for recovery: $ make sure all MDS are shut down (all crashed by this point anyway) $ ceph fs set myfs cluster_down true $ cephfs-journal-tool journal export backup.bin $ cephfs-journal-tool event recover_dentries summary Events by type: FRAGMENT: 9 OPEN: 29082 SESSION: 15 SUBTREEMAP: 241 UPDATE: 171835 Errors: 0 $ cephfs-table-tool all reset session { "0": { "data": {}, "result": 0 } } $ cephfs-table-tool all reset inode { "0": { "data": {}, "result": 0 } } $ cephfs-journal-tool --rank=myfs:0 journal reset old journal was 35714605847583~423728061 new journal start will be 35715031236608 (1660964 bytes past old end) writing journal head writing EResetJournal entry done $ ceph mds fail 0 $ ceph fs reset hpc_projects --yes-i-really-mean-it $ start up MDS again However, we keep getting the same error as above. We found this: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-December/023136.html which has a similar issue, and some suggestions on using the cephfs-table-tool take_inos command, as our problem looks like we can't create new inodes. However, we don't quite understand the show inode or take_inos command. On our cluster, we see this: $ cephfs-table-tool 0 show inode { "0": { "data": { "version": 1, "inotable": { "projected_free": [ { "start": 1099511627776, "len": 1099511627776 } ], "free": [ { "start": 1099511627776, "len": 1099511627776 } ] } }, "result": 0 } } Our test cluster shows the exact same output, and running `cephfs-table-tool all take_inos 10` (on the test cluster) doesn't seem to do anything to the output of the above, and also the inode number from creating new files doesn't seem to jump +100K from where it was (likely we misunderstood how take_inos works). On our test cluster (no recovery nor reset has been run on it), the latest max inode, from our file creation and running `ls -li` is 1099511627792, just a tiny bit bigger than the "start" value above which seems to match the file count we've created on it. How do we find out what is our latest max inode on our production cluster, when `show inode` doesn't seem to show us anything useful? Also, FYI, over a week ago, we had a network failure, and had to perform recovery then. The recovery seemed OK, but there were some clients that were still running jobs from previously and seemed to have recovered so we were still in the process of draining and rebooting them as they finish their jobs. Some would come back with bad files but nothing that caused troubles until now. Very much appreciate any help! Cheers, Linh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Recovery after datacenter outage
Hi Jason, your guesses were correct. Thank you for your support. Just in case, someone else stumbles upon this thread, some more links: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020722.html http://docs.ceph.com/docs/luminous/rados/operations/user-management/#authorization-capabilities http://docs.ceph.com/docs/luminous/rbd/rbd-openstack/#setup-ceph-client-authentication https://github.com/ceph/ceph/pull/15991 Jason Dillaman schrieb am Fr., 22. Juni 2018 um 22:58 Uhr: > It sounds like your OpenStack users do not have the correct caps to > blacklist dead clients. See step 6 in the upgrade section of Luminous’ > release notes or (preferably) use the new “profile rbd”-style caps if you > don’t use older clients. > > The reason why repairing the object map seemed to fix everything was > because I suspect you performed the op using the admin user, which had the > caps necessary to blacklist the dead clients and clean up the dirty > exclusive lock on the image. > > On Fri, Jun 22, 2018 at 4:47 PM Gregory Farnum wrote: > >> On Fri, Jun 22, 2018 at 2:26 AM Christian Zunker >> wrote: >> >>> Hi List, >>> >>> we are running a ceph cluster (12.2.5) as backend to our OpenStack cloud. >>> >>> Yesterday our datacenter had a power outage. As this wouldn't be enough, >>> we also had a separated ceph cluster because of networking problems. >>> >>> First of all thanks a lot to the ceph developers. After the network was >>> back to normal, ceph recovered itself. You saved us from a lot of downtime, >>> lack of sleep and insanity. >>> >>> Now to our problem/question: >>> After ceph recovered, we tried to bring up our VMs. They have cinder >>> volumes saved in ceph. All VMs didn't start because of I/O problems during >>> start: >>> [4.393246] JBD2: recovery failed >>> [4.395949] EXT4-fs (vda1): error loading journal >>> [4.400811] VFS: Dirty inode writeback failed for block device vda1 >>> (err=-5). >>> mount: mounting /dev/vda1 on /root failed: Input/output error >>> done. >>> Begin: Running /scripts/local-bottom ... done. >>> Begin: Running /scripts/init-bottom ... mount: mounting /dev on >>> /root/dev failed: No such file or directory >>> >>> We tried to recover the disk with different methods, but all failed >>> because of different reasons. What helped us at the end was a rebuild on >>> the object map of each image: >>> rbd object-map rebuild volumes/ >>> >>> From what we understood, object-map is a feature for ceph internal >>> speedup. How can this lead to I/O errors in our VMs? >>> Is this the expected way for a recovery? >>> Did we miss something? >>> Is there any documentation describing what leads to invalid object-maps >>> and how to recover? (We did not find a doc on that topic...) >>> >> >> An object map definitely shouldn't lead to IO errors in your VMs; in fact >> I thought it auto-repaired itself if necessary. Maybe the RBD guys can >> chime in here about probable causes of trouble. >> >> My *guess* is that perhaps your VMs or QEMU were configured to ignore >> barriers or some similar thing, so that when the power failed a write was >> "lost" as it got written to a new RBD object but not committed into the >> object map, but the FS or database journal recorded it as complete. I can't >> be sure about that though. >> -Greg >> >> >>> >>> >>> regards >>> Christian >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> -- > Jason > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Help! Luminous 12.2.5 CephFS - MDS crashed and now won't start (failing at MDCache::add_inode)
So my colleague Sean Crosby and I were looking through the logs (with debug mds = 10) and found some references just before the crash to inode number. We converted it from HEX to decimal and got something like 109953*5*627776 (last few digits not necessarily correct). We set one digit up i.e to 109953*6*627776 and used that as the value for take_inos i.e: $ cephfs-table-tool all take_inos 1099536627776 After that, the MDS could start successfully and we have a HEALTH_OK cluster once more! It would still be useful if `show inode` in cephfs-table-tool actually shows us the max inode number at least though. And I think take_inos should be documented as well in the Disaster Recovery guide. :) We'll be monitoring the cluster for the next few days. Hopefully nothing too interesting to share after this! 😉 Cheers, Linh From: ceph-users on behalf of Linh Vu Sent: Monday, 25 June 2018 7:06:45 PM To: ceph-users Subject: [ceph-users] Help! Luminous 12.2.5 CephFS - MDS crashed and now won't start (failing at MDCache::add_inode) Hi all, We have a Luminous 12.2.5 cluster, running entirely just CephFS with 1 active and 1 standby MDS. The active MDS crashed and now won't start again with this same error: ### 0> 2018-06-25 16:11:21.136203 7f01c2749700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/mds/MDCache.cc: In function 'void MDCache::add_inode(CInode*)' thread 7f01c2749700 time 2018-06-25 16:11:21.133236 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/mds/MDCache.cc: 262: FAILED assert(!p) ### Right before that point is just a bunch of client connection requests. There are also a few other inode errors such as: ### 2018-06-25 09:30:37.889166 7f934c1e5700 -1 log_channel(cluster) log [ERR] : loaded dup inode 0x198f00a [2,head] v3426852030 at ~mds0/stray5/198f00a, but inode 0x198f00a.head v3426838533 already exists at ~mds0/stray2/198f00a ### We've done this for recovery: $ make sure all MDS are shut down (all crashed by this point anyway) $ ceph fs set myfs cluster_down true $ cephfs-journal-tool journal export backup.bin $ cephfs-journal-tool event recover_dentries summary Events by type: FRAGMENT: 9 OPEN: 29082 SESSION: 15 SUBTREEMAP: 241 UPDATE: 171835 Errors: 0 $ cephfs-table-tool all reset session { "0": { "data": {}, "result": 0 } } $ cephfs-table-tool all reset inode { "0": { "data": {}, "result": 0 } } $ cephfs-journal-tool --rank=myfs:0 journal reset old journal was 35714605847583~423728061 new journal start will be 35715031236608 (1660964 bytes past old end) writing journal head writing EResetJournal entry done $ ceph mds fail 0 $ ceph fs reset hpc_projects --yes-i-really-mean-it $ start up MDS again However, we keep getting the same error as above. We found this: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-December/023136.html which has a similar issue, and some suggestions on using the cephfs-table-tool take_inos command, as our problem looks like we can't create new inodes. However, we don't quite understand the show inode or take_inos command. On our cluster, we see this: $ cephfs-table-tool 0 show inode { "0": { "data": { "version": 1, "inotable": { "projected_free": [ { "start": 1099511627776, "len": 1099511627776 } ], "free": [ { "start": 1099511627776, "len": 1099511627776 } ] } }, "result": 0 } } Our test cluster shows the exact same output, and running `cephfs-table-tool all take_inos 10` (on the test cluster) doesn't seem to do anything to the output of the above, and also the inode number from creating new files doesn't seem to jump +100K from where it was (likely we misunderstood how take_inos works). On our test cluster (no recovery nor reset has been run on it), the latest max inode, from our file creation and running `ls -li` is 1099511627792, just a tiny bit bigger than the "start" value above which seems to match the file count we've created on it. How do we find out what is our latest max inode on our production cluster, when `show inode` doesn't seem to show us anything useful? Also, FYI, over a week ago, we had a network failure, and had to perform recovery then. The recovery seemed OK, but there were some clients that were still running jobs from previously and seemed to have recovered so we were still in the proce
Re: [ceph-users] fixing unrepairable inconsistent PG
Hi Brad, here is the output: -- root@arh-ibstorage1-ib:/home/andrei# ceph --debug_ms 5 --debug_auth 20 pg 18.2 query 2018-06-25 10:59:12.100302 7fe23eaa1700 2 Event(0x7fe2400e0140 nevent=5000 time_id=1).set_owner idx=0 owner=140609690670848 2018-06-25 10:59:12.100398 7fe23e2a0700 2 Event(0x7fe24010d030 nevent=5000 time_id=1).set_owner idx=1 owner=140609682278144 2018-06-25 10:59:12.100445 7fe23da9f700 2 Event(0x7fe240139ec0 nevent=5000 time_id=1).set_owner idx=2 owner=140609673885440 2018-06-25 10:59:12.100793 7fe244b28700 1 Processor -- start 2018-06-25 10:59:12.100869 7fe244b28700 1 -- - start start 2018-06-25 10:59:12.100882 7fe244b28700 5 adding auth protocol: cephx 2018-06-25 10:59:12.101046 7fe244b28700 2 auth: KeyRing::load: loaded key file /etc/ceph/ceph.client.admin.keyring 2018-06-25 10:59:12.101244 7fe244b28700 1 -- - --> 192.168.168.201:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- 0x7fe240174b80 con 0 2018-06-25 10:59:12.101264 7fe244b28700 1 -- - --> 192.168.168.202:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- 0x7fe240175010 con 0 2018-06-25 10:59:12.101690 7fe23e2a0700 1 -- 192.168.168.201:0/3046734987 learned_addr learned my addr 192.168.168.201:0/3046734987 2018-06-25 10:59:12.101890 7fe23e2a0700 2 -- 192.168.168.201:0/3046734987 >> 192.168.168.202:6789/0 conn(0x7fe240176dc0 :-1 s=STATE_CONNECTING_WAIT_ACK_SEQ pgs=0 cs=0 l=1)._process_connection got newly_acked_seq 0 vs out_seq 0 2018-06-25 10:59:12.102030 7fe23da9f700 2 -- 192.168.168.201:0/3046734987 >> 192.168.168.201:6789/0 conn(0x7fe24017a420 :-1 s=STATE_CONNECTING_WAIT_ACK_SEQ pgs=0 cs=0 l=1)._process_connection got newly_acked_seq 0 vs out_seq 0 2018-06-25 10:59:12.102450 7fe23e2a0700 5 -- 192.168.168.201:0/3046734987 >> 192.168.168.202:6789/0 conn(0x7fe240176dc0 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=472363 cs=1 l=1). rx mon.1 seq 1 0x7fe234002670 mon_map magic: 0 v1 2018-06-25 10:59:12.102494 7fe23e2a0700 5 -- 192.168.168.201:0/3046734987 >> 192.168.168.202:6789/0 conn(0x7fe240176dc0 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=472363 cs=1 l=1). rx mon.1 seq 2 0x7fe234002b70 auth_reply(proto 2 0 (0) Success) v1 2018-06-25 10:59:12.102542 7fe23ca9d700 1 -- 192.168.168.201:0/3046734987 <== mon.1 192.168.168.202:6789/0 1 mon_map magic: 0 v1 505+0+0 (2386987630 0 0) 0x7fe234002670 con 0x7fe240176dc0 2018-06-25 10:59:12.102629 7fe23ca9d700 1 -- 192.168.168.201:0/3046734987 <== mon.1 192.168.168.202:6789/0 2 auth_reply(proto 2 0 (0) Success) v1 33+0+0 (1469975654 0 0) 0x7fe234002b70 con 0x7fe240176dc0 2018-06-25 10:59:12.102655 7fe23ca9d700 10 cephx: set_have_need_key no handler for service mon 2018-06-25 10:59:12.102657 7fe23ca9d700 10 cephx: set_have_need_key no handler for service osd 2018-06-25 10:59:12.102658 7fe23ca9d700 10 cephx: set_have_need_key no handler for service mgr 2018-06-25 10:59:12.102661 7fe23ca9d700 10 cephx: set_have_need_key no handler for service auth 2018-06-25 10:59:12.102662 7fe23ca9d700 10 cephx: validate_tickets want 53 have 0 need 53 2018-06-25 10:59:12.102666 7fe23ca9d700 10 cephx client: handle_response ret = 0 2018-06-25 10:59:12.102671 7fe23ca9d700 10 cephx client: got initial server challenge 6522ec95fb2eb487 2018-06-25 10:59:12.102673 7fe23ca9d700 10 cephx client: validate_tickets: want=53 need=53 have=0 2018-06-25 10:59:12.102674 7fe23ca9d700 10 cephx: set_have_need_key no handler for service mon 2018-06-25 10:59:12.102675 7fe23ca9d700 10 cephx: set_have_need_key no handler for service osd 2018-06-25 10:59:12.102676 7fe23ca9d700 10 cephx: set_have_need_key no handler for service mgr 2018-06-25 10:59:12.102676 7fe23ca9d700 10 cephx: set_have_need_key no handler for service auth 2018-06-25 10:59:12.102677 7fe23ca9d700 10 cephx: validate_tickets want 53 have 0 need 53 2018-06-25 10:59:12.102678 7fe23ca9d700 10 cephx client: want=53 need=53 have=0 2018-06-25 10:59:12.102680 7fe23ca9d700 10 cephx client: build_request 2018-06-25 10:59:12.102702 7fe23da9f700 5 -- 192.168.168.201:0/3046734987 >> 192.168.168.201:6789/0 conn(0x7fe24017a420 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=333625 cs=1 l=1). rx mon.0 seq 1 0x7fe228001490 mon_map magic: 0 v1 2018-06-25 10:59:12.102739 7fe23ca9d700 10 cephx client: get auth session key: client_challenge 80f2a24093f783c5 2018-06-25 10:59:12.102743 7fe23ca9d700 1 -- 192.168.168.201:0/3046734987 --> 192.168.168.202:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- 0x7fe224002080 con 0 2018-06-25 10:59:12.102737 7fe23da9f700 5 -- 192.168.168.201:0/3046734987 >> 192.168.168.201:6789/0 conn(0x7fe24017a420 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=333625 cs=1 l=1). rx mon.0 seq 2 0x7fe2280019c0 auth_reply(proto 2 0 (0) Success) v1 2018-06-25 10:59:12.102776 7fe23ca9d700 1 -- 192.168.168.201:0/3046734987 <== mon.0 192.168.168.201:6789/0 1 mon_map magic: 0 v1 505+0+0 (2386987630 0 0) 0x7fe228001490 con 0x7fe24017a420 2018-
[ceph-users] Move Ceph-Cluster to another Datacenter
Hey Ceph people, need advise on how to move a ceph-cluster from one datacenter to another without any downtime :) DC 1: 3 dedicated MON-Server (also MGR on this Servers) 4 dedicated OSD-Server (3x12 OSD, 1x 23 OSDs) 3 Proxmox Nodes with connection to our Ceph-Storage (not managed from proxmox! Ceph is a standalone installation) DC 2: No Ceph-related Server actualy Luminous (12.2.4) Only one Pool: NAMEID USED %USED MAX AVAIL OBJECTS rbd 0 30638G 63.8917318G 16778324 I need to move my Ceph instalation from DC1 to DC2 and would realy be happy if you could give me some advise on how to do this without any downtime and in a still performant manner. The latency from DC1 to DC2 is ~1,5ms - could perhaps bring up a 10GB fiber connection between DC1 and DC2.. A second Ceph-Cluster on DC2 is for cost reasons not possible but i could bring a 5th OSD Server Online there. So "RBD-Mirror" isn't actualy passable way - but i will try to make this possible ^^ ... ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Balancer: change from crush-compat to upmap
Hi All, I've been using the balancer module in crush-compat mode for quite a while now and want to switch to upmap mode since all my clients are now luminous (v12.2.5) i've reweighted the compat weight-set back to as close as the original crush weights using 'ceph osd crush reweight-compat' Before i switch to upmap i presume i need to remove the compat weight set with: ceph osd crush weight-set rm-compat Will this have any significant impact (rebalancing lots of pgs) or does this have very little effect since i already reweighted everything back close to crush default weights? Thanks in advance and kind regards, Caspar ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG status is "active+undersized+degraded"
Hi, On 06/22/2018 08:06 AM, dave.c...@dell.com wrote: I saw these statement from this link ( http://docs.ceph.com/docs/master/rados/operations/crush-map/ ), it that the reason which leads to the warning? " This, combined with the default CRUSH failure domain, ensures that replicas or erasure code shards are separated across hosts and a single host failure will not affect availability." Best Regards, Dave Chen -Original Message- From: Chen2, Dave Sent: Friday, June 22, 2018 1:59 PM To: 'Burkhard Linke'; ceph-users@lists.ceph.com Cc: Chen2, Dave Subject: RE: [ceph-users] PG status is "active+undersized+degraded" Hi Burkhard, Thanks for your explanation, I created an new OSD with 2TB from another node, it truly solved the issue, the status of Ceph cluster is " health HEALTH_OK" now. Another question is if three homogeneous OSD is spread across 2 nodes, I still got the warning message, and the status is "active+undersized+degraded", so does the three OSD spread across 3 nodes are mandatory rules for Ceph? Is that only for the HA consideration? Any official documents from Ceph has some guide on this? The default ceph crush rules try to distribute PG replicates among hosts. With a default replication number of 3 (pool size = 3), this requires at least three hosts. The pool also defines a minimum number of PG replicates to be available for allowing I/O to a PG. This is usually set to 2 (pool min size = 2). The above status thus means that there are enough copies for the min size (-> active), but not enough for the size (-> undersized + degraded). Using less than three hosts requires changing the pool size to 2. But this is strongly discouraged, since a sane automatic recovery of data in case of a netsplit or other temporary node failure is not possible. Do not do this in a production setup. For a production setup you should also consider node failures. The default setup uses 3 replicates, so to allow a node failure, you need 4 hosts. Otherwise the self healing feature of ceph cannot recover the third replicate. You also need to closely monitor your cluster's free space to avoid a full cluster due to replicated PGs in case of a node failure. Regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw failover help
Hi, On 06/20/2018 07:20 PM, David Turner wrote: We originally used pacemaker to move a VIP between our RGWs, but ultimately decided to go with an LB in front of them. With an LB you can utilize both RGWs while they're up, but the LB will shy away from either if they're down until the check starts succeeding for that host again. We do have 2 LBs with pacemaker, but the LBs are in charge of 3 prod RGW realms and 2 staging realms. Moving to the LB with pacemaker simplified our setup quite a bit for HA. We use a similar setup, but without an extra load balancer host. Pacemaker is deployed on all hosts acting as RGW, together with haproxy as load balancer. haproxy is bound to the VIPs, does active checks on the RGW, and distributes RGW traffic to all of the three RGW servers in our setup. It also takes care for SSL/TLS termination. With this setup we are also able to use multiple VIPs (e.g. one for external traffic, one for internal traffic), and route them to different haproxy instances if possible. Regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Intel SSD DC P3520 PCIe for OSD 1480 TBW good idea?
Hello everybody, I am thinking about making a production three node Ceph cluster with 3x 1.2TB Intel SSD DC P3520 PCIe storage devices. 10.8 (7.2TB 66% for production) I am not planning on a journal on a separate ssd. I assume there is no advantage of this when using pcie storage? Network connection to an Cisco SG550XG-8F8T 10Gbe Switch with Intel X710-DA2. (if someone knows a good mainline Linux budget replacement). https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p3520-series/dc-p3520-1-2tb-aic-3d1.html Is this a good storage setup? Mainboard: Intel® Server Board S2600CW2R CPU: 2x Intel® Xeon® Processor E5-2630 v4 (25M Cache, 2.20 GHz) Memory: 1x 64GB DDR4 ECC KVR24R17D4K4/64 Disk: 2x WD Gold 4TB 7200rpm 128MB SATA3 Storage: 3x Intel SSD DC P3520 1.2TB PCIe Adapter: Intel Ethernet Converged Network Adapter X710-DA2 I want to try using NUMA to also run KVM guests besides the OSD. I should have enough cores and only have a few osd processes. Kind regards, Jelle de Jong ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph 12.2.5 - FAILED assert(0 == "put on missing extent (nothing before)")
Good Morning, After removing roughly 20-some rbd shapshots, one of my OSD's has begun flapping. ERROR 1 2018-06-25 06:46:39.132257 a0ce2700 -1 osd.8 pg_epoch: 44738 pg[4.e8( v 44721'485588 (44697'484015,44721'485588] local-lis/les=44593/44595 n=2972 ec=9422/9422 lis/c 44593/44593 les/c/f 44595/44595/40729 44593/44593/44593) [8,7,10] r=0 lpr=44593 crt=44721'485588 lcod 44721'485586 mlcod 44721'485586 active+clean+snapt rim snaptrimq=[276~1,280~1,2af~1,2e8~4]] removing snap head 2018-06-25 06:46:41.314172 a1ce2700 -1 /var/tmp/portage/sys-cluster/ceph-12.2.5/work/ceph-12.2.5/src/os/bluestore/bluestore_types.cc: In function 'void bluestore_extent_ref_map_t::put(uint64_t, uint32_t, PExtentVector*, bool*)' thread a1ce2700 time 2018-06-25 06:46:41.220388 /var/tmp/portage/sys-cluster/ceph-12.2.5/work/ceph-12.2.5/src/os/bluestore/bluestore_types.cc: 217: FAILED assert(0 == "put on missing extent (nothing before)") ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1bc) [0x2a2c314] 2: (bluestore_extent_ref_map_t::put(unsigned long long, unsigned int, std::vectormempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*, bool*)+0x128) [0x2893650] 3: (BlueStore::SharedBlob::put_ref(unsigned long long, unsigned int, std::vectormempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> >*, std::set, std::allocator >*)+0xb8) [0x2791bdc] 4: (BlueStore::_wctx_finish(BlueStore::TransContext*, boost::intrusive_ptr&, boost::intrusive_ptr, BlueStore::WriteContext*, std::set, std::allocator >*)+0x5c8) [0x27f3254] 5: (BlueStore::_do_truncate(BlueStore::TransContext*, boost::intrusive_ptr&, boost::intrusive_ptr, unsigned long long, std::set, std::allocator >*)+0x360) [0x27f7834] 6: (BlueStore::_do_remove(BlueStore::TransContext*, boost::intrusive_ptr&, boost::intrusive_ptr)+0xb4) [0x27f81b4] 7: (BlueStore::_remove(BlueStore::TransContext*, boost::intrusive_ptr&, boost::intrusive_ptr&)+0x1dc) [0x27f9638] 8: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0xe7c) [0x27e855c] 9: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vectorstd::allocator >&, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x67c) [0x27e6f80] 10: (ObjectStore::queue_transactions(ObjectStore::Sequencer*, std::vectorstd::allocator >&, Context*, Context*, Context*, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x118) [0x1f9ce48] 11: (PrimaryLogPG::queue_transactions(std::vectorstd::allocator >&, boost::intrusive_ptr)+0x9c) [0x22dd754] 12: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr >&&, eversion_t const&, eversion_t const&, std::vectorstd::allocator > const&, boost::optionalry_t>&, Context*, Context*, Context*, unsigned long long, osd_reqid_t, boost::intrusive_ptr)+0x6f4) [0x25c0568] 13: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x7f4) [0x228ac98] 14: (PrimaryLogPG::simple_opc_submit(std::unique_ptrstd::default_delete >)+0x1b8) [0x228bc54] 15: (PrimaryLogPG::AwaitAsyncWork::react(PrimaryLogPG::DoSnapWork const&)+0x1970) [0x22c5d4c] 16: (boost::statechart::detail::reaction_result boost::statechart::custom_reaction::reactboost::statechart::event_base, void const*>(PrimaryLogPG::AwaitAsyncWork&, boost::statechart::event_base const&, void const* const&)+0x58) [0x23b245c] 17: (boost::statechart::detail::reaction_result boost::statechart::simple_statePrimaryLogPG::Trimming, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_: :na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, boost::statechart::simple_statePrimaryLogPG::Trimming, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl _::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0> >(boost::statechart::simple_state, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0x30) [0x23b0f04] 18: (boost::statechart::detail::reaction_result boost::statechart::simple_statePrimaryLogPG::Trimming, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_: :na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::local_react, mpl_::na, mpl_::na, mpl_::na, mpl
Re: [ceph-users] Recovery after datacenter outage
+Paul On Mon, Jun 25, 2018 at 5:14 AM, Christian Zunker wrote: > Hi Jason, > > your guesses were correct. Thank you for your support. > > Just in case, someone else stumbles upon this thread, some more links: > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020722.html > http://docs.ceph.com/docs/luminous/rados/operations/user-management/#authorization-capabilities > http://docs.ceph.com/docs/luminous/rbd/rbd-openstack/#setup-ceph-client-authentication > https://github.com/ceph/ceph/pull/15991 > > Jason Dillaman schrieb am Fr., 22. Juni 2018 um 22:58 > Uhr: >> >> It sounds like your OpenStack users do not have the correct caps to >> blacklist dead clients. See step 6 in the upgrade section of Luminous’ >> release notes or (preferably) use the new “profile rbd”-style caps if you >> don’t use older clients. >> >> The reason why repairing the object map seemed to fix everything was >> because I suspect you performed the op using the admin user, which had the >> caps necessary to blacklist the dead clients and clean up the dirty >> exclusive lock on the image. >> >> On Fri, Jun 22, 2018 at 4:47 PM Gregory Farnum wrote: >>> >>> On Fri, Jun 22, 2018 at 2:26 AM Christian Zunker >>> wrote: Hi List, we are running a ceph cluster (12.2.5) as backend to our OpenStack cloud. Yesterday our datacenter had a power outage. As this wouldn't be enough, we also had a separated ceph cluster because of networking problems. First of all thanks a lot to the ceph developers. After the network was back to normal, ceph recovered itself. You saved us from a lot of downtime, lack of sleep and insanity. Now to our problem/question: After ceph recovered, we tried to bring up our VMs. They have cinder volumes saved in ceph. All VMs didn't start because of I/O problems during start: [4.393246] JBD2: recovery failed [4.395949] EXT4-fs (vda1): error loading journal [4.400811] VFS: Dirty inode writeback failed for block device vda1 (err=-5). mount: mounting /dev/vda1 on /root failed: Input/output error done. Begin: Running /scripts/local-bottom ... done. Begin: Running /scripts/init-bottom ... mount: mounting /dev on /root/dev failed: No such file or directory We tried to recover the disk with different methods, but all failed because of different reasons. What helped us at the end was a rebuild on the object map of each image: rbd object-map rebuild volumes/ From what we understood, object-map is a feature for ceph internal speedup. How can this lead to I/O errors in our VMs? Is this the expected way for a recovery? Did we miss something? Is there any documentation describing what leads to invalid object-maps and how to recover? (We did not find a doc on that topic...) >>> >>> >>> An object map definitely shouldn't lead to IO errors in your VMs; in fact >>> I thought it auto-repaired itself if necessary. Maybe the RBD guys can chime >>> in here about probable causes of trouble. >>> >>> My *guess* is that perhaps your VMs or QEMU were configured to ignore >>> barriers or some similar thing, so that when the power failed a write was >>> "lost" as it got written to a new RBD object but not committed into the >>> object map, but the FS or database journal recorded it as complete. I can't >>> be sure about that though. >>> -Greg >>> regards Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> -- >> Jason > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Proxmox with EMC VNXe 3200
Hi all, We're planning the migration of a VMWare 5.5 cluster backed by a EMC VNXe 3200 storage appliance to Proxmox. The VNXe has about 3 year of warranty left and half the disks unprovisioned, so the current plan is to use the same VNXe for Proxmox storage. After warranty expires we'll most probably go ceph but that's some years in the future. VNXe seems to support both iSCSI and NFS (CIFS too but that is really out of my tech-tastes). I guess best option performance-wise would be iSCSI, but I like the simplicity of NFS. Any idea about what could be the performance impact of this (NFS/iSCSI)? Has anyone had any experience with this kind of storage appliances? Thanks a lot Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943569206 Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph 12.2.5 - FAILED assert(0 == "put on missing extent (nothing before)")
Hi, Is there any information you'd like to grab off this OSD? Anything I can provide to help you troubleshoot this? I ask, because if not, I'm going to reformat / rebuild this OSD (unless there is a faster way to repair this issue). Thanks, Dyweni On 2018-06-25 07:30, Dyweni - Ceph-Users wrote: Good Morning, After removing roughly 20-some rbd shapshots, one of my OSD's has begun flapping. ERROR 1 2018-06-25 06:46:39.132257 a0ce2700 -1 osd.8 pg_epoch: 44738 pg[4.e8( v 44721'485588 (44697'484015,44721'485588] local-lis/les=44593/44595 n=2972 ec=9422/9422 lis/c 44593/44593 les/c/f 44595/44595/40729 44593/44593/44593) [8,7,10] r=0 lpr=44593 crt=44721'485588 lcod 44721'485586 mlcod 44721'485586 active+clean+snapt rim snaptrimq=[276~1,280~1,2af~1,2e8~4]] removing snap head 2018-06-25 06:46:41.314172 a1ce2700 -1 /var/tmp/portage/sys-cluster/ceph-12.2.5/work/ceph-12.2.5/src/os/bluestore/bluestore_types.cc: In function 'void bluestore_extent_ref_map_t::put(uint64_t, uint32_t, PExtentVector*, bool*)' thread a1ce2700 time 2018-06-25 06:46:41.220388 /var/tmp/portage/sys-cluster/ceph-12.2.5/work/ceph-12.2.5/src/os/bluestore/bluestore_types.cc: 217: FAILED assert(0 == "put on missing extent (nothing before)") ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1bc) [0x2a2c314] 2: (bluestore_extent_ref_map_t::put(unsigned long long, unsigned int, std::vector >*, bool*)+0x128) [0x2893650] 3: (BlueStore::SharedBlob::put_ref(unsigned long long, unsigned int, std::vector >*, std::set, std::allocator >*)+0xb8) [0x2791bdc] 4: (BlueStore::_wctx_finish(BlueStore::TransContext*, boost::intrusive_ptr&, boost::intrusive_ptr, BlueStore::WriteContext*, std::set, std::allocator >*)+0x5c8) [0x27f3254] 5: (BlueStore::_do_truncate(BlueStore::TransContext*, boost::intrusive_ptr&, boost::intrusive_ptr, unsigned long long, std::set, std::allocator >*)+0x360) [0x27f7834] 6: (BlueStore::_do_remove(BlueStore::TransContext*, boost::intrusive_ptr&, boost::intrusive_ptr)+0xb4) [0x27f81b4] 7: (BlueStore::_remove(BlueStore::TransContext*, boost::intrusive_ptr&, boost::intrusive_ptr&)+0x1dc) [0x27f9638] 8: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0xe7c) [0x27e855c] 9: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector >&, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x67c) [0x27e6f80] 10: (ObjectStore::queue_transactions(ObjectStore::Sequencer*, std::vector >&, Context*, Context*, Context*, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x118) [0x1f9ce48] 11: (PrimaryLogPG::queue_transactions(std::vector >&, boost::intrusive_ptr)+0x9c) [0x22dd754] 12: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr >&&, eversion_t const&, eversion_t const&, std::vector > const&, boost::optional&, Context*, Context*, Context*, unsigned long long, osd_reqid_t, boost::intrusive_ptr)+0x6f4) [0x25c0568] 13: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x7f4) [0x228ac98] 14: (PrimaryLogPG::simple_opc_submit(std::unique_ptr >)+0x1b8) [0x228bc54] 15: (PrimaryLogPG::AwaitAsyncWork::react(PrimaryLogPG::DoSnapWork const&)+0x1970) [0x22c5d4c] 16: (boost::statechart::detail::reaction_result boost::statechart::custom_reaction::react(PrimaryLogPG::AwaitAsyncWork&, boost::statechart::event_base const&, void const* const&)+0x58) [0x23b245c] 17: (boost::statechart::detail::reaction_result boost::statechart::simple_state, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, boost::statechart::simple_state, (boost::statechart::history_mode)0> >(boost::statechart::simple_state, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0x30) [0x23b0f04] 18: (boost::statechart::detail::reaction_result boost::statechart::simple_state, (boost::statechart::history_mode)0>::local_react, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl _::na, mpl_::na, mpl_::na, mpl_::na> >(boost::statechart::event_base const&, void const*)+0x28) [0x23af7cc] 19: (boost::statechart::simple_state, (boost: :statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x28) [0x23ad744] 20: (boost::statechart::detail::send_function, boost::statechart::detail::rtti_policy>, boost::statechart::event_base, void const*>::operator()()+0x40) [0x21b6000] 21: (boost::statechart::detail::reaction_result boost::statechart::null_exception_translator::operator(), boost::statechart::detail::rtti_policy>, boost::statechart::event_base,
[ceph-users] Increase queue_depth in KVM
Hello, When I mount rbd image with -o queue_depth=1024 I can see much improvement, generally on writes(random write improvement from 3k IOPS on standard queue_depth to 24k IOPS on queue_depth=1024). But is there any way to attach rbd disk to KVM instance with custom queue_depth? I can't find any information about it. Thanks for any informations. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing
I should be able to answer this question for you if you can supply the output of the following commands. It will print out all of your pool names along with how many PGs are in that pool. My guess is that you don't have a power of 2 number of PGs in your pool. Alternatively you might have multiple pools and the PGs from the various pools are just different sizes. ceph osd lspools | tr ',' '\n' | awk '/^[0-9]/ {print $2}' | while read pool; do echo $pool: $(ceph osd pool get $pool pg_num | cut -d' ' -f2); done ceph df For me the output looks like this. rbd: 64 cephfs_metadata: 64 cephfs_data: 256 rbd-ssd: 32 GLOBAL: SIZE AVAIL RAW USED %RAW USED 46053G 26751G 19301G 41.91 POOLS: NAMEID USED %USED MAX AVAIL OBJECTS rbd-replica 4897G 11.36 7006G 263000 cephfs_metadata 6141M 0.05 268G 11945 cephfs_data 7 10746G 43.4114012G 2795782 rbd-replica-ssd 9241G 47.30 268G 75061 On Sun, Jun 24, 2018 at 9:48 PM shadow_lin wrote: > Hi List, >The enviroment is: >Ceph 12.2.4 >Balancer module on and in upmap mode >Failure domain is per host, 2 OSD per host >EC k=4 m=2 >PG distribution is almost even before and after the rebalancing. > > >After marking out one of the osd,I noticed a lot of the data was moving > into the other osd on the same host . > >Ceph osd df result is(osd.20 and osd.21 are in the same host and osd.20 > was marked out): > > ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS > 19 hdd 9.09560 1.0 9313G 7079G 2233G 76.01 1.00 135 > 21 hdd 9.09560 1.0 9313G 8123G 1190G 87.21 1.15 135 > 22 hdd 9.09560 1.0 9313G 7026G 2287G 75.44 1.00 133 > 23 hdd 9.09560 1.0 9313G 7026G 2286G 75.45 1.00 134 > >I am using RBD only so the objects should all be 4m .I don't understand > why osd 21 got significant more data > with the same pg as other osds. >Is this behavior expected or I misconfiged something or some kind of > bug? > >Thanks > > > 2018-06-25 > shadow_lin > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] fixing unrepairable inconsistent PG
Interesing... Can I see the output of "ceph auth list" and can you test whether you can query any other pg that has osd.21 as its primary? On Mon, Jun 25, 2018 at 8:04 PM, Andrei Mikhailovsky wrote: > Hi Brad, > > here is the output: > > -- > > root@arh-ibstorage1-ib:/home/andrei# ceph --debug_ms 5 --debug_auth 20 pg > 18.2 query > 2018-06-25 10:59:12.100302 7fe23eaa1700 2 Event(0x7fe2400e0140 nevent=5000 > time_id=1).set_owner idx=0 owner=140609690670848 > 2018-06-25 10:59:12.100398 7fe23e2a0700 2 Event(0x7fe24010d030 nevent=5000 > time_id=1).set_owner idx=1 owner=140609682278144 > 2018-06-25 10:59:12.100445 7fe23da9f700 2 Event(0x7fe240139ec0 nevent=5000 > time_id=1).set_owner idx=2 owner=140609673885440 > 2018-06-25 10:59:12.100793 7fe244b28700 1 Processor -- start > 2018-06-25 10:59:12.100869 7fe244b28700 1 -- - start start > 2018-06-25 10:59:12.100882 7fe244b28700 5 adding auth protocol: cephx > 2018-06-25 10:59:12.101046 7fe244b28700 2 auth: KeyRing::load: loaded key > file /etc/ceph/ceph.client.admin.keyring > 2018-06-25 10:59:12.101244 7fe244b28700 1 -- - --> 192.168.168.201:6789/0 -- > auth(proto 0 30 bytes epoch 0) v1 -- 0x7fe240174b80 con 0 > 2018-06-25 10:59:12.101264 7fe244b28700 1 -- - --> 192.168.168.202:6789/0 -- > auth(proto 0 30 bytes epoch 0) v1 -- 0x7fe240175010 con 0 > 2018-06-25 10:59:12.101690 7fe23e2a0700 1 -- 192.168.168.201:0/3046734987 > learned_addr learned my addr 192.168.168.201:0/3046734987 > 2018-06-25 10:59:12.101890 7fe23e2a0700 2 -- 192.168.168.201:0/3046734987 >> > 192.168.168.202:6789/0 conn(0x7fe240176dc0 :-1 > s=STATE_CONNECTING_WAIT_ACK_SEQ pgs=0 cs=0 l=1)._process_connection got > newly_acked_seq 0 vs out_seq 0 > 2018-06-25 10:59:12.102030 7fe23da9f700 2 -- 192.168.168.201:0/3046734987 >> > 192.168.168.201:6789/0 conn(0x7fe24017a420 :-1 > s=STATE_CONNECTING_WAIT_ACK_SEQ pgs=0 cs=0 l=1)._process_connection got > newly_acked_seq 0 vs out_seq 0 > 2018-06-25 10:59:12.102450 7fe23e2a0700 5 -- 192.168.168.201:0/3046734987 >> > 192.168.168.202:6789/0 conn(0x7fe240176dc0 :-1 > s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=472363 cs=1 l=1). rx mon.1 > seq 1 0x7fe234002670 mon_map magic: 0 v1 > 2018-06-25 10:59:12.102494 7fe23e2a0700 5 -- 192.168.168.201:0/3046734987 >> > 192.168.168.202:6789/0 conn(0x7fe240176dc0 :-1 > s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=472363 cs=1 l=1). rx mon.1 > seq 2 0x7fe234002b70 auth_reply(proto 2 0 (0) Success) v1 > 2018-06-25 10:59:12.102542 7fe23ca9d700 1 -- 192.168.168.201:0/3046734987 > <== mon.1 192.168.168.202:6789/0 1 mon_map magic: 0 v1 505+0+0 > (2386987630 0 0) 0x7fe234002670 con 0x7fe240176dc0 > 2018-06-25 10:59:12.102629 7fe23ca9d700 1 -- 192.168.168.201:0/3046734987 > <== mon.1 192.168.168.202:6789/0 2 auth_reply(proto 2 0 (0) Success) v1 > 33+0+0 (1469975654 0 0) 0x7fe234002b70 con 0x7fe240176dc0 > 2018-06-25 10:59:12.102655 7fe23ca9d700 10 cephx: set_have_need_key no > handler for service mon > 2018-06-25 10:59:12.102657 7fe23ca9d700 10 cephx: set_have_need_key no > handler for service osd > 2018-06-25 10:59:12.102658 7fe23ca9d700 10 cephx: set_have_need_key no > handler for service mgr > 2018-06-25 10:59:12.102661 7fe23ca9d700 10 cephx: set_have_need_key no > handler for service auth > 2018-06-25 10:59:12.102662 7fe23ca9d700 10 cephx: validate_tickets want 53 > have 0 need 53 > 2018-06-25 10:59:12.102666 7fe23ca9d700 10 cephx client: handle_response ret > = 0 > 2018-06-25 10:59:12.102671 7fe23ca9d700 10 cephx client: got initial server > challenge 6522ec95fb2eb487 > 2018-06-25 10:59:12.102673 7fe23ca9d700 10 cephx client: validate_tickets: > want=53 need=53 have=0 > 2018-06-25 10:59:12.102674 7fe23ca9d700 10 cephx: set_have_need_key no > handler for service mon > 2018-06-25 10:59:12.102675 7fe23ca9d700 10 cephx: set_have_need_key no > handler for service osd > 2018-06-25 10:59:12.102676 7fe23ca9d700 10 cephx: set_have_need_key no > handler for service mgr > 2018-06-25 10:59:12.102676 7fe23ca9d700 10 cephx: set_have_need_key no > handler for service auth > 2018-06-25 10:59:12.102677 7fe23ca9d700 10 cephx: validate_tickets want 53 > have 0 need 53 > 2018-06-25 10:59:12.102678 7fe23ca9d700 10 cephx client: want=53 need=53 > have=0 > 2018-06-25 10:59:12.102680 7fe23ca9d700 10 cephx client: build_request > 2018-06-25 10:59:12.102702 7fe23da9f700 5 -- 192.168.168.201:0/3046734987 >> > 192.168.168.201:6789/0 conn(0x7fe24017a420 :-1 > s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=333625 cs=1 l=1). rx mon.0 > seq 1 0x7fe228001490 mon_map magic: 0 v1 > 2018-06-25 10:59:12.102739 7fe23ca9d700 10 cephx client: get auth session > key: client_challenge 80f2a24093f783c5 > 2018-06-25 10:59:12.102743 7fe23ca9d700 1 -- 192.168.168.201:0/3046734987 > --> 192.168.168.202:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- > 0x7fe224002080 con 0 > 2018-06-25 10:59:12.102737 7fe23da9f700 5 -- 192.168.168.201:0/3046734987 >> > 192.168.168.2
Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing
Hi David, I am afraid I can't run the command you provide now,because I tried to remove another osd on that host to see if it would make the data distribution even and it did. The pg number of my pools are at power of 2. Below is from my note before removed another osd: pool 3 'ec_rbd_pool' erasure size 6 min_size 5 crush_rule 2 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 3248 flags hashpspool,ec_overwrites,nearfull stripe_width 16384 application rbd pool 4 'rbd_pool' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 3248 flags hashpspool,nearfull stripe_width 0 application rbd pg distribution of osd of all pools: https://pasteboard.co/HrBZv3s.png What I don't understand is why data distribution is uneven when pg distribution is even. 2018-06-26 shadow_lin 发件人:David Turner 发送时间:2018-06-26 01:24 主题:Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing 收件人:"shadow_lin" 抄送:"ceph-users" I should be able to answer this question for you if you can supply the output of the following commands. It will print out all of your pool names along with how many PGs are in that pool. My guess is that you don't have a power of 2 number of PGs in your pool. Alternatively you might have multiple pools and the PGs from the various pools are just different sizes. ceph osd lspools | tr ',' '\n' | awk '/^[0-9]/ {print $2}' | while read pool; do echo $pool: $(ceph osd pool get $pool pg_num | cut -d' ' -f2); done ceph df For me the output looks like this. rbd: 64 cephfs_metadata: 64 cephfs_data: 256 rbd-ssd: 32 GLOBAL: SIZE AVAIL RAW USED %RAW USED 46053G 26751G 19301G 41.91 POOLS: NAMEID USED %USED MAX AVAIL OBJECTS rbd-replica 4897G 11.36 7006G 263000 cephfs_metadata 6141M 0.05 268G 11945 cephfs_data 7 10746G 43.4114012G 2795782 rbd-replica-ssd 9241G 47.30 268G 75061 On Sun, Jun 24, 2018 at 9:48 PM shadow_lin wrote: Hi List, The enviroment is: Ceph 12.2.4 Balancer module on and in upmap mode Failure domain is per host, 2 OSD per host EC k=4 m=2 PG distribution is almost even before and after the rebalancing. After marking out one of the osd,I noticed a lot of the data was moving into the other osd on the same host . Ceph osd df result is(osd.20 and osd.21 are in the same host and osd.20 was marked out): ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 19 hdd 9.09560 1.0 9313G 7079G 2233G 76.01 1.00 135 21 hdd 9.09560 1.0 9313G 8123G 1190G 87.21 1.15 135 22 hdd 9.09560 1.0 9313G 7026G 2287G 75.44 1.00 133 23 hdd 9.09560 1.0 9313G 7026G 2286G 75.45 1.00 134 I am using RBD only so the objects should all be 4m .I don't understand why osd 21 got significant more data with the same pg as other osds. Is this behavior expected or I misconfiged something or some kind of bug? Thanks 2018-06-25 shadow_lin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing
You have 2 different pools. PGs in each pool are going to be a different size. It's like saying 12x + 13y should equal 2x + 23y because they each have 25 X's and Y's. Having equal PG counts on each osd is only balanced if you have a single pool or have a case where all PGs are identical in size. The latter is not likely. On Mon, Jun 25, 2018, 10:02 PM shadow_lin wrote: > Hi David, > I am afraid I can't run the command you provide now,because I tried to > remove another osd on that host to see if it would make the data > distribution even and it did. > The pg number of my pools are at power of 2. > Below is from my note before removed another osd: > pool 3 'ec_rbd_pool' erasure size 6 min_size 5 crush_rule 2 > object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 3248 flags > hashpspool,ec_overwrites,nearfull stripe_width 16384 application rbd > pool 4 'rbd_pool' replicated size 2 min_size 1 crush_rule 0 > object_hash rjenkins pg_num 128 pgp_num 128 last_change 3248 flags > hashpspool,nearfull stripe_width 0 application rbd > pg distribution of osd of all pools: > https://pasteboard.co/HrBZv3s.png > > What I don't understand is why data distribution is uneven when pg > distribution is even. > > 2018-06-26 > > shadow_lin > > > > 发件人:David Turner > 发送时间:2018-06-26 01:24 > 主题:Re: [ceph-users] Uneven data distribution with even pg distribution > after rebalancing > 收件人:"shadow_lin" > 抄送:"ceph-users" > > I should be able to answer this question for you if you can supply the > output of the following commands. It will print out all of your pool names > along with how many PGs are in that pool. My guess is that you don't have > a power of 2 number of PGs in your pool. Alternatively you might have > multiple pools and the PGs from the various pools are just different sizes. > > > ceph osd lspools | tr ',' '\n' | awk '/^[0-9]/ {print $2}' | while read > pool; do echo $pool: $(ceph osd pool get $pool pg_num | cut -d' ' -f2); done > ceph df > > > For me the output looks like this. > rbd: 64 > cephfs_metadata: 64 > cephfs_data: 256 > rbd-ssd: 32 > > > > GLOBAL: > SIZE AVAIL RAW USED %RAW USED > 46053G 26751G 19301G 41.91 > POOLS: > NAMEID USED %USED MAX AVAIL OBJECTS > rbd-replica 4897G 11.36 7006G 263000 > cephfs_metadata 6141M 0.05 268G 11945 > cephfs_data 7 10746G 43.4114012G 2795782 > rbd-replica-ssd 9241G 47.30 268G 75061 > > > On Sun, Jun 24, 2018 at 9:48 PM shadow_lin wrote: > > Hi List, >The enviroment is: >Ceph 12.2.4 >Balancer module on and in upmap mode >Failure domain is per host, 2 OSD per host >EC k=4 m=2 >PG distribution is almost even before and after the rebalancing. > > >After marking out one of the osd,I noticed a lot of the data was moving > into the other osd on the same host . > >Ceph osd df result is(osd.20 and osd.21 are in the same host and osd.20 > was marked out): > > ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS > 19 hdd 9.09560 1.0 9313G 7079G 2233G 76.01 1.00 135 > 21 hdd 9.09560 1.0 9313G 8123G 1190G 87.21 1.15 135 > 22 hdd 9.09560 1.0 9313G 7026G 2287G 75.44 1.00 133 > 23 hdd 9.09560 1.0 9313G 7026G 2286G 75.45 1.00 134 > >I am using RBD only so the objects should all be 4m .I don't understand > why osd 21 got significant more data > with the same pg as other osds. >Is this behavior expected or I misconfiged something or some kind of > bug? > >Thanks > > > 2018-06-25 > shadow_lin > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing
Hi David, I am sure most(if not all) data are in one pool. rbd_pool is only for omap for EC rbd. ceph df: GLOBAL: SIZE AVAIL RAW USED %RAW USED 427T 100555G 329T 77.03 POOLS: NAMEID USED %USED MAX AVAIL OBJECTS ec_rbd_pool 3 219T 81.4050172G 57441718 rbd_pool4 144 037629G 19 2018-06-26 shadow_lin 发件人:David Turner 发送时间:2018-06-26 10:21 主题:Re: Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing 收件人:"shadow_lin" 抄送:"ceph-users" You have 2 different pools. PGs in each pool are going to be a different size. It's like saying 12x + 13y should equal 2x + 23y because they each have 25 X's and Y's. Having equal PG counts on each osd is only balanced if you have a single pool or have a case where all PGs are identical in size. The latter is not likely. On Mon, Jun 25, 2018, 10:02 PM shadow_lin wrote: Hi David, I am afraid I can't run the command you provide now,because I tried to remove another osd on that host to see if it would make the data distribution even and it did. The pg number of my pools are at power of 2. Below is from my note before removed another osd: pool 3 'ec_rbd_pool' erasure size 6 min_size 5 crush_rule 2 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 3248 flags hashpspool,ec_overwrites,nearfull stripe_width 16384 application rbd pool 4 'rbd_pool' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 3248 flags hashpspool,nearfull stripe_width 0 application rbd pg distribution of osd of all pools: https://pasteboard.co/HrBZv3s.png What I don't understand is why data distribution is uneven when pg distribution is even. 2018-06-26 shadow_lin 发件人:David Turner 发送时间:2018-06-26 01:24 主题:Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing 收件人:"shadow_lin" 抄送:"ceph-users" I should be able to answer this question for you if you can supply the output of the following commands. It will print out all of your pool names along with how many PGs are in that pool. My guess is that you don't have a power of 2 number of PGs in your pool. Alternatively you might have multiple pools and the PGs from the various pools are just different sizes. ceph osd lspools | tr ',' '\n' | awk '/^[0-9]/ {print $2}' | while read pool; do echo $pool: $(ceph osd pool get $pool pg_num | cut -d' ' -f2); done ceph df For me the output looks like this. rbd: 64 cephfs_metadata: 64 cephfs_data: 256 rbd-ssd: 32 GLOBAL: SIZE AVAIL RAW USED %RAW USED 46053G 26751G 19301G 41.91 POOLS: NAMEID USED %USED MAX AVAIL OBJECTS rbd-replica 4897G 11.36 7006G 263000 cephfs_metadata 6141M 0.05 268G 11945 cephfs_data 7 10746G 43.4114012G 2795782 rbd-replica-ssd 9241G 47.30 268G 75061 On Sun, Jun 24, 2018 at 9:48 PM shadow_lin wrote: Hi List, The enviroment is: Ceph 12.2.4 Balancer module on and in upmap mode Failure domain is per host, 2 OSD per host EC k=4 m=2 PG distribution is almost even before and after the rebalancing. After marking out one of the osd,I noticed a lot of the data was moving into the other osd on the same host . Ceph osd df result is(osd.20 and osd.21 are in the same host and osd.20 was marked out): ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 19 hdd 9.09560 1.0 9313G 7079G 2233G 76.01 1.00 135 21 hdd 9.09560 1.0 9313G 8123G 1190G 87.21 1.15 135 22 hdd 9.09560 1.0 9313G 7026G 2287G 75.44 1.00 133 23 hdd 9.09560 1.0 9313G 7026G 2286G 75.45 1.00 134 I am using RBD only so the objects should all be 4m .I don't understand why osd 21 got significant more data with the same pg as other osds. Is this behavior expected or I misconfiged something or some kind of bug? Thanks 2018-06-25 shadow_lin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing
If you look at ceph pg dump, you'll see the size ceph believes each PG is. >From your ceph df, your PGs for the rbd_pool will be almost zero. So if you have an osd with 6 of those PGs and another with none of them, but both osds have the same number of PGs overall... The osd with none of them will be more full than the other. I bet that the osd you had that was really full just had less of those PGs than the rest. On Mon, Jun 25, 2018, 10:25 PM shadow_lin wrote: > Hi David, > I am sure most(if not all) data are in one pool. > rbd_pool is only for omap for EC rbd. > > ceph df: > > GLOBAL: > SIZE AVAIL RAW USED %RAW USED > 427T 100555G 329T 77.03 > > POOLS: > NAMEID USED %USED MAX AVAIL OBJECTS > ec_rbd_pool 3 219T 81.4050172G 57441718 > rbd_pool4 144 037629G 19 > > > 2018-06-26 > -- > shadow_lin > -- > > *发件人:*David Turner > *发送时间:*2018-06-26 10:21 > *主题:*Re: Re: [ceph-users] Uneven data distribution with even pg > distribution after rebalancing > > *收件人:*"shadow_lin" > *抄送:*"ceph-users" > > You have 2 different pools. PGs in each pool are going to be a different > size. It's like saying 12x + 13y should equal 2x + 23y because they each > have 25 X's and Y's. Having equal PG counts on each osd is only balanced if > you have a single pool or have a case where all PGs are identical in size. > The latter is not likely. > > On Mon, Jun 25, 2018, 10:02 PM shadow_lin wrote: > >> Hi David, >> I am afraid I can't run the command you provide now,because I tried >> to remove another osd on that host to see if it would make the data >> distribution even and it did. >> The pg number of my pools are at power of 2. >> Below is from my note before removed another osd: >> pool 3 'ec_rbd_pool' erasure size 6 min_size 5 crush_rule 2 >> object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 3248 flags >> hashpspool,ec_overwrites,nearfull stripe_width 16384 application rbd >> pool 4 'rbd_pool' replicated size 2 min_size 1 crush_rule 0 >> object_hash rjenkins pg_num 128 pgp_num 128 last_change 3248 flags >> hashpspool,nearfull stripe_width 0 application rbd >> pg distribution of osd of all pools: >> https://pasteboard.co/HrBZv3s.png >> >> What I don't understand is why data distribution is uneven when pg >> distribution is even. >> >> 2018-06-26 >> >> shadow_lin >> >> >> >> 发件人:David Turner >> 发送时间:2018-06-26 01:24 >> 主题:Re: [ceph-users] Uneven data distribution with even pg distribution >> after rebalancing >> 收件人:"shadow_lin" >> 抄送:"ceph-users" >> >> I should be able to answer this question for you if you can supply the >> output of the following commands. It will print out all of your pool names >> along with how many PGs are in that pool. My guess is that you don't have >> a power of 2 number of PGs in your pool. Alternatively you might have >> multiple pools and the PGs from the various pools are just different sizes. >> >> >> ceph osd lspools | tr ',' '\n' | awk '/^[0-9]/ {print $2}' | while read >> pool; do echo $pool: $(ceph osd pool get $pool pg_num | cut -d' ' -f2); done >> ceph df >> >> >> For me the output looks like this. >> rbd: 64 >> cephfs_metadata: 64 >> cephfs_data: 256 >> rbd-ssd: 32 >> >> >> >> GLOBAL: >> SIZE AVAIL RAW USED %RAW USED >> 46053G 26751G 19301G 41.91 >> POOLS: >> NAMEID USED %USED MAX AVAIL OBJECTS >> rbd-replica 4897G 11.36 7006G 263000 >> cephfs_metadata 6141M 0.05 268G 11945 >> cephfs_data 7 10746G 43.4114012G 2795782 >> rbd-replica-ssd 9241G 47.30 268G 75061 >> >> >> On Sun, Jun 24, 2018 at 9:48 PM shadow_lin wrote: >> >> Hi List, >>The enviroment is: >>Ceph 12.2.4 >>Balancer module on and in upmap mode >>Failure domain is per host, 2 OSD per host >>EC k=4 m=2 >>PG distribution is almost even before and after the rebalancing. >> >> >>After marking out one of the osd,I noticed a lot of the data was >> moving into the other osd on the same host . >> >>Ceph osd df result is(osd.20 and osd.21 are in the same host and >> osd.20 was marked out): >> >> ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS >> 19 hdd 9.09560 1.0 9313G 7079G 2233G 76.01 1.00 135 >> 21 hdd 9.09560 1.0 9313G 8123G 1190G 87.21 1.15 135 >> 22 hdd 9.09560 1.0 9313G 7026G 2287G 75.44 1.00 133 >> 23 hdd 9.09560 1.0 9313G 7026G 2286G 75.45 1.00 134 >> >>I am using RBD only so the objects should all be 4m .I don't >> understand why osd 21 got significant more data >> with the same pg as other osds. >>Is this behavior expected or I misconfiged something or some kind of >> bug? >> >>
[ceph-users] FS Reclaims storage too slow
Hi, Is it normal that I deleted files from the cephfs and ceph didn't delete the back objects a day later? Until I restart the mds deamon then it started to release the storage space. I noticed the doc(http://docs.ceph.com/docs/mimic/dev/delayed-delete/) says the file is marked as deleted on the MDS, and deleted lazily. What is the condition to trigger the back object deletion? If it's normal the deletion delayed that much, is there any way to make it faster? Since the cluster is near full. I'm using jewel 10.2.3 both for ceph-fuse and mds. Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing
This is the formated pg dump result: https://pasteboard.co/HrBZv3s.png You can see the pg distribution of each pool on each osd is fine. 2018-06-26 shadow_lin 发件人:David Turner 发送时间:2018-06-26 10:32 主题:Re: Re: Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing 收件人:"shadow_lin" 抄送:"ceph-users" If you look at ceph pg dump, you'll see the size ceph believes each PG is. From your ceph df, your PGs for the rbd_pool will be almost zero. So if you have an osd with 6 of those PGs and another with none of them, but both osds have the same number of PGs overall... The osd with none of them will be more full than the other. I bet that the osd you had that was really full just had less of those PGs than the rest. On Mon, Jun 25, 2018, 10:25 PM shadow_lin wrote: Hi David, I am sure most(if not all) data are in one pool. rbd_pool is only for omap for EC rbd. ceph df: GLOBAL: SIZE AVAIL RAW USED %RAW USED 427T 100555G 329T 77.03 POOLS: NAMEID USED %USED MAX AVAIL OBJECTS ec_rbd_pool 3 219T 81.4050172G 57441718 rbd_pool4 144 037629G 19 2018-06-26 shadow_lin 发件人:David Turner 发送时间:2018-06-26 10:21 主题:Re: Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing 收件人:"shadow_lin" 抄送:"ceph-users" You have 2 different pools. PGs in each pool are going to be a different size. It's like saying 12x + 13y should equal 2x + 23y because they each have 25 X's and Y's. Having equal PG counts on each osd is only balanced if you have a single pool or have a case where all PGs are identical in size. The latter is not likely. On Mon, Jun 25, 2018, 10:02 PM shadow_lin wrote: Hi David, I am afraid I can't run the command you provide now,because I tried to remove another osd on that host to see if it would make the data distribution even and it did. The pg number of my pools are at power of 2. Below is from my note before removed another osd: pool 3 'ec_rbd_pool' erasure size 6 min_size 5 crush_rule 2 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 3248 flags hashpspool,ec_overwrites,nearfull stripe_width 16384 application rbd pool 4 'rbd_pool' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 3248 flags hashpspool,nearfull stripe_width 0 application rbd pg distribution of osd of all pools: https://pasteboard.co/HrBZv3s.png What I don't understand is why data distribution is uneven when pg distribution is even. 2018-06-26 shadow_lin 发件人:David Turner 发送时间:2018-06-26 01:24 主题:Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing 收件人:"shadow_lin" 抄送:"ceph-users" I should be able to answer this question for you if you can supply the output of the following commands. It will print out all of your pool names along with how many PGs are in that pool. My guess is that you don't have a power of 2 number of PGs in your pool. Alternatively you might have multiple pools and the PGs from the various pools are just different sizes. ceph osd lspools | tr ',' '\n' | awk '/^[0-9]/ {print $2}' | while read pool; do echo $pool: $(ceph osd pool get $pool pg_num | cut -d' ' -f2); done ceph df For me the output looks like this. rbd: 64 cephfs_metadata: 64 cephfs_data: 256 rbd-ssd: 32 GLOBAL: SIZE AVAIL RAW USED %RAW USED 46053G 26751G 19301G 41.91 POOLS: NAMEID USED %USED MAX AVAIL OBJECTS rbd-replica 4897G 11.36 7006G 263000 cephfs_metadata 6141M 0.05 268G 11945 cephfs_data 7 10746G 43.4114012G 2795782 rbd-replica-ssd 9241G 47.30 268G 75061 On Sun, Jun 24, 2018 at 9:48 PM shadow_lin wrote: Hi List, The enviroment is: Ceph 12.2.4 Balancer module on and in upmap mode Failure domain is per host, 2 OSD per host EC k=4 m=2 PG distribution is almost even before and after the rebalancing. After marking out one of the osd,I noticed a lot of the data was moving into the other osd on the same host . Ceph osd df result is(osd.20 and osd.21 are in the same host and osd.20 was marked out): ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 19 hdd 9.09560 1.0 9313G 7079G 2233G 76.01 1.00 135 21 hdd 9.09560 1.0 9313G 8123G 1190G 87.21 1.15 135 22 hdd 9.09560 1.0 9313G 7026G 2287G 75.44 1.00 133 23 hdd 9.09560 1.0 9313G 7026G 2286G 75.45 1.00 134 I am using RBD only so the objects should all be 4m .I don't understand why osd 21 got significant more data with the same pg as other osds. Is this behavior expected or I misconfiged something or some kind of bug? Thanks 2018
[ceph-users] ceph on infiniband
Hi: We are using ceph on infiniband and configure it with default configuration. The ms_type is async + posix. I see there are 3 kinds of types. Which one is the most stable and the best performance ? Which one do you suggest shuold I use in production ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com