[ceph-users] One osd crashing daily, the problem with osd.50
hello I am running a small lab ceph cluster consisting of 6 old used servers. they have 36 slots for drives. but too little ram, 32GB max, for this mainboard, to take advantage of them all. When i get to around 20 osd's on a node the OOM killer becomes a problem, if there is incidents that require recovery. In order to remedy some of the ram problems i am running the osd's on 5 disk raid5 software sets. this gives me about 7 12TB osd's on a node and a global hotspare. I have tried this on one of the nodes with good success. and I am in the process of doing the migrations on the other nodes as well. i am running on debian jessie using the 0.94.6 hammer from ceph's repo. but a issue has started appering on one of these raid5 osd's osd.50 have a tendency to stop ~daily with the error message seen in the log below. The osd is running on a healthy software raid5 disk, and i can see nothing in dmesg or any other log that can indicate a problem with this md device. once i restart the osd it's up and in and probably stays up and in for some hours upto a few days. the other 6 osd's on this node does not show the same problem. i have restarted this osd about 8-10 times. so it's fairly regular. the raid5 sets are 12TB so i was hoping to be able to fix the problem, rather then zapping the md and recreating from scratch. I was also worrying if there was something fundamentaly wrong about running osd's on software md raid5 devices. kind regards Ronny Aasen NB: ignore osd.41 in the log below, that was a single broken disk -- ceph-50.log -7> 2016-05-05 04:36:04.382514 7f2758135700 1 -- 10.24.11.22:6805/22452 <== osd.66 10.24.12.24:0/5339 22940 osd_ping(ping e51484 stamp 2016-05-05 04:36:04.367488) v2 47+0+0 (1824128128 0 0) 0x41bfba00 con 0x428df1e0 -6> 2016-05-05 04:36:04.382534 7f2756932700 1 -- 10.24.12.22:6803/22452 --> 10.24.12.24:0/5339 -- osd_ping(ping_reply e51484 stamp 2016-05-05 04:36:04.367488) v2 -- ?+0 0x4df27200 con 0x428de6e0 -5> 2016-05-05 04:36:04.382576 7f2758135700 1 -- 10.24.11.22:6805/22452 --> 10.24.12.24:0/5339 -- osd_ping(ping_reply e51484 stamp 2016-05-05 04:36:04.367488) v2 -- ?+0 0x4df4a200 con 0x428df1e0 -4> 2016-05-05 04:36:04.412314 7f2756932700 1 -- 10.24.12.22:6803/22452 <== osd.19 10.24.12.25:0/5355 22879 osd_ping(ping e51484 stamp 2016-05-05 04:36:04.412495) v2 47+0+0 (1694664336 0 0) 0x57434a00 con 0x421ddb20 -3> 2016-05-05 04:36:04.412366 7f2756932700 1 -- 10.24.12.22:6803/22452 --> 10.24.12.25:0/5355 -- osd_ping(ping_reply e51484 stamp 2016-05-05 04:36:04.412495) v2 -- ?+0 0x56c67800 con 0x421ddb20 -2> 2016-05-05 04:36:04.412394 7f2758135700 1 -- 10.24.11.22:6805/22452 <== osd.19 10.24.12.25:0/5355 22879 osd_ping(ping e51484 stamp 2016-05-05 04:36:04.412495) v2 47+0+0 (1694664336 0 0) 0x4e485600 con 0x428de9a0 -1> 2016-05-05 04:36:04.412440 7f2758135700 1 -- 10.24.11.22:6805/22452 --> 10.24.12.25:0/5355 -- osd_ping(ping_reply e51484 stamp 2016-05-05 04:36:04.412495) v2 -- ?+0 0x41bfba00 con 0x428de9a0 0> 2016-05-05 04:36:04.418305 7f274c91e700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, uint32_t, bool)' thread 7f274c91e700 time 2016-05-05 04:36:04.115448 os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5) ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x76) [0xc03c46] 2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int, bool)+0xcc2) [0x90af82] 3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int, ScrubMap::object&, ThreadPool::TPHandle&)+0x31c) [0xa1c0ec] 4: (PGBackend::be_scan_list(ScrubMap&, std::vectorstd::allocator > const&, bool, unsigned int, ThreadPool::TPHandle&)+0x2ca) [0x8cd23a] 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x1fa) [0x7dc0ba] 6: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x3be) [0x7e437e] 7: (PG::scrub(ThreadPool::TPHandle&)+0x1d7) [0x7e5a87] 8: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x6b3e69] 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa77) [0xbf41f7] 10: (ThreadPool::WorkThread::entry()+0x10) [0xbf52c0] 11: (()+0x80a4) [0x7f27790c20a4] 12: (clone()+0x6d) [0x7f277761d87d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 fin
Re: [ceph-users] One osd crashing daily, the problem with osd.50
Hello, On Mon, 9 May 2016 09:31:20 +0200 Ronny Aasen wrote: > hello > > I am running a small lab ceph cluster consisting of 6 old used servers. That's larger than quite a few production deployments. ^_- > they have 36 slots for drives. but too little ram, 32GB max, for this > mainboard, to take advantage of them all. When i get to around 20 osd's > on a node the OOM killer becomes a problem, if there is incidents that > require recovery. > No surprise there, if you're limited to that little RAM I suspect you'd run out of CPU power with a full load, too. > In order to remedy some of the ram problems i am running the osd's on 5 > disk raid5 software sets. this gives me about 7 12TB osd's on a node and > a global hotspare. I have tried this on one of the nodes with good > success. and I am in the process of doing the migrations on the other > nodes as well. > That's optimizing for space and nothing else. Having done something similar in the past I would strongly recommend the following: a) Use RAID6, so that you never have to worry about an OSD failure. I've personally lost 2 RAID5 sets of similar size due to double disk failures. b) use RAID10 for much improved performance (IOPS). To offset the loss in space, consider running with a replication of 2, which would be safe, same for option a). > i am running on debian jessie using the 0.94.6 hammer from ceph's repo. > > but a issue has started appering on one of these raid5 osd's > > osd.50 have a tendency to stop ~daily with the error message seen in the > log below. The osd is running on a healthy software raid5 disk, and i > can see nothing in dmesg or any other log that can indicate a problem > with this md device. The key part of that log is EIO failed assert, if you google for "FAILED assert(allow_eio" you will get hits from last year, this is FS issue and has nothing to do with the RAID per se. Which FS are you using? If it's not BTRFS and since your other OSDs are not having issues, it might be worth going over this FS with a fine comb. The "near full" OSD is something that you want to address, too. > once i restart the osd it's up and in and probably stays up and in for > some hours upto a few days. the other 6 osd's on this node does not show > the same problem. i have restarted this osd about 8-10 times. so it's > fairly regular. > Might have to bite the bullet and re-create it if you can't find the issue. > the raid5 sets are 12TB so i was hoping to be able to fix the problem, > rather then zapping the md and recreating from scratch. I was also > worrying if there was something fundamentaly wrong about running osd's > on software md raid5 devices. > No problem in and by itself, other than reduced performance. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] One osd crashing daily, the problem with osd.50
thanks for your commends. answers inline. On 05/09/16 09:53, Christian Balzer wrote: Hello, On Mon, 9 May 2016 09:31:20 +0200 Ronny Aasen wrote: hello I am running a small lab ceph cluster consisting of 6 old used servers. That's larger than quite a few production deployments. ^_- :) they have 36 slots for drives. but too little ram, 32GB max, for this mainboard, to take advantage of them all. When i get to around 20 osd's on a node the OOM killer becomes a problem, if there is incidents that require recovery. No surprise there, if you're limited to that little RAM I suspect you'd run out of CPU power with a full load, too. cpu have not been an issue yet. but i suspect it might change when i can release enough ram to try erasure coding. In order to remedy some of the ram problems i am running the osd's on 5 disk raid5 software sets. this gives me about 7 12TB osd's on a node and a global hotspare. I have tried this on one of the nodes with good success. and I am in the process of doing the migrations on the other nodes as well. That's optimizing for space and nothing else. Having done something similar in the past I would strongly recommend the following: a) Use RAID6, so that you never have to worry about an OSD failure. I've personally lost 2 RAID5 sets of similar size due to double disk failures. b) use RAID10 for much improved performance (IOPS). To offset the loss in space, consider running with a replication of 2, which would be safe, same for option a). yes definitivly optimizing for space, and space only. that's also the reason for raid5 and not raid6 or 10. and why erasure coding is a must have if i can get it there. i am running on debian jessie using the 0.94.6 hammer from ceph's repo. but a issue has started appering on one of these raid5 osd's osd.50 have a tendency to stop ~daily with the error message seen in the log below. The osd is running on a healthy software raid5 disk, and i can see nothing in dmesg or any other log that can indicate a problem with this md device. The key part of that log is EIO failed assert, if you google for "FAILED assert(allow_eio" you will get hits from last year, this is FS issue and has nothing to do with the RAID per se. Which FS are you using? If it's not BTRFS and since your other OSDs are not having issues, it might be worth going over this FS with a fine comb. The "near full" OSD is something that you want to address, too. the near full is allready fixed. am running on XFS. i had assumed that FS issues also would show in system logs. Ill google for this issue. once i restart the osd it's up and in and probably stays up and in for some hours upto a few days. the other 6 osd's on this node does not show the same problem. i have restarted this osd about 8-10 times. so it's fairly regular. Might have to bite the bullet and re-create it if you can't find the issue. thanks. I'll do this today if nobody (or google) have any better suggestions the raid5 sets are 12TB so i was hoping to be able to fix the problem, rather then zapping the md and recreating from scratch. I was also worrying if there was something fundamentaly wrong about running osd's on software md raid5 devices. No problem in and by itself, other than reduced performance. Knid regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RadosGW - Problems running the S3 and SWIFT API at the same time
I try to simplify the question to get some feedback. Is anyone running the RadosGW in production with S3 and SWIFT API active at the same time ? thank you ! Saverio 2016-05-06 11:39 GMT+02:00 Saverio Proto : > Hello, > > We have been running the Rados GW with the S3 API and we did not have > problems for more than a year. > > We recently enabled also the SWIFT API for our users. > > radosgw --version > ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) > > The idea is that each user of the system is free of choosing the S3 > client or the SWIFT client to access the same container/buckets. > > Please tell us if this is possible by design or if we are doing something > wrong. > > We have now a problem that some files wrote in the past with S3, > cannot be read with the SWIFT API because the md5sum always fails. > > I am able to reproduce the bug in this way: > > We have this file googlebooks-fre-all-2gram-20120701-ts.gz and we know > the correct md5 is 1c8113d2bd21232688221ec74dccff3a > You can download the same file here: > https://www.dropbox.com/s/auq16vdv2maw4p7/googlebooks-fre-all-2gram-20120701-ts.gz?dl=0 > > rclone mkdir lss3:bugreproduce > rclone copy googlebooks-fre-all-2gram-20120701-ts.gz lss3:bugreproduce > > The file is successfully uploaded. > > At this point I can succesfully download again the file: > rclone copy lss3:bugreproduce/googlebooks-fre-all-2gram-20120701-ts.gz test.gz > > but not with swift: > > swift download googlebooks-ngrams-gz > fre/googlebooks-fre-all-2gram-20120701-ts.gz > Error downloading object > 'googlebooks-ngrams-gz/fre/googlebooks-fre-all-2gram-20120701-ts.gz': > u'Error downloading fre/googlebooks-fre-all-2gram-20120701-ts.gz: > md5sum != etag, 1c8113d2bd21232688221ec74dccff3a != > 1a209a31b4ac3eb923fac5e8d194d9d3-2' > > Also I found strange the dash character '-' at the end of the md5 that > is trying to compare. > > Of course upload a file with the swift client and redownloading the > same file just works. > > Should I open a bug for the radosgw on http://tracker.ceph.com/ ? > > thank you > > Saverio ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Boot from image (create a new volume) on Openstack dashboard
Hi, Calvin! Actually it's an OpenStack's question rather than ceph's, but you can use "cinder quota-show" and "cinder quota-update" commands to show and set tennat's quota. You can look this quetion\answer about setting volumes quota for a certain volume types, it's a kind of relevant to your question: https://ask.openstack.org/en/question/57912/can-you-restrict-a-user-or-tenant-to-a-cinder-volume-type/ I don't remember exactly but as far as I remember you can set volume limits in Horizon as well (but I'm not sure you can set per volume type quota from there). Kind regards, Vladimir 2016-05-09 11:35 GMT+05:00 박선규 [slash] : > Hi > > When launch a new instance as 'Boot from image (create a new volume)' on > Openstack dashboard, > normal user got below error message. (admin user is okay) > > 'The requested instance cannot be launched. Requested volume exceeds > quota: Available: 0, Requested: 1.' > > How to assign permission or quota to normal user ? > > Regards > Calvin > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to avoid kernel conflicts
On Mon, May 9, 2016 at 12:19 AM, K.C. Wong wrote: > >> As the tip said, you should not use rbd via kernel module on an OSD host >> >> However, using it with userspace code (librbd etc, as in kvm) is fine >> >> Generally, you should not have both: >> - "server" in userspace >> - "client" in kernelspace > > If `librbd` would help avoid this problem, then switch to `rbd-fuse` > should do the trick, right? > > The reason for my line of question is that I've seen occasionl freeze > up of `rbd map` that's resolved by a 'slight tap' by way of an strace. > There is definitely great attractiveness to not have specialized nodes > and make every one the same as the next one on the rack. The problem with placing the kernel client on the OSD node is the potential deadlock under heavy I/O when memory becomes scarce. It's not recommended, but people are doing it - if you don't stress your system too much, it'll never happen. "rbd map" freeze is definitely not related to the abov. Did the actual command hang? Could you describe what you saw in more detail and how did strace help? It could be that you ran into http://tracker.ceph.com/issues/14737 Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RadosGW - Problems running the S3 and SWIFT API at the same time
Hi, I'm not running a cluster as yours, but I don't think the issue is caused by you using 2 APIs at the same time. IIRC the dash thing is append by S3 multipart upload, with a following digit indicating the number of parts. You may want to check this reported in s3cmd community: https://sourceforge.net/p/s3tools/bugs/123/ and some basic info from Amazon: http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html Hope this helps :D Regards, ---Sandy > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Saverio Proto > Sent: Monday, May 09, 2016 4:42 PM > To: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] RadosGW - Problems running the S3 and SWIFT API at > the same time > > I try to simplify the question to get some feedback. > > Is anyone running the RadosGW in production with S3 and SWIFT API active at > the same time ? > > thank you ! > > Saverio > > > 2016-05-06 11:39 GMT+02:00 Saverio Proto : > > Hello, > > > > We have been running the Rados GW with the S3 API and we did not have > > problems for more than a year. > > > > We recently enabled also the SWIFT API for our users. > > > > radosgw --version > > ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) > > > > The idea is that each user of the system is free of choosing the S3 > > client or the SWIFT client to access the same container/buckets. > > > > Please tell us if this is possible by design or if we are doing something > > wrong. > > > > We have now a problem that some files wrote in the past with S3, > > cannot be read with the SWIFT API because the md5sum always fails. > > > > I am able to reproduce the bug in this way: > > > > We have this file googlebooks-fre-all-2gram-20120701-ts.gz and we know > > the correct md5 is 1c8113d2bd21232688221ec74dccff3a You can download > > the same file here: > > https://www.dropbox.com/s/auq16vdv2maw4p7/googlebooks-fre-all-2gram-20 > > 120701-ts.gz?dl=0 > > > > rclone mkdir lss3:bugreproduce > > rclone copy googlebooks-fre-all-2gram-20120701-ts.gz lss3:bugreproduce > > > > The file is successfully uploaded. > > > > At this point I can succesfully download again the file: > > rclone copy lss3:bugreproduce/googlebooks-fre-all-2gram-20120701-ts.gz > > test.gz > > > > but not with swift: > > > > swift download googlebooks-ngrams-gz > > fre/googlebooks-fre-all-2gram-20120701-ts.gz > > Error downloading object > > 'googlebooks-ngrams-gz/fre/googlebooks-fre-all-2gram-20120701-ts.gz': > > u'Error downloading fre/googlebooks-fre-all-2gram-20120701-ts.gz: > > md5sum != etag, 1c8113d2bd21232688221ec74dccff3a != > > 1a209a31b4ac3eb923fac5e8d194d9d3-2' > > > > Also I found strange the dash character '-' at the end of the md5 that > > is trying to compare. > > > > Of course upload a file with the swift client and redownloading the > > same file just works. > > > > Should I open a bug for the radosgw on http://tracker.ceph.com/ ? > > > > thank you > > > > Saverio > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com - 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本 邮件! This e-mail and its attachments contain confidential information from H3C, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile
Hi All, I've been testing an active/active Samba cluster over CephFS, performance seems really good with small files compared to Gluster. Soft reboots work beautifully with little to no interruption in file access. However when I perform a hard shutdown/reboot of one of the samba nodes, the remaining node detects that the other Samba node has disappeared but then eventually bans itself. If I leave everything for around 5 minutes, CTDB unbans itself and then everything continues running. >From what I can work out it looks like as the MDS has a stale session from the powered down node, it won't let the remaining node access the CTDB lock file (which is also sitting the on the CephFS). CTDB meanwhile is hammering away trying to access the lock file, but it sees what it thinks is a split brain scenario because something still has a lock on the lockfile, and so bans itself. I'm guessing the solution is to either reduce the mds session timeout or increase the amount of time/retries for CTDB, but I'm not sure what's the best approach. Does anyone have any ideas? Nick ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile
On Mon, 9 May 2016, Nick Fisk wrote: > Hi All, > > I've been testing an active/active Samba cluster over CephFS, performance > seems really good with small files compared to Gluster. Soft reboots work > beautifully with little to no interruption in file access. However when I > perform a hard shutdown/reboot of one of the samba nodes, the remaining node > detects that the other Samba node has disappeared but then eventually bans > itself. If I leave everything for around 5 minutes, CTDB unbans itself and > then everything continues running. > > From what I can work out it looks like as the MDS has a stale session from > the powered down node, it won't let the remaining node access the CTDB lock > file (which is also sitting the on the CephFS). CTDB meanwhile is hammering > away trying to access the lock file, but it sees what it thinks is a split > brain scenario because something still has a lock on the lockfile, and so > bans itself. > > I'm guessing the solution is to either reduce the mds session timeout or > increase the amount of time/retries for CTDB, but I'm not sure what's the > best approach. Does anyone have any ideas? I believe Ira was looking at this exact issue, and addressed it by lowering the mds_session_timeout to 30 seconds? sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile
On Mon, May 9, 2016 at 8:48 AM, Sage Weil wrote: > On Mon, 9 May 2016, Nick Fisk wrote: >> Hi All, >> >> I've been testing an active/active Samba cluster over CephFS, performance >> seems really good with small files compared to Gluster. Soft reboots work >> beautifully with little to no interruption in file access. However when I >> perform a hard shutdown/reboot of one of the samba nodes, the remaining node >> detects that the other Samba node has disappeared but then eventually bans >> itself. If I leave everything for around 5 minutes, CTDB unbans itself and >> then everything continues running. >> >> From what I can work out it looks like as the MDS has a stale session from >> the powered down node, it won't let the remaining node access the CTDB lock >> file (which is also sitting the on the CephFS). CTDB meanwhile is hammering >> away trying to access the lock file, but it sees what it thinks is a split >> brain scenario because something still has a lock on the lockfile, and so >> bans itself. >> >> I'm guessing the solution is to either reduce the mds session timeout or >> increase the amount of time/retries for CTDB, but I'm not sure what's the >> best approach. Does anyone have any ideas? > > I believe Ira was looking at this exact issue, and addressed it by > lowering the mds_session_timeout to 30 seconds? That's the default timeout. I think he lowered the beacon intervals to 5 seconds, plus whatever else flows out from that. We aren't quite sure if that's a good idea for real deployments or not, though! -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ACL support in Jewel using fuse and SAMBA
On Sat, May 7, 2016 at 9:53 PM, Eric Eastman wrote: > On Fri, May 6, 2016 at 2:14 PM, Eric Eastman > wrote: > >> As it should be working, I will increase the logging level in my >> smb.conf file and see what info I can get out of the logs, and report back. > > Setting the log level = 20 in my smb.conf file, and trying to add an > additional user to a directory on the Windows 2012 server, that has > mounted the share using a fuse mount to a Ceph file system shows the > error: "Operation not supported" in the smbd log file: > > [2016/05/07 23:41:19.213997, 10, pid=2823630, effective(2000501, > 2000514), real(2000501, 0)] > ../source3/modules/vfs_posixacl.c:92(posixacl_sys_acl_set_file) > Calling acl_set_file: New folder (4), 0 > [2016/05/07 23:41:19.214170, 10, pid=2823630, effective(2000501, > 2000514), real(2000501, 0)] > ../source3/modules/vfs_posixacl.c:111(posixacl_sys_acl_set_file) > acl_set_file failed: Operation not supported > > A simple test of setting an ACL from the command line to a fuse > mounted Ceph file system also fails: > # mkdir /cephfsFUSE/x > # setfacl -m d:o:rw /cephfsFUSE/x > setfacl: /cephfsFUSE/x: Operation not supported > > The same test to the same Ceph file system using the kernel mount > method works. > > Is there some option in my ceph.conf file or on the mount line that > needs to be used to support setting ACLs on a fuse mounted Ceph file > system? A quick check of the man page doesn't tell me what setfacl is doing, but I imagine this is another oddity of using FUSE filesystems. Judging by https://sourceforge.net/p/fuse/mailman/message/23787505/ there's some superblock flag that needs to be set in order for the VFS to allow ACLs. I'm not sure offhand if that's something that FUSE will let us do or not; please create a tracker ticket and somebody will get to it. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to avoid kernel conflicts
The systems on which the `rbd map` hangs problem occurred are definitely not under memory stress. I don't believer they are doing a lot of disk I/O either. Here's the basic set-up: * all nodes in the "data-plane" are identical * they each host and OSD instance, sharing one of the drive * I'm running Docker containers using an RBD volume plugin and Docker Compose * when the hang happens, the most visible behavior is that `docker ps` hangs * then I run `systemctl status` and see and `rbd map` process spawned by the RBD volume plugin * I then tried an `strace -f -p ` and that process promptly exits (with RC 0) and the hang resolves itself I'll tried to capture the strace output the next time I run into it and share with the mailing list. Thanks, Ilya. -kc > On May 9, 2016, at 2:21 AM, Ilya Dryomov wrote: > > On Mon, May 9, 2016 at 12:19 AM, K.C. Wong wrote: >> >>> As the tip said, you should not use rbd via kernel module on an OSD host >>> >>> However, using it with userspace code (librbd etc, as in kvm) is fine >>> >>> Generally, you should not have both: >>> - "server" in userspace >>> - "client" in kernelspace >> >> If `librbd` would help avoid this problem, then switch to `rbd-fuse` >> should do the trick, right? >> >> The reason for my line of question is that I've seen occasionl freeze >> up of `rbd map` that's resolved by a 'slight tap' by way of an strace. >> There is definitely great attractiveness to not have specialized nodes >> and make every one the same as the next one on the rack. > > The problem with placing the kernel client on the OSD node is the > potential deadlock under heavy I/O when memory becomes scarce. It's > not recommended, but people are doing it - if you don't stress your > system too much, it'll never happen. > > "rbd map" freeze is definitely not related to the abov. Did the actual > command hang? Could you describe what you saw in more detail and how > did strace help? It could be that you ran into > >http://tracker.ceph.com/issues/14737 > > Thanks, > >Ilya K.C. Wong kcw...@verseon.com 4096R/B8995EDE E527 CBE8 023E 79EA 8BBB 5C77 23A6 92E9 B899 5EDE hkps://hkps.pool.sks-keyservers.net signature.asc Description: Message signed with OpenPGP using GPGMail ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ACL support in Jewel using fuse and SAMBA
On Mon, May 9, 2016 at 10:36 AM, Gregory Farnum wrote: > On Sat, May 7, 2016 at 9:53 PM, Eric Eastman > wrote: >> On Fri, May 6, 2016 at 2:14 PM, Eric Eastman >> wrote: >> >> >> A simple test of setting an ACL from the command line to a fuse >> mounted Ceph file system also fails: >> # mkdir /cephfsFUSE/x >> # setfacl -m d:o:rw /cephfsFUSE/x >> setfacl: /cephfsFUSE/x: Operation not supported >> >> The same test to the same Ceph file system using the kernel mount >> method works. >> >> Is there some option in my ceph.conf file or on the mount line that >> needs to be used to support setting ACLs on a fuse mounted Ceph file >> system? > > A quick check of the man page doesn't tell me what setfacl is doing, > but I imagine this is another oddity of using FUSE filesystems. > > Judging by https://sourceforge.net/p/fuse/mailman/message/23787505/ > there's some superblock flag that needs to be set in order for the VFS > to allow ACLs. I'm not sure offhand if that's something that FUSE will > let us do or not; please create a tracker ticket and somebody will get > to it. > -Greg Thank you for your help. I have opened: http://tracker.ceph.com/issues/15783 Eric ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile
I am trying to do some similar testing with SAMBA and CTDB with the Ceph file system. Are you using the vfs_ceph SAMBA module or are you kernel mounting the Ceph file system? Thanks Eric On Mon, May 9, 2016 at 9:31 AM, Nick Fisk wrote: > Hi All, > > I've been testing an active/active Samba cluster over CephFS, performance > seems really good with small files compared to Gluster. Soft reboots work > beautifully with little to no interruption in file access. However when I > perform a hard shutdown/reboot of one of the samba nodes, the remaining node > detects that the other Samba node has disappeared but then eventually bans > itself. If I leave everything for around 5 minutes, CTDB unbans itself and > then everything continues running. > > From what I can work out it looks like as the MDS has a stale session from > the powered down node, it won't let the remaining node access the CTDB lock > file (which is also sitting the on the CephFS). CTDB meanwhile is hammering > away trying to access the lock file, but it sees what it thinks is a split > brain scenario because something still has a lock on the lockfile, and so > bans itself. > > I'm guessing the solution is to either reduce the mds session timeout or > increase the amount of time/retries for CTDB, but I'm not sure what's the > best approach. Does anyone have any ideas? > > Nick > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile
> -Original Message- > From: Ira Cooper [mailto:icoo...@redhat.com] > Sent: 09 May 2016 17:31 > To: Sage Weil > Cc: Nick Fisk ; ceph-users@lists.ceph.com > Subject: Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on > lockfile > - Original Message - > > On Mon, 9 May 2016, Nick Fisk wrote: > > > Hi All, > > > > > > I've been testing an active/active Samba cluster over CephFS, > > > performance seems really good with small files compared to Gluster. > > > Soft reboots work beautifully with little to no interruption in file > > > access. However when I perform a hard shutdown/reboot of one of the > > > samba nodes, the remaining node detects that the other Samba node > > > has disappeared but then eventually bans itself. If I leave > > > everything for around 5 minutes, CTDB unbans itself and then > > > everything continues running. > > > > > > From what I can work out it looks like as the MDS has a stale > > > session from the powered down node, it won't let the remaining node > > > access the CTDB lock file (which is also sitting the on the CephFS). > > > CTDB meanwhile is hammering away trying to access the lock file, but > > > it sees what it thinks is a split brain scenario because something > > > still has a lock on the lockfile, and so bans itself. > > > > > > I'm guessing the solution is to either reduce the mds session > > > timeout or increase the amount of time/retries for CTDB, but I'm not > > > sure what's the best approach. Does anyone have any ideas? > > > > I believe Ira was looking at this exact issue, and addressed it by > > lowering the mds_session_timeout to 30 seconds? > > Actually... > > There's a problem with the way I did it, in that there's issues in CephFS that > start to come out. Like the fact that it doesn't ban clients properly. :( Could you shed any more light on what this issues might be? I'm assuming they are around the locking part of ctdb? > > Greg's made comments about this not being production safe, I tend to agree. > ;) > > But it is possible, to make the cluster happy, I've been testing on VMs with > the following added to my ceph.conf for "a while" now. > > DISCLAIMER: THESE ARE NOT PRODUCTION SETTINGS! DO NOT USE IN > PRODUCTION IF YOU LIKE YOUR DATA! > > mds_session_timeout = 5 > mds_tick_interval = 1 > mon_tick_interval = 1 > mon_session_timeout = 2 > mds_session_autoclose = 15 These all look like they make Ceph more responsive to the loss of a client, as per your warning above, what negative effects do you see potentially arising from them? Or is that more of a warning as they haven't had long term testing? If the problem is only around the ctdb locking to avoid split brain, I would imagine using ctdb in conjunction with pacemaker to handle the fencing would also be a workaround? > > Since I did this, there have been changes made to CTDB to allow an external > program to be the arbitrator instead of the fcntl lockfile. I'm working on an > etcd integration for that. Not that it is that complicated, but making sure > you > get the details right is a minor pain. > > Also I'll be giving a talk on all of this at SambaXP on Thursday, so if you > are > there, feel free to catch me in the hall. (That goes for anyone interested in > this topic or ceph/samba topics in general!) I would be really interested in slides/video if there will be any post event. > > Clearly my being at SambaXP will slow the etcd integration down. And I'm > betting Greg, John or Sage will want to talk to me about using mon instead of > etcd ;). Call it a "feeling". > > Cheers, > > -Ira ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile
Hi Eric, > -Original Message- > From: Eric Eastman [mailto:eric.east...@keepertech.com] > Sent: 09 May 2016 19:21 > To: Nick Fisk > Cc: Ceph Users > Subject: Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on > lockfile > > I am trying to do some similar testing with SAMBA and CTDB with the Ceph > file system. Are you using the vfs_ceph SAMBA module or are you kernel > mounting the Ceph file system? I'm using the kernel client. I couldn't find any up to date information on if the vfs plugin supported all the necessary bits and pieces. How is your testing coming along? I would be very interested in any findings you may have come across. Nick > > Thanks > Eric > > On Mon, May 9, 2016 at 9:31 AM, Nick Fisk wrote: > > Hi All, > > > > I've been testing an active/active Samba cluster over CephFS, > > performance seems really good with small files compared to Gluster. > > Soft reboots work beautifully with little to no interruption in file > > access. However when I perform a hard shutdown/reboot of one of the > > samba nodes, the remaining node detects that the other Samba node has > > disappeared but then eventually bans itself. If I leave everything for > > around 5 minutes, CTDB unbans itself and then everything continues > running. > > > > From what I can work out it looks like as the MDS has a stale session > > from the powered down node, it won't let the remaining node access the > > CTDB lock file (which is also sitting the on the CephFS). CTDB > > meanwhile is hammering away trying to access the lock file, but it > > sees what it thinks is a split brain scenario because something still > > has a lock on the lockfile, and so bans itself. > > > > I'm guessing the solution is to either reduce the mds session timeout > > or increase the amount of time/retries for CTDB, but I'm not sure > > what's the best approach. Does anyone have any ideas? > > > > Nick > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile
On Mon, May 9, 2016 at 3:28 PM, Nick Fisk wrote: > Hi Eric, > >> >> I am trying to do some similar testing with SAMBA and CTDB with the Ceph >> file system. Are you using the vfs_ceph SAMBA module or are you kernel >> mounting the Ceph file system? > > I'm using the kernel client. I couldn't find any up to date information on if > the vfs plugin supported all the necessary bits and pieces. > > How is your testing coming along? I would be very interested in any findings > you may have come across. > > Nick I am also using CephFS kernel mounts, with 4 SAMBA gateways. When from a SAMBA client, I write a large file (about 2GB) to a gateway that is not the holder of the CTDB lock file, and then kill that gateway server during the write, the IP failover works as expected, and in most cases the file ends up being the correct size after the new server finishes writing it, but the data is corrupt. The data in the file, from the point of the failover, is all zeros. I thought the issue may be with the kernel mount, so I looked into using the SAMBA vfs_ceph module, but I need SAMBA with AD support and the current vfs_ceph module, even in the SAMBA git master version, is lacking ACL support for CephFS, as the vfs_ceph.c patches summited to the SAMBA mail list are not yet available. See: https://lists.samba.org/archive/samba-technical/2016-March/113063.html I tried using a FUSE mount of the CephFS, and it also fails setting ACLs. See: http://tracker.ceph.com/issues/15783. My current status is IP failover is working, but I am seeing data corruption on writes to the share when using kernel mounts. I am also seeing the issue you reported when I kill the system holding the CTDB lock file. Are you verifying your data after each failover? Eric ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ACL support in Jewel using fuse and SAMBA
On Tue, May 10, 2016 at 2:10 AM, Eric Eastman wrote: > On Mon, May 9, 2016 at 10:36 AM, Gregory Farnum wrote: >> On Sat, May 7, 2016 at 9:53 PM, Eric Eastman >> wrote: >>> On Fri, May 6, 2016 at 2:14 PM, Eric Eastman >>> wrote: >>> > >>> >>> A simple test of setting an ACL from the command line to a fuse >>> mounted Ceph file system also fails: >>> # mkdir /cephfsFUSE/x >>> # setfacl -m d:o:rw /cephfsFUSE/x >>> setfacl: /cephfsFUSE/x: Operation not supported >>> >>> The same test to the same Ceph file system using the kernel mount >>> method works. >>> >>> Is there some option in my ceph.conf file or on the mount line that >>> needs to be used to support setting ACLs on a fuse mounted Ceph file >>> system? >> >> A quick check of the man page doesn't tell me what setfacl is doing, >> but I imagine this is another oddity of using FUSE filesystems. >> >> Judging by https://sourceforge.net/p/fuse/mailman/message/23787505/ >> there's some superblock flag that needs to be set in order for the VFS >> to allow ACLs. I'm not sure offhand if that's something that FUSE will >> let us do or not; please create a tracker ticket and somebody will get >> to it. >> -Greg > > Thank you for your help. I have opened: http://tracker.ceph.com/issues/15783 fuse kernel does not have ACL support. To use ACL, you need to add "--fuse_default_permission=0 --client_acl_type=posix_acl" options to ceph-fuse. The '--fuse_default_permission=0' option disables kernel file permission check and let ceph-fuse do the check. Regards Yan, Zheng > > Eric > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] thanks for a double check on ceph's config
Hi members, We have 21 hosts for ceph OSD servers, each host has 12 SATA disks (4TB each), 64GB memory. ceph version 10.2.0, Ubuntu 16.04 LTS The whole cluster is new installed. Can you help check what the arguments we put in ceph.conf is reasonable or not? thanks. [osd] osd_data = /var/lib/ceph/osd/ceph-$id osd_journal_size = 2 osd_mkfs_type = xfs osd_mkfs_options_xfs = -f filestore_xattr_use_omap = true filestore_min_sync_interval = 10 filestore_max_sync_interval = 15 filestore_queue_max_ops = 25000 filestore_queue_max_bytes = 10485760 filestore_queue_committing_max_ops = 5000 filestore_queue_committing_max_bytes = 1048576 journal_max_write_bytes = 1073714824 journal_max_write_entries = 1 journal_queue_max_ops = 5 journal_queue_max_bytes = 1048576 osd_max_write_size = 512 osd_client_message_size_cap = 2147483648 osd_deep_scrub_stride = 131072 osd_op_threads = 8 osd_disk_threads = 4 osd_map_cache_size = 1024 osd_map_cache_bl_size = 128 osd_mount_options_xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier" osd_recovery_op_priority = 4 osd_recovery_max_active = 10 osd_max_backfills = 4 [client] rbd_cache = true rbd_cache_size = 268435456 rbd_cache_max_dirty = 134217728 rbd_cache_max_dirty_age = 5 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ACL support in Jewel using fuse and SAMBA
On Mon, May 9, 2016 at 8:08 PM, Yan, Zheng wrote: > On Tue, May 10, 2016 at 2:10 AM, Eric Eastman > wrote: >> On Mon, May 9, 2016 at 10:36 AM, Gregory Farnum wrote: >>> On Sat, May 7, 2016 at 9:53 PM, Eric Eastman >>> wrote: On Fri, May 6, 2016 at 2:14 PM, Eric Eastman wrote: >> A simple test of setting an ACL from the command line to a fuse mounted Ceph file system also fails: # mkdir /cephfsFUSE/x # setfacl -m d:o:rw /cephfsFUSE/x setfacl: /cephfsFUSE/x: Operation not supported The same test to the same Ceph file system using the kernel mount method works. Is there some option in my ceph.conf file or on the mount line that needs to be used to support setting ACLs on a fuse mounted Ceph file system? >>> >>> A quick check of the man page doesn't tell me what setfacl is doing, >>> but I imagine this is another oddity of using FUSE filesystems. >>> >>> Judging by https://sourceforge.net/p/fuse/mailman/message/23787505/ >>> there's some superblock flag that needs to be set in order for the VFS >>> to allow ACLs. I'm not sure offhand if that's something that FUSE will >>> let us do or not; please create a tracker ticket and somebody will get >>> to it. >>> -Greg >> >> Thank you for your help. I have opened: http://tracker.ceph.com/issues/15783 > > > fuse kernel does not have ACL support. To use ACL, you need to add > "--fuse_default_permission=0 --client_acl_type=posix_acl" options to > ceph-fuse. The '--fuse_default_permission=0' option disables kernel > file permission check and let ceph-fuse do the check. > > Regards > Yan, Zheng > Thank you for the answer. My command line test to set the ACL: setfacl -m d:o:rw /cephfsFUSE/x Now works. I did find a minor typo, where --fuse_default_permission=0 needs an "s", so it is --fuse_default_permissions=0. The line I am using in my /etc/fstab is: id=cephfs,keyring=/etc/ceph/client.cephfs.keyring,client_acl_type=posix_acl,fuse_default_permissions=0 /cephfsFUSE fuse.ceph noatime,noauto 0 0 -Eric ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com