Re: [ceph-users] data corruption with hammer
On Thu, 17 Mar 2016, Robert LeBlanc wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > I'm having trouble finding documentation about using ceph_test_rados. Can I > run this on the existing cluster and will that provide useful info? It seems > running it in the build will not have the caching set up (vstart.sh). > > I have accepted a job with another company and only have until Wednesday to > help with getting information about this bug. My new job will not be using C > eph, so I won't be able to provide any additional info after Tuesday. I want > to leave the company on a good trajectory for upgrading, so any input you c > an provide will be helpful. I'm sorry to hear it! You'll be missed. :) > I've found: > > ./ceph_test_rados --op read 100 --op write 100 --op delete 50 > - --max-ops 40 --objects 1024 --max-in-flight 64 --size 400 > - --min-stride-size 40 --max-stride-size 80 --max-seconds 600 > - --op copy_from 50 --op snap_create 50 --op snap_remove 50 --op > rollback 50 --op setattr 25 --op rmattr 25 --pool unique_pool_0 > > Is that enough if I change --pool to the cached pool and do the toggling whi > le ceph_test_rados is running? I think this will run for 10 minutes. Precisely. You can probably drop copy_from and snap ops from the list since your workload wasn't exercising those. Thanks! sage > > Thanks, > -BEGIN PGP SIGNATURE- > Version: Mailvelope v1.3.6 > Comment: https://www.mailvelope.com > > wsFcBAEBCAAQBQJW6tjwCRDmVDuy+mK58QAANKgP/ia5TA/7kTUpmciVR2BW > t0MrilXAIvdikHlaWTVIxEmb4S8X+57hziEZUd6hLBMnKnuUQxsDb3yyuZX4 > iqaE8KBXDjMFjHnhTOFf7eB2JIjM1WkZxmlA23yBRMNtvlBArbwxYYnAyTXt > /fW1QmgLZIvuql1y01TdRot/owqJ3B2Ah896lySrltWj626R+1rhTLVDWYr6 > EKa1mf8BiRBeGpjEVhN6Vihb7T1IzHtCi1E6+mlSqhWGNf8AeZh8IKUT0tbm > C/JiUVGmG8/t7WFzCiQWd1w8UdkdCzms7k662CsSLIpbjNo4ouwEkpb5sZLP > ELgWxo8hvad7USqSXvXqJNzmoenUwQwdUvSjYbNk+4D+8eHqptlNXDmDfpiE > pN7dp8wbJ+yICxMPLuUe/Iqzp6rRnjPwam/CiDZu52N1ncH3X1X4u0cuAD0Z > dFjEfdAZJAJ+fqvts2zVvtOwq/q41eTuV3ZRSn5ubA6iAeKnxMtPoEcuozEp > Su1Iud2fYdma5w8MFStjp1BAV3osg1WgIM6KYzsSZI1BkCQAqU58ROZ0ZsMb > D05/AEK/A6fp0ROXUczhXDcXlXcGEWyJm1QEtg7cSu3C+9qu5qvQQxyrrwbZ > MK8C5lhVb44sqSVcSIZ+KCrPC+x8UKodDQZCz6O6NrJjZLn2g06583cMFWK8 > qLo+ > =qgB7 > -END PGP SIGNATURE- > > > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > On Thu, Mar 17, 2016 at 8:19 AM, Sage Weil wrote: > On Thu, 17 Mar 2016, Robert LeBlanc wrote: > > We are trying to figure out how to use rados bench to > reproduce. Ceph > > itself doesn't seem to think there is any corruption, but when > you do a > > verify inside the RBD, there is. Can rados bench verify the > objects after > > they are written? It also seems to be primarily the filesystem > metadata > > that is corrupted. If we fsck the volume, there is missing > data (put into > > lost+found), but if it is there it is primarily OK. There only > seems to be > > a few cases where a file's contents are corrupted. I would > suspect on an > > object boundary. We would have to look at blockinfo to map > that out and see > > if that is what is happening. > > 'rados bench' doesn't do validation. ceph_test_rados does, > though--if you > can reproduce with that workload then it should be pretty easy > to track > down. > > Thanks! > sage > > > > We stopped all the IO and did put the tier in writeback mode > with recency > > 1, set the recency to 2 and started the test and there was > corruption, so > > it doesn't seem to be limited to changing the mode. I don't > know how that > > patch could cause the issue either. Unless there is a bug that > reads from > > the back tier, but writes to cache tier, then the object gets > promoted > > wiping that last write, but then it seems like it should not > be as much > > corruption since the metadata should be in the cache pretty > quick. We > > usually evited the cache before each try so we should not be > evicting on > > writeback. > > > > Sent from a mobile device, please excuse any typos. > > On Mar 17, 2016 6:26 AM, "Sage Weil" wrote: > > > > > On Thu, 17 Mar 2016, Nick Fisk wrote: > > > > There is got to be something else going on here. All that > PR does is to > > > > potentially delay the promotion to hit_set_period*recency > instead of > > > > just doing it on the 2nd read regardless, it's got to be > uncovering > > > > another bug. > > > > > > > > Do you see the same problem if the cache is in writeback > mode before you > > > > start the unpacking. Ie is it the switching mid operation > which causes > > > > the problem? If it only happens mid operati
[ceph-users] [cephfs] About feature 'snapshot'
Hi all, I encounter a trouble about cephfs sanpshot. It seems that the folder '.snap' is exist. But I use 'll -a' can't let it show up. And I enter that folder and create folder in it, it showed something wrong to use snapshot. Please check : http://imgur.com/elZhQvD ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] data corruption with hammer
Hi,All. I confirm the problem. When min_read_recency_for_promote> 1 data failure. С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 2016-03-17 15:26 GMT+03:00 Sage Weil : > On Thu, 17 Mar 2016, Nick Fisk wrote: > > There is got to be something else going on here. All that PR does is to > > potentially delay the promotion to hit_set_period*recency instead of > > just doing it on the 2nd read regardless, it's got to be uncovering > > another bug. > > > > Do you see the same problem if the cache is in writeback mode before you > > start the unpacking. Ie is it the switching mid operation which causes > > the problem? If it only happens mid operation, does it still occur if > > you pause IO when you make the switch? > > > > Do you also see this if you perform on a RBD mount, to rule out any > > librbd/qemu weirdness? > > > > Do you know if it’s the actual data that is getting corrupted or if it's > > the FS metadata? I'm only wondering as unpacking should really only be > > writing to each object a couple of times, whereas FS metadata could > > potentially be being updated+read back lots of times for the same group > > of objects and ordering is very important. > > > > Thinking through it logically the only difference is that with recency=1 > > the object will be copied up to the cache tier, where recency=6 it will > > be proxy read for a long time. If I had to guess I would say the issue > > would lie somewhere in the proxy read + writeback<->forward logic. > > That seems reasonable. Was switching from writeback -> forward always > part of the sequence that resulted in corruption? Not that there is a > known ordering issue when switching to forward mode. I wouldn't really > expect it to bite real users but it's possible.. > > http://tracker.ceph.com/issues/12814 > > I've opened a ticket to track this: > > http://tracker.ceph.com/issues/15171 > > What would be *really* great is if you could reproduce this with a > ceph_test_rados workload (from ceph-tests). I.e., get ceph_test_rados > running, and then find the sequence of operations that are sufficient to > trigger a failure. > > sage > > > > > > > > > > > > -Original Message- > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > Of > > > Mike Lovell > > > Sent: 16 March 2016 23:23 > > > To: ceph-users ; sw...@redhat.com > > > Cc: Robert LeBlanc ; William Perkins > > > > > > Subject: Re: [ceph-users] data corruption with hammer > > > > > > just got done with a test against a build of 0.94.6 minus the two > commits that > > > were backported in PR 7207. everything worked as it should with the > cache- > > > mode set to writeback and the min_read_recency_for_promote set to 2. > > > assuming it works properly on master, there must be a commit that we're > > > missing on the backport to support this properly. > > > > > > sage, > > > i'm adding you to the recipients on this so hopefully you see it. the > tl;dr > > > version is that the backport of the cache recency fix to hammer > doesn't work > > > right and potentially corrupts data when > > > the min_read_recency_for_promote is set to greater than 1. > > > > > > mike > > > > > > On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell > > > wrote: > > > robert and i have done some further investigation the past couple days > on > > > this. we have a test environment with a hard drive tier and an ssd > tier as a > > > cache. several vms were created with volumes from the ceph cluster. i > did a > > > test in each guest where i un-tarred the linux kernel source multiple > times > > > and then did a md5sum check against all of the files in the resulting > source > > > tree. i started off with the monitors and osds running 0.94.5 and > never saw > > > any problems. > > > > > > a single node was then upgraded to 0.94.6 which has osds in both the > ssd and > > > hard drive tier. i then proceeded to run the same test and, while the > untar > > > and md5sum operations were running, i changed the ssd tier cache-mode > > > from forward to writeback. almost immediately the vms started > reporting io > > > errors and odd data corruption. the remainder of the cluster was > updated to > > > 0.94.6, including the monitors, and the same thing happened. > > > > > > things were cleaned up and reset and then a test was run > > > where min_read_recency_for_promote for the ssd cache pool was set to 1. > > > we previously had it set to 6. there was never an error with the > recency > > > setting set to 1. i then tested with it set to 2 and it immediately > caused > > > failures. we are currently thinking that it is related to the backport > of the fix > > > for the recency promotion and are in progress of making a .6 build > without > > > that backport to see if we can cause corruption. is anyone using a > version > > > from after the original recency fix (PR 6702) with a cache tier in > writeback > > > mode? anyone have a similar problem? > > > > > > mike > > > > > > On Mon, Mar
Re: [ceph-users] data corruption with hammer
Also, is this ceph_test_rados rewriting objects quickly? I think that the issue is with rewriting objects so if we can tailor the ceph_test_rados to do that, it might be easier to reproduce. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Mar 17, 2016 at 11:05 AM, Robert LeBlanc wrote: > I'll miss the Ceph community as well. There was a few things I really > wanted to work in with Ceph. > > I got this: > > update_object_version oid 13 v 1166 (ObjNum 1028 snap 0 seq_num 1028) > dirty exists > 1038: left oid 13 (ObjNum 1028 snap 0 seq_num 1028) > 1040: finishing write tid 1 to nodez23350-256 > 1040: finishing write tid 2 to nodez23350-256 > 1040: finishing write tid 3 to nodez23350-256 > 1040: finishing write tid 4 to nodez23350-256 > 1040: finishing write tid 6 to nodez23350-256 > 1035: done (4 left) > 1037: done (3 left) > 1038: done (2 left) > 1043: read oid 430 snap -1 > 1043: expect (ObjNum 429 snap 0 seq_num 429) > 1040: finishing write tid 7 to nodez23350-256 > update_object_version oid 256 v 661 (ObjNum 1029 snap 0 seq_num 1029) > dirty exists > 1040: left oid 256 (ObjNum 1029 snap 0 seq_num 1029) > 1042: expect (ObjNum 664 snap 0 seq_num 664) > 1043: Error: oid 430 read returned error code -2 > ./test/osd/RadosModel.h: In function 'virtual void > ReadOp::_finish(TestOp::CallbackInfo*)' thread 7fa1bf7fe700 time > 2016-03-17 10:47:19.085414 > ./test/osd/RadosModel.h: 1109: FAILED assert(0) > ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x76) [0x4db956] > 2: (ReadOp::_finish(TestOp::CallbackInfo*)+0xec) [0x4c959c] > 3: (()+0x9791d) [0x7fa1d472191d] > 4: (()+0x72519) [0x7fa1d46fc519] > 5: (()+0x13c178) [0x7fa1d47c6178] > 6: (()+0x80a4) [0x7fa1d425a0a4] > 7: (clone()+0x6d) [0x7fa1d2bd504d] > NOTE: a copy of the executable, or `objdump -rdS ` is > needed to interpret this. > terminate called after throwing an instance of 'ceph::FailedAssertion' > Aborted > > I had to toggle writeback/forward and min_read_recency_for_promote a > few times to get it, but I don't know if it is because I only have one > job running. Even with six jobs running, it is not easy to trigger > with ceph_test_rados, but it is very instant in the RBD VMs. > > Here are the six run crashes (I have about the last 2000 lines of each > if needed): > > nodev: > update_object_version oid 1015 v 1255 (ObjNum 1014 snap 0 seq_num > 1014) dirty exists > 1015: left oid 1015 (ObjNum 1014 snap 0 seq_num 1014) > 1016: finishing write tid 1 to nodev21799-1016 > 1016: finishing write tid 2 to nodev21799-1016 > 1016: finishing write tid 3 to nodev21799-1016 > 1016: finishing write tid 4 to nodev21799-1016 > 1016: finishing write tid 6 to nodev21799-1016 > 1016: finishing write tid 7 to nodev21799-1016 > update_object_version oid 1016 v 1957 (ObjNum 1015 snap 0 seq_num > 1015) dirty exists > 1016: left oid 1016 (ObjNum 1015 snap 0 seq_num 1015) > 1017: finishing write tid 1 to nodev21799-1017 > 1017: finishing write tid 2 to nodev21799-1017 > 1017: finishing write tid 3 to nodev21799-1017 > 1017: finishing write tid 5 to nodev21799-1017 > 1017: finishing write tid 6 to nodev21799-1017 > update_object_version oid 1017 v 1010 (ObjNum 1016 snap 0 seq_num > 1016) dirty exists > 1017: left oid 1017 (ObjNum 1016 snap 0 seq_num 1016) > 1018: finishing write tid 1 to nodev21799-1018 > 1018: finishing write tid 2 to nodev21799-1018 > 1018: finishing write tid 3 to nodev21799-1018 > 1018: finishing write tid 4 to nodev21799-1018 > 1018: finishing write tid 6 to nodev21799-1018 > 1018: finishing write tid 7 to nodev21799-1018 > update_object_version oid 1018 v 1093 (ObjNum 1017 snap 0 seq_num > 1017) dirty exists > 1018: left oid 1018 (ObjNum 1017 snap 0 seq_num 1017) > 1019: finishing write tid 1 to nodev21799-1019 > 1019: finishing write tid 2 to nodev21799-1019 > 1019: finishing write tid 3 to nodev21799-1019 > 1019: finishing write tid 5 to nodev21799-1019 > 1019: finishing write tid 6 to nodev21799-1019 > update_object_version oid 1019 v 462 (ObjNum 1018 snap 0 seq_num 1018) > dirty exists > 1019: left oid 1019 (ObjNum 1018 snap 0 seq_num 1018) > 1021: finishing write tid 1 to nodev21799-1021 > 1020: finishing write tid 1 to nodev21799-1020 > 1020: finishing write tid 2 to nodev21799-1020 > 1020: finishing write tid 3 to nodev21799-1020 > 1020: finishing write tid 5 to nodev21799-1020 > 1020: finishing write tid 6 to nodev21799-1020 > update_object_version oid 1020 v 1287 (ObjNum 1019 snap 0 seq_num > 1019) dirty exists > 1020: left oid 1020 (ObjNum 1019 snap 0 seq_num 1019) > 1021: finishing write tid 2 to nodev21799-1021 > 1021: finishing write tid 3 to nodev21799-1021 > 1021: finishing write tid 5 to nodev21799-1021 > 1021: finishing write tid 6 to nodev21799-1021 > update_object_version oid 1021 v 1077 (ObjNum 1020 snap 0 seq_num > 1020) dirty exists >
Re: [ceph-users] ZFS or BTRFS for performance?
Neither of these file systems is recommended for production use underlying an OSD. The general direction for ceph is to move away from having a file system at all. That effort is called "bluestore" and is supposed to show up in the jewel release. -H > On Mar 18, 2016, at 11:15, Schlacta, Christ wrote: > > Insofar as I've been able to tell, both BTRFS and ZFS provide similar > capabilities back to CEPH, and both are sufficiently stable for the > basic CEPH use case (Single disk -> single mount point), so the > question becomes this: Which actually provides better performance? > Which is the more highly optimized single write path for ceph? Does > anybody have a handful of side-by-side benchmarks? I'm more > interested in higher IOPS, since you can always scale-out throughput, > but throughput is also important. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSDs for journals vs SSDs for a cache tier, which is better?
The rule of thumb is to match the journal throughput to the OSD throughout. I'm seeing ~180MB/s sequential write on my OSDs and I'm using one of the P3700 400GB units per six OSDs. The 400GB P3700 yields around 1200MB/s* and has around 1/10th the latency of any SATA SSD I've tested. I put a pair of them in a 12-drive chassis and get excellent performance. One could probably do the same in an 18-drive chassis without any issues. Failure domain for a journal starts to get pretty large at they point. I have dozens of the "Fultondale" SSDs deployed and have had zero failures. Endurance is excellent, etc. *the larger units yield much better write throughout but don't make sense financially for journals. -H On Mar 16, 2016, at 09:37, Nick Fisk wrote: >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Stephen Harker >> Sent: 16 March 2016 16:22 >> To: ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] SSDs for journals vs SSDs for a cache tier, > which is >> better? >> >>> On 2016-02-17 11:07, Christian Balzer wrote: >>> >>> On Wed, 17 Feb 2016 10:04:11 +0100 Piotr Wachowicz wrote: >>> >> Let's consider both cases: >> Journals on SSDs - for writes, the write operation returns right >> after data lands on the Journal's SSDs, but before it's written >> to the backing HDD. So, for writes, SSD journal approach should >> be comparable to having a SSD cache tier. > Not quite, see below. Could you elaborate a bit more? Are you saying that with a Journal on a SSD writes from clients, before they can return from the operation to the client, must end up on both the SSD (Journal) *and* HDD (actual data store behind that journal)? >>> >>> No, your initial statement is correct. >>> >>> However that burst of speed doesn't last indefinitely. >>> >>> Aside from the size of the journal (which is incidentally NOT the most >>> limiting factor) there are various "filestore" parameters in Ceph, in >>> particular the sync interval ones. >>> There was a more in-depth explanation by a developer about this in >>> this ML, try your google-foo. >>> >>> For short bursts of activity, the journal helps a LOT. >>> If you send a huge number of for example 4KB writes to your cluster, >>> the speed will eventually (after a few seconds) go down to what your >>> backing storage (HDDs) are capable of sustaining. >>> > (Which SSDs do you plan to use anyway?) Intel DC S3700 >>> Good choice, with the 200GB model prefer the 3700 over the 3710 >>> (higher sequential write speed). >> >> Hi All, >> >> I am looking at using PCI-E SSDs as journals in our (4) Ceph OSD nodes, > each >> of which has 6 4TB SATA drives within. I had my eye on these: >> >> 400GB Intel P3500 DC AIC SSD, HHHL PCIe 3.0 >> >> but reading through this thread, it might be better to go with the P3700 > given >> the improved iops. So a couple of questions. >> >> * Are the PCI-E versions of these drives different in any other way than > the >> interface? > > Yes and no. Internally they are probably not much difference, but the > NVME/PCIE interface is a lot faster than SATA/SAS, both in terms of minimum > latency and bandwidth. > >> >> * Would one of these as a journal for 6 4TB OSDs be overkill (connectivity > is >> 10GE, or will be shortly anyway), would the SATA S3700 be sufficient? > > Again depends on your use case. The S3700 may suffer if you are doing large > sequential writes, it might not have a high enough sequential write speed > and will become the bottleneck. 6 Disks could potentially take around > 500-700MB/s of writes. A P3700 will have enough and will give slightly lower > write latency as well if this is important. You may even be able to run more > than 6 disk OSD's on it if needed. > >> >> Given they're not hot-swappable, it'd be good if they didn't wear out in >> 6 months too. > > Probably won't unless you are doing some really extreme write workloads and > even then I would imagine they would last 1-2 years. > >> >> I realise I've not given you much to go on and I'm Googling around as > well, I'm >> really just asking in case someone has tried this already and has some >> feedback or advice.. > > That's ok, I'm currently running S3700 100GB's on current cluster and new > cluster that's in planning stages will be using the 400Gb P3700's. > >> >> Thanks! :) >> >> Stephen >> >> -- >> Stephen Harker >> Chief Technology Officer >> The Positive Internet Company. >> >> -- >> All postal correspondence to: >> The Positive Internet Company, 24 Ganton Street, London. W1F 7QY >> >> *Follow us on Twitter* @posipeople >> >> The Positive Internet Company Limited is registered in England and Wales. >> Registered company number: 3673639. VAT no: 726 7072 28. >> Registered office: Northside House, Mount Pleasant, Barnet, Herts, EN4 > 9EE. >> ___ >> ceph-users mai
Re: [ceph-users] ceph-disk from jewel has issues on redhat 7
Hi, Is there a tracker for this? We just hit the same problem on 10.0.5. Cheers, Dan # rpm -q ceph ceph-10.0.5-0.el7.x86_64 # cat /etc/redhat-release CentOS Linux release 7.2.1511 (Core) # ceph-disk -v prepare /dev/sdc DEBUG:ceph-disk:get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid DEBUG:ceph-disk:get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid DEBUG:ceph-disk:get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_type INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=osd_journal_size INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_cryptsetup_parameters INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_dmcrypt_key_size INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_dmcrypt_type DEBUG:ceph-disk:get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid INFO:ceph-disk:Will colocate journal with data on /dev/sdc DEBUG:ceph-disk:get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid DEBUG:ceph-disk:get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid DEBUG:ceph-disk:Creating journal partition num 2 size 20480 on /dev/sdc INFO:ceph-disk:Running command: /usr/sbin/sgdisk --new=2:0:20480M --change-name=2:ceph journal --partition-guid=2:aa23e07d-e6b3-4261-a236-c0565971d88d --typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/sdc The operation has completed successfully. DEBUG:ceph-disk:Calling partprobe on prepared device /dev/sdc INFO:ceph-disk:Running command: /usr/bin/udevadm settle INFO:ceph-disk:Running command: /usr/sbin/partprobe /dev/sdc Error: Error informing the kernel about modifications to partition /dev/sdc2 -- Device or resource busy. This means Linux won't know about any changes you made to /dev/sdc2 until you reboot -- so you shouldn't mount it or use it in any way before rebooting. Error: Failed to add partition 2 (Device or resource busy) Traceback (most recent call last): File "/usr/sbin/ceph-disk", line 3528, in main(sys.argv[1:]) File "/usr/sbin/ceph-disk", line 3482, in main args.func(args) File "/usr/sbin/ceph-disk", line 1817, in main_prepare luks=luks File "/usr/sbin/ceph-disk", line 1447, in prepare_journal return prepare_journal_dev(data, journal, journal_size, journal_uuid, journal_dm_keypath, cryptsetup_parameters, luks) File "/usr/sbin/ceph-disk", line 1401, in prepare_journal_dev raise Error(e) __main__.Error: Error: Command '['/usr/sbin/partprobe', '/dev/sdc']' returned non-zero exit status 1 On Tue, Mar 15, 2016 at 8:38 PM, Vasu Kulkarni wrote: > Thanks for the steps that should be enough to test it out, I hope you got > the latest ceph-deploy either from pip or throught github. > > On Tue, Mar 15, 2016 at 12:29 PM, Stephen Lord > wrote: >> >> I would have to nuke my cluster right now, and I do not have a spare one.. >> >> The procedure though is literally this, given a 3 node redhat 7.2 cluster, >> ceph00, ceph01 and ceph02 >> >> ceph-deploy install --testing ceph00 ceph01 ceph02 >> ceph-deploy new ceph00 ceph01 ceph02 >> >> ceph-deploy mon create ceph00 ceph01 ceph02 >> ceph-deploy gatherkeys ceph00 >> >> ceph-deploy osd create ceph00:sdb:/dev/sdi >> ceph-deploy osd create ceph00:sdc:/dev/sdi >> >> All devices have their partition tables wiped before this. They are all >> just SATA devices, no special devices in the way. >> >> sdi is an ssd and it is being carved up for journals. The first osd create >> works, the second one gets stuck in a loop in the update_partition call in >> ceph_disk for the 5 iterations before it gives up. When I look in >> /sys/block/sdi the partition for the first osd is visible, the one for the >> second is not. However looking at /proc/partitions it sees the correct >> thing. So something about partprobe is not kicking udev into doing the right >> thing when the second partition is added I suspect. >> >> If I do not use the separate journal device then it usually works, but >> occasionally I see a single retry in that same loop. >> >> There is code in ceph_deploy which uses partprobe or partx depending on >> which distro it detects, that is how I worked out what to change here. >> >> If I have to tear things down again I will reproduce and post here. >> >> Steve >> >> > On Mar 15, 2016, at 2:12 PM, Vasu Kulkarni wrote: >> > >> > Do you mind giving the full failed logs somewhere
Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system
Basically, the lookup process is: try DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/DIR_C/DIR_9/DIR_7...doesn't exist try DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/DIR_C/DIR_9/...doesn't exist try DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/DIR_C/...doesn't exist try DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/...does exist, object must be here If DIR_E did not exist, then it would check DIR_9/DIR_5/DIR_4/DIR_D and so on. The hash is always 32 bit (8 hex digits) -- baked into the rados object distribution algorithms. When DIR_E hits the threshhold (320 iirc), the objects (files) in that directory will be moved one more directory deeper. An object with hash 79CED459 would then be in DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/DIR_C/. Basically, the depth of the tree is dynamic. The file will be in the deepest existing path that matches the hash (might even be different between replicas, the tree structure is purely internal to the filestore). -Sam On Wed, Mar 16, 2016 at 10:46 AM, Jeffrey McDonald wrote: > OK, I think I have it now. I do have one more question, in this case, the > hash indicates the directory structure but how do I know from the hash how > many levels I should go down.If the hash is a 32-bit hex integer, *how > do I know how many should be included as part of the hash for the directory > structure*? > > e.g. our example: the hash is 79CED459 and the directory is then the last > five taken in reverse order, what happens if there are only 4 levels of > hierarchy?I only have this one example so far.is the 79C of the hash > constant? Would the hash pick up another hex character if the pg splits > again? > > Thanks, > Jeff > > On Wed, Mar 16, 2016 at 10:24 AM, Samuel Just wrote: >> >> There is a directory structure hash, it's just that it's at the end of >> the name and you'll have to check the xattr I mentioned to find it. >> >> I think that file is actually the one we are talking about removing. >> >> >> ./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long: >> user.cephos.lfn3: >> >> default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46_3189d_0 >> >> Notice that the user.cephosd.lfn3 attr has the full name, and it >> *does* have a hash 79CED459 (you referred to it as a directory hash I >> think, but it's actually the hash we used to place it on this osd to >> begin with). >> >> In specifically this case, you shouldn't find any files in the >> DIR_9/DIR_5/DIR_4/DIR_D directory since there are 16 subdirectories >> (so all hash values should hash to one of those). >> >> The one in DIR_9/DIR_5/DIR_4/DIR_D/DIR_E is completely fine -- that's >> the actual object file, don't remove that. If you look at the attr: >> >> >> ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long: >> user.cephos.lfn3: >> >> default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46__0 >> >> The hash is 79CED459, which means that (assuming >> DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/DIR_C does *not* exist) it's in the >> right place. >> >> The ENOENT return >> >> 2016-03-07 16:11:41.828332 7ff30cdad700 10 >> filestore(/var/lib/ceph/osd/ceph-307) remove >> >> 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0 >> = -2 >> 2016-03-07 21:44:02.197676 7fe96b56f700 10 >> filestore(/var/lib/ceph/osd/ceph-307) remove >> >> 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0 >> = -2 >> >> actually was a symptom in this case, but, in general, it's not >> indicative of anything -- the filestore can get ENOENT return values >> for legitimate reasons. >> >> To reiterate: files that end in something like >> fa202ec9b4b3b217275a_0_long are *not* necessarily orphans -- you need >> to check the user.cephos.lfn3 attr (as you did before) for the full >> length file name and determine whether the file is in the right place. >
[ceph-users] Single key delete performance against increasing bucket size
On Wed, Mar 16, 2016 at 06:36:33AM +, Pavan Rallabhandi wrote: > I find this to be discussed here before, but couldn¹t find any solution > hence the mail. In RGW, for a bucket holding objects in the range of ~ > millions, one can find it to take for ever to delete the bucket(via > radosgw-admin). I understand the gc(and its parameters) that would reclaim > the space eventually, but am looking more at the bucket deletion options > that can possibly speed up the operation. This ties well into a mail I had sitting in my drafts, but never got around to sending. Whilst doing some rough benchmarking on bucket index sharding, I ran into some terrible performance for key deletion on non-existent keys. Shards did NOT alleviate this performance issue, but did help elsewhere. Numbers given below are for unsharded buckets; relatively empty buckets perform worse when shards before performance picks up again. Test methodology: - Fire single DELETE key ops to the RGW; not using multi-object delete. - I measured the time taken for each delete, and report it here for the 99% percentile (1% of operations took longer than this). - I took at least 1K samples for #keys up to and including 10k keys per bucket. For 50k keys/bucket I capped it to the first 100 samples instead of waiting 10 hours for the run to complete. - The DELETE operations were run single-threaded, with no concurrency. Test environments: Clusters are were both running Hammer 0.94.5 on Ubuntu precise; the hardware is a long way from being new; there are no SSDs, the journal is the first partition on each OSD's disk. The test source host was unloaded, and approx 1ms of latency away from the RGWs. Cluster 1 (Congress, ~1350 OSDs; production cluster; haproxy of 10 RGWs) #keys-in-bucket time per single key delete 0 6.899ms 10 7.507ms 10013.573ms 1000 327.936ms 14825.597ms 5 33802.497ms 10 did-not-finish Cluster 2 (Benjamin, ~50 OSDs; test cluster, practically idle; haproxy of 2 RGWs) #keys-in-bucket time per single key delete 0 4.825ms 10 6.749ms 100 6.146ms 10006.816ms 11233.727ms 5 64262.764ms 10 did-not-finish The cases marked with did-not-finish are where the RGW seems to time out the operation even with the client having an unlimited timeout. It did occur also connected directly to CivetWeb and not HAProxy. I'm not sure why the 100-keys case on the second cluster seems to have been faster than the 10-key case, but I'm willing to put it down to statistical noise. The huge increase at the end, and the operation not returning over 100k items is concerning. -- Robin Hugh Johnson Gentoo Linux: Developer, Infrastructure Lead, Foundation Trustee E-Mail : robb...@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v10.0.4 released
Hi, Because of a tiny mistake preventing deb packages to be built, v10.0.5 was released shortly after v10.0.4 and is now the current development release. The Stable release team[0] collectively decided to help by publishing development packages[1], starting with v10.0.5. The packages for v10.0.5 are available at http://ceph-releases.dachary.org/ which can be used as a replacement for http://download.ceph.com/ for both http://download.ceph.com/rpm-testing and http://download.ceph.com/debian-testing . The only difference is the key used to sign the releases which can be imported with wget -q -O- 'http://ceph-releases.dachary.org/release-key.asc' | sudo apt-key add - or rpm --import http://ceph-releases.dachary.org/release-key.asc The instructions to install development packages found at http://docs.ceph.com/docs/master/install/get-packages/ can otherwise be applied with no change. Cheers [0] Stable release team http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO#Whos-who [1] Publishing development releases http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/30126 On 08/03/2016 22:35, Sage Weil wrote: > This is the fourth and last development release before Jewel. The next > release will be a release candidate with the final set of features. Big > items include RGW static website support, librbd journal framework, fixed > mon sync of config-key data, C++11 updates, and bluestore/kstore. > > Note that, due to general developer busyness, we aren’t building official > release packages for this dev release. You can fetch autobuilt gitbuilder > packages from the usual location (http://gitbuilder.ceph.com). > > Notable Changes > --- > > http://ceph.com/releases/v10-0-4-released/ > > Getting Ceph > > > * Git at git://github.com/ceph/ceph.git > * For packages, see > http://ceph.com/docs/master/install/get-packages#add-ceph-development > * For ceph-deploy, see http://ceph.com/docs/master/install/install-ceph-deploy > -- Loïc Dachary, Artisan Logiciel Libre ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD hanging on some volumes of a pool
Hi, I am facing issues with some of my rbd volumes since yesterday. Some of them completely hang at some point before eventually resuming IO, may it be a few minutes or several hours later. First and foremost, my setup : I already detailed it on the mailing list [0][1]. Some changes have been made : the 3 monitors are now VM and we are trying kernel 4.4.5 on the clients (cluster is still 3.10 centos7). Using EC pools, I already had some trouble with RBD features not supported by EC [2] and changed min_recency_* to 0 about 2 weeks ago to avoid the hassle. Everything has been working pretty smoothly since. All my volumes (currently 5) are on an EC pool with writeback cache. Two of them are perfectly fine. On the other 3, different story : doing IO is impossible, if I start a simple copy I get a new file of a few dozen MB (or sometimes 0) then it hangs. Doing dd with direct and sync flags has the same behaviour. I tried witching back to 3.10, no changes, on the client I rebooted I currently cannot mount the filesystem, mount hangs (the volume seems correctly mapped however). strace on the cp command freezes in the middle of a read : 11:17:56 write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 11:17:56 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 11:17:56 write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 11:17:56 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 11:17:56 write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 11:17:56 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 11:17:56 write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536 11:17:56 read(3, I tried to bump up the logging but I don't really know what to look for exactly and didn't see anything obvious. Any input or lead on how to debug this would be highly appreciated :) Adrien [0] http://www.spinics.net/lists/ceph-users/msg23990.html [1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-January/007004.html [2] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007746.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSDs for journals vs SSDs for a cache tier, which is better?
> -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Stephen Harker > Sent: 16 March 2016 16:22 > To: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] SSDs for journals vs SSDs for a cache tier, which is > better? > > On 2016-02-17 11:07, Christian Balzer wrote: > > > > On Wed, 17 Feb 2016 10:04:11 +0100 Piotr Wachowicz wrote: > > > >> > > Let's consider both cases: > >> > > Journals on SSDs - for writes, the write operation returns right > >> > > after data lands on the Journal's SSDs, but before it's written > >> > > to the backing HDD. So, for writes, SSD journal approach should > >> > > be comparable to having a SSD cache tier. > >> > Not quite, see below. > >> > > >> > > >> Could you elaborate a bit more? > >> > >> Are you saying that with a Journal on a SSD writes from clients, > >> before they can return from the operation to the client, must end up > >> on both the SSD (Journal) *and* HDD (actual data store behind that > >> journal)? > > > > No, your initial statement is correct. > > > > However that burst of speed doesn't last indefinitely. > > > > Aside from the size of the journal (which is incidentally NOT the most > > limiting factor) there are various "filestore" parameters in Ceph, in > > particular the sync interval ones. > > There was a more in-depth explanation by a developer about this in > > this ML, try your google-foo. > > > > For short bursts of activity, the journal helps a LOT. > > If you send a huge number of for example 4KB writes to your cluster, > > the speed will eventually (after a few seconds) go down to what your > > backing storage (HDDs) are capable of sustaining. > > > >> > (Which SSDs do you plan to use anyway?) > >> > > >> > >> Intel DC S3700 > >> > > Good choice, with the 200GB model prefer the 3700 over the 3710 > > (higher sequential write speed). > > Hi All, > > I am looking at using PCI-E SSDs as journals in our (4) Ceph OSD nodes, each > of which has 6 4TB SATA drives within. I had my eye on these: > > 400GB Intel P3500 DC AIC SSD, HHHL PCIe 3.0 > > but reading through this thread, it might be better to go with the P3700 given > the improved iops. So a couple of questions. > > * Are the PCI-E versions of these drives different in any other way than the > interface? Yes and no. Internally they are probably not much difference, but the NVME/PCIE interface is a lot faster than SATA/SAS, both in terms of minimum latency and bandwidth. > > * Would one of these as a journal for 6 4TB OSDs be overkill (connectivity is > 10GE, or will be shortly anyway), would the SATA S3700 be sufficient? Again depends on your use case. The S3700 may suffer if you are doing large sequential writes, it might not have a high enough sequential write speed and will become the bottleneck. 6 Disks could potentially take around 500-700MB/s of writes. A P3700 will have enough and will give slightly lower write latency as well if this is important. You may even be able to run more than 6 disk OSD's on it if needed. > > Given they're not hot-swappable, it'd be good if they didn't wear out in > 6 months too. Probably won't unless you are doing some really extreme write workloads and even then I would imagine they would last 1-2 years. > > I realise I've not given you much to go on and I'm Googling around as well, I'm > really just asking in case someone has tried this already and has some > feedback or advice.. That's ok, I'm currently running S3700 100GB's on current cluster and new cluster that's in planning stages will be using the 400Gb P3700's. > > Thanks! :) > > Stephen > > -- > Stephen Harker > Chief Technology Officer > The Positive Internet Company. > > -- > All postal correspondence to: > The Positive Internet Company, 24 Ganton Street, London. W1F 7QY > > *Follow us on Twitter* @posipeople > > The Positive Internet Company Limited is registered in England and Wales. > Registered company number: 3673639. VAT no: 726 7072 28. > Registered office: Northside House, Mount Pleasant, Barnet, Herts, EN4 9EE. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ssd only storage and ceph
> On 17 Mar 2016, at 17:28, Erik Schwalbe wrote: > > Hi, > > at the moment I do some tests with SSD's and ceph. > My Question is, how to mount an SSD OSD? With or without discard option? I recommend running without discard but running "fstrim" command every now and then (depends on how fast your SSD is - some SSDs hang for quite a while when fstrim is run on them, test it) > Where should I do the fstrim, when I mount the OSD without discard? On the > ceph storage node? Inside the vm, running on rbd? > discard on the SSD itself makes garbage collection easier - that might make the SSD faster and it can last longer (how faster and how longer depends on the SSD, generally if you use DC-class SSDs you won't notice anything) discard in the VM (assuming everything supports it) makes thin-provisioning more effective, but you (IMO) need virtio-scsi for that. I have no real-life experience whether Ceph actually frees the unneeded space even if you make it work... > What is the best practice there. > > Thanks for your answers. > > Regards, > Erik > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.94.6 Hammer released
Hi Chen, On Thu, Mar 17, 2016 at 12:40:28AM +, Chen, Xiaoxi wrote: > It’s already there, in > http://download.ceph.com/debian-hammer/pool/main/c/ceph/. I can only see ceph*_0.94.6-1~bpo80+1_amd64.deb there. Debian wheezy would be bpo70. Cheers, Chris > On 3/17/16, 7:20 AM, "Chris Dunlop" wrote: > >> Hi Stable Release Team for v0.94, >> >> On Thu, Mar 10, 2016 at 11:00:06AM +1100, Chris Dunlop wrote: >>> On Wed, Mar 02, 2016 at 06:32:18PM +0700, Loic Dachary wrote: I think you misread what Sage wrote : "The intention was to continue building stable releases (0.94.x) on the old list of supported platforms (which inclues 12.04 and el6)". In other words, the old OS'es are still supported. Their absence is a glitch in the release process that will be fixed. >>> >>> Any news on a release of v0.94.6 for debian wheezy? >> >> Any news on a release of v0.94.6 for debian wheezy? >> >> Cheers, >> >> Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-disk from jewel has issues on redhat 7
Hi, It's true, partprobe works intermittently. I extracted the key commands to show the problem: [18:44]# /usr/sbin/sgdisk --new=2:0:20480M --change-name=2:'ceph journal' --partition-guid=2:aa23e07d-e6b3-4261-a236-c0565971d88d --typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/sdc The operation has completed successfully. [18:44]# partprobe /dev/sdc Error: Error informing the kernel about modifications to partition /dev/sdc2 -- Device or resource busy. This means Linux won't know about any changes you made to /dev/sdc2 until you reboot -- so you shouldn't mount it or use it in any way before rebooting. Error: Failed to add partition 2 (Device or resource busy) [18:44]# partprobe /dev/sdc [18:44]# partprobe /dev/sdc Error: Error informing the kernel about modifications to partition /dev/sdc2 -- Device or resource busy. This means Linux won't know about any changes you made to /dev/sdc2 until you reboot -- so you shouldn't mount it or use it in any way before rebooting. Error: Failed to add partition 2 (Device or resource busy) [18:44]# partprobe /dev/sdc Error: Error informing the kernel about modifications to partition /dev/sdc2 -- Device or resource busy. This means Linux won't know about any changes you made to /dev/sdc2 until you reboot -- so you shouldn't mount it or use it in any way before rebooting. Error: Failed to add partition 2 (Device or resource busy) But partx works every time: [18:46]# /usr/sbin/sgdisk --new=2:0:20480M --change-name=2:'ceph journal' --partition-guid=2:aa23e07d-e6b3-4261-a236-c0565971d88d --typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/sdd The operation has completed successfully. [18:46]# partx -u /dev/sdd [18:46]# partx -u /dev/sdd [18:46]# partx -u /dev/sdd [18:46]# -- Dan On Thu, Mar 17, 2016 at 6:31 PM, Vasu Kulkarni wrote: > I can raise a tracker for this issue since it looks like an intermittent > issue and mostly dependent on specific hardware or it would be better if you > add all the hardware/os details in tracker.ceph.com, also from your logs it > looks like you have > Resource busy issue: Error: Failed to add partition 2 (Device or resource > busy) > > From my test run logs on centos 7.2 , 10.0.5 ( > http://qa-proxy.ceph.com/teuthology/vasu-2016-03-15_15:34:41-selinux-master---basic-mira/62626/teuthology.log > ) > > 2016-03-15T18:49:56.305 > INFO:teuthology.orchestra.run.mira041.stderr:[ceph_deploy.osd][DEBUG ] > Preparing host mira041 disk /dev/sdb journal None activate True > 2016-03-15T18:49:56.305 > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][DEBUG ] find the > location of an executable > 2016-03-15T18:49:56.309 > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][INFO ] Running > command: sudo /usr/sbin/ceph-disk -v prepare --cluster ceph --fs-type xfs -- > /dev/sdb > 2016-03-15T18:49:56.546 > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command: > Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid > 2016-03-15T18:49:56.611 > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command: > Running command: /usr/bin/ceph-osd --check-allows-journal -i 0 --cluster > ceph > 2016-03-15T18:49:56.643 > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command: > Running command: /usr/bin/ceph-osd --check-wants-journal -i 0 --cluster ceph > 2016-03-15T18:49:56.708 > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command: > Running command: /usr/bin/ceph-osd --check-needs-journal -i 0 --cluster ceph > 2016-03-15T18:49:56.708 > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] get_dm_uuid: > get_dm_uuid /dev/sdb uuid path is /sys/dev/block/8:16/dm/uuid > 2016-03-15T18:49:56.709 > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] set_type: > Will colocate journal with data on /dev/sdb > 2016-03-15T18:49:56.709 > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command: > Running command: /usr/bin/ceph-osd --cluster=ceph > --show-config-value=osd_journal_size > 2016-03-15T18:49:56.774 > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] get_dm_uuid: > get_dm_uuid /dev/sdb uuid path is /sys/dev/block/8:16/dm/uuid > 2016-03-15T18:49:56.774 > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] get_dm_uuid: > get_dm_uuid /dev/sdb uuid path is /sys/dev/block/8:16/dm/uuid > 2016-03-15T18:49:56.775 > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] get_dm_uuid: > get_dm_uuid /dev/sdb uuid path is /sys/dev/block/8:16/dm/uuid > 2016-03-15T18:49:56.775 > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command: > Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup > osd_mkfs_options_xfs > 2016-03-15T18:49:56.777 > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command: > Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup > osd_fs_mkfs_options_xfs > 2016-03-15T18:49:56.809 > INFO
Re: [ceph-users] SSDs for journals vs SSDs for a cache tier, which is better?
Thanks all for your suggestions and advice. I'll let you know how it goes :) Stephen On 2016-03-16 16:58, Heath Albritton wrote: The rule of thumb is to match the journal throughput to the OSD throughout. I'm seeing ~180MB/s sequential write on my OSDs and I'm using one of the P3700 400GB units per six OSDs. The 400GB P3700 yields around 1200MB/s* and has around 1/10th the latency of any SATA SSD I've tested. I put a pair of them in a 12-drive chassis and get excellent performance. One could probably do the same in an 18-drive chassis without any issues. Failure domain for a journal starts to get pretty large at they point. I have dozens of the "Fultondale" SSDs deployed and have had zero failures. Endurance is excellent, etc. *the larger units yield much better write throughout but don't make sense financially for journals. -H On Mar 16, 2016, at 09:37, Nick Fisk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Stephen Harker Sent: 16 March 2016 16:22 To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] SSDs for journals vs SSDs for a cache tier, which is better? On 2016-02-17 11:07, Christian Balzer wrote: On Wed, 17 Feb 2016 10:04:11 +0100 Piotr Wachowicz wrote: Let's consider both cases: Journals on SSDs - for writes, the write operation returns right after data lands on the Journal's SSDs, but before it's written to the backing HDD. So, for writes, SSD journal approach should be comparable to having a SSD cache tier. Not quite, see below. Could you elaborate a bit more? Are you saying that with a Journal on a SSD writes from clients, before they can return from the operation to the client, must end up on both the SSD (Journal) *and* HDD (actual data store behind that journal)? No, your initial statement is correct. However that burst of speed doesn't last indefinitely. Aside from the size of the journal (which is incidentally NOT the most limiting factor) there are various "filestore" parameters in Ceph, in particular the sync interval ones. There was a more in-depth explanation by a developer about this in this ML, try your google-foo. For short bursts of activity, the journal helps a LOT. If you send a huge number of for example 4KB writes to your cluster, the speed will eventually (after a few seconds) go down to what your backing storage (HDDs) are capable of sustaining. (Which SSDs do you plan to use anyway?) Intel DC S3700 Good choice, with the 200GB model prefer the 3700 over the 3710 (higher sequential write speed). Hi All, I am looking at using PCI-E SSDs as journals in our (4) Ceph OSD nodes, each of which has 6 4TB SATA drives within. I had my eye on these: 400GB Intel P3500 DC AIC SSD, HHHL PCIe 3.0 but reading through this thread, it might be better to go with the P3700 given the improved iops. So a couple of questions. * Are the PCI-E versions of these drives different in any other way than the interface? Yes and no. Internally they are probably not much difference, but the NVME/PCIE interface is a lot faster than SATA/SAS, both in terms of minimum latency and bandwidth. * Would one of these as a journal for 6 4TB OSDs be overkill (connectivity is 10GE, or will be shortly anyway), would the SATA S3700 be sufficient? Again depends on your use case. The S3700 may suffer if you are doing large sequential writes, it might not have a high enough sequential write speed and will become the bottleneck. 6 Disks could potentially take around 500-700MB/s of writes. A P3700 will have enough and will give slightly lower write latency as well if this is important. You may even be able to run more than 6 disk OSD's on it if needed. Given they're not hot-swappable, it'd be good if they didn't wear out in 6 months too. Probably won't unless you are doing some really extreme write workloads and even then I would imagine they would last 1-2 years. I realise I've not given you much to go on and I'm Googling around as well, I'm really just asking in case someone has tried this already and has some feedback or advice.. That's ok, I'm currently running S3700 100GB's on current cluster and new cluster that's in planning stages will be using the 400Gb P3700's. Thanks! :) Stephen -- Stephen Harker Chief Technology Officer The Positive Internet Company. -- All postal correspondence to: The Positive Internet Company, 24 Ganton Street, London. W1F 7QY *Follow us on Twitter* @posipeople The Positive Internet Company Limited is registered in England and Wales. Registered company number: 3673639. VAT no: 726 7072 28. Registered office: Northside House, Mount Pleasant, Barnet, Herts, EN4 9EE. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://list
[ceph-users] radosgw_agent sync issues
HI i setup 2 clusters and in using radosgw_agent to sync them last week the sync stop working if on runinig the agent from command line i see its stuck on 2 files in the console im geting : 2016-03-17 21:11:57,391 14323 [radosgw_agent.worker][DEBUG ] op state is [] 2016-03-17 21:11:57,391 14323 [radosgw_agent.worker][DEBUG ] error geting op state: list index out of range Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/radosgw_agent/worker.py", line 275, in wait_for_object state = state[0]['state'] and in the log i see : 2016-03-17 21:38:53,221 30848 [boto][DEBUG ] Signature: AWS WOCV3FJ0KFG4E5CVHF46:b+kB03QMTXlVIAhSfkM2aW4sSmk= 2016-03-17 21:38:53,221 30848 [boto][DEBUG ] url = ' http://s3-us-west.test.com/admin/opstate' params={'client-id': 'radosgw-agent', 'object': u'test/Kenny-Wormald-photo-premiere2-56b2b75d5f9b58def9c8ed52.jpg', 'op-id': 'nyprceph1.ops.test.com:30568:135'} headers={'Date': 'Thu, 17 Mar 2016 21:38:53 GMT', 'Content-Length': '0', 'Authorization': u'AWS WOCV3FJ0KFG4E5CVHF46:b+kB03QMTXlVIAhSfkM2aW4sSmk=', 'User-Agent': 'Boto/2.38.0 Python/2.6.6 Linux/2.6.32-504.8.1.el6.x86_64'} data=None 2016-03-17 21:38:53,222 30848 [boto][DEBUG ] Method: GET 2016-03-17 21:38:53,222 30848 [boto][DEBUG ] Path: /admin/opstate?client-id=radosgw-agent&object=test/Kenny-Wormald-photo-premiere2-56b2b75d5f9b58def9c8ed52.jpg&op-id= nyprceph1.ops.test.com%3A30568%3A135 2016-03-17 21:38:53,222 30848 [boto][DEBUG ] Data: 2016-03-17 21:38:53,222 30848 [boto][DEBUG ] Headers: {} 2016-03-17 21:38:53,222 30848 [boto][DEBUG ] Host: s3-us-west.test.com 2016-03-17 21:38:53,222 30848 [boto][DEBUG ] Port: 80 2016-03-17 21:38:53,223 30848 [boto][DEBUG ] Params: {'client-id': 'radosgw-agent', 'object': 'test/Kenny-Wormald-photo-premiere2-56b2b75d5f9b58def9c8ed52.jpg', 'op-id': 'nyprceph1.ops.test.com%3A30568%3A135'} 2016-03-17 21:38:53,223 30848 [boto][DEBUG ] Token: None 2016-03-17 21:38:53,223 30848 [boto][DEBUG ] StringToSign: GET Thu, 17 Mar 2016 21:38:53 GMT /admin/opstate 2016-03-17 21:38:53,223 30848 [boto][DEBUG ] Signature: AWS WOCV3FJ0KFG4E5CVHF46:b+kB03QMTXlVIAhSfkM2aW4sSmk= 2016-03-17 21:38:53,223 30848 [boto][DEBUG ] Final headers: {'Date': 'Thu, 17 Mar 2016 21:38:53 GMT', 'Content-Length': '0', 'Authorization': u'AWS WOCV3FJ0KFG4E5CVHF46:b+kB03QMTXlVIAhSfkM2aW4sSmk=', 'User-Agent': 'Boto/2.38.0 Python/2.6.6 Linux/2.6.32-504.8.1.el6.x86_64'} 2016-03-17 21:38:53,298 30848 [boto][DEBUG ] Response headers: [('date', 'Thu, 17 Mar 2016 21:38:53 GMT'), ('content-length', '2'), ('x-amz-request-id', 'tx00019c09c-0056eb23ed-f149c-us-west')] 2016-03-17 21:38:53,369 30848 [radosgw_agent.worker][DEBUG ] op state is [] 2016-03-17 21:38:53,369 30848 [radosgw_agent.worker][DEBUG ] error geting op state: list index out of range Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/radosgw_agent/worker.py", line 275, in wait_for_object state = state[0]['state'] IndexError: list index out of range i can download the file from from the master i upload it to the slave and rerun the sync but still didnt work any way to skip the file and get the sync done ? (IE just remove it and re-upload it under new name ? ) is it need to be fix from the master side or slave ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [cephfs] About feature 'snapshot'
Hi John, How to set this feature on? Thank you 2016-03-17 21:41 GMT+08:00 Gregory Farnum : > On Thu, Mar 17, 2016 at 3:49 AM, John Spray wrote: > > Snapshots are disabled by default: > > > http://docs.ceph.com/docs/hammer/cephfs/early-adopters/#most-stable-configuration > > Which makes me wonder if we ought to be hiding the .snaps directory > entirely in that case. I haven't previously thought about that, but it > *is* a bit weird. > -Greg > > > > > John > > > > On Thu, Mar 17, 2016 at 10:02 AM, 施柏安 wrote: > >> Hi all, > >> I encounter a trouble about cephfs sanpshot. It seems that the folder > >> '.snap' is exist. > >> But I use 'll -a' can't let it show up. And I enter that folder and > create > >> folder in it, it showed something wrong to use snapshot. > >> > >> Please check : http://imgur.com/elZhQvD > >> > >> > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD/Ceph as Physical boot volume
I posted about this a while ago, and someone else has since inquired, but I am seriously wanting to know if anybody has figured out how to boot from a RBD device yet using ipxe or similar. Last I read. loading the kernel and initrd from object storage would be theoretically easy, and would only require making an initramfs to initialize and mount the rbd.. But I couldn't find any documented instances of anybody having done this yet.. So.. Has anybody done this yet? If so, which distros is it working on, and where can I find more info? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Does object map feature lock snapshots ?
Hi, I had no special logging activated. Today I re-enabled exclusive-lock object-map and fast-diff on an image in 9.2.1 As soon as I ran an rbd export-diff I had lots of these error messages on the console of the rbd export process: 2016-03-18 11:18:21.546658 7f77245d1700 1 heartbeat_map is_healthy 'librbd::thread_pool thread 0x7f77137fe700' had timed out after 60 2016-03-18 11:18:26.546750 7f77245d1700 1 heartbeat_map is_healthy 'librbd::thread_pool thread 0x7f77137fe700' had timed out after 60 2016-03-18 11:18:31.546840 7f77245d1700 1 heartbeat_map is_healthy 'librbd::thread_pool thread 0x7f77137fe700' had timed out after 60 2016-03-18 11:18:36.546928 7f77245d1700 1 heartbeat_map is_healthy 'librbd::thread_pool thread 0x7f77137fe700' had timed out after 60 2016-03-18 11:18:41.547017 7f77245d1700 1 heartbeat_map is_healthy 'librbd::thread_pool thread 0x7f77137fe700' had timed out after 60 2016-03-18 11:18:46.547105 7f77245d1700 1 heartbeat_map is_healthy 'librbd::thread_pool thread 0x7f77137fe700' had timed out after 60 Is this a known issue ? On Tue, Mar 08, 2016 at 11:22:17AM -0500, Jason Dillaman wrote: > Is there anyway for you to provide debug logs (i.e. debug rbd = 20) from your > rbd CLI and qemu process when you attempt to create a snapshot? In v9.2.0, > there was an issue [1] where the cache flush writeback from the snap create > request was being blocked when the exclusive lock feature was enabled, but > that should have been fixed in v9.2.1. > > [1] http://tracker.ceph.com/issues/14542 > > -- > > Jason Dillaman > > > - Original Message - > > From: "Christoph Adomeit" > > To: ceph-us...@ceph.com > > Sent: Tuesday, March 8, 2016 11:13:04 AM > > Subject: [ceph-users] Does object map feature lock snapshots ? > > > > Hi, > > > > i have installed ceph 9.21 on proxmox with kernel 4.2.8-1-pve. > > > > Afterwards I have enabled the features: > > > > rbd feature enable $IMG exclusive-lock > > rbd feature enable $IMG object-map > > rbd feature enable $IMG fast-diff > > > > > > During the night I have a cronjob which does a rbd snap create on each > > of my images and then an rbd export-diff > > > > I found out that my cronjob was hanging during the rbd snap create and > > does not create the snapshot. > > > > Also more worse, sometimes also the vms were hanging. > > > > What are your experiences with object maps ? For me it looks that they > > are not yet production ready. > > > > Thanks > > Christoph > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system
Great, I just recovered the first placement group from this error. To be sure, I ran a deep-scrub and that comes back clean. Thanks for all your help. Regards, Jeff On Thu, Mar 17, 2016 at 11:58 AM, Samuel Just wrote: > Oh, it's getting a stat mismatch. I think what happened is that on > one of the earlier repairs it reset the stats to the wrong value (the > orphan was causing the primary to scan two objects twice, which > matches the stat mismatch I see here). A pg repair repair will clear > that up. > -Sam > > On Thu, Mar 17, 2016 at 9:22 AM, Jeffrey McDonald > wrote: > > Thanks Sam. > > > > Since I have prepared a script for this, I decided to go ahead with the > > checks.(patience isn't one of my extended attributes) > > > > I've got a file that searches the full erasure encoded spaces and does > your > > checklist below. I have operated only on one PG so far, the 70.459 one > > that we've been discussing.There was only the one file that I found > to > > be out of place--the one we already discussed/found and it has been > removed. > > > > The pg is still marked as inconsistent. I've scrubbed it a couple of > times > > now and what I've seen is: > > > > 2016-03-17 09:29:53.202818 7f2e816f8700 0 log_channel(cluster) log > [INF] : > > 70.459 deep-scrub starts > > 2016-03-17 09:36:38.436821 7f2e816f8700 -1 log_channel(cluster) log > [ERR] : > > 70.459s0 deep-scrub stat mismatch, got 22319/22321 objects, 0/0 clones, > > 22319/22321 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, > > 68440088914/68445454633 bytes,0/0 hit_set_archive bytes. > > 2016-03-17 09:36:38.436844 7f2e816f8700 -1 log_channel(cluster) log > [ERR] : > > 70.459 deep-scrub 1 errors > > 2016-03-17 09:44:23.592302 7f2e816f8700 0 log_channel(cluster) log > [INF] : > > 70.459 deep-scrub starts > > 2016-03-17 09:47:01.237846 7f2e816f8700 -1 log_channel(cluster) log > [ERR] : > > 70.459s0 deep-scrub stat mismatch, got 22319/22321 objects, 0/0 clones, > > 22319/22321 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, > > 68440088914/68445454633 bytes,0/0 hit_set_archive bytes. > > 2016-03-17 09:47:01.237880 7f2e816f8700 -1 log_channel(cluster) log > [ERR] : > > 70.459 deep-scrub 1 errors > > > > > > Should the scrub be sufficient to remove the inconsistent flag? I took > the > > osd offline during the repairs.I've looked at files in all of the > osds > > in the placement group and I'm not finding any more problem files.The > > vast majority of files do not have the user.cephos.lfn3 attribute. > There > > are 22321 objects that I seen and only about 230 have the > user.cephos.lfn3 > > file attribute. The files will have other attributes, just not > > user.cephos.lfn3. > > > > Regards, > > Jeff > > > > > > On Wed, Mar 16, 2016 at 3:53 PM, Samuel Just wrote: > >> > >> Ok, like I said, most files with _long at the end are *not orphaned*. > >> The generation number also is *not* an indication of whether the file > >> is orphaned -- some of the orphaned files will have > >> as the generation number and others won't. For each long filename > >> object in a pg you would have to: > >> 1) Pull the long name out of the attr > >> 2) Parse the hash out of the long name > >> 3) Turn that into a directory path > >> 4) Determine whether the file is at the right place in the path > >> 5) If not, remove it (or echo it to be checked) > >> > >> You probably want to wait for someone to get around to writing a > >> branch for ceph-objectstore-tool. Should happen in the next week or > >> two. > >> -Sam > >> > > > > -- > > > > Jeffrey McDonald, PhD > > Assistant Director for HPC Operations > > Minnesota Supercomputing Institute > > University of Minnesota Twin Cities > > 599 Walter Library email: jeffrey.mcdon...@msi.umn.edu > > 117 Pleasant St SE phone: +1 612 625-6905 > > Minneapolis, MN 55455fax: +1 612 624-8861 > > > > > -- Jeffrey McDonald, PhD Assistant Director for HPC Operations Minnesota Supercomputing Institute University of Minnesota Twin Cities 599 Walter Library email: jeffrey.mcdon...@msi.umn.edu 117 Pleasant St SE phone: +1 612 625-6905 Minneapolis, MN 55455fax: +1 612 624-8861 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [cephfs] About feature 'snapshot'
Snapshots are disabled by default: http://docs.ceph.com/docs/hammer/cephfs/early-adopters/#most-stable-configuration John On Thu, Mar 17, 2016 at 10:02 AM, 施柏安 wrote: > Hi all, > I encounter a trouble about cephfs sanpshot. It seems that the folder > '.snap' is exist. > But I use 'll -a' can't let it show up. And I enter that folder and create > folder in it, it showed something wrong to use snapshot. > > Please check : http://imgur.com/elZhQvD > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system
Yep, thanks for all the help tracking down the root cause! -Sam On Thu, Mar 17, 2016 at 10:50 AM, Jeffrey McDonald wrote: > Great, I just recovered the first placement group from this error. To be > sure, I ran a deep-scrub and that comes back clean. > > Thanks for all your help. > Regards, > Jeff > > On Thu, Mar 17, 2016 at 11:58 AM, Samuel Just wrote: >> >> Oh, it's getting a stat mismatch. I think what happened is that on >> one of the earlier repairs it reset the stats to the wrong value (the >> orphan was causing the primary to scan two objects twice, which >> matches the stat mismatch I see here). A pg repair repair will clear >> that up. >> -Sam >> >> On Thu, Mar 17, 2016 at 9:22 AM, Jeffrey McDonald >> wrote: >> > Thanks Sam. >> > >> > Since I have prepared a script for this, I decided to go ahead with the >> > checks.(patience isn't one of my extended attributes) >> > >> > I've got a file that searches the full erasure encoded spaces and does >> > your >> > checklist below. I have operated only on one PG so far, the 70.459 one >> > that we've been discussing.There was only the one file that I found >> > to >> > be out of place--the one we already discussed/found and it has been >> > removed. >> > >> > The pg is still marked as inconsistent. I've scrubbed it a couple of >> > times >> > now and what I've seen is: >> > >> > 2016-03-17 09:29:53.202818 7f2e816f8700 0 log_channel(cluster) log >> > [INF] : >> > 70.459 deep-scrub starts >> > 2016-03-17 09:36:38.436821 7f2e816f8700 -1 log_channel(cluster) log >> > [ERR] : >> > 70.459s0 deep-scrub stat mismatch, got 22319/22321 objects, 0/0 clones, >> > 22319/22321 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, >> > 68440088914/68445454633 bytes,0/0 hit_set_archive bytes. >> > 2016-03-17 09:36:38.436844 7f2e816f8700 -1 log_channel(cluster) log >> > [ERR] : >> > 70.459 deep-scrub 1 errors >> > 2016-03-17 09:44:23.592302 7f2e816f8700 0 log_channel(cluster) log >> > [INF] : >> > 70.459 deep-scrub starts >> > 2016-03-17 09:47:01.237846 7f2e816f8700 -1 log_channel(cluster) log >> > [ERR] : >> > 70.459s0 deep-scrub stat mismatch, got 22319/22321 objects, 0/0 clones, >> > 22319/22321 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, >> > 68440088914/68445454633 bytes,0/0 hit_set_archive bytes. >> > 2016-03-17 09:47:01.237880 7f2e816f8700 -1 log_channel(cluster) log >> > [ERR] : >> > 70.459 deep-scrub 1 errors >> > >> > >> > Should the scrub be sufficient to remove the inconsistent flag? I took >> > the >> > osd offline during the repairs.I've looked at files in all of the >> > osds >> > in the placement group and I'm not finding any more problem files. >> > The >> > vast majority of files do not have the user.cephos.lfn3 attribute. >> > There >> > are 22321 objects that I seen and only about 230 have the >> > user.cephos.lfn3 >> > file attribute. The files will have other attributes, just not >> > user.cephos.lfn3. >> > >> > Regards, >> > Jeff >> > >> > >> > On Wed, Mar 16, 2016 at 3:53 PM, Samuel Just wrote: >> >> >> >> Ok, like I said, most files with _long at the end are *not orphaned*. >> >> The generation number also is *not* an indication of whether the file >> >> is orphaned -- some of the orphaned files will have >> >> as the generation number and others won't. For each long filename >> >> object in a pg you would have to: >> >> 1) Pull the long name out of the attr >> >> 2) Parse the hash out of the long name >> >> 3) Turn that into a directory path >> >> 4) Determine whether the file is at the right place in the path >> >> 5) If not, remove it (or echo it to be checked) >> >> >> >> You probably want to wait for someone to get around to writing a >> >> branch for ceph-objectstore-tool. Should happen in the next week or >> >> two. >> >> -Sam >> >> >> > >> > -- >> > >> > Jeffrey McDonald, PhD >> > Assistant Director for HPC Operations >> > Minnesota Supercomputing Institute >> > University of Minnesota Twin Cities >> > 599 Walter Library email: jeffrey.mcdon...@msi.umn.edu >> > 117 Pleasant St SE phone: +1 612 625-6905 >> > Minneapolis, MN 55455fax: +1 612 624-8861 >> > >> > > > > > > -- > > Jeffrey McDonald, PhD > Assistant Director for HPC Operations > Minnesota Supercomputing Institute > University of Minnesota Twin Cities > 599 Walter Library email: jeffrey.mcdon...@msi.umn.edu > 117 Pleasant St SE phone: +1 612 625-6905 > Minneapolis, MN 55455fax: +1 612 624-8861 > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs infernalis (ceph version 9.2.1) - bonnie++
Hi, on ubuntu 14.04 client and centos 7.2 client with centos 7 Hammer its working without problems. -- Mit freundlichen Gruessen / Best regards Oliver Dzombic IP-Interactive mailto:i...@ip-interactive.de Anschrift: IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3 63571 Gelnhausen HRB 93402 beim Amtsgericht Hanau Geschäftsführung: Oliver Dzombic Steuer Nr.: 35 236 3622 1 UST ID: DE274086107 Am 19.03.2016 um 02:38 schrieb Michael Hanscho: > Hi! > > Trying to run bonnie++ on cephfs mounted via the kernel driver on a > centos 7.2.1511 machine resulted in: > > # bonnie++ -r 128 -u root -d /data/cephtest/bonnie2/ > Using uid:0, gid:0. > Writing a byte at a time...done > Writing intelligently...done > Rewriting...done > Reading a byte at a time...done > Reading intelligently...done > start 'em...done...done...done...done...done... > Create files in sequential order...done. > Stat files in sequential order...done. > Delete files in sequential order...Bonnie: drastic I/O error (rmdir): > Directory not empty > Cleaning up test directory after error. > > # ceph -w > cluster > health HEALTH_OK > monmap e3: 3 mons at > {cestor4=:6789/0,cestor5=:6789/0,cestor6=:6789/0} > election epoch 62, quorum 0,1,2 cestor4,cestor5,cestor6 > mdsmap e30: 1/1/1 up {0=cestor2=up:active}, 1 up:standby > osdmap e703: 60 osds: 60 up, 60 in > flags sortbitwise > pgmap v135437: 1344 pgs, 4 pools, 4315 GB data, 2315 kobjects > 7262 GB used, 320 TB / 327 TB avail > 1344 active+clean > > Any ideas? > > Gruesse > Michael > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] data corruption with hammer
Hi, Nick I switched between forward and writeback. (forward -> writeback) С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 2016-03-17 16:10 GMT+03:00 Nick Fisk : > > -Original Message- > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > > Irek Fasikhov > > Sent: 17 March 2016 13:00 > > To: Sage Weil > > Cc: Robert LeBlanc ; ceph-users > us...@lists.ceph.com>; Nick Fisk ; William Perkins > > > > Subject: Re: [ceph-users] data corruption with hammer > > > > Hi,All. > > > > I confirm the problem. When min_read_recency_for_promote> 1 data > > failure. > > But what scenario is this? Are you switching between forward and > writeback, or just running in writeback? > > > > > > > С уважением, Фасихов Ирек Нургаязович > > Моб.: +79229045757 > > > > 2016-03-17 15:26 GMT+03:00 Sage Weil : > > On Thu, 17 Mar 2016, Nick Fisk wrote: > > > There is got to be something else going on here. All that PR does is to > > > potentially delay the promotion to hit_set_period*recency instead of > > > just doing it on the 2nd read regardless, it's got to be uncovering > > > another bug. > > > > > > Do you see the same problem if the cache is in writeback mode before > you > > > start the unpacking. Ie is it the switching mid operation which causes > > > the problem? If it only happens mid operation, does it still occur if > > > you pause IO when you make the switch? > > > > > > Do you also see this if you perform on a RBD mount, to rule out any > > > librbd/qemu weirdness? > > > > > > Do you know if it’s the actual data that is getting corrupted or if > it's > > > the FS metadata? I'm only wondering as unpacking should really only be > > > writing to each object a couple of times, whereas FS metadata could > > > potentially be being updated+read back lots of times for the same group > > > of objects and ordering is very important. > > > > > > Thinking through it logically the only difference is that with > recency=1 > > > the object will be copied up to the cache tier, where recency=6 it will > > > be proxy read for a long time. If I had to guess I would say the issue > > > would lie somewhere in the proxy read + writeback<->forward logic. > > > > That seems reasonable. Was switching from writeback -> forward always > > part of the sequence that resulted in corruption? Not that there is a > > known ordering issue when switching to forward mode. I wouldn't really > > expect it to bite real users but it's possible.. > > > > http://tracker.ceph.com/issues/12814 > > > > I've opened a ticket to track this: > > > > http://tracker.ceph.com/issues/15171 > > > > What would be *really* great is if you could reproduce this with a > > ceph_test_rados workload (from ceph-tests). I.e., get ceph_test_rados > > running, and then find the sequence of operations that are sufficient to > > trigger a failure. > > > > sage > > > > > > > > > > > > > > > > > > > -Original Message- > > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On > > Behalf Of > > > > Mike Lovell > > > > Sent: 16 March 2016 23:23 > > > > To: ceph-users ; sw...@redhat.com > > > > Cc: Robert LeBlanc ; William Perkins > > > > > > > > Subject: Re: [ceph-users] data corruption with hammer > > > > > > > > just got done with a test against a build of 0.94.6 minus the two > commits > > that > > > > were backported in PR 7207. everything worked as it should with the > > cache- > > > > mode set to writeback and the min_read_recency_for_promote set to 2. > > > > assuming it works properly on master, there must be a commit that > we're > > > > missing on the backport to support this properly. > > > > > > > > sage, > > > > i'm adding you to the recipients on this so hopefully you see it. > the tl;dr > > > > version is that the backport of the cache recency fix to hammer > doesn't > > work > > > > right and potentially corrupts data when > > > > the min_read_recency_for_promote is set to greater than 1. > > > > > > > > mike > > > > > > > > On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell > > > > wrote: > > > > robert and i have done some further investigation the past couple > days > > on > > > > this. we have a test environment with a hard drive tier and an ssd > tier as a > > > > cache. several vms were created with volumes from the ceph cluster. i > > did a > > > > test in each guest where i un-tarred the linux kernel source multiple > > times > > > > and then did a md5sum check against all of the files in the resulting > > source > > > > tree. i started off with the monitors and osds running 0.94.5 and > never > > saw > > > > any problems. > > > > > > > > a single node was then upgraded to 0.94.6 which has osds in both the > ssd > > and > > > > hard drive tier. i then proceeded to run the same test and, while the > > untar > > > > and md5sum operations were running, i changed the ssd tier cache-mode > > > > from forward to writeback. almost immediately the vms started > reporting > > io > > > > errors a
Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system
Hi Sam, In the 70.459 logs from the deep-scrub, there is an error: $ zgrep "= \-2$" ceph-osd.307.log.1.gz 2016-03-07 16:11:41.828332 7ff30cdad700 10 filestore(/var/lib/ceph/osd/ceph-307) remove 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0 = -2 2016-03-07 21:44:02.197676 7fe96b56f700 10 filestore(/var/lib/ceph/osd/ceph-307) remove 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0 = -2 I'm taking this as an indication of the error you mentioned.It looks to me as if this bug leaves two files with "issues" based upon what I see on the filesystem. First, I have a size-0 file in a directory where I expect only to have directories: root@ceph03:/var/lib/ceph/osd/ceph-307/current/70.459s0_head/DIR_9/DIR_5/DIR_4/DIR_D# ls -ltr total 320 -rw-r--r-- 1 root root 0 Jan 23 21:49 default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long drwxr-xr-x 2 root root 16384 Feb 5 15:13 DIR_6 drwxr-xr-x 2 root root 16384 Feb 5 17:26 DIR_3 drwxr-xr-x 2 root root 16384 Feb 10 00:01 DIR_C drwxr-xr-x 2 root root 16384 Mar 4 10:50 DIR_7 drwxr-xr-x 2 root root 16384 Mar 4 16:46 DIR_A drwxr-xr-x 2 root root 16384 Mar 5 02:37 DIR_2 drwxr-xr-x 2 root root 16384 Mar 5 17:39 DIR_4 drwxr-xr-x 2 root root 16384 Mar 8 16:50 DIR_F drwxr-xr-x 2 root root 16384 Mar 15 15:51 DIR_8 drwxr-xr-x 2 root root 16384 Mar 15 21:18 DIR_D drwxr-xr-x 2 root root 16384 Mar 15 22:25 DIR_0 drwxr-xr-x 2 root root 16384 Mar 15 22:35 DIR_9 drwxr-xr-x 2 root root 16384 Mar 15 22:56 DIR_E drwxr-xr-x 2 root root 16384 Mar 15 23:21 DIR_1 drwxr-xr-x 2 root root 12288 Mar 16 00:07 DIR_B drwxr-xr-x 2 root root 16384 Mar 16 00:34 DIR_5 I assume that this file is an issue as well..and needs to be removed. then, in the directory where the file should be, I have the same file: root@ceph03:/var/lib/ceph/osd/ceph-307/current/70.459s0_head/DIR_9/DIR_5/DIR_4/DIR_D/DIR_E# ls -ltr | grep -v __head_ total 64840 -rw-r--r-- 1 root root 1048576 Jan 23 21:49 default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long In the directory DIR_E here (from above), there is only one file without a __head_ in the pathname -- the file aboveShould I be deleting both these _long files without the __head_ in DIR_E and in one above .../DIR_E? Since there is no directory structure HASH in these files, is that the indication that it is an orphan? Thanks, Jeff On Tue, Mar 15, 2016 at 8:38 PM, Samuel Just wrote: > Ah, actually, I think there will be duplicates only around half the > time -- either the old link or the new link could be orphaned > depending on which xfs decides to list first. Only if the old link is > orphaned will it match the name of the object once it's recreated. I > should be able to find time to put together a branch in the next week > or two if you want to wait. It's still probably worth trying removing > that object in 70.459. > -Sam > > On Tue, Mar 15, 2016 at 6:03 PM, Samuel Just wrote: > > The bug is entirely independent of hardware issues -- entirely a ceph > > bug. xfs doesn't let us specify an ordering when reading a directory, > > so we have to keep directory sizes small. That means that when one of > > those pg collection subfolders has 320 files in it, we split it into > > up to 16 smaller directories. Overwriting or removing an ec object > > requires us to rename the old version out of the way in case we need > > to roll back (that's the generation number I mentioned above). For > > crash safety, this involves first creating a link to the new name, > > then removing the old one. Both the old and new link will be in the > > same subdirectory. If creating the new link pushes the directory to > > 320 files then we do a split while both links are present. If the > > file in question is using the special long filename handling, then a > > bug in the resulting link juggling causes us to orphan the old version > > of the file. Your cluster seems to have an unusual number of objects > > with very long names, which is why it is so visible on your cluster. > > > > There are critical pool sizes where the PGs will all be close to one > > of those limits. It's possible you are not close to one of those > > limits. It's als
Re: [ceph-users] v10.0.4 released
On Wed, 16 Mar 2016, Eric Eastman wrote: > Thank you for doing this. It will make testing 10.0.x easier for all of us > in the field, and will make it easier to report bugs, as we will know that > the problems we find were not caused by our build process. Note that you can also always pull builds from the gitbuilders (which is what we run QA against). Both of these should work: ceph-deploy install --dev jewel HOST ceph-deploy install --dev v10.0.5 HOST or you can grab builds directly from gitbuilder.ceph.com. sage > Eric > > On Wed, Mar 16, 2016 at 7:14 AM, Loic Dachary wrote: > Hi, > > Because of a tiny mistake preventing deb packages to be built, > v10.0.5 was released shortly after v10.0.4 and is now the > current development release. The Stable release team[0] > collectively decided to help by publishing development > packages[1], starting with v10.0.5. > > The packages for v10.0.5 are available at > http://ceph-releases.dachary.org/ which can be used as a > replacement for http://download.ceph.com/ for both > http://download.ceph.com/rpm-testing and > http://download.ceph.com/debian-testing . The only difference is > the key used to sign the releases which can be imported with > > wget -q -O- > 'http://ceph-releases.dachary.org/release-key.asc' | sudo > apt-key add - > > or > > rpm --import > http://ceph-releases.dachary.org/release-key.asc > > The instructions to install development packages found at > http://docs.ceph.com/docs/master/install/get-packages/ can > otherwise be applied with no change. > > Cheers > > [0] Stable release team > http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO#Whos-who > [1] Publishing development releases > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/30126 > > On 08/03/2016 22:35, Sage Weil wrote: > > This is the fourth and last development release before Jewel. > The next > > release will be a release candidate with the final set of > features. Big > > items include RGW static website support, librbd journal > framework, fixed > > mon sync of config-key data, C++11 updates, and > bluestore/kstore. > > > > Note that, due to general developer busyness, we aren’t > building official > > release packages for this dev release. You can fetch autobuilt > gitbuilder > > packages from the usual location (http://gitbuilder.ceph.com). > > > > Notable Changes > > --- > > > > http://ceph.com/releases/v10-0-4-released/ > > > > Getting Ceph > > > > > > * Git at git://github.com/ceph/ceph.git > > * For packages, see > http://ceph.com/docs/master/install/get-packages#add-ceph-development > > * For ceph-deploy, see > http://ceph.com/docs/master/install/install-ceph-deploy > > > > -- > Loïc Dachary, Artisan Logiciel Libre > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] data corruption with hammer
I'll miss the Ceph community as well. There was a few things I really wanted to work in with Ceph. I got this: update_object_version oid 13 v 1166 (ObjNum 1028 snap 0 seq_num 1028) dirty exists 1038: left oid 13 (ObjNum 1028 snap 0 seq_num 1028) 1040: finishing write tid 1 to nodez23350-256 1040: finishing write tid 2 to nodez23350-256 1040: finishing write tid 3 to nodez23350-256 1040: finishing write tid 4 to nodez23350-256 1040: finishing write tid 6 to nodez23350-256 1035: done (4 left) 1037: done (3 left) 1038: done (2 left) 1043: read oid 430 snap -1 1043: expect (ObjNum 429 snap 0 seq_num 429) 1040: finishing write tid 7 to nodez23350-256 update_object_version oid 256 v 661 (ObjNum 1029 snap 0 seq_num 1029) dirty exists 1040: left oid 256 (ObjNum 1029 snap 0 seq_num 1029) 1042: expect (ObjNum 664 snap 0 seq_num 664) 1043: Error: oid 430 read returned error code -2 ./test/osd/RadosModel.h: In function 'virtual void ReadOp::_finish(TestOp::CallbackInfo*)' thread 7fa1bf7fe700 time 2016-03-17 10:47:19.085414 ./test/osd/RadosModel.h: 1109: FAILED assert(0) ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x76) [0x4db956] 2: (ReadOp::_finish(TestOp::CallbackInfo*)+0xec) [0x4c959c] 3: (()+0x9791d) [0x7fa1d472191d] 4: (()+0x72519) [0x7fa1d46fc519] 5: (()+0x13c178) [0x7fa1d47c6178] 6: (()+0x80a4) [0x7fa1d425a0a4] 7: (clone()+0x6d) [0x7fa1d2bd504d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' Aborted I had to toggle writeback/forward and min_read_recency_for_promote a few times to get it, but I don't know if it is because I only have one job running. Even with six jobs running, it is not easy to trigger with ceph_test_rados, but it is very instant in the RBD VMs. Here are the six run crashes (I have about the last 2000 lines of each if needed): nodev: update_object_version oid 1015 v 1255 (ObjNum 1014 snap 0 seq_num 1014) dirty exists 1015: left oid 1015 (ObjNum 1014 snap 0 seq_num 1014) 1016: finishing write tid 1 to nodev21799-1016 1016: finishing write tid 2 to nodev21799-1016 1016: finishing write tid 3 to nodev21799-1016 1016: finishing write tid 4 to nodev21799-1016 1016: finishing write tid 6 to nodev21799-1016 1016: finishing write tid 7 to nodev21799-1016 update_object_version oid 1016 v 1957 (ObjNum 1015 snap 0 seq_num 1015) dirty exists 1016: left oid 1016 (ObjNum 1015 snap 0 seq_num 1015) 1017: finishing write tid 1 to nodev21799-1017 1017: finishing write tid 2 to nodev21799-1017 1017: finishing write tid 3 to nodev21799-1017 1017: finishing write tid 5 to nodev21799-1017 1017: finishing write tid 6 to nodev21799-1017 update_object_version oid 1017 v 1010 (ObjNum 1016 snap 0 seq_num 1016) dirty exists 1017: left oid 1017 (ObjNum 1016 snap 0 seq_num 1016) 1018: finishing write tid 1 to nodev21799-1018 1018: finishing write tid 2 to nodev21799-1018 1018: finishing write tid 3 to nodev21799-1018 1018: finishing write tid 4 to nodev21799-1018 1018: finishing write tid 6 to nodev21799-1018 1018: finishing write tid 7 to nodev21799-1018 update_object_version oid 1018 v 1093 (ObjNum 1017 snap 0 seq_num 1017) dirty exists 1018: left oid 1018 (ObjNum 1017 snap 0 seq_num 1017) 1019: finishing write tid 1 to nodev21799-1019 1019: finishing write tid 2 to nodev21799-1019 1019: finishing write tid 3 to nodev21799-1019 1019: finishing write tid 5 to nodev21799-1019 1019: finishing write tid 6 to nodev21799-1019 update_object_version oid 1019 v 462 (ObjNum 1018 snap 0 seq_num 1018) dirty exists 1019: left oid 1019 (ObjNum 1018 snap 0 seq_num 1018) 1021: finishing write tid 1 to nodev21799-1021 1020: finishing write tid 1 to nodev21799-1020 1020: finishing write tid 2 to nodev21799-1020 1020: finishing write tid 3 to nodev21799-1020 1020: finishing write tid 5 to nodev21799-1020 1020: finishing write tid 6 to nodev21799-1020 update_object_version oid 1020 v 1287 (ObjNum 1019 snap 0 seq_num 1019) dirty exists 1020: left oid 1020 (ObjNum 1019 snap 0 seq_num 1019) 1021: finishing write tid 2 to nodev21799-1021 1021: finishing write tid 3 to nodev21799-1021 1021: finishing write tid 5 to nodev21799-1021 1021: finishing write tid 6 to nodev21799-1021 update_object_version oid 1021 v 1077 (ObjNum 1020 snap 0 seq_num 1020) dirty exists 1021: left oid 1021 (ObjNum 1020 snap 0 seq_num 1020) 1022: finishing write tid 1 to nodev21799-1022 1022: finishing write tid 2 to nodev21799-1022 1022: finishing write tid 3 to nodev21799-1022 1022: finishing write tid 5 to nodev21799-1022 1022: finishing write tid 6 to nodev21799-1022 update_object_version oid 1022 v 1213 (ObjNum 1021 snap 0 seq_num 1021) dirty exists 1022: left oid 1022 (ObjNum 1021 snap 0 seq_num 1021) 1023: finishing write tid 1 to nodev21799-1023 1023: finishing write tid 2 to nodev21799-1023 1023: finishing wri
Re: [ceph-users] data corruption with hammer
Cherry-picking that commit onto v0.94.6 wasn't clean so I'm just building your branch. I'm not sure what the difference between your branch and 0.94.6 is, I don't see any commits against osd/ReplicatedPG.cc in the last 5 months other than the one you did today. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Mar 17, 2016 at 11:38 AM, Robert LeBlanc wrote: > Yep, let me pull and build that branch. I tried installing the dbg > packages and running it in gdb, but it didn't load the symbols. > > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Thu, Mar 17, 2016 at 11:36 AM, Sage Weil wrote: >> On Thu, 17 Mar 2016, Robert LeBlanc wrote: >>> Also, is this ceph_test_rados rewriting objects quickly? I think that >>> the issue is with rewriting objects so if we can tailor the >>> ceph_test_rados to do that, it might be easier to reproduce. >> >> It's doing lots of overwrites, yeah. >> >> I was albe to reproduce--thanks! It looks like it's specific to >> hammer. The code was rewritten for jewel so it doesn't affect the >> latest. The problem is that maybe_handle_cache may proxy the read and >> also still try to handle the same request locally (if it doesn't trigger a >> promote). >> >> Here's my proposed fix: >> >> https://github.com/ceph/ceph/pull/8187 >> >> Do you mind testing this branch? >> >> It doesn't appear to be directly related to flipping between writeback and >> forward, although it may be that we are seeing two unrelated issues. I >> seemed to be able to trigger it more easily when I flipped modes, but the >> bug itself was a simple issue in the writeback mode logic. :/ >> >> Anyway, please see if this fixes it for you (esp with the RBD workload). >> >> Thanks! >> sage >> >> >> >> >>> >>> Robert LeBlanc >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >>> >>> >>> On Thu, Mar 17, 2016 at 11:05 AM, Robert LeBlanc >>> wrote: >>> > I'll miss the Ceph community as well. There was a few things I really >>> > wanted to work in with Ceph. >>> > >>> > I got this: >>> > >>> > update_object_version oid 13 v 1166 (ObjNum 1028 snap 0 seq_num 1028) >>> > dirty exists >>> > 1038: left oid 13 (ObjNum 1028 snap 0 seq_num 1028) >>> > 1040: finishing write tid 1 to nodez23350-256 >>> > 1040: finishing write tid 2 to nodez23350-256 >>> > 1040: finishing write tid 3 to nodez23350-256 >>> > 1040: finishing write tid 4 to nodez23350-256 >>> > 1040: finishing write tid 6 to nodez23350-256 >>> > 1035: done (4 left) >>> > 1037: done (3 left) >>> > 1038: done (2 left) >>> > 1043: read oid 430 snap -1 >>> > 1043: expect (ObjNum 429 snap 0 seq_num 429) >>> > 1040: finishing write tid 7 to nodez23350-256 >>> > update_object_version oid 256 v 661 (ObjNum 1029 snap 0 seq_num 1029) >>> > dirty exists >>> > 1040: left oid 256 (ObjNum 1029 snap 0 seq_num 1029) >>> > 1042: expect (ObjNum 664 snap 0 seq_num 664) >>> > 1043: Error: oid 430 read returned error code -2 >>> > ./test/osd/RadosModel.h: In function 'virtual void >>> > ReadOp::_finish(TestOp::CallbackInfo*)' thread 7fa1bf7fe700 time >>> > 2016-03-17 10:47:19.085414 >>> > ./test/osd/RadosModel.h: 1109: FAILED assert(0) >>> > ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) >>> > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>> > const*)+0x76) [0x4db956] >>> > 2: (ReadOp::_finish(TestOp::CallbackInfo*)+0xec) [0x4c959c] >>> > 3: (()+0x9791d) [0x7fa1d472191d] >>> > 4: (()+0x72519) [0x7fa1d46fc519] >>> > 5: (()+0x13c178) [0x7fa1d47c6178] >>> > 6: (()+0x80a4) [0x7fa1d425a0a4] >>> > 7: (clone()+0x6d) [0x7fa1d2bd504d] >>> > NOTE: a copy of the executable, or `objdump -rdS ` is >>> > needed to interpret this. >>> > terminate called after throwing an instance of 'ceph::FailedAssertion' >>> > Aborted >>> > >>> > I had to toggle writeback/forward and min_read_recency_for_promote a >>> > few times to get it, but I don't know if it is because I only have one >>> > job running. Even with six jobs running, it is not easy to trigger >>> > with ceph_test_rados, but it is very instant in the RBD VMs. >>> > >>> > Here are the six run crashes (I have about the last 2000 lines of each >>> > if needed): >>> > >>> > nodev: >>> > update_object_version oid 1015 v 1255 (ObjNum 1014 snap 0 seq_num >>> > 1014) dirty exists >>> > 1015: left oid 1015 (ObjNum 1014 snap 0 seq_num 1014) >>> > 1016: finishing write tid 1 to nodev21799-1016 >>> > 1016: finishing write tid 2 to nodev21799-1016 >>> > 1016: finishing write tid 3 to nodev21799-1016 >>> > 1016: finishing write tid 4 to nodev21799-1016 >>> > 1016: finishing write tid 6 to nodev21799-1016 >>> > 1016: finishing write tid 7 to nodev21799-1016 >>> > update_object_version oid 1016 v 1957 (ObjNum 1015 snap 0 seq_num >>> > 1015) dirty exists >>> > 1016: left oid 1016 (ObjNum 1015 snap 0 seq_num 1015) >>> > 1017: finishing write tid 1
Re: [ceph-users] Local SSD cache for ceph on each compute node.
Hi Nick, Your solution requires manual configuration for each VM and cannot be setup as part of an automated OpenStack deployment. It would be really nice if it was a hypervisor based setting as opposed to a VM based setting. Thanks Daniel -Original Message- From: Nick Fisk [mailto:n...@fisk.me.uk] Sent: 16 March 2016 08:59 To: Daniel Niasoff ; 'Van Leeuwen, Robert' ; 'Jason Dillaman' Cc: ceph-users@lists.ceph.com Subject: RE: [ceph-users] Local SSD cache for ceph on each compute node. > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > Of Daniel Niasoff > Sent: 16 March 2016 08:26 > To: Van Leeuwen, Robert ; Jason Dillaman > > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. > > Hi Robert, > > >Caching writes would be bad because a hypervisor failure would result > >in > loss of the cache which pretty much guarantees inconsistent data on > the ceph volume. > >Also live-migration will become problematic compared to running > everything from ceph since you will also need to migrate the local-storage. I tested a solution using iSCSI for the cache devices. Each VM was using flashcache with a combination of a iSCSI LUN from a SSD and a RBD. This gets around the problem of moving things around or if the hypervisor goes down. It's not local caching but the write latency is at least 10x lower than the RBD. Note I tested it, I didn't put it into production :-) > > My understanding of how a writeback cache should work is that it > should only take a few seconds for writes to be streamed onto the > network and is focussed on resolving the speed issue of small sync > writes. The writes would > be bundled into larger writes that are not time sensitive. > > So there is potential for a few seconds data loss but compared to the current > trend of using ephemeral storage to solve this issue, it's a major > improvement. Yeah, problem is a couple of seconds data loss mean different things to different people. > > > (considering the time required for setting up and maintaining the > > extra > caching layer on each vm, unless you work for free ;-) > > Couldn't agree more there. > > I am just so surprised how the openstack community haven't looked to > resolve this issue. Ephemeral storage is a HUGE compromise unless you > have built in failure into every aspect of your application but many > people use openstack as a general purpose devstack. > > (Jason pointed out his blueprint but I guess it's at least a year or 2 away - > http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash- > consistent_write-back_caching_extension) > > I see articles discussing the idea such as this one > > http://www.sebastien-han.fr/blog/2014/06/10/ceph-cache-pool-tiering- > scalable-cache/ > > but no real straightforward validated setup instructions. > > Thanks > > Daniel > > > -Original Message- > From: Van Leeuwen, Robert [mailto:rovanleeu...@ebay.com] > Sent: 16 March 2016 08:11 > To: Jason Dillaman ; Daniel Niasoff > > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. > > >Indeed, well understood. > > > >As a shorter term workaround, if you have control over the VMs, you > >could > always just slice out an LVM volume from local SSD/NVMe and pass it > through to the guest. Within the guest, use dm-cache (or similar) to > add a > cache front-end to your RBD volume. > > If you do this you need to setup your cache as read-cache only. > Caching writes would be bad because a hypervisor failure would result > in loss > of the cache which pretty much guarantees inconsistent data on the > ceph volume. > Also live-migration will become problematic compared to running > everything from ceph since you will also need to migrate the local-storage. > > The question will be if adding more ram (== more read cache) would not > be more convenient and cheaper in the end. > (considering the time required for setting up and maintaining the > extra caching layer on each vm, unless you work for free ;-) Also > reads from ceph > are pretty fast compared to the biggest bottleneck: (small) sync writes. > So it is debatable how much performance you would win except for some > use-cases with lots of reads on very large data sets which are also > very latency sensitive. > > Cheers, > Robert van Leeuwen > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v10.0.4 released
Thank you for doing this. It will make testing 10.0.x easier for all of us in the field, and will make it easier to report bugs, as we will know that the problems we find were not caused by our build process. Eric On Wed, Mar 16, 2016 at 7:14 AM, Loic Dachary wrote: > Hi, > > Because of a tiny mistake preventing deb packages to be built, v10.0.5 was > released shortly after v10.0.4 and is now the current development release. > The Stable release team[0] collectively decided to help by publishing > development packages[1], starting with v10.0.5. > > The packages for v10.0.5 are available at > http://ceph-releases.dachary.org/ which can be used as a replacement for > http://download.ceph.com/ for both http://download.ceph.com/rpm-testing > and http://download.ceph.com/debian-testing . The only difference is the > key used to sign the releases which can be imported with > > wget -q -O- 'http://ceph-releases.dachary.org/release-key.asc' | sudo > apt-key add - > > or > > rpm --import http://ceph-releases.dachary.org/release-key.asc > > The instructions to install development packages found at > http://docs.ceph.com/docs/master/install/get-packages/ can otherwise be > applied with no change. > > Cheers > > [0] Stable release team > http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO#Whos-who > [1] Publishing development releases > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/30126 > > On 08/03/2016 22:35, Sage Weil wrote: > > This is the fourth and last development release before Jewel. The next > > release will be a release candidate with the final set of features. Big > > items include RGW static website support, librbd journal framework, fixed > > mon sync of config-key data, C++11 updates, and bluestore/kstore. > > > > Note that, due to general developer busyness, we aren’t building official > > release packages for this dev release. You can fetch autobuilt gitbuilder > > packages from the usual location (http://gitbuilder.ceph.com). > > > > Notable Changes > > --- > > > > http://ceph.com/releases/v10-0-4-released/ > > > > Getting Ceph > > > > > > * Git at git://github.com/ceph/ceph.git > > * For packages, see > http://ceph.com/docs/master/install/get-packages#add-ceph-development > > * For ceph-deploy, see > http://ceph.com/docs/master/install/install-ceph-deploy > > > > -- > Loïc Dachary, Artisan Logiciel Libre > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] data corruption with hammer
On Thu, 17 Mar 2016, Nick Fisk wrote: > There is got to be something else going on here. All that PR does is to > potentially delay the promotion to hit_set_period*recency instead of > just doing it on the 2nd read regardless, it's got to be uncovering > another bug. > > Do you see the same problem if the cache is in writeback mode before you > start the unpacking. Ie is it the switching mid operation which causes > the problem? If it only happens mid operation, does it still occur if > you pause IO when you make the switch? > > Do you also see this if you perform on a RBD mount, to rule out any > librbd/qemu weirdness? > > Do you know if it’s the actual data that is getting corrupted or if it's > the FS metadata? I'm only wondering as unpacking should really only be > writing to each object a couple of times, whereas FS metadata could > potentially be being updated+read back lots of times for the same group > of objects and ordering is very important. > > Thinking through it logically the only difference is that with recency=1 > the object will be copied up to the cache tier, where recency=6 it will > be proxy read for a long time. If I had to guess I would say the issue > would lie somewhere in the proxy read + writeback<->forward logic. That seems reasonable. Was switching from writeback -> forward always part of the sequence that resulted in corruption? Not that there is a known ordering issue when switching to forward mode. I wouldn't really expect it to bite real users but it's possible.. http://tracker.ceph.com/issues/12814 I've opened a ticket to track this: http://tracker.ceph.com/issues/15171 What would be *really* great is if you could reproduce this with a ceph_test_rados workload (from ceph-tests). I.e., get ceph_test_rados running, and then find the sequence of operations that are sufficient to trigger a failure. sage > > > > > -Original Message- > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > > Mike Lovell > > Sent: 16 March 2016 23:23 > > To: ceph-users ; sw...@redhat.com > > Cc: Robert LeBlanc ; William Perkins > > > > Subject: Re: [ceph-users] data corruption with hammer > > > > just got done with a test against a build of 0.94.6 minus the two commits > > that > > were backported in PR 7207. everything worked as it should with the cache- > > mode set to writeback and the min_read_recency_for_promote set to 2. > > assuming it works properly on master, there must be a commit that we're > > missing on the backport to support this properly. > > > > sage, > > i'm adding you to the recipients on this so hopefully you see it. the tl;dr > > version is that the backport of the cache recency fix to hammer doesn't work > > right and potentially corrupts data when > > the min_read_recency_for_promote is set to greater than 1. > > > > mike > > > > On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell > > wrote: > > robert and i have done some further investigation the past couple days on > > this. we have a test environment with a hard drive tier and an ssd tier as a > > cache. several vms were created with volumes from the ceph cluster. i did a > > test in each guest where i un-tarred the linux kernel source multiple times > > and then did a md5sum check against all of the files in the resulting source > > tree. i started off with the monitors and osds running 0.94.5 and never saw > > any problems. > > > > a single node was then upgraded to 0.94.6 which has osds in both the ssd and > > hard drive tier. i then proceeded to run the same test and, while the untar > > and md5sum operations were running, i changed the ssd tier cache-mode > > from forward to writeback. almost immediately the vms started reporting io > > errors and odd data corruption. the remainder of the cluster was updated to > > 0.94.6, including the monitors, and the same thing happened. > > > > things were cleaned up and reset and then a test was run > > where min_read_recency_for_promote for the ssd cache pool was set to 1. > > we previously had it set to 6. there was never an error with the recency > > setting set to 1. i then tested with it set to 2 and it immediately caused > > failures. we are currently thinking that it is related to the backport of > > the fix > > for the recency promotion and are in progress of making a .6 build without > > that backport to see if we can cause corruption. is anyone using a version > > from after the original recency fix (PR 6702) with a cache tier in writeback > > mode? anyone have a similar problem? > > > > mike > > > > On Mon, Mar 14, 2016 at 8:51 PM, Mike Lovell > > wrote: > > something weird happened on one of the ceph clusters that i administer > > tonight which resulted in virtual machines using rbd volumes seeing > > corruption in multiple forms. > > > > when everything was fine earlier in the day, the cluster was a number of > > storage nodes spread across 3 different roots in the crush
Re: [ceph-users] data corruption with hammer
We are trying to figure out how to use rados bench to reproduce. Ceph itself doesn't seem to think there is any corruption, but when you do a verify inside the RBD, there is. Can rados bench verify the objects after they are written? It also seems to be primarily the filesystem metadata that is corrupted. If we fsck the volume, there is missing data (put into lost+found), but if it is there it is primarily OK. There only seems to be a few cases where a file's contents are corrupted. I would suspect on an object boundary. We would have to look at blockinfo to map that out and see if that is what is happening. We stopped all the IO and did put the tier in writeback mode with recency 1, set the recency to 2 and started the test and there was corruption, so it doesn't seem to be limited to changing the mode. I don't know how that patch could cause the issue either. Unless there is a bug that reads from the back tier, but writes to cache tier, then the object gets promoted wiping that last write, but then it seems like it should not be as much corruption since the metadata should be in the cache pretty quick. We usually evited the cache before each try so we should not be evicting on writeback. Sent from a mobile device, please excuse any typos. On Mar 17, 2016 6:26 AM, "Sage Weil" wrote: > On Thu, 17 Mar 2016, Nick Fisk wrote: > > There is got to be something else going on here. All that PR does is to > > potentially delay the promotion to hit_set_period*recency instead of > > just doing it on the 2nd read regardless, it's got to be uncovering > > another bug. > > > > Do you see the same problem if the cache is in writeback mode before you > > start the unpacking. Ie is it the switching mid operation which causes > > the problem? If it only happens mid operation, does it still occur if > > you pause IO when you make the switch? > > > > Do you also see this if you perform on a RBD mount, to rule out any > > librbd/qemu weirdness? > > > > Do you know if it’s the actual data that is getting corrupted or if it's > > the FS metadata? I'm only wondering as unpacking should really only be > > writing to each object a couple of times, whereas FS metadata could > > potentially be being updated+read back lots of times for the same group > > of objects and ordering is very important. > > > > Thinking through it logically the only difference is that with recency=1 > > the object will be copied up to the cache tier, where recency=6 it will > > be proxy read for a long time. If I had to guess I would say the issue > > would lie somewhere in the proxy read + writeback<->forward logic. > > That seems reasonable. Was switching from writeback -> forward always > part of the sequence that resulted in corruption? Not that there is a > known ordering issue when switching to forward mode. I wouldn't really > expect it to bite real users but it's possible.. > > http://tracker.ceph.com/issues/12814 > > I've opened a ticket to track this: > > http://tracker.ceph.com/issues/15171 > > What would be *really* great is if you could reproduce this with a > ceph_test_rados workload (from ceph-tests). I.e., get ceph_test_rados > running, and then find the sequence of operations that are sufficient to > trigger a failure. > > sage > > > > > > > > > > > > -Original Message- > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > Of > > > Mike Lovell > > > Sent: 16 March 2016 23:23 > > > To: ceph-users ; sw...@redhat.com > > > Cc: Robert LeBlanc ; William Perkins > > > > > > Subject: Re: [ceph-users] data corruption with hammer > > > > > > just got done with a test against a build of 0.94.6 minus the two > commits that > > > were backported in PR 7207. everything worked as it should with the > cache- > > > mode set to writeback and the min_read_recency_for_promote set to 2. > > > assuming it works properly on master, there must be a commit that we're > > > missing on the backport to support this properly. > > > > > > sage, > > > i'm adding you to the recipients on this so hopefully you see it. the > tl;dr > > > version is that the backport of the cache recency fix to hammer > doesn't work > > > right and potentially corrupts data when > > > the min_read_recency_for_promote is set to greater than 1. > > > > > > mike > > > > > > On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell > > > wrote: > > > robert and i have done some further investigation the past couple days > on > > > this. we have a test environment with a hard drive tier and an ssd > tier as a > > > cache. several vms were created with volumes from the ceph cluster. i > did a > > > test in each guest where i un-tarred the linux kernel source multiple > times > > > and then did a md5sum check against all of the files in the resulting > source > > > tree. i started off with the monitors and osds running 0.94.5 and > never saw > > > any problems. > > > > > > a single node was then upgraded to 0.94.6 which has osds in both the > ssd and >
Re: [ceph-users] RBD hanging on some volumes of a pool
> -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Adrien Gillard > Sent: 17 March 2016 10:23 > To: ceph-users > Subject: [ceph-users] RBD hanging on some volumes of a pool > > Hi, > > I am facing issues with some of my rbd volumes since yesterday. Some of > them completely hang at some point before eventually resuming IO, may it > be a few minutes or several hours later. > > First and foremost, my setup : I already detailed it on the mailing list > [0][1]. > Some changes have been made : the 3 monitors are now VM and we are > trying kernel 4.4.5 on the clients (cluster is still 3.10 centos7). > > Using EC pools, I already had some trouble with RBD features not supported > by EC [2] and changed min_recency_* to 0 about 2 weeks ago to avoid the > hassle. Everything has been working pretty smoothly since. > > All my volumes (currently 5) are on an EC pool with writeback cache. Two of > them are perfectly fine. On the other 3, different story : doing IO is > impossible, if I start a simple copy I get a new file of a few dozen MB (or > sometimes 0) then it hangs. Doing dd with direct and sync flags has the same > behaviour. I can only guess that you are having problems with your cache tier not flushing and so writes are stalling on waiting for space to become available. Can you post ceph osd dump | grep pool and ceph df detail > > I tried witching back to 3.10, no changes, on the client I rebooted I > currently > cannot mount the filesystem, mount hangs (the volume seems correctly > mapped however). > > strace on the cp command freezes in the middle of a read : > > 11:17:56 write(4, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 65536) = 65536 > 11:17:56 read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 65536) = 65536 > 11:17:56 write(4, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 65536) = 65536 > 11:17:56 read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 65536) = 65536 > 11:17:56 write(4, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 65536) = 65536 > 11:17:56 read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 65536) = 65536 > 11:17:56 write(4, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 65536) = 65536 > 11:17:56 read(3, > > > I tried to bump up the logging but I don't really know what to look for > exactly > and didn't see anything obvious. > > Any input or lead on how to debug this would be highly appreciated :) > > Adrien > > [0] http://www.spinics.net/lists/ceph-users/msg23990.html > [1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016- > January/007004.html > [2] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016- > February/007746.html > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD hanging on some volumes of a pool
Hi Nick, Thank you for your feedback. The cache tiers was fine. We identified some packet loss between two switches. As usual with network, relatively easy to identify but not something that comes to mind at first :) Adrien On Thu, Mar 17, 2016 at 2:32 PM, Nick Fisk wrote: > > -Original Message- > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > > Adrien Gillard > > Sent: 17 March 2016 10:23 > > To: ceph-users > > Subject: [ceph-users] RBD hanging on some volumes of a pool > > > > Hi, > > > > I am facing issues with some of my rbd volumes since yesterday. Some of > > them completely hang at some point before eventually resuming IO, may it > > be a few minutes or several hours later. > > > > First and foremost, my setup : I already detailed it on the mailing list > [0][1]. > > Some changes have been made : the 3 monitors are now VM and we are > > trying kernel 4.4.5 on the clients (cluster is still 3.10 centos7). > > > > Using EC pools, I already had some trouble with RBD features not > supported > > by EC [2] and changed min_recency_* to 0 about 2 weeks ago to avoid the > > hassle. Everything has been working pretty smoothly since. > > > > All my volumes (currently 5) are on an EC pool with writeback cache. Two > of > > them are perfectly fine. On the other 3, different story : doing IO is > > impossible, if I start a simple copy I get a new file of a few dozen MB > (or > > sometimes 0) then it hangs. Doing dd with direct and sync flags has the > same > > behaviour. > > I can only guess that you are having problems with your cache tier not > flushing and so writes are stalling on waiting for space to become > available. Can you post > > ceph osd dump | grep pool > > and > > ceph df detail > > > > > I tried witching back to 3.10, no changes, on the client I rebooted I > currently > > cannot mount the filesystem, mount hangs (the volume seems correctly > > mapped however). > > > > strace on the cp command freezes in the middle of a read : > > > > 11:17:56 write(4, > > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > > 65536) = 65536 > > 11:17:56 read(3, > > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > > 65536) = 65536 > > 11:17:56 write(4, > > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > > 65536) = 65536 > > 11:17:56 read(3, > > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > > 65536) = 65536 > > 11:17:56 write(4, > > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > > 65536) = 65536 > > 11:17:56 read(3, > > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > > 65536) = 65536 > > 11:17:56 write(4, > > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > > 65536) = 65536 > > 11:17:56 read(3, > > > > > > I tried to bump up the logging but I don't really know what to look for > exactly > > and didn't see anything obvious. > > > > Any input or lead on how to debug this would be highly appreciated :) > > > > Adrien > > > > [0] http://www.spinics.net/lists/ceph-users/msg23990.html > > [1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016- > > January/007004.html > > [2] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016- > > February/007746.html > > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-deploy rgw
For clusters that were created pre-hammer and want to use ceph-deploy to create additional rgw instances is there a way to create the bootstrap-rgw keyring? http://docs.ceph.com/ceph-deploy/docs/rgw.html -- Derek T. Yarnell University of Maryland Institute for Advanced Computer Studies ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] data corruption with hammer
> -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Irek Fasikhov > Sent: 17 March 2016 13:00 > To: Sage Weil > Cc: Robert LeBlanc ; ceph-users us...@lists.ceph.com>; Nick Fisk ; William Perkins > > Subject: Re: [ceph-users] data corruption with hammer > > Hi,All. > > I confirm the problem. When min_read_recency_for_promote> 1 data > failure. But what scenario is this? Are you switching between forward and writeback, or just running in writeback? > > > С уважением, Фасихов Ирек Нургаязович > Моб.: +79229045757 > > 2016-03-17 15:26 GMT+03:00 Sage Weil : > On Thu, 17 Mar 2016, Nick Fisk wrote: > > There is got to be something else going on here. All that PR does is to > > potentially delay the promotion to hit_set_period*recency instead of > > just doing it on the 2nd read regardless, it's got to be uncovering > > another bug. > > > > Do you see the same problem if the cache is in writeback mode before you > > start the unpacking. Ie is it the switching mid operation which causes > > the problem? If it only happens mid operation, does it still occur if > > you pause IO when you make the switch? > > > > Do you also see this if you perform on a RBD mount, to rule out any > > librbd/qemu weirdness? > > > > Do you know if it’s the actual data that is getting corrupted or if it's > > the FS metadata? I'm only wondering as unpacking should really only be > > writing to each object a couple of times, whereas FS metadata could > > potentially be being updated+read back lots of times for the same group > > of objects and ordering is very important. > > > > Thinking through it logically the only difference is that with recency=1 > > the object will be copied up to the cache tier, where recency=6 it will > > be proxy read for a long time. If I had to guess I would say the issue > > would lie somewhere in the proxy read + writeback<->forward logic. > > That seems reasonable. Was switching from writeback -> forward always > part of the sequence that resulted in corruption? Not that there is a > known ordering issue when switching to forward mode. I wouldn't really > expect it to bite real users but it's possible.. > > http://tracker.ceph.com/issues/12814 > > I've opened a ticket to track this: > > http://tracker.ceph.com/issues/15171 > > What would be *really* great is if you could reproduce this with a > ceph_test_rados workload (from ceph-tests). I.e., get ceph_test_rados > running, and then find the sequence of operations that are sufficient to > trigger a failure. > > sage > > > > > > > > > > > > -Original Message- > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On > Behalf Of > > > Mike Lovell > > > Sent: 16 March 2016 23:23 > > > To: ceph-users ; sw...@redhat.com > > > Cc: Robert LeBlanc ; William Perkins > > > > > > Subject: Re: [ceph-users] data corruption with hammer > > > > > > just got done with a test against a build of 0.94.6 minus the two commits > that > > > were backported in PR 7207. everything worked as it should with the > cache- > > > mode set to writeback and the min_read_recency_for_promote set to 2. > > > assuming it works properly on master, there must be a commit that we're > > > missing on the backport to support this properly. > > > > > > sage, > > > i'm adding you to the recipients on this so hopefully you see it. the > > > tl;dr > > > version is that the backport of the cache recency fix to hammer doesn't > work > > > right and potentially corrupts data when > > > the min_read_recency_for_promote is set to greater than 1. > > > > > > mike > > > > > > On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell > > > wrote: > > > robert and i have done some further investigation the past couple days > on > > > this. we have a test environment with a hard drive tier and an ssd tier > > > as a > > > cache. several vms were created with volumes from the ceph cluster. i > did a > > > test in each guest where i un-tarred the linux kernel source multiple > times > > > and then did a md5sum check against all of the files in the resulting > source > > > tree. i started off with the monitors and osds running 0.94.5 and never > saw > > > any problems. > > > > > > a single node was then upgraded to 0.94.6 which has osds in both the ssd > and > > > hard drive tier. i then proceeded to run the same test and, while the > untar > > > and md5sum operations were running, i changed the ssd tier cache-mode > > > from forward to writeback. almost immediately the vms started reporting > io > > > errors and odd data corruption. the remainder of the cluster was updated > to > > > 0.94.6, including the monitors, and the same thing happened. > > > > > > things were cleaned up and reset and then a test was run > > > where min_read_recency_for_promote for the ssd cache pool was set to > 1. > > > we previously had it set to 6. there was never an error with the recency > > > setting set to 1. i then tested with it
Re: [ceph-users] data corruption with hammer
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Possible, it looks like all the messages comes from a test suite. Is there some logging that would expose this or an assert that could be added? We are about ready to do some testing in our lab to see if we can replicate it and workaround the issue. I also can't tell which version introduced this in Hammer, it doesn't look like it has been resolved. Thanks, -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.6 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJW6bqRCRDmVDuy+mK58QAANTsP/1jceRh9zYDlm2rkVq3e F6UKgezyCWV7h1cou8/rSVkxOfyyWEDSy1nMPBTHCtfMuOHzlx9VZftmPCiY BmxbclpUhAbAbjMb/E7t0jFR7fAZylX4okjUTN1y7NII+6xMXyxb51drYrZv AJzNcXfWYL1+y0Mz/QqOgEyij27OF8vYpSTJqXFDUcXtZNPfyvTjJ1ttYtuR saFJJ6SrFXA5LliGBNQK+pTDq0ZF0Bn0soE73rpzwpQvIdiOf/Jg7hAbERCc Vqjhg34YVLdpGd8W7IvaT0RirYbz8SmRdwOw1IIkBcqe0r9Mt08OgKu5NPT3 Rm0MKYynE1E7nKgutPisJQidT9QuaSVuY40oRDBIlrFA1BxNjGjwFxZn7y8r WyNMHKqB9Y+78uWdtEZtGfiSwyxC2UZTQFI4+eLs/XOoRLWv9oxRYV55Co0W e8zPW0nL1pm9iD9J+3fCRlNEL+cyDjsLLmW005BkF2q7da1XgxkoNndUBTlM Az9RGHoCELfI6kle315/2BEGfE2aRokLngbyhQWKAWmrdTCTDZaJwDKIi4hb 69LGT2eHofTWB5KgMHoCFLUSy2lYa86GxLLsBvPuqOfAXPWHMZERGv94qH/E CppgbnchgRHuI68rNM6nFYPJa4C3MlyQhu2WmOialAGgQi+IQP/g6h70e0RR eqLX =DcjE -END PGP SIGNATURE- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Mar 16, 2016 at 1:40 PM, Gregory Farnum wrote: > This tracker ticket happened to go by my eyes today: > http://tracker.ceph.com/issues/12814 . There isn't a lot of detail > there but the headline matches. > -Greg > > On Wed, Mar 16, 2016 at 2:02 AM, Nick Fisk wrote: > > > > > >> -Original Message- > >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > Of > >> Christian Balzer > >> Sent: 16 March 2016 07:08 > >> To: Robert LeBlanc > >> Cc: Robert LeBlanc ; ceph-users >> us...@lists.ceph.com>; William Perkins > >> Subject: Re: [ceph-users] data corruption with hammer > >> > >> > >> Hello Robert, > >> > >> On Tue, 15 Mar 2016 10:54:20 -0600 Robert LeBlanc wrote: > >> > >> > -BEGIN PGP SIGNED MESSAGE- > >> > Hash: SHA256 > >> > > >> > There are no monitors on the new node. > >> > > >> So one less possible source of confusion. > >> > >> > It doesn't look like there has been any new corruption since we > >> > stopped changing the cache modes. Upon closer inspection, some files > >> > have been changed such that binary files are now ASCII files and visa > >> > versa. These are readable ASCII files and are things like PHP or > >> > script files. Or C files where ASCII files should be. > >> > > >> What would be most interesting is if the objects containing those > > corrupted > >> files did reside on the new OSDs (primary PG) or the old ones, or both. > >> > >> Also, what cache mode was the cluster in before the first switch > > (writeback I > >> presume from the timeline) and which one is it in now? > >> > >> > I've seen this type of corruption before when a SAN node misbehaved > >> > and both controllers were writing concurrently to the backend disks. > >> > The volume was only mounted by one host, but the writes were split > >> > between the controllers when it should have been active/passive. > >> > > >> > We have killed off the OSDs on the new node as a precaution and will > >> > try to replicate this in our lab. > >> > > >> > I suspicion is that is has to do with the cache promotion code update, > >> > but I'm not sure how it would have caused this. > >> > > >> While blissfully unaware of the code, I have a hard time imagining how > it > >> would cause that as well. > >> Potentially a regression in the code that only triggers in one cache > mode > > and > >> when wanting to promote something? > >> > >> Or if it is actually the switching action, not correctly promoting > things > > as it > >> happens? > >> And thus referencing a stale object? > > > > I can't think of any other reason why the recency would break things in > any > > other way. Can the OP confirm what recency setting is being used? > > > > When you switch to writeback, if you haven't reached the required recency > > yet, all reads will be proxied, previous behaviour would have pretty much > > promoted all the time regardless. So unless something is happening where > > writes are getting sent to one tier in forward mode and then read from a > > different tier in WB mode, I'm out of ideas. I'm pretty sure the code > says > > Proxy Read then check for promotion, so I'm not even convinced that there > > should be any difference anyway. > > > > I note the documentation states that in forward mode, modified objects > get > > written to the backing tier, I'm not if that sounds correct to me. But if > > that is what is happening, that could also be related to the problem??? > > > > I think this might be easyish to reproduce using the get/put commands > with a > > couple of objects on a test pool if anybody out there is running 94.6 on > the > > whole c
Re: [ceph-users] data corruption with hammer
just got done with a test against a build of 0.94.6 minus the two commits that were backported in PR 7207. everything worked as it should with the cache-mode set to writeback and the min_read_recency_for_promote set to 2. assuming it works properly on master, there must be a commit that we're missing on the backport to support this properly. sage, i'm adding you to the recipients on this so hopefully you see it. the tl;dr version is that the backport of the cache recency fix to hammer doesn't work right and potentially corrupts data when the min_read_recency_for_promote is set to greater than 1. mike On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell wrote: > robert and i have done some further investigation the past couple days on > this. we have a test environment with a hard drive tier and an ssd tier as > a cache. several vms were created with volumes from the ceph cluster. i did > a test in each guest where i un-tarred the linux kernel source multiple > times and then did a md5sum check against all of the files in the resulting > source tree. i started off with the monitors and osds running 0.94.5 and > never saw any problems. > > a single node was then upgraded to 0.94.6 which has osds in both the ssd > and hard drive tier. i then proceeded to run the same test and, while the > untar and md5sum operations were running, i changed the ssd tier cache-mode > from forward to writeback. almost immediately the vms started reporting io > errors and odd data corruption. the remainder of the cluster was updated to > 0.94.6, including the monitors, and the same thing happened. > > things were cleaned up and reset and then a test was run > where min_read_recency_for_promote for the ssd cache pool was set to 1. we > previously had it set to 6. there was never an error with the recency > setting set to 1. i then tested with it set to 2 and it immediately caused > failures. we are currently thinking that it is related to the backport of > the fix for the recency promotion and are in progress of making a .6 build > without that backport to see if we can cause corruption. is anyone using a > version from after the original recency fix (PR 6702) with a cache tier in > writeback mode? anyone have a similar problem? > > mike > > On Mon, Mar 14, 2016 at 8:51 PM, Mike Lovell > wrote: > >> something weird happened on one of the ceph clusters that i administer >> tonight which resulted in virtual machines using rbd volumes seeing >> corruption in multiple forms. >> >> when everything was fine earlier in the day, the cluster was a number of >> storage nodes spread across 3 different roots in the crush map. the first >> bunch of storage nodes have both hard drives and ssds in them with the hard >> drives in one root and the ssds in another. there is a pool for each and >> the pool for the ssds is a cache tier for the hard drives. the last set of >> storage nodes were in a separate root with their own pool that is being >> used for burn in testing. >> >> these nodes had run for a while with test traffic and we decided to move >> them to the main root and pools. the main cluster is running 0.94.5 and the >> new nodes got 0.94.6 due to them getting configured after that was >> released. i removed the test pool and did a ceph osd crush move to move the >> first node into the main cluster, the hard drives into the root for that >> tier of storage and the ssds into the root and pool for the cache tier. >> each set was done about 45 minutes apart and they ran for a couple hours >> while performing backfill without any issue other than high load on the >> cluster. >> >> we normally run the ssd tier in the forward cache-mode due to the ssds we >> have not being able to keep up with the io of writeback. this results in io >> on the hard drives slowing going up and performance of the cluster starting >> to suffer. about once a week, i change the cache-mode between writeback and >> forward for short periods of time to promote actively used data to the >> cache tier. this moves io load from the hard drive tier to the ssd tier and >> has been done multiple times without issue. i normally don't do this while >> there are backfills or recoveries happening on the cluster but decided to >> go ahead while backfill was happening due to the high load. >> >> i tried this procedure to change the ssd cache-tier between writeback and >> forward cache-mode and things seemed okay from the ceph cluster. about 10 >> minutes after the first attempt a changing the mode, vms using the ceph >> cluster for their storage started seeing corruption in multiple forms. the >> mode was flipped back and forth multiple times in that time frame and its >> unknown if the corruption was noticed with the first change or subsequent >> changes. the vms were having issues of filesystems having errors and >> getting remounted RO and mysql databases seeing corruption (both myisam and >> innodb). some of this was recoverable but on some filesystems there was >> corruption that
Re: [ceph-users] SSDs for journals vs SSDs for a cache tier, which is better?
On 2016-02-17 11:07, Christian Balzer wrote: On Wed, 17 Feb 2016 10:04:11 +0100 Piotr Wachowicz wrote: > > Let's consider both cases: > > Journals on SSDs - for writes, the write operation returns right > > after data lands on the Journal's SSDs, but before it's written to > > the backing HDD. So, for writes, SSD journal approach should be > > comparable to having a SSD cache tier. > Not quite, see below. > > Could you elaborate a bit more? Are you saying that with a Journal on a SSD writes from clients, before they can return from the operation to the client, must end up on both the SSD (Journal) *and* HDD (actual data store behind that journal)? No, your initial statement is correct. However that burst of speed doesn't last indefinitely. Aside from the size of the journal (which is incidentally NOT the most limiting factor) there are various "filestore" parameters in Ceph, in particular the sync interval ones. There was a more in-depth explanation by a developer about this in this ML, try your google-foo. For short bursts of activity, the journal helps a LOT. If you send a huge number of for example 4KB writes to your cluster, the speed will eventually (after a few seconds) go down to what your backing storage (HDDs) are capable of sustaining. > (Which SSDs do you plan to use anyway?) > Intel DC S3700 Good choice, with the 200GB model prefer the 3700 over the 3710 (higher sequential write speed). Hi All, I am looking at using PCI-E SSDs as journals in our (4) Ceph OSD nodes, each of which has 6 4TB SATA drives within. I had my eye on these: 400GB Intel P3500 DC AIC SSD, HHHL PCIe 3.0 but reading through this thread, it might be better to go with the P3700 given the improved iops. So a couple of questions. * Are the PCI-E versions of these drives different in any other way than the interface? * Would one of these as a journal for 6 4TB OSDs be overkill (connectivity is 10GE, or will be shortly anyway), would the SATA S3700 be sufficient? Given they're not hot-swappable, it'd be good if they didn't wear out in 6 months too. I realise I've not given you much to go on and I'm Googling around as well, I'm really just asking in case someone has tried this already and has some feedback or advice.. Thanks! :) Stephen -- Stephen Harker Chief Technology Officer The Positive Internet Company. -- All postal correspondence to: The Positive Internet Company, 24 Ganton Street, London. W1F 7QY *Follow us on Twitter* @posipeople The Positive Internet Company Limited is registered in England and Wales. Registered company number: 3673639. VAT no: 726 7072 28. Registered office: Northside House, Mount Pleasant, Barnet, Herts, EN4 9EE. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Radosgw (civetweb) hangs once around 850 established connections
I have a cluster of around 630 OSDs with 3 dedicated monitors and 2 dedicated gateways. The entire cluster is running hammer (0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)). (Both of my gateways have stopped responding to curl right now. root@host:~# timeout 5 curl localhost ; echo $? 124 From here I checked and it looks like radosgw has over 1 million open files: root@host:~# grep -i rados whatisopen.files.list | wc -l 1151753 And around 750 open connections: root@host:~# netstat -planet | grep radosgw | wc -l 752 root@host:~# ss -tnlap | grep rados | wc -l 752 I don't think that the backend storage is hanging based on the following dump: root@host:~# ceph daemon /var/run/ceph/ceph-client.rgw.kh11-9.asok objecter_requests | grep -i mtime "mtime": "0.00", "mtime": "0.00", "mtime": "0.00", "mtime": "0.00", "mtime": "0.00", "mtime": "0.00", [...] "mtime": "0.00", The radosgw log is still showing lots of activity and so does strace which makes me think this is a config issue or limit of some kind that is not triggering a log. Of what I am not sure as the log doesn't seem to show any open file limit being hit and I don't see any big errors showing up in the logs. (last 500 lines of /var/log/radosgw/client.radosgw.log) http://pastebin.com/jmM1GFSA Perf dump of radosgw http://pastebin.com/rjfqkxzE Radosgw objecter requests: http://pastebin.com/skDJiyHb After restarting the gateway with '/etc/init.d/radosgw restart' the old process remains, no error is sent, and then I get connection refused via curl or netcat:: root@kh11-9:~# curl localhost curl: (7) Failed to connect to localhost port 80: Connection refused Once I kill the old radosgw via sigkill the new radosgw instance restarts automatically and starts responding:: root@kh11-9:~# curl localhost xmlns="http://s3.amazonaws.com/doc/2006-03-01/";>anonymous What is going on here? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.94.6 Hammer released
Hi Stable Release Team for v0.94, On Thu, Mar 10, 2016 at 11:00:06AM +1100, Chris Dunlop wrote: > On Wed, Mar 02, 2016 at 06:32:18PM +0700, Loic Dachary wrote: >> I think you misread what Sage wrote : "The intention was to >> continue building stable releases (0.94.x) on the old list of >> supported platforms (which inclues 12.04 and el6)". In other >> words, the old OS'es are still supported. Their absence is a >> glitch in the release process that will be fixed. > > Any news on a release of v0.94.6 for debian wheezy? Any news on a release of v0.94.6 for debian wheezy? Cheers, Chris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW quota
On Wednesday, 16 March 2016, Derek Yarnell wrote: > Hi, > > We have a user with a 50GB quota and has now a single bucket with 20GB > of files. They had previous buckets created and removed but the quota > has not decreased. I understand that we do garbage collection but it > has been significantly longer than the defaults that we have not > overridden. They get 403 QuotaExceeded when trying to write additional > data to a new bucket or the existing bucket. > > # radosgw-admin user info --uid=username > ... > "user_quota": { > "enabled": true, > "max_size_kb": 52428800, > "max_objects": -1 > }, > > # radosgw-admin bucket stats --bucket=start > ... > "usage": { > "rgw.main": { > "size_kb": 21516505, > "size_kb_actual": 21516992, > "num_objects": 243 > } > }, > > # radosgw-admin user stats --uid=username > ... > { > "stats": { > "total_entries": 737, > "total_bytes": 55060794604, > "total_bytes_rounded": 55062102016 > }, > "last_stats_sync": "2016-03-16 14:16:25.205060Z", > "last_stats_update": "2016-03-16 14:16:25.190605Z" > } > > Thanks, > derek > > -- > Derek T. Yarnell > University of Maryland > Institute for Advanced Computer Studies > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > Hi, It's possible that somebody changed the owner of some bucket. But all objects in that bucket still belongs to this user. That way you can get quota exceeded. We had the same situation. -- Marius Vaitiekūnas ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD/Ceph as Physical boot volume
On 03/17/2016 03:51 AM, Schlacta, Christ wrote: I posted about this a while ago, and someone else has since inquired, but I am seriously wanting to know if anybody has figured out how to boot from a RBD device yet using ipxe or similar. Last I read. loading the kernel and initrd from object storage would be theoretically easy, and would only require making an initramfs to initialize and mount the rbd.. But I couldn't find any documented instances of anybody having done this yet.. So.. Has anybody done this yet? If so, which distros is it working on, and where can I find more info? Not sure if anyone is doing this, though there was a patch for creating an initramfs that would mount rbd: https://lists.debian.org/debian-kernel/2015/06/msg00161.html Josh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ZFS or BTRFS for performance?
On Mar 18, 2016 4:31 PM, "Lionel Bouton" > > Will bluestore provide the same protection against bitrot than BTRFS? > Ie: with BTRFS the deep-scrubs detect inconsistencies *and* the OSD(s) > with invalid data get IO errors when trying to read corrupted data and > as such can't be used as the source for repairs even if they are primary > OSD(s). So with BTRFS you get a pretty good overall protection against > bitrot in Ceph (it allowed us to automate the repair process in the most > common cases). With XFS IIRC unless you override the default behavior > the primary OSD is always the source for repairs (even if all the > secondaries agree on another version of the data). I have a functionally identical question about bluestore, but with zfs instead of btrfs. Do you have more info on this bluestore? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ZFS or BTRFS for performance?
If you google "ceph bluestore" you'll be able to find a couple slide decks on the topic. One of them by Sage is easy to follow without the benefit of the presentation. There's also the " Redhat Ceph Storage Roadmap 2016" deck. In any case, bluestore is not intended to address bitrot. Given that ceph is a distributed file system, many of the posix file system features are not required for the underlying block storage device. Bluestore is intended to address this and reduce the disk IO required to store user data. Ceph protects against bitrot at a much higher level by validating the checksum of the entire placement group during a deep scrub. -H > On Mar 19, 2016, at 10:06, Schlacta, Christ wrote: > > > On Mar 18, 2016 4:31 PM, "Lionel Bouton" > > > > Will bluestore provide the same protection against bitrot than BTRFS? > > Ie: with BTRFS the deep-scrubs detect inconsistencies *and* the OSD(s) > > with invalid data get IO errors when trying to read corrupted data and > > as such can't be used as the source for repairs even if they are primary > > OSD(s). So with BTRFS you get a pretty good overall protection against > > bitrot in Ceph (it allowed us to automate the repair process in the most > > common cases). With XFS IIRC unless you override the default behavior > > the primary OSD is always the source for repairs (even if all the > > secondaries agree on another version of the data). > > I have a functionally identical question about bluestore, but with zfs > instead of btrfs. Do you have more info on this bluestore? > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cannot remove rbd locks
Try the following: # rbd lock remove vm-114-disk-1 "auto 140454012457856" client.71260575 -- Jason Dillaman - Original Message - > From: "Christoph Adomeit" > To: ceph-us...@ceph.com > Sent: Friday, March 18, 2016 11:14:00 AM > Subject: [ceph-users] Cannot remove rbd locks > > Hi, > > some of my rbds show they have an exclusive lock. > > I think the lock can be stale or weeks old. > > We have also once added feature exclusive lock and later removed that feature > > I can see the lock: > > root@machine:~# rbd lock list vm-114-disk-1 > There is 1 exclusive lock on this image. > Locker ID Address > client.71260575 auto 140454012457856 10.67.1.14:0/1131494432 > > iBut I cannot remove the lock: > > root@machine:~# rbd lock remove vm-114-disk-1 auto client.71260575 > rbd: releasing lock failed: (2) No such file or directory > > How can I remove the locks ? > > Thanks > Christoph > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] reallocate when OSD down
Hello, I have a problem with the following crushmap : # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable straw_calc_version 1 # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 osd.4 device 5 osd.5 device 6 osd.6 device 7 osd.7 device 8 osd.8 device 9 osd.9 device 10 osd.10 device 11 osd.11 # types type 0 device type 1 host type 2 chassis type 3 rack type 4 room type 5 datacenter type 6 root # buckets host testctrcephosd1 { id -1 # do not change unnecessarily # weight 3.000 alg straw hash 0 # rjenkins1 item osd.0 weight 1.000 item osd.1 weight 1.000 item osd.2 weight 1.000 } host testctrcephosd2 { id -2 # do not change unnecessarily # weight 3.000 alg straw hash 0 # rjenkins1 item osd.3 weight 1.000 item osd.4 weight 1.000 item osd.5 weight 1.000 } host testctrcephosd3 { id -3 # do not change unnecessarily # weight 3.000 alg straw hash 0 # rjenkins1 item osd.6 weight 1.000 item osd.7 weight 1.000 item osd.8 weight 1.000 } host testctrcephosd4 { id -4 # do not change unnecessarily # weight 3.000 alg straw hash 0 # rjenkins1 item osd.9 weight 1.000 item osd.10 weight 1.000 item osd.11 weight 1.000 } chassis chassis1 { id -5 # do not change unnecessarily # weight 6.000 alg straw hash 0 # rjenkins1 item testctrcephosd1 weight 3.000 item testctrcephosd2 weight 3.000 } chassis chassis2 { id -6 # do not change unnecessarily # weight 6.000 alg straw hash 0 # rjenkins1 item testctrcephosd3 weight 3.000 item testctrcephosd4 weight 3.000 } room salle1 { id -7 # weight 6.000 alg straw hash 0 item chassis1 weight 6.000 } room salle2 { id -8 # weight 6.000 alg straw hash 0 item chassis2 weight 6.000 } root dc1 { id -9 # weight 6.000 alg straw hash 0 item salle1 weight 6.000 item salle2 weight 6.000 } # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take dc1 step chooseleaf firstn 0 type host step emit } rule dc { ruleset 1 type replicated min_size 2 max_size 10 step take dc1 step choose firstn 0 type room step chooseleaf firstn 0 type chassis step emit } ID WEIGHT TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY -9 12.0 root dc1 -7 6.0 room salle1 -5 6.0 chassis chassis1 -1 3.0 host testctrcephosd1 0 1.0 osd.0 up 1.0 1.0 1 1.0 osd.1 up 1.0 1.0 2 1.0 osd.2 up 1.0 1.0 -2 3.0 host testctrcephosd2 3 1.0 osd.3 up 1.0 1.0 4 1.0 osd.4 up 1.0 1.0 5 1.0 osd.5 up 1.0 1.0 -8 6.0 room salle2 -6 6.0 chassis chassis2 -3 3.0 host testctrcephosd3 6 1.0 osd.6 up 1.0 1.0 7 1.0 osd.7 up 1.0 1.0 8 1.0 osd.8 up 1.0 1.0 -4 3.0 host testctrcephosd4 9 1.0 osd.9 up 1.0 1.0 10 1.0 osd.10up 1.0 1.0 11 1.0 osd.11up 1.0 1.0 Allocating when creating is ok, my datas are replicated in 2 rooms. ceph osd map rbdnew testvol1 osdmap e127 pool 'rbdnew' (1) object 'testvol1' -> pg 1.c657d5a4 (1.a4) -> up ([9,5], p9) acting ([9,5], p9) but when one of these host is down, I want to create another replica on the other host in the same room. For example, when host "testctrcephosd2" is down, I want CRUSH to create another copy in "testctrcephosd1" (keeping another copy on one of the host in room "salle 2". In place of this, cluster stays with only one osd used (instead of 2) : ceph osd map rbdnew testvol1 osdmap e130 pool 'rbdnew' (1) object 'testvol1' -> pg 1.c657d5a4 (1.a4) -> up ([9], p9) acting ([9], p9) Do you have any idea to do this ? Regards Christophe ___ ceph-users mailing list ceph-users@lists.ceph.com http://l
Re: [ceph-users] data corruption with hammer
robert and i have done some further investigation the past couple days on this. we have a test environment with a hard drive tier and an ssd tier as a cache. several vms were created with volumes from the ceph cluster. i did a test in each guest where i un-tarred the linux kernel source multiple times and then did a md5sum check against all of the files in the resulting source tree. i started off with the monitors and osds running 0.94.5 and never saw any problems. a single node was then upgraded to 0.94.6 which has osds in both the ssd and hard drive tier. i then proceeded to run the same test and, while the untar and md5sum operations were running, i changed the ssd tier cache-mode from forward to writeback. almost immediately the vms started reporting io errors and odd data corruption. the remainder of the cluster was updated to 0.94.6, including the monitors, and the same thing happened. things were cleaned up and reset and then a test was run where min_read_recency_for_promote for the ssd cache pool was set to 1. we previously had it set to 6. there was never an error with the recency setting set to 1. i then tested with it set to 2 and it immediately caused failures. we are currently thinking that it is related to the backport of the fix for the recency promotion and are in progress of making a .6 build without that backport to see if we can cause corruption. is anyone using a version from after the original recency fix (PR 6702) with a cache tier in writeback mode? anyone have a similar problem? mike On Mon, Mar 14, 2016 at 8:51 PM, Mike Lovell wrote: > something weird happened on one of the ceph clusters that i administer > tonight which resulted in virtual machines using rbd volumes seeing > corruption in multiple forms. > > when everything was fine earlier in the day, the cluster was a number of > storage nodes spread across 3 different roots in the crush map. the first > bunch of storage nodes have both hard drives and ssds in them with the hard > drives in one root and the ssds in another. there is a pool for each and > the pool for the ssds is a cache tier for the hard drives. the last set of > storage nodes were in a separate root with their own pool that is being > used for burn in testing. > > these nodes had run for a while with test traffic and we decided to move > them to the main root and pools. the main cluster is running 0.94.5 and the > new nodes got 0.94.6 due to them getting configured after that was > released. i removed the test pool and did a ceph osd crush move to move the > first node into the main cluster, the hard drives into the root for that > tier of storage and the ssds into the root and pool for the cache tier. > each set was done about 45 minutes apart and they ran for a couple hours > while performing backfill without any issue other than high load on the > cluster. > > we normally run the ssd tier in the forward cache-mode due to the ssds we > have not being able to keep up with the io of writeback. this results in io > on the hard drives slowing going up and performance of the cluster starting > to suffer. about once a week, i change the cache-mode between writeback and > forward for short periods of time to promote actively used data to the > cache tier. this moves io load from the hard drive tier to the ssd tier and > has been done multiple times without issue. i normally don't do this while > there are backfills or recoveries happening on the cluster but decided to > go ahead while backfill was happening due to the high load. > > i tried this procedure to change the ssd cache-tier between writeback and > forward cache-mode and things seemed okay from the ceph cluster. about 10 > minutes after the first attempt a changing the mode, vms using the ceph > cluster for their storage started seeing corruption in multiple forms. the > mode was flipped back and forth multiple times in that time frame and its > unknown if the corruption was noticed with the first change or subsequent > changes. the vms were having issues of filesystems having errors and > getting remounted RO and mysql databases seeing corruption (both myisam and > innodb). some of this was recoverable but on some filesystems there was > corruption that lead to things like lots of data ending up in the > lost+found and some of the databases were un-recoverable (backups are > helping there). > > i'm not sure what would have happened to cause this corruption. the > libvirt logs for the qemu processes for the vms did not provide any output > of problems from the ceph client code. it doesn't look like any of the qemu > processes had crashed. also, it has now been several hours since this > happened with no additional corruption noticed by the vms. it doesn't > appear that we had any corruption happen before i attempted the flipping of > the ssd tier cache-mode. > > the only think i can think of that is different between this time doing > this procedure vs previous attempts was that there was the one stor
[ceph-users] Cannot remove rbd locks
Hi, some of my rbds show they have an exclusive lock. I think the lock can be stale or weeks old. We have also once added feature exclusive lock and later removed that feature I can see the lock: root@machine:~# rbd lock list vm-114-disk-1 There is 1 exclusive lock on this image. Locker ID Address client.71260575 auto 140454012457856 10.67.1.14:0/1131494432 iBut I cannot remove the lock: root@machine:~# rbd lock remove vm-114-disk-1 auto client.71260575 rbd: releasing lock failed: (2) No such file or directory How can I remove the locks ? Thanks Christoph ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Radosgw (civetweb) hangs once around 850 established connections
What OS are you using? I have a lot more open connections than that. (though i have some other issues, where rgw sometimes returns 500 errors, it doesn't stop like yours) You might try tuning civetweb's num_threads and 'rgw num rados handles': rgw frontends = civetweb num_threads=125 error_log_file=/var/log/radosgw/civetweb.error.log access_log_file=/var/log/radosgw/civetweb.access.log rgw num rados handles = 32 You can also up civetweb loglevel: debug civetweb = 20 -Ben On Wed, Mar 16, 2016 at 5:03 PM, seapasu...@uchicago.edu < seapasu...@uchicago.edu> wrote: > I have a cluster of around 630 OSDs with 3 dedicated monitors and 2 > dedicated gateways. The entire cluster is running hammer (0.94.5 > (9764da52395923e0b32908d83a9f7304401fee43)). > > (Both of my gateways have stopped responding to curl right now. > root@host:~# timeout 5 curl localhost ; echo $? > 124 > > From here I checked and it looks like radosgw has over 1 million open > files: > root@host:~# grep -i rados whatisopen.files.list | wc -l > 1151753 > > And around 750 open connections: > root@host:~# netstat -planet | grep radosgw | wc -l > 752 > root@host:~# ss -tnlap | grep rados | wc -l > 752 > > I don't think that the backend storage is hanging based on the following > dump: > > root@host:~# ceph daemon /var/run/ceph/ceph-client.rgw.kh11-9.asok > objecter_requests | grep -i mtime > "mtime": "0.00", > "mtime": "0.00", > "mtime": "0.00", > "mtime": "0.00", > "mtime": "0.00", > "mtime": "0.00", > [...] > "mtime": "0.00", > > The radosgw log is still showing lots of activity and so does strace which > makes me think this is a config issue or limit of some kind that is not > triggering a log. Of what I am not sure as the log doesn't seem to show any > open file limit being hit and I don't see any big errors showing up in the > logs. > (last 500 lines of /var/log/radosgw/client.radosgw.log) > http://pastebin.com/jmM1GFSA > > Perf dump of radosgw > http://pastebin.com/rjfqkxzE > > Radosgw objecter requests: > http://pastebin.com/skDJiyHb > > After restarting the gateway with '/etc/init.d/radosgw restart' the old > process remains, no error is sent, and then I get connection refused via > curl or netcat:: > root@kh11-9:~# curl localhost > curl: (7) Failed to connect to localhost port 80: Connection refused > > Once I kill the old radosgw via sigkill the new radosgw instance restarts > automatically and starts responding:: > root@kh11-9:~# curl localhost > http://s3.amazonaws.com/doc/2006-03-01/ > ">anonymous > What is going on here? > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system
Hi Sam, I've written a script but i'm a little leary of unleasing it until I find a few more cases to test. The script successfully removed the file mentioned above. I took the next pg which was marked inconsistent and ran the following command over those pg directory structures: find . -name "*_long" -exec xattr -p user.cephos.lfn3 {} + | grep -v I didn't find any files that "orphaned" by this command. All of these files should have "_long" and the grep should pull out the invalid generation, correct? I'm looking wider but in the next pg marked inconsistent I didn't find any orphans. Thanks, Jeff -- Jeffrey McDonald, PhD Assistant Director for HPC Operations Minnesota Supercomputing Institute University of Minnesota Twin Cities 599 Walter Library email: jeffrey.mcdon...@msi.umn.edu 117 Pleasant St SE phone: +1 612 625-6905 Minneapolis, MN 55455fax: +1 612 624-8861 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system
OK, I think I have it now. I do have one more question, in this case, the hash indicates the directory structure but how do I know from the hash how many levels I should go down.If the hash is a 32-bit hex integer, *how do I know how many should be included as part of the hash for the directory structure*? e.g. our example: the hash is 79CED459 and the directory is then the last five taken in reverse order, what happens if there are only 4 levels of hierarchy?I only have this one example so far.is the 79C of the hash constant? Would the hash pick up another hex character if the pg splits again? Thanks, Jeff On Wed, Mar 16, 2016 at 10:24 AM, Samuel Just wrote: > There is a directory structure hash, it's just that it's at the end of > the name and you'll have to check the xattr I mentioned to find it. > > I think that file is actually the one we are talking about removing. > > > ./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long: > user.cephos.lfn3: > > default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46_3189d_0 > > Notice that the user.cephosd.lfn3 attr has the full name, and it > *does* have a hash 79CED459 (you referred to it as a directory hash I > think, but it's actually the hash we used to place it on this osd to > begin with). > > In specifically this case, you shouldn't find any files in the > DIR_9/DIR_5/DIR_4/DIR_D directory since there are 16 subdirectories > (so all hash values should hash to one of those). > > The one in DIR_9/DIR_5/DIR_4/DIR_D/DIR_E is completely fine -- that's > the actual object file, don't remove that. If you look at the attr: > > > ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long: > user.cephos.lfn3: > > default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46__0 > > The hash is 79CED459, which means that (assuming > DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/DIR_C does *not* exist) it's in the > right place. > > The ENOENT return > > 2016-03-07 16:11:41.828332 7ff30cdad700 10 > filestore(/var/lib/ceph/osd/ceph-307) remove > > 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0 > = -2 > 2016-03-07 21:44:02.197676 7fe96b56f700 10 > filestore(/var/lib/ceph/osd/ceph-307) remove > > 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0 > = -2 > > actually was a symptom in this case, but, in general, it's not > indicative of anything -- the filestore can get ENOENT return values > for legitimate reasons. > > To reiterate: files that end in something like > fa202ec9b4b3b217275a_0_long are *not* necessarily orphans -- you need > to check the user.cephos.lfn3 attr (as you did before) for the full > length file name and determine whether the file is in the right place. > -Sam > > On Wed, Mar 16, 2016 at 7:49 AM, Jeffrey McDonald > wrote: > > Hi Sam, > > > > In the 70.459 logs from the deep-scrub, there is an error: > > > > $ zgrep "= \-2$" ceph-osd.307.log.1.gz > > 2016-03-07 16:11:41.828332 7ff30cdad700 10 > > filestore(/var/lib/ceph/osd/ceph-307) remove > > > 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0 > > = -2 > > 2016-03-07 21:44:02.197676 7fe96b56f700 10 > > filestore(/var/lib/ceph/osd/ceph-307) remove > > > 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0 > > = -2 > > > > I'm taking this as an indication of the error you mentioned.It looks > to > > me as if t
[ceph-users] ssd only storage and ceph
Hi, at the moment I do some tests with SSD's and ceph. My Question is, how to mount an SSD OSD? With or without discard option? Where should I do the fstrim, when I mount the OSD without discard? On the ceph storage node? Inside the vm, running on rbd? What is the best practice there. Thanks for your answers. Regards, Erik ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Infernalis .rgw.buckets.index objects becoming corrupted in on RHEL 7.2 during recovery
List, We have stood up a Infernalis 9.2.0 cluster on RHEL 7.2. We are using the radosGW to store potentially billions of small to medium sized objects (64k - 1MB). We have run into an issue twice thus far where .rgw.bucket.index placement groups will become corrupt during recovery after a drive failure. This corruption will cause the OSD to crash with a suicide_timeout error when trying to backfill the corrupted index file to a different OSD. Exporting the corrupted placement group using the ceph-objectstore-tool will also hang. When this first came up, we were able to simply rebuild the .rgw pools and start from scratch. There were no underlying XFS issues. Before we put this cluster into full operation, we are looking to determine what caused this and if there is a hard limit to the number of objects in a bucket. We are currently putting all objects into 1 bucket, but should probably divide these up. I have uploaded the OSD and ceph-objectstore tool debug files here: https://github.com/garignack/ceph_misc/raw/master/ceph-osd.388.zip Any help would be greatly appreciated. I'm not a ceph expert by any means, but here is where I've gotten to thus far. (And may be way off base) The PG in question only has 1 object - .dir.default.808642.1.163 | [root@node13 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-388/ --journal-path /var/lib/ceph/osd/ceph-388/journal --pgid 24.197 --op list | SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 18 00 00 00 00 20 00 00 00 00 00 83 1c 00 00 00 00 00 00 00 00 00 00 00 00 | SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 18 00 00 00 00 20 00 00 00 00 00 83 1c 00 00 00 00 00 00 00 00 00 00 00 00 | ["24.197",{"oid":".dir.default.808642.1.163","key":"","snapid":-2,"hash":491874711,"max":0,"pool":24,"namespace":"","max":0}] Here are the final lines of the ceph-objectstore-tool before it hangs: | e140768: 570 osds: 558 up, 542 in | Read 24/1d516997/.dir.default.808642.1.163/head | size=0 | object_info: 24/1d516997/.dir.default.808642.1.163/head(139155'2197754 client.1137891.0:20837319 dirty|omap|data_digest s 0 uv 2197754 dd ) | attrs size 2 This leads me to suspect something between line 564 and line 576 in the tool is hanging. https://github.com/ceph/ceph/blob/master/src/tools/ceph_objectstore_tool.cc#L564. Current suspect is the objectstore read command. | ret = store->read(cid, obj, offset, len, rawdatabl); Looking through the OSD debug logs, I also see a strange size(18446744073709551615) on the recovery operation for the 24/1d516997/.dir.default.808642.1.163/head object | 2016-03-17 12:12:29.753446 7f972ca3d700 10 osd.388 154849 dequeue_op 0x7f97580d3500 prio 2 cost 1049576 latency 0.000185 MOSDPGPull(24.197 154849 [PullOp(24/1d516997/.dir.default.808642.1.163/head, recovery_info: ObjectRecoveryInfo(24/1d516997/.dir.default.808642.1.163/head@139155'2197754, size: 18446744073709551615, copy_subset: [0~18446744073709551615], clone_subset: {}), recovery_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 pg pg[24.197( v 139155'2197754 (139111'2194700,139155'2197754] local-les=154480 n=1 ec=128853 les/c/f 154268/138679/0 154649/154650/154650) [179,443,517]/[306,441] r=-1 lpr=154846 pi=138674-154649/37 crt=139155'2197752 lcod 0'0 inactive NOTIFY NIBBLEWISE] this error eventually causes the thread to hang and eventually trigger the suicide timeout | 2016-03-17 12:12:45.541528 7f973524e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f972ca3d700' had timed out after 15 | 2016-03-17 12:12:45.541533 7f973524e700 20 heartbeat_map is_healthy = NOT HEALTHY, total workers: 29, number of unhealthy: 1 | 2016-03-17 12:12:45.541534 7f973524e700 10 osd.388 154849 internal heartbeat not healthy, dropping ping request | 2016-03-17 12:15:02.148193 7f973524e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f972ca3d700' had timed out after 15 | 2016-03-17 12:15:02.148195 7f973524e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f972ca3d700' had suicide timed out after 150 | ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) | 1: (()+0x7e6ab2) [0x7f9753701ab2] | 2: (()+0xf100) [0x7f9751893100] | 3: (gsignal()+0x37) [0x7f97500705f7] | 4: (abort()+0x148) [0x7f9750071ce8] | 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f97509749d5] | 6: (()+0x5e946) [0x7f9750972946] | 7: (()+0x5e973) [0x7f9750972973] | 8: (()+0x5eb93) [0x7f9750972b93] | 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0x7f97537f6dda] | 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2d9) [0x7f97537363b9] | 11: (ceph::HeartbeatMap::is_healthy()+0xd6) [0x7f9753736bf6] | 12: (OSD::handle_osd_ping(MOSDPing*)+0x933) [0x7f9753241593] | 13: (OSD::heartbeat_dispatch(M
Re: [ceph-users] RGW quota
On 3/17/16 1:41 PM, Marius Vaitiekunas wrote: > It's possible that somebody changed the owner of some bucket. But all > objects in that bucket still belongs to this user. That way you can get > quota exceeded. We had the same situation. Well the user says he didn't write to any other buckets than his own. The usage shows that he did have two other buckets boston_bombing, charlie_hebdo and the buckets no longer exist (and we have apache logs for the DELETE for them) but from the usage they were never deleted. I am concerned that since the delete never shows up that this is where the quota is being lost to. ceph-access.log:192.168.79.51 - - [16/Mar/2016:02:50:01 -0400] "DELETE /boston_bombing/ HTTP/1.1" 204 - "-" "Boto/2.34.0 Python/2.6.6 Linux/2.6.32-573.18.1.el6.x86_64" ceph-access.log:192.168.79.51 - - [16/Mar/2016:02:51:47 -0400] "DELETE /charlie_hebdo/ HTTP/1.1" 204 - "-" "Boto/2.34.0 Python/2.6.6 Linux/2.6.32-573.18.1.el6.x86_64" # radosgw-admin usage show --uid=username --start-date=2015-01-01 --end-date=2016-03-16 ... { "bucket": "boston", "time": "2016-03-07 23:00:00.00Z", "epoch": 1457391600, "categories": [ { "category": "create_bucket", "bytes_sent": 19, "bytes_received": 0, "ops": 1, "successful_ops": 1 }, { "category": "get_acls", "bytes_sent": 174400, "bytes_received": 0, "ops": 352, "successful_ops": 352 }, { "category": "get_obj", "bytes_sent": 86170638, "bytes_received": 0, "ops": 14, "successful_ops": 10 }, { "category": "list_bucket", "bytes_sent": 381327, "bytes_received": 0, "ops": 10, "successful_ops": 10 }, { "category": "put_acls", "bytes_sent": 3230, "bytes_received": 73031, "ops": 170, "successful_ops": 170 }, { "category": "put_obj", "bytes_sent": 0, "bytes_received": 14041021516, "ops": 169, "successful_ops": 169 }, { "category": "stat_bucket", "bytes_sent": 6688, "bytes_received": 0, "ops": 353, "successful_ops": 352 } ] } , { "bucket": "charlie_hebdo", "time": "2016-03-07 23:00:00.00Z", "epoch": 1457391600, "categories": [ { "category": "create_bucket", "bytes_sent": 19, "bytes_received": 0, "ops": 1, "successful_ops": 1 }, { "category": "get_acls", "bytes_sent": 79062, "bytes_received": 0, "ops": 159, "successful_ops": 159 }, { "category": "get_obj", "bytes_sent": 1096, "bytes_received": 0, "ops": 9, "successful_ops": 4 }, { "category": "list_bucket", "bytes_sent": 84129, "bytes_received": 0, "ops": 6, "successful_ops": 6 }, { "category": "put_acls", "bytes_sent": 1406, "bytes_received": 31655, "ops": 74, "successful_ops": 74 },
[ceph-users] Infernalis: chown ceph:ceph at runtime ?
Hi, we have upgraded our ceph-cluster to infernalis from hammer. Ceph is still running as root and we are using the "setuser match path = /var/lib/ceph/$type/$cluster-$id" directive in ceph.conf Now we would like to change the ownership of data-files and devices to ceph at runtime. What ist the best way to do this ? I am thinking about removing the "setuser match path" directive from ceph.conf and then stopping one osd after the other, change all files to ceph:ceph and then restart the daemon. Is this the best and recommended way ? I also once read about a fast parallel chmod syntax in this mailing list but I did not yet find the mail. Does someone remember how this was done ? Thanks Christoph ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ZFS or BTRFS for performance?
FWIW, from purely a performance perspective Ceph usually looks pretty fantastic on a fresh BTRFS filesystem. In fact it will probably continue to look great until you do small random writes to large objects (like say to blocks in an RBD volume). Then COW starts fragmenting the objects into oblivion. I've seen sequential read performance drop by 300% after 5 minutes of 4K random writes to the same RBD blocks. Autodefrag might help. A long time ago I recall Josef told me it was dangerous to use (I think it could run the node out of memory and corrupt the FS), but it may be that it's safer now. In any event we don't really do a lot of testing with BTRFS these days as bluestore is indeed the next gen OSD backend. If you do decide to give either BTRFS or ZFS a go with filestore, let us know how it goes. ;) Mark On 03/18/2016 02:42 PM, Heath Albritton wrote: Neither of these file systems is recommended for production use underlying an OSD. The general direction for ceph is to move away from having a file system at all. That effort is called "bluestore" and is supposed to show up in the jewel release. -H On Mar 18, 2016, at 11:15, Schlacta, Christ wrote: Insofar as I've been able to tell, both BTRFS and ZFS provide similar capabilities back to CEPH, and both are sufficiently stable for the basic CEPH use case (Single disk -> single mount point), so the question becomes this: Which actually provides better performance? Which is the more highly optimized single write path for ceph? Does anybody have a handful of side-by-side benchmarks? I'm more interested in higher IOPS, since you can always scale-out throughput, but throughput is also important. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-disk from jewel has issues on redhat 7
Thanks Dan, I have raised the tracker for this issue http://tracker.ceph.com/issues/15176 On Thu, Mar 17, 2016 at 10:47 AM, Dan van der Ster wrote: > Hi, > > It's true, partprobe works intermittently. I extracted the key > commands to show the problem: > > [18:44]# /usr/sbin/sgdisk --new=2:0:20480M --change-name=2:'ceph > journal' --partition-guid=2:aa23e07d-e6b3-4261-a236-c0565971d88d > --typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- > /dev/sdc > The operation has completed successfully. > [18:44]# partprobe /dev/sdc > Error: Error informing the kernel about modifications to partition > /dev/sdc2 -- Device or resource busy. This means Linux won't know > about any changes you made to /dev/sdc2 until you reboot -- so you > shouldn't mount it or use it in any way before rebooting. > Error: Failed to add partition 2 (Device or resource busy) > [18:44]# partprobe /dev/sdc > [18:44]# partprobe /dev/sdc > Error: Error informing the kernel about modifications to partition > /dev/sdc2 -- Device or resource busy. This means Linux won't know > about any changes you made to /dev/sdc2 until you reboot -- so you > shouldn't mount it or use it in any way before rebooting. > Error: Failed to add partition 2 (Device or resource busy) > [18:44]# partprobe /dev/sdc > Error: Error informing the kernel about modifications to partition > /dev/sdc2 -- Device or resource busy. This means Linux won't know > about any changes you made to /dev/sdc2 until you reboot -- so you > shouldn't mount it or use it in any way before rebooting. > Error: Failed to add partition 2 (Device or resource busy) > > But partx works every time: > > [18:46]# /usr/sbin/sgdisk --new=2:0:20480M --change-name=2:'ceph > journal' --partition-guid=2:aa23e07d-e6b3-4261-a236-c0565971d88d > --typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- > /dev/sdd > The operation has completed successfully. > [18:46]# partx -u /dev/sdd > [18:46]# partx -u /dev/sdd > [18:46]# partx -u /dev/sdd > [18:46]# > > -- Dan > > On Thu, Mar 17, 2016 at 6:31 PM, Vasu Kulkarni > wrote: > > I can raise a tracker for this issue since it looks like an intermittent > > issue and mostly dependent on specific hardware or it would be better if > you > > add all the hardware/os details in tracker.ceph.com, also from your > logs it > > looks like you have > > Resource busy issue: Error: Failed to add partition 2 (Device or > resource > > busy) > > > > From my test run logs on centos 7.2 , 10.0.5 ( > > > http://qa-proxy.ceph.com/teuthology/vasu-2016-03-15_15:34:41-selinux-master---basic-mira/62626/teuthology.log > > ) > > > > 2016-03-15T18:49:56.305 > > INFO:teuthology.orchestra.run.mira041.stderr:[ceph_deploy.osd][DEBUG ] > > Preparing host mira041 disk /dev/sdb journal None activate True > > 2016-03-15T18:49:56.305 > > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][DEBUG ] find the > > location of an executable > > 2016-03-15T18:49:56.309 > > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][INFO ] Running > > command: sudo /usr/sbin/ceph-disk -v prepare --cluster ceph --fs-type > xfs -- > > /dev/sdb > > 2016-03-15T18:49:56.546 > > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command: > > Running command: /usr/bin/ceph-osd --cluster=ceph > --show-config-value=fsid > > 2016-03-15T18:49:56.611 > > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command: > > Running command: /usr/bin/ceph-osd --check-allows-journal -i 0 --cluster > > ceph > > 2016-03-15T18:49:56.643 > > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command: > > Running command: /usr/bin/ceph-osd --check-wants-journal -i 0 --cluster > ceph > > 2016-03-15T18:49:56.708 > > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command: > > Running command: /usr/bin/ceph-osd --check-needs-journal -i 0 --cluster > ceph > > 2016-03-15T18:49:56.708 > > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] > get_dm_uuid: > > get_dm_uuid /dev/sdb uuid path is /sys/dev/block/8:16/dm/uuid > > 2016-03-15T18:49:56.709 > > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] set_type: > > Will colocate journal with data on /dev/sdb > > 2016-03-15T18:49:56.709 > > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] command: > > Running command: /usr/bin/ceph-osd --cluster=ceph > > --show-config-value=osd_journal_size > > 2016-03-15T18:49:56.774 > > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] > get_dm_uuid: > > get_dm_uuid /dev/sdb uuid path is /sys/dev/block/8:16/dm/uuid > > 2016-03-15T18:49:56.774 > > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] > get_dm_uuid: > > get_dm_uuid /dev/sdb uuid path is /sys/dev/block/8:16/dm/uuid > > 2016-03-15T18:49:56.775 > > INFO:teuthology.orchestra.run.mira041.stderr:[mira041][WARNING] > get_dm_uuid: > > get_dm_uuid /dev/sdb uuid path is /sys/dev/block/8:16/dm/uuid > > 2016-03-15T18:49:56.775 > > INFO:teuthology.orchestra.run.mi
Re: [ceph-users] Local SSD cache for ceph on each compute node.
> -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Daniel Niasoff > Sent: 16 March 2016 21:02 > To: Nick Fisk ; 'Van Leeuwen, Robert' > ; 'Jason Dillaman' > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. > > Hi Nick, > > Your solution requires manual configuration for each VM and cannot be > setup as part of an automated OpenStack deployment. Absolutely, potentially flaky as well. > > It would be really nice if it was a hypervisor based setting as opposed to a VM > based setting. Yes, I can't wait until we can just specify "rbd_cache_device=/dev/ssd" in the ceph.conf and get it to write to that instead. Ideally ceph would also provide some sort of lightweight replication for the cache devices, but otherwise a iSCSI SSD farm or switched SAS could be used so that the caching device is not tied to one physical host. > > Thanks > > Daniel > > -Original Message- > From: Nick Fisk [mailto:n...@fisk.me.uk] > Sent: 16 March 2016 08:59 > To: Daniel Niasoff ; 'Van Leeuwen, Robert' > ; 'Jason Dillaman' > Cc: ceph-users@lists.ceph.com > Subject: RE: [ceph-users] Local SSD cache for ceph on each compute node. > > > > > -Original Message- > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > > Of Daniel Niasoff > > Sent: 16 March 2016 08:26 > > To: Van Leeuwen, Robert ; Jason Dillaman > > > > Cc: ceph-users@lists.ceph.com > > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. > > > > Hi Robert, > > > > >Caching writes would be bad because a hypervisor failure would result > > >in > > loss of the cache which pretty much guarantees inconsistent data on > > the ceph volume. > > >Also live-migration will become problematic compared to running > > everything from ceph since you will also need to migrate the > local-storage. > > I tested a solution using iSCSI for the cache devices. Each VM was using > flashcache with a combination of a iSCSI LUN from a SSD and a RBD. This gets > around the problem of moving things around or if the hypervisor goes down. > It's not local caching but the write latency is at least 10x lower than the RBD. > Note I tested it, I didn't put it into production :-) > > > > > My understanding of how a writeback cache should work is that it > > should only take a few seconds for writes to be streamed onto the > > network and is focussed on resolving the speed issue of small sync > > writes. The writes > would > > be bundled into larger writes that are not time sensitive. > > > > So there is potential for a few seconds data loss but compared to the > current > > trend of using ephemeral storage to solve this issue, it's a major > > improvement. > > Yeah, problem is a couple of seconds data loss mean different things to > different people. > > > > > > (considering the time required for setting up and maintaining the > > > extra > > caching layer on each vm, unless you work for free ;-) > > > > Couldn't agree more there. > > > > I am just so surprised how the openstack community haven't looked to > > resolve this issue. Ephemeral storage is a HUGE compromise unless you > > have built in failure into every aspect of your application but many > > people use openstack as a general purpose devstack. > > > > (Jason pointed out his blueprint but I guess it's at least a year or 2 > away - > > http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash- > > consistent_write-back_caching_extension) > > > > I see articles discussing the idea such as this one > > > > http://www.sebastien-han.fr/blog/2014/06/10/ceph-cache-pool-tiering- > > scalable-cache/ > > > > but no real straightforward validated setup instructions. > > > > Thanks > > > > Daniel > > > > > > -Original Message- > > From: Van Leeuwen, Robert [mailto:rovanleeu...@ebay.com] > > Sent: 16 March 2016 08:11 > > To: Jason Dillaman ; Daniel Niasoff > > > > Cc: ceph-users@lists.ceph.com > > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. > > > > >Indeed, well understood. > > > > > >As a shorter term workaround, if you have control over the VMs, you > > >could > > always just slice out an LVM volume from local SSD/NVMe and pass it > > through to the guest. Within the guest, use dm-cache (or similar) to > > add > a > > cache front-end to your RBD volume. > > > > If you do this you need to setup your cache as read-cache only. > > Caching writes would be bad because a hypervisor failure would result > > in > loss > > of the cache which pretty much guarantees inconsistent data on the > > ceph volume. > > Also live-migration will become problematic compared to running > > everything from ceph since you will also need to migrate the local-storage. > > > > The question will be if adding more ram (== more read cache) would not > > be more convenient and cheaper in the end. > > (considering the time required for setting up and maintaining
Re: [ceph-users] rgw bucket deletion woes
On Tue, Mar 15, 2016 at 11:36 PM, Pavan Rallabhandi wrote: > Hi, > > I find this to be discussed here before, but couldn¹t find any solution > hence the mail. In RGW, for a bucket holding objects in the range of ~ > millions, one can find it to take for ever to delete the bucket(via > radosgw-admin). I understand the gc(and its parameters) that would reclaim > the space eventually, but am looking more at the bucket deletion options > that can possibly speed up the operation. > > I realize, currently rgw_remove_bucket(), does it 1000 objects at a time, > serially. Wanted to know if there is a reason(that am possibly missing and > discussed) for this to be left that way, otherwise I was considering a > patch to make it happen better. > There is no real reason. You might want to have a version of that command that doesn't schedule the removal to gc, but rather removes all the object parts by itself. Otherwise, you're just going to flood the gc. You'll need to iterate through all the objects, and for each object you'll need to remove all of it's rados objects (starting with the tail, then the head). Removal of each rados object can be done asynchronously, but you'll need to throttle the operations, not send everything to the osds at once (which will be impossible, as the objecter will throttle the requests anyway, which will lead to a high memory consumption). Thanks, Yehuda ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CfP 11th Workshop on Virtualization in High-Performance Cloud Computing (VHPC '16)
CfP 11th Workshop on Virtualization in High-Performance Cloud Computing (VHPC '16) CALL FOR PAPERS 11th Workshop on Virtualization in High-Performance Cloud Computing (VHPC '16) held in conjunction with the International Supercomputing Conference - High Performance, June 19-23, 2016, Frankfurt, Germany. Date: June 23, 2016 Workshop URL: http://vhpc.org Paper Submission Deadline: April 25, 2016 Call for Papers Virtualization technologies constitute a key enabling factor for flexible resource management in modern data centers, and particularly in cloud environments. Cloud providers need to manage complex infrastructures in a seamless fashion to support the highly dynamic and heterogeneous workloads and hosted applications customers deploy. Similarly, HPC environments have been increasingly adopting techniques that enable flexible management of vast computing and networking resources, close to marginal provisioning cost, which is unprecedented in the history of scientific and commercial computing. Various virtualization technologies contribute to the overall picture in different ways: machine virtualization, with its capability to enable consolidation of multiple underutilized servers with heterogeneous software and operating systems (OSes), and its capability to live-migrate a fully operating virtual machine (VM) with a very short downtime, enables novel and dynamic ways to manage physical servers; OS-level virtualization (i.e., containerization), with its capability to isolate multiple user-space environments and to allow for their coexistence within the same OS kernel, promises to provide many of the advantages of machine virtualization with high levels of responsiveness and performance; I/O Virtualization allows physical NICs/HBAs to take traffic from multiple VMs or containers; network virtualization, with its capability to create logical network overlays that are independent of the underlying physical topology and IP addressing, provides the fundamental ground on top of which evolved network services can be realized with an unprecedented level of dynamicity and flexibility; the increasingly adopted paradigm of Software-Defined Networking (SDN) promises to extend this flexibility to the control and data planes of network paths. Topics of Interest The VHPC program committee solicits original, high-quality submissions related to virtualization across the entire software stack with a special focus on the intersection of HPC and the cloud. Topics include, but are not limited to: - Virtualization in supercomputing environments, HPC clusters, cloud HPC and grids - OS-level virtualization including container runtimes (Docker, rkt et al.) - Lightweight compute node operating systems/VMMs - Optimizations of virtual machine monitor platforms, hypervisors - QoS and SLA in hypervisors and network virtualization - Cloud based network and system management for SDN and NFV - Management, deployment and monitoring of virtualized environments - Virtual per job / on-demand clusters and cloud bursting - Performance measurement, modelling and monitoring of virtualized/cloud workloads - Programming models for virtualized environments - Virtualization in data intensive computing and Big Data processing - Cloud reliability, fault-tolerance, high-availability and security - Heterogeneous virtualized environments, virtualized accelerators, GPUs and co-processors - Optimized communication libraries/protocols in the cloud and for HPC in the cloud - Topology management and optimization for distributed virtualized applications - Adaptation of emerging HPC technologies (high performance networks, RDMA, etc..) - I/O and storage virtualization, virtualization aware file systems - Job scheduling/control/policy in virtualized environments - Checkpointing and migration of VM-based large compute jobs - Cloud frameworks and APIs - Energy-efficient / power-aware virtualization The Workshop on Virtualization in High-Performance Cloud Computing (VHPC) aims to bring together researchers and industrial practitioners facing the challenges posed by virtualization in order to foster discussion, collaboration, mutual exchange of knowledge and experience, enabling research to ultimately provide novel solutions for virtualized computing systems of tomorrow. The workshop will be one day in length, composed of 20 min paper presentations, each followed by 10 min discussion sections, plus lightning talks that are limited to 5 minutes. Presentations may be accompanied by interactive demonstrations. Important Dates April 25, 2016 - Paper submission deadline May 30, 2016 Acceptance notification June 23, 2016 - Workshop Day July 25, 2016 - Camera-ready version due Chair Michael Alexander (chair), TU Wien, Austria Anastassios Nanos (co-chair), NTUA, Greece Balazs Gerofi (co-chair), RIKEN Advan
Re: [ceph-users] ZFS or BTRFS for performance?
On 20/03/2016 3:38 AM, Heath Albritton wrote: Ceph protects against bitrot at a much higher level by validating the checksum of the entire placement group during a deep scrub. Ceph has checksums? I didn't think it did. Its my understanding that it just compares blocks between replications and marks the pg invalid when it finds a mismatch, unlike btrfs/zfs which auto repair the block if the mirror has a valid checksum. -- Lindsay Mathieson ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ZFS or BTRFS for performance?
Le 19/03/2016 18:38, Heath Albritton a écrit : > If you google "ceph bluestore" you'll be able to find a couple slide > decks on the topic. One of them by Sage is easy to follow without the > benefit of the presentation. There's also the " Redhat Ceph Storage > Roadmap 2016" deck. > > In any case, bluestore is not intended to address bitrot. Given that > ceph is a distributed file system, many of the posix file system > features are not required for the underlying block storage device. > Bluestore is intended to address this and reduce the disk IO required > to store user data. > > Ceph protects against bitrot at a much higher level by validating the > checksum of the entire placement group during a deep scrub. My impression is that the only protection against bitrot is provided by the underlying filesystem which means that you don't get any if you use XFS or EXT4. I can't trust Ceph on this alone until its bitrot protection (if any) is clearly documented. The situation is far from clear right now. The documentations states that deep scrubs are using checksums to validate data, but this is not good enough at least because we don't known what these checksums are supposed to cover (see below for another reason). There is even this howto by Sebastien Han about repairing a PG : http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/ which clearly concludes that with only 2 replicas you can't reliably find out which object is corrupted with Ceph alone. If Ceph really stored checksums to verify all the objects it stores we could manually check which replica is valid. Even if deep scrubs would use checksums to verify data this would not be enough to protect against bitrot: there is a window between a corruption event and a deep scrub where the data on a primary can be returned to a client. BTRFS solves this problem by returning an IO error for any data read that doesn't match its checksum (or automatically rebuilds it if the allocation group is using RAID1/10/5/6). I've never seen this kind of behavior documented for Ceph. Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [cephfs] About feature 'snapshot'
Hi John, Really thank you for your help, and sorry about that I ask such a stupid question of setting... So isn't this feature ready in Jewel? I found something info says that the features(snapshot, quota...) become stable in Jewel Thank you 2016-03-18 21:07 GMT+09:00 John Spray : > On Fri, Mar 18, 2016 at 1:33 AM, 施柏安 wrote: > > Hi John, > > How to set this feature on? > > ceph mds set allow_new_snaps true --yes-i-really-mean-it > > John > > > Thank you > > > > 2016-03-17 21:41 GMT+08:00 Gregory Farnum : > >> > >> On Thu, Mar 17, 2016 at 3:49 AM, John Spray wrote: > >> > Snapshots are disabled by default: > >> > > >> > > http://docs.ceph.com/docs/hammer/cephfs/early-adopters/#most-stable-configuration > >> > >> Which makes me wonder if we ought to be hiding the .snaps directory > >> entirely in that case. I haven't previously thought about that, but it > >> *is* a bit weird. > >> -Greg > >> > >> > > >> > John > >> > > >> > On Thu, Mar 17, 2016 at 10:02 AM, 施柏安 > wrote: > >> >> Hi all, > >> >> I encounter a trouble about cephfs sanpshot. It seems that the folder > >> >> '.snap' is exist. > >> >> But I use 'll -a' can't let it show up. And I enter that folder and > >> >> create > >> >> folder in it, it showed something wrong to use snapshot. > >> >> > >> >> Please check : http://imgur.com/elZhQvD > >> >> > >> >> > >> >> ___ > >> >> ceph-users mailing list > >> >> ceph-users@lists.ceph.com > >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> > >> > ___ > >> > ceph-users mailing list > >> > ceph-users@lists.ceph.com > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [cephfs] About feature 'snapshot'
On Fri, Mar 18, 2016 at 1:33 AM, 施柏安 wrote: > Hi John, > How to set this feature on? ceph mds set allow_new_snaps true --yes-i-really-mean-it John > Thank you > > 2016-03-17 21:41 GMT+08:00 Gregory Farnum : >> >> On Thu, Mar 17, 2016 at 3:49 AM, John Spray wrote: >> > Snapshots are disabled by default: >> > >> > http://docs.ceph.com/docs/hammer/cephfs/early-adopters/#most-stable-configuration >> >> Which makes me wonder if we ought to be hiding the .snaps directory >> entirely in that case. I haven't previously thought about that, but it >> *is* a bit weird. >> -Greg >> >> > >> > John >> > >> > On Thu, Mar 17, 2016 at 10:02 AM, 施柏安 wrote: >> >> Hi all, >> >> I encounter a trouble about cephfs sanpshot. It seems that the folder >> >> '.snap' is exist. >> >> But I use 'll -a' can't let it show up. And I enter that folder and >> >> create >> >> folder in it, it showed something wrong to use snapshot. >> >> >> >> Please check : http://imgur.com/elZhQvD >> >> >> >> >> >> ___ >> >> ceph-users mailing list >> >> ceph-users@lists.ceph.com >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> > ___ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [cephfs] About feature 'snapshot'
On Thu, Mar 17, 2016 at 3:49 AM, John Spray wrote: > Snapshots are disabled by default: > http://docs.ceph.com/docs/hammer/cephfs/early-adopters/#most-stable-configuration Which makes me wonder if we ought to be hiding the .snaps directory entirely in that case. I haven't previously thought about that, but it *is* a bit weird. -Greg > > John > > On Thu, Mar 17, 2016 at 10:02 AM, 施柏安 wrote: >> Hi all, >> I encounter a trouble about cephfs sanpshot. It seems that the folder >> '.snap' is exist. >> But I use 'll -a' can't let it show up. And I enter that folder and create >> folder in it, it showed something wrong to use snapshot. >> >> Please check : http://imgur.com/elZhQvD >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ZFS or BTRFS for performance?
Yes, I`m missing protection from Ceph too. http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007680.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system
There is a directory structure hash, it's just that it's at the end of the name and you'll have to check the xattr I mentioned to find it. I think that file is actually the one we are talking about removing. ./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long: user.cephos.lfn3: default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46_3189d_0 Notice that the user.cephosd.lfn3 attr has the full name, and it *does* have a hash 79CED459 (you referred to it as a directory hash I think, but it's actually the hash we used to place it on this osd to begin with). In specifically this case, you shouldn't find any files in the DIR_9/DIR_5/DIR_4/DIR_D directory since there are 16 subdirectories (so all hash values should hash to one of those). The one in DIR_9/DIR_5/DIR_4/DIR_D/DIR_E is completely fine -- that's the actual object file, don't remove that. If you look at the attr: ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long: user.cephos.lfn3: default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46__0 The hash is 79CED459, which means that (assuming DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/DIR_C does *not* exist) it's in the right place. The ENOENT return 2016-03-07 16:11:41.828332 7ff30cdad700 10 filestore(/var/lib/ceph/osd/ceph-307) remove 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0 = -2 2016-03-07 21:44:02.197676 7fe96b56f700 10 filestore(/var/lib/ceph/osd/ceph-307) remove 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0 = -2 actually was a symptom in this case, but, in general, it's not indicative of anything -- the filestore can get ENOENT return values for legitimate reasons. To reiterate: files that end in something like fa202ec9b4b3b217275a_0_long are *not* necessarily orphans -- you need to check the user.cephos.lfn3 attr (as you did before) for the full length file name and determine whether the file is in the right place. -Sam On Wed, Mar 16, 2016 at 7:49 AM, Jeffrey McDonald wrote: > Hi Sam, > > In the 70.459 logs from the deep-scrub, there is an error: > > $ zgrep "= \-2$" ceph-osd.307.log.1.gz > 2016-03-07 16:11:41.828332 7ff30cdad700 10 > filestore(/var/lib/ceph/osd/ceph-307) remove > 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0 > = -2 > 2016-03-07 21:44:02.197676 7fe96b56f700 10 > filestore(/var/lib/ceph/osd/ceph-307) remove > 70.459s0_head/79ced459/default.724733.17__shadow_prostate/rnaseq/8e5da6e8-8881-4813-a4e3-327df57fd1b7/UNCID_2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304_UNC14-SN744_0400_AC3LWGACXX_7_GAGTGG.tar.gz.2~_1r_FGidmpEP8GRsJkNLfAh9CokxkLf.4_156/head//70/202909/0 > = -2 > > I'm taking this as an indication of the error you mentioned.It looks to > me as if this bug leaves two files with "issues" based upon what I see on > the filesystem. > > First, I have a size-0 file in a directory where I expect only to have > directories: > > root@ceph03:/var/lib/ceph/osd/ceph-307/current/70.459s0_head/DIR_9/DIR_5/DIR_4/DIR_D# > ls -ltr > total 320 > -rw-r--r-- 1 root root 0 Jan 23 21:49 > default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long > drwxr-xr-x 2 root root 16384 Feb 5 15:13 DIR_6 > drwxr-xr-x 2 root root 16384 Feb 5 17:26 DIR_3 > drwxr-xr-x 2 root root 16384 Feb 10 00:01 DIR_C > drwxr-xr-x 2 root root 16384 Mar 4 10:50 DIR_7 > drwxr-xr-x 2 root root 16384 Mar 4 16:46 D
[ceph-users] Upgrade from .94 to 10.0.5
Is there documentation on all the steps showing how to upgrade from .94 to 10.0.5? Thanks Rick ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] data corruption with hammer
There is got to be something else going on here. All that PR does is to potentially delay the promotion to hit_set_period*recency instead of just doing it on the 2nd read regardless, it's got to be uncovering another bug. Do you see the same problem if the cache is in writeback mode before you start the unpacking. Ie is it the switching mid operation which causes the problem? If it only happens mid operation, does it still occur if you pause IO when you make the switch? Do you also see this if you perform on a RBD mount, to rule out any librbd/qemu weirdness? Do you know if it’s the actual data that is getting corrupted or if it's the FS metadata? I'm only wondering as unpacking should really only be writing to each object a couple of times, whereas FS metadata could potentially be being updated+read back lots of times for the same group of objects and ordering is very important. Thinking through it logically the only difference is that with recency=1 the object will be copied up to the cache tier, where recency=6 it will be proxy read for a long time. If I had to guess I would say the issue would lie somewhere in the proxy read + writeback<->forward logic. > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Mike Lovell > Sent: 16 March 2016 23:23 > To: ceph-users ; sw...@redhat.com > Cc: Robert LeBlanc ; William Perkins > > Subject: Re: [ceph-users] data corruption with hammer > > just got done with a test against a build of 0.94.6 minus the two commits that > were backported in PR 7207. everything worked as it should with the cache- > mode set to writeback and the min_read_recency_for_promote set to 2. > assuming it works properly on master, there must be a commit that we're > missing on the backport to support this properly. > > sage, > i'm adding you to the recipients on this so hopefully you see it. the tl;dr > version is that the backport of the cache recency fix to hammer doesn't work > right and potentially corrupts data when > the min_read_recency_for_promote is set to greater than 1. > > mike > > On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell > wrote: > robert and i have done some further investigation the past couple days on > this. we have a test environment with a hard drive tier and an ssd tier as a > cache. several vms were created with volumes from the ceph cluster. i did a > test in each guest where i un-tarred the linux kernel source multiple times > and then did a md5sum check against all of the files in the resulting source > tree. i started off with the monitors and osds running 0.94.5 and never saw > any problems. > > a single node was then upgraded to 0.94.6 which has osds in both the ssd and > hard drive tier. i then proceeded to run the same test and, while the untar > and md5sum operations were running, i changed the ssd tier cache-mode > from forward to writeback. almost immediately the vms started reporting io > errors and odd data corruption. the remainder of the cluster was updated to > 0.94.6, including the monitors, and the same thing happened. > > things were cleaned up and reset and then a test was run > where min_read_recency_for_promote for the ssd cache pool was set to 1. > we previously had it set to 6. there was never an error with the recency > setting set to 1. i then tested with it set to 2 and it immediately caused > failures. we are currently thinking that it is related to the backport of the > fix > for the recency promotion and are in progress of making a .6 build without > that backport to see if we can cause corruption. is anyone using a version > from after the original recency fix (PR 6702) with a cache tier in writeback > mode? anyone have a similar problem? > > mike > > On Mon, Mar 14, 2016 at 8:51 PM, Mike Lovell > wrote: > something weird happened on one of the ceph clusters that i administer > tonight which resulted in virtual machines using rbd volumes seeing > corruption in multiple forms. > > when everything was fine earlier in the day, the cluster was a number of > storage nodes spread across 3 different roots in the crush map. the first > bunch of storage nodes have both hard drives and ssds in them with the hard > drives in one root and the ssds in another. there is a pool for each and the > pool for the ssds is a cache tier for the hard drives. the last set of storage > nodes were in a separate root with their own pool that is being used for burn > in testing. > > these nodes had run for a while with test traffic and we decided to move > them to the main root and pools. the main cluster is running 0.94.5 and the > new nodes got 0.94.6 due to them getting configured after that was > released. i removed the test pool and did a ceph osd crush move to move > the first node into the main cluster, the hard drives into the root for that > tier > of storage and the ssds into the root and pool for the cache tier. each set > was > done about 45 minutes apart and they r
Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system
Oh, it's getting a stat mismatch. I think what happened is that on one of the earlier repairs it reset the stats to the wrong value (the orphan was causing the primary to scan two objects twice, which matches the stat mismatch I see here). A pg repair repair will clear that up. -Sam On Thu, Mar 17, 2016 at 9:22 AM, Jeffrey McDonald wrote: > Thanks Sam. > > Since I have prepared a script for this, I decided to go ahead with the > checks.(patience isn't one of my extended attributes) > > I've got a file that searches the full erasure encoded spaces and does your > checklist below. I have operated only on one PG so far, the 70.459 one > that we've been discussing.There was only the one file that I found to > be out of place--the one we already discussed/found and it has been removed. > > The pg is still marked as inconsistent. I've scrubbed it a couple of times > now and what I've seen is: > > 2016-03-17 09:29:53.202818 7f2e816f8700 0 log_channel(cluster) log [INF] : > 70.459 deep-scrub starts > 2016-03-17 09:36:38.436821 7f2e816f8700 -1 log_channel(cluster) log [ERR] : > 70.459s0 deep-scrub stat mismatch, got 22319/22321 objects, 0/0 clones, > 22319/22321 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, > 68440088914/68445454633 bytes,0/0 hit_set_archive bytes. > 2016-03-17 09:36:38.436844 7f2e816f8700 -1 log_channel(cluster) log [ERR] : > 70.459 deep-scrub 1 errors > 2016-03-17 09:44:23.592302 7f2e816f8700 0 log_channel(cluster) log [INF] : > 70.459 deep-scrub starts > 2016-03-17 09:47:01.237846 7f2e816f8700 -1 log_channel(cluster) log [ERR] : > 70.459s0 deep-scrub stat mismatch, got 22319/22321 objects, 0/0 clones, > 22319/22321 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, > 68440088914/68445454633 bytes,0/0 hit_set_archive bytes. > 2016-03-17 09:47:01.237880 7f2e816f8700 -1 log_channel(cluster) log [ERR] : > 70.459 deep-scrub 1 errors > > > Should the scrub be sufficient to remove the inconsistent flag? I took the > osd offline during the repairs.I've looked at files in all of the osds > in the placement group and I'm not finding any more problem files.The > vast majority of files do not have the user.cephos.lfn3 attribute.There > are 22321 objects that I seen and only about 230 have the user.cephos.lfn3 > file attribute. The files will have other attributes, just not > user.cephos.lfn3. > > Regards, > Jeff > > > On Wed, Mar 16, 2016 at 3:53 PM, Samuel Just wrote: >> >> Ok, like I said, most files with _long at the end are *not orphaned*. >> The generation number also is *not* an indication of whether the file >> is orphaned -- some of the orphaned files will have >> as the generation number and others won't. For each long filename >> object in a pg you would have to: >> 1) Pull the long name out of the attr >> 2) Parse the hash out of the long name >> 3) Turn that into a directory path >> 4) Determine whether the file is at the right place in the path >> 5) If not, remove it (or echo it to be checked) >> >> You probably want to wait for someone to get around to writing a >> branch for ceph-objectstore-tool. Should happen in the next week or >> two. >> -Sam >> > > -- > > Jeffrey McDonald, PhD > Assistant Director for HPC Operations > Minnesota Supercomputing Institute > University of Minnesota Twin Cities > 599 Walter Library email: jeffrey.mcdon...@msi.umn.edu > 117 Pleasant St SE phone: +1 612 625-6905 > Minneapolis, MN 55455fax: +1 612 624-8861 > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] data corruption with hammer
Yep, let me pull and build that branch. I tried installing the dbg packages and running it in gdb, but it didn't load the symbols. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Mar 17, 2016 at 11:36 AM, Sage Weil wrote: > On Thu, 17 Mar 2016, Robert LeBlanc wrote: >> Also, is this ceph_test_rados rewriting objects quickly? I think that >> the issue is with rewriting objects so if we can tailor the >> ceph_test_rados to do that, it might be easier to reproduce. > > It's doing lots of overwrites, yeah. > > I was albe to reproduce--thanks! It looks like it's specific to > hammer. The code was rewritten for jewel so it doesn't affect the > latest. The problem is that maybe_handle_cache may proxy the read and > also still try to handle the same request locally (if it doesn't trigger a > promote). > > Here's my proposed fix: > > https://github.com/ceph/ceph/pull/8187 > > Do you mind testing this branch? > > It doesn't appear to be directly related to flipping between writeback and > forward, although it may be that we are seeing two unrelated issues. I > seemed to be able to trigger it more easily when I flipped modes, but the > bug itself was a simple issue in the writeback mode logic. :/ > > Anyway, please see if this fixes it for you (esp with the RBD workload). > > Thanks! > sage > > > > >> >> Robert LeBlanc >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >> >> >> On Thu, Mar 17, 2016 at 11:05 AM, Robert LeBlanc >> wrote: >> > I'll miss the Ceph community as well. There was a few things I really >> > wanted to work in with Ceph. >> > >> > I got this: >> > >> > update_object_version oid 13 v 1166 (ObjNum 1028 snap 0 seq_num 1028) >> > dirty exists >> > 1038: left oid 13 (ObjNum 1028 snap 0 seq_num 1028) >> > 1040: finishing write tid 1 to nodez23350-256 >> > 1040: finishing write tid 2 to nodez23350-256 >> > 1040: finishing write tid 3 to nodez23350-256 >> > 1040: finishing write tid 4 to nodez23350-256 >> > 1040: finishing write tid 6 to nodez23350-256 >> > 1035: done (4 left) >> > 1037: done (3 left) >> > 1038: done (2 left) >> > 1043: read oid 430 snap -1 >> > 1043: expect (ObjNum 429 snap 0 seq_num 429) >> > 1040: finishing write tid 7 to nodez23350-256 >> > update_object_version oid 256 v 661 (ObjNum 1029 snap 0 seq_num 1029) >> > dirty exists >> > 1040: left oid 256 (ObjNum 1029 snap 0 seq_num 1029) >> > 1042: expect (ObjNum 664 snap 0 seq_num 664) >> > 1043: Error: oid 430 read returned error code -2 >> > ./test/osd/RadosModel.h: In function 'virtual void >> > ReadOp::_finish(TestOp::CallbackInfo*)' thread 7fa1bf7fe700 time >> > 2016-03-17 10:47:19.085414 >> > ./test/osd/RadosModel.h: 1109: FAILED assert(0) >> > ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) >> > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> > const*)+0x76) [0x4db956] >> > 2: (ReadOp::_finish(TestOp::CallbackInfo*)+0xec) [0x4c959c] >> > 3: (()+0x9791d) [0x7fa1d472191d] >> > 4: (()+0x72519) [0x7fa1d46fc519] >> > 5: (()+0x13c178) [0x7fa1d47c6178] >> > 6: (()+0x80a4) [0x7fa1d425a0a4] >> > 7: (clone()+0x6d) [0x7fa1d2bd504d] >> > NOTE: a copy of the executable, or `objdump -rdS ` is >> > needed to interpret this. >> > terminate called after throwing an instance of 'ceph::FailedAssertion' >> > Aborted >> > >> > I had to toggle writeback/forward and min_read_recency_for_promote a >> > few times to get it, but I don't know if it is because I only have one >> > job running. Even with six jobs running, it is not easy to trigger >> > with ceph_test_rados, but it is very instant in the RBD VMs. >> > >> > Here are the six run crashes (I have about the last 2000 lines of each >> > if needed): >> > >> > nodev: >> > update_object_version oid 1015 v 1255 (ObjNum 1014 snap 0 seq_num >> > 1014) dirty exists >> > 1015: left oid 1015 (ObjNum 1014 snap 0 seq_num 1014) >> > 1016: finishing write tid 1 to nodev21799-1016 >> > 1016: finishing write tid 2 to nodev21799-1016 >> > 1016: finishing write tid 3 to nodev21799-1016 >> > 1016: finishing write tid 4 to nodev21799-1016 >> > 1016: finishing write tid 6 to nodev21799-1016 >> > 1016: finishing write tid 7 to nodev21799-1016 >> > update_object_version oid 1016 v 1957 (ObjNum 1015 snap 0 seq_num >> > 1015) dirty exists >> > 1016: left oid 1016 (ObjNum 1015 snap 0 seq_num 1015) >> > 1017: finishing write tid 1 to nodev21799-1017 >> > 1017: finishing write tid 2 to nodev21799-1017 >> > 1017: finishing write tid 3 to nodev21799-1017 >> > 1017: finishing write tid 5 to nodev21799-1017 >> > 1017: finishing write tid 6 to nodev21799-1017 >> > update_object_version oid 1017 v 1010 (ObjNum 1016 snap 0 seq_num >> > 1016) dirty exists >> > 1017: left oid 1017 (ObjNum 1016 snap 0 seq_num 1016) >> > 1018: finishing write tid 1 to nodev21799-1018 >> > 1018: finishing write tid 2 to nodev21799-1018 >> > 1018: finishing write tid 3 to nodev217
Re: [ceph-users] ZFS or BTRFS for performance?
Hello, On Sun, 20 Mar 2016 00:45:47 +0100 Lionel Bouton wrote: > Le 19/03/2016 18:38, Heath Albritton a écrit : > > If you google "ceph bluestore" you'll be able to find a couple slide > > decks on the topic. One of them by Sage is easy to follow without the > > benefit of the presentation. There's also the " Redhat Ceph Storage > > Roadmap 2016" deck. > > > > In any case, bluestore is not intended to address bitrot. Given that > > ceph is a distributed file system, many of the posix file system > > features are not required for the underlying block storage device. > > Bluestore is intended to address this and reduce the disk IO required > > to store user data. > > > > Ceph protects against bitrot at a much higher level by validating the > > checksum of the entire placement group during a deep scrub. > That's not protection, that's an "uh-oh, something is wrong, you better check it out" notification, after which you get to spend a lot of time figuring out which is the good replica and as Lionel wrote in the case of just 2 replicas and faced with binary data you might as well roll a dice. Completely unacceptable and my oldest pet peeve about Ceph. I'd be deeply disappointed if bluestore would go ahead ignoring that elephant in the room as well. > My impression is that the only protection against bitrot is provided by > the underlying filesystem which means that you don't get any if you use > XFS or EXT4. > Indeed. > I can't trust Ceph on this alone until its bitrot protection (if any) is > clearly documented. The situation is far from clear right now. The > documentations states that deep scrubs are using checksums to validate > data, but this is not good enough at least because we don't known what > these checksums are supposed to cover (see below for another reason). > There is even this howto by Sebastien Han about repairing a PG : > http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/ > which clearly concludes that with only 2 replicas you can't reliably > find out which object is corrupted with Ceph alone. If Ceph really > stored checksums to verify all the objects it stores we could manually > check which replica is valid. > AFAIK it uses checksums created on the fly to compare the data during deep-scrubs. I also recall talks about having permanent checksums stored, but no idea what the status is. > Even if deep scrubs would use checksums to verify data this would not be > enough to protect against bitrot: there is a window between a corruption > event and a deep scrub where the data on a primary can be returned to a > client. BTRFS solves this problem by returning an IO error for any data > read that doesn't match its checksum (or automatically rebuilds it if > the allocation group is using RAID1/10/5/6). I've never seen this kind > of behavior documented for Ceph. > Ditto. And if/when Ceph has reliable checksumming (in the storage layer) it should definitely get auto-repair abilities as well. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] data corruption with hammer
On Thu, 17 Mar 2016, Robert LeBlanc wrote: > We are trying to figure out how to use rados bench to reproduce. Ceph > itself doesn't seem to think there is any corruption, but when you do a > verify inside the RBD, there is. Can rados bench verify the objects after > they are written? It also seems to be primarily the filesystem metadata > that is corrupted. If we fsck the volume, there is missing data (put into > lost+found), but if it is there it is primarily OK. There only seems to be > a few cases where a file's contents are corrupted. I would suspect on an > object boundary. We would have to look at blockinfo to map that out and see > if that is what is happening. 'rados bench' doesn't do validation. ceph_test_rados does, though--if you can reproduce with that workload then it should be pretty easy to track down. Thanks! sage > We stopped all the IO and did put the tier in writeback mode with recency > 1, set the recency to 2 and started the test and there was corruption, so > it doesn't seem to be limited to changing the mode. I don't know how that > patch could cause the issue either. Unless there is a bug that reads from > the back tier, but writes to cache tier, then the object gets promoted > wiping that last write, but then it seems like it should not be as much > corruption since the metadata should be in the cache pretty quick. We > usually evited the cache before each try so we should not be evicting on > writeback. > > Sent from a mobile device, please excuse any typos. > On Mar 17, 2016 6:26 AM, "Sage Weil" wrote: > > > On Thu, 17 Mar 2016, Nick Fisk wrote: > > > There is got to be something else going on here. All that PR does is to > > > potentially delay the promotion to hit_set_period*recency instead of > > > just doing it on the 2nd read regardless, it's got to be uncovering > > > another bug. > > > > > > Do you see the same problem if the cache is in writeback mode before you > > > start the unpacking. Ie is it the switching mid operation which causes > > > the problem? If it only happens mid operation, does it still occur if > > > you pause IO when you make the switch? > > > > > > Do you also see this if you perform on a RBD mount, to rule out any > > > librbd/qemu weirdness? > > > > > > Do you know if it’s the actual data that is getting corrupted or if it's > > > the FS metadata? I'm only wondering as unpacking should really only be > > > writing to each object a couple of times, whereas FS metadata could > > > potentially be being updated+read back lots of times for the same group > > > of objects and ordering is very important. > > > > > > Thinking through it logically the only difference is that with recency=1 > > > the object will be copied up to the cache tier, where recency=6 it will > > > be proxy read for a long time. If I had to guess I would say the issue > > > would lie somewhere in the proxy read + writeback<->forward logic. > > > > That seems reasonable. Was switching from writeback -> forward always > > part of the sequence that resulted in corruption? Not that there is a > > known ordering issue when switching to forward mode. I wouldn't really > > expect it to bite real users but it's possible.. > > > > http://tracker.ceph.com/issues/12814 > > > > I've opened a ticket to track this: > > > > http://tracker.ceph.com/issues/15171 > > > > What would be *really* great is if you could reproduce this with a > > ceph_test_rados workload (from ceph-tests). I.e., get ceph_test_rados > > running, and then find the sequence of operations that are sufficient to > > trigger a failure. > > > > sage > > > > > > > > > > > > > > > > > > > -Original Message- > > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > > Of > > > > Mike Lovell > > > > Sent: 16 March 2016 23:23 > > > > To: ceph-users ; sw...@redhat.com > > > > Cc: Robert LeBlanc ; William Perkins > > > > > > > > Subject: Re: [ceph-users] data corruption with hammer > > > > > > > > just got done with a test against a build of 0.94.6 minus the two > > commits that > > > > were backported in PR 7207. everything worked as it should with the > > cache- > > > > mode set to writeback and the min_read_recency_for_promote set to 2. > > > > assuming it works properly on master, there must be a commit that we're > > > > missing on the backport to support this properly. > > > > > > > > sage, > > > > i'm adding you to the recipients on this so hopefully you see it. the > > tl;dr > > > > version is that the backport of the cache recency fix to hammer > > doesn't work > > > > right and potentially corrupts data when > > > > the min_read_recency_for_promote is set to greater than 1. > > > > > > > > mike > > > > > > > > On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell > > > > wrote: > > > > robert and i have done some further investigation the past couple days > > on > > > > this. we have a test environment with a hard drive tier and an ssd > > tier as a > > > > cache. several vms
Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system
Ok, like I said, most files with _long at the end are *not orphaned*. The generation number also is *not* an indication of whether the file is orphaned -- some of the orphaned files will have as the generation number and others won't. For each long filename object in a pg you would have to: 1) Pull the long name out of the attr 2) Parse the hash out of the long name 3) Turn that into a directory path 4) Determine whether the file is at the right place in the path 5) If not, remove it (or echo it to be checked) You probably want to wait for someone to get around to writing a branch for ceph-objectstore-tool. Should happen in the next week or two. -Sam On Wed, Mar 16, 2016 at 1:36 PM, Jeffrey McDonald wrote: > Hi Sam, > > I've written a script but i'm a little leary of unleasing it until I find a > few more cases to test. The script successfully removed the file mentioned > above. > I took the next pg which was marked inconsistent and ran the following > command over those pg directory structures: > > find . -name "*_long" -exec xattr -p user.cephos.lfn3 {} + | grep -v > > > I didn't find any files that "orphaned" by this command. All of these > files should have "_long" and the grep should pull out the invalid > generation, correct? > > I'm looking wider but in the next pg marked inconsistent I didn't find any > orphans. > > Thanks, > Jeff > > -- > > Jeffrey McDonald, PhD > Assistant Director for HPC Operations > Minnesota Supercomputing Institute > University of Minnesota Twin Cities > 599 Walter Library email: jeffrey.mcdon...@msi.umn.edu > 117 Pleasant St SE phone: +1 612 625-6905 > Minneapolis, MN 55455fax: +1 612 624-8861 > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rgw bucket deletion woes
We would be a big user of this. We delete large buckets often and it takes forever. Though didn't I read that 'object expiration' support is on the near-term RGW roadmap? That may do what we want.. we're creating thousands of objects a day, and thousands of objects a day will be expiring, so RGW will need to handle. -Ben On Wed, Mar 16, 2016 at 9:40 AM, Yehuda Sadeh-Weinraub wrote: > On Tue, Mar 15, 2016 at 11:36 PM, Pavan Rallabhandi > wrote: > > Hi, > > > > I find this to be discussed here before, but couldn¹t find any solution > > hence the mail. In RGW, for a bucket holding objects in the range of ~ > > millions, one can find it to take for ever to delete the bucket(via > > radosgw-admin). I understand the gc(and its parameters) that would > reclaim > > the space eventually, but am looking more at the bucket deletion options > > that can possibly speed up the operation. > > > > I realize, currently rgw_remove_bucket(), does it 1000 objects at a time, > > serially. Wanted to know if there is a reason(that am possibly missing > and > > discussed) for this to be left that way, otherwise I was considering a > > patch to make it happen better. > > > > There is no real reason. You might want to have a version of that > command that doesn't schedule the removal to gc, but rather removes > all the object parts by itself. Otherwise, you're just going to flood > the gc. You'll need to iterate through all the objects, and for each > object you'll need to remove all of it's rados objects (starting with > the tail, then the head). Removal of each rados object can be done > asynchronously, but you'll need to throttle the operations, not send > everything to the osds at once (which will be impossible, as the > objecter will throttle the requests anyway, which will lead to a high > memory consumption). > > Thanks, > Yehuda > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com