Re: [ceph-users] 3.18.11 - RBD triggered deadlock?
On Sat, Apr 25, 2015 at 9:56 AM, Nikola Ciprich wrote: >> >> It seems you just grepped for ceph-osd - that doesn't include sockets >> opened by the kernel client, which is what I was after. Paste the >> entire netstat? > ouch, bummer! here are full netstats, sorry about delay.. > > http://nik.lbox.cz/download/ceph/ > > BR tcp0 0 10.0.0.1:6809 10.0.0.1:59692 ESTABLISHED 20182/ceph-osd tcp0 4163543 10.0.0.1:59692 10.0.0.1:6809 ESTABLISHED - You got bitten by a recently fixed regression. It's never been a good idea to co-locate kernel client with osds, and we advise not to do it. However it happens to work most of the time so you can do it if you really want to. That "happens to work" part got accidentally broken in 3.18 and was fixed in 4.0, 3.19.5 and 3.18.12. You are running 3.18.11, so you are going to need to upgrade. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Radosgw and mds hardware configuration
Hi, Gregory Farnum wrote: > The MDS will run in 1GB, but the more RAM it has the more of the metadata > you can cache in memory. The faster single-threaded performance your CPU > has, the more metadata IOPS you'll get. We haven't done much work > characterizing it, though. Ok, thanks for the answer. So, if I understand well, in any case the mds server will use just one core of the CPU, so that it's useless to have several cores. Is it correct? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cephfs: proportion of data between data pool and metadata pool
Thanks Greg and Steffen for your answer. I will make some tests. Gregory Farnum wrote: > Yeah. The metadata pool will contain: > 1) MDS logs, which I think by default will take up to 200MB per > logical MDS. (You should have only one logical MDS.) > 2) directory metadata objects, which contain the dentries and inodes > of the system; ~4KB is probably generous for each? So one file in the cephfs generates one inode of ~4KB in the "metadata" pool, correct? So that (number-of-files-in-cephfs) x 4KB gives me an (approximative) estimation of the amount of data in the "metadata" pool? > 3) Some smaller data structures about the allocated inode range and > current client sessions. > > The data pool contains all of the file data. Presumably this is much > larger, but it will depend on your average file size and we've not > done any real study of it. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Shadow Files
Yeah, that's definitely something that we'd address soon. Yehuda - Original Message - > From: "Ben" > To: "Ben Hines" , "Yehuda Sadeh-Weinraub" > > Cc: "ceph-users" > Sent: Friday, April 24, 2015 5:14:11 PM > Subject: Re: [ceph-users] Shadow Files > > Definitely need something to help clear out these old shadow files. > > I'm sure our cluster has around 100TB of these shadow files. > > I've written a script to go through known objects to get prefixes of objects > that should exist to compare to ones that shouldn't, but the time it takes > to do this over millions and millions of objects is just too long. > > On 25/04/15 09:53, Ben Hines wrote: > > > > When these are fixed it would be great to get good steps for listing / > cleaning up any orphaned objects. I have suspicions this is affecting us. > > thanks- > > -Ben > > On Fri, Apr 24, 2015 at 3:10 PM, Yehuda Sadeh-Weinraub < yeh...@redhat.com > > wrote: > > > These ones: > > http://tracker.ceph.com/issues/10295 > http://tracker.ceph.com/issues/11447 > > - Original Message - > > From: "Ben Jackson" > > To: "Yehuda Sadeh-Weinraub" < yeh...@redhat.com > > > Cc: "ceph-users" < ceph-us...@ceph.com > > > Sent: Friday, April 24, 2015 3:06:02 PM > > Subject: Re: [ceph-users] Shadow Files > > > > We were firefly, then we upgraded to giant, now we are on hammer. > > > > What issues? > > > > On 25 Apr 2015 2:12 am, Yehuda Sadeh-Weinraub < yeh...@redhat.com > wrote: > > > > > > What version are you running? There are two different issues that we were > > > fixing this week, and we should have that upstream pretty soon. > > > > > > Yehuda > > > > > > - Original Message - > > > > From: "Ben" > > > > To: "ceph-users" < ceph-us...@ceph.com > > > > > Cc: "Yehuda Sadeh-Weinraub" < yeh...@redhat.com > > > > > Sent: Thursday, April 23, 2015 7:42:06 PM > > > > Subject: [ceph-users] Shadow Files > > > > > > > > We are still experiencing a problem with out gateway not properly > > > > clearing out shadow files. > > > > > > > > I have done numerous tests where I have: > > > > -Uploaded a file of 1.5GB in size using s3browser application > > > > -Done an object stat on the file to get its prefix > > > > -Done rados ls -p .rgw.buckets | grep to count the number of > > > > shadow files associated (in this case it is around 290 shadow files) > > > > -Deleted said file with s3browser > > > > -Performed a gc list, which shows the ~290 files listed > > > > -Waited 24 hours to redo the rados ls -p .rgw.buckets | grep > > > > to > > > > recount the shadow files only to be left with 290 files still there > > > > > > > > From log output /var/log/ceph/radosgw.log, I can see the following when > > > > clicking DELETE (this appears 290 times) > > > > 2015-04-24 10:43:29.996523 7f0b0afb5700 0 RGWObjManifest::operator++(): > > > > result: ofs=4718592 stripe_ofs=4718592 part_ofs=0 rule->part_size=0 > > > > 2015-04-24 10:43:29.996557 7f0b0afb5700 0 RGWObjManifest::operator++(): > > > > result: ofs=8912896 stripe_ofs=8912896 part_ofs=0 rule->part_size=0 > > > > 2015-04-24 10:43:29.996564 7f0b0afb5700 0 RGWObjManifest::operator++(): > > > > result: ofs=13107200 stripe_ofs=13107200 part_ofs=0 rule->part_size=0 > > > > 2015-04-24 10:43:29.996570 7f0b0afb5700 0 RGWObjManifest::operator++(): > > > > result: ofs=17301504 stripe_ofs=17301504 part_ofs=0 rule->part_size=0 > > > > 2015-04-24 10:43:29.996576 7f0b0afb5700 0 RGWObjManifest::operator++(): > > > > result: ofs=21495808 stripe_ofs=21495808 part_ofs=0 rule->part_size=0 > > > > 2015-04-24 10:43:29.996581 7f0b0afb5700 0 RGWObjManifest::operator++(): > > > > result: ofs=25690112 stripe_ofs=25690112 part_ofs=0 rule->part_size=0 > > > > 2015-04-24 10:43:29.996586 7f0b0afb5700 0 RGWObjManifest::operator++(): > > > > result: ofs=29884416 stripe_ofs=29884416 part_ofs=0 rule->part_size=0 > > > > 2015-04-24 10:43:29.996592 7f0b0afb5700 0 RGWObjManifest::operator++(): > > > > result: ofs=34078720 stripe_ofs=34078720 part_ofs=0 rule->part_size=0 > > > > > > > > In this same log, I also see the gc process saying it is removing said > > > > file (these records appear 290 times too) > > > > 2015-04-23 14:16:27.926952 7f15be0ee700 0 gc::process: removing > > > > .rgw.buckets: > > > > 2015-04-23 14:16:27.928572 7f15be0ee700 0 gc::process: removing > > > > .rgw.buckets: > > > > 2015-04-23 14:16:27.929636 7f15be0ee700 0 gc::process: removing > > > > .rgw.buckets: > > > > 2015-04-23 14:16:27.930448 7f15be0ee700 0 gc::process: removing > > > > .rgw.buckets: > > > > 2015-04-23 14:16:27.931226 7f15be0ee700 0 gc::process: removing > > > > .rgw.buckets: > > > > 2015-04-23 14:16:27.932103 7f15be0ee700 0 gc::process: removing > > > > .rgw.buckets: > > > > 2015-04-23 14:16:27.933470 7f15be0ee700 0 gc::process: removing > > > > .rgw.buckets: > > > > > > > > So even though it appears that the GC is processing its removal, the > > > > shadow files remain! > > > > > > > > Please help! > > > > ___
Re: [ceph-users] Cephfs: proportion of data between data pool and metadata pool
We're currently putting data into our cephfs pool (cachepool in front of it as a caching tier), but the metadata pool contains ~50MB of data for 36 million files. If that were an accurate estimation, we'd have a metadata pool closer to ~140GB. Here is a ceph df detail: http://people.beocat.cis.ksu.edu/~mozes/ceph_df_detail.txt I'm not saying it won't get larger, I have no idea of the code behind it. This is just what it happens to be for us. -- Adam On Sat, Apr 25, 2015 at 11:29 AM, François Lafont wrote: > Thanks Greg and Steffen for your answer. I will make some tests. > > Gregory Farnum wrote: > >> Yeah. The metadata pool will contain: >> 1) MDS logs, which I think by default will take up to 200MB per >> logical MDS. (You should have only one logical MDS.) >> 2) directory metadata objects, which contain the dentries and inodes >> of the system; ~4KB is probably generous for each? > > So one file in the cephfs generates one inode of ~4KB in the > "metadata" pool, correct? So that (number-of-files-in-cephfs) x 4KB > gives me an (approximative) estimation of the amount of data in the > "metadata" pool? > >> 3) Some smaller data structures about the allocated inode range and >> current client sessions. >> >> The data pool contains all of the file data. Presumably this is much >> larger, but it will depend on your average file size and we've not >> done any real study of it. > > -- > François Lafont > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Possible improvements for a slow write speed (excluding independent SSD journals)
I'm able to reach around 2-25000iops with 4k block with s3500 (with o_dsync) (so yes, around 80-100MB/S). I'l bench new s3610 soon to compare. - Mail original - De: "Anthony Levesque" À: "Christian Balzer" Cc: "ceph-users" Envoyé: Vendredi 24 Avril 2015 22:00:44 Objet: Re: [ceph-users] Possible improvements for a slow write speed (excluding independent SSD journals) Hi Christian, We tested some DC S3500 300GB using dd if=randfile of=/dev/sda bs=4k count=10 oflag=direct,dsync we got 96 MB/s which is far from the 315 MB/s from the website. Can I ask you or anyone on the mailing list how you are testing the write speed for journals? Thanks --- Anthony Lévesque GloboTech Communications Phone: 1-514-907-0050 x 208 Toll Free: 1-(888)-GTCOMM1 x 208 Phone Urgency: 1-(514) 907-0047 1-(866)-500-1555 Fax: 1-(514)-907-0750 aleves...@gtcomm.net http://www.gtcomm.net On Apr 23, 2015, at 9:05 PM, Christian Balzer < ch...@gol.com > wrote: Hello, On Thu, 23 Apr 2015 18:40:38 -0400 Anthony Levesque wrote: BQ_BEGIN To update you on the current test in our lab: 1.We tested the Samsung OSD in Recovery mode and the speed was able to maxout 2x 10GbE port(transferring data at 2200+ MB/s during recovery). So for normal write operation without O_DSYNC writes Samsung drives seem ok. 2.We then tested a couple of different model of SSD we had in stock with the following command: dd if=randfile of=/dev/sda bs=4k count=10 oflag=direct,dsync This was from a blog written by Sebastien Han and I think should be able to show how the drives would perform in O_DSYNC writes. For people interested in some result of what we tested here they are: Intel DC S3500 120GB = 114 MB/s Samsung Pro 128GB = 2.4 MB/s WD Black 1TB (HDD) = 409 KB/s Intel 330 120GB = 105 MB/s Intel 520 120GB = 9.4 MB/s Intel 335 80GB = 9.4 MB/s Samsung EVO 1TB = 2.5 MB/s Intel 320 120GB = 78 MB/s OCZ Revo Drive 240GB = 60.8 MB/s 4x Samsung EVO 1TB LSI RAID0 HW + BBU = 28.4 MB/s No real surprises here, but a nice summary nonetheless. You _really_ want to avoid consumer SSDs for journals and have a good idea on how much data you'll write per day and how long you expect your SSDs to last (the TBW/$ ratio). BQ_BEGIN Please let us know if the command we ran was not optimal to test O_DSYNC writes We order larger drive from Intel DC series to see if we could get more than 200 MB/s per SSD. We will keep you posted on tests if that interested you guys. We dint test multiple parallel test yet (to simulate multiple journal on one SSD). BQ_END You can totally trust the numbers on Intel's site: http://ark.intel.com/products/family/83425/Data-Center-SSDs The S3500s are by far the slowest and have the lowest endurance. Again, depending on your expected write level the S3610 or S3700 models are going to be a better fit regarding price/performance. Especially when you consider that loosing a journal SSD will result in several dead OSDs. BQ_BEGIN 3.We remove the Journal from all Samsung OSD and put 2x Intel 330 120GB on all 6 Node to test. The overall speed we were getting from the rados bench went from 1000 MB/s(approx.) to 450 MB/s which might only be because the intel cannot do too much in term of journaling (was tested at around 100 MB/s). It will be interesting to test with bigger Intel DC S3500 drives(and more journals) per node to see if I can back up to 1000MB/s or even surpass it. We also wanted to test if the CPU could be a huge bottle neck so we swap the Dual E5-2620v2 from node #6 and replace them with Dual E5-2609v2(Which are much smaller in core and speed) and the 450 MB/s we got from he rados bench went even lower to 180 MB/s. BQ_END You really don't have to swap CPUs around, monitor things with atop or other tools to see where your bottlenecks are. BQ_BEGIN So Im wondering if the 1000MB/s we got when the Journal was shared on the OSD SSD was not limited by the CPUs (even though the samsung are not good for journals on the long run) and not just by the fact Samsung SSD are bad in O_DSYNC writes(or maybe both). It is probable that 16 SSD OSD per node in a full SSD cluster is too much and the major bottleneck will be from the CPU. BQ_END That's what I kept saying. ^.^ BQ_BEGIN 4.Im wondering if we find good SSD for the journal and keep the samsung for normal writes and read(We can saturate 20GbE easy with read benchmark. We will test 40GbE soon) if the cluster will keep healthy since Samsung seem to get burnt from O_DSYNC writes. BQ_END They will get burned, as in have their cells worn out by any and all writes. BQ_BEGIN 5.In term of HBA controller, did you guys have made any test for a full SSD cluster or even just for SSD Journal. BQ_END If you have separate journals and OSDs, it often makes good sense to have them on separate controllers as well. It all depends on density of your setup and capabilities of the control
Re: [ceph-users] Cephfs: proportion of data between data pool and metadata pool
That doesn't make sense -- 50MB for 36 million files is <1.5 bytes each. How do you have things configured, exactly? On Sat, Apr 25, 2015 at 9:32 AM Adam Tygart wrote: > We're currently putting data into our cephfs pool (cachepool in front > of it as a caching tier), but the metadata pool contains ~50MB of data > for 36 million files. If that were an accurate estimation, we'd have a > metadata pool closer to ~140GB. Here is a ceph df detail: > > http://people.beocat.cis.ksu.edu/~mozes/ceph_df_detail.txt > > I'm not saying it won't get larger, I have no idea of the code behind > it. This is just what it happens to be for us. > -- > Adam > > > On Sat, Apr 25, 2015 at 11:29 AM, François Lafont > wrote: > > Thanks Greg and Steffen for your answer. I will make some tests. > > > > Gregory Farnum wrote: > > > >> Yeah. The metadata pool will contain: > >> 1) MDS logs, which I think by default will take up to 200MB per > >> logical MDS. (You should have only one logical MDS.) > >> 2) directory metadata objects, which contain the dentries and inodes > >> of the system; ~4KB is probably generous for each? > > > > So one file in the cephfs generates one inode of ~4KB in the > > "metadata" pool, correct? So that (number-of-files-in-cephfs) x 4KB > > gives me an (approximative) estimation of the amount of data in the > > "metadata" pool? > > > >> 3) Some smaller data structures about the allocated inode range and > >> current client sessions. > >> > >> The data pool contains all of the file data. Presumably this is much > >> larger, but it will depend on your average file size and we've not > >> done any real study of it. > > > > -- > > François Lafont > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cephfs: proportion of data between data pool and metadata pool
cephfs (really ec84pool) is an ec pool (k=8 m=4), cachepool is a writeback cachetier in front of ec84pool. As far as I know, we've not done any strange configuration. Potentially relevant configuration details: ceph osd crush dump > http://people.beocat.cis.ksu.edu/~mozes/ceph/crush_dump.txt ceph osd pool ls detail > http://people.beocat.cis.ksu.edu/~mozes/ceph/pool_ls_detail.txt ceph mds dump > http://people.beocat.cis.ksu.edu/~mozes/ceph/mds_dump.txt getfattr -d -m '.*' /tmp/cephfs > http://people.beocat.cis.ksu.edu/~mozes/ceph/getfattr_cephfs.txt rsync is ongoing, moving data into cephfs. It would seem the data is truly there, both with metadata and file data. md5sums match for files that I've tested. -- Adam On Sat, Apr 25, 2015 at 12:16 PM, Gregory Farnum wrote: > That doesn't make sense -- 50MB for 36 million files is <1.5 bytes each. How > do you have things configured, exactly? > > On Sat, Apr 25, 2015 at 9:32 AM Adam Tygart wrote: >> >> We're currently putting data into our cephfs pool (cachepool in front >> of it as a caching tier), but the metadata pool contains ~50MB of data >> for 36 million files. If that were an accurate estimation, we'd have a >> metadata pool closer to ~140GB. Here is a ceph df detail: >> >> http://people.beocat.cis.ksu.edu/~mozes/ceph_df_detail.txt >> >> I'm not saying it won't get larger, I have no idea of the code behind >> it. This is just what it happens to be for us. >> -- >> Adam >> >> >> On Sat, Apr 25, 2015 at 11:29 AM, François Lafont >> wrote: >> > Thanks Greg and Steffen for your answer. I will make some tests. >> > >> > Gregory Farnum wrote: >> > >> >> Yeah. The metadata pool will contain: >> >> 1) MDS logs, which I think by default will take up to 200MB per >> >> logical MDS. (You should have only one logical MDS.) >> >> 2) directory metadata objects, which contain the dentries and inodes >> >> of the system; ~4KB is probably generous for each? >> > >> > So one file in the cephfs generates one inode of ~4KB in the >> > "metadata" pool, correct? So that (number-of-files-in-cephfs) x 4KB >> > gives me an (approximative) estimation of the amount of data in the >> > "metadata" pool? >> > >> >> 3) Some smaller data structures about the allocated inode range and >> >> current client sessions. >> >> >> >> The data pool contains all of the file data. Presumably this is much >> >> larger, but it will depend on your average file size and we've not >> >> done any real study of it. >> > >> > -- >> > François Lafont >> > ___ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cephfs: proportion of data between data pool and metadata pool
That's odd -- I almost want to think the pg statistics reporting is going wrong somehow. ...I bet the leveldb/omap stuff isn't being included in the of statistics. That could be why and would make sense with what you've got here. :) -Greg On Sat, Apr 25, 2015 at 10:32 AM Adam Tygart wrote: > cephfs (really ec84pool) is an ec pool (k=8 m=4), cachepool is a > writeback cachetier in front of ec84pool. As far as I know, we've not > done any strange configuration. > > Potentially relevant configuration details: > ceph osd crush dump > > http://people.beocat.cis.ksu.edu/~mozes/ceph/crush_dump.txt > ceph osd pool ls detail > > http://people.beocat.cis.ksu.edu/~mozes/ceph/pool_ls_detail.txt > ceph mds dump > http://people.beocat.cis.ksu.edu/~mozes/ceph/mds_dump.txt > getfattr -d -m '.*' /tmp/cephfs > > http://people.beocat.cis.ksu.edu/~mozes/ceph/getfattr_cephfs.txt > > rsync is ongoing, moving data into cephfs. It would seem the data is > truly there, both with metadata and file data. md5sums match for files > that I've tested. > -- > Adam > > On Sat, Apr 25, 2015 at 12:16 PM, Gregory Farnum wrote: > > That doesn't make sense -- 50MB for 36 million files is <1.5 bytes each. > How > > do you have things configured, exactly? > > > > On Sat, Apr 25, 2015 at 9:32 AM Adam Tygart wrote: > >> > >> We're currently putting data into our cephfs pool (cachepool in front > >> of it as a caching tier), but the metadata pool contains ~50MB of data > >> for 36 million files. If that were an accurate estimation, we'd have a > >> metadata pool closer to ~140GB. Here is a ceph df detail: > >> > >> http://people.beocat.cis.ksu.edu/~mozes/ceph_df_detail.txt > >> > >> I'm not saying it won't get larger, I have no idea of the code behind > >> it. This is just what it happens to be for us. > >> -- > >> Adam > >> > >> > >> On Sat, Apr 25, 2015 at 11:29 AM, François Lafont > >> wrote: > >> > Thanks Greg and Steffen for your answer. I will make some tests. > >> > > >> > Gregory Farnum wrote: > >> > > >> >> Yeah. The metadata pool will contain: > >> >> 1) MDS logs, which I think by default will take up to 200MB per > >> >> logical MDS. (You should have only one logical MDS.) > >> >> 2) directory metadata objects, which contain the dentries and inodes > >> >> of the system; ~4KB is probably generous for each? > >> > > >> > So one file in the cephfs generates one inode of ~4KB in the > >> > "metadata" pool, correct? So that (number-of-files-in-cephfs) x 4KB > >> > gives me an (approximative) estimation of the amount of data in the > >> > "metadata" pool? > >> > > >> >> 3) Some smaller data structures about the allocated inode range and > >> >> current client sessions. > >> >> > >> >> The data pool contains all of the file data. Presumably this is much > >> >> larger, but it will depend on your average file size and we've not > >> >> done any real study of it. > >> > > >> > -- > >> > François Lafont > >> > ___ > >> > ceph-users mailing list > >> > ceph-users@lists.ceph.com > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cephfs: proportion of data between data pool and metadata pool
Probably the case. I've check a 10% of the objects in the metadata pool (rbd -p metadata stat $objname). They've all been 0 byte objects. Most of them have 1-10 omapvals usually 408 bytes each. Based on the usage of the other pools on the SSDs, that comes out to about ~46GB of omap/leveldb stuff. Assuming all of that usage is for the metadata, it comes out to ~1.4KB per file. Still *much* less than the 4K estimate, but probably more reasonable than a few bytes per file :). -- Adam On Sat, Apr 25, 2015 at 1:03 PM, Gregory Farnum wrote: > That's odd -- I almost want to think the pg statistics reporting is going > wrong somehow. > ...I bet the leveldb/omap stuff isn't being included in the of statistics. > That could be why and would make sense with what you've got here. :) > -Greg > On Sat, Apr 25, 2015 at 10:32 AM Adam Tygart wrote: >> >> cephfs (really ec84pool) is an ec pool (k=8 m=4), cachepool is a >> writeback cachetier in front of ec84pool. As far as I know, we've not >> done any strange configuration. >> >> Potentially relevant configuration details: >> ceph osd crush dump > >> http://people.beocat.cis.ksu.edu/~mozes/ceph/crush_dump.txt >> ceph osd pool ls detail > >> http://people.beocat.cis.ksu.edu/~mozes/ceph/pool_ls_detail.txt >> ceph mds dump > http://people.beocat.cis.ksu.edu/~mozes/ceph/mds_dump.txt >> getfattr -d -m '.*' /tmp/cephfs > >> http://people.beocat.cis.ksu.edu/~mozes/ceph/getfattr_cephfs.txt >> >> rsync is ongoing, moving data into cephfs. It would seem the data is >> truly there, both with metadata and file data. md5sums match for files >> that I've tested. >> -- >> Adam >> >> On Sat, Apr 25, 2015 at 12:16 PM, Gregory Farnum wrote: >> > That doesn't make sense -- 50MB for 36 million files is <1.5 bytes each. >> > How >> > do you have things configured, exactly? >> > >> > On Sat, Apr 25, 2015 at 9:32 AM Adam Tygart wrote: >> >> >> >> We're currently putting data into our cephfs pool (cachepool in front >> >> of it as a caching tier), but the metadata pool contains ~50MB of data >> >> for 36 million files. If that were an accurate estimation, we'd have a >> >> metadata pool closer to ~140GB. Here is a ceph df detail: >> >> >> >> http://people.beocat.cis.ksu.edu/~mozes/ceph_df_detail.txt >> >> >> >> I'm not saying it won't get larger, I have no idea of the code behind >> >> it. This is just what it happens to be for us. >> >> -- >> >> Adam >> >> >> >> >> >> On Sat, Apr 25, 2015 at 11:29 AM, François Lafont >> >> wrote: >> >> > Thanks Greg and Steffen for your answer. I will make some tests. >> >> > >> >> > Gregory Farnum wrote: >> >> > >> >> >> Yeah. The metadata pool will contain: >> >> >> 1) MDS logs, which I think by default will take up to 200MB per >> >> >> logical MDS. (You should have only one logical MDS.) >> >> >> 2) directory metadata objects, which contain the dentries and inodes >> >> >> of the system; ~4KB is probably generous for each? >> >> > >> >> > So one file in the cephfs generates one inode of ~4KB in the >> >> > "metadata" pool, correct? So that (number-of-files-in-cephfs) x 4KB >> >> > gives me an (approximative) estimation of the amount of data in the >> >> > "metadata" pool? >> >> > >> >> >> 3) Some smaller data structures about the allocated inode range and >> >> >> current client sessions. >> >> >> >> >> >> The data pool contains all of the file data. Presumably this is much >> >> >> larger, but it will depend on your average file size and we've not >> >> >> done any real study of it. >> >> > >> >> > -- >> >> > François Lafont >> >> > ___ >> >> > ceph-users mailing list >> >> > ceph-users@lists.ceph.com >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> ___ >> >> ceph-users mailing list >> >> ceph-users@lists.ceph.com >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cephfs: proportion of data between data pool and metadata pool
On Sat, 25 Apr 2015, Gregory Farnum wrote: > That's odd -- I almost want to think the pg statistics reporting is going > wrong somehow. > ...I bet the leveldb/omap stuff isn't being included in the of statistics. > That could be why and would make sense with what you've got here. :) Yeah, the pool stats sum up bytes and objects, but not keys (or key sizes). We should probably expand the stats struct to include uint64_t kv; // key/value pairs uint64_t kv_bytes; // key/value bytes (key + value length) sage > -GregOn Sat, Apr 25, 2015 at 10:32 AM Adam Tygart wrote: > cephfs (really ec84pool) is an ec pool (k=8 m=4), cachepool is a > writeback cachetier in front of ec84pool. As far as I know, > we've not > done any strange configuration. > > Potentially relevant configuration details: > ceph osd crush dump > > http://people.beocat.cis.ksu.edu/~mozes/ceph/crush_dump.txt > ceph osd pool ls detail > > http://people.beocat.cis.ksu.edu/~mozes/ceph/pool_ls_detail.txt > ceph mds dump > > http://people.beocat.cis.ksu.edu/~mozes/ceph/mds_dump.txt > getfattr -d -m '.*' /tmp/cephfs > > http://people.beocat.cis.ksu.edu/~mozes/ceph/getfattr_cephfs.txt > > rsync is ongoing, moving data into cephfs. It would seem the > data is > truly there, both with metadata and file data. md5sums match for > files > that I've tested. > -- > Adam > > On Sat, Apr 25, 2015 at 12:16 PM, Gregory Farnum >wrote: > > That doesn't make sense -- 50MB for 36 million files is <1.5 > bytes each. How > > do you have things configured, exactly? > > > > On Sat, Apr 25, 2015 at 9:32 AM Adam Tygart >wrote: > >> > >> We're currently putting data into our cephfs pool (cachepool > in front > >> of it as a caching tier), but the metadata pool contains > ~50MB of data > >> for 36 million files. If that were an accurate estimation, > we'd have a > >> metadata pool closer to ~140GB. Here is a ceph df detail: > >> > >> http://people.beocat.cis.ksu.edu/~mozes/ceph_df_detail.txt > >> > >> I'm not saying it won't get larger, I have no idea of the > code behind > >> it. This is just what it happens to be for us. > >> -- > >> Adam > >> > >> > >> On Sat, Apr 25, 2015 at 11:29 AM, François Lafont > > >> wrote: > >> > Thanks Greg and Steffen for your answer. I will make some > tests. > >> > > >> > Gregory Farnum wrote: > >> > > >> >> Yeah. The metadata pool will contain: > >> >> 1) MDS logs, which I think by default will take up to > 200MB per > >> >> logical MDS. (You should have only one logical MDS.) > >> >> 2) directory metadata objects, which contain the dentries > and inodes > >> >> of the system; ~4KB is probably generous for each? > >> > > >> > So one file in the cephfs generates one inode of ~4KB in > the > >> > "metadata" pool, correct? So that > (number-of-files-in-cephfs) x 4KB > >> > gives me an (approximative) estimation of the amount of > data in the > >> > "metadata" pool? > >> > > >> >> 3) Some smaller data structures about the allocated inode > range and > >> >> current client sessions. > >> >> > >> >> The data pool contains all of the file data. Presumably > this is much > >> >> larger, but it will depend on your average file size and > we've not > >> >> done any real study of it. > >> > > >> > -- > >> > François Lafont > >> > ___ > >> > ceph-users mailing list > >> > ceph-users@lists.ceph.com > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cephfs: proportion of data between data pool and metadata pool
Yeah -- as I said, 4KB was a generous number. It's going to vary some though, based on the actual length of the names you're using, whether you have symlinks or hard links, snapshots, etc. -Greg On Sat, Apr 25, 2015 at 11:34 AM Adam Tygart wrote: > Probably the case. I've check a 10% of the objects in the metadata > pool (rbd -p metadata stat $objname). They've all been 0 byte objects. > Most of them have 1-10 omapvals usually 408 bytes each. > > Based on the usage of the other pools on the SSDs, that comes out to > about ~46GB of omap/leveldb stuff. Assuming all of that usage is for > the metadata, it comes out to ~1.4KB per file. Still *much* less than > the 4K estimate, but probably more reasonable than a few bytes per > file :). > > -- > Adam > > On Sat, Apr 25, 2015 at 1:03 PM, Gregory Farnum wrote: > > That's odd -- I almost want to think the pg statistics reporting is going > > wrong somehow. > > ...I bet the leveldb/omap stuff isn't being included in the of > statistics. > > That could be why and would make sense with what you've got here. :) > > -Greg > > On Sat, Apr 25, 2015 at 10:32 AM Adam Tygart wrote: > >> > >> cephfs (really ec84pool) is an ec pool (k=8 m=4), cachepool is a > >> writeback cachetier in front of ec84pool. As far as I know, we've not > >> done any strange configuration. > >> > >> Potentially relevant configuration details: > >> ceph osd crush dump > > >> http://people.beocat.cis.ksu.edu/~mozes/ceph/crush_dump.txt > >> ceph osd pool ls detail > > >> http://people.beocat.cis.ksu.edu/~mozes/ceph/pool_ls_detail.txt > >> ceph mds dump > > http://people.beocat.cis.ksu.edu/~mozes/ceph/mds_dump.txt > >> getfattr -d -m '.*' /tmp/cephfs > > >> http://people.beocat.cis.ksu.edu/~mozes/ceph/getfattr_cephfs.txt > >> > >> rsync is ongoing, moving data into cephfs. It would seem the data is > >> truly there, both with metadata and file data. md5sums match for files > >> that I've tested. > >> -- > >> Adam > >> > >> On Sat, Apr 25, 2015 at 12:16 PM, Gregory Farnum > wrote: > >> > That doesn't make sense -- 50MB for 36 million files is <1.5 bytes > each. > >> > How > >> > do you have things configured, exactly? > >> > > >> > On Sat, Apr 25, 2015 at 9:32 AM Adam Tygart > wrote: > >> >> > >> >> We're currently putting data into our cephfs pool (cachepool in front > >> >> of it as a caching tier), but the metadata pool contains ~50MB of > data > >> >> for 36 million files. If that were an accurate estimation, we'd have > a > >> >> metadata pool closer to ~140GB. Here is a ceph df detail: > >> >> > >> >> http://people.beocat.cis.ksu.edu/~mozes/ceph_df_detail.txt > >> >> > >> >> I'm not saying it won't get larger, I have no idea of the code behind > >> >> it. This is just what it happens to be for us. > >> >> -- > >> >> Adam > >> >> > >> >> > >> >> On Sat, Apr 25, 2015 at 11:29 AM, François Lafont < > flafdiv...@free.fr> > >> >> wrote: > >> >> > Thanks Greg and Steffen for your answer. I will make some tests. > >> >> > > >> >> > Gregory Farnum wrote: > >> >> > > >> >> >> Yeah. The metadata pool will contain: > >> >> >> 1) MDS logs, which I think by default will take up to 200MB per > >> >> >> logical MDS. (You should have only one logical MDS.) > >> >> >> 2) directory metadata objects, which contain the dentries and > inodes > >> >> >> of the system; ~4KB is probably generous for each? > >> >> > > >> >> > So one file in the cephfs generates one inode of ~4KB in the > >> >> > "metadata" pool, correct? So that (number-of-files-in-cephfs) x 4KB > >> >> > gives me an (approximative) estimation of the amount of data in the > >> >> > "metadata" pool? > >> >> > > >> >> >> 3) Some smaller data structures about the allocated inode range > and > >> >> >> current client sessions. > >> >> >> > >> >> >> The data pool contains all of the file data. Presumably this is > much > >> >> >> larger, but it will depend on your average file size and we've not > >> >> >> done any real study of it. > >> >> > > >> >> > -- > >> >> > François Lafont > >> >> > ___ > >> >> > ceph-users mailing list > >> >> > ceph-users@lists.ceph.com > >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> ___ > >> >> ceph-users mailing list > >> >> ceph-users@lists.ceph.com > >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Possible improvements for a slow write speed (excluding independent SSD journals)
Hello, I think that the dd test isn't a 100% replica of what Ceph actually does then. My suspicion would be the 4k blocks, since when people test the maximum bandwidth they do it with rados bench or other tools that write the optimum sized "blocks" for Ceph, 4MB ones. I currently have no unused DC S3700s to do a realistic comparison and the DC S3500 I have aren't used in any Ceph environment. When testing a 200GB DC S3700 that has specs of 35K write IOPS and 365MB/s sequential writes on mostly idle system (but on top of Ext4, not the raw device) with a 4k dd dsync test run, atop and iostat show a 70% SSD utilization, 30k IOPS and 70MB/s writes. Which matches the specs perfectly. If I do that test with 4MB blocks, the speed goes up to 330MB/s and 90% SSD utilization according to atop, again on par with the specs. Lastly on existing Ceph clusters with DC S3700 SSDs as journals and rados bench and its 4MB default size that pattern continues. Smaller sizes with rados naturally (at least on my hardware and Ceph version, Firefly) run into the limitations of Ceph long before they hit the SSDs (nearly 100% busy cores, journals at 4-8%, OSD HDDs anywhere from 50-100%). Of course using the same dd test over all brands will still give you a good comparison of the SSDs capabilities. But translating that into actual Ceph journal performance is another thing. Christian On Sat, 25 Apr 2015 18:32:30 +0200 (CEST) Alexandre DERUMIER wrote: > I'm able to reach around 2-25000iops with 4k block with s3500 (with > o_dsync) (so yes, around 80-100MB/S). > > I'l bench new s3610 soon to compare. > > > - Mail original - > De: "Anthony Levesque" > À: "Christian Balzer" > Cc: "ceph-users" > Envoyé: Vendredi 24 Avril 2015 22:00:44 > Objet: Re: [ceph-users] Possible improvements for a slow write > speed (excluding independent SSD journals) > > Hi Christian, > > We tested some DC S3500 300GB using dd if=randfile of=/dev/sda bs=4k > count=10 oflag=direct,dsync > > we got 96 MB/s which is far from the 315 MB/s from the website. > > Can I ask you or anyone on the mailing list how you are testing the > write speed for journals? > > Thanks > --- > Anthony Lévesque > GloboTech Communications > Phone: 1-514-907-0050 x 208 > Toll Free: 1-(888)-GTCOMM1 x 208 > Phone Urgency: 1-(514) 907-0047 > 1-(866)-500-1555 > Fax: 1-(514)-907-0750 > aleves...@gtcomm.net > http://www.gtcomm.net > > > > > On Apr 23, 2015, at 9:05 PM, Christian Balzer < ch...@gol.com > wrote: > > > Hello, > > On Thu, 23 Apr 2015 18:40:38 -0400 Anthony Levesque wrote: > > > BQ_BEGIN > To update you on the current test in our lab: > > 1.We tested the Samsung OSD in Recovery mode and the speed was able to > maxout 2x 10GbE port(transferring data at 2200+ MB/s during recovery). > So for normal write operation without O_DSYNC writes Samsung drives seem > ok. > > 2.We then tested a couple of different model of SSD we had in stock with > the following command: > > dd if=randfile of=/dev/sda bs=4k count=10 oflag=direct,dsync > > This was from a blog written by Sebastien Han and I think should be able > to show how the drives would perform in O_DSYNC writes. For people > interested in some result of what we tested here they are: > > Intel DC S3500 120GB = 114 MB/s > Samsung Pro 128GB = 2.4 MB/s > WD Black 1TB (HDD) = 409 KB/s > Intel 330 120GB = 105 MB/s > Intel 520 120GB = 9.4 MB/s > Intel 335 80GB = 9.4 MB/s > Samsung EVO 1TB = 2.5 MB/s > Intel 320 120GB = 78 MB/s > OCZ Revo Drive 240GB = 60.8 MB/s > 4x Samsung EVO 1TB LSI RAID0 HW + BBU = 28.4 MB/s > > > > No real surprises here, but a nice summary nonetheless. > > You _really_ want to avoid consumer SSDs for journals and have a good > idea on how much data you'll write per day and how long you expect your > SSDs to last (the TBW/$ ratio). > > > BQ_BEGIN > Please let us know if the command we ran was not optimal to test O_DSYNC > writes > > We order larger drive from Intel DC series to see if we could get more > than 200 MB/s per SSD. We will keep you posted on tests if that > interested you guys. We dint test multiple parallel test yet (to > simulate multiple journal on one SSD). > > > BQ_END > You can totally trust the numbers on Intel's site: > http://ark.intel.com/products/family/83425/Data-Center-SSDs > > The S3500s are by far the slowest and have the lowest endurance. > Again, depending on your expected write level the S3610 or S3700 models > are going to be a better fit regarding price/performance. > Especially when you consider that loosing a journal SSD will result in > several dead OSDs. > > > BQ_BEGIN > 3.We remove the Journal from all Samsung OSD and put 2x Intel 330 120GB > on all 6 Node to test. The overall speed we were getting from the rados > bench went from 1000 MB/s(approx.) to 450 MB/s which might only be > because the intel cannot do too much in term of journaling (was tested > at around 100
[ceph-users] CephFs - Ceph-fuse Client Read Performance During Cache Tier Flushing
Hi I was doing some testing on erasure coded based CephFS cluster. cluster is running with giant 0.87.1 release. Cluster info 15 * 36 drives node(journal on same osd) 3 * 4 drives SSD cache node( Intel DC3500) 3 * MON/MDS EC 10 +3 10G Ethernet for private and cluster network We got approx. 55MB/s read transfer speed using ceph-fuse client, when the data was available on cache tier( cold storage was empty). When I tried to add more data, ceph started the flushing the data from cache tier to cold storage. During flushing, cluster read speed became approx 100 KB/s. But I got 50 – 55MB/s write transfer speed during flushing from multiple simultaneous ceph-fuse client( 1G Ethernet). I think there is an issue on data migration from cold storage to cache tier during ceph-fuse client read. Am I hitting any known issue/bug or is there any issue with my cluster? I used big video files( approx 5 GB to 10 GB) for this testing . Any help ? Cheers K.Mohamed Pakkeer ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] IOWait on SATA-backed with SSD-journals
Hi, With inspiration from all the other performance threads going on here, I started to investigate on my own as well. I’m seeing a lot iowait on the OSD, and the journal utilised at 2-7%, with about 8-30MB/s (mostly around 8MB/s write). This is a dumpling cluster. The goal here is to increase the utilisation to maybe 50%. Journals: Intel DC S3700, OSD: HGST 4TB I did some initial testing to make the wbthrottle have more in the buffer, and I think I managed to do it, didn’t affect the journal utilisation though. There’s 12 cores for the 10 OSDs per machine to utilise, and they use about 20% of them, so I guess no bottle neck there. Well that’s the problem, I really can’t see any bottleneck with the current layout, maybe it’s out copper 10Gb that’s giving us too much latency? It would be fancy with some kind of bottle-neck troubleshoot in ceph docs :) I’m guessing I’m not the only one on these kinds of specs and would be interesting to see if there’s optimisation to be done. Hope you guys have a nice weekend :) Cheers, Josef Ping from a host to OSD: 6 packets transmitted, 6 received, 0% packet loss, time 4998ms rtt min/avg/max/mdev = 0.063/0.107/0.193/0.048 ms Setting on the OSD { "filestore_wbthrottle_xfs_ios_start_flusher": "5000"} { "filestore_wbthrottle_xfs_inodes_start_flusher": "5000"} { "filestore_wbthrottle_xfs_ios_hard_limit": "1"} { "filestore_wbthrottle_xfs_inodes_hard_limit": "1"} { "filestore_max_sync_interval": "30”} From the standard { "filestore_wbthrottle_xfs_ios_start_flusher": "500"} { "filestore_wbthrottle_xfs_inodes_start_flusher": "500"} { "filestore_wbthrottle_xfs_ios_hard_limit": “5000"} { "filestore_wbthrottle_xfs_inodes_hard_limit": “5000"} { "filestore_max_sync_interval": “5”} a single dump_historic_ops { "description": "osd_op(client.47765822.0:99270434 rbd_data.1da982c2eb141f2.5825 [stat,write 2093056~8192] 3.8130048c e19290)", "rmw_flags": 6, "received_at": "2015-04-26 08:24:03.226255", "age": "87.026653", "duration": "0.801927", "flag_point": "commit sent; apply or cleanup", "client_info": { "client": "client.47765822", "tid": 99270434}, "events": [ { "time": "2015-04-26 08:24:03.226329", "event": "waiting_for_osdmap"}, { "time": "2015-04-26 08:24:03.230921", "event": "reached_pg"}, { "time": "2015-04-26 08:24:03.230928", "event": "started"}, { "time": "2015-04-26 08:24:03.230931", "event": "started"}, { "time": "2015-04-26 08:24:03.231791", "event": "waiting for subops from [22,48]"}, { "time": "2015-04-26 08:24:03.231813", "event": "commit_queued_for_journal_write"}, { "time": "2015-04-26 08:24:03.231849", "event": "write_thread_in_journal_buffer"}, { "time": "2015-04-26 08:24:03.232075", "event": "journaled_completion_queued"}, { "time": "2015-04-26 08:24:03.232492", "event": "op_commit"}, { "time": "2015-04-26 08:24:03.233134", "event": "sub_op_commit_rec"}, { "time": "2015-04-26 08:24:03.233183", "event": "op_applied"}, { "time": "2015-04-26 08:24:04.028167", "event": "sub_op_commit_rec"}, { "time": "2015-04-26 08:24:04.028174", "event": "commit_sent"}, { "time": "2015-04-26 08:24:04.028182", "event": "done"}]}, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com