Re: [ceph-users] thanks for a double check on ceph's config
On Tue, 10 May 2016 11:48:07 +0800 Geocast wrote: Hello, > We have 21 hosts for ceph OSD servers, each host has 12 SATA disks (4TB > each), 64GB memory. No journal SSDs? What CPU(s) and network? > ceph version 10.2.0, Ubuntu 16.04 LTS > The whole cluster is new installed. > > Can you help check what the arguments we put in ceph.conf is reasonable > or not? > thanks. > > [osd] > osd_data = /var/lib/ceph/osd/ceph-$id > osd_journal_size = 2 Overkill most likely, but not an issue. > osd_mkfs_type = xfs > osd_mkfs_options_xfs = -f > filestore_xattr_use_omap = true > filestore_min_sync_interval = 10 Are you aware what this does and have you actually tested this (IOPS AND throughput) with various other setting on your hardware to arrive at this number? > filestore_max_sync_interval = 15 That's fine in and by itself, unlikely to ever be reached anyway. > filestore_queue_max_ops = 25000 > filestore_queue_max_bytes = 10485760 > filestore_queue_committing_max_ops = 5000 > filestore_queue_committing_max_bytes = 1048576 > journal_max_write_bytes = 1073714824 > journal_max_write_entries = 1 > journal_queue_max_ops = 5 > journal_queue_max_bytes = 1048576 Same as above, have you tested these setting (from filestore_queue_max_ops onward) compared to the defaults? With HDDs only I'd expect any benefits to be small and/or things to become very uneven once the HDDs are saturated. > osd_max_write_size = 512 > osd_client_message_size_cap = 2147483648 > osd_deep_scrub_stride = 131072 > osd_op_threads = 8 > osd_disk_threads = 4 > osd_map_cache_size = 1024 > osd_map_cache_bl_size = 128 > osd_mount_options_xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier" The nobarrier part is a a potential recipe for disaster unless you have all on-disk caches disabled and every other cache battery backed. The only devices I trust to mount nobarrier are SSDs with powercaps that have been proven to do the right thing (Intel DC S amongst them). > osd_recovery_op_priority = 4 > osd_recovery_max_active = 10 > osd_max_backfills = 4 > That's sane enough. > [client] > rbd_cache = true AFAIK that's the case with recent Ceph versions anyway. > rbd_cache_size = 268435456 Are you sure that you have 256MB per client to waste on RBD cache? If so, bully for you, but you might find that depending on your use case a smaller RBD cache but more VM memory (for pagecache, SLAB, etc) could be more beneficial. > rbd_cache_max_dirty = 134217728 > rbd_cache_max_dirty_age = 5 Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] journal or cache tier on SSDs ?
Hello, I'd like some advices about the setup of a new ceph cluster. Here the use case : RadowGW (S3 and maybe swift for hadoop/spark) will be the main usage. Most of the access will be in read only mode. Write access will only be done by the admin to update the datasets. We might use rbd some time to sync data as temp storage (when POSIX is needed) but performance will not be an issue here. We might use cephfs in the futur if that can replace a filesystem on rdb. We gonna start with 16 nodes (up to 24). The configuration of each node is : CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (12c/48t) Memory : 128GB OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1) Journal or cache Storage : 2 x SSD 400GB Intel S3300 DC (no Raid) OSD Disk : 10 x HGST ultrastar-7k6000 6TB Public Network : 1 x 10Gb/s Private Network : 1 x 10Gb/s OS : Ubuntu 16.04 Ceph version : Jewel The question is : journal or cache tier (read only) on the SD 400GB Intel S3300 DC ? Each disk is able to write sequentially at 220MB/s. SSDs can write at ~500MB/s. if we set 5 journals on each SSDs, SSD will still be the bottleneck (1GB/s vs 2GB/s). If we set the journal on OSDs, we can expect a good throughput in read on the disk (in case of data not in the cache) and write shouldn't be so bad too, even if we have random read on the OSD during the write ? SSDs as cache tier seem to be a better usage than only 5 journal on each ? Is that correct ? We gonna use an EC pool for big files (jerasure 8+2 I think) and a replicated pool for small files. If I check on http://ceph.com/pgcalc/, in this use case replicated pool: pg_num = 8192 for 160 OSDs but 16384 for 240 OSDs Ec pool : pg_num = 4096 and pgp_num = pg_num Should I set the pg_num to 8192 or 16384 ? what is the impact on the cluster if we set the pg_num to 16384 at the beginning ? 16384 is high, isn't it ? Thanks for your help -- Yoann Moulin EPFL IC-IT ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] thanks for a double check on ceph's config
Hello Chris, We don't use SSD as journal. each host has one intel E5-2620 CPU which is 6 cores. the networking (both cluster and data networks) is 10Gbps. My further questions include, (1) osd_mkfs_type = xfs osd_mkfs_options_xfs = -f filestore_xattr_use_omap = true for XFS filesystem, we should not enable filestore_xattr_use_omap = true, is it? (2) filestore_queue_max_ops = 25000 filestore_queue_max_bytes = 10485760 filestore_queue_committing_max_ops = 5000 filestore_queue_committing_max_bytes = 1048576 journal_max_write_bytes = 1073714824 journal_max_write_entries = 1 journal_queue_max_ops = 5 journal_queue_max_bytes = 1048576 Since we don't have SSD as journals, all these setup are too large? what are the better values? (3) osd_mount_options_xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier" What's your suggested options here? Thanks a lot. 2016-05-10 15:31 GMT+08:00 Christian Balzer : > On Tue, 10 May 2016 11:48:07 +0800 Geocast wrote: > > Hello, > > > We have 21 hosts for ceph OSD servers, each host has 12 SATA disks (4TB > > each), 64GB memory. > No journal SSDs? > What CPU(s) and network? > > > ceph version 10.2.0, Ubuntu 16.04 LTS > > The whole cluster is new installed. > > > > Can you help check what the arguments we put in ceph.conf is reasonable > > or not? > > thanks. > > > > [osd] > > osd_data = /var/lib/ceph/osd/ceph-$id > > osd_journal_size = 2 > Overkill most likely, but not an issue. > > > osd_mkfs_type = xfs > > osd_mkfs_options_xfs = -f > > filestore_xattr_use_omap = true > > filestore_min_sync_interval = 10 > Are you aware what this does and have you actually tested this (IOPS AND > throughput) with various other setting on your hardware to arrive at this > number? > > > filestore_max_sync_interval = 15 > That's fine in and by itself, unlikely to ever be reached anyway. > > > filestore_queue_max_ops = 25000 > > filestore_queue_max_bytes = 10485760 > > filestore_queue_committing_max_ops = 5000 > > filestore_queue_committing_max_bytes = 1048576 > > journal_max_write_bytes = 1073714824 > > journal_max_write_entries = 1 > > journal_queue_max_ops = 5 > > journal_queue_max_bytes = 1048576 > Same as above, have you tested these setting (from filestore_queue_max_ops > onward) compared to the defaults? > > With HDDs only I'd expect any benefits to be small and/or things to become > very uneven once the HDDs are saturated. > > > osd_max_write_size = 512 > > osd_client_message_size_cap = 2147483648 > > osd_deep_scrub_stride = 131072 > > osd_op_threads = 8 > > osd_disk_threads = 4 > > osd_map_cache_size = 1024 > > osd_map_cache_bl_size = 128 > > osd_mount_options_xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier" > The nobarrier part is a a potential recipe for disaster unless you have all > on-disk caches disabled and every other cache battery backed. > > The only devices I trust to mount nobarrier are SSDs with powercaps that > have been proven to do the right thing (Intel DC S amongst them). > > > osd_recovery_op_priority = 4 > > osd_recovery_max_active = 10 > > osd_max_backfills = 4 > > > That's sane enough. > > > [client] > > rbd_cache = true > AFAIK that's the case with recent Ceph versions anyway. > > > rbd_cache_size = 268435456 > > Are you sure that you have 256MB per client to waste on RBD cache? > If so, bully for you, but you might find that depending on your use case a > smaller RBD cache but more VM memory (for pagecache, SLAB, etc) could be > more beneficial. > > > rbd_cache_max_dirty = 134217728 > > rbd_cache_max_dirty_age = 5 > > Christian > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] thanks for a double check on ceph's config
> rbd_cache_size = 268435456 Are you sure that you have 256MB per client to waste on RBD cache? If so, bully for you, but you might find that depending on your use case a smaller RBD cache but more VM memory (for pagecache, SLAB, etc) could be more beneficial. We have changed this value to 64MB. thanks. 2016-05-10 15:31 GMT+08:00 Christian Balzer : > On Tue, 10 May 2016 11:48:07 +0800 Geocast wrote: > > Hello, > > > We have 21 hosts for ceph OSD servers, each host has 12 SATA disks (4TB > > each), 64GB memory. > No journal SSDs? > What CPU(s) and network? > > > ceph version 10.2.0, Ubuntu 16.04 LTS > > The whole cluster is new installed. > > > > Can you help check what the arguments we put in ceph.conf is reasonable > > or not? > > thanks. > > > > [osd] > > osd_data = /var/lib/ceph/osd/ceph-$id > > osd_journal_size = 2 > Overkill most likely, but not an issue. > > > osd_mkfs_type = xfs > > osd_mkfs_options_xfs = -f > > filestore_xattr_use_omap = true > > filestore_min_sync_interval = 10 > Are you aware what this does and have you actually tested this (IOPS AND > throughput) with various other setting on your hardware to arrive at this > number? > > > filestore_max_sync_interval = 15 > That's fine in and by itself, unlikely to ever be reached anyway. > > > filestore_queue_max_ops = 25000 > > filestore_queue_max_bytes = 10485760 > > filestore_queue_committing_max_ops = 5000 > > filestore_queue_committing_max_bytes = 1048576 > > journal_max_write_bytes = 1073714824 > > journal_max_write_entries = 1 > > journal_queue_max_ops = 5 > > journal_queue_max_bytes = 1048576 > Same as above, have you tested these setting (from filestore_queue_max_ops > onward) compared to the defaults? > > With HDDs only I'd expect any benefits to be small and/or things to become > very uneven once the HDDs are saturated. > > > osd_max_write_size = 512 > > osd_client_message_size_cap = 2147483648 > > osd_deep_scrub_stride = 131072 > > osd_op_threads = 8 > > osd_disk_threads = 4 > > osd_map_cache_size = 1024 > > osd_map_cache_bl_size = 128 > > osd_mount_options_xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier" > The nobarrier part is a a potential recipe for disaster unless you have all > on-disk caches disabled and every other cache battery backed. > > The only devices I trust to mount nobarrier are SSDs with powercaps that > have been proven to do the right thing (Intel DC S amongst them). > > > osd_recovery_op_priority = 4 > > osd_recovery_max_active = 10 > > osd_max_backfills = 4 > > > That's sane enough. > > > [client] > > rbd_cache = true > AFAIK that's the case with recent Ceph versions anyway. > > > rbd_cache_size = 268435456 > > Are you sure that you have 256MB per client to waste on RBD cache? > If so, bully for you, but you might find that depending on your use case a > smaller RBD cache but more VM memory (for pagecache, SLAB, etc) could be > more beneficial. > > > rbd_cache_max_dirty = 134217728 > > rbd_cache_max_dirty_age = 5 > > Christian > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] journal or cache tier on SSDs ?
Hello, On Tue, 10 May 2016 10:40:08 +0200 Yoann Moulin wrote: > Hello, > > I'd like some advices about the setup of a new ceph cluster. Here the > use case : > > RadowGW (S3 and maybe swift for hadoop/spark) will be the main usage. > Most of the access will be in read only mode. Write access will only be > done by the admin to update the datasets. > > We might use rbd some time to sync data as temp storage (when POSIX is > needed) but performance will not be an issue here. We might use cephfs > in the futur if that can replace a filesystem on rdb. > > We gonna start with 16 nodes (up to 24). The configuration of each node > is : > > CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (12c/48t) > Memory : 128GB > OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1) Dedicated OS SSDs aren't really needed, I tend to share OS and cache/journal SSDs. That's of course with more durable (S3610) models. Since you didn't mention dedicated MON nodes, make sure that if you plan to put MONs on storage servers to have fast SSDs in them for the leveldb (again DC S36xx or 37xx). This will also free up 2 more slots in your (likely Supermicro) chassis for OSD HDDs. > Journal or cache Storage : 2 x SSD 400GB Intel S3300 DC (no Raid) These SSDs do not exist according to the Intel site and the only references I can find for them are on "no longer available" European sites. Since you're in the land of rich chocolate bankers, I assume that this model is something that just happened in Europe. Without knowing the specifications for these SSDs, I can't recommend them. I'd use DC S3610 or 3710 instead, this very much depends on how much endurance (TPW) you need. > OSD Disk : 10 x HGST ultrastar-7k6000 6TB > Public Network : 1 x 10Gb/s > Private Network : 1 x 10Gb/s > OS : Ubuntu 16.04 > Ceph version : Jewel > > The question is : journal or cache tier (read only) on the SD 400GB > Intel S3300 DC ? > You said read-only, or read-mostly up there. So why journals (only helpful for writes) or cache tiers (your 2 SSDs may not be faster than your 10 HDDs for reads) at all? Mind, if you have the money, go for it! > Each disk is able to write sequentially at 220MB/s. SSDs can write at > ~500MB/s. if we set 5 journals on each SSDs, SSD will still be the > bottleneck (1GB/s vs 2GB/s). Your filestore based OSDs will never write Ceph data at 220MB/s, 100 would be pushing it. So no, your journal SSDs won't be the limiting factor, though 5 journals on one SSD is pushing my comfort zone when it comes to SPoFs. > If we set the journal on OSDs, we can > expect a good throughput in read on the disk (in case of data not in the > cache) and write shouldn't be so bad too, even if we have random read on > the OSD during the write ? > > SSDs as cache tier seem to be a better usage than only 5 journal on > each ? Is that correct ? > Potentially, depends on your actual usage. Again, since you said read-mostly, the question with a cache-tier becomes, how much of your truly hot data can fit into it? Remember that super-hot objects are likely to come from the pagecache of the storage node in question anyway. If you do care about fast writes after all, consider de-coupling writes and reads as much as possible. As in, set your cache to "readforward" (undocumented, google for it), so all un-cached reads will go to the HDDs (they CAN read at near full speed), while all writes will go the cache pool (and eventually to the HDDs, you can time that with lowering the dirty ratio during off-peak hours). > We gonna use an EC pool for big files (jerasure 8+2 I think) and a > replicated pool for small files. > > If I check on http://ceph.com/pgcalc/, in this use case > > replicated pool: pg_num = 8192 for 160 OSDs but 16384 for 240 OSDs > Ec pool : pg_num = 4096 > and pgp_num = pg_num > > Should I set the pg_num to 8192 or 16384 ? what is the impact on the > cluster if we set the pg_num to 16384 at the beginning ? 16384 is high, > isn't it ? > If 24 nodes is the absolute limit of your cluster, you want to set the target pg num to 100 in the calculator, which gives you 8192 again. Keep in mind that splitting PGs is an expensive operation, so if 24 isn't a hard upper limit, you might be better off starting big. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] thanks for a double check on ceph's config
Hello, On Tue, 10 May 2016 16:50:17 +0800 Geocast Networks wrote: > Hello Chris, > > We don't use SSD as journal. > each host has one intel E5-2620 CPU which is 6 cores. That should be enough. > the networking (both cluster and data networks) is 10Gbps. > 12 HDDs will barely saturate a 10Gb/s link during writes, if you care about fast reads you may be better off with a uniform, bonded 20Gb/s network. > My further questions include, > > (1) osd_mkfs_type = xfs > osd_mkfs_options_xfs = -f > filestore_xattr_use_omap = true > > for XFS filesystem, we should not enable filestore_xattr_use_omap = true, > is it? > You don't need to, AFAIK this switch doesn't cause any overhead if it isn't needed. Somebody actually using XFS or knowing the code may pipe up here. > (2) filestore_queue_max_ops = 25000 > filestore_queue_max_bytes = 10485760 > filestore_queue_committing_max_ops = 5000 > filestore_queue_committing_max_bytes = 1048576 > journal_max_write_bytes = 1073714824 > journal_max_write_entries = 1 > journal_queue_max_ops = 5 > journal_queue_max_bytes = 1048576 > > Since we don't have SSD as journals, all these setup are too large? what > are the better values? > You really want to test them against the defaults. And the defaults are designed for usage with HDD only OSDs, so they are probably your best bet unless you feel like empiric testing. > (3) osd_mount_options_xfs = > "rw,noexec,nodev,noatime,nodiratime,nobarrier" What's your suggested > options here? > As I said, loose the "nobarrier". Christian > Thanks a lot. > > > 2016-05-10 15:31 GMT+08:00 Christian Balzer : > > > On Tue, 10 May 2016 11:48:07 +0800 Geocast wrote: > > > > Hello, > > > > > We have 21 hosts for ceph OSD servers, each host has 12 SATA disks > > > (4TB each), 64GB memory. > > No journal SSDs? > > What CPU(s) and network? > > > > > ceph version 10.2.0, Ubuntu 16.04 LTS > > > The whole cluster is new installed. > > > > > > Can you help check what the arguments we put in ceph.conf is > > > reasonable or not? > > > thanks. > > > > > > [osd] > > > osd_data = /var/lib/ceph/osd/ceph-$id > > > osd_journal_size = 2 > > Overkill most likely, but not an issue. > > > > > osd_mkfs_type = xfs > > > osd_mkfs_options_xfs = -f > > > filestore_xattr_use_omap = true > > > filestore_min_sync_interval = 10 > > Are you aware what this does and have you actually tested this (IOPS > > AND throughput) with various other setting on your hardware to arrive > > at this number? > > > > > filestore_max_sync_interval = 15 > > That's fine in and by itself, unlikely to ever be reached anyway. > > > > > filestore_queue_max_ops = 25000 > > > filestore_queue_max_bytes = 10485760 > > > filestore_queue_committing_max_ops = 5000 > > > filestore_queue_committing_max_bytes = 1048576 > > > journal_max_write_bytes = 1073714824 > > > journal_max_write_entries = 1 > > > journal_queue_max_ops = 5 > > > journal_queue_max_bytes = 1048576 > > Same as above, have you tested these setting (from > > filestore_queue_max_ops onward) compared to the defaults? > > > > With HDDs only I'd expect any benefits to be small and/or things to > > become very uneven once the HDDs are saturated. > > > > > osd_max_write_size = 512 > > > osd_client_message_size_cap = 2147483648 > > > osd_deep_scrub_stride = 131072 > > > osd_op_threads = 8 > > > osd_disk_threads = 4 > > > osd_map_cache_size = 1024 > > > osd_map_cache_bl_size = 128 > > > osd_mount_options_xfs = > > > "rw,noexec,nodev,noatime,nodiratime,nobarrier" > > The nobarrier part is a a potential recipe for disaster unless you > > have all on-disk caches disabled and every other cache battery backed. > > > > The only devices I trust to mount nobarrier are SSDs with powercaps > > that have been proven to do the right thing (Intel DC S amongst them). > > > > > osd_recovery_op_priority = 4 > > > osd_recovery_max_active = 10 > > > osd_max_backfills = 4 > > > > > That's sane enough. > > > > > [client] > > > rbd_cache = true > > AFAIK that's the case with recent Ceph versions anyway. > > > > > rbd_cache_size = 268435456 > > > > Are you sure that you have 256MB per client to waste on RBD cache? > > If so, bully for you, but you might find that depending on your use > > case a smaller RBD cache but more VM memory (for pagecache, SLAB, etc) > > could be more beneficial. > > > > > rbd_cache_max_dirty = 134217728 > > > rbd_cache_max_dirty_age = 5 > > > > Christian > > -- > > Christian BalzerNetwork/Systems Engineer > > ch...@gol.com Global OnLine Japan/Rakuten Communications > > http://www.gol.com/ > > -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] journal or cache tier on SSDs ?
Hello, >> I'd like some advices about the setup of a new ceph cluster. Here the >> use case : >> >> RadowGW (S3 and maybe swift for hadoop/spark) will be the main usage. >> Most of the access will be in read only mode. Write access will only be >> done by the admin to update the datasets. >> >> We might use rbd some time to sync data as temp storage (when POSIX is >> needed) but performance will not be an issue here. We might use cephfs >> in the futur if that can replace a filesystem on rdb. >> >> We gonna start with 16 nodes (up to 24). The configuration of each node >> is : >> >> CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (12c/48t) >> Memory : 128GB >> OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1) > > Dedicated OS SSDs aren't really needed, I tend to share OS and > cache/journal SSDs. > That's of course with more durable (S3610) models. I already have those 24 servers running 2 ceph cluster for test right now, so I cannot change anything. we were thinking about share journal but as I mention it below, MON will be on storage server, so that might use too much I/O to share levedb and journal on the same SSD. > Since you didn't mention dedicated MON nodes, make sure that if you plan > to put MONs on storage servers to have fast SSDs in them for the leveldb > (again DC S36xx or 37xx). Yes MON nodes will be shared on storage server. MONs use the SSD 240GB for the leveldb right now. > This will also free up 2 more slots in your (likely Supermicro) chassis > for OSD HDDs. It's not supermicro enclosure, it's Intel one with 12 slot 3.5" front and 2 slots 2.5" back, so I cannot add more disk. the 240GB SSDs are in front. >> Journal or cache Storage : 2 x SSD 400GB Intel S3300 DC (no Raid) > > These SSDs do not exist according to the Intel site and the only > references I can find for them are on "no longer available" European sites. I made a mistake, it's not 400 but 480GB, smartctl give me Model SSDSC2BB480H4 > Since you're in the land of rich chocolate bankers, I assume that this > model is something that just happened in Europe. I'm just a poor sysadmin with expensive toy in a University ;) > Without knowing the specifications for these SSDs, I can't recommend them. > I'd use DC S3610 or 3710 instead, this very much depends on how much > endurance (TPW) you need. As I write above, I already have those SSDs so I look for the best setup with the hardware I have. >> OSD Disk : 10 x HGST ultrastar-7k6000 6TB >> Public Network : 1 x 10Gb/s >> Private Network : 1 x 10Gb/s >> OS : Ubuntu 16.04 >> Ceph version : Jewel >> >> The question is : journal or cache tier (read only) on the SD 400GB >> Intel S3300 DC ? >> > You said read-only, or read-mostly up there. I mean, I think about using cache tier for read operation. No write operation gonna use the cache tier. I don't know yet wich mode I gonna use, I have to do some tests. > So why journals (only helpful for writes) or cache tiers (your 2 SSDs may > not be faster than your 10 HDDs for reads) at all? We plan to have eavy read access some time so we think about to have cache tier on SSD to speed up the throughput and decrease the I/O pressure on disk. I might be wrong on that. > Mind, if you have the money, go for it! I don't have the money, I have the hardware :) >> Each disk is able to write sequentially at 220MB/s. SSDs can write at >> ~500MB/s. if we set 5 journals on each SSDs, SSD will still be the >> bottleneck (1GB/s vs 2GB/s). > > Your filestore based OSDs will never write Ceph data at 220MB/s, 100 would > be pushing it. > So no, your journal SSDs won't be the limiting factor, though 5 journals > on one SSD is pushing my comfort zone when it comes to SPoFs. > >> If we set the journal on OSDs, we can >> expect a good throughput in read on the disk (in case of data not in the >> cache) and write shouldn't be so bad too, even if we have random read on >> the OSD during the write ? >> >> SSDs as cache tier seem to be a better usage than only 5 journal on >> each ? Is that correct ? >> > Potentially, depends on your actual usage. > > Again, since you said read-mostly, the question with a cache-tier becomes, > how much of your truly hot data can fit into it? That the biggest point, many datasets will fit into the cache, but some of them will definitely be too big (+100TB) but in that case, Our user know what going one. > Remember that super-hot objects are likely to come from the pagecache of > the storage node in question anyway. Yes I know that. > If you do care about fast writes after all, consider de-coupling writes > and reads as much as possible. Write operation will only be done by the admins for datasets update. those updates will be plan according the usage of the cluster and scheduled during low usage period. > As in, set your cache to "readforward" (undocumented, google for it), so > all un-cached reads will go to the HDDs (they CAN read at near full speed), > while all writes will go the cache pool (and ev
[ceph-users] ceph 0.94.5 / Kernel 4.5.2 / assertion on OSD
Hello, we are running ceph 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) on a 4.5.2 kernel. Our cluster currently consists of 5 nodes, with 6 OSD's each. An issue has also been filed here (also containg logs, etc.): http://tracker.ceph.com/issues/15813 Last night we have observed a single OSD (osd.11) die with an assertion: 2016-05-10 03:16:30.718936 7fa5166dc700 -1 common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7fa5166dc700 time 2016-05-10 03:16:30.688044 common/Mutex.cc: 100: FAILED assert(r == 0) ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0xb34520] 2: (Mutex::Lock(bool)+0x105) [0xae2395] 3: (DispatchQueue::discard_queue(unsigned long)+0x37) [0xbeff67] 4: (Pipe::fault(bool)+0x426) [0xc16256] 5: (Pipe::reader()+0x3f2) [0xc1d752] 6: (Pipe::Reader::entry()+0xd) [0xc2880d] 7: (()+0x7474) [0x7fa54c70b474] 8: (clone()+0x6d) [0x7fa54ac01acd] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- begin dump of recent events --- -74> 2016-05-09 11:56:56.015711 7fa54d1fc7c0 5 asok(0x5644000) register_command perfcounters_dump hook 0x5624030 -73> 2016-05-09 11:56:56.015739 7fa54d1fc7c0 5 asok(0x5644000) register_command 1 hook 0x5624030 -72> 2016-05-09 11:56:56.015745 7fa54d1fc7c0 5 asok(0x5644000) register_command perf dump hook 0x5624030 -71> 2016-05-09 11:56:56.015751 7fa54d1fc7c0 5 asok(0x5644000) register_command perfcounters_schema hook 0x5624030 -70> 2016-05-09 11:56:56.015756 7fa54d1fc7c0 5 asok(0x5644000) register_command 2 hook 0x5624030 -69> 2016-05-09 11:56:56.015758 7fa54d1fc7c0 5 asok(0x5644000) register_command perf schema hook 0x5624030 -68> 2016-05-09 11:56:56.015763 7fa54d1fc7c0 5 asok(0x5644000) register_command perf reset hook 0x5624030 -67> 2016-05-09 11:56:56.015766 7fa54d1fc7c0 5 asok(0x5644000) register_command config show hook 0x5624030 -66> 2016-05-09 11:56:56.015770 7fa54d1fc7c0 5 asok(0x5644000) register_command config set hook 0x5624030 -65> 2016-05-09 11:56:56.015773 7fa54d1fc7c0 5 asok(0x5644000) register_command config get hook 0x5624030 -64> 2016-05-09 11:56:56.015776 7fa54d1fc7c0 5 asok(0x5644000) register_command config diff hook 0x5624030 -63> 2016-05-09 11:56:56.015779 7fa54d1fc7c0 5 asok(0x5644000) register_command log flush hook 0x5624030 -62> 2016-05-09 11:56:56.015783 7fa54d1fc7c0 5 asok(0x5644000) register_command log dump hook 0x5624030 -61> 2016-05-09 11:56:56.015786 7fa54d1fc7c0 5 asok(0x5644000) register_command log reopen hook 0x5624030 -60> 2016-05-09 11:56:56.017553 7fa54d1fc7c0 0 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43), process ceph-osd, pid 76017 -59> 2016-05-09 11:56:56.027154 7fa54d1fc7c0 0 filestore(/var/lib/ceph/osd/ceph-11) backend xfs (magic 0x58465342) -58> 2016-05-09 11:56:56.028635 7fa54d1fc7c0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: FIEMAP ioctl is supported and appears to work -57> 2016-05-09 11:56:56.028644 7fa54d1fc7c0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option -56> 2016-05-09 11:56:56.042822 7fa54d1fc7c0 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) -55> 2016-05-09 11:56:56.043047 7fa54d1fc7c0 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature: extsize is supported and kernel 4.5.2-1-ARCH >= 3.5 -54> 2016-05-09 11:56:56.109483 7fa54d1fc7c0 0 filestore(/var/lib/ceph/osd/ceph-11) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled -53> 2016-05-09 11:56:56.110110 7fa54d1fc7c0 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway -52> 2016-05-09 11:56:56.110825 7fa54d1fc7c0 0 cls/hello/cls_hello.cc:271: loading cls_hello -51> 2016-05-09 11:56:56.114886 7fa54d1fc7c0 0 osd.11 9819 crush map has features 283675107524608, adjusting msgr requires for clients -50> 2016-05-09 11:56:56.114895 7fa54d1fc7c0 0 osd.11 9819 crush map has features 283675107524608 was 8705, adjusting msgr requires for mons -49> 2016-05-09 11:56:56.114899 7fa54d1fc7c0 0 osd.11 9819 crush map has features 283675107524608, adjusting msgr requires for osds -48> 2016-05-09 11:56:56.114919 7fa54d1fc7c0 0 osd.11 9819 load_pgs -47> 2016-05-09 11:56:57.000584 7fa54d1fc7c0 0 osd.11 9819 load_pgs opened 55 pgs -46> 2016-05-09 11:56:57.000991 7fa54d1fc7c0 -1 osd.11 9819 log_to_monitors {default=true} -45> 2016-05-09 11:56:57.052319 7fa54d1fc7c0 0 osd.11 9819 done with init, starting boot process -44> 2016-05-09 11:57:01.103141 7fa50d84e700 0 -- [fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::3]:6804/75598 pipe(0x11152000 sd=80 :6806 s=0 pgs=0 cs=0 l=0 c=0x10ec6580)
[ceph-users] Cluster issue - pgs degraded, recovering, stale, etc.
Hello. I have a two node cluster with 4x replicas for all objects distributed between the two nodes (two copies on each node). I recently converted my OSDs from BTRFS to XFS (BTRFS was slow) by removing / preparing / activating OSDs on each node (one at at time) as XFS allowing cluster to rebalance / recover itself. Now with this all complete, I have a better performing cluster, all data is intact, however I have the following status. How can I remedy this? Looking for guidance into steps / troubleshooting starting point. There’s a bunch of seemingly different issues that likely stem from the same root cause. health HEALTH_WARN 11 pgs degraded 7 pgs peering 4 pgs recovering 2 pgs recovery_wait 885 pgs stale 11 pgs stuck degraded 60 pgs stuck inactive 885 pgs stuck stale 66 pgs stuck unclean recovery 3/24971148 objects degraded (0.000%) Thank you. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] journal or cache tier on SSDs ?
Hello, On Tue, 10 May 2016 13:14:35 +0200 Yoann Moulin wrote: > Hello, > > >> I'd like some advices about the setup of a new ceph cluster. Here the > >> use case : > >> > >> RadowGW (S3 and maybe swift for hadoop/spark) will be the main usage. > >> Most of the access will be in read only mode. Write access will only > >> be done by the admin to update the datasets. > >> > >> We might use rbd some time to sync data as temp storage (when POSIX is > >> needed) but performance will not be an issue here. We might use cephfs > >> in the futur if that can replace a filesystem on rdb. > >> > >> We gonna start with 16 nodes (up to 24). The configuration of each > >> node is : > >> > >> CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (12c/48t) > >> Memory : 128GB > >> OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1) > > > > Dedicated OS SSDs aren't really needed, I tend to share OS and > > cache/journal SSDs. > > That's of course with more durable (S3610) models. > > I already have those 24 servers running 2 ceph cluster for test right > now, so I cannot change anything. we were thinking about share journal > but as I mention it below, MON will be on storage server, so that might > use too much I/O to share levedb and journal on the same SSD. > Not really, the journal is sequential writes, the leveldb small, fast IOPS. Both of them on the same (decent) SSD should be fine. But as your HW is fixed, lets not speculate about that. > > Since you didn't mention dedicated MON nodes, make sure that if you > > plan to put MONs on storage servers to have fast SSDs in them for the > > leveldb (again DC S36xx or 37xx). > > Yes MON nodes will be shared on storage server. MONs use the SSD 240GB > for the leveldb right now. > Note that the lowest IP(s) become the MON leader, so if you put RADOSGW and other things on the storage nodes as well, spread things out accordingly. > > This will also free up 2 more slots in your (likely Supermicro) chassis > > for OSD HDDs. > > It's not supermicro enclosure, it's Intel one with 12 slot 3.5" front > and 2 slots 2.5" back, so I cannot add more disk. the 240GB SSDs are in > front. > That sounds like a SM chassis. ^o^ In fact, I can't find a chassis on Intel's page with 2 back 2.5 slots. > >> Journal or cache Storage : 2 x SSD 400GB Intel S3300 DC (no Raid) > > > > These SSDs do not exist according to the Intel site and the only > > references I can find for them are on "no longer available" European > > sites. > > I made a mistake, it's not 400 but 480GB, smartctl give me Model > SSDSC2BB480H4 > OK, that's not good. Firstly, that model number still doesn't get us any hits from Intel, strangely enough. Secondly, it is 480GB (instead of 400, which denotes overprovisioning) and matches the 3510 480GB model up to the last 2 characters. And that has an endurance of 275TBW, not something you want to use for either journals or cache pools. > > Since you're in the land of rich chocolate bankers, I assume that this > > model is something that just happened in Europe. > > I'm just a poor sysadmin with expensive toy in a University ;) > I know, I recognized the domain. ^.^ > > Without knowing the specifications for these SSDs, I can't recommend > > them. I'd use DC S3610 or 3710 instead, this very much depends on how > > much endurance (TPW) you need. > > As I write above, I already have those SSDs so I look for the best setup > with the hardware I have. > Unless they have at least an endurance of 3 DWPD like the 361x (and their model number, size and the 3300 naming suggests they do NOT), your 480GB SSDs aren't suited for intense Ceph usage. How much have you used them yet and what is their smartctl status, in particular these values (from a 800GB DC S3610 in my cache pool): --- 232 Available_Reservd_Space 0x0033 100 100 010Pre-fail Always - 0 233 Media_Wearout_Indicator 0x0032 100 100 000Old_age Always - 0 241 Host_Writes_32MiB 0x0032 100 100 000Old_age Always - 869293 242 Host_Reads_32MiB0x0032 100 100 000Old_age Always - 43435 243 NAND_Writes_32MiB 0x0032 100 100 000Old_age Always - 1300884 --- Not even 1% down after 40TBW, at which point your SSDs are likely to be 15% down... > >> OSD Disk : 10 x HGST ultrastar-7k6000 6TB > >> Public Network : 1 x 10Gb/s > >> Private Network : 1 x 10Gb/s > >> OS : Ubuntu 16.04 > >> Ceph version : Jewel > >> > >> The question is : journal or cache tier (read only) on the SD 400GB > >> Intel S3300 DC ? > >> > > You said read-only, or read-mostly up there. > > I mean, I think about using cache tier for read operation. No write > operation gonna use the cache tier. I don't know yet wich mode I gonna > use, I have to do some tests. > As I said, your HDDs are unlikely to be slower (for sufficient parallel access, not short, sequential reads) than those SSDs. > > So why journals (only helpf
Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile
> -Original Message- > From: Eric Eastman [mailto:eric.east...@keepertech.com] > Sent: 09 May 2016 23:09 > To: Nick Fisk > Cc: Ceph Users > Subject: Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on > lockfile > > On Mon, May 9, 2016 at 3:28 PM, Nick Fisk wrote: > > Hi Eric, > > > >> > >> I am trying to do some similar testing with SAMBA and CTDB with the > >> Ceph file system. Are you using the vfs_ceph SAMBA module or are you > >> kernel mounting the Ceph file system? > > > > I'm using the kernel client. I couldn't find any up to date information on > > if > the vfs plugin supported all the necessary bits and pieces. > > > > How is your testing coming along? I would be very interested in any > findings you may have come across. > > > > Nick > > I am also using CephFS kernel mounts, with 4 SAMBA gateways. When from a > SAMBA client, I write a large file (about 2GB) to a gateway that is not the > holder of the CTDB lock file, and then kill that gateway server during the > write, the IP failover works as expected, and in most cases the file ends up > being the correct size after the new server finishes writing it, but the data > is > corrupt. The data in the file, from the point of the failover, is all zeros. > > I thought the issue may be with the kernel mount, so I looked into using the > SAMBA vfs_ceph module, but I need SAMBA with AD support and the > current vfs_ceph module, even in the SAMBA git master version, is lacking > ACL support for CephFS, as the vfs_ceph.c patches summited to the SAMBA > mail list are not yet available. See: > https://lists.samba.org/archive/samba-technical/2016-March/113063.html > > I tried using a FUSE mount of the CephFS, and it also fails setting ACLs. > See: > http://tracker.ceph.com/issues/15783. > > My current status is IP failover is working, but I am seeing data corruption > on > writes to the share when using kernel mounts. I am also seeing the issue you > reported when I kill the system holding the CTDB lock file. Are you verifying > your data after each failover? I must admit you are slightly ahead of me. I was initially trying to just get hard/soft failover working correctly. But your response has prompted me to test out the scenario you mentioned. I'm seeing slightly different results, my copy seems to error out when I do a node failover. I'm copying an ISO from a 2008 server to the CTDB/Samba share and when I reboot the active node, the copy pauses for a couple of seconds and then comes up with the error box. Clicking try again several times doesn't let it resume. I need to do a bit more digging to try and work out why this is happening. The share itself does seem to be in a working state when trying to click the try again button, so there is probably some sort of state/session problem. Do you have multiple vip's configured on your cluster or just a single IP? I have just the one at the moment. > > Eric ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile
> -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Nick Fisk > Sent: 10 May 2016 13:30 > To: 'Eric Eastman' > Cc: 'Ceph Users' > Subject: Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on > lockfile > > > -Original Message- > > From: Eric Eastman [mailto:eric.east...@keepertech.com] > > Sent: 09 May 2016 23:09 > > To: Nick Fisk > > Cc: Ceph Users > > Subject: Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on > > lockfile > > > > On Mon, May 9, 2016 at 3:28 PM, Nick Fisk wrote: > > > Hi Eric, > > > > > >> > > >> I am trying to do some similar testing with SAMBA and CTDB with the > > >> Ceph file system. Are you using the vfs_ceph SAMBA module or are > > >> you kernel mounting the Ceph file system? > > > > > > I'm using the kernel client. I couldn't find any up to date > > > information on if > > the vfs plugin supported all the necessary bits and pieces. > > > > > > How is your testing coming along? I would be very interested in any > > findings you may have come across. > > > > > > Nick > > > > I am also using CephFS kernel mounts, with 4 SAMBA gateways. When > from > > a SAMBA client, I write a large file (about 2GB) to a gateway that is > > not the holder of the CTDB lock file, and then kill that gateway > > server during the write, the IP failover works as expected, and in > > most cases the file ends up being the correct size after the new > > server finishes writing it, but the data is corrupt. The data in the file, from > the point of the failover, is all zeros. > > > > I thought the issue may be with the kernel mount, so I looked into > > using the SAMBA vfs_ceph module, but I need SAMBA with AD support > and > > the current vfs_ceph module, even in the SAMBA git master version, is > > lacking ACL support for CephFS, as the vfs_ceph.c patches summited to > > the SAMBA mail list are not yet available. See: > > https://lists.samba.org/archive/samba-technical/2016-March/113063.html > > > > I tried using a FUSE mount of the CephFS, and it also fails setting ACLs. See: > > http://tracker.ceph.com/issues/15783. > > > > My current status is IP failover is working, but I am seeing data > > corruption on writes to the share when using kernel mounts. I am also > > seeing the issue you reported when I kill the system holding the CTDB > > lock file. Are you verifying your data after each failover? > > I must admit you are slightly ahead of me. I was initially trying to just get > hard/soft failover working correctly. But your response has prompted me to > test out the scenario you mentioned. I'm seeing slightly different results, my > copy seems to error out when I do a node failover. I'm copying an ISO from a > 2008 server to the CTDB/Samba share and when I reboot the active node, > the copy pauses for a couple of seconds and then comes up with the error > box. Clicking try again several times doesn't let it resume. I need to do a bit > more digging to try and work out why this is happening. The share itself does > seem to be in a working state when trying to click the try again button, so > there is probably some sort of state/session problem. > > Do you have multiple vip's configured on your cluster or just a single IP? I > have just the one at the moment. Just to add to this, I have just been reading this article https://nnc3.com/mags/LM10/Magazine/Archive/2009/105/030-035_SambaHA/article .html And the following paragraph seems to indicate that what I am seeing is the correct behaviour? I 'm wondering if this is not happening in your case and is why you are getting corruption? "It is important to understand that load balancing and client distribution over the client nodes are connection oriented. If an IP address is switched from one node to another, all the connections actively using this IP address are dropped and the clients have to reconnect. To avoid delays, CTDB uses a trick: When an IP is switched, the new CTDB node "tickles" the client with an illegal TCP ACK packet (tickle ACK) containing an invalid sequence number of 0 and an ACK number of 0. The client responds with a valid ACK packet, allowing the new IP address owner to close the connection with an RST packet, thus forcing the client to reestablish the connection to the new node." Nick > > > > > Eric > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] thanks for a double check on ceph's config
Hi, Am 2016-05-10 05:48, schrieb Geocast: Hi members, We have 21 hosts for ceph OSD servers, each host has 12 SATA disks (4TB each), 64GB memory. ceph version 10.2.0, Ubuntu 16.04 LTS The whole cluster is new installed. Can you help check what the arguments we put in ceph.conf is reasonable or not? thanks. [osd] osd_data = /var/lib/ceph/osd/ceph-$id osd_journal_size = 2 osd_mkfs_type = xfs osd_mkfs_options_xfs = -f filestore_xattr_use_omap = true filestore_min_sync_interval = 10 filestore_max_sync_interval = 15 filestore_queue_max_ops = 25000 filestore_queue_max_bytes = 10485760 filestore_queue_committing_max_ops = 5000 filestore_queue_committing_max_bytes = 1048576 journal_max_write_bytes = 1073714824 journal_max_write_entries = 1 journal_queue_max_ops = 5 journal_queue_max_bytes = 1048576 osd_max_write_size = 512 osd_client_message_size_cap = 2147483648 osd_deep_scrub_stride = 131072 osd_op_threads = 8 osd_disk_threads = 4 osd_map_cache_size = 1024 osd_map_cache_bl_size = 128 osd_mount_options_xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier" I have this settings (to avoid fragmentation): osd mount options xfs = "rw,noatime,inode64,logbufs=8,logbsize=256k,allocsize=4M" osd mkfs options xfs = "-f -i size=2048" Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Erasure pool performance expectations
To answer my own question it seems that you can change settings on the fly using ceph tell osd.* injectargs '--osd_tier_promote_max_bytes_sec 5242880' osd.0: osd_tier_promote_max_bytes_sec = '5242880' (unchangeable) However the response seems to imply I can't change this setting. Is there an other way to change these settings? On Sun, May 8, 2016 at 2:37 PM, Peter Kerdisle wrote: > Hey guys, > > I noticed the merge request that fixes the switch around here > https://github.com/ceph/ceph/pull/8912 > > I had two questions: > > >- Does this effect my performance in any way? Could it explain the >slow requests I keep having? >- Can I modify these settings manually myself on my cluster? > > Thanks, > > Peter > > > On Fri, May 6, 2016 at 9:58 AM, Peter Kerdisle > wrote: > >> Hey Mark, >> >> Sorry I missed your message as I'm only subscribed to daily digests. >> >> >>> Date: Tue, 3 May 2016 09:05:02 -0500 >>> From: Mark Nelson >>> To: ceph-users@lists.ceph.com >>> Subject: Re: [ceph-users] Erasure pool performance expectations >>> Message-ID: >>> Content-Type: text/plain; charset=windows-1252; format=flowed >>> In addition to what nick said, it's really valuable to watch your cache >>> tier write behavior during heavy IO. One thing I noticed is you said >>> you have 2 SSDs for journals and 7 SSDs for data. >> >> >> I thought the hardware recommendations were 1 journal disk per 3 or 4 >> data disks but I think I might have misunderstood it. Looking at my journal >> read/writes they seem to be ok though: >> https://www.dropbox.com/s/er7bei4idd56g4d/Screenshot%202016-05-06%2009.55.30.png?dl=0 >> >> However I started running into a lot of slow requests (made a separate >> thread for those: Diagnosing slow requests) and now I'm hoping these >> could be related to my journaling setup. >> >> >>> If they are all of >>> the same type, you're likely bottlenecked by the journal SSDs for >>> writes, which compounded with the heavy promotions is going to really >>> hold you back. >>> What you really want: >>> 1) (assuming filestore) equal large write throughput between the >>> journals and data disks. >> >> How would one achieve that? >> >>> >>> 2) promotions to be limited by some reasonable fraction of the cache >>> tier and/or network throughput (say 70%). This is why the >>> user-configurable promotion throttles were added in jewel. >> >> Are these already in the docs somewhere? >> >>> >>> 3) The cache tier to fill up quickly when empty but change slowly once >>> it's full (ie limiting promotions and evictions). No real way to do >>> this yet. >>> Mark >> >> >> Thanks for your thoughts. >> >> Peter >> >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Erasure pool performance expectations
> -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Peter Kerdisle > Sent: 10 May 2016 14:37 > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Erasure pool performance expectations > > To answer my own question it seems that you can change settings on the fly > using > > ceph tell osd.* injectargs '--osd_tier_promote_max_bytes_sec 5242880' > osd.0: osd_tier_promote_max_bytes_sec = '5242880' (unchangeable) > > However the response seems to imply I can't change this setting. Is there an > other way to change these settings? Sorry Peter, I missed your last email. You can also specify that setting in the ceph.conf, ie I have in mine osd_tier_promote_max_bytes_sec = 400 > > > On Sun, May 8, 2016 at 2:37 PM, Peter Kerdisle > wrote: > Hey guys, > > I noticed the merge request that fixes the switch around here > https://github.com/ceph/ceph/pull/8912 > > I had two questions: > > • Does this effect my performance in any way? Could it explain the slow > requests I keep having? > • Can I modify these settings manually myself on my cluster? > Thanks, > > Peter > > > On Fri, May 6, 2016 at 9:58 AM, Peter Kerdisle > wrote: > Hey Mark, > > Sorry I missed your message as I'm only subscribed to daily digests. > > Date: Tue, 3 May 2016 09:05:02 -0500 > From: Mark Nelson > To: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Erasure pool performance expectations > Message-ID: > Content-Type: text/plain; charset=windows-1252; format=flowed > In addition to what nick said, it's really valuable to watch your cache > tier write behavior during heavy IO. One thing I noticed is you said > you have 2 SSDs for journals and 7 SSDs for data. > > I thought the hardware recommendations were 1 journal disk per 3 or 4 data > disks but I think I might have misunderstood it. Looking at my journal > read/writes they seem to be ok > though: https://www.dropbox.com/s/er7bei4idd56g4d/Screenshot%202016- > 05-06%2009.55.30.png?dl=0 > > However I started running into a lot of slow requests (made a separate > thread for those: Diagnosing slow requests) and now I'm hoping these could > be related to my journaling setup. > > If they are all of > the same type, you're likely bottlenecked by the journal SSDs for > writes, which compounded with the heavy promotions is going to really > hold you back. > What you really want: > 1) (assuming filestore) equal large write throughput between the > journals and data disks. > How would one achieve that? > > 2) promotions to be limited by some reasonable fraction of the cache > tier and/or network throughput (say 70%). This is why the > user-configurable promotion throttles were added in jewel. > Are these already in the docs somewhere? > > 3) The cache tier to fill up quickly when empty but change slowly once > it's full (ie limiting promotions and evictions). No real way to do > this yet. > Mark > > Thanks for your thoughts. > > Peter > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RGW user quota may not adjust on bucket removal
Hey, we currently have a problem with our radosgw. The quota value of a user does not get updated after an admin manually deleted a bucket (via radosgw-admin). You can only circumvent this if you synced the user stats before the removal. So there are now users which can not upload new objects although they should be able to. There is already a bug filed for this: http://tracker.ceph.com/issues/14507 It looks like the corresponding merge commit got into Ceph v10.1.0 first: """ nick@nick-nine-virtual:~/git_repos/ceph$ git tag --contains 709ab2dd6e84abf152527e6a9177aabcf1a4c887 v10.1.0 v10.1.1 v10.1.2 v10.2.0 """ We are using Ceph version 9.2.1. I will upgrade the cluster to Jewel in the next days, but I guess my problem will stay the same :-) So does anyone know if there is a method to let ceph recalculate the quota usage of a user or change it manually somewhere? I had the same problem a few weeks ago and I did the following: - create a new temp user with new temp buckets - lock the old account - copy all the objects with S3fuse from the old account to the new one - delete the old account and recreate it - copy the objects back (I did this because it was not possible to change the ownership of a bucket to a new user) This time it would take a long time to do this again as the users have a lot more objects in their buckets. Thanks for any help or advise... Cheers Nick signature.asc Description: This is a digitally signed message part. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] journal or cache tier on SSDs ?
Re, I'd like some advices about the setup of a new ceph cluster. Here the use case : RadowGW (S3 and maybe swift for hadoop/spark) will be the main usage. Most of the access will be in read only mode. Write access will only be done by the admin to update the datasets. We might use rbd some time to sync data as temp storage (when POSIX is needed) but performance will not be an issue here. We might use cephfs in the futur if that can replace a filesystem on rdb. We gonna start with 16 nodes (up to 24). The configuration of each node is : CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (12c/48t) Memory : 128GB OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1) >>> >>> Dedicated OS SSDs aren't really needed, I tend to share OS and >>> cache/journal SSDs. >>> That's of course with more durable (S3610) models. >> >> I already have those 24 servers running 2 ceph cluster for test right >> now, so I cannot change anything. we were thinking about share journal >> but as I mention it below, MON will be on storage server, so that might >> use too much I/O to share levedb and journal on the same SSD. >> > Not really, the journal is sequential writes, the leveldb small, fast > IOPS. Both of them on the same (decent) SSD should be fine. > > But as your HW is fixed, lets not speculate about that. Ok. >>> Since you didn't mention dedicated MON nodes, make sure that if you >>> plan to put MONs on storage servers to have fast SSDs in them for the >>> leveldb (again DC S36xx or 37xx). >> >> Yes MON nodes will be shared on storage server. MONs use the SSD 240GB >> for the leveldb right now. >> > Note that the lowest IP(s) become the MON leader, so if you put RADOSGW > and other things on the storage nodes as well, spread things out > accordingly. Yes for sur, we gonna spread services over nodes. The 3 RadosGW won't be on the MONs nodes. >>> This will also free up 2 more slots in your (likely Supermicro) chassis >>> for OSD HDDs. >> >> It's not supermicro enclosure, it's Intel one with 12 slot 3.5" front >> and 2 slots 2.5" back, so I cannot add more disk. the 240GB SSDs are in >> front. > > That sounds like a SM chassis. ^o^ > In fact, I can't find a chassis on Intel's page with 2 back 2.5 slots. http://www.colfax-intl.com/nd/images/systems/servers/R2208WT-rear.gif Journal or cache Storage : 2 x SSD 400GB Intel S3300 DC (no Raid) >>> >>> These SSDs do not exist according to the Intel site and the only >>> references I can find for them are on "no longer available" European >>> sites. >> >> I made a mistake, it's not 400 but 480GB, smartctl give me Model >> SSDSC2BB480H4 >> > OK, that's not good. > Firstly, that model number still doesn't get us any hits from Intel, > strangely enough. > > Secondly, it is 480GB (instead of 400, which denotes overprovisioning) and > matches the 3510 480GB model up to the last 2 characters. > And that has an endurance of 275TBW, not something you want to use for > either journals or cache pools. I see, here the information from the resseler : "The S3300 series is the OEM version of S3510 and 1:1 the same drive" >>> Since you're in the land of rich chocolate bankers, I assume that this >>> model is something that just happened in Europe. >> >> I'm just a poor sysadmin with expensive toy in a University ;) >> > I know, I recognized the domain. ^.^ :) >>> Without knowing the specifications for these SSDs, I can't recommend >>> them. I'd use DC S3610 or 3710 instead, this very much depends on how >>> much endurance (TPW) you need. >> >> As I write above, I already have those SSDs so I look for the best setup >> with the hardware I have. >> > > Unless they have at least an endurance of 3 DWPD like the 361x (and their > model number, size and the 3300 naming suggests they do NOT), your 480GB > SSDs aren't suited for intense Ceph usage. > > How much have you used them yet and what is their smartctl status, in > particular these values (from a 800GB DC S3610 in my cache pool): > --- > 232 Available_Reservd_Space 0x0033 100 100 010Pre-fail Always > - 0 > 233 Media_Wearout_Indicator 0x0032 100 100 000Old_age Always > - 0 > 241 Host_Writes_32MiB 0x0032 100 100 000Old_age Always > - 869293 > 242 Host_Reads_32MiB0x0032 100 100 000Old_age Always > - 43435 > 243 NAND_Writes_32MiB 0x0032 100 100 000Old_age Always > - 1300884 > --- > > Not even 1% down after 40TBW, at which point your SSDs are likely to be > 15% down... More or less the same value on the 10 hosts I have on my beta cluster : 232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0 233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 0 241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 233252 242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 13
[ceph-users] Ceph OSD not goint up and join to the cluster. OSD does not goes up. ceph version 10.1.2
Hello, I just upgraded my cluster to the version 10.1.2 and it worked well for a while until I saw that systemctl ceph-disk@dev-sdc1.service was failed and I reruned it. >From there the OSD stopped working. This is ubuntu 16.04. I connected to the IRC looking for help where people pointed me to one or another place but none of the investigations helped to resolve. My configuration is rather simple: oot@red-compute:~# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 1.0 root default -4 1.0 rack rack-1 -2 1.0 host blue-compute 0 1.0 osd.0down0 1.0 2 1.0 osd.2down0 1.0 -3 1.0 host red-compute 1 1.0 osd.1down0 1.0 3 0.5 osd.3 up 1.0 1.0 4 1.0 osd.4down0 1.0 This is what I got sofar: 1. Once upgraded I discovered that daemon runs under ceph. I just ran chown on ceph directories. and it worked. 2. Firewall is fully disabled. Checked connectivity with nc and nmap. 3. Configuration seems to be right. I can post if you want. 4. Enabling logging on OSD shows that for example osd.1 is reconnecting all the time. 1. 2016-05-10 14:35:48.199573 7f53e8f1a700 1 -- 0.0.0.0:6806/13962 >> :/0 pipe(0x556f99413400 sd=84 :6806 s=0 pgs=0 cs=0 l=0 c=0x556f993b3a80).accept sd=84 172.16.0.119:35388/0 2016-05-10 14:35:48.199966 7f53e8f1a700 2 -- 0.0.0.0:6806/13962 >> :/0 pipe(0x556f99413400 sd=84 :6806 s=4 pgs=0 cs=0 l=0 c=0x556f993b3a80).fault (0) Success 2016-05-10 14:35:48.200018 7f53fb941700 1 osd.1 2468 ms_handle_reset con 0x556f993b3a80 session 0 5. OSD.3 goes ok because never left out because ceph restriction. 6. I rebooted all services at once for it to have available all OSD at the same time and don't mark it down. Don't work. 7. I forced up from commandline. ceph osd in 1-5. They appear as in for a while then out. 8. We tried ceph-disk activate-all to boot everything. Don't work. The strange thing is that culster started worked just right after upgrade. But the systemctrl command broke both servers. root@blue-compute:~# ceph -w cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771 health HEALTH_ERR 694 pgs are stuck inactive for more than 300 seconds 694 pgs stale 694 pgs stuck stale too many PGs per OSD (1528 > max 300) mds cluster is degraded crush map has straw_calc_version=0 monmap e10: 2 mons at {blue-compute= 172.16.0.119:6789/0,red-compute=172.16.0.100:6789/0} election epoch 3600, quorum 0,1 red-compute,blue-compute fsmap e673: 1/1/1 up {0:0=blue-compute=up:replay} osdmap e2495: 5 osds: 1 up, 1 in; 5 remapped pgs pgmap v40765481: 764 pgs, 6 pools, 410 GB data, 103 kobjects 87641 MB used, 212 GB / 297 GB avail 694 stale+active+clean 70 active+clean 2016-05-10 17:03:55.822440 mon.0 [INF] HEALTH_ERR; 694 pgs are stuck inactive for more than 300 seconds; 694 pgs stale; 694 pgs stuck stale; too many PGs per OSD (1528 > max 300); mds cluster is degraded; crush map has straw_calc_version= cat /etc/ceph/ceph.conf [global] fsid = 9028f4da-0d77-462b-be9b-dbdf7fa57771 mon_initial_members = blue-compute, red-compute mon_host = 172.16.0.119, 172.16.0.100 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true public_network = 172.16.0.0/24 osd_pool_default_pg_num = 100 osd_pool_default_pgp_num = 100 osd_pool_default_size = 2 # Write an object 3 times. osd_pool_default_min_size = 1 # Allow writing one copy in a degraded state. ## Required upgrade osd max object name len = 256 osd max object namespace len = 64 [mon.] debug mon = 9 caps mon = "allow *" Any help on this? Any clue of what's going wrong? Best regards, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] inconsistencies from read errors during scrub
On Thu, 21 Apr 2016, Dan van der Ster wrote: > On Thu, Apr 21, 2016 at 1:23 PM, Dan van der Ster wrote: > > Hi cephalapods, > > > > In our couple years of operating a large Ceph cluster, every single > > inconsistency I can recall was caused by a failed read during > > deep-scrub. In other words, deep scrub reads an object, the read fails > > with dmesg reporting "Sense Key : Medium Error [current]", "Add. > > Sense: Unrecovered read error", "blk_update_request: critical medium > > error", but the ceph-osd keeps on running and serving up data. > > I forgot to mention that the OSD notices the read error. In jewel it prints: > > :head got -5 on read, read_error > > So why no assert? I think this should be controlled by a config option, similar to how it is on read (filestore_fail_eio ... although we probably want a more generic option for that, too). The danger would be that if we fail the whole due to a single failed read, we might fail too many osds too quickly, and availability drops. Ideally, if we saw an eio we would do a graceful offload (mark osd out or reweight to 0, drop primary_affinity; and then fail osd when we are done). sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph OSD not goint up and join to the cluster. OSD does not goes up. ceph version 10.1.2
Hello, I forgot to say that the nodes are in preboot status. Something seems strange to me. root@red-compute:/var/lib/ceph/osd/ceph-1# ceph daemon osd.1 status { "cluster_fsid": "9028f4da-0d77-462b-be9b-dbdf7fa57771", "osd_fsid": "adf9890a-e680-48e4-82c6-e96f4ed56889", "whoami": 1, "state": "preboot", "oldest_map": 1764, "newest_map": 2504, "num_pgs": 323 } root@red-compute:/var/lib/ceph/osd/ceph-1# ceph daemon osd.3 status { "cluster_fsid": "9028f4da-0d77-462b-be9b-dbdf7fa57771", "osd_fsid": "8dd085d4-0b50-4c80-a0ca-c5bc4ad972f7", "whoami": 3, "state": "preboot", "oldest_map": 1764, "newest_map": 2504, "num_pgs": 150 } 3 is up and in. On Tue, May 10, 2016 at 6:07 PM, Gonzalo Aguilar Delgado < gaguilar.delg...@gmail.com> wrote: > Hello, > > I just upgraded my cluster to the version 10.1.2 and it worked well for a > while until I saw that systemctl ceph-disk@dev-sdc1.service was failed > and I reruned it. > > From there the OSD stopped working. > > This is ubuntu 16.04. > > I connected to the IRC looking for help where people pointed me to one or > another place but none of the investigations helped to resolve. > > My configuration is rather simple: > > oot@red-compute:~# ceph osd tree > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > -1 1.0 root default > -4 1.0 rack rack-1 > -2 1.0 host blue-compute > 0 1.0 osd.0down0 1.0 > 2 1.0 osd.2down0 1.0 > -3 1.0 host red-compute > 1 1.0 osd.1down0 1.0 > 3 0.5 osd.3 up 1.0 1.0 > 4 1.0 osd.4down0 1.0 > > > > This is what I got sofar: > > >1. Once upgraded I discovered that daemon runs under ceph. I just ran >chown on ceph directories. and it worked. >2. Firewall is fully disabled. Checked connectivity with nc and nmap. >3. Configuration seems to be right. I can post if you want. >4. Enabling logging on OSD shows that for example osd.1 is >reconnecting all the time. > 1. 2016-05-10 14:35:48.199573 7f53e8f1a700 1 -- 0.0.0.0:6806/13962 > >> :/0 pipe(0x556f99413400 sd=84 :6806 s=0 pgs=0 cs=0 l=0 > c=0x556f993b3a80).accept sd=84 172.16.0.119:35388/0 >2016-05-10 14:35:48.199966 7f53e8f1a700 2 -- 0.0.0.0:6806/13962 > >> :/0 pipe(0x556f99413400 sd=84 :6806 s=4 pgs=0 cs=0 l=0 > c=0x556f993b3a80).fault (0) Success >2016-05-10 14:35:48.200018 7f53fb941700 1 osd.1 2468 > ms_handle_reset con 0x556f993b3a80 session 0 >5. OSD.3 goes ok because never left out because ceph restriction. >6. I rebooted all services at once for it to have available all OSD at >the same time and don't mark it down. Don't work. >7. I forced up from commandline. ceph osd in 1-5. They appear as in >for a while then out. >8. We tried ceph-disk activate-all to boot everything. Don't work. > > > The strange thing is that culster started worked just right after upgrade. > But the systemctrl command broke both servers. > > root@blue-compute:~# ceph -w > cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771 > health HEALTH_ERR > 694 pgs are stuck inactive for more than 300 seconds > 694 pgs stale > 694 pgs stuck stale > too many PGs per OSD (1528 > max 300) > mds cluster is degraded > crush map has straw_calc_version=0 > monmap e10: 2 mons at {blue-compute= > 172.16.0.119:6789/0,red-compute=172.16.0.100:6789/0} > election epoch 3600, quorum 0,1 red-compute,blue-compute > fsmap e673: 1/1/1 up {0:0=blue-compute=up:replay} > osdmap e2495: 5 osds: 1 up, 1 in; 5 remapped pgs > pgmap v40765481: 764 pgs, 6 pools, 410 GB data, 103 kobjects > 87641 MB used, 212 GB / 297 GB avail > 694 stale+active+clean > 70 active+clean > > 2016-05-10 17:03:55.822440 mon.0 [INF] HEALTH_ERR; 694 pgs are stuck > inactive for more than 300 seconds; 694 pgs stale; 694 pgs stuck stale; too > many PGs per OSD (1528 > max 300); mds cluster is degraded; crush map has > straw_calc_version= > > cat /etc/ceph/ceph.conf > [global] > > fsid = 9028f4da-0d77-462b-be9b-dbdf7fa57771 > mon_initial_members = blue-compute, red-compute > mon_host = 172.16.0.119, 172.16.0.100 > auth_cluster_required = cephx > auth_service_required = cephx > auth_client_required = cephx > filestore_xattr_use_omap = true > public_network = 172.16.0.0/24 > osd_pool_default_pg_num = 100 > osd_pool_default_pgp_num = 100 > osd_pool_default_size = 2 # Write an object 3 times. > osd_pool_default_min_size = 1 # Allow writing one copy in a degraded state. > > ## Required upgrade > osd max object name len = 256 > osd max object namespace len = 64 > > [mon.] > > de
Re: [ceph-users] Ceph OSD not goint up and join to the cluster. OSD does not goes up. ceph version 10.1.2
I must also add that I just found in the log the following. I don't know if this has something to do with the problem. => ceph-osd.admin.log <== 2016-05-10 18:21:46.060278 7fa8f30cc8c0 0 ceph version 10.1.2 (4a2a6f72640d6b74a3bbd92798bb913ed380dcd4), process ceph-osd, pid 14135 2016-05-10 18:21:46.060460 7fa8f30cc8c0 -1 bluestore(/dev/sdc2) _read_bdev_label unable to decode label at offset 66: buffer::malformed_input: void bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past end of struct encoding 2016-05-10 18:21:46.062949 7fa8f30cc8c0 1 journal _open /dev/sdc2 fd 4: 5367660544 bytes, block size 4096 bytes, directio = 0, aio = 0 2016-05-10 18:21:46.062991 7fa8f30cc8c0 1 journal close /dev/sdc2 2016-05-10 18:21:46.063026 7fa8f30cc8c0 0 probe_block_device_fsid /dev/sdc2 is filestore, 119a9f4e-73d8-4a1f-877c-d60b01840c96 2016-05-10 18:21:47.072082 7eff735598c0 0 ceph version 10.1.2 (4a2a6f72640d6b74a3bbd92798bb913ed380dcd4), process ceph-osd, pid 14177 2016-05-10 18:21:47.072285 7eff735598c0 -1 bluestore(/dev/sdf2) _read_bdev_label unable to decode label at offset 66: buffer::malformed_input: void bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past end of struct encoding 2016-05-10 18:21:47.074799 7eff735598c0 1 journal _open /dev/sdf2 fd 4: 5367660544 bytes, block size 4096 bytes, directio = 0, aio = 0 2016-05-10 18:21:47.074844 7eff735598c0 1 journal close /dev/sdf2 2016-05-10 18:21:47.074881 7eff735598c0 0 probe_block_device_fsid /dev/sdf2 is filestore, fd069e6a-9a62-4286-99cb-d8a523bd946a r On Tue, May 10, 2016 at 6:07 PM, Gonzalo Aguilar Delgado < gaguilar.delg...@gmail.com> wrote: > Hello, > > I just upgraded my cluster to the version 10.1.2 and it worked well for a > while until I saw that systemctl ceph-disk@dev-sdc1.service was failed > and I reruned it. > > From there the OSD stopped working. > > This is ubuntu 16.04. > > I connected to the IRC looking for help where people pointed me to one or > another place but none of the investigations helped to resolve. > > My configuration is rather simple: > > oot@red-compute:~# ceph osd tree > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > -1 1.0 root default > -4 1.0 rack rack-1 > -2 1.0 host blue-compute > 0 1.0 osd.0down0 1.0 > 2 1.0 osd.2down0 1.0 > -3 1.0 host red-compute > 1 1.0 osd.1down0 1.0 > 3 0.5 osd.3 up 1.0 1.0 > 4 1.0 osd.4down0 1.0 > > > > This is what I got sofar: > > >1. Once upgraded I discovered that daemon runs under ceph. I just ran >chown on ceph directories. and it worked. >2. Firewall is fully disabled. Checked connectivity with nc and nmap. >3. Configuration seems to be right. I can post if you want. >4. Enabling logging on OSD shows that for example osd.1 is >reconnecting all the time. > 1. 2016-05-10 14:35:48.199573 7f53e8f1a700 1 -- 0.0.0.0:6806/13962 > >> :/0 pipe(0x556f99413400 sd=84 :6806 s=0 pgs=0 cs=0 l=0 > c=0x556f993b3a80).accept sd=84 172.16.0.119:35388/0 >2016-05-10 14:35:48.199966 7f53e8f1a700 2 -- 0.0.0.0:6806/13962 > >> :/0 pipe(0x556f99413400 sd=84 :6806 s=4 pgs=0 cs=0 l=0 > c=0x556f993b3a80).fault (0) Success >2016-05-10 14:35:48.200018 7f53fb941700 1 osd.1 2468 > ms_handle_reset con 0x556f993b3a80 session 0 >5. OSD.3 goes ok because never left out because ceph restriction. >6. I rebooted all services at once for it to have available all OSD at >the same time and don't mark it down. Don't work. >7. I forced up from commandline. ceph osd in 1-5. They appear as in >for a while then out. >8. We tried ceph-disk activate-all to boot everything. Don't work. > > > The strange thing is that culster started worked just right after upgrade. > But the systemctrl command broke both servers. > > root@blue-compute:~# ceph -w > cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771 > health HEALTH_ERR > 694 pgs are stuck inactive for more than 300 seconds > 694 pgs stale > 694 pgs stuck stale > too many PGs per OSD (1528 > max 300) > mds cluster is degraded > crush map has straw_calc_version=0 > monmap e10: 2 mons at {blue-compute= > 172.16.0.119:6789/0,red-compute=172.16.0.100:6789/0} > election epoch 3600, quorum 0,1 red-compute,blue-compute > fsmap e673: 1/1/1 up {0:0=blue-compute=up:replay} > osdmap e2495: 5 osds: 1 up, 1 in; 5 remapped pgs > pgmap v40765481: 764 pgs, 6 pools, 410 GB data, 103 kobjects > 87641 MB used, 212 GB / 297 GB avail > 694 stale+active+clean > 70 active+clean > > 2016-05-10 17:03:55.822440 mon.
Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile
On Tue, May 10, 2016 at 6:48 AM, Nick Fisk wrote: > > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Nick Fisk >> Sent: 10 May 2016 13:30 >> To: 'Eric Eastman' >> Cc: 'Ceph Users' >> Subject: Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on >> lockfile >> > On Mon, May 9, 2016 at 3:28 PM, Nick Fisk wrote: >> > > Hi Eric, >> > > >> > >> >> > >> I am trying to do some similar testing with SAMBA and CTDB with the >> > >> Ceph file system. Are you using the vfs_ceph SAMBA module or are >> > >> you kernel mounting the Ceph file system? >> > > >> > > I'm using the kernel client. I couldn't find any up to date >> > > information on if >> > the vfs plugin supported all the necessary bits and pieces. >> > > >> > > How is your testing coming along? I would be very interested in any >> > findings you may have come across. >> > > >> > > Nick >> > >> > I am also using CephFS kernel mounts, with 4 SAMBA gateways. When >> from >> > a SAMBA client, I write a large file (about 2GB) to a gateway that is >> > not the holder of the CTDB lock file, and then kill that gateway >> > server during the write, the IP failover works as expected, and in >> > most cases the file ends up being the correct size after the new >> > server finishes writing it, but the data is corrupt. The data in the > file, from >> the point of the failover, is all zeros. >> > >> > I thought the issue may be with the kernel mount, so I looked into >> > using the SAMBA vfs_ceph module, but I need SAMBA with AD support >> and >> > the current vfs_ceph module, even in the SAMBA git master version, is >> > lacking ACL support for CephFS, as the vfs_ceph.c patches summited to >> > the SAMBA mail list are not yet available. See: >> > https://lists.samba.org/archive/samba-technical/2016-March/113063.html >> > >> > I tried using a FUSE mount of the CephFS, and it also fails setting > ACLs. See: >> > http://tracker.ceph.com/issues/15783. >> > >> > My current status is IP failover is working, but I am seeing data >> > corruption on writes to the share when using kernel mounts. I am also >> > seeing the issue you reported when I kill the system holding the CTDB >> > lock file. Are you verifying your data after each failover? >> >> I must admit you are slightly ahead of me. I was initially trying to just > get >> hard/soft failover working correctly. But your response has prompted me to >> test out the scenario you mentioned. I'm seeing slightly different > results, my >> copy seems to error out when I do a node failover. I'm copying an ISO from > a >> 2008 server to the CTDB/Samba share and when I reboot the active node, >> the copy pauses for a couple of seconds and then comes up with the error >> box. Clicking try again several times doesn't let it resume. I need to do > a bit >> more digging to try and work out why this is happening. The share itself > does >> seem to be in a working state when trying to click the try again button, > so >> there is probably some sort of state/session problem. >> >> Do you have multiple vip's configured on your cluster or just a single IP? > I >> have just the one at the moment. I have 4 HA addresses setup, and I am using my AD to do the round-robin DNS. The moving of IP addresses on failure or when a CTDB controlled SAMBA system comes on line works great. > > Just to add to this, I have just been reading this article > > https://nnc3.com/mags/LM10/Magazine/Archive/2009/105/030-035_SambaHA/article > .html > > And the following paragraph seems to indicate that what I am seeing is the > correct behaviour? I 'm wondering if this is not happening in your case and > is why you are getting corruption? > > "It is important to understand that load balancing and client distribution > over the client nodes are connection oriented. If an IP address is switched > from one node to another, all the connections actively using this IP address > are dropped and the clients have to reconnect. > > To avoid delays, CTDB uses a trick: When an IP is switched, the new CTDB > node "tickles" the client with an illegal TCP ACK packet (tickle ACK) > containing an invalid sequence number of 0 and an ACK number of 0. The > client responds with a valid ACK packet, allowing the new IP address owner > to close the connection with an RST packet, thus forcing the client to > reestablish the connection to the new node." > Nice article. I have been trying to figure out if data integrity is supported with CTDB on failover on any shared file system. From looking at various email posts on CTDB+GPFS, it looks like it may work, so I am going to continue to test it with various CephFS configurations. There is a new "witness protocol" in SMB3 to support failover, that is not yet supported in any released versions of SAMBA. I may have to wait for it to be implemented in SAMBA to get fully working failover. See: https://wiki.samba.org/index.php/Samba3/SMB2#Witness_Notification_Protocol
Re: [ceph-users] Erasure pool performance expectations
Thanks Nick. I added it to my ceph.conf. I'm guessing this is an OSD setting and therefor I should restart my OSDs is that correct? On Tue, May 10, 2016 at 3:48 PM, Nick Fisk wrote: > > > > -Original Message- > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > > Peter Kerdisle > > Sent: 10 May 2016 14:37 > > Cc: ceph-users@lists.ceph.com > > Subject: Re: [ceph-users] Erasure pool performance expectations > > > > To answer my own question it seems that you can change settings on the > fly > > using > > > > ceph tell osd.* injectargs '--osd_tier_promote_max_bytes_sec 5242880' > > osd.0: osd_tier_promote_max_bytes_sec = '5242880' (unchangeable) > > > > However the response seems to imply I can't change this setting. Is > there an > > other way to change these settings? > > Sorry Peter, I missed your last email. You can also specify that setting > in the ceph.conf, ie I have in mine > > osd_tier_promote_max_bytes_sec = 400 > > > > > > > > > On Sun, May 8, 2016 at 2:37 PM, Peter Kerdisle > > > wrote: > > Hey guys, > > > > I noticed the merge request that fixes the switch around here > > https://github.com/ceph/ceph/pull/8912 > > > > I had two questions: > > > > • Does this effect my performance in any way? Could it explain the slow > > requests I keep having? > > • Can I modify these settings manually myself on my cluster? > > Thanks, > > > > Peter > > > > > > On Fri, May 6, 2016 at 9:58 AM, Peter Kerdisle > > > wrote: > > Hey Mark, > > > > Sorry I missed your message as I'm only subscribed to daily digests. > > > > Date: Tue, 3 May 2016 09:05:02 -0500 > > From: Mark Nelson > > To: ceph-users@lists.ceph.com > > Subject: Re: [ceph-users] Erasure pool performance expectations > > Message-ID: > > Content-Type: text/plain; charset=windows-1252; format=flowed > > In addition to what nick said, it's really valuable to watch your cache > > tier write behavior during heavy IO. One thing I noticed is you said > > you have 2 SSDs for journals and 7 SSDs for data. > > > > I thought the hardware recommendations were 1 journal disk per 3 or 4 > data > > disks but I think I might have misunderstood it. Looking at my journal > > read/writes they seem to be ok > > though: https://www.dropbox.com/s/er7bei4idd56g4d/Screenshot%202016- > > 05-06%2009.55.30.png?dl=0 > > > > However I started running into a lot of slow requests (made a separate > > thread for those: Diagnosing slow requests) and now I'm hoping these > could > > be related to my journaling setup. > > > > If they are all of > > the same type, you're likely bottlenecked by the journal SSDs for > > writes, which compounded with the heavy promotions is going to really > > hold you back. > > What you really want: > > 1) (assuming filestore) equal large write throughput between the > > journals and data disks. > > How would one achieve that? > > > > 2) promotions to be limited by some reasonable fraction of the cache > > tier and/or network throughput (say 70%). This is why the > > user-configurable promotion throttles were added in jewel. > > Are these already in the docs somewhere? > > > > 3) The cache tier to fill up quickly when empty but change slowly once > > it's full (ie limiting promotions and evictions). No real way to do > > this yet. > > Mark > > > > Thanks for your thoughts. > > > > Peter > > > > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Adding an OSD to existing Ceph using ceph-deploy
All, I am trying to add another OSD to our cluster using ceph-deploy. This is running Jewel. I previously set up the other 12 OSDs on a fresh install using the command: ceph-deploy osd create :/dev/mapper/mpath:/dev/sda Those are all up and happy. On the systems /dev/sda is an SSD which I have created partitions on for journals. It seems to prepare everything fine (ceph-deploy osd prepare ceph-1-35a:/dev/mapper/mpathn:/dev/sda8), but when it comes time to activate, I am getting a Traceback: [2016-05-10 11:27:58,195][ceph_deploy.osd][INFO ] Distro info: CentOS Linux 7.2.1511 Core [2016-05-10 11:27:58,195][ceph_deploy.osd][DEBUG ] activating host ceph-1-35a disk /dev/mapper/mpathn [2016-05-10 11:27:58,195][ceph_deploy.osd][DEBUG ] will use init type: systemd [2016-05-10 11:27:58,196][ceph-1-35a][INFO ] Running command: ceph-disk -v activate --mark-init systemd --mount /dev/mapper/mpathn [2016-05-10 11:27:58,315][ceph-1-35a][WARNING] main_activate: path = /dev/mapper/mpathn [2016-05-10 11:27:58,315][ceph-1-35a][WARNING] get_dm_uuid: get_dm_uuid /dev/mapper/mpathn uuid path is /sys/dev/block/253:8/dm/uuid [2016-05-10 11:27:58,316][ceph-1-35a][WARNING] get_dm_uuid: get_dm_uuid /dev/mapper/mpathn uuid is mpath-360001ff09070e00c8921000c [2016-05-10 11:27:58,316][ceph-1-35a][WARNING] [2016-05-10 11:27:58,316][ceph-1-35a][WARNING] get_dm_uuid: get_dm_uuid /dev/mapper/mpathn uuid path is /sys/dev/block/253:8/dm/uuid [2016-05-10 11:27:58,316][ceph-1-35a][WARNING] get_dm_uuid: get_dm_uuid /dev/mapper/mpathn uuid is mpath-360001ff09070e00c8921000c [2016-05-10 11:27:58,316][ceph-1-35a][WARNING] [2016-05-10 11:27:58,316][ceph-1-35a][WARNING] command: Running command: /sbin/blkid -p -s TYPE -o value -- /dev/mapper/mpathn [2016-05-10 11:27:58,316][ceph-1-35a][WARNING] Traceback (most recent call last): [2016-05-10 11:27:58,316][ceph-1-35a][WARNING] File "/usr/sbin/ceph-disk", line 9, in [2016-05-10 11:27:58,316][ceph-1-35a][WARNING] load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')() [2016-05-10 11:27:58,316][ceph-1-35a][WARNING] File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4964, in run [2016-05-10 11:27:58,316][ceph-1-35a][WARNING] main(sys.argv[1:]) [2016-05-10 11:27:58,317][ceph-1-35a][WARNING] File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4915, in main [2016-05-10 11:27:58,317][ceph-1-35a][WARNING] args.func(args) [2016-05-10 11:27:58,317][ceph-1-35a][WARNING] File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3269, in main_activate [2016-05-10 11:27:58,317][ceph-1-35a][WARNING] reactivate=args.reactivate, [2016-05-10 11:27:58,317][ceph-1-35a][WARNING] File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 2979, in mount_activate [2016-05-10 11:27:58,317][ceph-1-35a][WARNING] e, [2016-05-10 11:27:58,317][ceph-1-35a][WARNING] ceph_disk.main.FilesystemTypeError: Cannot discover filesystem type: device /dev/mapper/mpathn: Line is truncated: [2016-05-10 11:27:58,318][ceph-1-35a][ERROR ] RuntimeError: command returned non-zero exit status: 1 [2016-05-10 11:27:58,318][ceph_deploy][ERROR ] RuntimeError: Failed to execute command: ceph-disk -v activate --mark-init systemd --mount /dev/mapper/mpathn This seems to be due to the command: /sbin/blkid -p -s TYPE -o value -- /dev/mapper/mpathn is being run instead of: /sbin/blkid -p -s TYPE -o value -- /dev/mapper/mpathn1 Anyone have ideas on how to get these happy? Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Performance during disk rebuild - MercadoLibre
Hello All, I´m writing to you because i´m trying to find the way to rebuild a osd disk in a way to don´t impact the performance of the cluster. That´s because my applications are very latency sensitive. 1_ I found the way to reuse a OSD ID and don´t rebalance the cluster every time that I lost a disk. So, my cluster is running with the noout check forever. The point here is do the disk change as fast I can. 2_ after reuse de OSD ID, I´m living the OSD up and running, but with CERO weight. For example: root@DC4-ceph03-dn03:/var/lib/ceph/osd/ceph-352# ceph osd tree | grep 352 *352 1.81999 osd.352 up0 1.0* At this point everything is good. 3_ Starting the reweight, using "osd reweigh" i´m not touching the crushmap, and I´m doing the reweight very gradually. Example: *ceph osd reweight 352 0.001* But, anyway doing the reweight in this way i´m heating the latency sometimes. Depending of the amount of PGs that the cluster is recovering the impact is worst. Tunings that I already have done: ceph tell osd.* injectargs "--osd_max_backfills 1" ceph tell osd.* injectargs "--osd_recovery_max_active 1" ceph tell osd.* injectargs '--osd-max-recovery-threads 1' ceph tell osd.* injectargs '--osd-recovery-op-priority 1' ceph tell osd.* injectargs '--osd-client-op-priority 63' The question is, there are more parameters to change in order to do more gradually the OSD rebuild? I really appreciate your help, thanks in advance. Agustin Trolli Storage Team Mercadolibre.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Performance during disk rebuild - MercadoLibre
Hello, As far as I know and can tell, you're doing everything that is possible for having a least impact OSD rebuild/replacement. If your cluster is still strongly, adversely impacted by this gradual and throttled approach, how about the following things: 1. Does scrub or deep_scrub also impact your performance so that your applications notice it? 2. Are there times when other cluster activity (like reboots or installs of new VMs, other large data movements created by clients) impacts your applications? If both or either of these are true, your cluster is at the limit of its capacity. And in general, a rebuild with throttled parameters like yours (and many others, including me) should not hurt things. If it does, it's time to improve your cluster performance. 1. Adding journal SSDs if not present already. 2. Adding more OSDs in general. 3. Adding a cache tier, this is particular effective if your latency sensitive applications do small writes or reads that easily fit into the cache. I was in a similar situation with hundreds of VMs running an application that had latency sensitive small writes and adding a cache tier completely solved the problem. Regards, Christian On Tue, 10 May 2016 16:30:00 -0300 Agustín Trolli wrote: > Hello All, > I´m writing to you because i´m trying to find the way to rebuild a osd > disk in a way to don´t impact the performance of the cluster. > That´s because my applications are very latency sensitive. > > 1_ I found the way to reuse a OSD ID and don´t rebalance the cluster > every time that I lost a disk. > So, my cluster is running with the noout check forever. > The point here is do the disk change as fast I can. > > 2_ after reuse de OSD ID, I´m living the OSD up and running, but with > CERO weight. > For example: > > root@DC4-ceph03-dn03:/var/lib/ceph/osd/ceph-352# ceph osd tree | grep 352 > *352 1.81999 osd.352 up0 > 1.0* > > At this point everything is good. > > 3_ Starting the reweight, using "osd reweigh" i´m not touching the > crushmap, and I´m doing the reweight very gradually. > Example: > *ceph osd reweight 352 0.001* > > But, anyway doing the reweight in this way i´m heating the latency > sometimes. > Depending of the amount of PGs that the cluster is recovering the impact > is worst. > > Tunings that I already have done: > > ceph tell osd.* injectargs "--osd_max_backfills 1" > ceph tell osd.* injectargs "--osd_recovery_max_active 1" > ceph tell osd.* injectargs '--osd-max-recovery-threads 1' > ceph tell osd.* injectargs '--osd-recovery-op-priority 1' > ceph tell osd.* injectargs '--osd-client-op-priority 63' > > The question is, there are more parameters to change in order to do more > gradually the OSD rebuild? > > I really appreciate your help, thanks in advance. > > Agustin Trolli > Storage Team > Mercadolibre.com -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] journal or cache tier on SSDs ?
On Tue, 10 May 2016 17:51:24 +0200 Yoann Moulin wrote: [snip] > Journal or cache Storage : 2 x SSD 400GB Intel S3300 DC (no Raid) > >>> > >>> These SSDs do not exist according to the Intel site and the only > >>> references I can find for them are on "no longer available" European > >>> sites. > >> > >> I made a mistake, it's not 400 but 480GB, smartctl give me Model > >> SSDSC2BB480H4 > >> > > OK, that's not good. > > Firstly, that model number still doesn't get us any hits from Intel, > > strangely enough. > > > > Secondly, it is 480GB (instead of 400, which denotes overprovisioning) > > and matches the 3510 480GB model up to the last 2 characters. > > And that has an endurance of 275TBW, not something you want to use for > > either journals or cache pools. > > I see, here the information from the resseler : > > "The S3300 series is the OEM version of S3510 and 1:1 the same drive" > Given the SMART output below, it seems to be 3500 based, but that doesn't change things. > >>> Without knowing the specifications for these SSDs, I can't recommend > >>> them. I'd use DC S3610 or 3710 instead, this very much depends on how > >>> much endurance (TPW) you need. > >> > >> As I write above, I already have those SSDs so I look for the best > >> setup with the hardware I have. > >> > > > > Unless they have at least an endurance of 3 DWPD like the 361x (and > > their model number, size and the 3300 naming suggests they do NOT), > > your 480GB SSDs aren't suited for intense Ceph usage. > > > > How much have you used them yet and what is their smartctl status, in > > particular these values (from a 800GB DC S3610 in my cache pool): > > --- > > 232 Available_Reservd_Space 0x0033 100 100 010Pre-fail > > Always - 0 233 Media_Wearout_Indicator 0x0032 100 > > 100 000Old_age Always - 0 241 > > Host_Writes_32MiB 0x0032 100 100 000Old_age > > Always - 869293 242 Host_Reads_32MiB0x0032 100 > > 100 000Old_age Always - 43435 243 > > NAND_Writes_32MiB 0x0032 100 100 000Old_age > > Always - 1300884 --- > > > > Not even 1% down after 40TBW, at which point your SSDs are likely to be > > 15% down... > > More or less the same value on the 10 hosts I have on my beta cluster : > > 232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0 > 233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 0 > 241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 233252 > 242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 13 > >From the read count it's obvious that you used those as journals. ^.^ As I hinted above, if these were 3510 based they also should have the 243 attribute, as in my 3610 example. You may want to upgrade your smartctl and/or it's definition DB (on Debian that can be done with "update-smart-drivedb"). Intel's calculation of the media wearout always seems to be very fuzzy to me, given your 7TB written I'd expect it to be 98%, at least 99%. But then again a 200GB DC S3700 of mine has written 90TB out of 3650TB total and is at 99%, when I would expect it to be at 98%. Either way, those SSDs are designed for 275TBW (or 0.3 DWPD), and if they are used as journals they will expire quickly when those 100TB+ datasets get updated. They _might_ survive longer with a very carefully tuned cache tier (promote only really hot objects), but the risk of loosing SSDs there can be even higher than with journals. [snap] Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] journal or cache tier on SSDs ?
Hi, If we have 12 SATA disks, each 4TB as storage pool. Then how many SSD disks we should have for cache tier usage? thanks. 2016-05-10 16:40 GMT+08:00 Yoann Moulin : > Hello, > > I'd like some advices about the setup of a new ceph cluster. Here the use > case : > > RadowGW (S3 and maybe swift for hadoop/spark) will be the main usage. Most > of > the access will be in read only mode. Write access will only be done by the > admin to update the datasets. > > We might use rbd some time to sync data as temp storage (when POSIX is > needed) > but performance will not be an issue here. We might use cephfs in the > futur if > that can replace a filesystem on rdb. > > We gonna start with 16 nodes (up to 24). The configuration of each node is > : > > CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (12c/48t) > Memory : 128GB > OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1) > Journal or cache Storage : 2 x SSD 400GB Intel S3300 DC (no Raid) > OSD Disk : 10 x HGST ultrastar-7k6000 6TB > Public Network : 1 x 10Gb/s > Private Network : 1 x 10Gb/s > OS : Ubuntu 16.04 > Ceph version : Jewel > > The question is : journal or cache tier (read only) on the SD 400GB Intel > S3300 DC ? > > Each disk is able to write sequentially at 220MB/s. SSDs can write at > ~500MB/s. > if we set 5 journals on each SSDs, SSD will still be the bottleneck (1GB/s > vs > 2GB/s). If we set the journal on OSDs, we can expect a good throughput in > read > on the disk (in case of data not in the cache) and write shouldn't be so > bad > too, even if we have random read on the OSD during the write ? > > SSDs as cache tier seem to be a better usage than only 5 journal on each ? > Is > that correct ? > > We gonna use an EC pool for big files (jerasure 8+2 I think) and a > replicated > pool for small files. > > If I check on http://ceph.com/pgcalc/, in this use case > > replicated pool: pg_num = 8192 for 160 OSDs but 16384 for 240 OSDs > Ec pool : pg_num = 4096 > and pgp_num = pg_num > > Should I set the pg_num to 8192 or 16384 ? what is the impact on the > cluster if > we set the pg_num to 16384 at the beginning ? 16384 is high, isn't it ? > > Thanks for your help > > -- > Yoann Moulin > EPFL IC-IT > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] journal or cache tier on SSDs ?
Hello, On Wed, 11 May 2016 11:24:29 +0800 Geocast Networks wrote: > Hi, > > If we have 12 SATA disks, each 4TB as storage pool. > Then how many SSD disks we should have for cache tier usage? > That question makes no sense. Firstly you mentioned earlier that you have 21 of those hosts. Which would be a significant factor when trying to determine cache-tier sizes, as it it gives an idea of your overall storage needs. But the the size of the cache-tier would totally depend on your use case and how big your hot data is. Nobody can answer that for you. A cache tier may also make no (financial) sense for you, speeding up things with SSD journals currently is the best first step. Secondly, what I think you mean is the number of SSDs for JOURNAL usage, which is something completely different. You will want to read up more on Ceph concepts and explore the ML archives. That said, 12 HDDs will be able to write about 1GB/s in total, so your journal SSDs should be around that (sequential write) speed as well. And they should be DC level SSDs (Intel DC S or respective Samsung models), with medium (3 DWPD) to large (10 DWPD) endurance. Normally you will also want to avoid putting too many journals on one SSD, as a failure of the SSD will kill all associated HDD OSDs. However as you have 21 hosts and hopefully decent redundancy and distribution (CRUSH Map), going with 2 SSDs (6 journals per SSD) should be fine. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rbd resize option
Hello, I wanted to resize an image using 'rbd' resize option, but it should be have data loss. For ex: I have image with 100 GB size (thin provisioned). and this image has data of 10GB only. Here I wanted to resize this image to 11GB, so that 10GB data is safe and its resized. Can I do the above resize safely.? If I tried to resize to 5GB, is rbd throughs an error saying that your data is going loss, something like that??? Any inputs here are appriciated. Thanks Swami ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] wrong exit status if bucket already exists
Hi I am using infernalis 9.2.1. While creating bucket, if the bucket already exists, its still returns 0 as exit status. Is it intentional out of some reason or a bug? root@node1:~# ceph osd crush add-bucket rack1 rack bucket 'rack1' already exists root@node1:~# echo $? 0 root@node1:~# -- Swapnil ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weighted Priority Queue testing
+ceph users Hi, Here is first cut result. I can only manage 128TB box for now. Ceph code base Capacity Each drive capacity Compute-nodes Total copy Total data-set Failure domain Fault-injected Percentage of degraded PGs Full recovery time Last 1% of degraded PG recovery time Hammer 2X128TB IF150 8TB 2 2 ~80TB Chassis One OSD node down ~20% ~24 hours ~3-4 hours Hammer 2X128TB IF150 8TB 4 2 ~80TB Chassis One OSD node down ~10% 10 hours 3 min ~3 hours Hammer 2X128TB IF150 4TB 4 2 ~100TB Chassis One OSD node down ~12.5% 7 hours 5 min ~2.5 hours Jewel 2X128TB IF150 4TB 4 2 ~100TB Chassis One OSD node down ~12.5% 6 hours 10 min ~1 hour 30 min Jewel + wpq 2X128TB IF150 4TB 4 2 ~100TB Chassis One OSD node down ~12.5% 8 hours 30 min ~4 hours 30 min Summary : 1. First scenario, only 4 node scenario and since it is chassis level replication single node remaining on the chassis taking all the traffic. It seems that is a bottleneck as for the host level replication on the similar setup recovery time is much less (data is not in this table). 2. In the second scenario , I kept everything else same but doubled the node/chassis. Recovery time is also half. 3. For the third scenario, increased cluster data and also now I have doubled the number of OSDs in the cluster (since each drive size is 4TB now). Recovery time came down further. 4. Moved to Jewel keeping everything else same, got further improvement. Mostly because of improved write performance in jewel (?). 5. Last scenario is interesting. I got improved recovery speed than any other scenario with this WPQ. Degraded PG % came down to 2% within 3 hours , ~0.6% within 4 hours and 15 min , but last 0.6% took ~4 hours hurting overall time for recovery. 6. In fact, this long tail latency is hurting the overall recovery time for every other scenarios. Related tracker I found is http://tracker.ceph.com/issues/15763 Any feedback much appreciated. We can discuss this in tomorrow’s performance call if needed. Thanks & Regards Somnath -Original Message- From: Somnath Roy Sent: Wednesday, May 04, 2016 11:47 AM To: 'Mark Nelson'; Nick Fisk; Ben England; Kyle Bader Cc: Sage Weil; Samuel Just Subject: RE: Weighted Priority Queue testing Thanks Mark, I will come back to you with some data on that. This is what I am planning to run. 1. One 2X IF150 chassis with 256 TB flash each and total 8 node cluster (4 servers on each). Will generate ~100TB of data on the cluster. 2. Will go for host and chassis level replication with 2 copies. 3. Heavy IO will be on (different block sizes 60% RW and 40% RR) Hammer took me ~4 hours to complete recovery for a host level replication and single host down. ~12 hours when single host down with chassis level replication. Bear with me till I find all the HW for this :-) Let me know if you guys want to add something here.. Regards Somnath -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: Wednesday, May 04, 2016 8:40 AM To: Somnath Roy; Nick Fisk; Ben England; Kyle Bader Cc: Sage Weil; Samuel Just Subject: Weighted Priority Queue testing Hi Guys, I think all of you have expressed some interest in recovery testing either now or in the past, so I wanted to get folks together to talk. We need to get the new weighted priority queue tested to: a) see when/how it's breaking b) hopefully see better behavior It's available in Jewel through a simple ceph.conf change: osd_op_queue = wpq For those of you who have run cbt recovery tests in the past, it might be worth running some new stress tests comparing: a) jewel + wpq b) jewel + prio queue c) hammer In the past I've done this under various concurrent client workloads (say large sequential or small random writes). I think Kyle has done quite a bit of this kind of testing in the recent past with Intel as well, so he might have some insights as to where we've been hurting recently. Thanks, Mark PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com