[ceph-users] Re: ceph octopus centos7, containers, cephadm
No clarity on this? -Original Message- To: ceph-users Subject: [ceph-users] ceph octopus centos7, containers, cephadm I am running Nautilus on centos7. Does octopus run similar as nautilus thus: - runs on el7/centos7 - runs without containers by default - runs without cephadm by default ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph octopus centos7, containers, cephadm
I'm not sure I understood the question. If you're asking if you can run octopus via RPMs on el7 without the cephadm and containers orchestration, then the answer is yes. -- dan On Fri, Oct 23, 2020 at 9:47 AM Marc Roos wrote: > > > No clarity on this? > > -Original Message- > To: ceph-users > Subject: [ceph-users] ceph octopus centos7, containers, cephadm > > > I am running Nautilus on centos7. Does octopus run similar as nautilus > thus: > > - runs on el7/centos7 > - runs without containers by default > - runs without cephadm by default > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph octopus centos7, containers, cephadm
Hi! Runs on el7: https://download.ceph.com/rpm-octopus/el7/x86_64/ Runs as usual without containers by default - if you use cephadm for deployments then it will use containers. cephadm is one way to do deployments, you can however deploy whichever way you want (manually etc). -- David Majchrzak CTO Oderland Webbhotell AB Östra Hamngatan 50B, 411 09 Göteborg, SWEDEN Den 2020-10-23 kl. 09:47, skrev Marc Roos: No clarity on this? -Original Message- To: ceph-users Subject: [ceph-users] ceph octopus centos7, containers, cephadm I am running Nautilus on centos7. Does octopus run similar as nautilus thus: - runs on el7/centos7 - runs without containers by default - runs without cephadm by default ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 14.2.12 breaks mon_host pointing to Round Robin DNS entry
Hi, non round robin entries with multiple mon host FQDNs are also broken. Regards, Burkhard ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Hardware needs for MDS for HPC/OpenStack workloads?
On 2020-10-22 14:34, Matthew Vernon wrote: > Hi, > > We're considering the merits of enabling CephFS for our main Ceph > cluster (which provides object storage for OpenStack), and one of the > obvious questions is what sort of hardware we would need for the MDSs > (and how many!). Is it a many parallel large writes workload without a lot fs manipulation (file creation / deletion, attribute updates? You might only need 2 for HA (active-standby). But when used as a regular fs with many clients and a lot of small IO, than you might run out of the performance of a single MDS. Add (many) more as you see fit. Keep in mind it does make things a bit more complex (different ranks when more than one active MDS) and that when you need to upgrade you have to downscale that to 1. You can pin directories to a single MDS if you know your workload well enough. > > These would be for our users scientific workloads, so they would need to > provide reasonably high performance. For reference, we have 3060 6TB > OSDs across 51 OSD hosts, and 6 dedicated RGW nodes. It really depend on the workload. If there are a lot of file / directory operations the MDS needs to keep track of all that and needs to be able to cache as well (inodes / dnodes). The more files/dirs, the more RAM you need. We don't have PB of storage (but 39 TB for CephFS) but have MDSes with 256 GB RAM for cache for all the little files and many dirs we have. Prefer a few faster cores above many slower cores. > > The minimum specs are very modest (2-3GB RAM, a tiny amount of disk, > similar networking to the OSD nodes), but I'm not sure how much going > beyond that is likely to be useful in production. MDSes don't do a lot of traffic. Clients write directly to OSDs after they have acquired capabilities (CAPS) from MDS. > > I've also seen it suggested that an SSD-only pool is sensible for the > CephFS metadata pool; how big is that likely to get? Yes, but CephFS, like RGW (index), stores a lot of data in OMAP and the RocksDB databases tend to get quite large. Especially when storing many small files and lots of dirs. So if that happens to be the workload, make sure you have plenty of them. We once put all cephfs_metadata on 30 NVMe ... and that was not a good thing. Spread that data out over as many SSD / NVMe as you can. Do your HDDs have their WAL / DB on flash? Cephfs_metadaa does not take up a lot of space, but Mimic does not have as good administration on all space occupied as newer releases. But I guess it's in the order of 5% of CephFS size. But again, this might be wildly different on other deployments. > > I'd be grateful for any pointers :) I would buy a CPU with high clock speed and ~ 4 -8 cores. RAM as needed, but 32 GB will be minimum I guess. Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Strange USED size
Hi, did you delete lots of objects recently? That operation is slow and ceph takes some time to catch up. If the value is not decreasing post again with 'ceph osd df' output. Regards, Eugen Zitat von Marcelo : Hello. I've searched a lot but couldn't find why the size of USED column in the output of ceph df is a lot times bigger than the actual size. I'm using Nautilus (14.2.8), and I've 1000 buckets with 100 objectsineach bucket. Each object is around 10B. ceph df RAW STORAGE: CLASS SIZEAVAIL USEDRAW USED %RAW USED hdd 511 GiB 147 GiB 340 GiB 364 GiB 71.21 TOTAL 511 GiB 147 GiB 340 GiB 364 GiB 71.21 POOLS: POOL ID STORED OBJECTS USED%USED MAX AVAIL .rgw.root 1 1.1 KiB 4 768 KiB 036 GiB default.rgw.control11 0 B 8 0 B 036 GiB default.rgw.meta 12 449 KiB 2.00k 376 MiB 0.3436 GiB default.rgw.log13 3.4 KiB 207 6 MiB 036 GiB default.rgw.buckets.index 14 0 B 1.00k 0 B 036 GiB default.rgw.buckets.data 15 969 KiB100k 18 GiB 14.5236 GiB default.rgw.buckets.non-ec 1627 B 1 192 KiB 036 GiB Does anyone know what are the maths behind this, to show 18GiB used when I have something like 1 MiB? Thanks in advance, Marcelo. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Rados Crashing
Hi, I read that civetweb and radosgw have a locking issue in combination with ssl [1], just a thought based on failed to acquire lock on obj_delete_at_hint.79 Since Nautilus the default rgw frontend is beast, have you thought about switching? Regards, Eugen [1] https://tracker.ceph.com/issues/22951 Zitat von Brent Kennedy : We are performing file maintenance( deletes essentially ) and when the process gets to a certain point, all four rados gateways crash with the following: Log output: -5> 2020-10-20 06:09:53.996 7f15f1543700 2 req 7 0.000s s3:delete_obj verifying op params -4> 2020-10-20 06:09:53.996 7f15f1543700 2 req 7 0.000s s3:delete_obj pre-executing -3> 2020-10-20 06:09:53.996 7f15f1543700 2 req 7 0.000s s3:delete_obj executing -2> 2020-10-20 06:09:53.997 7f161758f700 10 monclient: get_auth_request con 0x55d2c02ff800 auth_method 0 -1> 2020-10-20 06:09:54.009 7f1609d74700 5 process_single_shard(): failed to acquire lock on obj_delete_at_hint.79 0> 2020-10-20 06:09:54.035 7f15f1543700 -1 *** Caught signal (Segmentation fault) ** in thread 7f15f1543700 thread_name:civetweb-worker ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) 1: (()+0xf5d0) [0x7f161d3405d0] 2: (()+0x2bec80) [0x55d2bcd1fc80] 3: (std::string::assign(std::string const&)+0x2e) [0x55d2bcd2870e] 4: (rgw_bucket::operator=(rgw_bucket const&)+0x11) [0x55d2bce3e551] 5: (RGWObjManifest::obj_iterator::update_location()+0x184) [0x55d2bced7114] 6: (RGWObjManifest::obj_iterator::operator++()+0x263) [0x55d2bd092793] 7: (RGWRados::update_gc_chain(rgw_obj&, RGWObjManifest&, cls_rgw_obj_chain*)+0x51a) [0x55d2bd0939ea] 8: (RGWRados::Object::complete_atomic_modification()+0x83) [0x55d2bd093c63] 9: (RGWRados::Object::Delete::delete_obj()+0x74d) [0x55d2bd0a87ad] 10: (RGWDeleteObj::execute()+0x915) [0x55d2bd04b6d5] 11: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, bool)+0x915) [0x55d2bcdfbb35] 12: (process_request(RGWRados*, RGWREST*, RGWRequest*, std::string const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, optional_yield, rgw::dmclock::Scheduler*, int*)+0x1cd8) [0x55d2bcdfdea8] 13: (RGWCivetWebFrontend::process(mg_connection*)+0x38e) [0x55d2bcd41a1e] 14: (()+0x36bace) [0x55d2bcdccace] 15: (()+0x36d76f) [0x55d2bcdce76f] 16: (()+0x36dc18) [0x55d2bcdcec18] 17: (()+0x7dd5) [0x7f161d338dd5] 18: (clone()+0x6d) [0x7f161c84302d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. My guess is that we need to add more resources to the gateways? They have 2 CPUs and 12GB of memory running as virtual machines on centOS 7.6 . Any thoughts? -Brent ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [EXTERNAL] Re: 14.2.12 breaks mon_host pointing to Round Robin DNS entry
Jason/Wido, et al: I was hitting this exact problem when attempting to update from 14.2.11 to 14.2.12. I reverted the two commits associated with that pull request and was able to successfully upgrade to 14.2.12. Everything seems normal, now. Thanks, -- Kenneth Van Alstyne Systems Architect M: 804.240.2327 14291 Park Meadow Drive, Chantilly, VA 20151 perspecta From: Jason Dillaman Sent: Thursday, October 22, 2020 12:54 PM To: Wido den Hollander Cc: ceph-users@ceph.io Subject: [EXTERNAL] [ceph-users] Re: 14.2.12 breaks mon_host pointing to Round Robin DNS entry This backport [1] looks suspicious as it was introduced in v14.2.12 and directly changes the initial MonMap code. If you revert it in a dev build does it solve your problem? [1] https://github.com/ceph/ceph/pull/36704 On Thu, Oct 22, 2020 at 12:39 PM Wido den Hollander wrote: > > Hi, > > I already submitted a ticket: https://tracker.ceph.com/issues/47951 > > Maybe other people noticed this as well. > > Situation: > - Cluster is running IPv6 > - mon_host is set to a DNS entry > - DNS entry is a Round Robin with three -records > > root@wido-standard-benchmark:~# ceph -s > unable to parse addrs in 'mon.objects.xx.xxx.net' > [errno 22] error connecting to the cluster > root@wido-standard-benchmark:~# > > The relevant part of the ceph.conf: > > [global] > auth_client_required = cephx > auth_cluster_required = cephx > auth_service_required = cephx > mon_host = mon.objects.xxx.xxx.xxx > ms_bind_ipv6 = true > > This works fine with 14.2.11 and breaks under 14.2.12 > > Anybody else seeing this as well? > > Wido > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > -- Jason ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD Failures after pg_num increase on one of the pools
Hi, do you see any peaks on the OSD nodes like OOM killer etc.? Instead of norecover flag I would try the nodown and noout flags to prevent flapping OSDs. What was the previous pg_num before you increased to 512? Regards, Eugen Zitat von Артём Григорьев : Hello everyone, I created a new ceph 14.2.7 Nautilus cluster recently. Cluster consists of 3 racks and 2 osd nodes on each rack, 12 new hdd in each node. HDD model is TOSHIBA MG07ACA14TE 14Tb. All data pools are ec pools. Yesterday I decided to increase pg number on one of the pools with command "ceph osd pool set photo.buckets.data pg_num 512", after that many osds started to crash with "out" and "down" status. I tried to increase recovery_sleep to 1s but osds still crashes. Osds started working properly only when i set "norecover" flag, but osd scrub errors appeared after that. In logs from osd during crashes i found this: --- Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHIN E_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/osd/ECBackend.cc: In function 'void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)' thread 7f8af535d700 time 2020-10-21 15:12:11.460092 Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHIN E_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/osd/ECBackend.cc: 648: FAILED ceph_assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( aft er_progress.data_recovered_to - op.recovery_progress.data_recovered_to)) Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable) Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x55fc694d6c0f] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 2: (()+0x47) [0x55fc694d6dd7] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 3: (ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)+0x1740) [0x55fc698cafa0] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 4: (ECBackend::handle_recovery_read_complete(hobject_t const&, boost::tuples::tuple, std::allocator > , boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>&, boost::optional, std::allocator > >, RecoveryMessages*)+0x734) [0x55fc698cb804] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 5: (OnRecoveryReadComplete::finish(std::pair&)+0x94) [0x55fc698ebbe4] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 6: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x8c) [0x55fc698bfdcc] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 7: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0x109c) [0x55fc698d6b8c] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 8: (ECBackend::_handle_message(boost::intrusive_ptr)+0x17f) [0x55fc698d718f] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 9: (PGBackend::handle_message(boost::intrusive_ptr)+0x4a) [0x55fc697c18ea] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 10: (PrimaryLogPG::do_request(boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x5b3) [0x55fc697676b3] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 11: (OSD::dequeue_op(boost::intrusive_ptr, boost::intrusive_ptr, ThreadPool::TPHandle&)+0x362) [0x55fc695b3d72] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 12: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x62) [0x55fc698415c2] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f) [0x55fc695cebbf] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55fc69b6f976] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55fc69b72490] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 16: (()+0x7e65) [0x7f8b1ddede65] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 17: (clone()+0x6d) [0x7f8b1ccb188d] Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: *** Caught signal (Aborted) ** Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: in thread 7f8af535d700 thread_name:tp_osd_tp --- Current ec profile and pool info bellow: # ceph osd erasure-code-profile get EC42 crush-device-class=hdd crush-failure-domain=host crush-root=main jerasure-per-chunk-alignment=false k=4 m=2 plugin=jerasure technique=reed_sol_van w=8 pool 25 'photo.buckets.data' erasure size 6 min_size 4 crush_rule 6 object_hash rjenkins pg_num 512 pgp_num 280 pgp_num_target 512 autoscale_mode warn last_change 43418 lfor 0/0/42223 flags hashpspool stripe_width 1048576 application
[ceph-users] Re: Ceph Octopus and Snapshot Schedules
Care to provide anymore detail? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: desaster recovery Ceph Storage , urgent help needed
Hi, your mail is formatted in a way that makes it impossible to get all information, so a number of questions first: - are the mons up, or are the mon up and in a quorum? you cannot change mon IP addresses without also adjusting them in the mon map. use the daemon socket on the systems to qeury the current state of the mons - the osd systemd output is useless for debugging. it only states that the osd is not running and not able to start The real log files are located in /var/log/ceph/. If the mon are in quorum, you should find more information here. Keep in mind that you also need to change ceph.conf on the OSD hosts if you change the mon IP addresses, otherwise the OSDs won't be able to find the mon and the processes will die. And I do not understand how corosync should affect your ceph cluster. Ceph does not use corosync... If you need fast help I can recommend the ceph irc channel ;-) Regards, Burkhard ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: multiple OSD crash, unfound objects
Hi Michael. > I still don't see any traffic to the pool, though I'm also unsure how much > traffic is to be expected. Probably not much. If ceph df shows that the pool contains some objects, I guess that's sorted. That osdmaptool crashes indicates that your cluster runs with corrupted internal data. I tested your crush map and you should get complete PGs for the fs data pool. That you don't and that osdmaptool crashes points at a corruption of internal data. I'm afraid this is the point where you need support from ceph developers and should file a tracker report (https://tracker.ceph.com/projects/ceph/issues). A short description of the origin of the situation with the osdmaptool output and a reference to this thread linked in should be sufficient. Please post a link to the ticket here. In parallel, you should probably open a new thread focussed on the osd map corruption. Maybe there are low-level commands to repair it. You should wait with trying to clean up the unfound objects until this is resolved. Not sure about adding further storage either. To me, this sounds quite serious. Best regards and good luck! = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph octopus centos7, containers, cephadm
yes that was it. I see so many messages here about these, I was wondering if it was a default. -Original Message- Cc: ceph-users Subject: Re: [ceph-users] Re: ceph octopus centos7, containers, cephadm I'm not sure I understood the question. If you're asking if you can run octopus via RPMs on el7 without the cephadm and containers orchestration, then the answer is yes. -- dan On Fri, Oct 23, 2020 at 9:47 AM Marc Roos wrote: > > > No clarity on this? > > -Original Message- > To: ceph-users > Subject: [ceph-users] ceph octopus centos7, containers, cephadm > > > I am running Nautilus on centos7. Does octopus run similar as nautilus > thus: > > - runs on el7/centos7 > - runs without containers by default > - runs without cephadm by default > > ___ > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an > email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: desaster recovery Ceph Storage , urgent help needed
Hi, On 10/23/20 2:22 PM, Gerhard W. Recher wrote: This is a proxmox cluster ... sorry for formating problems of my post :( short plot, we messed with ip addr. change of public network, so monitors went down. *snipsnap* so howto recover from this disaster ? # ceph -s cluster: id: 92d063d7-647c-44b8-95d7-86057ee0ab22 health: HEALTH_WARN 1 daemons have recently crashed OSD count 0 < osd_pool_default_size 3 services: mon: 3 daemons, quorum pve01,pve02,pve03 (age 19h) mgr: pve01(active, since 19h) osd: 0 osds: 0 up, 0 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: Are you sure that the existing mons have been restarted? If the mon database is still present, the status output should contain at least the pool and osd information. But those numbers are zero... Please check the local osd logs for the actual reason of the failed restart. Regards, Burkhard ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph Octopus
Hi Eugen, I did the same step specified but OSD is not updated cluster address. On Tue, Oct 20, 2020 at 2:52 PM Eugen Block wrote: > > I wonder if this would be impactful, even if `nodown` were set. > > When a given OSD latches onto > > the new replication network, I would expect it to want to use it for > > heartbeats — but when > > its heartbeat peers aren’t using the replication network yet, they > > won’t be reachable. > > I also expected at least some sort of impact, I just tested it in a > virtual lab environment. But besides the temporary "down" OSDs during > container restart the cluster was always responsive (although there's > no client traffic). I didn't even set "nodown". But all OSDs now have > a new backend address and the cluster seems to be happy. > > Regards, > Eugen > > > Zitat von Anthony D'Atri : > > > I wonder if this would be impactful, even if `nodown` were set. > > When a given OSD latches onto > > the new replication network, I would expect it to want to use it for > > heartbeats — but when > > its heartbeat peers aren’t using the replication network yet, they > > won’t be reachable. > > > > Unless something has changed since I tried this with Luminous. > > > >> On Oct 20, 2020, at 12:47 AM, Eugen Block wrote: > >> > >> Hi, > >> > >> a quick search [1] shows this: > >> > >> ---snip--- > >> # set new config > >> ceph config set global cluster_network 192.168.1.0/24 > >> > >> # let orchestrator reconfigure the daemons > >> ceph orch daemon reconfig mon.host1 > >> ceph orch daemon reconfig mon.host2 > >> ceph orch daemon reconfig mon.host3 > >> ceph orch daemon reconfig osd.1 > >> ceph orch daemon reconfig osd.2 > >> ceph orch daemon reconfig osd.3 > >> ---snip--- > >> > >> I haven't tried it myself though. > >> > >> Regards, > >> Eugen > >> > >> [1] > >> > https://stackoverflow.com/questions/61763230/configure-a-cluster-network-with-cephadm > >> > >> > >> Zitat von Amudhan P : > >> > >>> Hi, > >>> > >>> I have installed Ceph Octopus cluster using cephadm with a single > network > >>> now I want to add a second network and configure it as a cluster > address. > >>> > >>> How do I configure ceph to use second Network as cluster network?. > >>> > >>> Amudhan > >>> ___ > >>> ceph-users mailing list -- ceph-users@ceph.io > >>> To unsubscribe send an email to ceph-users-le...@ceph.io > >> > >> > >> ___ > >> ceph-users mailing list -- ceph-users@ceph.io > >> To unsubscribe send an email to ceph-users-le...@ceph.io > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Large map object found
Perfect -- many thanks Dominic! I haven't found a doc which notes the --num-shards needs to be a power of two. It isn't I don't believe you -- just haven't seen that anywhere. peter Peter Eisch Senior Site Reliability Engineer T1.612.445.5135 virginpulse.com Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland | United Kingdom | USA Confidentiality Notice: The information contained in this e-mail, including any attachment(s), is intended solely for use by the designated recipient(s). Unauthorized use, dissemination, distribution, or reproduction of this message by anyone other than the intended recipient(s), or a person designated as responsible for delivering such messages to the intended recipient, is strictly prohibited and may be unlawful. This e-mail may contain proprietary, confidential or privileged information. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Virgin Pulse, Inc. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender and delete this e-mail message. v2.66 On 10/22/20, 10:24 AM, "dhils...@performair.com" wrote: Peter; I believe shard counts should be powers of two. Also, resharding makes the buckets unavailable, but occurs very quickly. As such it is not done in the background, but in the foreground, for a manual reshard. Notice the statement: "reshard of bucket from to completed successfully." It's done. The warning notice won't go away until a scrub is completed to determine that a large OMAP object no longer exists. Thank you, Dominic L. Hilsbos, MBA Director – Information Technology Perform Air International Inc. dhils...@performair.com https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.performair.com%2F&data=04%7C01%7Cpeter.eisch%40virginpulse.com%7C14386968705f4571e9a008d8769e9a16%7Cb123a16e892b4cf6a55a6f8c7606a035%7C0%7C0%7C637389770850660951%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=AUPmwxIgzRhmwpg2MM6b%2FzpPyR84%2F92OFsW9UrKw%2Fes%3D&reserved=0 From: Peter Eisch [mailto:peter.ei...@virginpulse.com] Sent: Thursday, October 22, 2020 8:04 AM To: Dominic Hilsbos; ceph-users@ceph.io Subject: Re: Large map object found Thank you! This was helpful. I opted for a manual reshard: [root@cephmon-s03 ~]# radosgw-admin bucket reshard --bucket=d2ff913f5b6542cda307c9cd6a95a214/NAME_segments --num-shards=3 tenant: d2ff913f5b6542cda307c9cd6a95a214 bucket name: backups_sql_dswhseloadrepl_segments old bucket instance id: 80bdfc66-d1fd-418d-b87d-5c8518a0b707.340850308.51 new bucket instance id: 80bdfc66-d1fd-418d-b87d-5c8518a0b707.948621036.1 total entries: 1000 2000 3000 3228 2020-10-22 08:40:26.353 7fb197fc66c0 1 execute INFO: reshard of bucket "backups_sql_dswhseloadrepl_segments" from "d2ff913f5b6542cda307c9cd6a95a214/backups_sql_dswhseloadrepl_segments:80bdfc66-d1fd-418d-b87d-5c8518a0b707.340850308.51" to "d2ff913f5b6542cda307c9cd6a95a214/backups_sql_dswhseloadrepl_segments:80bdfc66-d1fd-418d-b87d-5c8518a0b707.948621036.1" completed successfully [root@cephmon-s03 ~]# radosgw-admin buckets reshard list [] [root@cephmon-s03 ~]# radosgw-admin buckets reshard status --bucket=d2ff913f5b6542cda307c9cd6a95a214/NAME_segments [ { "reshard_status": "not-resharding", "new_bucket_instance_id": "", "num_shards": -1 }, { "reshard_status": "not-resharding", "new_bucket_instance_id": "", "num_shards": -1 }, { "reshard_status": "not-resharding", "new_bucket_instance_id": "", "num_shards": -1 } ] [root@cephmon-s03 ~]# This kicked of an autoscale event. Would the reshard presumably start after the autoscaling is complete? peter Peter Eisch Senior Site Reliability Engineer T 1.612.445.5135 virginpulse.com Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland | United Kingdom | USA Confidentiality Notice: The information contained in this e-mail, including any attachment(s), is intended solely for use by the designated recipient(s). Unauthorized use, dissemination, distribution, or reproduction of this message by anyone other than the intended recipient(s), or a person designated as responsible for delivering such messages to the intended recipient, is strictly prohibited and may be unlawful. This e-mail may contain proprietary, confidential or privileged information. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Virgin Pulse, Inc. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender and delete this e-mail message. v2.66 On 10/21/20, 3:19 P
[ceph-users] Re: Ceph Octopus
Did you restart the OSD containers? Does ceph config show your changes? ceph config get mon cluster_network ceph config get mon public_network Zitat von Amudhan P : Hi Eugen, I did the same step specified but OSD is not updated cluster address. On Tue, Oct 20, 2020 at 2:52 PM Eugen Block wrote: > I wonder if this would be impactful, even if `nodown` were set. > When a given OSD latches onto > the new replication network, I would expect it to want to use it for > heartbeats — but when > its heartbeat peers aren’t using the replication network yet, they > won’t be reachable. I also expected at least some sort of impact, I just tested it in a virtual lab environment. But besides the temporary "down" OSDs during container restart the cluster was always responsive (although there's no client traffic). I didn't even set "nodown". But all OSDs now have a new backend address and the cluster seems to be happy. Regards, Eugen Zitat von Anthony D'Atri : > I wonder if this would be impactful, even if `nodown` were set. > When a given OSD latches onto > the new replication network, I would expect it to want to use it for > heartbeats — but when > its heartbeat peers aren’t using the replication network yet, they > won’t be reachable. > > Unless something has changed since I tried this with Luminous. > >> On Oct 20, 2020, at 12:47 AM, Eugen Block wrote: >> >> Hi, >> >> a quick search [1] shows this: >> >> ---snip--- >> # set new config >> ceph config set global cluster_network 192.168.1.0/24 >> >> # let orchestrator reconfigure the daemons >> ceph orch daemon reconfig mon.host1 >> ceph orch daemon reconfig mon.host2 >> ceph orch daemon reconfig mon.host3 >> ceph orch daemon reconfig osd.1 >> ceph orch daemon reconfig osd.2 >> ceph orch daemon reconfig osd.3 >> ---snip--- >> >> I haven't tried it myself though. >> >> Regards, >> Eugen >> >> [1] >> https://stackoverflow.com/questions/61763230/configure-a-cluster-network-with-cephadm >> >> >> Zitat von Amudhan P : >> >>> Hi, >>> >>> I have installed Ceph Octopus cluster using cephadm with a single network >>> now I want to add a second network and configure it as a cluster address. >>> >>> How do I configure ceph to use second Network as cluster network?. >>> >>> Amudhan >>> ___ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >> >> >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Hardware needs for MDS for HPC/OpenStack workloads?
Regarding MDS pinning, we have our home directories split into u{0..9} for legacy reasons, and while adding more MDS' helped a little, pinning certain u? to certain MDS' helped greatly. The automatic migration between MDS' killed performance. This is an unusually perfect workload for pinning, as we have 10 practically identical directories, but still. On Fri, Oct 23, 2020 at 2:04 AM Stefan Kooman wrote: > > On 2020-10-22 14:34, Matthew Vernon wrote: > > Hi, > > > > We're considering the merits of enabling CephFS for our main Ceph > > cluster (which provides object storage for OpenStack), and one of the > > obvious questions is what sort of hardware we would need for the MDSs > > (and how many!). > > Is it a many parallel large writes workload without a lot fs > manipulation (file creation / deletion, attribute updates? You might > only need 2 for HA (active-standby). But when used as a regular fs with > many clients and a lot of small IO, than you might run out of the > performance of a single MDS. Add (many) more as you see fit. Keep in > mind it does make things a bit more complex (different ranks when more > than one active MDS) and that when you need to upgrade you have to > downscale that to 1. You can pin directories to a single MDS if you know > your workload well enough. > > > > > These would be for our users scientific workloads, so they would need to > > provide reasonably high performance. For reference, we have 3060 6TB > > OSDs across 51 OSD hosts, and 6 dedicated RGW nodes. > > It really depend on the workload. If there are a lot of file / directory > operations the MDS needs to keep track of all that and needs to be able > to cache as well (inodes / dnodes). The more files/dirs, the more RAM > you need. We don't have PB of storage (but 39 TB for CephFS) but have > MDSes with 256 GB RAM for cache for all the little files and many dirs > we have. Prefer a few faster cores above many slower cores. > > > > > > The minimum specs are very modest (2-3GB RAM, a tiny amount of disk, > > similar networking to the OSD nodes), but I'm not sure how much going > > beyond that is likely to be useful in production. > > MDSes don't do a lot of traffic. Clients write directly to OSDs after > they have acquired capabilities (CAPS) from MDS. > > > > > I've also seen it suggested that an SSD-only pool is sensible for the > > CephFS metadata pool; how big is that likely to get? > > Yes, but CephFS, like RGW (index), stores a lot of data in OMAP and the > RocksDB databases tend to get quite large. Especially when storing many > small files and lots of dirs. So if that happens to be the workload, > make sure you have plenty of them. We once put all cephfs_metadata on 30 > NVMe ... and that was not a good thing. Spread that data out over as > many SSD / NVMe as you can. Do your HDDs have their WAL / DB on flash? > Cephfs_metadaa does not take up a lot of space, but Mimic does not have > as good administration on all space occupied as newer releases. But I > guess it's in the order of 5% of CephFS size. But again, this might be > wildly different on other deployments. > > > > > I'd be grateful for any pointers :) > > I would buy a CPU with high clock speed and ~ 4 -8 cores. RAM as needed, > but 32 GB will be minimum I guess. > > Gr. Stefan > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] TOO_FEW_PGS warning and pg_autoscale
Hi, # ceph health detail HEALTH_WARN too few PGs per OSD (24 < min 30) TOO_FEW_PGS too few PGs per OSD (24 < min 30) ceph version 14.2.9 This warning popped up when autoscale shrunk a pool from pg_num and pgp_num from 512 to 256 on its own. The hdd35 storage is only used by this pool. I have three different storage classes and the pools use the different classes as appropriate. How can I convert the warning into something useful which then helps me make the appropriate change to the right class of storage? I'm guessing it's referring to hdd35. RAW STORAGE: CLASS SIZEAVAIL USEDRAW USED %RAW USED hdd25 129 TiB 83 TiB 46 TiB 46 TiB 35.87 hdd35 269 TiB 220 TiB 49 TiB 49 TiB 18.12 ssd 256 TiB 164 TiB 92 TiB 92 TiB 35.84 TOTAL 655 TiB 468 TiB 186 TiB 187 TiB 28.56 If I follow: https://docs.ceph.com/en/latest/rados/operations/health-checks/#too-few-pgs Which then links to: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#choosing-number-of-placement-groups The math for this would want the pool to have pg/p_num of 2048 -- where autoscale just recently shrunk the count. Which is more right? Thanks! peter Peter Eisch Senior Site Reliability Engineer T1.612.445.5135 virginpulse.com Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland | United Kingdom | USA Confidentiality Notice: The information contained in this e-mail, including any attachment(s), is intended solely for use by the designated recipient(s). Unauthorized use, dissemination, distribution, or reproduction of this message by anyone other than the intended recipient(s), or a person designated as responsible for delivering such messages to the intended recipient, is strictly prohibited and may be unlawful. This e-mail may contain proprietary, confidential or privileged information. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Virgin Pulse, Inc. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender and delete this e-mail message. v2.66 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph Octopus
Hi Eugen, ceph config output shows set network address. I have not restarted containers directly I was trying the command `ceph orch restart osd.46` I think that was a problem now after running `ceph orch daemon restart osd.46` it's showing changes in dashboard. Thanks. On Fri, Oct 23, 2020 at 6:14 PM Eugen Block wrote: > Did you restart the OSD containers? Does ceph config show your changes? > > ceph config get mon cluster_network > ceph config get mon public_network > > > > Zitat von Amudhan P : > > > Hi Eugen, > > > > I did the same step specified but OSD is not updated cluster address. > > > > > > On Tue, Oct 20, 2020 at 2:52 PM Eugen Block wrote: > > > >> > I wonder if this would be impactful, even if `nodown` were set. > >> > When a given OSD latches onto > >> > the new replication network, I would expect it to want to use it for > >> > heartbeats — but when > >> > its heartbeat peers aren’t using the replication network yet, they > >> > won’t be reachable. > >> > >> I also expected at least some sort of impact, I just tested it in a > >> virtual lab environment. But besides the temporary "down" OSDs during > >> container restart the cluster was always responsive (although there's > >> no client traffic). I didn't even set "nodown". But all OSDs now have > >> a new backend address and the cluster seems to be happy. > >> > >> Regards, > >> Eugen > >> > >> > >> Zitat von Anthony D'Atri : > >> > >> > I wonder if this would be impactful, even if `nodown` were set. > >> > When a given OSD latches onto > >> > the new replication network, I would expect it to want to use it for > >> > heartbeats — but when > >> > its heartbeat peers aren’t using the replication network yet, they > >> > won’t be reachable. > >> > > >> > Unless something has changed since I tried this with Luminous. > >> > > >> >> On Oct 20, 2020, at 12:47 AM, Eugen Block wrote: > >> >> > >> >> Hi, > >> >> > >> >> a quick search [1] shows this: > >> >> > >> >> ---snip--- > >> >> # set new config > >> >> ceph config set global cluster_network 192.168.1.0/24 > >> >> > >> >> # let orchestrator reconfigure the daemons > >> >> ceph orch daemon reconfig mon.host1 > >> >> ceph orch daemon reconfig mon.host2 > >> >> ceph orch daemon reconfig mon.host3 > >> >> ceph orch daemon reconfig osd.1 > >> >> ceph orch daemon reconfig osd.2 > >> >> ceph orch daemon reconfig osd.3 > >> >> ---snip--- > >> >> > >> >> I haven't tried it myself though. > >> >> > >> >> Regards, > >> >> Eugen > >> >> > >> >> [1] > >> >> > >> > https://stackoverflow.com/questions/61763230/configure-a-cluster-network-with-cephadm > >> >> > >> >> > >> >> Zitat von Amudhan P : > >> >> > >> >>> Hi, > >> >>> > >> >>> I have installed Ceph Octopus cluster using cephadm with a single > >> network > >> >>> now I want to add a second network and configure it as a cluster > >> address. > >> >>> > >> >>> How do I configure ceph to use second Network as cluster network?. > >> >>> > >> >>> Amudhan > >> >>> ___ > >> >>> ceph-users mailing list -- ceph-users@ceph.io > >> >>> To unsubscribe send an email to ceph-users-le...@ceph.io > >> >> > >> >> > >> >> ___ > >> >> ceph-users mailing list -- ceph-users@ceph.io > >> >> To unsubscribe send an email to ceph-users-le...@ceph.io > >> > >> > >> ___ > >> ceph-users mailing list -- ceph-users@ceph.io > >> To unsubscribe send an email to ceph-users-le...@ceph.io > >> > > > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Hardware for new OSD nodes.
Hi Anthony, El 22/10/20 a las 18:34, Anthony D'Atri escribió: Yeah, didn't think about a RAID10 really, although there wouldn't be enough space for 8x300GB = 2400GB WAL/DBs. 300 is overkill for many applications anyway. Yes, but he has spillover with 1600GB/12 WAL/DB. Seems he can make use of those 300GB. Also, using a RAID10 for WAL/DBs will: - make OSDs less movable between hosts (they'd have to be moved all together - with 2 OSD per NVMe you can move them around in pairs Why would you want to move them between hosts? I think the usual case is a server failure, so that won't be a problem. With small clusters (like ours) you may want to reorganize OSDs to a new server (let's say, move one OSD of earch server to the new server). But this is an uncommon corner-case, I agree :) Cheers -- Eneko Lacunza| +34 943 569 206 | elacu...@binovo.es Zuzendari teknikoa | https://www.binovo.es Director técnico | Astigarragako Bidea, 2 - 2º izda. BINOVO IT HUMAN PROJECT S.L | oficina 10-11, 20180 Oiartzun ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Hardware for new OSD nodes.
Hi Brian, El 22/10/20 a las 18:41, Brian Topping escribió: On Oct 22, 2020, at 10:34 AM, Anthony D'Atri wrote: - You must really be sure your raid card is dependable. (sorry but I have seen so much management problems with top-tier RAID cards I avoid them like the plague). This. I’d definitely avoid a RAID card. If I can do advanced encryption with an MMX instruction, I think I can certainly trust IOMMU to handle device multiplexing from software in an efficient manner, no? mdadm RAID is just fine for me and is reliably bootable from GRUB. I’m not an expert in driver mechanics, but mirroring should be very low overhead at the software level. Once it’s software RAID, moving disks between chassis is a simple process as well. Apologies I didn’t make that clear earlier... Yes, I really like mdraid :) . Problem is BIOS/UEFI has to find a working bootable disk. I think some BIOS/UEFIs have settings for a secondary boot/UEFI bootfile, but that would have to be prepared and maintained manually, out of the mdraid10; and would only work with a total failure of the primary disk. Cheers -- Eneko Lacunza| +34 943 569 206 | elacu...@binovo.es Zuzendari teknikoa | https://www.binovo.es Director técnico | Astigarragako Bidea, 2 - 2º izda. BINOVO IT HUMAN PROJECT S.L | oficina 10-11, 20180 Oiartzun ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Hardware for new OSD nodes.
Hi Dave, El 22/10/20 a las 19:43, Dave Hall escribió: El 22/10/20 a las 16:48, Dave Hall escribió: (BTW, Nautilus 14.2.7 on Debian non-container.) We're about to purchase more OSD nodes for our cluster, but I have a couple questions about hardware choices. Our original nodes were 8 x 12TB SAS drives and a 1.6TB Samsung NVMe card for WAL, DB, etc. We chose the NVMe card for performance since it has an 8 lane PCIe interface. However, we're currently BlueFS spillovers. The Tyan chassis we are considering has the option of 4 x U.2 NVMe bays - each with 4 PCIe lanes, (and 8 SAS bays). It has occurred to me that I might stripe 4 1TB NVMe drives together to get much more space for WAL/DB and a net performance of 16 PCIe lanes. Any thoughts on this approach? Don't stripe them, if one NVMe fails you'll lose all OSDs. Just use 1 NVMe drive for 2 SAS drives and provision 300GB for WAL/DB for each OSD (see related threads on this mailing list about why that exact size). This way if a NVMe fails, you'll only lose 2 OSD. I was under the impression that everything that BlueStore puts on the SSD/NVMe could be reconstructed from information on the OSD. Am I mistaken about this? If so, my single 1.6TB NVMe card is equally vulnerable. I don't think so, that info only exists on that partition as was the case with filestore journal. Your single 1.6TB NVMe is vulnerable, yes. Also, what size of WAL/DB partitions do you have now, and what spillover size? I recently posted another question to the list on this topic, since I now have spillover on 7 of 24 OSDs. Since the data layout on the NVMe for BlueStore is not traditional I've never quite figured out how to get this information. The current partition size is 1.6TB /12 since we had the possibility to add for more drives to each node. How that was divided between WAL, DB, etc. is something I'd like to be able to understand. However, we're not going to add the extra 4 drives, so expanding the LVM partitions is now a possibility. Can you paste the warning message? If shows the spillover size. What size are the partitions on NVMe disk (lsblk) Cheers -- Eneko Lacunza| +34 943 569 206 | elacu...@binovo.es Zuzendari teknikoa | https://www.binovo.es Director técnico | Astigarragako Bidea, 2 - 2º izda. BINOVO IT HUMAN PROJECT S.L | oficina 10-11, 20180 Oiartzun ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] OSD down, how to reconstruct it from its main and block.db parts ?
Dear all, after breaking my experimental 1-host Ceph cluster and making one its pg 'incomplete' I left it in abandoned state for some time. Now I decided to bring it back into life and found that it can not start one of its OSDs (osd.1 to name it) "ceph osd df" shows : ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 0hdd0 1.0 2.7 TiB 1.6 TiB 1.6 TiB 113 MiB 4.7 GiB 1.1 TiB 59.77 0.69 102 up 1hdd 2.84549 0 0 B 0 B 0 B 0 B 0 B 0 B 0 00down 2hdd 2.84549 1.0 2.8 TiB 2.6 TiB 2.5 TiB 57 MiB 3.8 GiB 275 GiB 90.58 1.05 176 up 3hdd 2.84549 1.0 2.8 TiB 2.6 TiB 2.5 TiB 57 MiB 3.9 GiB 271 GiB 90.69 1.05 185 up 4hdd 2.84549 1.0 2.8 TiB 2.6 TiB 2.5 TiB 63 MiB 4.2 GiB 263 GiB 90.98 1.05 184 up 5hdd 2.84549 1.0 2.8 TiB 2.6 TiB 2.5 TiB 52 MiB 3.8 GiB 263 GiB 90.96 1.05 178 up 6hdd 2.53400 1.0 2.5 TiB 2.3 TiB 2.3 TiB 173 MiB 5.2 GiB 228 GiB 91.21 1.05 178 up 7hdd 2.53400 1.0 2.5 TiB 2.3 TiB 2.3 TiB 147 MiB 5.2 GiB 230 GiB 91.12 1.05 168 up TOTAL 19 TiB 17 TiB 16 TiB 662 MiB 31 GiB 2.6 TiB 86.48 MIN/MAX VAR: 0.69/1.05 STDDEV: 10.90 "ceph device ls" shows : DEVICE HOST:DEV DAEMONS LIFE EXPECTANCY GIGABYTE_GP-ASACNE2100TTTDR_SN191108950380 p10s:nvme0n1 osd.1 osd.2 osd.3 osd.4 osd.5 WDC_WD30EFRX-68N32N0_WD-WCC7K1JJXVSTp10s:sdd osd.1 WDC_WD30EFRX-68N32N0_WD-WCC7K1VUYPRAp10s:sda osd.6 WDC_WD30EFRX-68N32N0_WD-WCC7K2CKX8NTp10s:sdb osd.7 WDC_WD30EFRX-68N32N0_WD-WCC7K2UD8H74p10s:sde osd.2 WDC_WD30EFRX-68N32N0_WD-WCC7K2VFTR1Fp10s:sdh osd.5 WDC_WD30EFRX-68N32N0_WD-WCC7K3CYKL87p10s:sdf osd.3 WDC_WD30EFRX-68N32N0_WD-WCC7K6FPZAJPp10s:sdc osd.0 WDC_WD30EFRX-68N32N0_WD-WCC7K7FXSCRNp10s:sdg osd.4 In my last migration, I created a bluestore volume with external block.db like this : "ceph-volume lvm prepare --bluestore --data /dev/sdd1 --block.db /dev/nvme0n1p4" And I can see this metadata by "ceph-bluestore-tool show-label --dev /dev/ceph-e53b65ba-5eb0-44f5-9160-a2328f787a0f/osd-block-8c6324a3-0364-4fad-9dcb-81a1661ee202" : { "/dev/ceph-e53b65ba-5eb0-44f5-9160-a2328f787a0f/osd-block-8c6324a3-0364-4fad-9dcb-81a1661ee202": { "osd_uuid": "8c6324a3-0364-4fad-9dcb-81a1661ee202", "size": 3000588304384, "btime": "2020-07-12T11:34:16.579735+0300", "description": "main", "bfm_blocks": "45785344", "bfm_blocks_per_key": "128", "bfm_bytes_per_block": "65536", "bfm_size": "3000588304384", "bluefs": "1", "ceph_fsid": "49cdfe90-6f6e-4afe-8558-bf14a13aadfa", "kv_backend": "rocksdb", "magic": "ceph osd volume v026", "mkfs_done": "yes", "osd_key": "AQD9ygpf+7+MABAAqtj4y1YYgxwCaAN/jgDSwg==", "ready": "ready", "require_osd_release": "14", "whoami": "1" } } and by "ceph-bluestore-tool show-label --dev /dev/nvme0n1p4" : { "/dev/nvme0n1p4": { "osd_uuid": "8c6324a3-0364-4fad-9dcb-81a1661ee202", "size": 128025886720, "btime": "2020-07-12T11:34:16.592054+0300", "description": "bluefs db" } } As you see, their osd_uuid is equal. But when I try to start it by hand : "systemctl restart ceph-osd@1" , I get this in the logs : ("journalctl -b -u ceph-osd@1") -- Logs begin at Tue 2020-10-13 19:09:49 EEST, end at Fri 2020-10-23 16:59:38 EEST. -- жов 23 16:59:36 p10s systemd[1]: Starting Ceph object storage daemon osd.1... жов 23 16:59:36 p10s systemd[1]: Started Ceph object storage daemon osd.1. жов 23 16:59:36 p10s ceph-osd[3987]: 2020-10-23T16:59:36.943+0300 7f513cebedc0 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-1/keyring: (2) No such file or directory жов 23 16:59:36 p10s ceph-osd[3987]: 2020-10-23T16:59:36.943+0300 7f513cebedc0 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-1/keyring: (2) No such file or directory жов 23 16:59:36 p10s ceph-osd[3987]: 2020-10-23T16:59:36.943+0300 7f513cebedc0 -1 AuthRegistry(0x560776222940) no keyring found at /var/lib/ceph/osd/ceph-1/keyring, disabling cephx жов 23 16:59:36 p10s ceph-osd[3987]: 2020-10-23T16:59:36.943+0300 7f513cebedc0 -1 AuthRegistry(0x560776222940) no keyring found at /var/lib/ceph/osd/ceph-1/keyring, disabling cephx жов 23 16:59:36 p10s ceph-osd[3987]: 2020-10-23T16:59:36.947+0300 7f513cebedc0 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-1/keyring: (2) No such file or directory жов 23 16:59:36 p10s ceph-osd[3987]: 2020-10-23T16:59:36.947+0300 7f513cebedc0 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-1/keyring: (2) No such file o
[ceph-users] Re: Strange USED size
10B as in ten bytes? By chance have you run `rados bench` ? Sometimes a run is interrupted or one forgets to clean up and there are a bunch of orphaned RADOS objects taking up space, though I’d think `ceph df` would reflect that. Is your buckets.data pool replicated or EC? > On Oct 22, 2020, at 7:35 AM, Marcelo wrote: > > Hello. I've searched a lot but couldn't find why the size of USED column in > the output of ceph df is a lot times bigger than the actual size. I'm using > Nautilus (14.2.8), and I've 1000 buckets with 100 objectsineach bucket. > Each object is around 10B. > > ceph df > RAW STORAGE: >CLASS SIZEAVAIL USEDRAW USED %RAW USED >hdd 511 GiB 147 GiB 340 GiB 364 GiB 71.21 >TOTAL 511 GiB 147 GiB 340 GiB 364 GiB 71.21 > > POOLS: >POOL ID STORED OBJECTS > USED%USED MAX AVAIL >.rgw.root 1 1.1 KiB 4 768 > KiB 036 GiB >default.rgw.control11 0 B 8 0 > B 036 GiB >default.rgw.meta 12 449 KiB 2.00k 376 > MiB 0.3436 GiB >default.rgw.log13 3.4 KiB 207 6 > MiB 036 GiB >default.rgw.buckets.index 14 0 B 1.00k 0 > B 036 GiB >default.rgw.buckets.data 15 969 KiB100k 18 > GiB 14.5236 GiB >default.rgw.buckets.non-ec 1627 B 1 192 > KiB 036 GiB > > Does anyone know what are the maths behind this, to show 18GiB used when I > have something like 1 MiB? > > Thanks in advance, Marcelo. > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] desaster recovery Ceph Storage , urgent help needed
Hi I have a worst case, osd's in a 3 node cluster each 4 nvme's won't start we had a ip config change in public network, and mon's died so we managed mon's to come back with new ip's. corosync on 2 rings is fine, all 3 mon's are up osd's won't start how to get back to the pool, already 3vm's are configured and valuable data would be lost... this is like a scenario when all systemdisks on each 3 nodes failed, but osd disks are healthy ... any help to reconstruct this storage is highly appreciated! Gerhard |root@pve01:/var/log# systemctl status ceph-osd@0.service.service ● ceph-osd@0.service.service - Ceph object storage daemon osd.0.service Loaded: loaded (/lib/systemd/system/ceph-osd@.service; disabled; vendor preset: enabled) Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d └─ceph-after-pve-cluster.conf Active: failed (Result: exit-code) since Thu 2020-10-22 00:30:09 CEST; 37min ago Process: 31402 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 0.service (code=exited, status=1/FAILURE) Oct 22 00:30:09 pve01 systemd[1]: ceph-osd@0.service.service: Service RestartSec=100ms expired, scheduling restart. Oct 22 00:30:09 pve01 systemd[1]: ceph-osd@0.service.service: Scheduled restart job, restart counter is at 3. Oct 22 00:30:09 pve01 systemd[1]: Stopped Ceph object storage daemon osd.0.service. Oct 22 00:30:09 pve01 systemd[1]: ceph-osd@0.service.service: Start request repeated too quickly. Oct 22 00:30:09 pve01 systemd[1]: ceph-osd@0.service.service: Failed with result 'exit-code'. Oct 22 00:30:09 pve01 systemd[1]: Failed to start Ceph object storage daemon osd.0.service. | ||ceph mon dump dumped monmap epoch 3 epoch 3 fsid 92d063d7-647c-44b8-95d7-86057ee0ab22 last_changed 2020-10-21 23:31:50.584796 created 2020-10-21 21:00:54.077449 min_mon_release 14 (nautilus) 0: [v2:10.100.200.141:3300/0,v1:10.100.200.141:6789/0] mon.pve01 1: [v2:10.100.200.142:3300/0,v1:10.100.200.142:6789/0] mon.pve02 2: [v2:10.100.200.143:3300/0,v1:10.100.200.143:6789/0] mon.pve03 || |||Networks: auto lo iface lo inet loopback auto eno1np0 iface eno1np0 inet static address 10.110.200.131/24 mtu 9000 #corosync1 10GB auto eno2np1 iface eno2np1 inet static address 10.111.200.131/24 mtu 9000 #Corosync2 10GB iface enp69s0f0 inet manual mtu 9000 auto enp69s0f1 iface enp69s0f1 inet static address 10.112.200.131/24 mtu 9000 #Cluster private 100GB auto vmbr0 iface vmbr0 inet static address 10.100.200.141/24 gateway 10.100.200.1 bridge-ports enp69s0f0 bridge-stp off bridge-fd 0 mtu 9000 #Cluster public 100GB === ||| ceph.conf [global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = 10.112.200.0/24 fsid = 92d063d7-647c-44b8-95d7-86057ee0ab22 mon_allow_pool_delete = true mon_host = 10.100.200.141 10.100.200.142 10.100.200.143 osd_pool_default_min_size = 2 osd_pool_default_size = 3 public_network = 10.100.200.0/24 [client] keyring = /etc/pve/priv/$cluster.$name.keyring [mon.pve01] public_addr = 10.100.200.141 [mon.pve02] public_addr = 10.100.200.142 [mon.pve03] public_addr = 10.100.200.143 |ceph -s cluster: id: 92d063d7-647c-44b8-95d7-86057ee0ab22 health: HEALTH_WARN 1 daemons have recently crashed OSD count 0 < osd_pool_default_size 3 services: mon: 3 daemons, quorum pve01,pve02,pve03 (age 63m) mgr: pve01(active, since 64m) osd: 0 osds: 0 up, 0 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: df -h Filesystem Size Used Avail Use% Mounted on udev 252G 0 252G 0% /dev tmpfs 51G 11M 51G 1% /run rpool/ROOT/pve-1 229G 16G 214G 7% / tmpfs 252G 63M 252G 1% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 252G 0 252G 0% /sys/fs/cgroup rpool 214G 128K 214G 1% /rpool rpool/data 214G 128K 214G 1% /rpool/data rpool/ROOT 214G 128K 214G 1% /rpool/ROOT tmpfs 252G 24K 252G 1% /var/lib/ceph/osd/ceph-3 tmpfs 252G 24K 252G 1% /var/lib/ceph/osd/ceph-2 tmpfs 252G 24K 252G 1% /var/lib/ceph/osd/ceph-0 tmpfs 252G 24K 252G 1% /var/lib/ceph/osd/ceph-1 /dev/fuse 30M 32K 30M 1% /etc/pve tmpfs 51G 0 51G 0% /run/user/0 lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme4n1 259:0 0 238.5G 0 disk ├─nvme4n1p1 259:5 0 1007K 0 part ├─nvme4n1p2 259:6 0 512M 0 part └─nvme4n1p3 259:7 0 238G 0 part nvme5n1 259:1 0 238.5G 0 disk ├─nvme5n1p1 259:2 0 1007K 0 part ├─nvme5n1p2 259:3 0 512M 0 part └─nvme5n1p3 259:4 0 238G 0 part nvme0n1 259:12 0 2.9T 0 disk └─ceph--cc77fe1b--c8d4--48be--a7c4--36109439c85c-osd--block--80e0127e--836e--44b8--882d--ac49bfc85866 253:3 0 2.9T 0 lvm nvme1n1 259:13 0 2.9T 0 disk └─ceph--eb8b2fc7--775e--4b94--8070--784e7bbf861e-osd--block--4d433222--e1e8--43ac--8dc7--2e6e998ff122 253:2 0 2.9T 0 lvm nvme3n1 259:14 0 2.9T 0 disk └─ceph--5724bdf7--5124--4244--91d6--e254210c2174-osd--block--2d6fe149--f330--415a--a762--44d037c900b1 253:1 0 2.9T 0 lvm nvme2n1 259:15 0 2.9T 0 disk └─ceph--cb5762e9--40fa--4148--98f4--5b5ddef4c1de-osd-
[ceph-users] Re: Large map object found
Peter; As with many things in Ceph, I don’t believe it’s a hard and fast rule (i.e. noon power of 2 will work). I believe the issues are performance, and balance. I can't confirm that. Perhaps someone else on the list will add their thoughts. Has your warning gone away? Thank you, Dominic L. Hilsbos, MBA Director – Information Technology Perform Air International Inc. dhils...@performair.com www.PerformAir.com From: Peter Eisch [mailto:peter.ei...@virginpulse.com] Sent: Friday, October 23, 2020 5:41 AM To: Dominic Hilsbos; ceph-users@ceph.io Subject: Re: Large map object found Perfect -- many thanks Dominic! I haven't found a doc which notes the --num-shards needs to be a power of two. It isn't I don't believe you -- just haven't seen that anywhere. peter Peter Eisch Senior Site Reliability Engineer T 1.612.445.5135 virginpulse.com Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland | United Kingdom | USA Confidentiality Notice: The information contained in this e-mail, including any attachment(s), is intended solely for use by the designated recipient(s). Unauthorized use, dissemination, distribution, or reproduction of this message by anyone other than the intended recipient(s), or a person designated as responsible for delivering such messages to the intended recipient, is strictly prohibited and may be unlawful. This e-mail may contain proprietary, confidential or privileged information. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Virgin Pulse, Inc. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender and delete this e-mail message. v2.66 On 10/22/20, 10:24 AM, "dhils...@performair.com" wrote: Peter; I believe shard counts should be powers of two. Also, resharding makes the buckets unavailable, but occurs very quickly. As such it is not done in the background, but in the foreground, for a manual reshard. Notice the statement: "reshard of bucket from to completed successfully." It's done. The warning notice won't go away until a scrub is completed to determine that a large OMAP object no longer exists. Thank you, Dominic L. Hilsbos, MBA Director – Information Technology Perform Air International Inc. dhils...@performair.com https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.performair.com%2F&data=04%7C01%7Cpeter.eisch%40virginpulse.com%7C14386968705f4571e9a008d8769e9a16%7Cb123a16e892b4cf6a55a6f8c7606a035%7C0%7C0%7C637389770850660951%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=AUPmwxIgzRhmwpg2MM6b%2FzpPyR84%2F92OFsW9UrKw%2Fes%3D&reserved=0 From: Peter Eisch [mailto:peter.ei...@virginpulse.com] Sent: Thursday, October 22, 2020 8:04 AM To: Dominic Hilsbos; ceph-users@ceph.io Subject: Re: Large map object found Thank you! This was helpful. I opted for a manual reshard: [root@cephmon-s03 ~]# radosgw-admin bucket reshard --bucket=d2ff913f5b6542cda307c9cd6a95a214/NAME_segments --num-shards=3 tenant: d2ff913f5b6542cda307c9cd6a95a214 bucket name: backups_sql_dswhseloadrepl_segments old bucket instance id: 80bdfc66-d1fd-418d-b87d-5c8518a0b707.340850308.51 new bucket instance id: 80bdfc66-d1fd-418d-b87d-5c8518a0b707.948621036.1 total entries: 1000 2000 3000 3228 2020-10-22 08:40:26.353 7fb197fc66c0 1 execute INFO: reshard of bucket "backups_sql_dswhseloadrepl_segments" from "d2ff913f5b6542cda307c9cd6a95a214/backups_sql_dswhseloadrepl_segments:80bdfc66-d1fd-418d-b87d-5c8518a0b707.340850308.51" to "d2ff913f5b6542cda307c9cd6a95a214/backups_sql_dswhseloadrepl_segments:80bdfc66-d1fd-418d-b87d-5c8518a0b707.948621036.1" completed successfully [root@cephmon-s03 ~]# radosgw-admin buckets reshard list [] [root@cephmon-s03 ~]# radosgw-admin buckets reshard status --bucket=d2ff913f5b6542cda307c9cd6a95a214/NAME_segments [ { "reshard_status": "not-resharding", "new_bucket_instance_id": "", "num_shards": -1 }, { "reshard_status": "not-resharding", "new_bucket_instance_id": "", "num_shards": -1 }, { "reshard_status": "not-resharding", "new_bucket_instance_id": "", "num_shards": -1 } ] [root@cephmon-s03 ~]# This kicked of an autoscale event. Would the reshard presumably start after the autoscaling is complete? peter Peter Eisch Senior Site Reliability Engineer T 1.612.445.5135 virginpulse.com Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland | United Kingdom | USA Confidentiality Notice: The information contained in this e-mail, including any attachment(s), is intended solely for use by the designated recipient(s). Unauthorized use, dissemination, distribution, or reproduction of this message by anyone other than the intended recipient(s), or a person designated as responsible for delivering such messages to the intended recipient, is strictly prohibited a
[ceph-users] Re: Hardware for new OSD nodes.
Yes the UEFI problem with mirrored mdraid boot is well-documented. I’ve generally been working with BIOS partition maps which do not have the single point of failure UEFI has (/boot can be mounted as mirrored, any of them can be used as non-RAID by GRUB). But BIOS maps have problems as well with volume size. That said, the disks are portable at that point and really don’t have deep performance bottlenecks because mirroring and striping is cheap. Sent from my iPhone > On Oct 23, 2020, at 03:54, Eneko Lacunza wrote: > > Hi Brian, > >> El 22/10/20 a las 18:41, Brian Topping escribió: >> On Oct 22, 2020, at 10:34 AM, Anthony D'Atri wrote: >>> - You must really be sure your raid card is dependable. (sorry but I have seen so much management problems with top-tier RAID cards I avoid them like the plague). >>> This. >> I’d definitely avoid a RAID card. If I can do advanced encryption with an >> MMX instruction, I think I can certainly trust IOMMU to handle device >> multiplexing from software in an efficient manner, no? mdadm RAID is just >> fine for me and is reliably bootable from GRUB. >> >> I’m not an expert in driver mechanics, but mirroring should be very low >> overhead at the software level. >> >> Once it’s software RAID, moving disks between chassis is a simple process as >> well. >> >> Apologies I didn’t make that clear earlier... > Yes, I really like mdraid :) . Problem is BIOS/UEFI has to find a working > bootable disk. I think some BIOS/UEFIs have settings for a secondary > boot/UEFI bootfile, but that would have to be prepared and maintained > manually, out of the mdraid10; and would only work with a total failure of > the primary disk. > > Cheers > > -- > Eneko Lacunza| +34 943 569 206 > | elacu...@binovo.es > Zuzendari teknikoa | https://www.binovo.es > Director técnico | Astigarragako Bidea, 2 - 2º izda. > BINOVO IT HUMAN PROJECT S.L | oficina 10-11, 20180 Oiartzun > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Urgent help needed please - MDS offline
Success! I remembered I had a server I'd taken out of the cluster to investigate some issues, that had some good quality 800GB Intel DC SSDs, dedicated an entire drive to swap, tuned up min_free_kbytes, added an MDS to that server and let it run. Took 3 - 4 hours but eventually came back online. It used the 128GB of RAM and about 250GB of the swap. Dan, thanks so much for steering me down this path, I would have more than likely started hacking away at the journal otherwise! Frank, thanks for pointing me towards that other thread, I used your min_free_kbytes tip I now need to consider updating - I wonder if the risk averse CephFS operator would go for the latest Nautilus or latest Octopus, it used to be that the newer CephFS code meant the most stable but don't know if that's still the case. Thanks, again David On Thu, Oct 22, 2020 at 7:06 PM Frank Schilder wrote: > > The post was titled "mds behind on trimming - replay until memory exhausted". > > > Load up with swap and try the up:replay route. > > Set the beacon to 10 until it finishes. > > Good point! The MDS will not send beacons for a long time. Same was necessary > in the other case. > > Good luck! > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Large map object found
Yes, the OMAP warning has cleared after running the deep-scrub, with all the swiftness. Thanks again! Peter Eisch Senior Site Reliability Engineer T1.612.445.5135 virginpulse.com Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland | United Kingdom | USA Confidentiality Notice: The information contained in this e-mail, including any attachment(s), is intended solely for use by the designated recipient(s). Unauthorized use, dissemination, distribution, or reproduction of this message by anyone other than the intended recipient(s), or a person designated as responsible for delivering such messages to the intended recipient, is strictly prohibited and may be unlawful. This e-mail may contain proprietary, confidential or privileged information. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Virgin Pulse, Inc. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender and delete this e-mail message. v2.66 On 10/23/20, 10:48 AM, "dhils...@performair.com" wrote: Peter; As with many things in Ceph, I don’t believe it’s a hard and fast rule (i.e. noon power of 2 will work). I believe the issues are performance, and balance. I can't confirm that. Perhaps someone else on the list will add their thoughts. Has your warning gone away? Thank you, Dominic L. Hilsbos, MBA Director – Information Technology Perform Air International Inc. dhils...@performair.com https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.performair.com%2F&data=04%7C01%7Cpeter.eisch%40virginpulse.com%7C399824739c5049951cd408d8776b1eb6%7Cb123a16e892b4cf6a55a6f8c7606a035%7C0%7C0%7C637390649266421690%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jAb9AqbuDMcdTYh%2Fb69EBFtL%2B3iI%2BwJdCmdbV%2F9MJAg%3D&reserved=0 From: Peter Eisch [mailto:peter.ei...@virginpulse.com] Sent: Friday, October 23, 2020 5:41 AM To: Dominic Hilsbos; ceph-users@ceph.io Subject: Re: Large map object found Perfect -- many thanks Dominic! I haven't found a doc which notes the --num-shards needs to be a power of two. It isn't I don't believe you -- just haven't seen that anywhere. peter Peter Eisch Senior Site Reliability Engineer T 1.612.445.5135 virginpulse.com Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland | United Kingdom | USA Confidentiality Notice: The information contained in this e-mail, including any attachment(s), is intended solely for use by the designated recipient(s). Unauthorized use, dissemination, distribution, or reproduction of this message by anyone other than the intended recipient(s), or a person designated as responsible for delivering such messages to the intended recipient, is strictly prohibited and may be unlawful. This e-mail may contain proprietary, confidential or privileged information. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Virgin Pulse, Inc. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender and delete this e-mail message. v2.66 On 10/22/20, 10:24 AM, "dhils...@performair.com" wrote: Peter; I believe shard counts should be powers of two. Also, resharding makes the buckets unavailable, but occurs very quickly. As such it is not done in the background, but in the foreground, for a manual reshard. Notice the statement: "reshard of bucket from to completed successfully." It's done. The warning notice won't go away until a scrub is completed to determine that a large OMAP object no longer exists. Thank you, Dominic L. Hilsbos, MBA Director – Information Technology Perform Air International Inc. dhils...@performair.com https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.performair.com%2F&data=04%7C01%7Cpeter.eisch%40virginpulse.com%7C399824739c5049951cd408d8776b1eb6%7Cb123a16e892b4cf6a55a6f8c7606a035%7C0%7C0%7C637390649266421690%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jAb9AqbuDMcdTYh%2Fb69EBFtL%2B3iI%2BwJdCmdbV%2F9MJAg%3D&reserved=0 From: Peter Eisch [mailto:peter.ei...@virginpulse.com] Sent: Thursday, October 22, 2020 8:04 AM To: Dominic Hilsbos; ceph-users@ceph.io Subject: Re: Large map object found Thank you! This was helpful. I opted for a manual reshard: [root@cephmon-s03 ~]# radosgw-admin bucket reshard --bucket=d2ff913f5b6542cda307c9cd6a95a214/NAME_segments --num-shards=3 tenant: d2ff913f5b6542cda307c9cd6a95a214 bucket name: backups_sql_dswhseloadrepl_segments old bucket instance id: 80bdfc66-d1fd-418d-b87d-5c8518a0b707.3408
[ceph-users] Re: desaster recovery Ceph Storage , urgent help needed
This is a proxmox cluster ... sorry for formating problems of my post :( short plot, we messed with ip addr. change of public network, so monitors went down. we changed monitor information in ceph.conf and with ceph-mon -i pve01 --extract-monmap /tmp/monmap monmaptool --rm pve01 --rm pve02 --rm pve03 /tmp/monmap monmaptool --add pve01 10.100.200.141 --add pve02 10.100.200.142 --add pve03 10.100.200.143 /tmp/monmap monmaptool --print /tmp/monmap ceph-mon -i pve01 --inject-monmap /tmp/monmap restart of all three nodes, but osd's dont't come up so howto recover from this disaster ? # ceph -s cluster: id: 92d063d7-647c-44b8-95d7-86057ee0ab22 health: HEALTH_WARN 1 daemons have recently crashed OSD count 0 < osd_pool_default_size 3 services: mon: 3 daemons, quorum pve01,pve02,pve03 (age 19h) mgr: pve01(active, since 19h) osd: 0 osds: 0 up, 0 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: cat /etc/pve/ceph.conf [global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = 10.112.200.0/24 fsid = 92d063d7-647c-44b8-95d7-86057ee0ab22 mon_allow_pool_delete = true mon_host = 10.100.200.141 10.100.200.142 10.100.200.143 osd_pool_default_min_size = 2 osd_pool_default_size = 3 public_network = 10.100.200.0/24 [client] keyring = /etc/pve/priv/$cluster.$name.keyring [mon.pve01] public_addr = 10.100.200.141 [mon.pve02] public_addr = 10.100.200.142 [mon.pve03] public_addr = 10.100.200.143 Gerhard W. Recher net4sec UG (haftungsbeschränkt) Leitenweg 6 86929 Penzing +49 8191 4283888 +49 171 4802507 Am 23.10.2020 um 13:50 schrieb Burkhard Linke: > Hi, > > > your mail is formatted in a way that makes it impossible to get all > information, so a number of questions first: > > > - are the mons up, or are the mon up and in a quorum? you cannot > change mon IP addresses without also adjusting them in the mon map. > use the daemon socket on the systems to qeury the current state of the > mons > > - the osd systemd output is useless for debugging. it only states that > the osd is not running and not able to start > > > The real log files are located in /var/log/ceph/. If the mon are in > quorum, you should find more information here. Keep in mind that you > also need to change ceph.conf on the OSD hosts if you change the mon > IP addresses, otherwise the OSDs won't be able to find the mon and the > processes will die. > > And I do not understand how corosync should affect your ceph cluster. > Ceph does not use corosync... > > > If you need fast help I can recommend the ceph irc channel ;-) > > > Regards, > > Burkhard > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: desaster recovery Ceph Storage , urgent help needed
Hace you tried to recover old IPs ? El 23/10/20 a las 14:22, Gerhard W. Recher escribió: This is a proxmox cluster ... sorry for formating problems of my post :( short plot, we messed with ip addr. change of public network, so monitors went down. we changed monitor information in ceph.conf and with ceph-mon -i pve01 --extract-monmap /tmp/monmap monmaptool --rm pve01 --rm pve02 --rm pve03 /tmp/monmap monmaptool --add pve01 10.100.200.141 --add pve02 10.100.200.142 --add pve03 10.100.200.143 /tmp/monmap monmaptool --print /tmp/monmap ceph-mon -i pve01 --inject-monmap /tmp/monmap restart of all three nodes, but osd's dont't come up so howto recover from this disaster ? # ceph -s cluster: id: 92d063d7-647c-44b8-95d7-86057ee0ab22 health: HEALTH_WARN 1 daemons have recently crashed OSD count 0 < osd_pool_default_size 3 services: mon: 3 daemons, quorum pve01,pve02,pve03 (age 19h) mgr: pve01(active, since 19h) osd: 0 osds: 0 up, 0 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: cat /etc/pve/ceph.conf [global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = 10.112.200.0/24 fsid = 92d063d7-647c-44b8-95d7-86057ee0ab22 mon_allow_pool_delete = true mon_host = 10.100.200.141 10.100.200.142 10.100.200.143 osd_pool_default_min_size = 2 osd_pool_default_size = 3 public_network = 10.100.200.0/24 [client] keyring = /etc/pve/priv/$cluster.$name.keyring [mon.pve01] public_addr = 10.100.200.141 [mon.pve02] public_addr = 10.100.200.142 [mon.pve03] public_addr = 10.100.200.143 Gerhard W. Recher net4sec UG (haftungsbeschränkt) Leitenweg 6 86929 Penzing +49 8191 4283888 +49 171 4802507 Am 23.10.2020 um 13:50 schrieb Burkhard Linke: Hi, your mail is formatted in a way that makes it impossible to get all information, so a number of questions first: - are the mons up, or are the mon up and in a quorum? you cannot change mon IP addresses without also adjusting them in the mon map. use the daemon socket on the systems to qeury the current state of the mons - the osd systemd output is useless for debugging. it only states that the osd is not running and not able to start The real log files are located in /var/log/ceph/. If the mon are in quorum, you should find more information here. Keep in mind that you also need to change ceph.conf on the OSD hosts if you change the mon IP addresses, otherwise the OSDs won't be able to find the mon and the processes will die. And I do not understand how corosync should affect your ceph cluster. Ceph does not use corosync... If you need fast help I can recommend the ceph irc channel ;-) Regards, Burkhard ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io -- Eneko Lacunza| +34 943 569 206 | elacu...@binovo.es Zuzendari teknikoa | https://www.binovo.es Director técnico | Astigarragako Bidea, 2 - 2º izda. BINOVO IT HUMAN PROJECT S.L | oficina 10-11, 20180 Oiartzun ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Ceph and ram limits
Since some days ago i am recoveryng my ceph cluster, all start with OSD been killed by OOM, well i created a script to delete from the OSD the PGs corrupted (i write corrupted because that pg is the cause of the 100% of RAM usage by OSD). Great, almost done with all OSDs of my cluster, then the monitors now are consuming all the servers RAM, and the Managers too, why??? why they use 60GB of RAM, there are something to block that?? i tried configurind all kind of RAM limit to the minimal. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD Failures after pg_num increase on one of the pools
It was ok in monitoring and logs, OSD nodes have plenty of available cpu and ram. Previous pg_num was 256. From: Eugen Block Sent: Friday, October 23, 2020 2:06:27 PM To: ceph-users@ceph.io Subject: [ceph-users] Re: OSD Failures after pg_num increase on one of the pools Hi, do you see any peaks on the OSD nodes like OOM killer etc.? Instead of norecover flag I would try the nodown and noout flags to prevent flapping OSDs. What was the previous pg_num before you increased to 512? Regards, Eugen Zitat von Артём Григорьев : > Hello everyone, > > I created a new ceph 14.2.7 Nautilus cluster recently. Cluster consists of > 3 racks and 2 osd nodes on each rack, 12 new hdd in each node. HDD > model is TOSHIBA > MG07ACA14TE 14Tb. All data pools are ec pools. > Yesterday I decided to increase pg number on one of the pools with > command "ceph > osd pool set photo.buckets.data pg_num 512", after that many osds started > to crash with "out" and "down" status. I tried to increase recovery_sleep > to 1s but osds still crashes. Osds started working properly only when i set > "norecover" flag, but osd scrub errors appeared after that. > > In logs from osd during crashes i found this: > --- > > Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHIN > > E_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/osd/ECBackend.cc: > In function 'void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, > RecoveryMessages*)' > > thread 7f8af535d700 time 2020-10-21 15:12:11.460092 > > Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHIN > > E_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/osd/ECBackend.cc: > 648: FAILED ceph_assert(pop.data.length() == > sinfo.aligned_logical_offset_to_chunk_offset( aft > > er_progress.data_recovered_to - op.recovery_progress.data_recovered_to)) > > Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: ceph version 14.2.7 > (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable) > > Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 1: > (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x14a) [0x55fc694d6c0f] > > Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 2: (()+0x47) > [0x55fc694d6dd7] > > Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 3: > (ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, > RecoveryMessages*)+0x1740) [0x55fc698cafa0] > > Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 4: > (ECBackend::handle_recovery_read_complete(hobject_t const&, > boost::tuples::tuple ceph::buffer::v14_2_0::list, std::less, > std::allocator > >> , boost::tuples::null_type, boost::tuples::null_type, > boost::tuples::null_type, boost::tuples::null_type, > boost::tuples::null_type, boost::tuples::null_type, > boost::tuples::null_type>&, boost::optional ceph::buffer::v14_2_0::list, std::less, > std::allocator > >> >, RecoveryMessages*)+0x734) [0x55fc698cb804] > > Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 5: > (OnRecoveryReadComplete::finish(std::pair ECBackend::read_result_t&>&)+0x94) [0x55fc698ebbe4] > > Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 6: > (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x8c) > [0x55fc698bfdcc] > > Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 7: > (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, > RecoveryMessages*, ZTracer::Trace const&)+0x109c) [0x55fc698d6b8c] > > Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 8: > (ECBackend::_handle_message(boost::intrusive_ptr)+0x17f) > [0x55fc698d718f] > > Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 9: > (PGBackend::handle_message(boost::intrusive_ptr)+0x4a) > [0x55fc697c18ea] > > Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 10: > (PrimaryLogPG::do_request(boost::intrusive_ptr&, > ThreadPool::TPHandle&)+0x5b3) [0x55fc697676b3] > > Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 11: > (OSD::dequeue_op(boost::intrusive_ptr, boost::intrusive_ptr, > ThreadPool::TPHandle&)+0x362) [0x55fc695b3d72] > > Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 12: (PGOpItem::run(OSD*, > OSDShard*, boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x62) > [0x55fc698415c2] > > Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 13: > (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f) > [0x55fc695cebbf] > > Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 14: > (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) > [0x55fc69b6f976] > > Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 15: > (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55fc69b72490] > > Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 16: (()+0x7e65) > [0x7f8b1ddede65] > > Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 17: (clone()+0x6d) > [0x7f8b1ccb188d] > > Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: *** Caught si
[ceph-users] Re: desaster recovery Ceph Storage , urgent help needed
yep, I now reverted ip changes osd's still do not come up and i see no error in ceph.log , osd logs are empty ... Gerhard W. Recher net4sec UG (haftungsbeschränkt) Leitenweg 6 86929 Penzing +49 8191 4283888 +49 171 4802507 Am 23.10.2020 um 14:28 schrieb Eneko Lacunza: > Hace you tried to recover old IPs ? > > El 23/10/20 a las 14:22, Gerhard W. Recher escribió: >> This is a proxmox cluster ... >> sorry for formating problems of my post :( >> >> short plot, we messed with ip addr. change of public network, so >> monitors went down. >> >> >> >> we changed monitor information in ceph.conf and with >> ceph-mon -i pve01 --extract-monmap /tmp/monmap >> monmaptool --rm pve01 --rm pve02 --rm pve03 /tmp/monmap >> monmaptool --add pve01 10.100.200.141 --add pve02 10.100.200.142 --add >> pve03 10.100.200.143 /tmp/monmap >> monmaptool --print /tmp/monmap >> ceph-mon -i pve01 --inject-monmap /tmp/monmap >> >> >> restart of all three nodes, but osd's dont't come up >> >> so howto recover from this disaster ? >> >> # ceph -s >> cluster: >> id: 92d063d7-647c-44b8-95d7-86057ee0ab22 >> health: HEALTH_WARN >> 1 daemons have recently crashed >> OSD count 0 < osd_pool_default_size 3 >> >> services: >> mon: 3 daemons, quorum pve01,pve02,pve03 (age 19h) >> mgr: pve01(active, since 19h) >> osd: 0 osds: 0 up, 0 in >> >> data: >> pools: 0 pools, 0 pgs >> objects: 0 objects, 0 B >> usage: 0 B used, 0 B / 0 B avail >> pgs: >> >> >> >> >> cat /etc/pve/ceph.conf >> [global] >> auth_client_required = cephx >> auth_cluster_required = cephx >> auth_service_required = cephx >> cluster_network = 10.112.200.0/24 >> fsid = 92d063d7-647c-44b8-95d7-86057ee0ab22 >> mon_allow_pool_delete = true >> mon_host = 10.100.200.141 10.100.200.142 10.100.200.143 >> osd_pool_default_min_size = 2 >> osd_pool_default_size = 3 >> public_network = 10.100.200.0/24 >> >> [client] >> keyring = /etc/pve/priv/$cluster.$name.keyring >> >> [mon.pve01] >> public_addr = 10.100.200.141 >> >> [mon.pve02] >> public_addr = 10.100.200.142 >> >> [mon.pve03] >> public_addr = 10.100.200.143 >> >> >> >> Gerhard W. Recher >> >> net4sec UG (haftungsbeschränkt) >> Leitenweg 6 >> 86929 Penzing >> >> +49 8191 4283888 >> +49 171 4802507 >> Am 23.10.2020 um 13:50 schrieb Burkhard Linke: >>> Hi, >>> >>> >>> your mail is formatted in a way that makes it impossible to get all >>> information, so a number of questions first: >>> >>> >>> - are the mons up, or are the mon up and in a quorum? you cannot >>> change mon IP addresses without also adjusting them in the mon map. >>> use the daemon socket on the systems to qeury the current state of the >>> mons >>> >>> - the osd systemd output is useless for debugging. it only states that >>> the osd is not running and not able to start >>> >>> >>> The real log files are located in /var/log/ceph/. If the mon are in >>> quorum, you should find more information here. Keep in mind that you >>> also need to change ceph.conf on the OSD hosts if you change the mon >>> IP addresses, otherwise the OSDs won't be able to find the mon and the >>> processes will die. >>> >>> And I do not understand how corosync should affect your ceph cluster. >>> Ceph does not use corosync... >>> >>> >>> If you need fast help I can recommend the ceph irc channel ;-) >>> >>> >>> Regards, >>> >>> Burkhard >>> >>> ___ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >> >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > > > > smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [External Email] Re: Hardware for new OSD nodes.
Brian, Eneko, BTW, the Tyan LFF chassis we've been using has 12 x 3.5" bays in front and 2 x 2.5" SATA bays in back. We've been using 240GB SSDs in the rear bays for mirrored boot drives, so any NVMe we add is exclusively for OSD support. -Dave Dave Hall Binghamton University kdh...@binghamton.edu 607-760-2328 (Cell) 607-777-4641 (Office) On 10/23/2020 11:55 AM, Brian Topping wrote: Yes the UEFI problem with mirrored mdraid boot is well-documented. I’ve generally been working with BIOS partition maps which do not have the single point of failure UEFI has (/boot can be mounted as mirrored, any of them can be used as non-RAID by GRUB). But BIOS maps have problems as well with volume size. That said, the disks are portable at that point and really don’t have deep performance bottlenecks because mirroring and striping is cheap. Sent from my iPhone On Oct 23, 2020, at 03:54, Eneko Lacunza wrote: Hi Brian, El 22/10/20 a las 18:41, Brian Topping escribió: On Oct 22, 2020, at 10:34 AM, Anthony D'Atri wrote: - You must really be sure your raid card is dependable. (sorry but I have seen so much management problems with top-tier RAID cards I avoid them like the plague). This. I’d definitely avoid a RAID card. If I can do advanced encryption with an MMX instruction, I think I can certainly trust IOMMU to handle device multiplexing from software in an efficient manner, no? mdadm RAID is just fine for me and is reliably bootable from GRUB. I’m not an expert in driver mechanics, but mirroring should be very low overhead at the software level. Once it’s software RAID, moving disks between chassis is a simple process as well. Apologies I didn’t make that clear earlier... Yes, I really like mdraid :) . Problem is BIOS/UEFI has to find a working bootable disk. I think some BIOS/UEFIs have settings for a secondary boot/UEFI bootfile, but that would have to be prepared and maintained manually, out of the mdraid10; and would only work with a total failure of the primary disk. Cheers -- Eneko Lacunza| +34 943 569 206 | elacu...@binovo.es Zuzendari teknikoa | https://www.binovo.es Director técnico | Astigarragako Bidea, 2 - 2º izda. BINOVO IT HUMAN PROJECT S.L | oficina 10-11, 20180 Oiartzun ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [External Email] Re: Hardware for new OSD nodes.
Eneko, # ceph health detail HEALTH_WARN BlueFS spillover detected on 7 OSD(s) BLUEFS_SPILLOVER BlueFS spillover detected on 7 OSD(s) osd.1 spilled over 648 MiB metadata from 'db' device (28 GiB used of 124 GiB) to slow device osd.3 spilled over 613 MiB metadata from 'db' device (28 GiB used of 124 GiB) to slow device osd.4 spilled over 485 MiB metadata from 'db' device (28 GiB used of 124 GiB) to slow device osd.10 spilled over 1008 MiB metadata from 'db' device (28 GiB used of 124 GiB) to slow device osd.17 spilled over 808 MiB metadata from 'db' device (28 GiB used of 124 GiB) to slow device osd.18 spilled over 2.5 GiB metadata from 'db' device (28 GiB used of 124 GiB) to slow device osd.20 spilled over 1.5 GiB metadata from 'db' device (28 GiB used of 124 GiB) to slow device nvme0n1 259:10 1.5T 0 disk ├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--6dcbb748--13f5--45cb--9d49--6c78d6589a71 │ 253:10 124G 0 lvm ├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--736a22a8--e4aa--4da9--b63b--295d8f5f2a3d │ 253:30 124G 0 lvm ├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--751c6623--9870--4123--b551--1fd7fc837341 │ 253:50 124G 0 lvm ├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--2a376e8d--abb1--42af--a4bd--4ae8734d703e │ 253:70 124G 0 lvm ├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--54fbe282--9b29--422b--bdb2--d7ed730bc589 │ 253:90 124G 0 lvm ├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--c1153cd2--2ec0--4e7f--a3d7--91dac92560ad │ 253:11 0 124G 0 lvm ├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--d613f4eb--6ddc--4dd5--a2b5--cb520b6ba922 │ 253:13 0 124G 0 lvm └─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--41f75c25--67db--46e8--a3fb--ddee9e7f7fc4 253:15 0 124G 0 lvm Dave Hall Binghamton universitykdh...@binghamton.edu 607-760-2328 (Cell) 607-777-4641 (Office) On 10/23/2020 6:00 AM, Eneko Lacunza wrote: Hi Dave, El 22/10/20 a las 19:43, Dave Hall escribió: El 22/10/20 a las 16:48, Dave Hall escribió: (BTW, Nautilus 14.2.7 on Debian non-container.) We're about to purchase more OSD nodes for our cluster, but I have a couple questions about hardware choices. Our original nodes were 8 x 12TB SAS drives and a 1.6TB Samsung NVMe card for WAL, DB, etc. We chose the NVMe card for performance since it has an 8 lane PCIe interface. However, we're currently BlueFS spillovers. The Tyan chassis we are considering has the option of 4 x U.2 NVMe bays - each with 4 PCIe lanes, (and 8 SAS bays). It has occurred to me that I might stripe 4 1TB NVMe drives together to get much more space for WAL/DB and a net performance of 16 PCIe lanes. Any thoughts on this approach? Don't stripe them, if one NVMe fails you'll lose all OSDs. Just use 1 NVMe drive for 2 SAS drives and provision 300GB for WAL/DB for each OSD (see related threads on this mailing list about why that exact size). This way if a NVMe fails, you'll only lose 2 OSD. I was under the impression that everything that BlueStore puts on the SSD/NVMe could be reconstructed from information on the OSD. Am I mistaken about this? If so, my single 1.6TB NVMe card is equally vulnerable. I don't think so, that info only exists on that partition as was the case with filestore journal. Your single 1.6TB NVMe is vulnerable, yes. Also, what size of WAL/DB partitions do you have now, and what spillover size? I recently posted another question to the list on this topic, since I now have spillover on 7 of 24 OSDs. Since the data layout on the NVMe for BlueStore is not traditional I've never quite figured out how to get this information. The current partition size is 1.6TB /12 since we had the possibility to add for more drives to each node. How that was divided between WAL, DB, etc. is something I'd like to be able to understand. However, we're not going to add the extra 4 drives, so expanding the LVM partitions is now a possibility. Can you paste the warning message? If shows the spillover size. What size are the partitions on NVMe disk (lsblk) Cheers ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Urgent help needed please - MDS offline
On Fri, Oct 23, 2020 at 9:02 AM David C wrote: > > Success! > > I remembered I had a server I'd taken out of the cluster to > investigate some issues, that had some good quality 800GB Intel DC > SSDs, dedicated an entire drive to swap, tuned up min_free_kbytes, > added an MDS to that server and let it run. Took 3 - 4 hours but > eventually came back online. It used the 128GB of RAM and about 250GB > of the swap. > > Dan, thanks so much for steering me down this path, I would have more > than likely started hacking away at the journal otherwise! > > Frank, thanks for pointing me towards that other thread, I used your > min_free_kbytes tip > > I now need to consider updating - I wonder if the risk averse CephFS > operator would go for the latest Nautilus or latest Octopus, it used > to be that the newer CephFS code meant the most stable but don't know > if that's still the case. You need to first upgrade to Nautilus in any case. n+2 releases is the max delta between upgrades. -- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io