[ceph-users] Re: ceph octopus centos7, containers, cephadm

2020-10-23 Thread Marc Roos


No clarity on this? 

-Original Message-
To: ceph-users
Subject: [ceph-users] ceph octopus centos7, containers, cephadm


I am running Nautilus on centos7. Does octopus run similar as nautilus
thus:

- runs on el7/centos7
- runs without containers by default
- runs without cephadm by default

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph octopus centos7, containers, cephadm

2020-10-23 Thread Dan van der Ster
I'm not sure I understood the question.

If you're asking if you can run octopus via RPMs on el7 without the
cephadm and containers orchestration, then the answer is yes.

-- dan

On Fri, Oct 23, 2020 at 9:47 AM Marc Roos  wrote:
>
>
> No clarity on this?
>
> -Original Message-
> To: ceph-users
> Subject: [ceph-users] ceph octopus centos7, containers, cephadm
>
>
> I am running Nautilus on centos7. Does octopus run similar as nautilus
> thus:
>
> - runs on el7/centos7
> - runs without containers by default
> - runs without cephadm by default
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph octopus centos7, containers, cephadm

2020-10-23 Thread David Majchrzak, ODERLAND Webbhotell AB

Hi!

Runs on el7: https://download.ceph.com/rpm-octopus/el7/x86_64/

Runs as usual without containers by default - if you use cephadm for 
deployments then it will use containers.


cephadm is one way to do deployments, you can however deploy whichever 
way you want (manually etc).


--

David Majchrzak
CTO
Oderland Webbhotell AB
Östra Hamngatan 50B, 411 09 Göteborg, SWEDEN

Den 2020-10-23 kl. 09:47, skrev Marc Roos:

No clarity on this?

-Original Message-
To: ceph-users
Subject: [ceph-users] ceph octopus centos7, containers, cephadm


I am running Nautilus on centos7. Does octopus run similar as nautilus
thus:

- runs on el7/centos7
- runs without containers by default
- runs without cephadm by default

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 14.2.12 breaks mon_host pointing to Round Robin DNS entry

2020-10-23 Thread Burkhard Linke

Hi,


non round robin entries with multiple mon host FQDNs are also broken.


Regards,

Burkhard

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Hardware needs for MDS for HPC/OpenStack workloads?

2020-10-23 Thread Stefan Kooman
On 2020-10-22 14:34, Matthew Vernon wrote:
> Hi,
> 
> We're considering the merits of enabling CephFS for our main Ceph
> cluster (which provides object storage for OpenStack), and one of the
> obvious questions is what sort of hardware we would need for the MDSs
> (and how many!).

Is it a many parallel large writes workload without a lot fs
manipulation (file creation / deletion, attribute updates? You might
only need 2 for HA (active-standby). But when used as a regular fs with
many clients and a lot of small IO, than you might run out of the
performance of a single MDS. Add (many) more as you see fit. Keep in
mind it does make things a bit more complex (different ranks when more
than one active MDS) and that when you need to upgrade you have to
downscale that to 1. You can pin directories to a single MDS if you know
your workload well enough.

> 
> These would be for our users scientific workloads, so they would need to
> provide reasonably high performance. For reference, we have 3060 6TB
> OSDs across 51 OSD hosts, and 6 dedicated RGW nodes.

It really depend on the workload. If there are a lot of file / directory
operations the MDS needs to keep track of all that and needs to be able
to cache as well (inodes / dnodes). The more files/dirs, the more RAM
you need. We don't have PB of storage (but 39 TB for CephFS) but have
MDSes with 256 GB RAM for cache for all the little files and many dirs
we have. Prefer a few faster cores above many slower cores.


> 
> The minimum specs are very modest (2-3GB RAM, a tiny amount of disk,
> similar networking to the OSD nodes), but I'm not sure how much going
> beyond that is likely to be useful in production.

MDSes don't do a lot of traffic. Clients write directly to OSDs after
they have acquired capabilities (CAPS) from MDS.

> 
> I've also seen it suggested that an SSD-only pool is sensible for the
> CephFS metadata pool; how big is that likely to get?

Yes, but CephFS, like RGW (index), stores a lot of data in OMAP and the
RocksDB databases tend to get quite large. Especially when storing many
small files and lots of dirs. So if that happens to be the workload,
make sure you have plenty of them. We once put all cephfs_metadata on 30
NVMe ... and that was not a good thing. Spread that data out over as
many SSD / NVMe as you can. Do your HDDs have their WAL / DB on flash?
Cephfs_metadaa does not take up a lot of space, but Mimic does not have
as good administration on all space occupied as newer releases. But I
guess it's in the order of 5% of CephFS size. But again, this might be
wildly different on other deployments.

> 
> I'd be grateful for any pointers :)

I would buy a CPU with high clock speed and ~ 4 -8 cores. RAM as needed,
but 32 GB will be minimum I guess.

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Strange USED size

2020-10-23 Thread Eugen Block

Hi,

did you delete lots of objects recently? That operation is slow and  
ceph takes some time to catch up. If the value is not decreasing post  
again with 'ceph osd df' output.


Regards,
Eugen


Zitat von Marcelo :


Hello. I've searched a lot but couldn't find why the size of USED column in
the output of ceph df is a lot times bigger than the actual size. I'm using
Nautilus (14.2.8), and I've 1000 buckets with 100 objectsineach bucket.
Each object is around 10B.

ceph df
RAW STORAGE:
CLASS SIZEAVAIL   USEDRAW USED %RAW USED
hdd   511 GiB 147 GiB 340 GiB  364 GiB 71.21
TOTAL 511 GiB 147 GiB 340 GiB  364 GiB 71.21

POOLS:
POOL   ID STORED  OBJECTS
USED%USED MAX AVAIL
.rgw.root   1 1.1 KiB   4 768
KiB 036 GiB
default.rgw.control11 0 B   8 0
B 036 GiB
default.rgw.meta   12 449 KiB   2.00k 376
MiB  0.3436 GiB
default.rgw.log13 3.4 KiB 207   6
MiB 036 GiB
default.rgw.buckets.index  14 0 B   1.00k 0
B 036 GiB
default.rgw.buckets.data   15 969 KiB100k  18
GiB 14.5236 GiB
default.rgw.buckets.non-ec 1627 B   1 192
KiB 036 GiB

Does anyone know what are the maths behind this, to show 18GiB used when I
have something like 1 MiB?

Thanks in advance, Marcelo.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rados Crashing

2020-10-23 Thread Eugen Block

Hi,

I read that civetweb and radosgw have a locking issue in combination  
with ssl [1], just a thought based on



failed to acquire lock on obj_delete_at_hint.79


Since Nautilus the default rgw frontend is beast, have you thought  
about switching?


Regards,
Eugen


[1] https://tracker.ceph.com/issues/22951


Zitat von Brent Kennedy :


We are performing file maintenance( deletes essentially ) and when the
process gets to a certain point, all four rados gateways crash with the
following:





Log output:

-5> 2020-10-20 06:09:53.996 7f15f1543700  2 req 7 0.000s s3:delete_obj
verifying op params

-4> 2020-10-20 06:09:53.996 7f15f1543700  2 req 7 0.000s s3:delete_obj
pre-executing

-3> 2020-10-20 06:09:53.996 7f15f1543700  2 req 7 0.000s s3:delete_obj
executing

-2> 2020-10-20 06:09:53.997 7f161758f700 10 monclient: get_auth_request
con 0x55d2c02ff800 auth_method 0

-1> 2020-10-20 06:09:54.009 7f1609d74700  5 process_single_shard():
failed to acquire lock on obj_delete_at_hint.79

 0> 2020-10-20 06:09:54.035 7f15f1543700 -1 *** Caught signal
(Segmentation fault) **

in thread 7f15f1543700 thread_name:civetweb-worker



ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus
(stable)

1: (()+0xf5d0) [0x7f161d3405d0]

2: (()+0x2bec80) [0x55d2bcd1fc80]

3: (std::string::assign(std::string const&)+0x2e) [0x55d2bcd2870e]

4: (rgw_bucket::operator=(rgw_bucket const&)+0x11) [0x55d2bce3e551]

5: (RGWObjManifest::obj_iterator::update_location()+0x184) [0x55d2bced7114]

6: (RGWObjManifest::obj_iterator::operator++()+0x263) [0x55d2bd092793]

7: (RGWRados::update_gc_chain(rgw_obj&, RGWObjManifest&,
cls_rgw_obj_chain*)+0x51a) [0x55d2bd0939ea]

8: (RGWRados::Object::complete_atomic_modification()+0x83) [0x55d2bd093c63]

9: (RGWRados::Object::Delete::delete_obj()+0x74d) [0x55d2bd0a87ad]

10: (RGWDeleteObj::execute()+0x915) [0x55d2bd04b6d5]

11: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*,
req_state*, bool)+0x915) [0x55d2bcdfbb35]

12: (process_request(RGWRados*, RGWREST*, RGWRequest*, std::string const&,
rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*,
optional_yield, rgw::dmclock::Scheduler*, int*)+0x1cd8) [0x55d2bcdfdea8]

13: (RGWCivetWebFrontend::process(mg_connection*)+0x38e) [0x55d2bcd41a1e]

14: (()+0x36bace) [0x55d2bcdccace]

15: (()+0x36d76f) [0x55d2bcdce76f]

16: (()+0x36dc18) [0x55d2bcdcec18]

17: (()+0x7dd5) [0x7f161d338dd5]

18: (clone()+0x6d) [0x7f161c84302d]

NOTE: a copy of the executable, or `objdump -rdS ` is needed to
interpret this.



My guess is that we need to add more resources to the gateways?  They have 2
CPUs and 12GB of memory running as virtual machines on centOS 7.6 .  Any
thoughts?



-Brent

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [EXTERNAL] Re: 14.2.12 breaks mon_host pointing to Round Robin DNS entry

2020-10-23 Thread Van Alstyne, Kenneth
Jason/Wido, et al:
 I was hitting this exact problem when attempting to update from 14.2.11 to 
14.2.12.  I reverted the two commits associated with that pull request and was 
able to successfully upgrade to 14.2.12.  Everything seems normal, now.


Thanks,

--
Kenneth Van Alstyne
Systems Architect
M: 804.240.2327
14291 Park Meadow Drive, Chantilly, VA 20151
perspecta


From: Jason Dillaman 
Sent: Thursday, October 22, 2020 12:54 PM
To: Wido den Hollander 
Cc: ceph-users@ceph.io 
Subject: [EXTERNAL] [ceph-users] Re: 14.2.12 breaks mon_host pointing to Round 
Robin DNS entry

This backport [1] looks suspicious as it was introduced in v14.2.12
and directly changes the initial MonMap code. If you revert it in a
dev build does it solve your problem?

[1] https://github.com/ceph/ceph/pull/36704

On Thu, Oct 22, 2020 at 12:39 PM Wido den Hollander  wrote:
>
> Hi,
>
> I already submitted a ticket: https://tracker.ceph.com/issues/47951
>
> Maybe other people noticed this as well.
>
> Situation:
> - Cluster is running IPv6
> - mon_host is set to a DNS entry
> - DNS entry is a Round Robin with three -records
>
> root@wido-standard-benchmark:~# ceph -s
> unable to parse addrs in 'mon.objects.xx.xxx.net'
> [errno 22] error connecting to the cluster
> root@wido-standard-benchmark:~#
>
> The relevant part of the ceph.conf:
>
> [global]
> auth_client_required = cephx
> auth_cluster_required = cephx
> auth_service_required = cephx
> mon_host = mon.objects.xxx.xxx.xxx
> ms_bind_ipv6 = true
>
> This works fine with 14.2.11 and breaks under 14.2.12
>
> Anybody else seeing this as well?
>
> Wido
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


--
Jason
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD Failures after pg_num increase on one of the pools

2020-10-23 Thread Eugen Block

Hi,

do you see any peaks on the OSD nodes like OOM killer etc.?
Instead of norecover flag I would try the nodown and noout flags to  
prevent flapping OSDs. What was the previous pg_num before you  
increased to 512?


Regards,
Eugen


Zitat von Артём Григорьев :


Hello everyone,

I created a new ceph 14.2.7 Nautilus cluster  recently. Cluster consists of
3 racks and 2 osd nodes on each rack, 12 new hdd in each node. HDD
model is TOSHIBA
MG07ACA14TE 14Tb. All data pools are ec pools.
Yesterday I decided to increase pg number on one of the pools with
command "ceph
osd pool set photo.buckets.data pg_num 512", after that many osds started
to crash with "out" and "down" status. I tried to increase recovery_sleep
to 1s but osds still crashes. Osds started working properly only when i set
"norecover" flag, but osd scrub errors appeared after that.

In logs from osd during crashes i found this:
---

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHIN

E_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/osd/ECBackend.cc:
In function 'void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&,
RecoveryMessages*)'

thread 7f8af535d700 time 2020-10-21 15:12:11.460092

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHIN

E_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/osd/ECBackend.cc:
648: FAILED ceph_assert(pop.data.length() ==
sinfo.aligned_logical_offset_to_chunk_offset( aft

er_progress.data_recovered_to - op.recovery_progress.data_recovered_to))

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: ceph version 14.2.7
(3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 1:
(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x14a) [0x55fc694d6c0f]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 2: (()+0x47)
[0x55fc694d6dd7]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 3:
(ECBackend::continue_recovery_op(ECBackend::RecoveryOp&,
RecoveryMessages*)+0x1740) [0x55fc698cafa0]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 4:
(ECBackend::handle_recovery_read_complete(hobject_t const&,
boost::tuples::tuple,
std::allocator >

, boost::tuples::null_type, boost::tuples::null_type,

boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type>&, boost::optional,
std::allocator >

>, RecoveryMessages*)+0x734) [0x55fc698cb804]


Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 5:
(OnRecoveryReadComplete::finish(std::pair&)+0x94) [0x55fc698ebbe4]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 6:
(ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x8c)
[0x55fc698bfdcc]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 7:
(ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&,
RecoveryMessages*, ZTracer::Trace const&)+0x109c) [0x55fc698d6b8c]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 8:
(ECBackend::_handle_message(boost::intrusive_ptr)+0x17f)
[0x55fc698d718f]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 9:
(PGBackend::handle_message(boost::intrusive_ptr)+0x4a)
[0x55fc697c18ea]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 10:
(PrimaryLogPG::do_request(boost::intrusive_ptr&,
ThreadPool::TPHandle&)+0x5b3) [0x55fc697676b3]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 11:
(OSD::dequeue_op(boost::intrusive_ptr, boost::intrusive_ptr,
ThreadPool::TPHandle&)+0x362) [0x55fc695b3d72]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 12: (PGOpItem::run(OSD*,
OSDShard*, boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x62)
[0x55fc698415c2]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 13:
(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f)
[0x55fc695cebbf]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 14:
(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6)
[0x55fc69b6f976]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 15:
(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55fc69b72490]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 16: (()+0x7e65)
[0x7f8b1ddede65]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 17: (clone()+0x6d)
[0x7f8b1ccb188d]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: *** Caught signal (Aborted) **

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: in thread 7f8af535d700
thread_name:tp_osd_tp
---

Current ec profile and pool info bellow:

# ceph osd erasure-code-profile get EC42

crush-device-class=hdd

crush-failure-domain=host

crush-root=main

jerasure-per-chunk-alignment=false

k=4

m=2

plugin=jerasure

technique=reed_sol_van

w=8


pool 25 'photo.buckets.data' erasure size 6 min_size 4 crush_rule 6
object_hash rjenkins pg_num 512 pgp_num 280 pgp_num_target 512
autoscale_mode warn last_change 43418 lfor 0/0/42223 flags hashpspool
stripe_width 1048576 application 

[ceph-users] Re: Ceph Octopus and Snapshot Schedules

2020-10-23 Thread Adam Boyhan
Care to provide anymore detail? 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: desaster recovery Ceph Storage , urgent help needed

2020-10-23 Thread Burkhard Linke

Hi,


your mail is formatted in a way that makes it impossible to get all 
information, so a number of questions first:



- are the mons up, or are the mon up and in a quorum? you cannot change 
mon IP addresses without also adjusting them in the mon map. use the 
daemon socket on the systems to qeury the current state of the mons


- the osd systemd output is useless for debugging. it only states that 
the osd is not running and not able to start



The real log files are located in /var/log/ceph/. If the mon are in 
quorum, you should find more information here. Keep in mind that you 
also need to change ceph.conf on the OSD hosts if you change the mon IP 
addresses, otherwise the OSDs won't be able to find the mon and the 
processes will die.


And I do not understand how corosync should affect your ceph cluster. 
Ceph does not use corosync...



If you need fast help I can recommend the ceph irc channel ;-)


Regards,

Burkhard

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-23 Thread Frank Schilder
Hi Michael.

> I still don't see any traffic to the pool, though I'm also unsure how much 
> traffic is to be expected.

Probably not much. If ceph df shows that the pool contains some objects, I 
guess that's sorted.

That osdmaptool crashes indicates that your cluster runs with corrupted 
internal data. I tested your crush map and you should get complete PGs for the 
fs data pool. That you don't and that osdmaptool crashes points at a corruption 
of internal data. I'm afraid this is the point where you need support from ceph 
developers and should file a tracker report 
(https://tracker.ceph.com/projects/ceph/issues). A short description of the 
origin of the situation with the osdmaptool output and a reference to this 
thread linked in should be sufficient. Please post a link to the ticket here.

In parallel, you should probably open a new thread focussed on the osd map 
corruption. Maybe there are low-level commands to repair it.

You should wait with trying to clean up the unfound objects until this is 
resolved. Not sure about adding further storage either. To me, this sounds 
quite serious.

Best regards and good luck!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph octopus centos7, containers, cephadm

2020-10-23 Thread Marc Roos


yes that was it. I see so many messages here about these, I was 
wondering if it was a default.
 

-Original Message-
Cc: ceph-users
Subject: Re: [ceph-users] Re: ceph octopus centos7, containers, cephadm

I'm not sure I understood the question.

If you're asking if you can run octopus via RPMs on el7 without the 
cephadm and containers orchestration, then the answer is yes.

-- dan

On Fri, Oct 23, 2020 at 9:47 AM Marc Roos  
wrote:
>
>
> No clarity on this?
>
> -Original Message-
> To: ceph-users
> Subject: [ceph-users] ceph octopus centos7, containers, cephadm
>
>
> I am running Nautilus on centos7. Does octopus run similar as nautilus
> thus:
>
> - runs on el7/centos7
> - runs without containers by default
> - runs without cephadm by default
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
> email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: desaster recovery Ceph Storage , urgent help needed

2020-10-23 Thread Burkhard Linke

Hi,

On 10/23/20 2:22 PM, Gerhard W. Recher wrote:

This is a proxmox cluster ...
sorry for formating problems of my post :(

short plot, we messed with ip addr. change of public network, so
monitors went down.



*snipsnap*


so howto recover from this disaster ?

# ceph -s
   cluster:
     id: 92d063d7-647c-44b8-95d7-86057ee0ab22
     health: HEALTH_WARN
     1 daemons have recently crashed
     OSD count 0 < osd_pool_default_size 3

   services:
     mon: 3 daemons, quorum pve01,pve02,pve03 (age 19h)
     mgr: pve01(active, since 19h)
     osd: 0 osds: 0 up, 0 in

   data:
     pools:   0 pools, 0 pgs
     objects: 0 objects, 0 B
     usage:   0 B used, 0 B / 0 B avail
     pgs:


Are you sure that the existing mons have been restarted? If the mon 
database is still present, the status output should contain at least the 
pool and osd information. But those numbers are zero...



Please check the local osd logs for the actual reason of the failed restart.


Regards,

Burkhard

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Octopus

2020-10-23 Thread Amudhan P
Hi Eugen,

I did the same step specified but OSD is not updated cluster address.


On Tue, Oct 20, 2020 at 2:52 PM Eugen Block  wrote:

> > I wonder if this would be impactful, even if  `nodown` were set.
> > When a given OSD latches onto
> > the new replication network, I would expect it to want to use it for
> > heartbeats — but when
> > its heartbeat peers aren’t using the replication network yet, they
> > won’t be reachable.
>
> I also expected at least some sort of impact, I just tested it in a
> virtual lab environment. But besides the temporary "down" OSDs during
> container restart the cluster was always responsive (although there's
> no client traffic). I didn't even set "nodown". But all OSDs now have
> a new backend address and the cluster seems to be happy.
>
> Regards,
> Eugen
>
>
> Zitat von Anthony D'Atri :
>
> > I wonder if this would be impactful, even if  `nodown` were set.
> > When a given OSD latches onto
> > the new replication network, I would expect it to want to use it for
> > heartbeats — but when
> > its heartbeat peers aren’t using the replication network yet, they
> > won’t be reachable.
> >
> > Unless something has changed since I tried this with Luminous.
> >
> >> On Oct 20, 2020, at 12:47 AM, Eugen Block  wrote:
> >>
> >> Hi,
> >>
> >> a quick search [1] shows this:
> >>
> >> ---snip---
> >> # set new config
> >> ceph config set global cluster_network 192.168.1.0/24
> >>
> >> # let orchestrator reconfigure the daemons
> >> ceph orch daemon reconfig mon.host1
> >> ceph orch daemon reconfig mon.host2
> >> ceph orch daemon reconfig mon.host3
> >> ceph orch daemon reconfig osd.1
> >> ceph orch daemon reconfig osd.2
> >> ceph orch daemon reconfig osd.3
> >> ---snip---
> >>
> >> I haven't tried it myself though.
> >>
> >> Regards,
> >> Eugen
> >>
> >> [1]
> >>
> https://stackoverflow.com/questions/61763230/configure-a-cluster-network-with-cephadm
> >>
> >>
> >> Zitat von Amudhan P :
> >>
> >>> Hi,
> >>>
> >>> I have installed Ceph Octopus cluster using cephadm with a single
> network
> >>> now I want to add a second network and configure it as a cluster
> address.
> >>>
> >>> How do I configure ceph to use second Network as cluster network?.
> >>>
> >>> Amudhan
> >>> ___
> >>> ceph-users mailing list -- ceph-users@ceph.io
> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large map object found

2020-10-23 Thread Peter Eisch
Perfect -- many thanks Dominic!

I haven't found a doc which notes the --num-shards needs to be a power of two.  
It isn't I don't believe you -- just haven't seen that anywhere.

peter


Peter Eisch
Senior Site Reliability Engineer
T1.612.445.5135
virginpulse.com
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.66
On 10/22/20, 10:24 AM, "dhils...@performair.com"  
wrote:

Peter;

I believe shard counts should be powers of two.

Also, resharding makes the buckets unavailable, but occurs very quickly.  
As such it is not done in the background, but in the foreground, for a manual 
reshard.

Notice the statement: "reshard of bucket   from 
 to  completed successfully."  It's done.

The warning notice won't go away until a scrub is completed to determine 
that a large OMAP object no longer exists.

Thank you,

Dominic L. Hilsbos, MBA
Director – Information Technology
Perform Air International Inc.
dhils...@performair.com

https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.performair.com%2F&data=04%7C01%7Cpeter.eisch%40virginpulse.com%7C14386968705f4571e9a008d8769e9a16%7Cb123a16e892b4cf6a55a6f8c7606a035%7C0%7C0%7C637389770850660951%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=AUPmwxIgzRhmwpg2MM6b%2FzpPyR84%2F92OFsW9UrKw%2Fes%3D&reserved=0


From: Peter Eisch [mailto:peter.ei...@virginpulse.com]
Sent: Thursday, October 22, 2020 8:04 AM
To: Dominic Hilsbos; ceph-users@ceph.io
Subject: Re: Large map object found

Thank you! This was helpful.

I opted for a manual reshard:

[root@cephmon-s03 ~]# radosgw-admin bucket reshard 
--bucket=d2ff913f5b6542cda307c9cd6a95a214/NAME_segments --num-shards=3
tenant: d2ff913f5b6542cda307c9cd6a95a214
bucket name: backups_sql_dswhseloadrepl_segments
old bucket instance id: 80bdfc66-d1fd-418d-b87d-5c8518a0b707.340850308.51
new bucket instance id: 80bdfc66-d1fd-418d-b87d-5c8518a0b707.948621036.1
total entries: 1000 2000 3000 3228
2020-10-22 08:40:26.353 7fb197fc66c0 1 execute INFO: reshard of bucket 
"backups_sql_dswhseloadrepl_segments" from 
"d2ff913f5b6542cda307c9cd6a95a214/backups_sql_dswhseloadrepl_segments:80bdfc66-d1fd-418d-b87d-5c8518a0b707.340850308.51"
 to 
"d2ff913f5b6542cda307c9cd6a95a214/backups_sql_dswhseloadrepl_segments:80bdfc66-d1fd-418d-b87d-5c8518a0b707.948621036.1"
 completed successfully

[root@cephmon-s03 ~]# radosgw-admin buckets reshard list
[]
[root@cephmon-s03 ~]# radosgw-admin buckets reshard status 
--bucket=d2ff913f5b6542cda307c9cd6a95a214/NAME_segments
[
{
"reshard_status": "not-resharding",
"new_bucket_instance_id": "",
"num_shards": -1
},
{
"reshard_status": "not-resharding",
"new_bucket_instance_id": "",
"num_shards": -1
},
{
"reshard_status": "not-resharding",
"new_bucket_instance_id": "",
"num_shards": -1
}
]
[root@cephmon-s03 ~]#

This kicked of an autoscale event. Would the reshard presumably start after 
the autoscaling is complete?

peter



Peter Eisch​

Senior Site Reliability Engineer


T

1.612.445.5135












virginpulse.com


Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | 
Switzerland | United Kingdom | USA

Confidentiality Notice: The information contained in this e-mail, including 
any attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.


v2.66

On 10/21/20, 3:19 P

[ceph-users] Re: Ceph Octopus

2020-10-23 Thread Eugen Block

Did you restart the OSD containers? Does ceph config show your changes?

ceph config get mon cluster_network
ceph config get mon public_network



Zitat von Amudhan P :


Hi Eugen,

I did the same step specified but OSD is not updated cluster address.


On Tue, Oct 20, 2020 at 2:52 PM Eugen Block  wrote:


> I wonder if this would be impactful, even if  `nodown` were set.
> When a given OSD latches onto
> the new replication network, I would expect it to want to use it for
> heartbeats — but when
> its heartbeat peers aren’t using the replication network yet, they
> won’t be reachable.

I also expected at least some sort of impact, I just tested it in a
virtual lab environment. But besides the temporary "down" OSDs during
container restart the cluster was always responsive (although there's
no client traffic). I didn't even set "nodown". But all OSDs now have
a new backend address and the cluster seems to be happy.

Regards,
Eugen


Zitat von Anthony D'Atri :

> I wonder if this would be impactful, even if  `nodown` were set.
> When a given OSD latches onto
> the new replication network, I would expect it to want to use it for
> heartbeats — but when
> its heartbeat peers aren’t using the replication network yet, they
> won’t be reachable.
>
> Unless something has changed since I tried this with Luminous.
>
>> On Oct 20, 2020, at 12:47 AM, Eugen Block  wrote:
>>
>> Hi,
>>
>> a quick search [1] shows this:
>>
>> ---snip---
>> # set new config
>> ceph config set global cluster_network 192.168.1.0/24
>>
>> # let orchestrator reconfigure the daemons
>> ceph orch daemon reconfig mon.host1
>> ceph orch daemon reconfig mon.host2
>> ceph orch daemon reconfig mon.host3
>> ceph orch daemon reconfig osd.1
>> ceph orch daemon reconfig osd.2
>> ceph orch daemon reconfig osd.3
>> ---snip---
>>
>> I haven't tried it myself though.
>>
>> Regards,
>> Eugen
>>
>> [1]
>>
https://stackoverflow.com/questions/61763230/configure-a-cluster-network-with-cephadm
>>
>>
>> Zitat von Amudhan P :
>>
>>> Hi,
>>>
>>> I have installed Ceph Octopus cluster using cephadm with a single
network
>>> now I want to add a second network and configure it as a cluster
address.
>>>
>>> How do I configure ceph to use second Network as cluster network?.
>>>
>>> Amudhan
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Hardware needs for MDS for HPC/OpenStack workloads?

2020-10-23 Thread Nathan Fish
Regarding MDS pinning, we have our home directories split into u{0..9}
for legacy reasons, and while adding more MDS' helped a little,
pinning certain u? to certain MDS' helped greatly. The automatic
migration between MDS' killed performance. This is an unusually
perfect workload for pinning, as we have 10 practically identical
directories, but still.

On Fri, Oct 23, 2020 at 2:04 AM Stefan Kooman  wrote:
>
> On 2020-10-22 14:34, Matthew Vernon wrote:
> > Hi,
> >
> > We're considering the merits of enabling CephFS for our main Ceph
> > cluster (which provides object storage for OpenStack), and one of the
> > obvious questions is what sort of hardware we would need for the MDSs
> > (and how many!).
>
> Is it a many parallel large writes workload without a lot fs
> manipulation (file creation / deletion, attribute updates? You might
> only need 2 for HA (active-standby). But when used as a regular fs with
> many clients and a lot of small IO, than you might run out of the
> performance of a single MDS. Add (many) more as you see fit. Keep in
> mind it does make things a bit more complex (different ranks when more
> than one active MDS) and that when you need to upgrade you have to
> downscale that to 1. You can pin directories to a single MDS if you know
> your workload well enough.
>
> >
> > These would be for our users scientific workloads, so they would need to
> > provide reasonably high performance. For reference, we have 3060 6TB
> > OSDs across 51 OSD hosts, and 6 dedicated RGW nodes.
>
> It really depend on the workload. If there are a lot of file / directory
> operations the MDS needs to keep track of all that and needs to be able
> to cache as well (inodes / dnodes). The more files/dirs, the more RAM
> you need. We don't have PB of storage (but 39 TB for CephFS) but have
> MDSes with 256 GB RAM for cache for all the little files and many dirs
> we have. Prefer a few faster cores above many slower cores.
>
>
> >
> > The minimum specs are very modest (2-3GB RAM, a tiny amount of disk,
> > similar networking to the OSD nodes), but I'm not sure how much going
> > beyond that is likely to be useful in production.
>
> MDSes don't do a lot of traffic. Clients write directly to OSDs after
> they have acquired capabilities (CAPS) from MDS.
>
> >
> > I've also seen it suggested that an SSD-only pool is sensible for the
> > CephFS metadata pool; how big is that likely to get?
>
> Yes, but CephFS, like RGW (index), stores a lot of data in OMAP and the
> RocksDB databases tend to get quite large. Especially when storing many
> small files and lots of dirs. So if that happens to be the workload,
> make sure you have plenty of them. We once put all cephfs_metadata on 30
> NVMe ... and that was not a good thing. Spread that data out over as
> many SSD / NVMe as you can. Do your HDDs have their WAL / DB on flash?
> Cephfs_metadaa does not take up a lot of space, but Mimic does not have
> as good administration on all space occupied as newer releases. But I
> guess it's in the order of 5% of CephFS size. But again, this might be
> wildly different on other deployments.
>
> >
> > I'd be grateful for any pointers :)
>
> I would buy a CPU with high clock speed and ~ 4 -8 cores. RAM as needed,
> but 32 GB will be minimum I guess.
>
> Gr. Stefan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] TOO_FEW_PGS warning and pg_autoscale

2020-10-23 Thread Peter Eisch
Hi,

# ceph health detail
HEALTH_WARN too few PGs per OSD (24 < min 30)
TOO_FEW_PGS too few PGs per OSD (24 < min 30)

ceph version 14.2.9

This warning popped up when autoscale shrunk a pool from pg_num and pgp_num 
from 512 to 256 on its own.  The hdd35 storage is only used by this pool.

I have three different storage classes and the pools use the different classes 
as appropriate.  How can I convert the warning into something useful which then 
helps me make the appropriate change to the right class of storage?  I'm 
guessing it's referring to hdd35.

RAW STORAGE:
CLASS SIZEAVAIL   USEDRAW USED %RAW USED
hdd25 129 TiB  83 TiB  46 TiB   46 TiB 35.87
hdd35 269 TiB 220 TiB  49 TiB   49 TiB 18.12
ssd   256 TiB 164 TiB  92 TiB   92 TiB 35.84
TOTAL 655 TiB 468 TiB 186 TiB  187 TiB 28.56

If I follow: 
https://docs.ceph.com/en/latest/rados/operations/health-checks/#too-few-pgs
Which then links to: 
https://docs.ceph.com/en/latest/rados/operations/placement-groups/#choosing-number-of-placement-groups

The math for this would want the pool to have pg/p_num of 2048 -- where 
autoscale just recently shrunk the count.  Which is more right?

Thanks!

peter


Peter Eisch
Senior Site Reliability Engineer
T1.612.445.5135
virginpulse.com
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.66
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Octopus

2020-10-23 Thread Amudhan P
Hi Eugen,

ceph config output shows set network address.

I have not restarted containers directly I was trying the command `ceph
orch restart osd.46` I think that was a problem now after running `ceph
orch daemon restart osd.46` it's showing changes in dashboard.

Thanks.


On Fri, Oct 23, 2020 at 6:14 PM Eugen Block  wrote:

> Did you restart the OSD containers? Does ceph config show your changes?
>
> ceph config get mon cluster_network
> ceph config get mon public_network
>
>
>
> Zitat von Amudhan P :
>
> > Hi Eugen,
> >
> > I did the same step specified but OSD is not updated cluster address.
> >
> >
> > On Tue, Oct 20, 2020 at 2:52 PM Eugen Block  wrote:
> >
> >> > I wonder if this would be impactful, even if  `nodown` were set.
> >> > When a given OSD latches onto
> >> > the new replication network, I would expect it to want to use it for
> >> > heartbeats — but when
> >> > its heartbeat peers aren’t using the replication network yet, they
> >> > won’t be reachable.
> >>
> >> I also expected at least some sort of impact, I just tested it in a
> >> virtual lab environment. But besides the temporary "down" OSDs during
> >> container restart the cluster was always responsive (although there's
> >> no client traffic). I didn't even set "nodown". But all OSDs now have
> >> a new backend address and the cluster seems to be happy.
> >>
> >> Regards,
> >> Eugen
> >>
> >>
> >> Zitat von Anthony D'Atri :
> >>
> >> > I wonder if this would be impactful, even if  `nodown` were set.
> >> > When a given OSD latches onto
> >> > the new replication network, I would expect it to want to use it for
> >> > heartbeats — but when
> >> > its heartbeat peers aren’t using the replication network yet, they
> >> > won’t be reachable.
> >> >
> >> > Unless something has changed since I tried this with Luminous.
> >> >
> >> >> On Oct 20, 2020, at 12:47 AM, Eugen Block  wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> a quick search [1] shows this:
> >> >>
> >> >> ---snip---
> >> >> # set new config
> >> >> ceph config set global cluster_network 192.168.1.0/24
> >> >>
> >> >> # let orchestrator reconfigure the daemons
> >> >> ceph orch daemon reconfig mon.host1
> >> >> ceph orch daemon reconfig mon.host2
> >> >> ceph orch daemon reconfig mon.host3
> >> >> ceph orch daemon reconfig osd.1
> >> >> ceph orch daemon reconfig osd.2
> >> >> ceph orch daemon reconfig osd.3
> >> >> ---snip---
> >> >>
> >> >> I haven't tried it myself though.
> >> >>
> >> >> Regards,
> >> >> Eugen
> >> >>
> >> >> [1]
> >> >>
> >>
> https://stackoverflow.com/questions/61763230/configure-a-cluster-network-with-cephadm
> >> >>
> >> >>
> >> >> Zitat von Amudhan P :
> >> >>
> >> >>> Hi,
> >> >>>
> >> >>> I have installed Ceph Octopus cluster using cephadm with a single
> >> network
> >> >>> now I want to add a second network and configure it as a cluster
> >> address.
> >> >>>
> >> >>> How do I configure ceph to use second Network as cluster network?.
> >> >>>
> >> >>> Amudhan
> >> >>> ___
> >> >>> ceph-users mailing list -- ceph-users@ceph.io
> >> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >> >>
> >> >>
> >> >> ___
> >> >> ceph-users mailing list -- ceph-users@ceph.io
> >> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
>
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Hardware for new OSD nodes.

2020-10-23 Thread Eneko Lacunza

Hi Anthony,

El 22/10/20 a las 18:34, Anthony D'Atri escribió:



Yeah, didn't think about a RAID10 really, although there wouldn't be enough 
space for 8x300GB = 2400GB WAL/DBs.

300 is overkill for many applications anyway.
Yes, but he has spillover with 1600GB/12 WAL/DB. Seems he can make use 
of those 300GB.



Also, using a RAID10 for WAL/DBs will:
 - make OSDs less movable between hosts (they'd have to be moved all 
together - with 2 OSD per NVMe you can move them around in pairs

Why would you want to move them between hosts?


I think the usual case is a server failure, so that won't be a problem. 
With small clusters (like ours) you may want to reorganize OSDs to a new 
server (let's say, move one OSD of earch server to the new server). But 
this is an uncommon corner-case, I agree :)


Cheers

--
Eneko Lacunza| +34 943 569 206
 | elacu...@binovo.es
Zuzendari teknikoa   | https://www.binovo.es
Director técnico | Astigarragako Bidea, 2 - 2º izda.
BINOVO IT HUMAN PROJECT S.L  | oficina 10-11, 20180 Oiartzun
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Hardware for new OSD nodes.

2020-10-23 Thread Eneko Lacunza

Hi Brian,

El 22/10/20 a las 18:41, Brian Topping escribió:



On Oct 22, 2020, at 10:34 AM, Anthony D'Atri  wrote:


- You must really be sure your raid card is dependable. (sorry but I have 
seen so much management problems with top-tier RAID cards I avoid them like the 
plague).

This.

I’d definitely avoid a RAID card. If I can do advanced encryption with an MMX 
instruction, I think I can certainly trust IOMMU to handle device multiplexing 
from software in an efficient manner, no? mdadm RAID is just fine for me and is 
reliably bootable from GRUB.

I’m not an expert in driver mechanics, but mirroring should be very low 
overhead at the software level.

Once it’s software RAID, moving disks between chassis is a simple process as 
well.

Apologies I didn’t make that clear earlier...
Yes, I really like mdraid :) . Problem is BIOS/UEFI has to find a 
working bootable disk. I think some BIOS/UEFIs have settings for a 
secondary boot/UEFI bootfile, but that would have to be prepared and 
maintained manually, out of the mdraid10; and would only work with a 
total failure of the primary disk.


Cheers

--
Eneko Lacunza| +34 943 569 206
 | elacu...@binovo.es
Zuzendari teknikoa   | https://www.binovo.es
Director técnico | Astigarragako Bidea, 2 - 2º izda.
BINOVO IT HUMAN PROJECT S.L  | oficina 10-11, 20180 Oiartzun
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Hardware for new OSD nodes.

2020-10-23 Thread Eneko Lacunza

Hi Dave,

El 22/10/20 a las 19:43, Dave Hall escribió:



El 22/10/20 a las 16:48, Dave Hall escribió:


(BTW, Nautilus 14.2.7 on Debian non-container.)

We're about to purchase more OSD nodes for our cluster, but I have a 
couple questions about hardware choices.  Our original nodes were 8 
x 12TB SAS drives and a 1.6TB Samsung NVMe card for WAL, DB, etc.


We chose the NVMe card for performance since it has an 8 lane PCIe 
interface.  However, we're currently BlueFS spillovers.


The Tyan chassis we are considering has the option of 4 x U.2 NVMe 
bays - each with 4 PCIe lanes, (and 8 SAS bays).   It has occurred 
to me that I might stripe 4 1TB NVMe drives together to get much 
more space for WAL/DB and a net performance of 16 PCIe lanes.


Any thoughts on this approach?
Don't stripe them, if one NVMe fails you'll lose all OSDs. Just use 1 
NVMe drive for 2  SAS drives  and provision 300GB for WAL/DB for each 
OSD (see related threads on this mailing list about why that exact 
size).


This way if a NVMe fails, you'll only lose 2 OSD.
I was under the impression that everything that BlueStore puts on the 
SSD/NVMe could be reconstructed from information on the OSD. Am I 
mistaken about this?  If so, my single 1.6TB NVMe card is equally 
vulnerable.


I don't think so, that info only exists on that partition as was the 
case with filestore journal. Your single 1.6TB NVMe is vulnerable, yes.




Also, what size of WAL/DB partitions do you have now, and what 
spillover size?


I recently posted another question to the list on this topic, since I 
now have spillover on 7 of 24 OSDs.  Since the data layout on the NVMe 
for BlueStore is not  traditional I've never quite figured out how to 
get this information.   The current partition size is 1.6TB /12 since 
we had the possibility to add for more drives to each node.  How that 
was divided between WAL, DB, etc. is something I'd like to be able to 
understand.  However, we're not going to add the extra 4 drives, so 
expanding the LVM partitions is now a possibility.
Can you paste the warning message? If shows the spillover size. What 
size are the partitions on NVMe disk (lsblk)



Cheers

--
Eneko Lacunza| +34 943 569 206
 | elacu...@binovo.es
Zuzendari teknikoa   | https://www.binovo.es
Director técnico | Astigarragako Bidea, 2 - 2º izda.
BINOVO IT HUMAN PROJECT S.L  | oficina 10-11, 20180 Oiartzun
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] OSD down, how to reconstruct it from its main and block.db parts ?

2020-10-23 Thread Wladimir Mutel

Dear all,

after breaking my experimental 1-host Ceph cluster and making one its pg 
'incomplete' I left it in abandoned state for some time.
Now I decided to bring it back into life and found that it can not start one of 
its OSDs (osd.1 to name it)

"ceph osd df" shows :

ID  CLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP META   AVAIL  
  %USE   VAR   PGS  STATUS
 0hdd0   1.0  2.7 TiB  1.6 TiB  1.6 TiB  113 MiB  4.7 GiB  1.1 
TiB  59.77  0.69  102  up
 1hdd  2.84549 0  0 B  0 B  0 B  0 B  0 B  
0 B  0 00down
 2hdd  2.84549   1.0  2.8 TiB  2.6 TiB  2.5 TiB   57 MiB  3.8 GiB  275 
GiB  90.58  1.05  176  up
 3hdd  2.84549   1.0  2.8 TiB  2.6 TiB  2.5 TiB   57 MiB  3.9 GiB  271 
GiB  90.69  1.05  185  up
 4hdd  2.84549   1.0  2.8 TiB  2.6 TiB  2.5 TiB   63 MiB  4.2 GiB  263 
GiB  90.98  1.05  184  up
 5hdd  2.84549   1.0  2.8 TiB  2.6 TiB  2.5 TiB   52 MiB  3.8 GiB  263 
GiB  90.96  1.05  178  up
 6hdd  2.53400   1.0  2.5 TiB  2.3 TiB  2.3 TiB  173 MiB  5.2 GiB  228 
GiB  91.21  1.05  178  up
 7hdd  2.53400   1.0  2.5 TiB  2.3 TiB  2.3 TiB  147 MiB  5.2 GiB  230 
GiB  91.12  1.05  168  up
   TOTAL   19 TiB   17 TiB   16 TiB  662 MiB   31 GiB  2.6 
TiB  86.48
MIN/MAX VAR: 0.69/1.05  STDDEV: 10.90

"ceph device ls" shows :

DEVICE  HOST:DEV  DAEMONS   
 LIFE EXPECTANCY
GIGABYTE_GP-ASACNE2100TTTDR_SN191108950380  p10s:nvme0n1  osd.1 osd.2 osd.3 
osd.4 osd.5
WDC_WD30EFRX-68N32N0_WD-WCC7K1JJXVSTp10s:sdd  osd.1
WDC_WD30EFRX-68N32N0_WD-WCC7K1VUYPRAp10s:sda  osd.6
WDC_WD30EFRX-68N32N0_WD-WCC7K2CKX8NTp10s:sdb  osd.7
WDC_WD30EFRX-68N32N0_WD-WCC7K2UD8H74p10s:sde  osd.2
WDC_WD30EFRX-68N32N0_WD-WCC7K2VFTR1Fp10s:sdh  osd.5
WDC_WD30EFRX-68N32N0_WD-WCC7K3CYKL87p10s:sdf  osd.3
WDC_WD30EFRX-68N32N0_WD-WCC7K6FPZAJPp10s:sdc  osd.0
WDC_WD30EFRX-68N32N0_WD-WCC7K7FXSCRNp10s:sdg  osd.4

In my last migration, I created a bluestore volume with external block.db like 
this :

"ceph-volume lvm prepare --bluestore --data /dev/sdd1 --block.db /dev/nvme0n1p4"

And I can see this metadata by

"ceph-bluestore-tool show-label --dev 
/dev/ceph-e53b65ba-5eb0-44f5-9160-a2328f787a0f/osd-block-8c6324a3-0364-4fad-9dcb-81a1661ee202"
 :

{

"/dev/ceph-e53b65ba-5eb0-44f5-9160-a2328f787a0f/osd-block-8c6324a3-0364-4fad-9dcb-81a1661ee202":
 {
"osd_uuid": "8c6324a3-0364-4fad-9dcb-81a1661ee202",
"size": 3000588304384,
"btime": "2020-07-12T11:34:16.579735+0300",
"description": "main",
"bfm_blocks": "45785344",
"bfm_blocks_per_key": "128",
"bfm_bytes_per_block": "65536",
"bfm_size": "3000588304384",
"bluefs": "1",
"ceph_fsid": "49cdfe90-6f6e-4afe-8558-bf14a13aadfa",
"kv_backend": "rocksdb",
"magic": "ceph osd volume v026",
"mkfs_done": "yes",
"osd_key": "AQD9ygpf+7+MABAAqtj4y1YYgxwCaAN/jgDSwg==",
"ready": "ready",
"require_osd_release": "14",
"whoami": "1"
}
}

and by

"ceph-bluestore-tool show-label --dev /dev/nvme0n1p4" :

{
"/dev/nvme0n1p4": {
"osd_uuid": "8c6324a3-0364-4fad-9dcb-81a1661ee202",
"size": 128025886720,
"btime": "2020-07-12T11:34:16.592054+0300",
"description": "bluefs db"
}
}

As you see, their osd_uuid is equal.
But when I try to start it by hand : "systemctl restart ceph-osd@1" ,
I get this in the logs : ("journalctl -b -u ceph-osd@1")

-- Logs begin at Tue 2020-10-13 19:09:49 EEST, end at Fri 2020-10-23 16:59:38 
EEST. --
жов 23 16:59:36 p10s systemd[1]: Starting Ceph object storage daemon osd.1...
жов 23 16:59:36 p10s systemd[1]: Started Ceph object storage daemon osd.1.
жов 23 16:59:36 p10s ceph-osd[3987]: 2020-10-23T16:59:36.943+0300 7f513cebedc0 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-1/keyring: (2) No 
such file or directory
жов 23 16:59:36 p10s ceph-osd[3987]: 2020-10-23T16:59:36.943+0300 7f513cebedc0 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-1/keyring: (2) No 
such file or directory
жов 23 16:59:36 p10s ceph-osd[3987]: 2020-10-23T16:59:36.943+0300 7f513cebedc0 -1 AuthRegistry(0x560776222940) no keyring found at 
/var/lib/ceph/osd/ceph-1/keyring, disabling cephx
жов 23 16:59:36 p10s ceph-osd[3987]: 2020-10-23T16:59:36.943+0300 7f513cebedc0 -1 AuthRegistry(0x560776222940) no keyring found at 
/var/lib/ceph/osd/ceph-1/keyring, disabling cephx
жов 23 16:59:36 p10s ceph-osd[3987]: 2020-10-23T16:59:36.947+0300 7f513cebedc0 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-1/keyring: (2) No 
such file or directory
жов 23 16:59:36 p10s ceph-osd[3987]: 2020-10-23T16:59:36.947+0300 7f513cebedc0 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-1/keyring: (2) No 
such file o

[ceph-users] Re: Strange USED size

2020-10-23 Thread Anthony D'Atri
10B as in ten bytes? 

By chance have you run `rados bench` ?  Sometimes a run is interrupted or one 
forgets to clean up and there are a bunch of orphaned RADOS objects taking up 
space, though I’d think `ceph df` would reflect that.  Is your buckets.data 
pool replicated or EC?

> On Oct 22, 2020, at 7:35 AM, Marcelo  wrote:
> 
> Hello. I've searched a lot but couldn't find why the size of USED column in
> the output of ceph df is a lot times bigger than the actual size. I'm using
> Nautilus (14.2.8), and I've 1000 buckets with 100 objectsineach bucket.
> Each object is around 10B.
> 
> ceph df
> RAW STORAGE:
>CLASS SIZEAVAIL   USEDRAW USED %RAW USED
>hdd   511 GiB 147 GiB 340 GiB  364 GiB 71.21
>TOTAL 511 GiB 147 GiB 340 GiB  364 GiB 71.21
> 
> POOLS:
>POOL   ID STORED  OBJECTS
> USED%USED MAX AVAIL
>.rgw.root   1 1.1 KiB   4 768
> KiB 036 GiB
>default.rgw.control11 0 B   8 0
> B 036 GiB
>default.rgw.meta   12 449 KiB   2.00k 376
> MiB  0.3436 GiB
>default.rgw.log13 3.4 KiB 207   6
> MiB 036 GiB
>default.rgw.buckets.index  14 0 B   1.00k 0
> B 036 GiB
>default.rgw.buckets.data   15 969 KiB100k  18
> GiB 14.5236 GiB
>default.rgw.buckets.non-ec 1627 B   1 192
> KiB 036 GiB
> 
> Does anyone know what are the maths behind this, to show 18GiB used when I
> have something like 1 MiB?
> 
> Thanks in advance, Marcelo.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] desaster recovery Ceph Storage , urgent help needed

2020-10-23 Thread Gerhard W. Recher
Hi I have a worst case,

osd's in a 3 node cluster each 4 nvme's won't start

we had a ip config change in public network, and mon's died so we
managed mon's to come back with new ip's.
corosync on 2 rings is fine,
all 3 mon's are up
osd's won't start


how to get back to the pool, already 3vm's are configured and valuable
data would be lost...

this is like a scenario when all systemdisks on each 3 nodes failed, but
osd disks are healthy ...

any help to reconstruct this storage is highly appreciated!

Gerhard

|root@pve01:/var/log# systemctl status ceph-osd@0.service.service ●
ceph-osd@0.service.service - Ceph object storage daemon osd.0.service
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; disabled; vendor
preset: enabled) Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
└─ceph-after-pve-cluster.conf Active: failed (Result: exit-code) since
Thu 2020-10-22 00:30:09 CEST; 37min ago Process: 31402
ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER}
--id 0.service (code=exited, status=1/FAILURE) Oct 22 00:30:09 pve01
systemd[1]: ceph-osd@0.service.service: Service RestartSec=100ms
expired, scheduling restart. Oct 22 00:30:09 pve01 systemd[1]:
ceph-osd@0.service.service: Scheduled restart job, restart counter is at
3. Oct 22 00:30:09 pve01 systemd[1]: Stopped Ceph object storage daemon
osd.0.service. Oct 22 00:30:09 pve01 systemd[1]:
ceph-osd@0.service.service: Start request repeated too quickly. Oct 22
00:30:09 pve01 systemd[1]: ceph-osd@0.service.service: Failed with
result 'exit-code'. Oct 22 00:30:09 pve01 systemd[1]: Failed to start
Ceph object storage daemon osd.0.service. |
||ceph mon dump dumped monmap epoch 3 epoch 3 fsid
92d063d7-647c-44b8-95d7-86057ee0ab22 last_changed 2020-10-21
23:31:50.584796 created 2020-10-21 21:00:54.077449 min_mon_release 14
(nautilus) 0: [v2:10.100.200.141:3300/0,v1:10.100.200.141:6789/0]
mon.pve01 1: [v2:10.100.200.142:3300/0,v1:10.100.200.142:6789/0]
mon.pve02 2: [v2:10.100.200.143:3300/0,v1:10.100.200.143:6789/0]
mon.pve03 ||
|||Networks: auto lo iface lo inet loopback auto eno1np0 iface eno1np0
inet static address 10.110.200.131/24 mtu 9000 #corosync1 10GB auto
eno2np1 iface eno2np1 inet static address 10.111.200.131/24 mtu 9000
#Corosync2 10GB iface enp69s0f0 inet manual mtu 9000 auto enp69s0f1
iface enp69s0f1 inet static address 10.112.200.131/24 mtu 9000 #Cluster
private 100GB auto vmbr0 iface vmbr0 inet static address
10.100.200.141/24 gateway 10.100.200.1 bridge-ports enp69s0f0 bridge-stp
off bridge-fd 0 mtu 9000 #Cluster public 100GB
===
|||
ceph.conf [global] auth_client_required = cephx
auth_cluster_required = cephx auth_service_required = cephx
cluster_network = 10.112.200.0/24 fsid =
92d063d7-647c-44b8-95d7-86057ee0ab22 mon_allow_pool_delete = true
mon_host = 10.100.200.141 10.100.200.142 10.100.200.143
osd_pool_default_min_size = 2 osd_pool_default_size = 3 public_network =
10.100.200.0/24 [client] keyring = /etc/pve/priv/$cluster.$name.keyring
[mon.pve01] public_addr = 10.100.200.141 [mon.pve02] public_addr =
10.100.200.142 [mon.pve03] public_addr = 10.100.200.143 
|ceph -s cluster: id: 92d063d7-647c-44b8-95d7-86057ee0ab22 health:
HEALTH_WARN 1 daemons have recently crashed OSD count 0 <
osd_pool_default_size 3 services: mon: 3 daemons, quorum
pve01,pve02,pve03 (age 63m) mgr: pve01(active, since 64m) osd: 0 osds: 0
up, 0 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 0 B
used, 0 B / 0 B avail pgs: df -h Filesystem Size Used Avail Use% Mounted
on udev 252G 0 252G 0% /dev tmpfs 51G 11M 51G 1% /run rpool/ROOT/pve-1
229G 16G 214G 7% / tmpfs 252G 63M 252G 1% /dev/shm tmpfs 5.0M 0 5.0M 0%
/run/lock tmpfs 252G 0 252G 0% /sys/fs/cgroup rpool 214G 128K 214G 1%
/rpool rpool/data 214G 128K 214G 1% /rpool/data rpool/ROOT 214G 128K
214G 1% /rpool/ROOT tmpfs 252G 24K 252G 1% /var/lib/ceph/osd/ceph-3
tmpfs 252G 24K 252G 1% /var/lib/ceph/osd/ceph-2 tmpfs 252G 24K 252G 1%
/var/lib/ceph/osd/ceph-0 tmpfs 252G 24K 252G 1% /var/lib/ceph/osd/ceph-1
/dev/fuse 30M 32K 30M 1% /etc/pve tmpfs 51G 0 51G 0% /run/user/0 lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme4n1 259:0 0 238.5G 0 disk
├─nvme4n1p1 259:5 0 1007K 0 part ├─nvme4n1p2 259:6 0 512M 0 part
└─nvme4n1p3 259:7 0 238G 0 part nvme5n1 259:1 0 238.5G 0 disk
├─nvme5n1p1 259:2 0 1007K 0 part ├─nvme5n1p2 259:3 0 512M 0 part
└─nvme5n1p3 259:4 0 238G 0 part nvme0n1 259:12 0 2.9T 0 disk
└─ceph--cc77fe1b--c8d4--48be--a7c4--36109439c85c-osd--block--80e0127e--836e--44b8--882d--ac49bfc85866
253:3 0 2.9T 0 lvm nvme1n1 259:13 0 2.9T 0 disk
└─ceph--eb8b2fc7--775e--4b94--8070--784e7bbf861e-osd--block--4d433222--e1e8--43ac--8dc7--2e6e998ff122
253:2 0 2.9T 0 lvm nvme3n1 259:14 0 2.9T 0 disk
└─ceph--5724bdf7--5124--4244--91d6--e254210c2174-osd--block--2d6fe149--f330--415a--a762--44d037c900b1
253:1 0 2.9T 0 lvm nvme2n1 259:15 0 2.9T 0 disk
└─ceph--cb5762e9--40fa--4148--98f4--5b5ddef4c1de-osd-

[ceph-users] Re: Large map object found

2020-10-23 Thread DHilsbos
Peter;

As with many things in Ceph, I don’t believe it’s a hard and fast rule (i.e. 
noon power of 2 will work).  I believe the issues are performance, and balance. 
 I can't confirm that.  Perhaps someone else on the list will add their 
thoughts.

Has your  warning gone away?

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com


From: Peter Eisch [mailto:peter.ei...@virginpulse.com] 
Sent: Friday, October 23, 2020 5:41 AM
To: Dominic Hilsbos; ceph-users@ceph.io
Subject: Re: Large map object found

Perfect -- many thanks Dominic!

I haven't found a doc which notes the --num-shards needs to be a power of two. 
It isn't I don't believe you -- just haven't seen that anywhere.

peter


Peter Eisch​

Senior Site Reliability Engineer


T

1.612.445.5135












virginpulse.com


Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA

Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.


v2.66

On 10/22/20, 10:24 AM, "dhils...@performair.com"  
wrote:

Peter;

I believe shard counts should be powers of two.

Also, resharding makes the buckets unavailable, but occurs very quickly. As 
such it is not done in the background, but in the foreground, for a manual 
reshard.

Notice the statement: "reshard of bucket  from  
to  completed successfully." It's done.

The warning notice won't go away until a scrub is completed to determine that a 
large OMAP object no longer exists.

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International Inc.
dhils...@performair.com 
https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.performair.com%2F&data=04%7C01%7Cpeter.eisch%40virginpulse.com%7C14386968705f4571e9a008d8769e9a16%7Cb123a16e892b4cf6a55a6f8c7606a035%7C0%7C0%7C637389770850660951%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=AUPmwxIgzRhmwpg2MM6b%2FzpPyR84%2F92OFsW9UrKw%2Fes%3D&reserved=0


From: Peter Eisch [mailto:peter.ei...@virginpulse.com] 
Sent: Thursday, October 22, 2020 8:04 AM
To: Dominic Hilsbos; ceph-users@ceph.io
Subject: Re: Large map object found

Thank you! This was helpful.

I opted for a manual reshard:

[root@cephmon-s03 ~]# radosgw-admin bucket reshard 
--bucket=d2ff913f5b6542cda307c9cd6a95a214/NAME_segments --num-shards=3
tenant: d2ff913f5b6542cda307c9cd6a95a214
bucket name: backups_sql_dswhseloadrepl_segments
old bucket instance id: 80bdfc66-d1fd-418d-b87d-5c8518a0b707.340850308.51
new bucket instance id: 80bdfc66-d1fd-418d-b87d-5c8518a0b707.948621036.1
total entries: 1000 2000 3000 3228
2020-10-22 08:40:26.353 7fb197fc66c0 1 execute INFO: reshard of bucket 
"backups_sql_dswhseloadrepl_segments" from 
"d2ff913f5b6542cda307c9cd6a95a214/backups_sql_dswhseloadrepl_segments:80bdfc66-d1fd-418d-b87d-5c8518a0b707.340850308.51"
 to 
"d2ff913f5b6542cda307c9cd6a95a214/backups_sql_dswhseloadrepl_segments:80bdfc66-d1fd-418d-b87d-5c8518a0b707.948621036.1"
 completed successfully

[root@cephmon-s03 ~]# radosgw-admin buckets reshard list
[] 
[root@cephmon-s03 ~]# radosgw-admin buckets reshard status 
--bucket=d2ff913f5b6542cda307c9cd6a95a214/NAME_segments
[
{
"reshard_status": "not-resharding",
"new_bucket_instance_id": "",
"num_shards": -1
},
{
"reshard_status": "not-resharding",
"new_bucket_instance_id": "",
"num_shards": -1
},
{
"reshard_status": "not-resharding",
"new_bucket_instance_id": "",
"num_shards": -1
}
]
[root@cephmon-s03 ~]#

This kicked of an autoscale event. Would the reshard presumably start after the 
autoscaling is complete?

peter



Peter Eisch​

Senior Site Reliability Engineer


T

1.612.445.5135












virginpulse.com


Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA

Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited a

[ceph-users] Re: Hardware for new OSD nodes.

2020-10-23 Thread Brian Topping
Yes the UEFI problem with mirrored mdraid boot is well-documented. I’ve 
generally been working with BIOS partition maps which do not have the single 
point of failure UEFI has (/boot can be mounted as mirrored, any of them can be 
used as non-RAID by GRUB). But BIOS maps have problems as well with volume 
size. 

That said, the disks are portable at that point and really don’t have deep 
performance bottlenecks because mirroring and striping is cheap. 

Sent from my iPhone

> On Oct 23, 2020, at 03:54, Eneko Lacunza  wrote:
> 
> Hi Brian,
> 
>> El 22/10/20 a las 18:41, Brian Topping escribió:
>> 
 On Oct 22, 2020, at 10:34 AM, Anthony D'Atri  
 wrote:
>>> 
- You must really be sure your raid card is dependable. (sorry but I 
 have seen so much management problems with top-tier RAID cards I avoid 
 them like the plague).
>>> This.
>> I’d definitely avoid a RAID card. If I can do advanced encryption with an 
>> MMX instruction, I think I can certainly trust IOMMU to handle device 
>> multiplexing from software in an efficient manner, no? mdadm RAID is just 
>> fine for me and is reliably bootable from GRUB.
>> 
>> I’m not an expert in driver mechanics, but mirroring should be very low 
>> overhead at the software level.
>> 
>> Once it’s software RAID, moving disks between chassis is a simple process as 
>> well.
>> 
>> Apologies I didn’t make that clear earlier...
> Yes, I really like mdraid :) . Problem is BIOS/UEFI has to find a working 
> bootable disk. I think some BIOS/UEFIs have settings for a secondary 
> boot/UEFI bootfile, but that would have to be prepared and maintained 
> manually, out of the mdraid10; and would only work with a total failure of 
> the primary disk.
> 
> Cheers
> 
> -- 
> Eneko Lacunza| +34 943 569 206
> | elacu...@binovo.es
> Zuzendari teknikoa   | https://www.binovo.es
> Director técnico | Astigarragako Bidea, 2 - 2º izda.
> BINOVO IT HUMAN PROJECT S.L  | oficina 10-11, 20180 Oiartzun
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-23 Thread David C
Success!

I remembered I had a server I'd taken out of the cluster to
investigate some issues, that had some good quality 800GB Intel DC
SSDs, dedicated an entire drive to swap, tuned up min_free_kbytes,
added an MDS to that server and let it run. Took 3 - 4 hours but
eventually came back online. It used the 128GB of RAM and about 250GB
of the swap.

Dan, thanks so much for steering me down this path, I would have more
than likely started hacking away at the journal otherwise!

Frank, thanks for pointing me towards that other thread, I used your
min_free_kbytes tip

I now need to consider updating - I wonder if the risk averse CephFS
operator would go for the latest Nautilus or latest Octopus, it used
to be that the newer CephFS code meant the most stable but don't know
if that's still the case.

Thanks, again
David

On Thu, Oct 22, 2020 at 7:06 PM Frank Schilder  wrote:
>
> The post was titled "mds behind on trimming - replay until memory exhausted".
>
> > Load up with swap and try the up:replay route.
> > Set the beacon to 10 until it finishes.
>
> Good point! The MDS will not send beacons for a long time. Same was necessary 
> in the other case.
>
> Good luck!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large map object found

2020-10-23 Thread Peter Eisch
Yes, the OMAP warning has cleared after running the deep-scrub, with all the 
swiftness.

Thanks again!



Peter Eisch
Senior Site Reliability Engineer
T1.612.445.5135
virginpulse.com
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.66
On 10/23/20, 10:48 AM, "dhils...@performair.com"  
wrote:

Peter;

As with many things in Ceph, I don’t believe it’s a hard and fast rule 
(i.e. noon power of 2 will work).  I believe the issues are performance, and 
balance.  I can't confirm that.  Perhaps someone else on the list will add 
their thoughts.

Has your  warning gone away?

Thank you,

Dominic L. Hilsbos, MBA
Director – Information Technology
Perform Air International Inc.
dhils...@performair.com

https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.performair.com%2F&data=04%7C01%7Cpeter.eisch%40virginpulse.com%7C399824739c5049951cd408d8776b1eb6%7Cb123a16e892b4cf6a55a6f8c7606a035%7C0%7C0%7C637390649266421690%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jAb9AqbuDMcdTYh%2Fb69EBFtL%2B3iI%2BwJdCmdbV%2F9MJAg%3D&reserved=0


From: Peter Eisch [mailto:peter.ei...@virginpulse.com]
Sent: Friday, October 23, 2020 5:41 AM
To: Dominic Hilsbos; ceph-users@ceph.io
Subject: Re: Large map object found

Perfect -- many thanks Dominic!

I haven't found a doc which notes the --num-shards needs to be a power of 
two. It isn't I don't believe you -- just haven't seen that anywhere.

peter


Peter Eisch​

Senior Site Reliability Engineer


T

1.612.445.5135












virginpulse.com


Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | 
Switzerland | United Kingdom | USA

Confidentiality Notice: The information contained in this e-mail, including 
any attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.


v2.66

On 10/22/20, 10:24 AM, "dhils...@performair.com"  
wrote:

Peter;

I believe shard counts should be powers of two.

Also, resharding makes the buckets unavailable, but occurs very quickly. As 
such it is not done in the background, but in the foreground, for a manual 
reshard.

Notice the statement: "reshard of bucket  from 
 to  completed successfully." It's done.

The warning notice won't go away until a scrub is completed to determine 
that a large OMAP object no longer exists.

Thank you,

Dominic L. Hilsbos, MBA
Director – Information Technology
Perform Air International Inc.
dhils...@performair.com

https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.performair.com%2F&data=04%7C01%7Cpeter.eisch%40virginpulse.com%7C399824739c5049951cd408d8776b1eb6%7Cb123a16e892b4cf6a55a6f8c7606a035%7C0%7C0%7C637390649266421690%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jAb9AqbuDMcdTYh%2Fb69EBFtL%2B3iI%2BwJdCmdbV%2F9MJAg%3D&reserved=0


From: Peter Eisch [mailto:peter.ei...@virginpulse.com]
Sent: Thursday, October 22, 2020 8:04 AM
To: Dominic Hilsbos; ceph-users@ceph.io
Subject: Re: Large map object found

Thank you! This was helpful.

I opted for a manual reshard:

[root@cephmon-s03 ~]# radosgw-admin bucket reshard 
--bucket=d2ff913f5b6542cda307c9cd6a95a214/NAME_segments --num-shards=3
tenant: d2ff913f5b6542cda307c9cd6a95a214
bucket name: backups_sql_dswhseloadrepl_segments
old bucket instance id: 80bdfc66-d1fd-418d-b87d-5c8518a0b707.3408

[ceph-users] Re: desaster recovery Ceph Storage , urgent help needed

2020-10-23 Thread Gerhard W. Recher
This is a proxmox cluster ...
sorry for formating problems of my post :(

short plot, we messed with ip addr. change of public network, so
monitors went down.



we changed monitor information in ceph.conf and with
ceph-mon -i pve01 --extract-monmap /tmp/monmap
monmaptool --rm pve01 --rm pve02 --rm pve03 /tmp/monmap
monmaptool --add pve01 10.100.200.141 --add pve02 10.100.200.142 --add
pve03 10.100.200.143 /tmp/monmap
monmaptool --print /tmp/monmap
ceph-mon -i pve01 --inject-monmap /tmp/monmap


restart of all three nodes, but osd's dont't come up

so howto recover from this disaster ?

# ceph -s
  cluster:
    id: 92d063d7-647c-44b8-95d7-86057ee0ab22
    health: HEALTH_WARN
    1 daemons have recently crashed
    OSD count 0 < osd_pool_default_size 3

  services:
    mon: 3 daemons, quorum pve01,pve02,pve03 (age 19h)
    mgr: pve01(active, since 19h)
    osd: 0 osds: 0 up, 0 in

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:




 cat /etc/pve/ceph.conf
[global]
 auth_client_required = cephx
 auth_cluster_required = cephx
 auth_service_required = cephx
 cluster_network = 10.112.200.0/24
 fsid = 92d063d7-647c-44b8-95d7-86057ee0ab22
 mon_allow_pool_delete = true
 mon_host = 10.100.200.141 10.100.200.142 10.100.200.143
 osd_pool_default_min_size = 2
 osd_pool_default_size = 3
 public_network = 10.100.200.0/24

[client]
 keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.pve01]
 public_addr = 10.100.200.141

[mon.pve02]
 public_addr = 10.100.200.142

[mon.pve03]
 public_addr = 10.100.200.143



Gerhard W. Recher

net4sec UG (haftungsbeschränkt)
Leitenweg 6
86929 Penzing

+49 8191 4283888
+49 171 4802507
Am 23.10.2020 um 13:50 schrieb Burkhard Linke:
> Hi,
>
>
> your mail is formatted in a way that makes it impossible to get all
> information, so a number of questions first:
>
>
> - are the mons up, or are the mon up and in a quorum? you cannot
> change mon IP addresses without also adjusting them in the mon map.
> use the daemon socket on the systems to qeury the current state of the
> mons
>
> - the osd systemd output is useless for debugging. it only states that
> the osd is not running and not able to start
>
>
> The real log files are located in /var/log/ceph/. If the mon are in
> quorum, you should find more information here. Keep in mind that you
> also need to change ceph.conf on the OSD hosts if you change the mon
> IP addresses, otherwise the OSDs won't be able to find the mon and the
> processes will die.
>
> And I do not understand how corosync should affect your ceph cluster.
> Ceph does not use corosync...
>
>
> If you need fast help I can recommend the ceph irc channel ;-)
>
>
> Regards,
>
> Burkhard
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: desaster recovery Ceph Storage , urgent help needed

2020-10-23 Thread Eneko Lacunza

Hace you tried to recover old IPs ?

El 23/10/20 a las 14:22, Gerhard W. Recher escribió:

This is a proxmox cluster ...
sorry for formating problems of my post :(

short plot, we messed with ip addr. change of public network, so
monitors went down.



we changed monitor information in ceph.conf and with
ceph-mon -i pve01 --extract-monmap /tmp/monmap
monmaptool --rm pve01 --rm pve02 --rm pve03 /tmp/monmap
monmaptool --add pve01 10.100.200.141 --add pve02 10.100.200.142 --add
pve03 10.100.200.143 /tmp/monmap
monmaptool --print /tmp/monmap
ceph-mon -i pve01 --inject-monmap /tmp/monmap


restart of all three nodes, but osd's dont't come up

so howto recover from this disaster ?

# ceph -s
   cluster:
     id: 92d063d7-647c-44b8-95d7-86057ee0ab22
     health: HEALTH_WARN
     1 daemons have recently crashed
     OSD count 0 < osd_pool_default_size 3

   services:
     mon: 3 daemons, quorum pve01,pve02,pve03 (age 19h)
     mgr: pve01(active, since 19h)
     osd: 0 osds: 0 up, 0 in

   data:
     pools:   0 pools, 0 pgs
     objects: 0 objects, 0 B
     usage:   0 B used, 0 B / 0 B avail
     pgs:




  cat /etc/pve/ceph.conf
[global]
  auth_client_required = cephx
  auth_cluster_required = cephx
  auth_service_required = cephx
  cluster_network = 10.112.200.0/24
  fsid = 92d063d7-647c-44b8-95d7-86057ee0ab22
  mon_allow_pool_delete = true
  mon_host = 10.100.200.141 10.100.200.142 10.100.200.143
  osd_pool_default_min_size = 2
  osd_pool_default_size = 3
  public_network = 10.100.200.0/24

[client]
  keyring = /etc/pve/priv/$cluster.$name.keyring

[mon.pve01]
  public_addr = 10.100.200.141

[mon.pve02]
  public_addr = 10.100.200.142

[mon.pve03]
  public_addr = 10.100.200.143



Gerhard W. Recher

net4sec UG (haftungsbeschränkt)
Leitenweg 6
86929 Penzing

+49 8191 4283888
+49 171 4802507
Am 23.10.2020 um 13:50 schrieb Burkhard Linke:

Hi,


your mail is formatted in a way that makes it impossible to get all
information, so a number of questions first:


- are the mons up, or are the mon up and in a quorum? you cannot
change mon IP addresses without also adjusting them in the mon map.
use the daemon socket on the systems to qeury the current state of the
mons

- the osd systemd output is useless for debugging. it only states that
the osd is not running and not able to start


The real log files are located in /var/log/ceph/. If the mon are in
quorum, you should find more information here. Keep in mind that you
also need to change ceph.conf on the OSD hosts if you change the mon
IP addresses, otherwise the OSDs won't be able to find the mon and the
processes will die.

And I do not understand how corosync should affect your ceph cluster.
Ceph does not use corosync...


If you need fast help I can recommend the ceph irc channel ;-)


Regards,

Burkhard

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io





--
Eneko Lacunza| +34 943 569 206
 | elacu...@binovo.es
Zuzendari teknikoa   | https://www.binovo.es
Director técnico | Astigarragako Bidea, 2 - 2º izda.
BINOVO IT HUMAN PROJECT S.L  | oficina 10-11, 20180 Oiartzun

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph and ram limits

2020-10-23 Thread Ing . Luis Felipe Domínguez Vega
Since some days ago i am recoveryng my ceph cluster, all start with OSD 
been killed by OOM, well i created a script to delete from the OSD the 
PGs corrupted (i write corrupted because that pg is the cause of the 
100% of RAM usage by OSD).
Great, almost done with all OSDs of my cluster, then the monitors now 
are consuming all the servers RAM, and the Managers too, why??? why they 
use 60GB of RAM, there are something to block that?? i tried configurind 
all kind of RAM limit to the minimal.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD Failures after pg_num increase on one of the pools

2020-10-23 Thread Григорьев Артём Дмитриевич
It was ok in monitoring and logs, OSD nodes have plenty of available cpu and 
ram.

Previous pg_num was 256.


From: Eugen Block 
Sent: Friday, October 23, 2020 2:06:27 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: OSD Failures after pg_num increase on one of the pools

Hi,

do you see any peaks on the OSD nodes like OOM killer etc.?
Instead of norecover flag I would try the nodown and noout flags to
prevent flapping OSDs. What was the previous pg_num before you
increased to 512?

Regards,
Eugen


Zitat von Артём Григорьев :

> Hello everyone,
>
> I created a new ceph 14.2.7 Nautilus cluster  recently. Cluster consists of
> 3 racks and 2 osd nodes on each rack, 12 new hdd in each node. HDD
> model is TOSHIBA
> MG07ACA14TE 14Tb. All data pools are ec pools.
> Yesterday I decided to increase pg number on one of the pools with
> command "ceph
> osd pool set photo.buckets.data pg_num 512", after that many osds started
> to crash with "out" and "down" status. I tried to increase recovery_sleep
> to 1s but osds still crashes. Osds started working properly only when i set
> "norecover" flag, but osd scrub errors appeared after that.
>
> In logs from osd during crashes i found this:
> ---
>
> Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]:
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHIN
>
> E_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/osd/ECBackend.cc:
> In function 'void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&,
> RecoveryMessages*)'
>
> thread 7f8af535d700 time 2020-10-21 15:12:11.460092
>
> Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]:
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHIN
>
> E_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/osd/ECBackend.cc:
> 648: FAILED ceph_assert(pop.data.length() ==
> sinfo.aligned_logical_offset_to_chunk_offset( aft
>
> er_progress.data_recovered_to - op.recovery_progress.data_recovered_to))
>
> Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: ceph version 14.2.7
> (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)
>
> Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 1:
> (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x14a) [0x55fc694d6c0f]
>
> Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 2: (()+0x47)
> [0x55fc694d6dd7]
>
> Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 3:
> (ECBackend::continue_recovery_op(ECBackend::RecoveryOp&,
> RecoveryMessages*)+0x1740) [0x55fc698cafa0]
>
> Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 4:
> (ECBackend::handle_recovery_read_complete(hobject_t const&,
> boost::tuples::tuple ceph::buffer::v14_2_0::list, std::less,
> std::allocator >
>> , boost::tuples::null_type, boost::tuples::null_type,
> boost::tuples::null_type, boost::tuples::null_type,
> boost::tuples::null_type, boost::tuples::null_type,
> boost::tuples::null_type>&, boost::optional ceph::buffer::v14_2_0::list, std::less,
> std::allocator >
>> >, RecoveryMessages*)+0x734) [0x55fc698cb804]
>
> Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 5:
> (OnRecoveryReadComplete::finish(std::pair ECBackend::read_result_t&>&)+0x94) [0x55fc698ebbe4]
>
> Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 6:
> (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x8c)
> [0x55fc698bfdcc]
>
> Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 7:
> (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&,
> RecoveryMessages*, ZTracer::Trace const&)+0x109c) [0x55fc698d6b8c]
>
> Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 8:
> (ECBackend::_handle_message(boost::intrusive_ptr)+0x17f)
> [0x55fc698d718f]
>
> Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 9:
> (PGBackend::handle_message(boost::intrusive_ptr)+0x4a)
> [0x55fc697c18ea]
>
> Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 10:
> (PrimaryLogPG::do_request(boost::intrusive_ptr&,
> ThreadPool::TPHandle&)+0x5b3) [0x55fc697676b3]
>
> Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 11:
> (OSD::dequeue_op(boost::intrusive_ptr, boost::intrusive_ptr,
> ThreadPool::TPHandle&)+0x362) [0x55fc695b3d72]
>
> Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 12: (PGOpItem::run(OSD*,
> OSDShard*, boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x62)
> [0x55fc698415c2]
>
> Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 13:
> (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f)
> [0x55fc695cebbf]
>
> Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 14:
> (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6)
> [0x55fc69b6f976]
>
> Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 15:
> (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55fc69b72490]
>
> Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 16: (()+0x7e65)
> [0x7f8b1ddede65]
>
> Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 17: (clone()+0x6d)
> [0x7f8b1ccb188d]
>
> Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: *** Caught si

[ceph-users] Re: desaster recovery Ceph Storage , urgent help needed

2020-10-23 Thread Gerhard W. Recher
yep, I now reverted ip changes

osd's still do not come up

and i see no error in ceph.log , osd logs are empty ...



Gerhard W. Recher

net4sec UG (haftungsbeschränkt)
Leitenweg 6
86929 Penzing

+49 8191 4283888
+49 171 4802507
Am 23.10.2020 um 14:28 schrieb Eneko Lacunza:
> Hace you tried to recover old IPs ?
>
> El 23/10/20 a las 14:22, Gerhard W. Recher escribió:
>> This is a proxmox cluster ...
>> sorry for formating problems of my post :(
>>
>> short plot, we messed with ip addr. change of public network, so
>> monitors went down.
>>
>>
>>
>> we changed monitor information in ceph.conf and with
>> ceph-mon -i pve01 --extract-monmap /tmp/monmap
>> monmaptool --rm pve01 --rm pve02 --rm pve03 /tmp/monmap
>> monmaptool --add pve01 10.100.200.141 --add pve02 10.100.200.142 --add
>> pve03 10.100.200.143 /tmp/monmap
>> monmaptool --print /tmp/monmap
>> ceph-mon -i pve01 --inject-monmap /tmp/monmap
>>
>>
>> restart of all three nodes, but osd's dont't come up
>>
>> so howto recover from this disaster ?
>>
>> # ceph -s
>>    cluster:
>>  id: 92d063d7-647c-44b8-95d7-86057ee0ab22
>>  health: HEALTH_WARN
>>  1 daemons have recently crashed
>>  OSD count 0 < osd_pool_default_size 3
>>
>>    services:
>>  mon: 3 daemons, quorum pve01,pve02,pve03 (age 19h)
>>  mgr: pve01(active, since 19h)
>>  osd: 0 osds: 0 up, 0 in
>>
>>    data:
>>  pools:   0 pools, 0 pgs
>>  objects: 0 objects, 0 B
>>  usage:   0 B used, 0 B / 0 B avail
>>  pgs:
>>
>>
>>
>>
>>   cat /etc/pve/ceph.conf
>> [global]
>>   auth_client_required = cephx
>>   auth_cluster_required = cephx
>>   auth_service_required = cephx
>>   cluster_network = 10.112.200.0/24
>>   fsid = 92d063d7-647c-44b8-95d7-86057ee0ab22
>>   mon_allow_pool_delete = true
>>   mon_host = 10.100.200.141 10.100.200.142 10.100.200.143
>>   osd_pool_default_min_size = 2
>>   osd_pool_default_size = 3
>>   public_network = 10.100.200.0/24
>>
>> [client]
>>   keyring = /etc/pve/priv/$cluster.$name.keyring
>>
>> [mon.pve01]
>>   public_addr = 10.100.200.141
>>
>> [mon.pve02]
>>   public_addr = 10.100.200.142
>>
>> [mon.pve03]
>>   public_addr = 10.100.200.143
>>
>>
>>
>> Gerhard W. Recher
>>
>> net4sec UG (haftungsbeschränkt)
>> Leitenweg 6
>> 86929 Penzing
>>
>> +49 8191 4283888
>> +49 171 4802507
>> Am 23.10.2020 um 13:50 schrieb Burkhard Linke:
>>> Hi,
>>>
>>>
>>> your mail is formatted in a way that makes it impossible to get all
>>> information, so a number of questions first:
>>>
>>>
>>> - are the mons up, or are the mon up and in a quorum? you cannot
>>> change mon IP addresses without also adjusting them in the mon map.
>>> use the daemon socket on the systems to qeury the current state of the
>>> mons
>>>
>>> - the osd systemd output is useless for debugging. it only states that
>>> the osd is not running and not able to start
>>>
>>>
>>> The real log files are located in /var/log/ceph/. If the mon are in
>>> quorum, you should find more information here. Keep in mind that you
>>> also need to change ceph.conf on the OSD hosts if you change the mon
>>> IP addresses, otherwise the OSDs won't be able to find the mon and the
>>> processes will die.
>>>
>>> And I do not understand how corosync should affect your ceph cluster.
>>> Ceph does not use corosync...
>>>
>>>
>>> If you need fast help I can recommend the ceph irc channel ;-)
>>>
>>>
>>> Regards,
>>>
>>> Burkhard
>>>
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
>




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [External Email] Re: Hardware for new OSD nodes.

2020-10-23 Thread Dave Hall

Brian, Eneko,

BTW, the Tyan LFF chassis we've been using has 12 x 3.5" bays in front 
and 2 x 2.5" SATA bays in back.  We've been using 240GB SSDs in the rear 
bays for mirrored boot drives, so any NVMe we add is exclusively for OSD 
support.


 -Dave

Dave Hall
Binghamton University
kdh...@binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)

On 10/23/2020 11:55 AM, Brian Topping wrote:

Yes the UEFI problem with mirrored mdraid boot is well-documented. I’ve 
generally been working with BIOS partition maps which do not have the single 
point of failure UEFI has (/boot can be mounted as mirrored, any of them can be 
used as non-RAID by GRUB). But BIOS maps have problems as well with volume size.

That said, the disks are portable at that point and really don’t have deep 
performance bottlenecks because mirroring and striping is cheap.

Sent from my iPhone


On Oct 23, 2020, at 03:54, Eneko Lacunza  wrote:

Hi Brian,


El 22/10/20 a las 18:41, Brian Topping escribió:


On Oct 22, 2020, at 10:34 AM, Anthony D'Atri  wrote:
- You must really be sure your raid card is dependable. (sorry but I have 
seen so much management problems with top-tier RAID cards I avoid them like the 
plague).

This.

I’d definitely avoid a RAID card. If I can do advanced encryption with an MMX 
instruction, I think I can certainly trust IOMMU to handle device multiplexing 
from software in an efficient manner, no? mdadm RAID is just fine for me and is 
reliably bootable from GRUB.

I’m not an expert in driver mechanics, but mirroring should be very low 
overhead at the software level.

Once it’s software RAID, moving disks between chassis is a simple process as 
well.

Apologies I didn’t make that clear earlier...

Yes, I really like mdraid :) . Problem is BIOS/UEFI has to find a working 
bootable disk. I think some BIOS/UEFIs have settings for a secondary boot/UEFI 
bootfile, but that would have to be prepared and maintained manually, out of 
the mdraid10; and would only work with a total failure of the primary disk.

Cheers

--
Eneko Lacunza| +34 943 569 206
 | elacu...@binovo.es
Zuzendari teknikoa   | https://www.binovo.es
Director técnico | Astigarragako Bidea, 2 - 2º izda.
BINOVO IT HUMAN PROJECT S.L  | oficina 10-11, 20180 Oiartzun
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [External Email] Re: Hardware for new OSD nodes.

2020-10-23 Thread Dave Hall
Eneko,

# ceph health detail
HEALTH_WARN BlueFS spillover detected on 7 OSD(s)
BLUEFS_SPILLOVER BlueFS spillover detected on 7 OSD(s)
 osd.1 spilled over 648 MiB metadata from 'db' device (28 GiB used of
124 GiB) to slow device
 osd.3 spilled over 613 MiB metadata from 'db' device (28 GiB used of
124 GiB) to slow device
 osd.4 spilled over 485 MiB metadata from 'db' device (28 GiB used of
124 GiB) to slow device
 osd.10 spilled over 1008 MiB metadata from 'db' device (28 GiB used of
124 GiB) to slow device
 osd.17 spilled over 808 MiB metadata from 'db' device (28 GiB used of
124 GiB) to slow device
 osd.18 spilled over 2.5 GiB metadata from 'db' device (28 GiB used of
124 GiB) to slow device
 osd.20 spilled over 1.5 GiB metadata from 'db' device (28 GiB used of
124 GiB) to slow device

nvme0n1
 259:10   1.5T  0 disk
├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--6dcbb748--13f5--45cb--9d49--6c78d6589a71
│
 253:10   124G  0 lvm
├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--736a22a8--e4aa--4da9--b63b--295d8f5f2a3d
│
 253:30   124G  0 lvm
├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--751c6623--9870--4123--b551--1fd7fc837341
│
 253:50   124G  0 lvm
├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--2a376e8d--abb1--42af--a4bd--4ae8734d703e
│
 253:70   124G  0 lvm
├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--54fbe282--9b29--422b--bdb2--d7ed730bc589
│
 253:90   124G  0 lvm
├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--c1153cd2--2ec0--4e7f--a3d7--91dac92560ad
│
 253:11   0   124G  0 lvm
├─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--d613f4eb--6ddc--4dd5--a2b5--cb520b6ba922
│
 253:13   0   124G  0 lvm
└─ceph--block--dbs--a2b7a161--d4da--4b86--a191--37564008adca-osd--block--db--41f75c25--67db--46e8--a3fb--ddee9e7f7fc4

 253:15   0   124G  0 lvm

Dave Hall
Binghamton universitykdh...@binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)

On 10/23/2020 6:00 AM, Eneko Lacunza wrote:

Hi Dave,

El 22/10/20 a las 19:43, Dave Hall escribió:


El 22/10/20 a las 16:48, Dave Hall escribió:


(BTW, Nautilus 14.2.7 on Debian non-container.)

We're about to purchase more OSD nodes for our cluster, but I have a couple
questions about hardware choices.  Our original nodes were 8 x 12TB SAS
drives and a 1.6TB Samsung NVMe card for WAL, DB, etc.

We chose the NVMe card for performance since it has an 8 lane PCIe
interface.  However, we're currently BlueFS spillovers.

The Tyan chassis we are considering has the option of 4 x U.2 NVMe bays -
each with 4 PCIe lanes, (and 8 SAS bays).   It has occurred to me that I
might stripe 4 1TB NVMe drives together to get much more space for WAL/DB
and a net performance of 16 PCIe lanes.

Any thoughts on this approach?

Don't stripe them, if one NVMe fails you'll lose all OSDs. Just use 1 NVMe
drive for 2  SAS drives  and provision 300GB for WAL/DB for each OSD (see
related threads on this mailing list about why that exact size).

This way if a NVMe fails, you'll only lose 2 OSD.

I was under the impression that everything that BlueStore puts on the
SSD/NVMe could be reconstructed from information on the OSD. Am I mistaken
about this?  If so, my single 1.6TB NVMe card is equally vulnerable.


I don't think so, that info only exists on that partition as was the case
with filestore journal. Your single 1.6TB NVMe is vulnerable, yes.


Also, what size of WAL/DB partitions do you have now, and what spillover
size?


I recently posted another question to the list on this topic, since I now
have spillover on 7 of 24 OSDs.  Since the data layout on the NVMe for
BlueStore is not  traditional I've never quite figured out how to get this
information.   The current partition size is 1.6TB /12 since we had the
possibility to add for more drives to each node.  How that was divided
between WAL, DB, etc. is something I'd like to be able to understand.
However, we're not going to add the extra 4 drives, so expanding the LVM
partitions is now a possibility.

Can you paste the warning message? If shows the spillover size. What size
are the partitions on NVMe disk (lsblk)


Cheers
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-23 Thread Patrick Donnelly
On Fri, Oct 23, 2020 at 9:02 AM David C  wrote:
>
> Success!
>
> I remembered I had a server I'd taken out of the cluster to
> investigate some issues, that had some good quality 800GB Intel DC
> SSDs, dedicated an entire drive to swap, tuned up min_free_kbytes,
> added an MDS to that server and let it run. Took 3 - 4 hours but
> eventually came back online. It used the 128GB of RAM and about 250GB
> of the swap.
>
> Dan, thanks so much for steering me down this path, I would have more
> than likely started hacking away at the journal otherwise!
>
> Frank, thanks for pointing me towards that other thread, I used your
> min_free_kbytes tip
>
> I now need to consider updating - I wonder if the risk averse CephFS
> operator would go for the latest Nautilus or latest Octopus, it used
> to be that the newer CephFS code meant the most stable but don't know
> if that's still the case.

You need to first upgrade to Nautilus in any case. n+2 releases is the
max delta between upgrades.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io