[ceph-users] certificate docs.ceph.com

2020-04-22 Thread Nic De Muyer
Hi,

It appears the certificate expired today for docs.ceph.com.
Just thought I'd mention it here.

kr,
Nic De Muyer

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: : nautilus : progress section in ceph status is stuck

2020-04-22 Thread ceph
What is the Output from

Ceph -s
Ceph health detail


- Mehmet

Am 21. April 2020 07:02:15 MESZ schrieb Khodayar Doustar :
>Hi Vasishta,
>
>Have you checked that osd's systemd log and perfcounters? You can check
>it's metadata and bluefs logs to see what's going on.
>
>Thanks,
>
>Khodayar
>
>On Mon, Apr 20, 2020 at 9:48 PM Vasishta Shastry 
>wrote:
>
>> Hi,
>>
>> I upgraded a luminous cluster to nautilus and migrated Filestore OSD
>to
>> bluestore using ceph-ansible playbook.
>> I migrated 6 OSDs as of now, I see that since the last 3 days
>progress
>> section of ceph status is stuck as below.
>> Can anyone please help me to check what is going wrong ?
>>
>> progress:
>> > Rebalancing after osd.6 marked out
>> >   [..]
>> >
>>
>> Regards,
>> Vasishta Shastry
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>___
>ceph-users mailing list -- ceph-users@ceph.io
>To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS : replace a standby-replay daemon by an active one

2020-04-22 Thread Herve Ballans

Hi Eugen,

Thanks for your confirmation, it works following your steps. In 
addition, I had to restart the third mds service in order to take into 
account the change from standby-replay to standby.


Regards,
Hervé

On 15/04/2020 11:01, Eugen Block wrote:

Hi,

I didn't find any clear procedure regarding this operation, and my 
question is about if I can add an active rank directly or if I have 
to unset the standby-replay status first ?


I was thinking of the second option, that is:

$ sudo ceph fs set /my_fs/ allow_standby_replay false
$ sudo ceph fs set /my_fs/ max_mds 2

Is it the correct way ?


both ways should work. You can first enable the second active MDS with

$ sudo ceph fs set /my_fs/ max_mds 2

and afterwards disable standby-replay or the other way around. I don't 
think there's "the one correct" way.


Regards,
Eugen



Zitat von Herve Ballans :


Hello to all confined people (and the others too) !

On one of my Ceph cluster (Nautilus 14.2.3), I previously set up 3 
MDS daemons in active/standy-replay/standby configuration.


For design reasons, I would like to replace this configuration by an 
active/active/standby one.


It means replace the standby-replay daemon by an active one.

I didn't find any clear procedure regarding this operation, and my 
question is about if I can add an active rank directly or if I have 
to unset the standby-replay status first ?


I was thinking of the second option, that is:

$ sudo ceph fs set /my_fs/ allow_standby_replay false
$ sudo ceph fs set /my_fs/ max_mds 2

Is it the correct way ?

Thanks in advance,
Hervé

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: block.db symlink missing after each reboot

2020-04-22 Thread Jan Fajerski

On Tue, Apr 21, 2020 at 04:38:18PM +0200, Stefan Priebe - Profihost AG wrote:

Hi Igor,

mhm i updated the missing lv tags:

# lvs -o lv_tags
/dev/ceph-3a295647-d5a1-423c-81dd-1d2b32d7c4c5/osd-block-c2676c5f-111c-4603-b411-473f7a7638c2
| tr ',' '\n' | sort


 LV Tags







ceph.block_device=/dev/ceph-3a295647-d5a1-423c-81dd-1d2b32d7c4c5/osd-block-c2676c5f-111c-4603-b411-473f7a7638c2
ceph.block_uuid=0wBREi-I5t1-UeUa-EvbA-sET0-S9O0-VaxOgg
ceph.cephx_lockbox_secret=
ceph.cluster_fsid=7e242332-55c3-4926-9646-149b2f5c8081
ceph.cluster_name=ceph
ceph.crush_device_class=None
ceph.db_device=/dev/bluefs_db1/db-osd0
ceph.db_uuid=UUw35K-YnNT-HZZE-IfWd-Rtxn-0eVW-kTuQmj
ceph.encrypted=0
ceph.osd_fsid=c2676c5f-111c-4603-b411-473f7a7638c2
ceph.osd_id=0
ceph.type=block

If this is the db LV then the type is wrong here. Should be ceph.type=db.

ceph.vdo=0

# lvdisplay /dev/bluefs_db1/db-osd0
 --- Logical volume ---
 LV Path/dev/bluefs_db1/db-osd0
 LV Namedb-osd0
 VG Namebluefs_db1
 LV UUIDUUw35K-YnNT-HZZE-IfWd-Rtxn-0eVW-kTuQmj
 LV Write Accessread/write
 LV Creation host, time cloud10-1517, 2020-02-28 21:32:48 +0100
 LV Status  available
 # open 0
 LV Size185,00 GiB
 Current LE 47360
 Segments   1
 Allocation inherit
 Read ahead sectors auto
 - currently set to 256
 Block device   253:1

but lvm trigger still says:

# /usr/sbin/ceph-volume lvm trigger
0-c2676c5f-111c-4603-b411-473f7a7638c2
   -->  RuntimeError: could not find db with uuid
UUw35K-YnNT-HZZE-IfWd-Rtxn-0eVW-kTuQmj

Mit freundlichen Grüßen
 Stefan Priebe
Bachelor of Science in Computer Science (BSCS)
Vorstand (CTO)

---
Profihost AG
Expo Plaza 1
30539 Hannover
Deutschland

Tel.: +49 (511) 5151 8181 | Fax.: +49 (511) 5151 8282
URL: http://www.profihost.com | E-Mail: i...@profihost.com

Sitz der Gesellschaft: Hannover, USt-IdNr. DE813460827
Registergericht: Amtsgericht Hannover, Register-Nr.: HRB 202350
Vorstand: Cristoph Bluhm, Stefan Priebe
Aufsichtsrat: Prof. Dr. iur. Winfried Huck (Vorsitzender)

Am 21.04.20 um 16:07 schrieb Igor Fedotov:

On 4/21/2020 4:59 PM, Stefan Priebe - Profihost AG wrote:

Hi Igor,

Am 21.04.20 um 15:52 schrieb Igor Fedotov:

Hi Stefan,

I think that's the cause:

https://tracker.ceph.com/issues/42928

thanks yes that matches. Is there any way to fix this manually?


I think so - AFAIK missed tags are pure LVM stuff and hence can be set
by regular LVM tools.

 ceph-volume does that during OSD provisioning as well. But
unfortunately I haven't dived into this topic deeper yet. So can't
provide you with the details how to fix this step-by-step.



And is this also related to:
https://tracker.ceph.com/issues/44509


Probably unrelated. That's either a different bug or rather some
artifact from RocksDB/BlueFS interaction.

Leaving a request for more info in the ticket...



Greets,
Stefan


On 4/21/2020 4:02 PM, Stefan Priebe - Profihost AG wrote:

Hi there,

i've a bunch of hosts where i migrated HDD only OSDs to hybird ones
using:
sudo -E -u ceph -- bash -c 'ceph-bluestore-tool --path
/var/lib/ceph/osd/ceph-${OSD} bluefs-bdev-new-db --dev-target
/dev/bluefs_db1/db-osd${OSD}'

while this worked fine and each OSD was running fine.

It looses it's block.db symlink after reboot.

If i manually recreate the block.db symlink inside:
/var/lib/ceph/osd/ceph-*

all osds start fine. Can anybody help who creates those symlinks and
why
they're not created automatically in case of migrated db?

Greets,
Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Jan Fajerski
Senior Software Engineer Enterprise Storage
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Upgrading to Octopus

2020-04-22 Thread Simon Sutter
Hello everybody


In octopus there are some interesting looking features, so I tried to upgrading 
my Centos 7 test nodes, according to:
https://docs.ceph.com/docs/master/releases/octopus/

Everything went fine and the cluster is healthy.


To test out the new dashboard functions, I tried to install it, but there are 
missing dependencies:

yum install ceph-mgr-dashboard.noarch

.

--> Finished Dependency Resolution
Error: Package: 2:ceph-mgr-dashboard-15.2.1-0.el7.noarch (Ceph-noarch)
   Requires: python3-routes
Error: Package: 2:ceph-mgr-dashboard-15.2.1-0.el7.noarch (Ceph-noarch)
   Requires: python3-jwt
Error: Package: 2:ceph-mgr-dashboard-15.2.1-0.el7.noarch (Ceph-noarch)
   Requires: python3-cherrypy


Installing them with pip3 does of course make no difference, because those are 
yum dependencies.

Does anyone know a workaround?

Do I have to upgrade to Centos 8 for this to work?


Thanks in advance,

Simon
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: docs.ceph.com certificate expired?

2020-04-22 Thread Jos Collin

Fixed

On 22/04/20 6:57 pm, Bobby wrote:


Thanks ! When will it be back?

On Wed, Apr 22, 2020 at 3:03 PM > wrote:


Hello,

trying to access the documentation on docs.ceph.com
 now results in an error:  The certificate
expired on April 22, 2020, 8:46 AM.

Bye,
Ulrich
___
Dev mailing list -- d...@ceph.io 
To unsubscribe send an email to dev-le...@ceph.io



___
Dev mailing list -- d...@ceph.io
To unsubscribe send an email to dev-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph Apply/Commit vs Read/Write Op Latency

2020-04-22 Thread John Petrini
Hello,

I was hoping someone could clear up the difference between these metrics.
In filestore the difference between Apply and Commit Latency was pretty
clear and these metrics gave a good representation of how the cluster was
performing. High commit usually meant our journals were performing poorly
while high apply pointed to an OSD issue.

With bluestore Apply & Commit are now tied to the same metric and it's not
as clear to me what that metric is.

In addition new metrics such as Read and Write Op Latency have been added.
I'm led to believe that these are similar to what Apply Latency used to
represent but is that actually the case?

If anyone who has a better understanding of this than I do can enlighten me
I'd appreciate it!

Thanks,

John
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrading to Octopus

2020-04-22 Thread Khodayar Doustar
Hi Simon,

Have you tried installing them with yum?




On Wed, Apr 22, 2020 at 6:16 PM Simon Sutter  wrote:

> Hello everybody
>
>
> In octopus there are some interesting looking features, so I tried to
> upgrading my Centos 7 test nodes, according to:
> https://docs.ceph.com/docs/master/releases/octopus/
>
> Everything went fine and the cluster is healthy.
>
>
> To test out the new dashboard functions, I tried to install it, but there
> are missing dependencies:
>
> yum install ceph-mgr-dashboard.noarch
>
> .
>
> --> Finished Dependency Resolution
> Error: Package: 2:ceph-mgr-dashboard-15.2.1-0.el7.noarch (Ceph-noarch)
>Requires: python3-routes
> Error: Package: 2:ceph-mgr-dashboard-15.2.1-0.el7.noarch (Ceph-noarch)
>Requires: python3-jwt
> Error: Package: 2:ceph-mgr-dashboard-15.2.1-0.el7.noarch (Ceph-noarch)
>Requires: python3-cherrypy
>
>
> Installing them with pip3 does of course make no difference, because those
> are yum dependencies.
>
> Does anyone know a workaround?
>
> Do I have to upgrade to Centos 8 for this to work?
>
>
> Thanks in advance,
>
> Simon
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How to remove a deamon from orch

2020-04-22 Thread Ml Ml
Hello list,

i somehow have this "mgr.cph02   ceph02  stopped " line here.

root@ceph01:~# ceph orch ps
NAMEHOSTSTATUSREFRESHED  AGE  VERSIONIMAGE
NAME   IMAGE ID  CONTAINER ID
mgr.ceph02  ceph02  running (2w)  2w ago -15.2.0
docker.io/ceph/ceph:v15  204a01f9b0b6  4e349a382c6b
mgr.ceph03  ceph03  running (2w)  2w ago -15.2.0
docker.io/ceph/ceph:v15  204a01f9b0b6  2a9a037e5e2d
mgr.cph02   ceph02  stopped   2w ago -  

mon.ceph02  ceph02  running (2w)  2w ago -15.2.0
docker.io/ceph/ceph:v15  204a01f9b0b6  cf66ca51c0dd
mon.ceph03  ceph03  running (2w)  2w ago -15.2.0
docker.io/ceph/ceph:v15  204a01f9b0b6  fceaaa03b41f

I actually cant remember how i did that. How can i remove that wrong
"mgr.cph02" entry?

Thanks,
Michael
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How to debug ssh: ceph orch host add ceph01 10.10.1.1

2020-04-22 Thread Ml Ml
Hello List,

i did:
root@ceph01:~# ceph cephadm set-ssh-config -i /tmp/ssh_conf

root@ceph01:~# cat /tmp/ssh_conf
Host *
User root
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

root@ceph01:~# ceph config-key set mgr/cephadm/ssh_identity_key -i
/root/.ssh/id_rsa
set mgr/cephadm/ssh_identity_key
root@ceph01:~# ceph config-key set mgr/cephadm/ssh_identity_pub -i
/root/.ssh/id_rsa.pub
set mgr/cephadm/ssh_identity_pub

But i get:
root@ceph01:~# ceph orch host add ceph01 10.10.1.1
Error ENOENT: Failed to connect to ceph01 (10.10.1.1).  Check that the
host is reachable and accepts connections using the cephadm SSH key

root@ceph01:~# ceph config-key get mgr/cephadm/ssh_identity_key =>
this shows my private key

How can i debug this?

root@ceph01:~# ssh 10.10.1.1
  or
root@ceph01:~# ssh ceph01

work without a prompt or key error.

I am using 15.2.0.

Thanks,
Michael
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Dear Abby: Why Is Architecting CEPH So Hard?

2020-04-22 Thread cody . schmidt
Hey Folks,

This is my first ever post here in the CEPH user group and I will preface with 
the fact that I know this is a lot of what many people ask frequently. Unlike 
what I assume to be a large majority of CEPH “users” in this forum, I am more 
of a CEPH “distributor.” My interests lie in how to build a CEPH environment to 
best fill an organization’s needs.I am here for the real-world experience and 
expertise so that I can learn to build CEPH “right.” I have spent the last 
couple years collecting data on general “best practices” through forum posts, 
CEPH documentation, CEPHLACON, etc. I wanted to post my findings to the forum 
to see where I can harden my stance.

Below are two example designs that I might use when architecting a solution 
currently. I have specific questions around design elements in each that I 
would like you to approve for holding water or not. I want to focus on the 
hardware, so I am asking for generalizations where possible. Let’s assume in 
all scenarios that we are using Luminous and that the data type is mixed use. 
I am not expecting anyone to run through every question, so please feel free to 
comment on any piece you can. Tell me what is overkill and what is lacking!

Example 1:
8x 60-Bay (8TB) Storage nodes (480x 8TB SAS Drives)
Storage Node Spec: 
2x 32C 2.9GHz AMD EPYC
   - Documentation mentions .5 cores per OSD for throughput optimized. Are they 
talking about .5 Physical cores or .5 Logical cores?
   - Is it better to pick my processors based on a total GHz measurement like 
2GHz per OSD?
   - Would a theoretical 8C at 2GHz serve the same number of OSDs as a 16C at 
1GHz? Would Threads be included in this calculation?
512GB Memory
   - I know this is the hot topic because of its role in recoveries. Basically, 
I am looking for the most generalized practice I can use as a safe number and a 
metric I can use as a nice to have. 
   - Is it 1GB of RAM per TB of RAW OSD?
2x 3.2TB NVMe WAHLDB / Log Drive
   - Another hot topic that I am sure will bring many “it depends.” All I am 
looking for is experience on this. I know people have mentioned having at least 
70GB of Flash for WAHLDB / Logs. 
   - Can I use 70GB as a flat calculation per OSD or is it depend on the Size 
of the OSD?
   - I know more is better, but what is a number I can use to get started with 
minimal issues?
2x 56Gbit Links
- I think this should be enough given the rule of thumb of 10Gbit for every 12 
OSDs.
3x MON Node
MON Node Spec:
1x 8C 3.2GHz AMD EPYC
- I can’t really find good practices around when to increase your core count. 
Any suggestions?
128GB Memory
   - What do I need memory for in a MON?
   - When do I need to expand?
2x 480GB Boot SSDs
   - Any reason to look more closely into the sizing of these drives?
2x 25Gbit Uplinks
   - Should these match the output of the storage nodes for any reason?


Example 2:
8x 12-Bay NVMe Storage nodes (96x 1.6TB NVMe Drives)
Storage Node Spec: 
2x 32C 2.9GHz AMD EPYC
   - I have read that each NMVe OSD should have 10 cores. I am not splitting 
Physical drives into multiple OSDs so let’s assume I have 12 OSD per Node.
   - Would threads count toward my 10 core quota or just physical cores?
   - Can I do a similar calculation as I mentioned before and just use 20GHz 
per OSD instead of focusing on cores specifically?
512GB Memory
   - I assume there is some reason I can’t use the same methodology of 1GB  per 
TB of OSD since this is NVMe storage
2x 100Gbit Links
   - This is assuming about 1Gigabyte per second of real-world speed per disk

3x MON Node – What differences should MONs serving NVMe have compared to large 
NLSAS pools?
MON Node Spec:
1x 8C 3.2GHz AMD Epyc
128GB Memory
2x 480GB Boot SSDs
2x 25Gbit Uplinks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Dear Abby: Why Is Architecting CEPH So Hard?

2020-04-22 Thread Jack
Hi,


On 4/22/20 11:47 PM, cody.schm...@iss-integration.com wrote:

> Example 1:
> 8x 60-Bay (8TB) Storage nodes (480x 8TB SAS Drives)
> Storage Node Spec: 
> 2x 32C 2.9GHz AMD EPYC
>- Documentation mentions .5 cores per OSD for throughput optimized. Are 
> they talking about .5 Physical cores or .5 Logical cores?

Does not matter much
CPU is used for recovery as well as for rbd snaptrimming
The real rule: do not have 12 OSDs on a 4 CPU host

>- Is it better to pick my processors based on a total GHz measurement like 
> 2GHz per OSD?
>- Would a theoretical 8C at 2GHz serve the same number of OSDs as a 16C at 
> 1GHz? Would Threads be included in this calculation?
Higher frequency leads to lower latency, hence higher performance
To suit your example, I would get 8 cores at 2GHz

> 512GB Memory
>- I know this is the hot topic because of its role in recoveries. 
> Basically, I am looking for the most generalized practice I can use as a safe 
> number and a metric I can use as a nice to have. 
>- Is it 1GB of RAM per TB of RAW OSD?
Well
More ram -> more performance, as always
I do have 1GB per TB on my rusty cluster
On my flashed-based cluster, I have between 2.5GB and 7GB per TB
Again, my real rule: 64GB of memory per node, and not more than 12
device slots per node

>- I know more is better, but what is a number I can use to get started 
> with minimal issues?
> 2x 56Gbit Links
10G is the cheapest, do not go below
25G is cheap too, consider it

> - I think this should be enough given the rule of thumb of 10Gbit for every 
> 12 OSDs.
> 3x MON Node
> MON Node Spec:
> 1x 8C 3.2GHz AMD EPYC
> - I can’t really find good practices around when to increase your core count. 
> Any suggestions?

I have never seen any CPU usage on monitors
I wonder if a dual core would suit perfectly ..
(the higher frequency stuff applies, tho)

> 128GB Memory
>- What do I need memory for in a MON?
>- When do I need to expand?
Same as the CPU: a couple of GB where always enought for me

> 2x 480GB Boot SSDs
>- Any reason to look more closely into the sizing of these drives?
I use 32GB flash-based satadom devices for root device
They are basically SSD, and do not take front slots
As they are never burning up, we never replace them
Ergo, the need to "open" the server is not an issue

> 2x 25Gbit Uplinks
>- Should these match the output of the storage nodes for any reason?
10G!


> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Dear Abby: Why Is Architecting CEPH So Hard?

2020-04-22 Thread Brian Topping
Great set of suggestions, thanks! One to consider:

> On Apr 22, 2020, at 4:14 PM, Jack  wrote:
> 
> I use 32GB flash-based satadom devices for root device
> They are basically SSD, and do not take front slots
> As they are never burning up, we never replace them
> Ergo, the need to "open" the server is not an issue


This is probably the wrong forum to understand how you are not burning them 
out. Any kind of logs or monitor databases on a small SATADOM will cook them 
quick, especially an MLC. There is no extra space for wear leveling and the 
like. I tried making it work with fancy systemd logging to memory and having 
those logs pulled by a log scraper storing to the actual data drives, but there 
was no place for the monitor DB. No monitor DB means Ceph doesn’t load, and if 
a monitor DB gets corrupted, it’s perilous for the cluster and instant death if 
the monitors aren’t replicated.

My node chassis have two motherboards and each is hard limited to four SSDs. On 
each node, `/boot` is mirrored (RAID1) on partition 1, `/` is stripe/mirrored 
(RAID10) on p2, then used whatever was left for ceph data on partition 3 of 
each disk. This way any disk could fail and I could still boot. Merging the 
volumes (ie no SATADOM), wear leveling was statistically more effective. And I 
don’t have to get into crazy system configurations that nobody would want to 
maintain or document.

$0.02…

Brian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Sporadic mgr segmentation fault

2020-04-22 Thread Brad Hubbard
On Tue, Apr 21, 2020 at 11:39 PM XuYun  wrote:
>
> Dear ceph users,
>
> We are experiencing sporadic mgr crash in all three ceph clusters (version 
> 14.2.6 and version 14.2.8), the crash log is:
>
> 2020-04-17 23:10:08.986 7fed7fe07700 -1 
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/common/buffer.cc:
>  In function 'const char* ceph::buffer::v14_2_0::ptr::c_str() const' thread 
> 7fed7fe07700 time 2020-04-17 23:10:08.984887
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/common/buffer.cc:
>  578: FAILED ceph_assert(_raw)
>
>  ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus 
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x14a) [0x7fed8605c325]
>  2: (()+0x2534ed) [0x7fed8605c4ed]
>  3: (()+0x5a21ed) [0x7fed863ab1ed]
>  4: (PosixConnectedSocketImpl::send(ceph::buffer::v14_2_0::list&, bool)+0xbd) 
> [0x7fed863840ed]
>  5: (AsyncConnection::_try_send(bool)+0xb6) [0x7fed8632fc76]
>  6: (ProtocolV2::write_message(Message*, bool)+0x832) [0x7fed8635bf52]
>  7: (ProtocolV2::write_event()+0x175) [0x7fed863718c5]
>  8: (AsyncConnection::handle_write()+0x40) [0x7fed86332600]
>  9: (EventCenter::process_events(unsigned int, std::chrono::duration long, std::ratio<1l, 10l> >*)+0x1397) [0x7fed8637f997]
>  10: (()+0x57c977) [0x7fed86385977]
>  11: (()+0x80bdaf) [0x7fed86614daf]
>  12: (()+0x7e65) [0x7fed8394ce65]
>  13: (clone()+0x6d) [0x7fed825fa88d]
>
> 2020-04-17 23:10:08.990 7fed7ee05700 -1 *** Caught signal (Segmentation 
> fault) **
>  in thread 7fed7ee05700 thread_name:msgr-worker-2
>
>  ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus 
> (stable)
>  1: (()+0xf5f0) [0x7fed839545f0]
>  2: (ceph::buffer::v14_2_0::ptr::release()+0x8) [0x7fed863aafd8]
>  3: 
> (ceph::crypto::onwire::AES128GCM_OnWireTxHandler::~AES128GCM_OnWireTxHandler()+0x59)
>  [0x7fed86388669]
>  4: (ProtocolV2::reset_recv_state()+0x11f) [0x7fed8635f5af]
>  5: (ProtocolV2::stop()+0x77) [0x7fed8635f857]
>  6: 
> (ProtocolV2::handle_existing_connection(boost::intrusive_ptr)+0x5ef)
>  [0x7fed86374f8f]
>  7: (ProtocolV2::handle_client_ident(ceph::buffer::v14_2_0::list&)+0xd9c) 
> [0x7fed8637673c]
>  8: (ProtocolV2::handle_frame_payload()+0x1fb) [0x7fed86376c1b]
>  9: (ProtocolV2::handle_read_frame_dispatch()+0x150) [0x7fed86376e70]
>  10: 
> (ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr  ceph::buffer::v14_2_0::ptr_node::disposer>&&, int)+0x44d) [0x7fed863773cd]
>  11: (ProtocolV2::run_continuation(Ct&)+0x34) [0x7fed86360534]
>  12: (AsyncConnection::process()+0x186) [0x7fed86330656]
>  13: (EventCenter::process_events(unsigned int, 
> std::chrono::duration >*)+0xa15) 
> [0x7fed8637f015]
>  14: (()+0x57c977) [0x7fed86385977]
>  15: (()+0x80bdaf) [0x7fed86614daf]
>  16: (()+0x7e65) [0x7fed8394ce65]
>  17: (clone()+0x6d) [0x7fed825fa88d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
> interpret this.
>
> Any thoughts about this issue?

Looks like https://tracker.ceph.com/issues/42026 which was recently
backported to the Nautilus branch via
https://github.com/ceph/ceph/pull/33820

You could try a build with those patches or wait for 14.2.9

-- 
Cheers,
Brad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Sporadic mgr segmentation fault

2020-04-22 Thread XuYun
Thank you, Brad. We’ll try to upgrade 14.2.9 today.

> 2020年4月23日 上午7:21,Brad Hubbard  写道:
> 
> On Tue, Apr 21, 2020 at 11:39 PM XuYun mailto:yu...@me.com>> 
> wrote:
>> 
>> Dear ceph users,
>> 
>> We are experiencing sporadic mgr crash in all three ceph clusters (version 
>> 14.2.6 and version 14.2.8), the crash log is:
>> 
>> 2020-04-17 23:10:08.986 7fed7fe07700 -1 
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/common/buffer.cc:
>>  In function 'const char* ceph::buffer::v14_2_0::ptr::c_str() const' thread 
>> 7fed7fe07700 time 2020-04-17 23:10:08.984887
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.8/rpm/el7/BUILD/ceph-14.2.8/src/common/buffer.cc:
>>  578: FAILED ceph_assert(_raw)
>> 
>> ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus 
>> (stable)
>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>> const*)+0x14a) [0x7fed8605c325]
>> 2: (()+0x2534ed) [0x7fed8605c4ed]
>> 3: (()+0x5a21ed) [0x7fed863ab1ed]
>> 4: (PosixConnectedSocketImpl::send(ceph::buffer::v14_2_0::list&, bool)+0xbd) 
>> [0x7fed863840ed]
>> 5: (AsyncConnection::_try_send(bool)+0xb6) [0x7fed8632fc76]
>> 6: (ProtocolV2::write_message(Message*, bool)+0x832) [0x7fed8635bf52]
>> 7: (ProtocolV2::write_event()+0x175) [0x7fed863718c5]
>> 8: (AsyncConnection::handle_write()+0x40) [0x7fed86332600]
>> 9: (EventCenter::process_events(unsigned int, std::chrono::duration> long, std::ratio<1l, 10l> >*)+0x1397) [0x7fed8637f997]
>> 10: (()+0x57c977) [0x7fed86385977]
>> 11: (()+0x80bdaf) [0x7fed86614daf]
>> 12: (()+0x7e65) [0x7fed8394ce65]
>> 13: (clone()+0x6d) [0x7fed825fa88d]
>> 
>> 2020-04-17 23:10:08.990 7fed7ee05700 -1 *** Caught signal (Segmentation 
>> fault) **
>> in thread 7fed7ee05700 thread_name:msgr-worker-2
>> 
>> ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus 
>> (stable)
>> 1: (()+0xf5f0) [0x7fed839545f0]
>> 2: (ceph::buffer::v14_2_0::ptr::release()+0x8) [0x7fed863aafd8]
>> 3: 
>> (ceph::crypto::onwire::AES128GCM_OnWireTxHandler::~AES128GCM_OnWireTxHandler()+0x59)
>>  [0x7fed86388669]
>> 4: (ProtocolV2::reset_recv_state()+0x11f) [0x7fed8635f5af]
>> 5: (ProtocolV2::stop()+0x77) [0x7fed8635f857]
>> 6: 
>> (ProtocolV2::handle_existing_connection(boost::intrusive_ptr)+0x5ef)
>>  [0x7fed86374f8f]
>> 7: (ProtocolV2::handle_client_ident(ceph::buffer::v14_2_0::list&)+0xd9c) 
>> [0x7fed8637673c]
>> 8: (ProtocolV2::handle_frame_payload()+0x1fb) [0x7fed86376c1b]
>> 9: (ProtocolV2::handle_read_frame_dispatch()+0x150) [0x7fed86376e70]
>> 10: 
>> (ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr>  ceph::buffer::v14_2_0::ptr_node::disposer>&&, int)+0x44d) [0x7fed863773cd]
>> 11: (ProtocolV2::run_continuation(Ct&)+0x34) [0x7fed86360534]
>> 12: (AsyncConnection::process()+0x186) [0x7fed86330656]
>> 13: (EventCenter::process_events(unsigned int, 
>> std::chrono::duration >*)+0xa15) 
>> [0x7fed8637f015]
>> 14: (()+0x57c977) [0x7fed86385977]
>> 15: (()+0x80bdaf) [0x7fed86614daf]
>> 16: (()+0x7e65) [0x7fed8394ce65]
>> 17: (clone()+0x6d) [0x7fed825fa88d]
>> NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
>> interpret this.
>> 
>> Any thoughts about this issue?
> 
> Looks like https://tracker.ceph.com/issues/42026 
>  which was recently
> backported to the Nautilus branch via
> https://github.com/ceph/ceph/pull/33820 
> 
> 
> You could try a build with those patches or wait for 14.2.9
> 
> -- 
> Cheers,
> Brad
> ___
> ceph-users mailing list -- ceph-users@ceph.io 
> To unsubscribe send an email to ceph-users-le...@ceph.io 
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Dear Abby: Why Is Architecting CEPH So Hard?

2020-04-22 Thread lin . yunfan







I have seen a lot of people saying not to go with big nodes.
What is the exact reason for that?I can understand that if the cluster is not big enough then the total nodes count could be too small to withstand a node failure, but if the cluster is big enough wouldn't the big node be more cost effective?






 








lin.yunfan






lin.yun...@gmail.com




















签名由
网易邮箱大师
定制

 

On 4/23/2020 06:33,Brian Topping wrote: 


Great set of suggestions, thanks! One to consider: On Apr 22, 2020, at 4:14 PM, Jack  wrote:  I use 32GB flash-based satadom devices for root device They are basically SSD, and do not take front slots As they are never burning up, we never replace them Ergo, the need to "open" the server is not an issueThis is probably the wrong forum to understand how you are not burning them out. Any kind of logs or monitor databases on a small SATADOM will cook them quick, especially an MLC. There is no extra space for wear leveling and the like. I tried making it work with fancy systemd logging to memory and having those logs pulled by a log scraper storing to the actual data drives, but there was no place for the monitor DB. No monitor DB means Ceph doesn’t load, and if a monitor DB gets corrupted, it’s perilous for the cluster and instant death if the monitors aren’t replicated.My node chassis have two motherboards and each is hard limited to four SSDs. On each node, `/boot` is mirrored (RAID1) on partition 1, `/` is stripe/mirrored (RAID10) on p2, then used whatever was left for ceph data on partition 3 of each disk. This way any disk could fail and I could still boot. Merging the volumes (ie no SATADOM), wear leveling was statistically more effective. And I don’t have to get into crazy system configurations that nobody would want to maintain or document.$0.02…Brian___ceph-users mailing list -- ceph-users@ceph.ioTo unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Dear Abby: Why Is Architecting CEPH So Hard?

2020-04-22 Thread Jarett DeAngelis
Well, for starters, "more network" = "faster cluster."

On Wed, Apr 22, 2020 at 11:18 PM lin.yunfan  wrote:

> I have seen a lot of people saying not to go with big nodes.
> What is the exact reason for that?
> I can understand that if the cluster is not big enough then the total
> nodes count could be too small to withstand a node failure, but if the
> cluster is big enough wouldn't the big node be more cost effective?
>
>
> lin.yunfan
> lin.yun...@gmail.com
>
> 
> 签名由 网易邮箱大师  定制
> On 4/23/2020 06:33,Brian Topping
>  wrote:
>
> Great set of suggestions, thanks! One to consider:
>
> On Apr 22, 2020, at 4:14 PM, Jack  wrote:
>
> I use 32GB flash-based satadom devices for root device
> They are basically SSD, and do not take front slots
> As they are never burning up, we never replace them
> Ergo, the need to "open" the server is not an issue
>
>
>
> This is probably the wrong forum to understand how you are not burning
> them out. Any kind of logs or monitor databases on a small SATADOM will
> cook them quick, especially an MLC. There is no extra space for wear
> leveling and the like. I tried making it work with fancy systemd logging to
> memory and having those logs pulled by a log scraper storing to the actual
> data drives, but there was no place for the monitor DB. No monitor DB means
> Ceph doesn’t load, and if a monitor DB gets corrupted, it’s perilous for
> the cluster and instant death if the monitors aren’t replicated.
>
> My node chassis have two motherboards and each is hard limited to four
> SSDs. On each node, `/boot` is mirrored (RAID1) on partition 1, `/` is
> stripe/mirrored (RAID10) on p2, then used whatever was left for ceph data
> on partition 3 of each disk. This way any disk could fail and I could still
> boot. Merging the volumes (ie no SATADOM), wear leveling was statistically
> more effective. And I don’t have to get into crazy system configurations
> that nobody would want to maintain or document.
>
> $0.02…
>
> Brian
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Dear Abby: Why Is Architecting CEPH So Hard?

2020-04-22 Thread lin . yunfan






Big nodes are most for HDD cluster and with 40G nic or 100G nic I don't think the network would be the bottleneck.







 








lin.yunfan






lin.yun...@gmail.com
 

On 4/23/2020 11:20,Jarett DeAngelis wrote: 


Well, for starters, "more network" = "faster cluster."On Wed, Apr 22, 2020 at 11:18 PM lin.yunfan  wrote:








I have seen a lot of people saying not to go with big nodes.
What is the exact reason for that?I can understand that if the cluster is not big enough then the total nodes count could be too small to withstand a node failure, but if the cluster is big enough wouldn't the big node be more cost effective?






 








lin.yunfan






lin.yun...@gmail.com




















签名由
网易邮箱大师
定制

 

On 4/23/2020 06:33,Brian Topping wrote: 


Great set of suggestions, thanks! One to consider: On Apr 22, 2020, at 4:14 PM, Jack  wrote:  I use 32GB flash-based satadom devices for root device They are basically SSD, and do not take front slots As they are never burning up, we never replace them Ergo, the need to "open" the server is not an issueThis is probably the wrong forum to understand how you are not burning them out. Any kind of logs or monitor databases on a small SATADOM will cook them quick, especially an MLC. There is no extra space for wear leveling and the like. I tried making it work with fancy systemd logging to memory and having those logs pulled by a log scraper storing to the actual data drives, but there was no place for the monitor DB. No monitor DB means Ceph doesn’t load, and if a monitor DB gets corrupted, it’s perilous for the cluster and instant death if the monitors aren’t replicated.My node chassis have two motherboards and each is hard limited to four SSDs. On each node, `/boot` is mirrored (RAID1) on partition 1, `/` is stripe/mirrored (RAID10) on p2, then used whatever was left for ceph data on partition 3 of each disk. This way any disk could fail and I could still boot. Merging the volumes (ie no SATADOM), wear leveling was statistically more effective. And I don’t have to get into crazy system configurations that nobody would want to maintain or document.$0.02…Brian___ceph-users mailing list -- ceph-users@ceph.ioTo unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: missing amqp-exchange on bucket-notification with AMQP endpoint

2020-04-22 Thread Andreas Unterkircher

Dear Yuval!


The message format you tried to use is the standard one (the one being
emitted from boto3, or any other AWS SDK [1]).
It passes the arguments using 'x-www-form-urlencoded'. For example:


Thank you for your clarification! I've previously tried it as a
x-www-form-urlencoded-body as well, but I have failed. That it was then
working using the non-standard-parameters has lead me down the wrong 
road...

But I have to admit that I'm still failing to create a topic the S3-way.

I've tried it with curl, but as well with Postman.
Even if I use your example-body, Ceph keeps telling me (at least) 
method-not-allowed.


Is this maybe because I'm using an AWS Sig v4 to authenticate?

This is the request I'm sending out:

POST / HTTP/1.1
Content-Type: application/x-www-form-urlencoded; charset=utf-8
Accept-Encoding: identity
Date: Tue, 23 Apr 2020 05:00:35 GMT
X-Amz-Content-Sha256: 
e8d828552b412fde2cd686b0a984509bc485693a02e8c53ab84cf36d1dbb961a

Host: s3.example.com
X-Amz-Date: 2 as0200423T050035Z
Authorization: AWS4-HMAC-SHA256 
Credential=DNQXT3I8Z5MWDJ1A8YMP/20200423/de/s3/aws4_request, 
SignedHeaders=accept-encoding;content-type;date;host;x-amz-content-sha256;x-amz-date, 
Signature=fa65844ba997fe11e65be87a18f160afe1ea459892316d6060bbc663daf6eace

User-Agent: PostmanRuntime/7.24.1
Accept: */*
Connection: keep-alive

Content-Length: 303

Name=ajmmvc-1_topic_1&
Attributes.entry.2.key=amqp-exchange&
Attributes.entry.1.key=amqp-ack-level&
Attributes.entry.2.value=amqp.direct&
Version=2010-03-31&
Attributes.entry.3.value=amqp%3A%2F%2F127.0.0.1%3A7001&
Attributes.entry.1.value=none&
Action=CreateTopic&
Attributes.entry.3.key=push-endpoint


This is the response that comes back:

HTTP/1.1 405 Method Not Allowed
Content-Length: 200
x-amz-request-id: tx1-005ea12159-6e47a-s3-datacenter
Accept-Ranges: bytes
Content-Type: application/xml
Date: Thu, 23 Apr 2020 05:02:17 GMT
encoding="UTF-8"?>MethodNotAllowedtx1-005ea12159-6e47a-s3-datacenter6e47a-s3-datacenter-de



This is was radosgw is seeing at the same time

2020-04-23T07:02:17.745+0200 7f5aab2af700 20 final domain/bucket 
subdomain= domain=s3.example.com in_hosted_domain=1 
in_hosted_domain_s3website=0 s->info.domain=s3.example.com 
s->info.request_uri=/
2020-04-23T07:02:17.745+0200 7f5aab2af700 10 meta>> 
HTTP_X_AMZ_CONTENT_SHA256

2020-04-23T07:02:17.745+0200 7f5aab2af700 10 meta>> HTTP_X_AMZ_DATE
2020-04-23T07:02:17.745+0200 7f5aab2af700 10 x>> 
x-amz-content-sha256:e8d828552b412fde2cd686b0a984509bc485693a02e8c53ab84cf36d1dbb961a
2020-04-23T07:02:17.745+0200 7f5aab2af700 10 x>> 
x-amz-date:20200423T050035Z
2020-04-23T07:02:17.745+0200 7f5aab2af700 20 req 1 0s get_handler 
handler=26RGWHandler_REST_Service_S3
2020-04-23T07:02:17.745+0200 7f5aab2af700 10 
handler=26RGWHandler_REST_Service_S3

2020-04-23T07:02:17.745+0200 7f5aab2af700  2 req 1 0s getting op 4
2020-04-23T07:02:17.745+0200 7f5aab2af700 10 Content of POST:
Name=ajmmvc-1_topic_1&
Attributes.entry.2.key=amqp-exchange&
Attributes.entry.1.key=amqp-ack-level&
Attributes.entry.2.value=amqp.direct&
Version=2010-03-31&
Attributes.entry.3.value=amqp%3A%2F%2F127.0.0.1%3A7001&
Attributes.entry.1.value=none&
Action=CreateTopic&
Attributes.entry.3.key=push-endpoint

2020-04-23T07:02:17.745+0200 7f5aab2af700 10 Content of POST:
Name=ajmmvc-1_topic_1&
Attributes.entry.2.key=amqp-exchange&
Attributes.entry.1.key=amqp-ack-level&
Attributes.entry.2.value=amqp.direct&
Version=2010-03-31&
Attributes.entry.3.value=amqp%3A%2F%2F127.0.0.1%3A7001&
Attributes.entry.1.value=none&
Action=CreateTopic&
Attributes.entry.3.key=push-endpoint

2020-04-23T07:02:17.745+0200 7f5aab2af700 10 Content of POST:
Name=ajmmvc-1_topic_1&
Attributes.entry.2.key=amqp-exchange&
Attributes.entry.1.key=amqp-ack-level&
Attributes.entry.2.value=amqp.direct&
Version=2010-03-31&
Attributes.entry.3.value=amqp%3A%2F%2F127.0.0.1%3A7001&
Attributes.entry.1.value=none&
Action=CreateTopic&
Attributes.entry.3.key=push-endpoint

2020-04-23T07:02:17.745+0200 7f5aab2af700  1 handler->ERRORHANDLER: 
err_no=-2003 new_err_no=-2003

2020-04-23T07:02:17.745+0200 7f5aab2af700  2 req 1 0s http status=405
2020-04-23T07:02:17.745+0200 7f5aab2af700  1 == req done 
req=0x7f5aab2a6d50 op status=0 http_status=405 latency=0s ==







Best Regards,
Andreas
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Dear Abby: Why Is Architecting CEPH So Hard?

2020-04-22 Thread Martin Verges
Hello Cody,

There are a few simple rules to design a good, stable and performant Ceph
cluster.

1) Don't choose big systems. Not only because often they are more
expensive, but you also have more impact when a system is down.

2) Throw away all the not required stuff like RAID controllers, make the
system as simple as possible.

3) Plan CPU with a rule of thumb:
 - for HDD 1 cpu thread of any cpu is ok
 - for SSD/NVMe midrange 1 cpu core is most likely ok
 - for high end NVMe up to 4 cpu cores (8 threads) can be consumed, most
setups would be ok with 2 cores per disk
And generally, the faster the cores, the better it will be. This is
especially important on high end NVMe.

4) Plan Memory by Number of OSD drives * 6-8 GB and then choose the next
optimal dimm config (for example 128 GB).

5) Network:
 - HDD don't provide a good performance, 2*10G is totally fine
 - SSD/NVMe midrange can exceed 10G so it would be the bare minimum, but
100G are way too much ;)
 - NVMe high end can cause a dual 40G link to exceed but honestly, I never
saw client traffic, only ceph recovery in that performance range
 And overall, choose a modern all path active network design, like leaf
spine with vxlan to scale

6) DB/WAL:
 - definitely will decrease latency
 - can increase performance
 - do require long lasting write intensive flash if you don't want to get
in trouble with them
 - Sizing this is a hot topic ;). I currently just plan 300G (not 299) per
OSD for best performance. Choose a PCIe interface, don't choose SATA
interface for DB/WAL it will be a bottleneck.

You can colocate any service of a unified Ceph on all the hosts. If you add
services like MON, RGW, MDS you need to add some extra resources to your
calculation
MON) Just throw it in, the rule of thumb above will work without a problem
RGW) Metadata requires an SSD/NVMe pool as HDD is too slow, depending on
the required performance, some more CPU is required. As we plan more but
smaller servers, load can be distributed across more nodes, it scales much
better.
MDS) Can easily consume high Memory rates. Again depending on the use-case
how much it will need. Most likely adding it to the rule of thumb is ok but
if there are many open files, choose the next bigger dimm config.

In the end, especially inexperienced customers do have a great need for
good Ceph management as well. If you are interested, please feel free to
contact me and I will show you how we do it. We also have reseller options,
maybe that's something for you.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Mi., 22. Apr. 2020 um 23:47 Uhr schrieb :

> Hey Folks,
>
> This is my first ever post here in the CEPH user group and I will preface
> with the fact that I know this is a lot of what many people ask frequently.
> Unlike what I assume to be a large majority of CEPH “users” in this forum,
> I am more of a CEPH “distributor.” My interests lie in how to build a CEPH
> environment to best fill an organization’s needs.I am here for the
> real-world experience and expertise so that I can learn to build CEPH
> “right.” I have spent the last couple years collecting data on general
> “best practices” through forum posts, CEPH documentation, CEPHLACON, etc. I
> wanted to post my findings to the forum to see where I can harden my stance.
>
> Below are two example designs that I might use when architecting a
> solution currently. I have specific questions around design elements in
> each that I would like you to approve for holding water or not. I want to
> focus on the hardware, so I am asking for generalizations where possible.
> Let’s assume in all scenarios that we are using Luminous and that the data
> type is mixed use.
> I am not expecting anyone to run through every question, so please feel
> free to comment on any piece you can. Tell me what is overkill and what is
> lacking!
>
> Example 1:
> 8x 60-Bay (8TB) Storage nodes (480x 8TB SAS Drives)
> Storage Node Spec:
> 2x 32C 2.9GHz AMD EPYC
>- Documentation mentions .5 cores per OSD for throughput optimized. Are
> they talking about .5 Physical cores or .5 Logical cores?
>- Is it better to pick my processors based on a total GHz measurement
> like 2GHz per OSD?
>- Would a theoretical 8C at 2GHz serve the same number of OSDs as a 16C
> at 1GHz? Would Threads be included in this calculation?
> 512GB Memory
>- I know this is the hot topic because of its role in recoveries.
> Basically, I am looking for the most generalized practice I can use as a
> safe number and a metric I can use as a nice to have.
>- Is it 1GB of RAM per TB of RAW OSD?
> 2x 3.2TB NVMe WAHLDB / Log Drive
>- Another hot topic that I am sure will bring many “it depends.” All I
> am looking for is experienc

[ceph-users] adding block.db to OSD

2020-04-22 Thread Stefan Priebe - Profihost AG

Hello,

is there anything else needed beside running:
ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-${OSD} 
bluefs-bdev-new-db --dev-target /dev/vgroup/lvdb-1


I did so some weeks ago and currently i'm seeing that all osds 
originally deployed with --block-db show 10-20% I/O waits while all 
those got converted using ceph-bluestore-tool show 80-100% I/O waits.


Also is there some tuning available to use more of the SSD? The SSD 
(block-db) is only saturated at 0-2%.


Greets,
Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Dear Abby: Why Is Architecting CEPH So Hard?

2020-04-22 Thread Martin Verges
From all our calculations of clusters, going with smaller systems reduced
the TCO because of much cheaper hardware.
Having 100 Ceph nodes is not an issue, therefore you can scale small and
large clusters with the exact same hardware.

But please, prove me wrong. I would love to see a way to reduce the TCO
even more and if you have a way, I would love to hear about it.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Do., 23. Apr. 2020 um 05:18 Uhr schrieb lin.yunfan :

> I have seen a lot of people saying not to go with big nodes.
> What is the exact reason for that?
> I can understand that if the cluster is not big enough then the total
> nodes count could be too small to withstand a node failure, but if the
> cluster is big enough wouldn't the big node be more cost effective?
>
>
> lin.yunfan
> lin.yun...@gmail.com
>
> 
> 签名由 网易邮箱大师  定制
> On 4/23/2020 06:33,Brian Topping
>  wrote:
>
> Great set of suggestions, thanks! One to consider:
>
> On Apr 22, 2020, at 4:14 PM, Jack  wrote:
>
> I use 32GB flash-based satadom devices for root device
> They are basically SSD, and do not take front slots
> As they are never burning up, we never replace them
> Ergo, the need to "open" the server is not an issue
>
>
>
> This is probably the wrong forum to understand how you are not burning
> them out. Any kind of logs or monitor databases on a small SATADOM will
> cook them quick, especially an MLC. There is no extra space for wear
> leveling and the like. I tried making it work with fancy systemd logging to
> memory and having those logs pulled by a log scraper storing to the actual
> data drives, but there was no place for the monitor DB. No monitor DB means
> Ceph doesn’t load, and if a monitor DB gets corrupted, it’s perilous for
> the cluster and instant death if the monitors aren’t replicated.
>
> My node chassis have two motherboards and each is hard limited to four
> SSDs. On each node, `/boot` is mirrored (RAID1) on partition 1, `/` is
> stripe/mirrored (RAID10) on p2, then used whatever was left for ceph data
> on partition 3 of each disk. This way any disk could fail and I could still
> boot. Merging the volumes (ie no SATADOM), wear leveling was statistically
> more effective. And I don’t have to get into crazy system configurations
> that nobody would want to maintain or document.
>
> $0.02…
>
> Brian
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Dear Abby: Why Is Architecting CEPH So Hard?

2020-04-22 Thread Darren Soothill
If you want the lowest cost per TB then you will be going with larger nodes in 
your cluster but it does mean you minimum cluster size is going to be many PB’s 
in size.

There are a number of fixed costs associated with a node.

So Motherboard, Network cards, disk controllers, the more disks you spread 
these fixed costs across then the lower the overhead cost so the lower the cost 
per TB.

So lets say our hypothetical server has a cost of 1000 for the motherboard, 
1000 for the network card and 1000 for the disk controller. Just trying to keep 
the maths simple here.

So 3000 is the cost of the server. If you spread cost across 60 disks then you 
end up with an additional cost per disk of 50. If you spread it across 24 disks 
then you have  an additional cost of 125 and across 12 disks you have an 
additional cost of 250.

I have left memory out of this as memory is a fixed amount per OSD device. I 
have also left the chassis out of this but a 60 drive chassis is not 3x the 
price of a 24 drive chassis and is not 5X the cost of a 12 drive chassis. If it 
is then you need to be looking for a new chassis vendor.

CPU is the one variable here which is not linear and the CPU vendor tax for 
higher core counts can be significant. So a 60 drive chassis would need 60 
threads available which puts you into dual socket on the intel side of things. 
AMD would allow you to get to a single socket motherboard with a 32 Core CPU 
for that 60 drive chassis. Single socket motherboard is lower cost than dual 
socket and feeds back into the calculation above.

Now the question is what is the tax that a particular chassis vendor is 
charging you. I know from the configs we do on a regular basis that a 60 drive 
chassis will give you the lowest cost per TB. BUT it has implications. Your 
cluster size needs to be up in the order of 10PB minimum. 60 x 18TB gives you 
around 1PB per node.  Oh did you notice here we are going for the bigger disk 
drives. Why because the more data you can spread your fixed costs across the 
lower the overall cost per GB.

If you have a node failure you will have to recreate 1PB of lost data. This 
pushes you to 25G networking or faster. In many cases I would be looking at 
100G, 100G Top of Rack switches are so cheap why wouldn't you go down this 
route.

You will get into power, weight and cooling  issues with many DC’s though and 
this is something to consider.

The amount of NVME space for RocksDB and WAL is also a fixed amount based on 
the number of OSD devices so this has no effect on the cost per TB. When 
deciding between different chassis densities.

So if your requirement is lowest cost per TB then 60 drive chassis is the way 
to go and will give you the lowest price point.








From: Martin Verges 
Date: Thursday, 23 April 2020 at 06:39
To: lin.yunfan 
Cc: brian.topp...@gmail.com , ceph-users@ceph.io 

Subject: [ceph-users] Re: Dear Abby: Why Is Architecting CEPH So Hard?
From all our calculations of clusters, going with smaller systems reduced
the TCO because of much cheaper hardware.
Having 100 Ceph nodes is not an issue, therefore you can scale small and
large clusters with the exact same hardware.

But please, prove me wrong. I would love to see a way to reduce the TCO
even more and if you have a way, I would love to hear about it.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Do., 23. Apr. 2020 um 05:18 Uhr schrieb lin.yunfan :

> I have seen a lot of people saying not to go with big nodes.
> What is the exact reason for that?
> I can understand that if the cluster is not big enough then the total
> nodes count could be too small to withstand a node failure, but if the
> cluster is big enough wouldn't the big node be more cost effective?
>
>
> lin.yunfan
> lin.yun...@gmail.com
>
> 
> 签名由 网易邮箱大师  定制
> On 4/23/2020 06:33,Brian Topping
>  wrote:
>
> Great set of suggestions, thanks! One to consider:
>
> On Apr 22, 2020, at 4:14 PM, Jack  wrote:
>
> I use 32GB flash-based satadom devices for root device
> They are basically SSD, and do not take front slots
> As they are never burning up, we never replace them
> Ergo, the need to "open" the server is not an issue
>
>
>
> This is probably the wrong forum to understand how you are not burning
> them out. Any kind of logs or monitor databases on a small SATADOM will
> cook them quick, especially an MLC. There is no extra space for wear
> level