Re: [ceph-users] Update / upgrade cluster with MDS from 12.2.7 to 12.2.11

2019-02-12 Thread Götz Reinicke


> Am 12.02.2019 um 00:03 schrieb Patrick Donnelly :
> 
> On Mon, Feb 11, 2019 at 12:10 PM Götz Reinicke
>  wrote:
>> as 12.2.11 is out for some days and no panic mails showed up on the list I 
>> was planing to update too.
>> 
>> I know there are recommended orders in which to update/upgrade the cluster 
>> but I don’t know how rpm packages are handling restarting services after a 
>> yum update. E.g. when MDS and MONs are on the same server.
> 
> This should be fine. The MDS only uses a new executable file if you
> explicitly restart it via systemd (or, the MDS fails and systemd
> restarts it).
> 
> More info: when the MDS respawns in normal circumstances, it passes
> the /proc/self/exe file to execve. An intended side-effect is that the
> MDS will continue using the same executable file across execs.
> 
>> And regarding an MDS Cluster I like to ask, if the upgrading instructions 
>> regarding only running one MDS during upgrading also applies for an update?
>> 
>> http://docs.ceph.com/docs/mimic/cephfs/upgrading/
> 
> If you upgrade an MDS, it may update the compatibility bits in the
> Monitor's MDSMap. Other MDSs will abort when they see this change. The
> upgrade process intended to help you avoid seeing those errors so you
> don't inadvertently think something went wrong.
> 
> If you don't mind seeing those errors and you're using 1 active MDS,
> then don't worry about it.

Thanks for your feedback and clarification!

I have one active MDS and one standby, bot on the same version. So I might see 
some errors during upgrade, but don’t have to stop the standby MDS?!

Or to be save should stop the standby?

Thanks if you can comment on that. Regards . Götz

smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] will crush rule be used during object relocation in OSD failure ?

2019-02-12 Thread Eugen Block

Hi,

I came to the same conclusion after doing various tests with rooms and  
failure domains. I agree with Maged and suggest to use size=4,  
min_size=2 for replicated pools. It's more overhead but you can  
survive the loss of one room and even one more OSD (of the affected  
PG) without losing data. You'll also have the certainty that there are  
always two replicas per room, no guessing or hoping which room is more  
likely to fail.


If the overhead is too high could EC be an option for your setup?

Regards,
Eugen


Zitat von "ST Wong (ITSC)" :


Hi all,

Tested 4 cases.  Case 1-3 are as expected, while for case 4,
rebuild didn’t take place on surviving room as Gregory mentioned.   
Repeated case 4 several times on both rooms got same result.  We’re  
running mimic 13.2.2.


E.g.

Room1
Host 1 osd: 2,5
Host 2 osd: 1,3

Room 2  <-- failed room
Host 3 osd: 0,4
Host 4 osd: 6,7


Before:
5.62  0  00 0   0 
 000 active+clean 2019-02-12 04:47:28.183375 
0'0  3643:2299   [0,7,5]  0   [0,7,5]  0  
   0'0 2019-02-12 04:47:28.183218 0'0 2019-02-11  
01:20:51.276922 0


After:
5.62  0  00 0   0 
 000  undersized+peered 2019-02-12  
09:10:59.1010960'0  3647:2284   [5]  5 
[5]  50'0 2019-02-12 04:47:28.183218  
0'0 2019-02-11 01:20:51.276922 0


Fyi.   Sorry for the belated report.

Thanks a lot.
/st


From: Gregory Farnum 
Sent: Monday, November 26, 2018 9:27 PM
To: ST Wong (ITSC) 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] will crush rule be used during object  
relocation in OSD failure ?


On Fri, Nov 23, 2018 at 11:01 AM ST Wong (ITSC)  
mailto:s...@itsc.cuhk.edu.hk>> wrote:


Hi all,



We've 8 osd hosts, 4 in room 1 and 4 in room2.

A pool with size = 3 using following crush map is created, to cater  
for room failure.



rule multiroom {
id 0
type replicated
min_size 2
max_size 4
step take default
step choose firstn 2 type room
step chooseleaf firstn 2 type host
step emit
}




We're expecting:

1.for each object, there are always 2 replicas in one room and 1  
replica in other room making size=3.  But we can't control which  
room has 1 or 2 replicas.


Right.


2.in case an osd host fails, ceph will assign remaining  
osds to the same PG to hold replicas on the failed osd host.   
Selection is based on crush rule of the pool, thus maintaining the  
same failure domain - won't make all replicas in the same room.


Yes, if a host fails the copies it held will be replaced by new  
copies in the same room.



3.in case of entire room with 1 replica fails, the pool  
will remain degraded but won't do any replica relocation.


Right.


4. in case of entire room with 2 replicas fails, ceph will make use  
of osds in the surviving room and making 2 replicas.  Pool will not  
be writeable before all objects are made 2 copies (unless we make  
pool size=4?).  Then when recovery is complete, pool will remain in  
degraded state until the failed room recover.


Hmm, I'm actually not sure if this will work out — because CRUSH is  
hierarchical, it will keep trying to select hosts from the dead room  
and will fill out the location vector's first two spots with -1. It  
could be that Ceph will skip all those "nonexistent" entries and  
just pick the two copies from slots 3 and 4, but it might not. You  
should test this carefully and report back!

-Greg

Is our understanding correct?  Thanks a lot.
Will do some simulation later to verify.

Regards,
/stwong
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph cluster stability

2019-02-12 Thread M Ranga Swami Reddy
Hello - I have a couple of questions on ceph cluster stability, even
we follow all recommendations as below:
- Having separate replication n/w and data n/w
- RACK is the failure domain
- Using SSDs for journals (1:4ratio)

Q1 - If one OSD down, cluster IO down drastically and customer Apps impacted.
Q2 - what is stability ratio, like with above, is ceph cluster
workable condition, if one osd down or one node down,etc.

Thanks
Swami
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] will crush rule be used during object relocation in OSD failure ?

2019-02-12 Thread ST Wong (ITSC)
Hi,

Thanks.As power supply to one of our server rooms is not so stable, will 
probably use size=4,min_size=2 to prevent data lose.

> If the overhead is too high could EC be an option for your setup?

Will there be much difference in performance between EC and replicated?  Thanks.
Hope can do more testing on EC  before deadline of our first production CEPH...

Regards,
/st

-Original Message-
From: ceph-users  On Behalf Of Eugen Block
Sent: Tuesday, February 12, 2019 5:32 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] will crush rule be used during object relocation in 
OSD failure ?

Hi,

I came to the same conclusion after doing various tests with rooms and failure 
domains. I agree with Maged and suggest to use size=4,
min_size=2 for replicated pools. It's more overhead but you can survive the 
loss of one room and even one more OSD (of the affected
PG) without losing data. You'll also have the certainty that there are always 
two replicas per room, no guessing or hoping which room is more likely to fail.

If the overhead is too high could EC be an option for your setup?

Regards,
Eugen


Zitat von "ST Wong (ITSC)" :

> Hi all,
>
> Tested 4 cases.  Case 1-3 are as expected, while for case 4,
> rebuild didn’t take place on surviving room as Gregory mentioned.   
> Repeated case 4 several times on both rooms got same result.  We’re 
> running mimic 13.2.2.
>
> E.g.
>
> Room1
> Host 1 osd: 2,5
> Host 2 osd: 1,3
>
> Room 2  <-- failed room
> Host 3 osd: 0,4
> Host 4 osd: 6,7
>
>
> Before:
> 5.62  0  00 0   0 
>  000 active+clean 2019-02-12 04:47:28.183375 
> 0'0  3643:2299   [0,7,5]  0   [0,7,5]  0  
>0'0 2019-02-12 04:47:28.183218 0'0 2019-02-11  
> 01:20:51.276922 0
>
> After:
> 5.62  0  00 0   0 
>  000  undersized+peered 2019-02-12  
> 09:10:59.1010960'0  3647:2284   [5]  5 
> [5]  50'0 2019-02-12 04:47:28.183218  
> 0'0 2019-02-11 01:20:51.276922 0
>
> Fyi.   Sorry for the belated report.
>
> Thanks a lot.
> /st
>
>
> From: Gregory Farnum 
> Sent: Monday, November 26, 2018 9:27 PM
> To: ST Wong (ITSC) 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] will crush rule be used during object 
> relocation in OSD failure ?
>
> On Fri, Nov 23, 2018 at 11:01 AM ST Wong (ITSC) 
> mailto:s...@itsc.cuhk.edu.hk>> wrote:
>
> Hi all,
>
>
>
> We've 8 osd hosts, 4 in room 1 and 4 in room2.
>
> A pool with size = 3 using following crush map is created, to cater 
> for room failure.
>
>
> rule multiroom {
> id 0
> type replicated
> min_size 2
> max_size 4
> step take default
> step choose firstn 2 type room
> step chooseleaf firstn 2 type host
> step emit
> }
>
>
>
>
> We're expecting:
>
> 1.for each object, there are always 2 replicas in one room and 1 
> replica in other room making size=3.  But we can't control which room 
> has 1 or 2 replicas.
>
> Right.
>
>
> 2.in case an osd host fails, ceph will assign remaining  
> osds to the same PG to hold replicas on the failed osd host.   
> Selection is based on crush rule of the pool, thus maintaining the 
> same failure domain - won't make all replicas in the same room.
>
> Yes, if a host fails the copies it held will be replaced by new copies 
> in the same room.
>
>
> 3.in case of entire room with 1 replica fails, the pool 
> will remain degraded but won't do any replica relocation.
>
> Right.
>
>
> 4. in case of entire room with 2 replicas fails, ceph will make use of 
> osds in the surviving room and making 2 replicas.  Pool will not be 
> writeable before all objects are made 2 copies (unless we make pool 
> size=4?).  Then when recovery is complete, pool will remain in 
> degraded state until the failed room recover.
>
> Hmm, I'm actually not sure if this will work out — because CRUSH is 
> hierarchical, it will keep trying to select hosts from the dead room 
> and will fill out the location vector's first two spots with -1. It 
> could be that Ceph will skip all those "nonexistent" entries and just 
> pick the two copies from slots 3 and 4, but it might not. You should 
> test this carefully and report back!
> -Greg
>
> Is our understanding correct?  Thanks a lot.
> Will do some simulation later to verify.
>
> Regards,
> /stwong
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@l

Re: [ceph-users] will crush rule be used during object relocation in OSD failure ?

2019-02-12 Thread Eugen Block
Will there be much difference in performance between EC and  
replicated?  Thanks.
Hope can do more testing on EC  before deadline of our first  
production CEPH...


In general, yes, there will be a difference in performance. Of course  
it depends on the actual configuration, but if you rely on performance  
I would stick with replication. Running your own tests with EC on your  
existing setup will reveal performance differences and help you decide  
which way to go.


Regards,
Eugen


Zitat von "ST Wong (ITSC)" :


Hi,

Thanks.As power supply to one of our server rooms is not so  
stable, will probably use size=4,min_size=2 to prevent data lose.



If the overhead is too high could EC be an option for your setup?


Will there be much difference in performance between EC and  
replicated?  Thanks.
Hope can do more testing on EC  before deadline of our first  
production CEPH...


Regards,
/st

-Original Message-
From: ceph-users  On Behalf Of Eugen Block
Sent: Tuesday, February 12, 2019 5:32 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] will crush rule be used during object  
relocation in OSD failure ?


Hi,

I came to the same conclusion after doing various tests with rooms  
and failure domains. I agree with Maged and suggest to use size=4,
min_size=2 for replicated pools. It's more overhead but you can  
survive the loss of one room and even one more OSD (of the affected
PG) without losing data. You'll also have the certainty that there  
are always two replicas per room, no guessing or hoping which room  
is more likely to fail.


If the overhead is too high could EC be an option for your setup?

Regards,
Eugen


Zitat von "ST Wong (ITSC)" :


Hi all,

Tested 4 cases.  Case 1-3 are as expected, while for case 4,
rebuild didn’t take place on surviving room as Gregory mentioned.
Repeated case 4 several times on both rooms got same result.  We’re
running mimic 13.2.2.

E.g.

Room1
Host 1 osd: 2,5
Host 2 osd: 1,3

Room 2  <-- failed room
Host 3 osd: 0,4
Host 4 osd: 6,7


Before:
5.62  0  00 0   0
 000 active+clean 2019-02-12 04:47:28.183375
0'0  3643:2299   [0,7,5]  0   [0,7,5]  0
   0'0 2019-02-12 04:47:28.183218 0'0 2019-02-11
01:20:51.276922 0

After:
5.62  0  00 0   0
 000  undersized+peered 2019-02-12
09:10:59.1010960'0  3647:2284   [5]  5
[5]  50'0 2019-02-12 04:47:28.183218
0'0 2019-02-11 01:20:51.276922 0

Fyi.   Sorry for the belated report.

Thanks a lot.
/st


From: Gregory Farnum 
Sent: Monday, November 26, 2018 9:27 PM
To: ST Wong (ITSC) 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] will crush rule be used during object
relocation in OSD failure ?

On Fri, Nov 23, 2018 at 11:01 AM ST Wong (ITSC)
mailto:s...@itsc.cuhk.edu.hk>> wrote:

Hi all,



We've 8 osd hosts, 4 in room 1 and 4 in room2.

A pool with size = 3 using following crush map is created, to cater
for room failure.


rule multiroom {
id 0
type replicated
min_size 2
max_size 4
step take default
step choose firstn 2 type room
step chooseleaf firstn 2 type host
step emit
}




We're expecting:

1.for each object, there are always 2 replicas in one room and 1
replica in other room making size=3.  But we can't control which room
has 1 or 2 replicas.

Right.


2.in case an osd host fails, ceph will assign remaining
osds to the same PG to hold replicas on the failed osd host.
Selection is based on crush rule of the pool, thus maintaining the
same failure domain - won't make all replicas in the same room.

Yes, if a host fails the copies it held will be replaced by new copies
in the same room.


3.in case of entire room with 1 replica fails, the pool
will remain degraded but won't do any replica relocation.

Right.


4. in case of entire room with 2 replicas fails, ceph will make use of
osds in the surviving room and making 2 replicas.  Pool will not be
writeable before all objects are made 2 copies (unless we make pool
size=4?).  Then when recovery is complete, pool will remain in
degraded state until the failed room recover.

Hmm, I'm actually not sure if this will work out — because CRUSH is
hierarchical, it will keep trying to select hosts from the dead room
and will fill out the location vector's first two spots with -1. It
could be that Ceph will skip all those "nonexistent" entries and just
pick the two copies from slots 3 and 4, but it might not. You should
test this carefully and report back!
-Greg

Is our understanding correct?  Thanks a lot.
Will do some simulation later to verify.

Regards,
/stwong
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-us

Re: [ceph-users] CephFS overwrite/truncate performance hit

2019-02-12 Thread Hector Martin

On 12/02/2019 06:01, Gregory Farnum wrote:
Right. Truncates and renames require sending messages to the MDS, and 
the MDS committing to RADOS (aka its disk) the change in status, before 
they can be completed. Creating new files will generally use a 
preallocated inode so it's just a network round-trip to the MDS.


I see. Is there a fundamental reason why these kinds of metadata 
operations cannot be buffered in the client, or is this just the current 
way they're implemented?


e.g. on a local FS these kinds of writes can just stick around in the 
block cache unflushed. And of course for CephFS I assume file extension 
also requires updating the file size in the MDS, yet that doesn't block 
while truncation does.


Going back to your first email, if you do an overwrite that is confined 
to a single stripe unit in RADOS (by default, a stripe unit is the size 
of your objects which is 4MB and it's aligned from 0), it is guaranteed 
to be atomic. CephFS can only tear writes across objects, and only if 
your client fails before the data has been flushed.


Great! I've implemented this in a backwards-compatible way, so that gets 
rid of this bottleneck. It's just a 128-byte flag file (formerly 
variable length, now I just pad it to the full 128 bytes and rewrite it 
in-place). This is good information to know for optimizing things :-)


--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Controlling CephFS hard link "primary name" for recursive stat

2019-02-12 Thread Hector Martin

On 11/02/2019 18:52, Yan, Zheng wrote:

how about directly reading backtrace, something equivalent to:

rados -p cephfs1_data getxattr xxx. parent >/tmp/parent
ceph-dencoder import /tmp/parent type inode_backtrace_t decode dump_json


Where xxx is just the hex inode from stat(), right. (I only just 
realized this :-))


Are there Python bindings for what ceph-dencoder does, or at least a C 
API? I could shell out to ceph-dencoder but I imagine that won't be too 
great for performance.


--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD fails to start (fsck error, unable to read osd superblock)

2019-02-12 Thread Ruben Rodriguez


On 2/9/19 5:40 PM, Brad Hubbard wrote:
> On Sun, Feb 10, 2019 at 1:56 AM Ruben Rodriguez  wrote:
>>
>> Hi there,
>>
>> Running 12.2.11-1xenial on a machine with 6 SSD OSD with bluestore.
>>
>> Today we had two disks fail out of the controller, and after a reboot
>> they both seemed to come back fine but ceph-osd was only able to start
>> in one of them. The other one gets this:
>>
>> 2019-02-08 18:53:00.703376 7f64f948ce00 -1
>> bluestore(/var/lib/ceph/osd/ceph-3) _verify_csum bad crc32c/0x1000
>> checksum at blob offset 0x0, got 0x95104dfc, expected 0xb9e3e26d, device
>> location [0x4000~1000], logical extent 0x0~1000, object
>> #-1:7b3f43c4:::osd_superblock:0#
>> 2019-02-08 18:53:00.703406 7f64f948ce00 -1 osd.3 0 OSD::init() : unable
>> to read osd superblock
>>
>> Note that there are no actual IO errors being shown by the controller in
>> dmesg, and that the disk is readable. The metadata FS is mounted and
>> looks normal.
>>
>> I tried running "ceph-bluestore-tool repair --path
>> /var/lib/ceph/osd/ceph-3 --deep 1" and that gets many instances of:
> 
> Running this with debug_bluestore=30 might give more information on
> the nature of the IO error.

I had collected the logs with debug info already, and nothing
significant was listed there. I applied this patch
https://github.com/ceph/ceph/pull/26247 and it allowed me to move
forward. There was a osd map corruption issue that I had to handle by
hand, but after that the osd started fine. After it started and
backfills finished, the bluestore_ignore_data_csum flag is no longer
needed, so I reverted to standard packages.

-- 
Ruben Rodriguez | Chief Technology Officer, Free Software Foundation
GPG Key: 05EF 1D2F FE61 747D 1FC8  27C3 7FAC 7D26 472F 4409
https://fsf.org | https://gnu.org



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proxmox 4.4, Ceph hammer, OSD cache link...

2019-02-12 Thread Marco Gaiarin
Mandi! Michel Raabe
  In chel di` si favelave...

> Have you changed/add the journal_uuid from the old partition?
> https://ceph.com/geen-categorie/ceph-recover-osds-after-ssd-journal-failure/

 root@blackpanther:~# ls -la /var/lib/ceph/osd/ceph-15
 totale 56
 drwxr-xr-x   3 root root  199 nov 21 23:08 .
 drwxr-xr-x   6 root root 4096 nov 21 23:08 ..
 -rw-r--r--   1 root root  903 nov 21 23:08 activate.monmap
 -rw-r--r--   1 root root3 nov 21 23:08 active
 -rw-r--r--   1 root root   37 nov 21 23:08 ceph_fsid
 drwxr-xr-x 292 root root 8192 dic  9 15:02 current
 -rw-r--r--   1 root root   37 nov 21 23:08 fsid
 lrwxrwxrwx   1 root root9 nov 21 23:08 journal -> /dev/sda8
 -rw---   1 root root   57 nov 21 23:08 keyring
 -rw-r--r--   1 root root   21 nov 21 23:08 magic
 -rw-r--r--   1 root root6 nov 21 23:08 ready
 -rw-r--r--   1 root root4 nov 21 23:08 store_version
 -rw-r--r--   1 root root   53 nov 21 23:08 superblock
 -rw-r--r--   1 root root0 nov 21 23:08 sysvinit
 -rw-r--r--   1 root root3 nov 21 23:08 whoami

Ahem, i've no 'journal_uuid' file on OSD...

-- 
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
  Associazione ``La Nostra Famiglia''  http://www.lanostrafamiglia.it/
  Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
  marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797

Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
  http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failed to load ceph-mgr modules: telemetry

2019-02-12 Thread Lenz Grimmer
Hi Ashley,

On 2/9/19 4:43 PM, Ashley Merrick wrote:

> Any further suggestions, should i just ignore the error "Failed to load
> ceph-mgr modules: telemetry" or is this my route cause for no realtime
> I/O readings in the Dashboard?

I don't think this is related. It you don't plan to enable the telemetry
module, this error can probably be ignored. However, I wonder why you
don't see those readings. Would you mind submitting an issue on the
tracker about this, ideally with the exact Ceph version you're running
and a screen shot of where the metrics are missing?

Thanks,

Lenz

-- 
SUSE Linux GmbH - Maxfeldstr. 5 - 90409 Nuernberg (Germany)
GF:Felix Imendörffer,Jane Smithard,Graham Norton,HRB 21284 (AG Nürnberg)



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Change fsid of Ceph cluster after splitting it into two clusters

2019-02-12 Thread Wido den Hollander
Hi,

I've got a situation where I need to split a Ceph cluster into two.

This cluster is currently running a mix of RBD and RGW and in this case
I am splitting it into two different clusters.

A difficult thing to do, but it's possible.

One problem that stays though is that after the split both Ceph clusters
have the same fsid and that might be confusing.

Is there a way to change the fsid of an existing cluster?

Injecting an updated MONMAP and OSDMAP into the cluster?

It's no problem if this has to be done offline, but I'm just wondering
if this is possible.

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS overwrite/truncate performance hit

2019-02-12 Thread Gregory Farnum
On Tue, Feb 12, 2019 at 5:10 AM Hector Martin  wrote:

> On 12/02/2019 06:01, Gregory Farnum wrote:
> > Right. Truncates and renames require sending messages to the MDS, and
> > the MDS committing to RADOS (aka its disk) the change in status, before
> > they can be completed. Creating new files will generally use a
> > preallocated inode so it's just a network round-trip to the MDS.
>
> I see. Is there a fundamental reason why these kinds of metadata
> operations cannot be buffered in the client, or is this just the current
> way they're implemented?
>

It's pretty fundamental, at least to the consistency guarantees we hold
ourselves to. What happens if the client has buffered an update like that,
performs writes to the data with those updates in mind, and then fails
before they're flushed to the MDS? A local FS doesn't need to worry about a
different node having a different lifetime, and can control the write order
of its metadata and data updates on belated flush a lot more precisely than
we can. :(
-Greg


>
> e.g. on a local FS these kinds of writes can just stick around in the
> block cache unflushed. And of course for CephFS I assume file extension
> also requires updating the file size in the MDS, yet that doesn't block
> while truncation does.
>
> > Going back to your first email, if you do an overwrite that is confined
> > to a single stripe unit in RADOS (by default, a stripe unit is the size
> > of your objects which is 4MB and it's aligned from 0), it is guaranteed
> > to be atomic. CephFS can only tear writes across objects, and only if
> > your client fails before the data has been flushed.
>
> Great! I've implemented this in a backwards-compatible way, so that gets
> rid of this bottleneck. It's just a 128-byte flag file (formerly
> variable length, now I just pad it to the full 128 bytes and rewrite it
> in-place). This is good information to know for optimizing things :-)
>
> --
> Hector Martin (hec...@marcansoft.com)
> Public Key: https://mrcn.st/pub
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] jewel10.2.11 EC pool out a osd,its PGs remap to the osds in the same host

2019-02-12 Thread hnuzhoulin2






Hi, cephersI am building a ceph EC cluster.when a disk is error,I out it.But its all PGs remap to the osds in the same host,which I think they should remap to other hosts in the same rack.test process is:ceph osd pool create .rgw.buckets.data 8192 8192 erasure ISA-4-2 site1_sata_erasure_ruleset 4ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/1/etc/init.d/ceph stop osd.2ceph osd out 2ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/2diff /tmp/1 /tmp/2 -y --suppress-common-lines0 1.0 1.0 118 osd.0   | 0 1.0 1.0 126 osd.01 1.0 1.0 123 osd.1   | 1 1.0 1.0 139 osd.12 1.0 1.0 122 osd.2   | 2 1.0 0 0 osd.23 1.0 1.0 113 osd.3   | 3 1.0 1.0 131 osd.34 1.0 1.0 122 osd.4   | 4 1.0 1.0 136 osd.45 1.0 1.0 112 osd.5   | 5 1.0 1.0 127 osd.56 1.0 1.0 114 osd.6   | 6 1.0 1.0 128 osd.67 1.0 1.0 124 osd.7   | 7 1.0 1.0 136 osd.78 1.0 1.0 95 osd.8   | 8 1.0 1.0 113 osd.89 1.0 1.0 112 osd.9   | 9 1.0 1.0 119 osd.9TOTAL 3073T 197G | TOTAL 3065T 197GMIN/MAX VAR: 0.84/26.56 | MIN/MAX VAR: 0.84/26.52some config info: (detail configs see: https://gist.github.com/hnuzhoulin/575883dbbcb04dff448eea3b9384c125)jewel 10.2.11  filestore+rocksdbceph osd erasure-code-profile get ISA-4-2k=4m=2plugin=isaruleset-failure-domain=ctnrruleset-root=site1-satatechnique=reed_sol_vanpart of ceph.conf is:[global]fsid = 1CAB340D-E551-474F-B21A-399AC0F10900auth cluster required = cephxauth service required = cephxauth client required = cephxpid file = /home/ceph/var/run/$name.pidlog file = /home/ceph/log/$cluster-$name.logmon osd nearfull ratio = 0.85mon osd full ratio = 0.95admin socket = /home/ceph/var/run/$cluster-$name.asokosd pool default size = 3osd pool default min size = 1osd objectstore = filestorefilestore merge threshold = -10[mon]keyring = /home/ceph/var/lib/$type/$cluster-$id/keyringmon data = "">mon cluster log file = /home/ceph/log/$cluster.log[osd]keyring = /home/ceph/var/lib/$type/$cluster-$id/keyringosd data = "">osd journal = /home/ceph/var/lib/$type/$cluster-$id/journalosd journal size = 1osd mkfs type = xfsosd mount options xfs = rw,noatime,nodiratime,inode64,logbsize=256kosd backfill full ratio = 0.92osd failsafe full ratio = 0.95osd failsafe nearfull ratio = 0.85osd max backfills = 1osd crush update on start = falseosd op thread timeout = 60filestore split multiple = 8filestore max sync interval = 15filestore min sync interval = 5[osd.0]host = cld-osd1-56addr = Xuser = cephdevs = /disk/link/osd-0/dataosd journal = /disk/link/osd-0/journal…….[osd.503]host = cld-osd42-56addr = 10.108.87.52user = cephdevs = /disk/link/osd-503/dataosd journal = /disk/link/osd-503/journalcrushmap is below:# begin crush maptunable choose_local_tries 0tunable choose_local_fallback_tries 0tunable choose_total_tries 50tunable chooseleaf_descend_once 1tunable chooseleaf_vary_r 1tunable straw_calc_version 1tunable allowed_bucket_algs 54# devicesdevice 0 osd.0device 1 osd.1device 2 osd.2。。。device 502 osd.502device 503 osd.503# typestype 0 osd  # osdtype 1 ctnr # sata/ssd group by node, -101~1xx/-201~2xxtype 2 media    # sata/ssd group by rack, -11~1x/-21~2xtype 3 mediagroup   # sata/ssd group by site, -5/-6type 4 unit # site, -2type 5 root # root, -1# bucketsctnr cld-osd1-56-sata {id -101  # do not change unnecessarily # weight 10.000alg straw2hash 0   # rjenkins1item osd.0 weight 1.000item osd.1 weight 1.000item osd.2 weight 1.000item osd.3 weight 1.000item osd.4 weight 1.000item osd.5 weight 1.000item osd.6 weight 1.000item osd.7 weight 1.000item osd.8 weight 1.000item osd.9 weight 1.000}ctnr cld-osd1-56-ssd {id -201  # do not change unnecessarily # weight 2.000alg straw2hash 0   # rjenkins1item osd.10 weight 1.000item osd.11 weight 1.000}…..ctnr cld-osd41-56-sata {id -141  # do not change unnecessarily # weight 10.000alg straw2hash 0   # rjenkins1item osd.480 weight 1.000item osd.481 weight 1.000item osd.482 weight 1.000item osd.483 weight 1.000item osd.484 weight 1.000item osd.485 weight 1.000item osd.486 weight 1.000item osd.487 weight 1.000item osd.488 weight 1.000item osd.489 weight 1.000}ctnr cld-osd41-56-ssd {id -241  # do not change un

Re: [ceph-users] Failed to load ceph-mgr modules: telemetry

2019-02-12 Thread Ashley Merrick
Sure have created such @ https://tracker.ceph.com/issues/38284

Thanks

On Wed, Feb 13, 2019 at 12:02 AM Lenz Grimmer  wrote:

> Hi Ashley,
>
> On 2/9/19 4:43 PM, Ashley Merrick wrote:
>
> > Any further suggestions, should i just ignore the error "Failed to load
> > ceph-mgr modules: telemetry" or is this my route cause for no realtime
> > I/O readings in the Dashboard?
>
> I don't think this is related. It you don't plan to enable the telemetry
> module, this error can probably be ignored. However, I wonder why you
> don't see those readings. Would you mind submitting an issue on the
> tracker about this, ideally with the exact Ceph version you're running
> and a screen shot of where the metrics are missing?
>
> Thanks,
>
> Lenz
>
> --
> SUSE Linux GmbH - Maxfeldstr. 5 - 90409 Nuernberg (Germany)
> GF:Felix Imendörffer,Jane Smithard,Graham Norton,HRB 21284 (AG Nürnberg)
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com