[ceph-users] Re: Ceph Filesystem recovery with intact pools

2020-09-01 Thread Eugen Block

Alright, I didn't realize that the MDS was affected by this as well.
In that case there's probably no other way than running the 'ceph fs  
new ...' command as Yan, Zheng suggested.
Do you have backups of your cephfs contents in case that goes wrong?  
I'm not sure if a pool copy would help in any way here, also I haven't  
recreated a cephfs from existing pools yet, maybe someone else can  
provide some more details about the risks of doing that, I understand  
your hesitation though.


Regards,
Eugen


Zitat von Cyclic 3 :


Both the MDS maps and the keyrings are lost as a side effect of the monitor
recovery process I mentioned in my initial email, detailed here
https://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-mon/#monitor-store-failures
.

On Mon, 31 Aug 2020 at 21:10, Eugen Block  wrote:


I don’t understand, what happened to the previous MDS? If there are
cephfs pools there also was an old MDS, right? Can you explain that
please?


Zitat von cyclic3@gmail.com:

> I added an MDS, but there was no change in either output (apart from
> recognising the existence of an MDS)
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io







___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recover pgs from failed osds

2020-09-01 Thread Vahideh Alinouri
One of failed osd with 3G RAM started and dump_mempools shows total RAM
usage is 18G and buff_anon uses 17G RAM!

On Mon, Aug 31, 2020 at 6:24 PM Vahideh Alinouri 
wrote:

> osd_memory_target of failed osd in one ceph-osd node changed to 6G but
> other osd_memory_target is 3G, starting failed osd with 6G memory_target
> causes other osd "down" in ceph-osd node! and failed osd is still down.
>
> On Mon, Aug 31, 2020 at 2:19 PM Eugen Block  wrote:
>
>> Can you try the opposite and turn up the memory_target and only try to
>> start a single OSD?
>>
>>
>> Zitat von Vahideh Alinouri :
>>
>> > osd_memory_target is changed to 3G, starting failed osd causes ceph-osd
>> > nodes crash! and failed osd is still "down"
>> >
>> > On Fri, Aug 28, 2020 at 1:13 PM Vahideh Alinouri <
>> vahideh.alino...@gmail.com>
>> > wrote:
>> >
>> >> Yes, each osd node has 7 osds with 4 GB memory_target.
>> >>
>> >>
>> >> On Fri, Aug 28, 2020, 12:48 PM Eugen Block  wrote:
>> >>
>> >>> Just to confirm, each OSD node has 7 OSDs with 4 GB memory_target?
>> >>> That leaves only 4 GB RAM for the rest, and in case of heavy load the
>> >>> OSDs use even more. I would suggest to reduce the memory_target to 3
>> >>> GB and see if they start successfully.
>> >>>
>> >>>
>> >>> Zitat von Vahideh Alinouri :
>> >>>
>> >>> > osd_memory_target is 4294967296.
>> >>> > Cluster setup:
>> >>> > 3 mon, 3 mgr, 21 osds on 3 ceph-osd nodes in lvm scenario.  ceph-osd
>> >>> nodes
>> >>> > resources are 32G RAM - 4 core CPU - osd disk 4TB - 9 osds have
>> >>> > block.wal on SSDs.  Public network is 1G and cluster network is 10G.
>> >>> > Cluster installed and upgraded using ceph-ansible.
>> >>> >
>> >>> > On Thu, Aug 27, 2020 at 7:01 PM Eugen Block  wrote:
>> >>> >
>> >>> >> What is the memory_target for your OSDs? Can you share more details
>> >>> >> about your setup? You write about high memory, are the OSD nodes
>> >>> >> affected by OOM killer? You could try to reduce the
>> osd_memory_target
>> >>> >> and see if that helps bring the OSDs back up. Splitting the PGs is
>> a
>> >>> >> very heavy operation.
>> >>> >>
>> >>> >>
>> >>> >> Zitat von Vahideh Alinouri :
>> >>> >>
>> >>> >> > Ceph cluster is updated from nautilus to octopus. On ceph-osd
>> nodes
>> >>> we
>> >>> >> have
>> >>> >> > high I/O wait.
>> >>> >> >
>> >>> >> > After increasing one of pool’s pg_num from 64 to 128 according to
>> >>> warning
>> >>> >> > message (more objects per pg), this lead to high cpu load and ram
>> >>> usage
>> >>> >> on
>> >>> >> > ceph-osd nodes and finally crashed the whole cluster. Three osds,
>> >>> one on
>> >>> >> > each host, stuck at down state (osd.34 osd.35 osd.40).
>> >>> >> >
>> >>> >> > Starting the down osd service causes high ram usage and cpu load
>> and
>> >>> >> > ceph-osd node to crash until the osd service fails.
>> >>> >> >
>> >>> >> > The active mgr service on each mon host will crash after
>> consuming
>> >>> almost
>> >>> >> > all available ram on the physical hosts.
>> >>> >> >
>> >>> >> > I need to recover pgs and solving corruption. How can i recover
>> >>> unknown
>> >>> >> and
>> >>> >> > down pgs? Is there any way to starting up failed osd?
>> >>> >> >
>> >>> >> >
>> >>> >> > Below steps are done:
>> >>> >> >
>> >>> >> > 1- osd nodes’ kernel was upgraded to 5.4.2 before ceph cluster
>> >>> upgrading.
>> >>> >> > Reverting to previous kernel 4.2.1 is tested for iowate
>> decreasing,
>> >>> but
>> >>> >> it
>> >>> >> > had no effect.
>> >>> >> >
>> >>> >> > 2- Recovering 11 pgs on failed osds by export them using
>> >>> >> > ceph-objectstore-tools utility and import them on other osds. The
>> >>> result
>> >>> >> > followed: 9 pgs are “down” and 2 pgs are “unknown”.
>> >>> >> >
>> >>> >> > 2-1) 9 pgs export and import successfully but status is “down”
>> >>> because of
>> >>> >> > "peering_blocked_by" 3 failed osds. I cannot lost osds because of
>> >>> >> > preventing unknown pgs from getting lost. pgs size in K and M.
>> >>> >> >
>> >>> >> > "peering_blocked_by": [
>> >>> >> >
>> >>> >> > {
>> >>> >> >
>> >>> >> > "osd": 34,
>> >>> >> >
>> >>> >> > "current_lost_at": 0,
>> >>> >> >
>> >>> >> > "comment": "starting or marking this osd lost may let us proceed"
>> >>> >> >
>> >>> >> > },
>> >>> >> >
>> >>> >> > {
>> >>> >> >
>> >>> >> > "osd": 35,
>> >>> >> >
>> >>> >> > "current_lost_at": 0,
>> >>> >> >
>> >>> >> > "comment": "starting or marking this osd lost may let us proceed"
>> >>> >> >
>> >>> >> > },
>> >>> >> >
>> >>> >> > {
>> >>> >> >
>> >>> >> > "osd": 40,
>> >>> >> >
>> >>> >> > "current_lost_at": 0,
>> >>> >> >
>> >>> >> > "comment": "starting or marking this osd lost may let us proceed"
>> >>> >> >
>> >>> >> > }
>> >>> >> >
>> >>> >> > ]
>> >>> >> >
>> >>> >> >
>> >>> >> > 2-2) 1 pg (2.39) export and import successfully, but after
>> starting
>> >>> osd
>> >>> >> > service (pg import to it), ceph-osd node RAM and CPU consumption
>> >>> increase
>> >>> >> > and cause ceph-osd node to crash until the osd service fails.
>> Other
>> >

[ceph-users] Re: cephfs needs access from two networks]

2020-09-01 Thread Marcel Kuiper



The mons get their bind address from the monmap I believe. So this means
changing in the monmap the ip-addresses of the monitors with the
monmaptool.

Regards

Marcel

> Hello again
>
> So I have changed the network configuration.
> Now my Ceph is reachable from outside, this also means all osd’s of all
> nodes are reachable.
> I still have the same behaviour which is a timeout.
>
> The client can resolve all nodes with their hostnames.
> The mon’s are still listening on the internal network so the nat rule is
> still there.
> I have set “public bind addr” to the external ip and restarted the mon
> but it’s still not working.
>
> [root@testnode1 ~]# ceph config get mon.public_bind_addr
> WHO MASK  LEVEL OPTIONVALUERO
> mon   advanced  public_bind_addr  v2:[ext-addr]:0/0 *
>
> Do I have to change them somewhere else too?
>
> Thanks in advance,
> Simon
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Default data pool in CEPH

2020-09-01 Thread Eugen Block

Hi,

did you apply that setting in the client's (e.g. controller, compute  
nodes) ceph.conf? You can find a description in [1].


Regards,
Eugen

[1] https://docs.ceph.com/docs/master/rbd/rbd-openstack/


Zitat von Gabriel Medve :


Hi,

I have a CEPH 15.2.4 running in a docker. How to configure for use a  
specific data pool? i try put the follow line in the ceph.conf but  
the changes not working.  .


[client.myclient]
rbd default data pool = Mydatapool

I need it to configure for erasure pool with cloudstack

Can help me? , where is the ceph conf we i need configure?

Thanks.

--

Untitled Document
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs needs access from two networks

2020-09-01 Thread Wido den Hollander



On 01/09/2020 08:15, Simon Sutter wrote:

Hello again

So I have changed the network configuration.
Now my Ceph is reachable from outside, this also means all osd’s of all nodes 
are reachable.
I still have the same behaviour which is a timeout.

The client can resolve all nodes with their hostnames.
The mon’s are still listening on the internal network so the nat rule is still 
there.
I have set “public bind addr” to the external ip and restarted the mon but it’s 
still not working.


It could be that the NAT is the problem here.

Just use routing and firewalling. That way clients and OSDs have direct 
IP-access to each other. Will make your life much easier.


Wido



[root@testnode1 ~]# ceph config get mon.public_bind_addr
WHO MASK  LEVEL OPTIONVALUERO
mon   advanced  public_bind_addr  v2:[ext-addr]:0/0 *

Do I have to change them somewhere else too?

Thanks in advance,
Simon


Von: Janne Johansson [mailto:icepic...@gmail.com]
Gesendet: 27 August 2020 20:01
An: Simon Sutter 
Betreff: Re: [ceph-users] cephfs needs access from two networks

Den tors 27 aug. 2020 kl 12:05 skrev Simon Sutter 
mailto:ssut...@hosttech.ch>>:
Hello Janne

Oh I missed that point. No, the client cannot talk directly to the osds.
In this case it’s extremely difficult to set this up.

This is an absolute requirement to be a ceph client.

How is the mon telling the client, which host and port of the osd, it should 
connect to?

The same port and ip that the ODS called into the mon with when it started up 
and joined the clusster.

Can I have an influence on it?


Well, you set the ip on the OSD hosts, and the port ranges in use for OSDs are 
changeable/settable, but it would not really help the above-mentioned client.

Von: Janne Johansson [mailto:icepic...@gmail.com]
Gesendet: 26 August 2020 15:09
An: Simon Sutter mailto:ssut...@hosttech.ch>>
Cc: ceph-users@ceph.io
Betreff: Re: [ceph-users] cephfs needs access from two networks

Den ons 26 aug. 2020 kl 14:16 skrev Simon Sutter 
mailto:ssut...@hosttech.ch>>:
Hello,
So I know, the mon services can only bind to just one ip.
But I have to make it accessible to two networks because internal and external 
servers have to mount the cephfs.
The internal ip is 10.99.10.1 and the external is some public-ip.
I tried nat'ing it  with this: "firewall-cmd --zone=public 
--add-forward-port=port=6789:proto=tcp:toport=6789:toaddr=10.99.10.1 -permanent"

So the nat is working, because I get a "ceph v027" (alongside with some gibberish) when I 
do a telnet "telnet *public-ip* 6789"
But when I try to mount it, I get just a timeout:
mount - -t ceph *public-ip*:6789:/testing /mnt -o 
name=test,secretfile=/root/ceph.client. test.key
mount error 110 = Connection timed out

The tcpdump also recognizes a "Ceph Connect" packet, coming from the mon.

How can I get around this problem?
Is there something I have missed?

Any ceph client will need direct access to all OSDs involved also. Your mail 
doesn't really say if the cephfs-mounting client can talk to OSDs?

In ceph, traffic is not shuffled via mons, mons only tell the client which OSDs 
it needs to talk to, then all IO goes directly from client to any involved OSD 
servers.

--
May the most significant bit of your life be positive.


--
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rgw.none vs quota

2020-09-01 Thread Konstantin Shalygin

On 8/24/20 11:20 PM, Jean-Sebastien Landry wrote:

Hi everyone, a bucket was overquota, (default quota of 300k objects per 
bucket), I enabled the object quota for this bucket and set a quota of 600k 
objects.

We are on Luminous (12.2.12) and dynamic resharding is disabled, I manually do 
the resharding from 3 to 6 shards.

Since then, radosgw-admin bucket stats report a `rgw.none` in the usage section 
for this bucket.

I search the mailing-lists, bugzilla, github, it's look like I can ignore the 
rgw.none stats. (0 byte object, entry left in the index marked as cancelled...)
but, the num_object in rgw.none is part of the quota usage.

I bump the quota to 800k object to workaround the problem. (without resharding)

Is there a way I can garbage collect the rgw.none?
Is this problem fixed in Mimic/Nautilus/Octopus?

 "usage": {
 "rgw.none": {
 "size": 0,
 "size_actual": 0,
 "size_utilized": 0,
 "size_kb": 0,
 "size_kb_actual": 0,
 "size_kb_utilized": 0,
 "num_objects": 417827
 },
 "rgw.main": {
 "size": 1390778138502,
 "size_actual": 1391581007872,
 "size_utilized": 1390778138502,
 "size_kb": 1358181776,
 "size_kb_actual": 1358965828,
 "size_kb_utilized": 1358181776,
 "num_objects": 305637
 }
 },

Try to upgrade to 12.2.13 first. Many of RGW bugs are fixed in this 
release, include `--fix`,`stale instances`, `lc after reshard`, etc...




k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: slow "rados ls"

2020-09-01 Thread Marcel Kuiper
As a matter of fact we did. We doubled the storage nodes from 25 to 50.
Total osds now 460.

You want to share your thoughts on that?

Regards

Marcel

> On 2020-08-31 14:16, Marcel Kuiper wrote:
>> The compaction of the bluestore-kv's helped indeed. The repons is back
>> to
>> acceptable levels
>
> Just curious. Did you do any cluster expansion and or PG expansion
> before the slowness occurred?
>
> Gr. Stefan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RandomCrashes on OSDs Attached to Mon Hosts with Octopus

2020-09-01 Thread Denis Krienbühl
Hi Igor

To bring this thread to a conclusion: We managed to stop the random crashes by 
restarting each of the OSDs manually.

After upgrading the cluster we reshuffled a lot of our data by changing PG 
counts. It seems like the memory reserved during that time was never released 
back to the OS.

Though we did not see any change in swap usage, with swap page in/out actually 
being lower than before the upgrade, the OSDs did not reclaim the memory they 
used before the restart in the days following the restart. We also stopped 
seeing random crashes.

I can’t say definitely what the error was, but for us these random crashes were 
solved by restarting all OSDs. Maybe this helps somebody else searching for 
this error in the future.

Thanks again for your help!

Denis

> On 27 Aug 2020, at 13:46, Denis Krienbühl  wrote:
> 
> Hi Igor
> 
> Just to clarify:
> 
>>> I grepped the logs for "checksum mismatch" and "_verify_csum". The only
>>> occurrences I could find where the ones that preceed the crashes.
>> 
>> Are you able to find multiple _verify_csum precisely?
> 
> There are no “_verify_csum” entries whatsoever. I wrote that wrongly.
> I could only find “checksum mismatch” right when the crash happens.
> 
> Sorry for the confusion.
> 
> I will keep tracking those counters and have a look at monitor/osd memory 
> tracking.
> 
> Cheers,
> 
> Denis
> 
>> On 27 Aug 2020, at 13:39, Igor Fedotov > > wrote:
>> 
>> Hi Denis
>> 
>> please see my comments inline.
>> 
>> 
>> Thanks,
>> 
>> Igor
>> 
>> On 8/27/2020 10:06 AM, Denis Krienbühl wrote:
>>> Hi Igor,
>>> 
>>> Thanks for your input. I tried to gather as much information as I could to
>>> answer your questions. Hopefully we can get to the bottom of this.
>>> 
 0) What is backing disks layout for OSDs in question (main device type?, 
 additional DB/WAL devices?).
>>> Everything is on a single Intel NVMe P4510 using dmcrypt with 2 OSDs per 
>>> NVMe
>>> device. There is no additional DB/WAL device and there are no HDDs involved.
>>> 
>>> Also note that we use 40 OSDs per host with a memory target of 
>>> 6'174'015'488.
>>> 
 1) Please check all the existing logs for OSDs at "failing" nodes for 
 other checksum errors (as per my comment #38)
>>> I grepped the logs for "checksum mismatch" and "_verify_csum". The only
>>> occurrences I could find where the ones that preceed the crashes.
>> 
>> Are you able to find multiple _verify_csum precisely?
>> 
>> If so this means data read failures were observed at user data not RocksDB 
>> one. Which backs the hypothesis about interim  disk read
>> 
>> errors as a root cause. User data reading has quite a different access stack 
>> and is able to retry after such errors hence they aren't that visible.
>> 
>> But having checksum failures for both DB and user data points to the same 
>> root cause at lower layers (kernel, I/O stack etc).
>> 
>> It might be interesting whether _verify_csum and RocksDB csum were happening 
>> nearly at the same period of time. Not even for a single OSD but for 
>> different OSDs of the same node.
>> 
>> This might indicate that node was suffering from some decease at that time. 
>> Anything suspicious from system-wide logs for this time period?
>> 
>>> 
 2) Check if BlueFS spillover is observed for any failing OSDs.
>>> As everything is on the same device, there can be no spillover, right?
>> Right
>>> 
 3) Check "bluestore_reads_with_retries" performance counters for all OSDs 
 at nodes in question. See comments 38-42 on the details. Any non-zero 
 values?
>>> I monitored this over night by repeatedly polling this performance counter 
>>> over
>>> all OSDs on the mons. Only one OSD, which has crashed in the past, has had a
>>> value of 1 since I started measuring. All the other OSDs, including the ones
>>> that crashed over night, have a value of 0. Before and after the crash.
>> 
>> Even a single occurrence isn't expected - this counter should always be 
>> equal to 0. And presumably these are peak hours when the cluster is exposed 
>> to the issue at most. Night is likely to be not the the peak period though. 
>> So please keep tracking...
>> 
>> 
>>> 
 4) Start monitoring RAM usage and swapping for these nodes. Comment 39.
>>> The memory use of those nodes is pretty constant with ~6GB free, ~25GB 
>>> availble of 256GB.
>>> There are also only a handful of pages being swapped, if at all.
>>> 
 a hypothesis why mon hosts are affected only  - higher memory utilization 
 at these nodes is what causes disk reading failures to appear. RAM leakage 
 (or excessive utilization) in MON processes or something?
>>> Since the memory usage is rather constant I'm not sure this is the case, I 
>>> think
>>> we would see more of an up/down pattern. However we are not yet monitoring 
>>> all
>>> processes, and that would be somthing I'd like to get some data on, but I'm 
>>> not
>>> sure this is the right course of action at the moment

[ceph-users] Re: slow "rados ls"

2020-09-01 Thread Stefan Kooman
On 2020-08-31 14:16, Marcel Kuiper wrote:
> The compaction of the bluestore-kv's helped indeed. The repons is back to
> acceptable levels

Just curious. Did you do any cluster expansion and or PG expansion
before the slowness occurred?

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rgw.none vs quota

2020-09-01 Thread EDH - Manuel Rios
Just ignore rgw.none is a old bug as far I investigated just a representation 
bug .

New versions and newer buckets doesn't have anymore rgw.none, and right now 
there's no way to remove section rgw.none.

Im at Nautilus 14.2.11 rgw.none is not present since several versions ago...

-Mensaje original-
De: Konstantin Shalygin  
Enviado el: martes, 1 de septiembre de 2020 10:30
Para: Jean-Sebastien Landry ; 
ceph-users@ceph.io
Asunto: [ceph-users] Re: rgw.none vs quota

On 8/24/20 11:20 PM, Jean-Sebastien Landry wrote:
> Hi everyone, a bucket was overquota, (default quota of 300k objects per 
> bucket), I enabled the object quota for this bucket and set a quota of 600k 
> objects.
>
> We are on Luminous (12.2.12) and dynamic resharding is disabled, I manually 
> do the resharding from 3 to 6 shards.
>
> Since then, radosgw-admin bucket stats report a `rgw.none` in the usage 
> section for this bucket.
>
> I search the mailing-lists, bugzilla, github, it's look like I can 
> ignore the rgw.none stats. (0 byte object, entry left in the index marked as 
> cancelled...) but, the num_object in rgw.none is part of the quota usage.
>
> I bump the quota to 800k object to workaround the problem. (without 
> resharding)
>
> Is there a way I can garbage collect the rgw.none?
> Is this problem fixed in Mimic/Nautilus/Octopus?
>
>  "usage": {
>  "rgw.none": {
>  "size": 0,
>  "size_actual": 0,
>  "size_utilized": 0,
>  "size_kb": 0,
>  "size_kb_actual": 0,
>  "size_kb_utilized": 0,
>  "num_objects": 417827
>  },
>  "rgw.main": {
>  "size": 1390778138502,
>  "size_actual": 1391581007872,
>  "size_utilized": 1390778138502,
>  "size_kb": 1358181776,
>  "size_kb_actual": 1358965828,
>  "size_kb_utilized": 1358181776,
>  "num_objects": 305637
>  }
>  },
>
Try to upgrade to 12.2.13 first. Many of RGW bugs are fixed in this release, 
include `--fix`,`stale instances`, `lc after reshard`, etc...



k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Avail Online Home Help Services From Qualified Experts

2020-09-01 Thread amarasmith488
Are you unable to complete your homework assignment? Are you looking for a 
reliable online homework help service provider? LiveWebTutors is one of the 
best and most reliable companies when it comes to providing an assignment 
writing service. You can connect with the experts whether you are in need of a 
dissertation writing service or coursework help service. The professionals will 
ensure that your writing needs are covered within the given deadline and that 
too as per the instructions stated by the college professor! So, get your 
grades better by connecting with the professionals of LiveWebTutors now!
Visit us: https://www.livewebtutors.com/usa/coursework-help
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd regularly wrongly marked down

2020-09-01 Thread Francois Legrand

Hello,
During the night the osd.16 crashed after hitting a suicide timout. Thus 
this morning I did a ceph-kvstore-tool  compact and restarted the osd.
I thus compared the results of ceph daemon osd.16 perf dump I had before 
(i.e. yesterday) and now (after compaction). I noticed a interresting 
difference in msgr_active_connections. Before the compaction it was, for 
all AsyncMessenger::Worker-0, 1 and 2 at a crasy value 
(18446744073709550998) and get back to something comparable to what I 
have for other osds (72).

Does this helps you to identify the problem ?
F.



Le 31/08/2020 à 15:59, Wido den Hollander a écrit :



On 31/08/2020 15:44, Francois Legrand wrote:

Thanks Igor for your answer,

We could try do a compaction of RocksDB manually, but it's not clear 
to me if we have to compact on the mon with something like

ceph-kvstore-tool rocksdb  /var/lib/ceph/mon/mon01/store.db/ compact
or on the concerned osd with
ceph-kvstore-tool rocksdb  /var/lib/ceph/osd/ceph-16/ compact
(or for all osd with a script like in 
https://gist.github.com/wido/b0f0200bd1a2cbbe3307265c5cfb2771 )


You would compact the OSDs, not the MONs. So the last command or my 
script which you linked there.


For my culture, how does compaction works ? Is it done automatically 
in background, regularly, at startup ?


Usually it's done by the OSD in the background, but sometimes an 
offline compact works best.


Because in the logs of the osd we have every 10mn some reports about 
compaction (which suggests that compaction occurs regularly), like :




Yes, that is normal. But the offline compaction is sometimes more 
effective than the online ones are.



2020-08-31 15:06:55.448 7f03fb398700  4 rocksdb: [db/db_impl.cc:777] 
--- DUMPING STATS ---

2020-08-31 15:06:55.448 7f03fb398700  4 rocksdb: [db/db_impl.cc:778]
** DB Stats **
Uptime(secs): 449404.8 total, 600.0 interval
Cumulative writes: 136K writes, 692K keys, 136K commit groups, 1.0 
writes per commit group, ingest: 0.28 GB, 0.00 MB/s
Cumulative WAL: 136K writes, 67K syncs, 2.04 writes per sync, 
written: 0.28 GB, 0.00 MB/s

Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 128 writes, 336 keys, 128 commit groups, 1.0 writes 
per commit group, ingest: 0.22 MB, 0.00 MB/s
Interval WAL: 128 writes, 64 syncs, 1.97 writes per sync, written: 
0.00 MB, 0.00 MB/s

Interval stall: 00:00:0.000 H:M:S, 0.0 percent

** Compaction Stats [default] **
Level    Files   Size Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) 
Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) 
CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
 

   L0  1/0   60.48 MB   0.2  0.0 0.0 0.0   0.1 
0.1   0.0   1.0  0.0    163.7 0.52  0.40 2    
0.258   0  0
   L1  0/0    0.00 KB   0.0  0.1 0.1 0.0   0.1 
0.1   0.0   0.5 48.2 26.1 2.32  0.64 1    
2.319    920K   197K
   L2 17/0    1.00 GB   0.8  1.1 0.1 1.1   1.1 
0.0   0.0  18.3 69.8 67.5 16.38  4.97 1   
16.380   4747K    82K
   L3 81/0    4.50 GB   0.9  0.6 0.1 0.5   0.3 
-0.2   0.0   4.3 66.9 36.6 9.23  4.95 2    
4.617   9544K   802K
   L4    285/0   16.64 GB   0.1  2.4 0.3 2.0   0.2 
-1.8   0.0   0.8    110.3 11.7 21.92  4.37 5    
4.384 12M    12M
  Sum    384/0   22.20 GB   0.0  4.2 0.6 3.6   1.8 
-1.8   0.0  21.8 85.2 36.6 50.37 15.32 11    
4.579 28M    13M
  Int  0/0    0.00 KB   0.0  0.0 0.0 0.0   0.0 
0.0   0.0   0.0  0.0  0.0 0.00  0.00 0    
0.000   0  0


** Compaction Stats [default] **
Priority    Files   Size Score Read(GB)  Rn(GB) Rnp1(GB) 
Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) 
CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
--- 

  Low  0/0    0.00 KB   0.0  4.2 0.6 3.6   1.7 
-1.9   0.0   0.0 86.0 35.3 49.86 14.92 9    
5.540 28M    13M
High  0/0    0.00 KB   0.0  0.0 0.0 0.0   0.1 
0.1   0.0   0.0  0.0    150.2 0.40  0.40 1    
0.403   0  0
User  0/0    0.00 KB   0.0  0.0 0.0 0.0   0.0 
0.0   0.0   0.0  0.0    211.7 0.11  0.00 1    
0.114   0  0

Uptime(secs): 449404.8 total, 600.0 interval
Flush(GB): cumulative 0.083, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 1.80 GB write, 0.00 MB/s write, 4.19 GB read, 
0.01 

[ceph-users] Re: Recover pgs from failed osds

2020-09-01 Thread Vahideh Alinouri
Is not any solution or advice?

On Tue, Sep 1, 2020, 11:53 AM Vahideh Alinouri 
wrote:

> One of failed osd with 3G RAM started and dump_mempools shows total RAM
> usage is 18G and buff_anon uses 17G RAM!
>
> On Mon, Aug 31, 2020 at 6:24 PM Vahideh Alinouri <
> vahideh.alino...@gmail.com> wrote:
>
>> osd_memory_target of failed osd in one ceph-osd node changed to 6G but
>> other osd_memory_target is 3G, starting failed osd with 6G memory_target
>> causes other osd "down" in ceph-osd node! and failed osd is still down.
>>
>> On Mon, Aug 31, 2020 at 2:19 PM Eugen Block  wrote:
>>
>>> Can you try the opposite and turn up the memory_target and only try to
>>> start a single OSD?
>>>
>>>
>>> Zitat von Vahideh Alinouri :
>>>
>>> > osd_memory_target is changed to 3G, starting failed osd causes ceph-osd
>>> > nodes crash! and failed osd is still "down"
>>> >
>>> > On Fri, Aug 28, 2020 at 1:13 PM Vahideh Alinouri <
>>> vahideh.alino...@gmail.com>
>>> > wrote:
>>> >
>>> >> Yes, each osd node has 7 osds with 4 GB memory_target.
>>> >>
>>> >>
>>> >> On Fri, Aug 28, 2020, 12:48 PM Eugen Block  wrote:
>>> >>
>>> >>> Just to confirm, each OSD node has 7 OSDs with 4 GB memory_target?
>>> >>> That leaves only 4 GB RAM for the rest, and in case of heavy load the
>>> >>> OSDs use even more. I would suggest to reduce the memory_target to 3
>>> >>> GB and see if they start successfully.
>>> >>>
>>> >>>
>>> >>> Zitat von Vahideh Alinouri :
>>> >>>
>>> >>> > osd_memory_target is 4294967296.
>>> >>> > Cluster setup:
>>> >>> > 3 mon, 3 mgr, 21 osds on 3 ceph-osd nodes in lvm scenario.
>>> ceph-osd
>>> >>> nodes
>>> >>> > resources are 32G RAM - 4 core CPU - osd disk 4TB - 9 osds have
>>> >>> > block.wal on SSDs.  Public network is 1G and cluster network is
>>> 10G.
>>> >>> > Cluster installed and upgraded using ceph-ansible.
>>> >>> >
>>> >>> > On Thu, Aug 27, 2020 at 7:01 PM Eugen Block  wrote:
>>> >>> >
>>> >>> >> What is the memory_target for your OSDs? Can you share more
>>> details
>>> >>> >> about your setup? You write about high memory, are the OSD nodes
>>> >>> >> affected by OOM killer? You could try to reduce the
>>> osd_memory_target
>>> >>> >> and see if that helps bring the OSDs back up. Splitting the PGs
>>> is a
>>> >>> >> very heavy operation.
>>> >>> >>
>>> >>> >>
>>> >>> >> Zitat von Vahideh Alinouri :
>>> >>> >>
>>> >>> >> > Ceph cluster is updated from nautilus to octopus. On ceph-osd
>>> nodes
>>> >>> we
>>> >>> >> have
>>> >>> >> > high I/O wait.
>>> >>> >> >
>>> >>> >> > After increasing one of pool’s pg_num from 64 to 128 according
>>> to
>>> >>> warning
>>> >>> >> > message (more objects per pg), this lead to high cpu load and
>>> ram
>>> >>> usage
>>> >>> >> on
>>> >>> >> > ceph-osd nodes and finally crashed the whole cluster. Three
>>> osds,
>>> >>> one on
>>> >>> >> > each host, stuck at down state (osd.34 osd.35 osd.40).
>>> >>> >> >
>>> >>> >> > Starting the down osd service causes high ram usage and cpu
>>> load and
>>> >>> >> > ceph-osd node to crash until the osd service fails.
>>> >>> >> >
>>> >>> >> > The active mgr service on each mon host will crash after
>>> consuming
>>> >>> almost
>>> >>> >> > all available ram on the physical hosts.
>>> >>> >> >
>>> >>> >> > I need to recover pgs and solving corruption. How can i recover
>>> >>> unknown
>>> >>> >> and
>>> >>> >> > down pgs? Is there any way to starting up failed osd?
>>> >>> >> >
>>> >>> >> >
>>> >>> >> > Below steps are done:
>>> >>> >> >
>>> >>> >> > 1- osd nodes’ kernel was upgraded to 5.4.2 before ceph cluster
>>> >>> upgrading.
>>> >>> >> > Reverting to previous kernel 4.2.1 is tested for iowate
>>> decreasing,
>>> >>> but
>>> >>> >> it
>>> >>> >> > had no effect.
>>> >>> >> >
>>> >>> >> > 2- Recovering 11 pgs on failed osds by export them using
>>> >>> >> > ceph-objectstore-tools utility and import them on other osds.
>>> The
>>> >>> result
>>> >>> >> > followed: 9 pgs are “down” and 2 pgs are “unknown”.
>>> >>> >> >
>>> >>> >> > 2-1) 9 pgs export and import successfully but status is “down”
>>> >>> because of
>>> >>> >> > "peering_blocked_by" 3 failed osds. I cannot lost osds because
>>> of
>>> >>> >> > preventing unknown pgs from getting lost. pgs size in K and M.
>>> >>> >> >
>>> >>> >> > "peering_blocked_by": [
>>> >>> >> >
>>> >>> >> > {
>>> >>> >> >
>>> >>> >> > "osd": 34,
>>> >>> >> >
>>> >>> >> > "current_lost_at": 0,
>>> >>> >> >
>>> >>> >> > "comment": "starting or marking this osd lost may let us
>>> proceed"
>>> >>> >> >
>>> >>> >> > },
>>> >>> >> >
>>> >>> >> > {
>>> >>> >> >
>>> >>> >> > "osd": 35,
>>> >>> >> >
>>> >>> >> > "current_lost_at": 0,
>>> >>> >> >
>>> >>> >> > "comment": "starting or marking this osd lost may let us
>>> proceed"
>>> >>> >> >
>>> >>> >> > },
>>> >>> >> >
>>> >>> >> > {
>>> >>> >> >
>>> >>> >> > "osd": 40,
>>> >>> >> >
>>> >>> >> > "current_lost_at": 0,
>>> >>> >> >
>>> >>> >> > "comment": "starting or marking this osd lost may let us
>>> proceed"
>>> >>> >> >
>>> >>> >> > }
>>> >>> >> >
>>> >>>

[ceph-users] Actual block size of osd

2020-09-01 Thread Ragan, Tj (Dr.)
Hi All,

Does anyone know how to get the actual block size used by an osd?  I’m trying 
to evaluate 4k vs 64k min_alloc_size_hdd and want to verify that the newly 
created osds are actually using the expected block size.

Thanks,

-TJ Ragan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm daemons vs cephadm services -- what's the difference?

2020-09-01 Thread Tony Liu
Service is logical entity, which may have multiple instances/daemons
for scaling/LB purpose. For example, one ceph-monitor service may
have 3 daemons running on 3 nodes to provide HA.

Tony
> -Original Message-
> From: John Zachary Dover 
> Sent: Monday, August 31, 2020 9:47 PM
> To: ceph-users 
> Subject: [ceph-users] cephadm daemons vs cephadm services -- what's the
> difference?
> 
> What is the difference between services and daemons?
> 
> Specifically, what does it mean that "orch ps" lists cephadm daemons and
> "orch ls" lists cephadm services?
> 
> This question will help me close this bug:
> https://tracker.ceph.com/issues/47142
> 
> Zac Dover
> Upstream Docs
> Ceph
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Default data pool in CEPH

2020-09-01 Thread Gabriel Medve

Hi, thanks for the reply

I don't use Openstack , i use Cloudstack .

Where this ceph.conf file to edit? , i edit /etc/ceph/ceph.conf and 
/var/lib/ceph/container/mon.Storage01/config , but the config not working .


--

El 1/9/20 a las 04:47, Eugen Block escribió:

Hi,

did you apply that setting in the client's (e.g. controller, compute 
nodes) ceph.conf? You can find a description in [1].


Regards,
Eugen

[1] https://docs.ceph.com/docs/master/rbd/rbd-openstack/


Zitat von Gabriel Medve :


Hi,

I have a CEPH 15.2.4 running in a docker. How to configure for use a 
specific data pool? i try put the follow line in the ceph.conf but 
the changes not working.  .


[client.myclient]
rbd default data pool = Mydatapool

I need it to configure for erasure pool with cloudstack

Can help me? , where is the ceph conf we i need configure?

Thanks.

--

Untitled Document
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

--
Untitled Docume

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)

2020-09-01 Thread Reed Dier
If using storcli/perccli for manipulating the LSI controller, you can disable 
the on-disk write cache with:
storcli /cx/vx set pdcache=off

You can also ensure that you turn off write caching at the controller level 
with 
storcli /cx/vx set iopolicy=direct
storcli /cx/vx set wrcache=wt

You can also tweak the readahead value for the vd if you want, though with an 
ssd, I don't think it will be much of an issue.
storcli /cx/vx set rdcache=nora

I'm sure the megacli alternatives are available with some quick searches.

May also want to check your c-states and p-states to make sure there isn't any 
aggressive power saving features getting in the way.

Reed

> On Aug 31, 2020, at 7:44 AM, VELARTIS Philipp Dürhammer 
>  wrote:
> 
> We have older LSi Raid controller with no HBA/JBOD option. So we expose the 
> single disks as raid0 devices. Ceph should not be aware of cache status?
> But digging deeper in to it it seems that 1 out of 4 serves is performing a 
> lot better and has super low commit/applay rates while the other have a lot 
> mor (20+) on heavy writes. This just applys fore the ssd. For the hdds I cant 
> see a difference...
> 
> -Ursprüngliche Nachricht-
> Von: Frank Schilder  
> Gesendet: Montag, 31. August 2020 13:19
> An: VELARTIS Philipp Dürhammer ; 
> 'ceph-users@ceph.io' 
> Betreff: Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra 
> journals)
> 
> Yes, they can - if volatile write cache is not disabled. There are many 
> threads on this, also recent. Search for "disable write cache" and/or 
> "disable volatile write cache".
> 
> You will also find different methods of doing this automatically.
> 
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> 
> From: VELARTIS Philipp Dürhammer 
> Sent: 31 August 2020 13:02:45
> To: 'ceph-users@ceph.io'
> Subject: [ceph-users] Can 16 server grade ssd's be slower then 60 hdds? (no 
> extra journals)
> 
> I have a productive 60 osd's cluster. No extra Journals. Its performing okay. 
> Now I added an extra ssd Pool with 16 Micron 5100 MAX. And the performance is 
> little slower or equal to the 60 hdd pool. 4K random as also sequential 
> reads. All on dedicated 2 times 10G Network. HDDS are still on filestore. SSD 
> on bluestore. Ceph Luminous.
> What should be possible 16 ssd's vs. 60 hhd's no extra journals?
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
> ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)

2020-09-01 Thread VELARTIS Philipp Dürhammer
Thank you. I was working in this direction. The situation is a lot better. But 
I think I can get still far better.

I could set the controller to writethrough, direct and no read ahead for the 
ssds.
But I cannot disable the pdcache ☹ there is an option set in the controller 
"Block SSD Write Disk Cache Change = Yes" which does not permit to deactivate 
the ssd cache. I could not find any solution in google for this controller (LSI 
MegaRAID SAS 9271-8i) to change this setting.

I don’t know how much performance gain it will be to deactivate the ssd cache. 
At least the micron 5200max has capacitor so I hope it is safe for data loss in 
case if power failure. I wrote a request to lsi / Broadcom if they know how I 
can change this setting. This is really annyoing.

I will check the cpu power settings. I rode also somewhere it can improve iops 
a lot. (if its bad set)

At the moment I get 600iops 4k random write 1 thread and 1 iodepth. I get 40K - 
4k random iops for some instances with 32iodepth. Its not the world but a lot 
better then before. Read around 100k iops. For 16 ssd's and 2 x dual 10G nic.

I was reading that good tunings and hardware config can get more then 2000 iops 
on single thread out of the ssds. I know thet ceph does not shine with single 
thread. But 600 iops is not very much...

philipp

-Ursprüngliche Nachricht-
Von: Reed Dier  
Gesendet: Dienstag, 01. September 2020 22:37
An: VELARTIS Philipp Dürhammer 
Cc: ceph-users@ceph.io
Betreff: Re: [ceph-users] Can 16 server grade ssd's be slower then 60 hdds? (no 
extra journals)

If using storcli/perccli for manipulating the LSI controller, you can disable 
the on-disk write cache with:
storcli /cx/vx set pdcache=off

You can also ensure that you turn off write caching at the controller level 
with 
storcli /cx/vx set iopolicy=direct
storcli /cx/vx set wrcache=wt

You can also tweak the readahead value for the vd if you want, though with an 
ssd, I don't think it will be much of an issue.
storcli /cx/vx set rdcache=nora

I'm sure the megacli alternatives are available with some quick searches.

May also want to check your c-states and p-states to make sure there isn't any 
aggressive power saving features getting in the way.

Reed

> On Aug 31, 2020, at 7:44 AM, VELARTIS Philipp Dürhammer 
>  wrote:
> 
> We have older LSi Raid controller with no HBA/JBOD option. So we expose the 
> single disks as raid0 devices. Ceph should not be aware of cache status?
> But digging deeper in to it it seems that 1 out of 4 serves is performing a 
> lot better and has super low commit/applay rates while the other have a lot 
> mor (20+) on heavy writes. This just applys fore the ssd. For the hdds I cant 
> see a difference...
> 
> -Ursprüngliche Nachricht-
> Von: Frank Schilder  
> Gesendet: Montag, 31. August 2020 13:19
> An: VELARTIS Philipp Dürhammer ; 
> 'ceph-users@ceph.io' 
> Betreff: Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra 
> journals)
> 
> Yes, they can - if volatile write cache is not disabled. There are many 
> threads on this, also recent. Search for "disable write cache" and/or 
> "disable volatile write cache".
> 
> You will also find different methods of doing this automatically.
> 
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> 
> From: VELARTIS Philipp Dürhammer 
> Sent: 31 August 2020 13:02:45
> To: 'ceph-users@ceph.io'
> Subject: [ceph-users] Can 16 server grade ssd's be slower then 60 hdds? (no 
> extra journals)
> 
> I have a productive 60 osd's cluster. No extra Journals. Its performing okay. 
> Now I added an extra ssd Pool with 16 Micron 5100 MAX. And the performance is 
> little slower or equal to the 60 hdd pool. 4K random as also sequential 
> reads. All on dedicated 2 times 10G Network. HDDS are still on filestore. SSD 
> on bluestore. Ceph Luminous.
> What should be possible 16 ssd's vs. 60 hhd's no extra journals?
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
> ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)

2020-09-01 Thread Marc Roos

Sorry I am not fully aware of what has been already discussed in this 
thread. But can't you flash these LSI logic cards to jbod? I have done 
this with my 9207 with sas2flash.

I have attached my fio test of the Micron 5100 Pro/5200 SSDs 
MTFDDAK1T9TCC. They perform similar to my samsung sm863a 1,92TB. Only 
thing weird, is that the rw-4k is 3x slower on the micron.





-Original Message-
To: 'Reed Dier'
Cc: 'ceph-users@ceph.io'
Subject: [ceph-users] Re: Can 16 server grade ssd's be slower then 60 
hdds? (no extra journals)

Thank you. I was working in this direction. The situation is a lot 
better. But I think I can get still far better.

I could set the controller to writethrough, direct and no read ahead for 
the ssds.
But I cannot disable the pdcache ☹ there is an option set in the 
controller "Block SSD Write Disk Cache Change = Yes" which does not 
permit to deactivate the ssd cache. I could not find any solution in 
google for this controller (LSI MegaRAID SAS 9271-8i) to change this 
setting.

I don’t know how much performance gain it will be to deactivate the ssd 
cache. At least the micron 5200max has capacitor so I hope it is safe 
for data loss in case if power failure. I wrote a request to lsi / 
Broadcom if they know how I can change this setting. This is really 
annyoing.

I will check the cpu power settings. I rode also somewhere it can 
improve iops a lot. (if its bad set)

At the moment I get 600iops 4k random write 1 thread and 1 iodepth. I 
get 40K - 4k random iops for some instances with 32iodepth. Its not the 
world but a lot better then before. Read around 100k iops. For 16 ssd's 
and 2 x dual 10G nic.

I was reading that good tunings and hardware config can get more then 
2000 iops on single thread out of the ssds. I know thet ceph does not 
shine with single thread. But 600 iops is not very much...

philipp

-Ursprüngliche Nachricht-
Von: Reed Dier 
Gesendet: Dienstag, 01. September 2020 22:37
An: VELARTIS Philipp Dürhammer 
Cc: ceph-users@ceph.io
Betreff: Re: [ceph-users] Can 16 server grade ssd's be slower then 60 
hdds? (no extra journals)

If using storcli/perccli for manipulating the LSI controller, you can 
disable the on-disk write cache with:
storcli /cx/vx set pdcache=off

You can also ensure that you turn off write caching at the controller 
level with storcli /cx/vx set iopolicy=direct storcli /cx/vx set 
wrcache=wt

You can also tweak the readahead value for the vd if you want, though 
with an ssd, I don't think it will be much of an issue.
storcli /cx/vx set rdcache=nora

I'm sure the megacli alternatives are available with some quick 
searches.

May also want to check your c-states and p-states to make sure there 
isn't any aggressive power saving features getting in the way.

Reed

> On Aug 31, 2020, at 7:44 AM, VELARTIS Philipp Dürhammer 
 wrote:
> 
> We have older LSi Raid controller with no HBA/JBOD option. So we 
expose the single disks as raid0 devices. Ceph should not be aware of 
cache status?
> But digging deeper in to it it seems that 1 out of 4 serves is 
performing a lot better and has super low commit/applay rates while the 
other have a lot mor (20+) on heavy writes. This just applys fore the 
ssd. For the hdds I cant see a difference...
> 
> -Ursprüngliche Nachricht-
> Von: Frank Schilder 
> Gesendet: Montag, 31. August 2020 13:19
> An: VELARTIS Philipp Dürhammer ; 
> 'ceph-users@ceph.io' 
> Betreff: Re: Can 16 server grade ssd's be slower then 60 hdds? (no 
> extra journals)
> 
> Yes, they can - if volatile write cache is not disabled. There are 
many threads on this, also recent. Search for "disable write cache" 
and/or "disable volatile write cache".
> 
> You will also find different methods of doing this automatically.
> 
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> 
> From: VELARTIS Philipp Dürhammer 
> Sent: 31 August 2020 13:02:45
> To: 'ceph-users@ceph.io'
> Subject: [ceph-users] Can 16 server grade ssd's be slower then 60 
> hdds? (no extra journals)
> 
> I have a productive 60 osd's cluster. No extra Journals. Its 
performing okay. Now I added an extra ssd Pool with 16 Micron 5100 MAX. 
And the performance is little slower or equal to the 60 hdd pool. 4K 
random as also sequential reads. All on dedicated 2 times 10G Network. 
HDDS are still on filestore. SSD on bluestore. Ceph Luminous.
> What should be possible 16 ssd's vs. 60 hhd's no extra journals?
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
> email to ceph-users-le...@ceph.io 
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
> email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
email to ceph-users-le...@ceph.io


__

[ceph-users] Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)

2020-09-01 Thread Marc Roos
 


write-4k-seq: (groupid=0, jobs=1): err= 0: pid=11017: Tue Sep  1 
20:58:43 2020
  write: IOPS=34.4k, BW=134MiB/s (141MB/s)(23.6GiB/180001msec)
slat (nsec): min=3964, max=124499, avg=4432.71, stdev=911.13
clat (nsec): min=470, max=435529, avg=23528.70, stdev=2553.67
 lat (usec): min=24, max=445, avg=28.08, stdev= 3.04
clat percentiles (nsec):
 |  1.00th=[22144],  5.00th=[22400], 10.00th=[22400], 
20.00th=[22656],
 | 30.00th=[22656], 40.00th=[22656], 50.00th=[22912], 
60.00th=[22912],
 | 70.00th=[22912], 80.00th=[23680], 90.00th=[25984], 
95.00th=[27008],
 | 99.00th=[32384], 99.50th=[35072], 99.90th=[40192], 
99.95th=[41728],
 | 99.99th=[54016]
   bw (  KiB/s): min=51928, max=140088, per=70.28%, avg=96694.64, 
stdev=15493.31, samples=359
   iops: min=12982, max=35022, avg=24173.37, stdev=3873.39, 
samples=359
  lat (nsec)   : 500=0.01%, 750=0.01%
  lat (usec)   : 2=0.01%, 10=0.01%, 20=0.01%, 50=99.98%, 100=0.01%
  lat (usec)   : 250=0.01%, 500=0.01%
  cpu  : usr=5.61%, sys=18.02%, ctx=6191054, majf=0, minf=7
  IO depths: 1=116.8%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
 issued rwts: total=0,6190964,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=1
randwrite-4k-seq: (groupid=1, jobs=1): err= 0: pid=11050: Tue Sep  1 
20:58:43 2020
  write: IOPS=34.6k, BW=135MiB/s (142MB/s)(23.7GiB/180001msec)
slat (nsec): min=3956, max=140599, avg=4368.89, stdev=537.59
clat (nsec): min=492, max=488951, avg=23289.09, stdev=2404.05
 lat (usec): min=24, max=494, avg=27.77, stdev= 2.59
clat percentiles (nsec):
 |  1.00th=[21888],  5.00th=[22144], 10.00th=[22144], 
20.00th=[22400],
 | 30.00th=[22400], 40.00th=[22400], 50.00th=[22656], 
60.00th=[22656],
 | 70.00th=[22656], 80.00th=[23424], 90.00th=[25728], 
95.00th=[26752],
 | 99.00th=[30336], 99.50th=[32128], 99.90th=[37120], 
99.95th=[39168],
 | 99.99th=[46848]
   bw (  KiB/s): min=48176, max=139232, per=70.44%, avg=97396.63, 
stdev=14897.43, samples=359
   iops: min=12044, max=34808, avg=24348.81, stdev=3724.44, 
samples=359
  lat (nsec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 20=0.01%, 50=99.98%, 100=0.01%, 250=0.01%
  lat (usec)   : 500=0.01%
  cpu  : usr=6.06%, sys=17.91%, ctx=6222106, majf=0, minf=5
  IO depths: 1=116.8%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
 issued rwts: total=0,6222038,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=1
read-4k-seq: (groupid=2, jobs=1): err= 0: pid=11068: Tue Sep  1 20:58:43 
2020
   read: IOPS=22.7k, BW=88.6MiB/s (92.9MB/s)(15.6GiB/180001msec)
slat (usec): min=5, max=603, avg= 6.77, stdev= 1.64
clat (nsec): min=694, max=426738, avg=35641.98, stdev=12418.44
 lat (usec): min=31, max=636, avg=42.59, stdev=12.68
clat percentiles (usec):
 |  1.00th=[   31],  5.00th=[   32], 10.00th=[   32], 20.00th=[   
32],
 | 30.00th=[   32], 40.00th=[   33], 50.00th=[   33], 60.00th=[   
33],
 | 70.00th=[   34], 80.00th=[   36], 90.00th=[   40], 95.00th=[   
44],
 | 99.00th=[  104], 99.50th=[  105], 99.90th=[  109], 99.95th=[  
110],
 | 99.99th=[  112]
   bw (  KiB/s): min=45664, max=92376, per=70.46%, avg=63903.11, 
stdev=10098.57, samples=359
   iops: min=11416, max=23094, avg=15975.47, stdev=2524.71, 
samples=359
  lat (nsec)   : 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 20=0.01%, 50=96.83%, 100=0.47%, 250=2.69%
  lat (usec)   : 500=0.01%
  cpu  : usr=5.61%, sys=18.15%, ctx=4081143, majf=0, minf=6
  IO depths: 1=116.7%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>=64=0.0%
 issued rwts: total=4081084,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=1
randread-4k-seq: (groupid=3, jobs=1): err= 0: pid=11094: Tue Sep  1 
20:58:43 2020
   read: IOPS=17.5k, BW=68.4MiB/s (71.7MB/s)(12.0GiB/180001msec)
slat (usec): min=6, max=111, avg=12.99, stdev= 4.03
clat (nsec): min=894, max=690146, avg=41739.21, stdev=16973.22
 lat (usec): min=36, max=698, avg=54.95, stdev=17.78
clat percentiles (usec):
 |  1.00th=[   34],  5.00th=[   37], 10.00th=[   37], 20.00th=[   
38],
 | 30.00th=[   38], 40.00th=[   40], 50.00th=[   40], 60.00th=[   
40],
 | 70.00th=[   40], 80.00th=[   41], 90.00th=[   42], 95.00th=[   
46],
 | 99.00th=[  143], 99.50th=[  161], 99.90th=[  174], 99.95th=[  
176]

[ceph-users] Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)

2020-09-01 Thread Reed Dier
> there is an option set in the controller "Block SSD Write Disk Cache Change = 
> Yes" which does not permit to deactivate the ssd cache. I could not find any 
> solution in google for this controller (LSI MegaRAID SAS 9271-8i) to change 
> this setting.


I assume you are referencing this parameter?

storcli /c0/v0 set ssdcaching=

If so, this is for CacheCade, which is LSI's cache tiering solution, which 
should both be off and not in use for ceph.

Single thread and single iodepth benchmarks will tend to be underwhelming.
Ceph shines with aggregate performance from lots of clients.
And in an odd twist of fate, I typically see better performance on RBD for 
random benchmarks rather than sequential benchmarks, as it distributes the load 
across more OSD's.

Might also help others offer some pointers for tuning if you describe the 
pool/application a bit more.

Ie RBD vs cephfs vs RGW, 3x replicated vs EC, etc.

At least things are trending in a positive direction.

Reed

> On Sep 1, 2020, at 4:21 PM, VELARTIS Philipp Dürhammer 
>  wrote:
> 
> Thank you. I was working in this direction. The situation is a lot better. 
> But I think I can get still far better.
> 
> I could set the controller to writethrough, direct and no read ahead for the 
> ssds.
> But I cannot disable the pdcache ☹ there is an option set in the controller 
> "Block SSD Write Disk Cache Change = Yes" which does not permit to deactivate 
> the ssd cache. I could not find any solution in google for this controller 
> (LSI MegaRAID SAS 9271-8i) to change this setting.
> 
> I don’t know how much performance gain it will be to deactivate the ssd 
> cache. At least the micron 5200max has capacitor so I hope it is safe for 
> data loss in case if power failure. I wrote a request to lsi / Broadcom if 
> they know how I can change this setting. This is really annyoing.
> 
> I will check the cpu power settings. I rode also somewhere it can improve 
> iops a lot. (if its bad set)
> 
> At the moment I get 600iops 4k random write 1 thread and 1 iodepth. I get 40K 
> - 4k random iops for some instances with 32iodepth. Its not the world but a 
> lot better then before. Read around 100k iops. For 16 ssd's and 2 x dual 10G 
> nic.
> 
> I was reading that good tunings and hardware config can get more then 2000 
> iops on single thread out of the ssds. I know thet ceph does not shine with 
> single thread. But 600 iops is not very much...
> 
> philipp
> 
> -Ursprüngliche Nachricht-
> Von: Reed Dier  
> Gesendet: Dienstag, 01. September 2020 22:37
> An: VELARTIS Philipp Dürhammer 
> Cc: ceph-users@ceph.io
> Betreff: Re: [ceph-users] Can 16 server grade ssd's be slower then 60 hdds? 
> (no extra journals)
> 
> If using storcli/perccli for manipulating the LSI controller, you can disable 
> the on-disk write cache with:
> storcli /cx/vx set pdcache=off
> 
> You can also ensure that you turn off write caching at the controller level 
> with 
> storcli /cx/vx set iopolicy=direct
> storcli /cx/vx set wrcache=wt
> 
> You can also tweak the readahead value for the vd if you want, though with an 
> ssd, I don't think it will be much of an issue.
> storcli /cx/vx set rdcache=nora
> 
> I'm sure the megacli alternatives are available with some quick searches.
> 
> May also want to check your c-states and p-states to make sure there isn't 
> any aggressive power saving features getting in the way.
> 
> Reed
> 
>> On Aug 31, 2020, at 7:44 AM, VELARTIS Philipp Dürhammer 
>>  wrote:
>> 
>> We have older LSi Raid controller with no HBA/JBOD option. So we expose the 
>> single disks as raid0 devices. Ceph should not be aware of cache status?
>> But digging deeper in to it it seems that 1 out of 4 serves is performing a 
>> lot better and has super low commit/applay rates while the other have a lot 
>> mor (20+) on heavy writes. This just applys fore the ssd. For the hdds I 
>> cant see a difference...
>> 
>> -Ursprüngliche Nachricht-
>> Von: Frank Schilder  
>> Gesendet: Montag, 31. August 2020 13:19
>> An: VELARTIS Philipp Dürhammer ; 
>> 'ceph-users@ceph.io' 
>> Betreff: Re: Can 16 server grade ssd's be slower then 60 hdds? (no extra 
>> journals)
>> 
>> Yes, they can - if volatile write cache is not disabled. There are many 
>> threads on this, also recent. Search for "disable write cache" and/or 
>> "disable volatile write cache".
>> 
>> You will also find different methods of doing this automatically.
>> 
>> Best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> 
>> 
>> From: VELARTIS Philipp Dürhammer 
>> Sent: 31 August 2020 13:02:45
>> To: 'ceph-users@ceph.io'
>> Subject: [ceph-users] Can 16 server grade ssd's be slower then 60 hdds? (no 
>> extra journals)
>> 
>> I have a productive 60 osd's cluster. No extra Journals. Its performing 
>> okay. Now I added an extra ssd Pool with 16 Micron 5100 MAX. And the 
>> performance is little slower or equal to the 60 hdd pool.

[ceph-users] Rbd image corrupt or locked somehow

2020-09-01 Thread Salsa
Hi,

I have set a 3 host cluster with 30 OSDs total. Cluster has health OK and no 
warning whatsoever. I set an RBD pool and 14 images which werer all 
rbd-mirrored to a second cluster (which was disconnected since problems began) 
and also an iSCSI interface. Then I connected a Windows 2019 Server through 
iSCSI, mounted all 14 drives and created a spanned volume with all the drives. 
Everything was working fine, but I had to disconnect the server, so I 
disconnected the iSCSI interface and when I tried to reconnect my volume was 
unusable and drives seemed stuck. I ended rebooting each cluster node and then 
later, since I still couldn't use my images, removed and recreated all images.

in this second run all was good and I had a robocopy syncing files for almost a 
week to my ceph cluster and had copied more than 5TB of data already when my 
Windows Server got stuck. Still not sure why it got stuck, but some services 
like FTP were responding but others, including login, were not. So I reset 
Windows server and when it was back up my spanned volume was bad again. I've 
been trying to recover it for the last 2 days but without success.

Right now all images are disconnected, I have no locks (found some at some 
point and removed, but not sure who was locking) and no watchers in any of the 
images, but the 3 images that had data in it are corrupt or locked somehow. 
Nothing I try works on them and the operation gets stuck. I can edit the 
images' config, but not these 3. I can create snapshots, but not these 3. I 
managed to mount images using iSCSI in a Linux box, but these 3 get Linux 
commands (fdisk, parted) hanging. Ceph dashboard shows some stats like read and 
write rate for all images, but these 3.

It seems something inside the image is broken or stuck but as I said no locks 
on them.

I tried a lot of options and somehow my cluster now has some RGW pools that I 
have no idea where they came from.

Any idea what I should do?

--
Salsa



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus: rbd image stuck unaccessible after VM restart

2020-09-01 Thread salsa
Hi,

Any news on this error? I'm facing the same issue I guess. Had a Windows Server 
copy data to some RBD images through iSCSI and the server got stuck and had to 
be reset and now the images that had data are blocking all I/O operations, 
including editing their config, creating snapshots, etc.

Thanks;
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io