[ceph-users] Re: Need help integrating radosgw with keystone for openstack swift

2020-10-22 Thread Burkhard Linke

Hi,


in our setup (ceph 15.2.4, openstack train) the swift endpoint URLs are 
different, e.g.


# openstack endpoint list --service swift
+--+---+--+--+-+---+--+
| ID   | Region    | Service Name | Service 
Type | Enabled | Interface | 
URL  |

+--+---+--+--+-+---+--+
| 521a556e391c40cc8d242f0f61a22812 | RegionOne | swift    | 
object-store | True    | public    | https://s3./swift/v1 |





And a somewhat related personal opinion: do not use swift.

The API requires using openstack credentials, and in many cases these 
credentials are the main user credentials used for accessing openstack 
(there are other methods, but most users are not aware of this). If 
instances want to access data in the object storage, you have to store 
the credentials in the instance. If an instance is exposed to the 
internet, it may be attacked and broken into; as a result the openstack 
credentials might end up in the wrong hands. I'm not sure whether using 
other methods like application credentials can reduce the problem e.g. 
by restricting them to certain services. But you can encourage users to 
use the S3 interface instead. S3 credentials can be created in the 
openstack web interface and by command line; they are scoped to a 
certain project only, and if you do not use some AWS compatibility layer 
they can _only_ be used for authentication in the S3 API. It's probably 
still a problem if they are stolen, but it is not as worse as openstack 
credentials...



Just my 0.02 euro


Regards,

Burkhard

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 6 PG's stuck not-active, remapped

2020-10-22 Thread Burkhard Linke

Hi,


On 10/21/20 10:01 PM, Mac Wynkoop wrote:

*snipsnap*

*up: 0: 1131: 1382: 303: 1324: 1055: 576: 1067: 1408: 161acting: 0: 721:
1502: 21474836473: 21474836474: 245: 486: 327: 1578: 103*


21474836473 is -1 as unsigned integer. This value means that the CRUSH 
algorithm did not produce enough OSDS to satisfy the PG requirements 
(e.g. less than three different OSDs for replicated pools with size=3).


You mentioned that some disks are currently offline; if they are marked 
out your current cluster setup might not be sufficient for your crush 
rules. Bring the disks back online or (in case of sufficient hosts/osds) 
change the number of iterations in crush before giving up. The pseudo 
random characteristics may not always be able to select three hosts out 
of three available ones ;-)


Regards,

Burkhard
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Need help integrating radosgw with keystone for openstack swift

2020-10-22 Thread Bujack, Stefan
Hy,

I tried your endpoint configuration but with the same outcome. Maybe I am 
missing something

I also don't know if I am testing the right way.

But thank you for your answer and your help.

Greets Stefan Bujack

root@keystone:~# openstack endpoint list | grep swift
| 0ee9c91af2424e33a91a4c118b693301 | RegionOne | swift| object-store
| True| internal  | http://ciosmon06.desy.de:7480/swift/v1/ 
 |
| 4719a266432f45bda380c52486421e62 | RegionOne | swift| object-store
| True| public| http://ciosmon06.desy.de:7480/swift/v1/ 
 |
| e68b3990e74447bfa35a5d6aa66ca2aa | RegionOne | swift| object-store
| True| admin | http://ciosmon06.desy.de:7480/swift/v1/ 
 |

root@it-build:~# openstack container list
Unrecognized schema in response body. (HTTP 401) (Request-ID: 
tx7-005f914731-26173f-default)

[root@ciosmon06 ~]# tail -f  /var/log/ceph/ceph-client.rgw.ciosmon06.log
2020-10-22 10:47:45.535 7efea6f5f700  1 == req done req=0x562f3de148f0 op 
status=0 http_status=401 latency=0.00099s ==
2020-10-22 10:47:45.798 7efea675e700  1 == starting new request 
req=0x562f3de148f0 =
2020-10-22 10:47:45.798 7efea675e700  1 == req done req=0x562f3de148f0 op 
status=0 http_status=401 latency=0s ==

root@it-build:~# openstack ec2 credentials create
+++
| Field  | Value

  |
+++
| access | 91fe4a54ac4547b2a127fc4599bd7580 

  |
| links  | {'self': 
'https://keystone-intern.desy.de:5000/v3/users/926c750033e668f0af2330b1c7c723a05b86fa393655369fdb1a5622ae65dac8/credentials/OS-EC2/91fe4a54ac4547b2a127fc4599bd7580'}
 |
| project_id | 286f5d2b38ae4595ba9ff8129e754f54 

  |
| secret | e8e0035d228743cfb40083d84d6f3580 

  |
| trust_id   | None 

  |
| user_id| 926c750033e668f0af2330b1c7c723a05b86fa393655369fdb1a5622ae65dac8 

  |
+++
root@it-build:~# /usr/local/bin/aws configure
AWS Access Key ID [780b]: 91fe4a54ac4547b2a127fc4599bd7580
AWS Secret Access Key [c4dc]: e8e0035d228743cfb40083d84d6f3580
Default region name [default]:
Default output format [None]:
root@it-build:~# /usr/local/bin/aws 
--endpoint='http://ciosmon06.desy.de:7480/swift/v1/' s3 ls s3://

An error occurred (404) when calling the ListBuckets operation: Not Found
root@it-build:~# /usr/local/bin/aws --endpoint='http://ciosmon06.desy.de:7480' 
s3 ls s3://

An error occurred (InvalidAccessKeyId) when calling the ListBuckets operation: 
Unknown


[root@ciosmon06 ~]# tail -f  /var/log/ceph/ceph-client.rgw.ciosmon06.log
2020-10-22 10:49:57.886 7efea2f57700  1 == starting new request 
req=0x562f3de248f0 =
2020-10-22 10:49:57.888 7efea2f57700  1 == req done req=0x562f3de248f0 op 
status=-2 http_status=404 latency=0.002s ==
2020-10-22 10:50:22.344 7efea0752700  1 == starting new request 
req=0x562f3de488f0 =
2020-10-22 10:50:22.346 7efea0752700  1 == req done req=0x562f3de488f0 op 
status=0 http_status=403 latency=0.002s ==





- Original Message -
From: "Burkhard Linke" 
To: "ceph-users" 
Sent: Thursday, 22 October, 2020 10:11:22
Subject: [ceph-users] Re: Need help integrating radosgw with keystone for 
openstack swift

Hi,


in our setup (ceph 15.2.4, openstack train) the swift endpoint URLs are 
different, e.g.

# openstack endpoint list --service swift
+--+---+--+--+-+---+--+
| ID  

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-22 Thread Frank Schilder
Sounds good. Did you re-create the pool again? If not, please do to give the 
devicehealth manager module its storage. In case you can't see any IO, it might 
be necessary to restart the MGR to flush out a stale rados connection. I would 
probably give the pool 10 PGs instead of 1, but that's up to you.

I hope I find time today to look at the incomplete PG.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michael Thomas 
Sent: 21 October 2020 22:58:47
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

On 10/21/20 6:47 AM, Frank Schilder wrote:
> Hi Michael,
>
> some quick thoughts.
>
>
> That you can create a pool with 1 PG is a good sign, the crush rule is OK. 
> That pg query says it doesn't have PG 1.0 points in the right direction. 
> There is an inconsistency in the cluster. This is also indicated by the fact 
> that no upmaps seem to exist (the clean-up script was empty). With the osd 
> map you extracted, you could check what the osd map believes the mapping of 
> the PGs of pool 1 are:
>
># osdmaptool osd.map --test-map-pgs-dump --pool 1

https://pastebin.com/seh6gb7R

As I suspected, it thinks that OSDs 0, 41 are the acting set.

> or if it also claims the PG does not exist. It looks like something went 
> wrong during pool creation and you are not the only one having problems with 
> this particular pool: https://www.spinics.net/lists/ceph-users/msg52665.html 
> . Sounds a lot like a bug in cephadm.
>
> In principle, it looks like the idea to delete and recreate the health 
> metrics pool is a way forward. Please look at the procedure mentioned in the 
> thread quoted above. Deletion of the pool there lead to some crashes and some 
> surgery on some OSDs was necessary. However, in your case it might just work, 
> because you redeployed the OSDs in question already - if I remember correctly.

That is correct.  The original OSDs 0 and 41 were removed and redeployed
on new disks.

> In order to do so cleanly, however, you will probably want to shut down all 
> clients accessing this pool. Note that clients accessing the health metrics 
> pool are not FS clients, so the mds cannot tell you anything about them. The 
> only command that seems to list all clients is
>
># ceph daemon mon.MON-ID sessions
>
> that needs to be executed on all mon hosts. On the other hand, you could also 
> just go ahead and see if something crashes (an MGR module probably) or 
> disable all MGR modules during this recovery attempt. I found some info that 
> cephadm creates this pool and starts an MGR module.
>
> If you google "device_health_metric pool" you should find descriptions of 
> similar cases. It looks solvable.

Unfortunately, in Octopus you can not disable the devicehealth manager
module, and the manager is required for operation.  So I just went ahead
and removed the pool with everything still running.  Fortunately, this
did not appear to cause any problems, and the single unknown PG has
disappeared from the ceph health output.

> I will look at the incomplete PG issue. I hope this is just some PG tuning. 
> At least pg query didn't complain :)

I have OSDs ready to add to the pool, in case you think we should try.

> The stuck MDS request could be an attempt to access an unfound object. It 
> should be possible to locate the fs client and find out what it was trying to 
> do. I see this sometimes when people are too impatient. They manage to 
> trigger a race condition and an MDS operation gets stuck (there are MDS bugs 
> and in my case it was an ls command that got stuck). Usually, evicting the 
> client temporarily solves the issue (but tell the user :).

I found the fs client and rebooted it.  The MDS still reports the slow
OPs, but according to the mds logs the offending ops were established
before the client was rebooted, and the offending client session (now
defunct) has been blacklisted.  I'll check back later to see if the slow
OPS get cleared from 'ceph status'.

Regards,

--Mike

> From: Michael Thomas 
> Sent: 20 October 2020 23:48:36
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects
>
> On 10/20/20 1:18 PM, Frank Schilder wrote:
>> Dear Michael,
>>
 Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an 
 OSD mapping?
>>
>> I meant here with crush rule replicated_host_nvme. Sorry, forgot.
>
> Seems to have worked fine:
>
> https://pastebin.com/PFgDE4J1
>
>>> Yes, the OSD was still out when the previous health report was created.
>>
>> Hmm, this is odd. If this is correct, then it did report a slow op even 
>> though it was out of the cluster:
>>
>>> from https://pastebin.com/3G3ij9ui:
>>> [WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 8133 sec, daemons 
>>> [osd.0,osd.41] have slow ops.
>>
>> Not sure what to make of that. I

[ceph-users] Urgent help needed please - MDS offline

2020-10-22 Thread David C
Hi All

My main CephFS data pool on a Luminous 12.2.10 cluster hit capacity
overnight, metadata is on a separate pool which didn't hit capacity but the
filesystem stopped working which I'd expect. I increased the osd full-ratio
to give me some breathing room to get some data deleted once the filesystem
is back online. When I attempt to restart the MDS service, I see the usual
stuff I'd expect in the log but then:

heartbeat_map is_healthy 'MDSRank' had timed out after 15


Followed by:

mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last
> acked 4.00013s ago); MDS internal heartbeat is not healthy!


Eventually I get:

>
> mds.beacon.hostnamecephssd01 is_laggy 29.372 > 15 since last acked beacon
> mds.0.90884 skipping upkeep work because connection to Monitors appears
> laggy
> mds.hostnamecephssd01 Updating MDS map to version 90885 from mon.0
> mds.beacon.hostnamecephssd01  MDS is no longer laggy


The "MDS is no longer laggy" appears to be where the service fails

Meanwhile a ceph -s is showing:

>
> cluster:
> id: 5c5998fd-dc9b-47ec-825e-beaba66aad11
> health: HEALTH_ERR
> 1 filesystem is degraded
> insufficient standby MDS daemons available
> 67 backfillfull osd(s)
> 11 nearfull osd(s)
> full ratio(s) out of order
> 2 pool(s) backfillfull
> 2 pool(s) nearfull
> 6 scrub errors
> Possible data damage: 5 pgs inconsistent
>   services:
> mon: 3 daemons, quorum hostnameceph01,hostnameceph02,hostnameceph03
> mgr: hostnameceph03(active), standbys: hostnameceph02, hostnameceph01
> mds: cephfs-1/1/1 up  {0=hostnamecephssd01=up:replay}
> osd: 172 osds: 161 up, 161 in
>   data:
> pools:   5 pools, 8384 pgs
> objects: 76.25M objects, 124TiB
> usage:   373TiB used, 125TiB / 498TiB avail
> pgs: 8379 active+clean
>  5active+clean+inconsistent
>   io:
> client:   676KiB/s rd, 0op/s rd, 0op/s w


The 5 pgs inconsistent is not a new issue, that is from past scrubs, just
haven't gotten around to manually clearing them although I suppose they
could be related to my issue

The cluster has no clients connected

I did notice in the ceph.log, some OSDs that are in the same host as the
MDS service briefly went down when trying to restart the MDS but examining
the logs of those particular OSDs isn't showing any glaring issues.

Full MDS log at debug 5 (can go higher if needed):

2020-10-22 11:27:10.987652 7f6f696f5240  0 set uid:gid to 167:167
(ceph:ceph)
2020-10-22 11:27:10.987669 7f6f696f5240  0 ceph version 12.2.10
(177915764b752804194937482a39e95e0ca3de94) luminous (stable), process
ceph-mds, pid 2022582
2020-10-22 11:27:10.990567 7f6f696f5240  0 pidfile_write: ignore empty
--pid-file
2020-10-22 11:27:11.027981 7f6f62616700  1 mds.hostnamecephssd01 Updating
MDS map to version 90882 from mon.0
2020-10-22 11:27:15.097957 7f6f62616700  1 mds.hostnamecephssd01 Updating
MDS map to version 90883 from mon.0
2020-10-22 11:27:15.097989 7f6f62616700  1 mds.hostnamecephssd01 Map has
assigned me to become a standby
2020-10-22 11:27:15.101071 7f6f62616700  1 mds.hostnamecephssd01 Updating
MDS map to version 90884 from mon.0
2020-10-22 11:27:15.105310 7f6f62616700  1 mds.0.90884 handle_mds_map i am
now mds.0.90884
2020-10-22 11:27:15.105316 7f6f62616700  1 mds.0.90884 handle_mds_map state
change up:boot --> up:replay
2020-10-22 11:27:15.105325 7f6f62616700  1 mds.0.90884 replay_start
2020-10-22 11:27:15.105333 7f6f62616700  1 mds.0.90884  recovery set is
2020-10-22 11:27:15.105344 7f6f62616700  1 mds.0.90884  waiting for osdmap
73745 (which blacklists prior instance)
2020-10-22 11:27:15.149092 7f6f5be09700  0 mds.0.cache creating system
inode with ino:0x100
2020-10-22 11:27:15.149693 7f6f5be09700  0 mds.0.cache creating system
inode with ino:0x1
2020-10-22 11:27:41.021708 7f6f63618700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:27:43.029290 7f6f5f610700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:27:43.029297 7f6f5f610700  0 mds.beacon.hostnamecephssd01
Skipping beacon heartbeat to monitors (last acked 4.00013s ago); MDS
internal heartbeat is not healthy!
2020-10-22 11:27:45.866711 7f6f5fe11700  1 heartbeat_map reset_timeout
'MDSRank' had timed out after 15
2020-10-22 11:28:01.021965 7f6f63618700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:28:03.029862 7f6f5f610700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:28:03.029885 7f6f5f610700  0 mds.beacon.hostnamecephssd01
Skipping beacon heartbeat to monitors (last acked 4.00113s ago); MDS
internal heartbeat is not healthy!
2020-10-22 11:28:06.022033 7f6f63618700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:28:07.029955 7f6f5f610700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:28:07.029961 7f6f5f610700  0 mds.beacon.host

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-22 Thread Frank Schilder
Could you also execute (and post the output of)

  # osdmaptool osd.map --test-map-pgs-dump --pool 7

with the osd map you pulled out (pool 7 should be the fs data pool)? Please 
check what mapping is reported for PG 7.39d? Just checking if osd map and pg 
dump agree here.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 22 October 2020 09:32:07
To: Michael Thomas; ceph-users@ceph.io
Subject: [ceph-users] Re: multiple OSD crash, unfound objects

Sounds good. Did you re-create the pool again? If not, please do to give the 
devicehealth manager module its storage. In case you can't see any IO, it might 
be necessary to restart the MGR to flush out a stale rados connection. I would 
probably give the pool 10 PGs instead of 1, but that's up to you.

I hope I find time today to look at the incomplete PG.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michael Thomas 
Sent: 21 October 2020 22:58:47
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

On 10/21/20 6:47 AM, Frank Schilder wrote:
> Hi Michael,
>
> some quick thoughts.
>
>
> That you can create a pool with 1 PG is a good sign, the crush rule is OK. 
> That pg query says it doesn't have PG 1.0 points in the right direction. 
> There is an inconsistency in the cluster. This is also indicated by the fact 
> that no upmaps seem to exist (the clean-up script was empty). With the osd 
> map you extracted, you could check what the osd map believes the mapping of 
> the PGs of pool 1 are:
>
># osdmaptool osd.map --test-map-pgs-dump --pool 1

https://pastebin.com/seh6gb7R

As I suspected, it thinks that OSDs 0, 41 are the acting set.

> or if it also claims the PG does not exist. It looks like something went 
> wrong during pool creation and you are not the only one having problems with 
> this particular pool: https://www.spinics.net/lists/ceph-users/msg52665.html 
> . Sounds a lot like a bug in cephadm.
>
> In principle, it looks like the idea to delete and recreate the health 
> metrics pool is a way forward. Please look at the procedure mentioned in the 
> thread quoted above. Deletion of the pool there lead to some crashes and some 
> surgery on some OSDs was necessary. However, in your case it might just work, 
> because you redeployed the OSDs in question already - if I remember correctly.

That is correct.  The original OSDs 0 and 41 were removed and redeployed
on new disks.

> In order to do so cleanly, however, you will probably want to shut down all 
> clients accessing this pool. Note that clients accessing the health metrics 
> pool are not FS clients, so the mds cannot tell you anything about them. The 
> only command that seems to list all clients is
>
># ceph daemon mon.MON-ID sessions
>
> that needs to be executed on all mon hosts. On the other hand, you could also 
> just go ahead and see if something crashes (an MGR module probably) or 
> disable all MGR modules during this recovery attempt. I found some info that 
> cephadm creates this pool and starts an MGR module.
>
> If you google "device_health_metric pool" you should find descriptions of 
> similar cases. It looks solvable.

Unfortunately, in Octopus you can not disable the devicehealth manager
module, and the manager is required for operation.  So I just went ahead
and removed the pool with everything still running.  Fortunately, this
did not appear to cause any problems, and the single unknown PG has
disappeared from the ceph health output.

> I will look at the incomplete PG issue. I hope this is just some PG tuning. 
> At least pg query didn't complain :)

I have OSDs ready to add to the pool, in case you think we should try.

> The stuck MDS request could be an attempt to access an unfound object. It 
> should be possible to locate the fs client and find out what it was trying to 
> do. I see this sometimes when people are too impatient. They manage to 
> trigger a race condition and an MDS operation gets stuck (there are MDS bugs 
> and in my case it was an ls command that got stuck). Usually, evicting the 
> client temporarily solves the issue (but tell the user :).

I found the fs client and rebooted it.  The MDS still reports the slow
OPs, but according to the mds logs the offending ops were established
before the client was rebooted, and the offending client session (now
defunct) has been blacklisted.  I'll check back later to see if the slow
OPS get cleared from 'ceph status'.

Regards,

--Mike

> From: Michael Thomas 
> Sent: 20 October 2020 23:48:36
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects
>
> On 10/20/20 1:18 PM, Frank Schilder wrote:
>> Dear Michael,
>>
 Can you create a test pool with pg_num=pgp_num=1 and see i

[ceph-users] Hardware needs for MDS for HPC/OpenStack workloads?

2020-10-22 Thread Matthew Vernon

Hi,

We're considering the merits of enabling CephFS for our main Ceph 
cluster (which provides object storage for OpenStack), and one of the 
obvious questions is what sort of hardware we would need for the MDSs 
(and how many!).


These would be for our users scientific workloads, so they would need to 
provide reasonably high performance. For reference, we have 3060 6TB 
OSDs across 51 OSD hosts, and 6 dedicated RGW nodes.


The minimum specs are very modest (2-3GB RAM, a tiny amount of disk, 
similar networking to the OSD nodes), but I'm not sure how much going 
beyond that is likely to be useful in production.


I've also seen it suggested that an SSD-only pool is sensible for the 
CephFS metadata pool; how big is that likely to get?


I'd be grateful for any pointers :)

Regards,

Matthew


--
The Wellcome Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
___

ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread Dan van der Ster
You can disable that beacon by increasing mds_beacon_grace to 300 or
600. This will stop the mon from failing that mds over to a standby.
I don't know if that is set on the mon or mgr, so I usually set it on both.
(You might as well disable the standby too -- no sense in something
failing back and forth between two mdses).

Next -- looks like your mds is in active:replay. Is it doing anything?
Is it using lots of CPU/RAM? If you increase debug_mds do you see some
progress?

-- dan


On Thu, Oct 22, 2020 at 2:01 PM David C  wrote:
>
> Hi All
>
> My main CephFS data pool on a Luminous 12.2.10 cluster hit capacity
> overnight, metadata is on a separate pool which didn't hit capacity but the
> filesystem stopped working which I'd expect. I increased the osd full-ratio
> to give me some breathing room to get some data deleted once the filesystem
> is back online. When I attempt to restart the MDS service, I see the usual
> stuff I'd expect in the log but then:
>
> heartbeat_map is_healthy 'MDSRank' had timed out after 15
>
>
> Followed by:
>
> mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last
> > acked 4.00013s ago); MDS internal heartbeat is not healthy!
>
>
> Eventually I get:
>
> >
> > mds.beacon.hostnamecephssd01 is_laggy 29.372 > 15 since last acked beacon
> > mds.0.90884 skipping upkeep work because connection to Monitors appears
> > laggy
> > mds.hostnamecephssd01 Updating MDS map to version 90885 from mon.0
> > mds.beacon.hostnamecephssd01  MDS is no longer laggy
>
>
> The "MDS is no longer laggy" appears to be where the service fails
>
> Meanwhile a ceph -s is showing:
>
> >
> > cluster:
> > id: 5c5998fd-dc9b-47ec-825e-beaba66aad11
> > health: HEALTH_ERR
> > 1 filesystem is degraded
> > insufficient standby MDS daemons available
> > 67 backfillfull osd(s)
> > 11 nearfull osd(s)
> > full ratio(s) out of order
> > 2 pool(s) backfillfull
> > 2 pool(s) nearfull
> > 6 scrub errors
> > Possible data damage: 5 pgs inconsistent
> >   services:
> > mon: 3 daemons, quorum hostnameceph01,hostnameceph02,hostnameceph03
> > mgr: hostnameceph03(active), standbys: hostnameceph02, hostnameceph01
> > mds: cephfs-1/1/1 up  {0=hostnamecephssd01=up:replay}
> > osd: 172 osds: 161 up, 161 in
> >   data:
> > pools:   5 pools, 8384 pgs
> > objects: 76.25M objects, 124TiB
> > usage:   373TiB used, 125TiB / 498TiB avail
> > pgs: 8379 active+clean
> >  5active+clean+inconsistent
> >   io:
> > client:   676KiB/s rd, 0op/s rd, 0op/s w
>
>
> The 5 pgs inconsistent is not a new issue, that is from past scrubs, just
> haven't gotten around to manually clearing them although I suppose they
> could be related to my issue
>
> The cluster has no clients connected
>
> I did notice in the ceph.log, some OSDs that are in the same host as the
> MDS service briefly went down when trying to restart the MDS but examining
> the logs of those particular OSDs isn't showing any glaring issues.
>
> Full MDS log at debug 5 (can go higher if needed):
>
> 2020-10-22 11:27:10.987652 7f6f696f5240  0 set uid:gid to 167:167
> (ceph:ceph)
> 2020-10-22 11:27:10.987669 7f6f696f5240  0 ceph version 12.2.10
> (177915764b752804194937482a39e95e0ca3de94) luminous (stable), process
> ceph-mds, pid 2022582
> 2020-10-22 11:27:10.990567 7f6f696f5240  0 pidfile_write: ignore empty
> --pid-file
> 2020-10-22 11:27:11.027981 7f6f62616700  1 mds.hostnamecephssd01 Updating
> MDS map to version 90882 from mon.0
> 2020-10-22 11:27:15.097957 7f6f62616700  1 mds.hostnamecephssd01 Updating
> MDS map to version 90883 from mon.0
> 2020-10-22 11:27:15.097989 7f6f62616700  1 mds.hostnamecephssd01 Map has
> assigned me to become a standby
> 2020-10-22 11:27:15.101071 7f6f62616700  1 mds.hostnamecephssd01 Updating
> MDS map to version 90884 from mon.0
> 2020-10-22 11:27:15.105310 7f6f62616700  1 mds.0.90884 handle_mds_map i am
> now mds.0.90884
> 2020-10-22 11:27:15.105316 7f6f62616700  1 mds.0.90884 handle_mds_map state
> change up:boot --> up:replay
> 2020-10-22 11:27:15.105325 7f6f62616700  1 mds.0.90884 replay_start
> 2020-10-22 11:27:15.105333 7f6f62616700  1 mds.0.90884  recovery set is
> 2020-10-22 11:27:15.105344 7f6f62616700  1 mds.0.90884  waiting for osdmap
> 73745 (which blacklists prior instance)
> 2020-10-22 11:27:15.149092 7f6f5be09700  0 mds.0.cache creating system
> inode with ino:0x100
> 2020-10-22 11:27:15.149693 7f6f5be09700  0 mds.0.cache creating system
> inode with ino:0x1
> 2020-10-22 11:27:41.021708 7f6f63618700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2020-10-22 11:27:43.029290 7f6f5f610700  1 heartbeat_map is_healthy
> 'MDSRank' had timed out after 15
> 2020-10-22 11:27:43.029297 7f6f5f610700  0 mds.beacon.hostnamecephssd01
> Skipping beacon heartbeat to monitors (last acked 4.00013s ago); MDS
> internal heartbeat is not healthy!
> 2020-10-

[ceph-users] OSD Failures after pg_num increase on one of the pools

2020-10-22 Thread Артём Григорьев
Hello everyone,

I created a new ceph 14.2.7 Nautilus cluster  recently. Cluster consists of
3 racks and 2 osd nodes on each rack, 12 new hdd in each node. HDD
model is TOSHIBA
MG07ACA14TE 14Tb. All data pools are ec pools.
Yesterday I decided to increase pg number on one of the pools with
command "ceph
osd pool set photo.buckets.data pg_num 512", after that many osds started
to crash with "out" and "down" status. I tried to increase recovery_sleep
to 1s but osds still crashes. Osds started working properly only when i set
"norecover" flag, but osd scrub errors appeared after that.

In logs from osd during crashes i found this:
---

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHIN

E_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/osd/ECBackend.cc:
In function 'void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&,
RecoveryMessages*)'

thread 7f8af535d700 time 2020-10-21 15:12:11.460092

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHIN

E_SIZE/huge/release/14.2.7/rpm/el7/BUILD/ceph-14.2.7/src/osd/ECBackend.cc:
648: FAILED ceph_assert(pop.data.length() ==
sinfo.aligned_logical_offset_to_chunk_offset( aft

er_progress.data_recovered_to - op.recovery_progress.data_recovered_to))

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: ceph version 14.2.7
(3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable)

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 1:
(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x14a) [0x55fc694d6c0f]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 2: (()+0x47)
[0x55fc694d6dd7]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 3:
(ECBackend::continue_recovery_op(ECBackend::RecoveryOp&,
RecoveryMessages*)+0x1740) [0x55fc698cafa0]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 4:
(ECBackend::handle_recovery_read_complete(hobject_t const&,
boost::tuples::tuple,
std::allocator >
>, boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type>&, boost::optional,
std::allocator >
> >, RecoveryMessages*)+0x734) [0x55fc698cb804]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 5:
(OnRecoveryReadComplete::finish(std::pair&)+0x94) [0x55fc698ebbe4]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 6:
(ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x8c)
[0x55fc698bfdcc]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 7:
(ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&,
RecoveryMessages*, ZTracer::Trace const&)+0x109c) [0x55fc698d6b8c]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 8:
(ECBackend::_handle_message(boost::intrusive_ptr)+0x17f)
[0x55fc698d718f]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 9:
(PGBackend::handle_message(boost::intrusive_ptr)+0x4a)
[0x55fc697c18ea]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 10:
(PrimaryLogPG::do_request(boost::intrusive_ptr&,
ThreadPool::TPHandle&)+0x5b3) [0x55fc697676b3]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 11:
(OSD::dequeue_op(boost::intrusive_ptr, boost::intrusive_ptr,
ThreadPool::TPHandle&)+0x362) [0x55fc695b3d72]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 12: (PGOpItem::run(OSD*,
OSDShard*, boost::intrusive_ptr&, ThreadPool::TPHandle&)+0x62)
[0x55fc698415c2]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 13:
(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f)
[0x55fc695cebbf]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 14:
(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6)
[0x55fc69b6f976]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 15:
(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55fc69b72490]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 16: (()+0x7e65)
[0x7f8b1ddede65]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: 17: (clone()+0x6d)
[0x7f8b1ccb188d]

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: *** Caught signal (Aborted) **

Oct 21 15:12:11 ceph-osd-201 ceph-osd[58159]: in thread 7f8af535d700
thread_name:tp_osd_tp
---

Current ec profile and pool info bellow:

# ceph osd erasure-code-profile get EC42

crush-device-class=hdd

crush-failure-domain=host

crush-root=main

jerasure-per-chunk-alignment=false

k=4

m=2

plugin=jerasure

technique=reed_sol_van

w=8


pool 25 'photo.buckets.data' erasure size 6 min_size 4 crush_rule 6
object_hash rjenkins pg_num 512 pgp_num 280 pgp_num_target 512
autoscale_mode warn last_change 43418 lfor 0/0/42223 flags hashpspool
stripe_width 1048576 application rgw


Current ceph status:

ceph -s

  cluster:

id: 9ec8d309-a620-4ad8-93fa-c2d111e5256e

health: HEALTH_ERR

norecover flag(s) set

1 pools have many more objects per pg than average

4542629 scrub errors

P

[ceph-users] Re: Huge RAM Ussage on OSD recovery

2020-10-22 Thread Mark Nelson


On 10/21/20 10:54 PM, Ing. Luis Felipe Domínguez Vega wrote:

El 2020-10-20 17:57, Ing. Luis Felipe Domínguez Vega escribió:

Hi, today mi Infra provider has a blackout, then the Ceph was try to
recover but are in an inconsistent state because many OSD can recover
itself because the kernel kill it by OOM. Even now one OSD that was
OK, go down by OOM killed.

Even in a server with 32GB RAM the OSD use ALL that and never recover,
i think that can be a memory leak, ceph version octopus 15.2.3

In: https://pastebin.pl/view/59089adc
You can see that buffer_anon get 32GB, but why?? all my cluster is
down because that.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
Used the --op export-remove and then --op import of 
ceph-objectstore-tool for the failing PG and now the OSD is running 
great.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



That's great news! ...but hopefully we'll figure out what's going on so 
we can avoid the problem in the first place. :)



Mark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread David C
Dan, many thanks for the response.

I was going down the route of looking at mds_beacon_grace but I now
realise when I start my MDS, it's swallowing up memory rapidly and
looks like the oom-killer is eventually killing the mds. With debug
upped to 10, I can see it's doing EMetaBlob.replays on various dirs in
the filesystem and I can't see any obvious issues.

This server has 128GB ram with 111GB free with the MDS stopped

The mds_cache_memory_limit is currently set to 32GB

Could this be a case of simply reducing the mds cache until I can get
this started again or is there another setting I should be looking at?
Is it safe to reduce the cache memory limit at this point?

The standby is currently down and has been deliberately down for a while now.

Log excerpt from debug 10 just before MDS is killed (path/to/dir
refers to a real path in my FS)

2020-10-22 13:29:49.527372 7fc72d39f700 10
mds.0.cache.ino(0x1000e4c0ff4) mark_dirty_parent
2020-10-22 13:29:49.527374 7fc72d39f700 10 mds.0.journal
EMetaBlob.replay noting opened inode [inode 0x1000e4c0ff4 [2,head]
/path/to/dir/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp auth v904149
dirtyparent s
=0 n(v0 1=1+0) (iversion lock) | dirtyparent=1 dirty=1 0x561c23d66e00]
2020-10-22 13:29:49.527378 7fc72d39f700 10 mds.0.journal
EMetaBlob.replay inotable tablev 481253 <= table 481328
2020-10-22 13:29:49.527380 7fc72d39f700 10 mds.0.journal
EMetaBlob.replay sessionmap v 240341131 <= table 240378576
2020-10-22 13:29:49.527383 7fc72d39f700 10 mds.0.journal
EMetaBlob.replay request client.16250824:1416595263 trim_to 1416595263
2020-10-22 13:29:49.530097 7fc72d39f700 10 mds.0.log _replay
57437755528637~11764673 / 57441334490146 2020-10-22 09:08:56.198798:
EOpen [metab
lob 0x10009e1ec8e, 1881 dirs], 16748 open files
2020-10-22 13:29:49.530106 7fc72d39f700 10 mds.0.journal EOpen.replay
2020-10-22 13:29:49.530107 7fc72d39f700 10 mds.0.journal
EMetaBlob.replay 1881 dirlumps by unknown.0
2020-10-22 13:29:49.530109 7fc72d39f700 10 mds.0.journal
EMetaBlob.replay dir 0x10009e1ec8e
2020-10-22 13:29:49.530111 7fc72d39f700 10 mds.0.journal
EMetaBlob.replay updated dir [dir 0x10009e1ec8e /path/to/dir/ [2,head]
auth v=904150 cv=0/0 state=1073741824 f(v0 m2020-10-22 08:46:44.932805
89215=89215+0) n(v2 rc2020-10-22 08:46:44.932805 b17592
89215=89215+0) hs=42927+1178,ss=0+0 dirty=2376 | child=1
0x56043c4bd100]
2020-10-22 13:29:50.275864 7fc731ba8700  5
mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 13
2020-10-22 13:29:51.026368 7fc73732e700  5
mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 13
rtt 0.750024
2020-10-22 13:29:51.026377 7fc73732e700  0
mds.beacon.hostnamecephssd01  MDS is no longer laggy
2020-10-22 13:29:54.275993 7fc731ba8700  5
mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 14
2020-10-22 13:29:54.277360 7fc73732e700  5
mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 14
rtt 0.0013
2020-10-22 13:29:58.276117 7fc731ba8700  5
mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 15
2020-10-22 13:29:58.277322 7fc73732e700  5
mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 15
rtt 0.0013
2020-10-22 13:30:02.276313 7fc731ba8700  5
mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 16
2020-10-22 13:30:02.477973 7fc73732e700  5
mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 16
rtt 0.202007

Thanks,
David

On Thu, Oct 22, 2020 at 1:41 PM Dan van der Ster  wrote:
>
> You can disable that beacon by increasing mds_beacon_grace to 300 or
> 600. This will stop the mon from failing that mds over to a standby.
> I don't know if that is set on the mon or mgr, so I usually set it on both.
> (You might as well disable the standby too -- no sense in something
> failing back and forth between two mdses).
>
> Next -- looks like your mds is in active:replay. Is it doing anything?
> Is it using lots of CPU/RAM? If you increase debug_mds do you see some
> progress?
>
> -- dan
>
>
> On Thu, Oct 22, 2020 at 2:01 PM David C  wrote:
> >
> > Hi All
> >
> > My main CephFS data pool on a Luminous 12.2.10 cluster hit capacity
> > overnight, metadata is on a separate pool which didn't hit capacity but the
> > filesystem stopped working which I'd expect. I increased the osd full-ratio
> > to give me some breathing room to get some data deleted once the filesystem
> > is back online. When I attempt to restart the MDS service, I see the usual
> > stuff I'd expect in the log but then:
> >
> > heartbeat_map is_healthy 'MDSRank' had timed out after 15
> >
> >
> > Followed by:
> >
> > mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last
> > > acked 4.00013s ago); MDS internal heartbeat is not healthy!
> >
> >
> > Eventually I get:
> >
> > >
> > > mds.beacon.hostnamecephssd01 is_laggy 29.372 > 15 since last acked beacon
> > > mds.0.90884 skipping upkeep work because connection to Monitors appears
> > > laggy
> > > mds.hostnamecephssd01 Updating MDS map to version 90885 from mon.0
> > > mds

[ceph-users] Ceph Octopus and Snapshot Schedules

2020-10-22 Thread Adam Boyhan
Hey all. 

I was wondering if Ceph Octopus is capable of automating/managing snapshot 
creation/retention and then replication? Ive seen some notes about it, but 
can't seem to find anything solid. 

Open to suggestions as well. Appreciate any input! 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Octopus and Snapshot Schedules

2020-10-22 Thread Martin Verges
Hello Adam,

in our croit Ceph Management Software, we have a snapshot manager feature
that is capable of doing that.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Do., 22. Okt. 2020 um 15:38 Uhr schrieb Adam Boyhan :

> Hey all.
>
> I was wondering if Ceph Octopus is capable of automating/managing snapshot
> creation/retention and then replication? Ive seen some notes about it, but
> can't seem to find anything solid.
>
> Open to suggestions as well. Appreciate any input!
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Huge RAM Ussage on OSD recovery

2020-10-22 Thread Ing . Luis Felipe Domínguez Vega

El 2020-10-22 09:07, Mark Nelson escribió:

On 10/21/20 10:54 PM, Ing. Luis Felipe Domínguez Vega wrote:

El 2020-10-20 17:57, Ing. Luis Felipe Domínguez Vega escribió:

Hi, today mi Infra provider has a blackout, then the Ceph was try to
recover but are in an inconsistent state because many OSD can recover
itself because the kernel kill it by OOM. Even now one OSD that was
OK, go down by OOM killed.

Even in a server with 32GB RAM the OSD use ALL that and never 
recover,

i think that can be a memory leak, ceph version octopus 15.2.3

In: https://pastebin.pl/view/59089adc
You can see that buffer_anon get 32GB, but why?? all my cluster is
down because that.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
Used the --op export-remove and then --op import of 
ceph-objectstore-tool for the failing PG and now the OSD is running 
great.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



That's great news! ...but hopefully we'll figure out what's going on
so we can avoid the problem in the first place. :)


Mark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


umm not at all, the OSD is not killed but is using huge ammount of RAM 
and many log meesages like this:


osd.46 osd.46 41072109 : slow request osd_op(client.72068484.0:1851999 
5.d 5.1aef4f8d (undecoded) ondisk+write+known_if_redirected e155365) 
initiated 2020-10-22T11:21:56.949886+ currently queued for pg

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread Dan van der Ster
You could decrease the mds_cache_memory_limit but I don't think this
will help here during replay.

You can see a related tracker here: https://tracker.ceph.com/issues/47582
This is possibly caused by replaying a very large journal. Did you
increase the journal segments?

-- dan







-- dan

On Thu, Oct 22, 2020 at 3:35 PM David C  wrote:
>
> Dan, many thanks for the response.
>
> I was going down the route of looking at mds_beacon_grace but I now
> realise when I start my MDS, it's swallowing up memory rapidly and
> looks like the oom-killer is eventually killing the mds. With debug
> upped to 10, I can see it's doing EMetaBlob.replays on various dirs in
> the filesystem and I can't see any obvious issues.
>
> This server has 128GB ram with 111GB free with the MDS stopped
>
> The mds_cache_memory_limit is currently set to 32GB
>
> Could this be a case of simply reducing the mds cache until I can get
> this started again or is there another setting I should be looking at?
> Is it safe to reduce the cache memory limit at this point?
>
> The standby is currently down and has been deliberately down for a while now.
>
> Log excerpt from debug 10 just before MDS is killed (path/to/dir
> refers to a real path in my FS)
>
> 2020-10-22 13:29:49.527372 7fc72d39f700 10
> mds.0.cache.ino(0x1000e4c0ff4) mark_dirty_parent
> 2020-10-22 13:29:49.527374 7fc72d39f700 10 mds.0.journal
> EMetaBlob.replay noting opened inode [inode 0x1000e4c0ff4 [2,head]
> /path/to/dir/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp auth v904149
> dirtyparent s
> =0 n(v0 1=1+0) (iversion lock) | dirtyparent=1 dirty=1 0x561c23d66e00]
> 2020-10-22 13:29:49.527378 7fc72d39f700 10 mds.0.journal
> EMetaBlob.replay inotable tablev 481253 <= table 481328
> 2020-10-22 13:29:49.527380 7fc72d39f700 10 mds.0.journal
> EMetaBlob.replay sessionmap v 240341131 <= table 240378576
> 2020-10-22 13:29:49.527383 7fc72d39f700 10 mds.0.journal
> EMetaBlob.replay request client.16250824:1416595263 trim_to 1416595263
> 2020-10-22 13:29:49.530097 7fc72d39f700 10 mds.0.log _replay
> 57437755528637~11764673 / 57441334490146 2020-10-22 09:08:56.198798:
> EOpen [metab
> lob 0x10009e1ec8e, 1881 dirs], 16748 open files
> 2020-10-22 13:29:49.530106 7fc72d39f700 10 mds.0.journal EOpen.replay
> 2020-10-22 13:29:49.530107 7fc72d39f700 10 mds.0.journal
> EMetaBlob.replay 1881 dirlumps by unknown.0
> 2020-10-22 13:29:49.530109 7fc72d39f700 10 mds.0.journal
> EMetaBlob.replay dir 0x10009e1ec8e
> 2020-10-22 13:29:49.530111 7fc72d39f700 10 mds.0.journal
> EMetaBlob.replay updated dir [dir 0x10009e1ec8e /path/to/dir/ [2,head]
> auth v=904150 cv=0/0 state=1073741824 f(v0 m2020-10-22 08:46:44.932805
> 89215=89215+0) n(v2 rc2020-10-22 08:46:44.932805 b17592
> 89215=89215+0) hs=42927+1178,ss=0+0 dirty=2376 | child=1
> 0x56043c4bd100]
> 2020-10-22 13:29:50.275864 7fc731ba8700  5
> mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 13
> 2020-10-22 13:29:51.026368 7fc73732e700  5
> mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 13
> rtt 0.750024
> 2020-10-22 13:29:51.026377 7fc73732e700  0
> mds.beacon.hostnamecephssd01  MDS is no longer laggy
> 2020-10-22 13:29:54.275993 7fc731ba8700  5
> mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 14
> 2020-10-22 13:29:54.277360 7fc73732e700  5
> mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 14
> rtt 0.0013
> 2020-10-22 13:29:58.276117 7fc731ba8700  5
> mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 15
> 2020-10-22 13:29:58.277322 7fc73732e700  5
> mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 15
> rtt 0.0013
> 2020-10-22 13:30:02.276313 7fc731ba8700  5
> mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 16
> 2020-10-22 13:30:02.477973 7fc73732e700  5
> mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 16
> rtt 0.202007
>
> Thanks,
> David
>
> On Thu, Oct 22, 2020 at 1:41 PM Dan van der Ster  wrote:
> >
> > You can disable that beacon by increasing mds_beacon_grace to 300 or
> > 600. This will stop the mon from failing that mds over to a standby.
> > I don't know if that is set on the mon or mgr, so I usually set it on both.
> > (You might as well disable the standby too -- no sense in something
> > failing back and forth between two mdses).
> >
> > Next -- looks like your mds is in active:replay. Is it doing anything?
> > Is it using lots of CPU/RAM? If you increase debug_mds do you see some
> > progress?
> >
> > -- dan
> >
> >
> > On Thu, Oct 22, 2020 at 2:01 PM David C  wrote:
> > >
> > > Hi All
> > >
> > > My main CephFS data pool on a Luminous 12.2.10 cluster hit capacity
> > > overnight, metadata is on a separate pool which didn't hit capacity but 
> > > the
> > > filesystem stopped working which I'd expect. I increased the osd 
> > > full-ratio
> > > to give me some breathing room to get some data deleted once the 
> > > filesystem
> > > is back online. When I attempt to restart the MDS service, I see the usual
> > > stuff I'd e

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread David C
I've not touched the journal segments, current value of
mds_log_max_segments is 128. Would you recommend I increase (or
decrease) that value? And do you think I should change
mds_log_max_expiring to match that value?

On Thu, Oct 22, 2020 at 3:06 PM Dan van der Ster  wrote:
>
> You could decrease the mds_cache_memory_limit but I don't think this
> will help here during replay.
>
> You can see a related tracker here: https://tracker.ceph.com/issues/47582
> This is possibly caused by replaying a very large journal. Did you
> increase the journal segments?
>
> -- dan
>
>
>
>
>
>
>
> -- dan
>
> On Thu, Oct 22, 2020 at 3:35 PM David C  wrote:
> >
> > Dan, many thanks for the response.
> >
> > I was going down the route of looking at mds_beacon_grace but I now
> > realise when I start my MDS, it's swallowing up memory rapidly and
> > looks like the oom-killer is eventually killing the mds. With debug
> > upped to 10, I can see it's doing EMetaBlob.replays on various dirs in
> > the filesystem and I can't see any obvious issues.
> >
> > This server has 128GB ram with 111GB free with the MDS stopped
> >
> > The mds_cache_memory_limit is currently set to 32GB
> >
> > Could this be a case of simply reducing the mds cache until I can get
> > this started again or is there another setting I should be looking at?
> > Is it safe to reduce the cache memory limit at this point?
> >
> > The standby is currently down and has been deliberately down for a while 
> > now.
> >
> > Log excerpt from debug 10 just before MDS is killed (path/to/dir
> > refers to a real path in my FS)
> >
> > 2020-10-22 13:29:49.527372 7fc72d39f700 10
> > mds.0.cache.ino(0x1000e4c0ff4) mark_dirty_parent
> > 2020-10-22 13:29:49.527374 7fc72d39f700 10 mds.0.journal
> > EMetaBlob.replay noting opened inode [inode 0x1000e4c0ff4 [2,head]
> > /path/to/dir/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp auth v904149
> > dirtyparent s
> > =0 n(v0 1=1+0) (iversion lock) | dirtyparent=1 dirty=1 0x561c23d66e00]
> > 2020-10-22 13:29:49.527378 7fc72d39f700 10 mds.0.journal
> > EMetaBlob.replay inotable tablev 481253 <= table 481328
> > 2020-10-22 13:29:49.527380 7fc72d39f700 10 mds.0.journal
> > EMetaBlob.replay sessionmap v 240341131 <= table 240378576
> > 2020-10-22 13:29:49.527383 7fc72d39f700 10 mds.0.journal
> > EMetaBlob.replay request client.16250824:1416595263 trim_to 1416595263
> > 2020-10-22 13:29:49.530097 7fc72d39f700 10 mds.0.log _replay
> > 57437755528637~11764673 / 57441334490146 2020-10-22 09:08:56.198798:
> > EOpen [metab
> > lob 0x10009e1ec8e, 1881 dirs], 16748 open files
> > 2020-10-22 13:29:49.530106 7fc72d39f700 10 mds.0.journal EOpen.replay
> > 2020-10-22 13:29:49.530107 7fc72d39f700 10 mds.0.journal
> > EMetaBlob.replay 1881 dirlumps by unknown.0
> > 2020-10-22 13:29:49.530109 7fc72d39f700 10 mds.0.journal
> > EMetaBlob.replay dir 0x10009e1ec8e
> > 2020-10-22 13:29:49.530111 7fc72d39f700 10 mds.0.journal
> > EMetaBlob.replay updated dir [dir 0x10009e1ec8e /path/to/dir/ [2,head]
> > auth v=904150 cv=0/0 state=1073741824 f(v0 m2020-10-22 08:46:44.932805
> > 89215=89215+0) n(v2 rc2020-10-22 08:46:44.932805 b17592
> > 89215=89215+0) hs=42927+1178,ss=0+0 dirty=2376 | child=1
> > 0x56043c4bd100]
> > 2020-10-22 13:29:50.275864 7fc731ba8700  5
> > mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 13
> > 2020-10-22 13:29:51.026368 7fc73732e700  5
> > mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 13
> > rtt 0.750024
> > 2020-10-22 13:29:51.026377 7fc73732e700  0
> > mds.beacon.hostnamecephssd01  MDS is no longer laggy
> > 2020-10-22 13:29:54.275993 7fc731ba8700  5
> > mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 14
> > 2020-10-22 13:29:54.277360 7fc73732e700  5
> > mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 14
> > rtt 0.0013
> > 2020-10-22 13:29:58.276117 7fc731ba8700  5
> > mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 15
> > 2020-10-22 13:29:58.277322 7fc73732e700  5
> > mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 15
> > rtt 0.0013
> > 2020-10-22 13:30:02.276313 7fc731ba8700  5
> > mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 16
> > 2020-10-22 13:30:02.477973 7fc73732e700  5
> > mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 16
> > rtt 0.202007
> >
> > Thanks,
> > David
> >
> > On Thu, Oct 22, 2020 at 1:41 PM Dan van der Ster  
> > wrote:
> > >
> > > You can disable that beacon by increasing mds_beacon_grace to 300 or
> > > 600. This will stop the mon from failing that mds over to a standby.
> > > I don't know if that is set on the mon or mgr, so I usually set it on 
> > > both.
> > > (You might as well disable the standby too -- no sense in something
> > > failing back and forth between two mdses).
> > >
> > > Next -- looks like your mds is in active:replay. Is it doing anything?
> > > Is it using lots of CPU/RAM? If you increase debug_mds do you see some
> > > progress?
> > >
> > > -- dan
> > >
> > >
> > > On Thu, 

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread Dan van der Ster
I wouldn't adjust it.
Do you have the impression that the mds is replaying the exact same ops every
time the mds is restarting? or is it progressing and trimming the
journal over time?

The only other advice I have is that 12.2.10 is quite old, and might
miss some important replay/mem fixes.
I'm thinking of one particular memory bloat issue we suffered (it
manifested on a multi-mds cluster, so I am not sure if it is the root
cause here https://tracker.ceph.com/issues/45090 )
I don't know enough about the changelog diffs to suggest upgrading
right now in the middle of this outage.


-- dan

On Thu, Oct 22, 2020 at 4:14 PM David C  wrote:
>
> I've not touched the journal segments, current value of
> mds_log_max_segments is 128. Would you recommend I increase (or
> decrease) that value? And do you think I should change
> mds_log_max_expiring to match that value?
>
> On Thu, Oct 22, 2020 at 3:06 PM Dan van der Ster  wrote:
> >
> > You could decrease the mds_cache_memory_limit but I don't think this
> > will help here during replay.
> >
> > You can see a related tracker here: https://tracker.ceph.com/issues/47582
> > This is possibly caused by replaying a very large journal. Did you
> > increase the journal segments?
> >
> > -- dan
> >
> >
> >
> >
> >
> >
> >
> > -- dan
> >
> > On Thu, Oct 22, 2020 at 3:35 PM David C  wrote:
> > >
> > > Dan, many thanks for the response.
> > >
> > > I was going down the route of looking at mds_beacon_grace but I now
> > > realise when I start my MDS, it's swallowing up memory rapidly and
> > > looks like the oom-killer is eventually killing the mds. With debug
> > > upped to 10, I can see it's doing EMetaBlob.replays on various dirs in
> > > the filesystem and I can't see any obvious issues.
> > >
> > > This server has 128GB ram with 111GB free with the MDS stopped
> > >
> > > The mds_cache_memory_limit is currently set to 32GB
> > >
> > > Could this be a case of simply reducing the mds cache until I can get
> > > this started again or is there another setting I should be looking at?
> > > Is it safe to reduce the cache memory limit at this point?
> > >
> > > The standby is currently down and has been deliberately down for a while 
> > > now.
> > >
> > > Log excerpt from debug 10 just before MDS is killed (path/to/dir
> > > refers to a real path in my FS)
> > >
> > > 2020-10-22 13:29:49.527372 7fc72d39f700 10
> > > mds.0.cache.ino(0x1000e4c0ff4) mark_dirty_parent
> > > 2020-10-22 13:29:49.527374 7fc72d39f700 10 mds.0.journal
> > > EMetaBlob.replay noting opened inode [inode 0x1000e4c0ff4 [2,head]
> > > /path/to/dir/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp auth v904149
> > > dirtyparent s
> > > =0 n(v0 1=1+0) (iversion lock) | dirtyparent=1 dirty=1 0x561c23d66e00]
> > > 2020-10-22 13:29:49.527378 7fc72d39f700 10 mds.0.journal
> > > EMetaBlob.replay inotable tablev 481253 <= table 481328
> > > 2020-10-22 13:29:49.527380 7fc72d39f700 10 mds.0.journal
> > > EMetaBlob.replay sessionmap v 240341131 <= table 240378576
> > > 2020-10-22 13:29:49.527383 7fc72d39f700 10 mds.0.journal
> > > EMetaBlob.replay request client.16250824:1416595263 trim_to 1416595263
> > > 2020-10-22 13:29:49.530097 7fc72d39f700 10 mds.0.log _replay
> > > 57437755528637~11764673 / 57441334490146 2020-10-22 09:08:56.198798:
> > > EOpen [metab
> > > lob 0x10009e1ec8e, 1881 dirs], 16748 open files
> > > 2020-10-22 13:29:49.530106 7fc72d39f700 10 mds.0.journal EOpen.replay
> > > 2020-10-22 13:29:49.530107 7fc72d39f700 10 mds.0.journal
> > > EMetaBlob.replay 1881 dirlumps by unknown.0
> > > 2020-10-22 13:29:49.530109 7fc72d39f700 10 mds.0.journal
> > > EMetaBlob.replay dir 0x10009e1ec8e
> > > 2020-10-22 13:29:49.530111 7fc72d39f700 10 mds.0.journal
> > > EMetaBlob.replay updated dir [dir 0x10009e1ec8e /path/to/dir/ [2,head]
> > > auth v=904150 cv=0/0 state=1073741824 f(v0 m2020-10-22 08:46:44.932805
> > > 89215=89215+0) n(v2 rc2020-10-22 08:46:44.932805 b17592
> > > 89215=89215+0) hs=42927+1178,ss=0+0 dirty=2376 | child=1
> > > 0x56043c4bd100]
> > > 2020-10-22 13:29:50.275864 7fc731ba8700  5
> > > mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 13
> > > 2020-10-22 13:29:51.026368 7fc73732e700  5
> > > mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 13
> > > rtt 0.750024
> > > 2020-10-22 13:29:51.026377 7fc73732e700  0
> > > mds.beacon.hostnamecephssd01  MDS is no longer laggy
> > > 2020-10-22 13:29:54.275993 7fc731ba8700  5
> > > mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 14
> > > 2020-10-22 13:29:54.277360 7fc73732e700  5
> > > mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 14
> > > rtt 0.0013
> > > 2020-10-22 13:29:58.276117 7fc731ba8700  5
> > > mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 15
> > > 2020-10-22 13:29:58.277322 7fc73732e700  5
> > > mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 15
> > > rtt 0.0013
> > > 2020-10-22 13:30:02.276313 7fc731ba8700  5
> > > mds.beacon.hostnamecephssd01 Sending beacon up

[ceph-users] Strange USED size

2020-10-22 Thread Marcelo
Hello. I've searched a lot but couldn't find why the size of USED column in
the output of ceph df is a lot times bigger than the actual size. I'm using
Nautilus (14.2.8), and I've 1000 buckets with 100 objectsineach bucket.
Each object is around 10B.

ceph df
RAW STORAGE:
CLASS SIZEAVAIL   USEDRAW USED %RAW USED
hdd   511 GiB 147 GiB 340 GiB  364 GiB 71.21
TOTAL 511 GiB 147 GiB 340 GiB  364 GiB 71.21

POOLS:
POOL   ID STORED  OBJECTS
USED%USED MAX AVAIL
.rgw.root   1 1.1 KiB   4 768
KiB 036 GiB
default.rgw.control11 0 B   8 0
B 036 GiB
default.rgw.meta   12 449 KiB   2.00k 376
MiB  0.3436 GiB
default.rgw.log13 3.4 KiB 207   6
MiB 036 GiB
default.rgw.buckets.index  14 0 B   1.00k 0
B 036 GiB
default.rgw.buckets.data   15 969 KiB100k  18
GiB 14.5236 GiB
default.rgw.buckets.non-ec 1627 B   1 192
KiB 036 GiB

Does anyone know what are the maths behind this, to show 18GiB used when I
have something like 1 MiB?

Thanks in advance, Marcelo.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Hardware for new OSD nodes.

2020-10-22 Thread Dave Hall

Hello,

(BTW, Nautilus 14.2.7 on Debian non-container.)

We're about to purchase more OSD nodes for our cluster, but I have a 
couple questions about hardware choices.  Our original nodes were 8 x 
12TB SAS drives and a 1.6TB Samsung NVMe card for WAL, DB, etc.


We chose the NVMe card for performance since it has an 8 lane PCIe 
interface.  However, we're currently BlueFS spillovers.


The Tyan chassis we are considering has the option of 4 x U.2 NVMe bays 
- each with 4 PCIe lanes, (and 8 SAS bays).   It has occurred to me that 
I might stripe 4 1TB NVMe drives together to get much more space for 
WAL/DB and a net performance of 16 PCIe lanes.


Any thoughts on this approach?

Also, any thoughts/recommendations on 12TB OSD drives?  For 
price/capacity this is a good size for us, but I'm wondering if my 
BlueFS spillovers are resulting from using drives that are too big.  I 
also thought I might have seen some comments about cutting large drives 
into multiple OSDs - could that be?


Thanks.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Huge RAM Ussage on OSD recovery

2020-10-22 Thread Mark Nelson


On 10/22/20 9:02 AM, Ing. Luis Felipe Domínguez Vega wrote:

El 2020-10-22 09:07, Mark Nelson escribió:

On 10/21/20 10:54 PM, Ing. Luis Felipe Domínguez Vega wrote:

El 2020-10-20 17:57, Ing. Luis Felipe Domínguez Vega escribió:

Hi, today mi Infra provider has a blackout, then the Ceph was try to
recover but are in an inconsistent state because many OSD can recover
itself because the kernel kill it by OOM. Even now one OSD that was
OK, go down by OOM killed.

Even in a server with 32GB RAM the OSD use ALL that and never recover,
i think that can be a memory leak, ceph version octopus 15.2.3

In: https://pastebin.pl/view/59089adc
You can see that buffer_anon get 32GB, but why?? all my cluster is
down because that.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
Used the --op export-remove and then --op import of 
ceph-objectstore-tool for the failing PG and now the OSD is running 
great.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



That's great news! ...but hopefully we'll figure out what's going on
so we can avoid the problem in the first place. :)


Mark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


umm not at all, the OSD is not killed but is using huge ammount of RAM 
and many log meesages like this:


osd.46 osd.46 41072109 : slow request osd_op(client.72068484.0:1851999 
5.d 5.1aef4f8d (undecoded) ondisk+write+known_if_redirected e155365) 
initiated 2020-10-22T11:21:56.949886+ currently queued for pg




Do you mean that the --op-export-remove and --op import step didn't end 
up fixing it in the end?  I had interpreted "running great" to mean the 
OSD was no longer using tons of memory (but it's not a real fix, just a 
workaround).



Mark


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Hardware needs for MDS for HPC/OpenStack workloads?

2020-10-22 Thread Dan van der Ster
Hi Matthew,

On Thu, Oct 22, 2020 at 2:35 PM Matthew Vernon  wrote:
>
> Hi,
>
> We're considering the merits of enabling CephFS for our main Ceph
> cluster (which provides object storage for OpenStack), and one of the
> obvious questions is what sort of hardware we would need for the MDSs
> (and how many!).

We've never mixed cephfs and rbd, for the simple reason that we
enforce QoS throttles on the openstack clients but cannot do that on
the cephfs clients.
This was decided years ago, and might be overly cautious these days.

> These would be for our users scientific workloads, so they would need to
> provide reasonably high performance. For reference, we have 3060 6TB
> OSDs across 51 OSD hosts, and 6 dedicated RGW nodes.
>
> The minimum specs are very modest (2-3GB RAM, a tiny amount of disk,
> similar networking to the OSD nodes), but I'm not sure how much going
> beyond that is likely to be useful in production.
>
> I've also seen it suggested that an SSD-only pool is sensible for the
> CephFS metadata pool; how big is that likely to get?

>From a smaller but active cephfs with size=3:

RAW STORAGE:
CLASS SIZEAVAIL   USEDRAW USED %RAW USED
hdd   1.1 PiB 389 TiB 729 TiB  729 TiB 65.22
TOTAL 1.1 PiB 389 TiB 729 TiB  729 TiB 65.22

POOLS:
POOLID STORED  OBJECTS USED
%USED MAX AVAIL
cephfs_data  1 235 TiB 267.32M 235 TiB
42.69   105 TiB
cephfs_metadata  2  66 GiB  19.06M  66 GiB
0.02   105 TiB

Cheers, Dan


>
> I'd be grateful for any pointers :)
>
> Regards,
>
> Matthew
>
>
> --
>  The Wellcome Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large map object found

2020-10-22 Thread Peter Eisch
Thank you!  This was helpful.

I opted for a manual reshard:

[root@cephmon-s03 ~]# radosgw-admin bucket reshard 
--bucket=d2ff913f5b6542cda307c9cd6a95a214/NAME_segments --num-shards=3
tenant: d2ff913f5b6542cda307c9cd6a95a214
bucket name: backups_sql_dswhseloadrepl_segments
old bucket instance id: 80bdfc66-d1fd-418d-b87d-5c8518a0b707.340850308.51
new bucket instance id: 80bdfc66-d1fd-418d-b87d-5c8518a0b707.948621036.1
total entries: 1000 2000 3000 3228
2020-10-22 08:40:26.353 7fb197fc66c0  1 execute INFO: reshard of bucket 
"backups_sql_dswhseloadrepl_segments" from 
"d2ff913f5b6542cda307c9cd6a95a214/backups_sql_dswhseloadrepl_segments:80bdfc66-d1fd-418d-b87d-5c8518a0b707.340850308.51"
 to 
"d2ff913f5b6542cda307c9cd6a95a214/backups_sql_dswhseloadrepl_segments:80bdfc66-d1fd-418d-b87d-5c8518a0b707.948621036.1"
 completed successfully

[root@cephmon-s03 ~]# radosgw-admin buckets reshard list
[]
[root@cephmon-s03 ~]# radosgw-admin buckets reshard status 
--bucket=d2ff913f5b6542cda307c9cd6a95a214/NAME_segments
[
{
"reshard_status": "not-resharding",
"new_bucket_instance_id": "",
"num_shards": -1
},
{
"reshard_status": "not-resharding",
"new_bucket_instance_id": "",
"num_shards": -1
},
{
"reshard_status": "not-resharding",
"new_bucket_instance_id": "",
"num_shards": -1
}
]
[root@cephmon-s03 ~]#

This kicked of an autoscale event.  Would the reshard presumably start after 
the autoscaling is complete?

peter



Peter Eisch
Senior Site Reliability Engineer
T1.612.445.5135
virginpulse.com
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.
v2.66
On 10/21/20, 3:19 PM, "dhils...@performair.com"  wrote:

This email originates outside Virgin Pulse.


Peter;

Look into bucket sharding.

Thank you,

Dominic L. Hilsbos, MBA
Director – Information Technology
Perform Air International Inc.
dhils...@performair.com

https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.performair.com%2F&data=04%7C01%7Cpeter.eisch%40virginpulse.com%7C667c3a35965f41ae09e908d875fe8be6%7Cb123a16e892b4cf6a55a6f8c7606a035%7C0%7C0%7C637389083427271421%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ze6f5CxNUXEsaL2HMSbuc1liFMjitKk9tG1gTNdojgE%3D&reserved=0


From: Peter Eisch [mailto:peter.ei...@virginpulse.com]
Sent: Wednesday, October 21, 2020 12:39 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Large map object found

Hi,

My rgw.buckets.index has the cluster in WARN. I'm either not understanding 
the real issue or I'm making it worse, or both.

OMAP_BYTES: 70461524
OMAP_KEYS: 250874

I thought I'd head this off by deleting rgw objects which would normally 
get deleted in the near future but this only seemed to make the values grow. 
Before I deleted lots of objects the values were:

OMAP_BYTES: 65450132
OMAP_KEYS: 209843

I read the default is 200k but I haven't read the proper way to manage this 
situation. What reading should I dive into? I could probably craft up a command 
to increase the value to clear the warning but I'm guessing this might not be 
great long-term.

Other errata which might matter:
Size: 3
Pool: nvme
CLASS SIZE AVAIL USED RAW USED %RAW USED
nvme 256 TiB 165 TiB 91 TiB 91 TiB 35.53

Errata: the complete statements:

PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG 
STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP
43.d 2 0 0 0 0 70461524 250874 3070 active+clean 36m 185904'456870 
185904:1357091 [99,90,48]p99 [99,90,48]p99 2020-10-21 13:53:42.102363 
2020-10-21 13:53:42.102363

Thanks!

peter
Peter Eisch​

Senior Site Reliability Engineer


T

1.612.445.5135












virginpulse.com


Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | 
Switzerland | United Kingdom | USA

Confidentiality Notice: The information contained in this e-mail, including 
any attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized us

[ceph-users] Re: Large map object found

2020-10-22 Thread DHilsbos
Peter;

I believe shard counts should be powers of two.

Also, resharding makes the buckets unavailable, but occurs very quickly.  As 
such it is not done in the background, but in the foreground, for a manual 
reshard.

Notice the statement: "reshard of bucket   from  
to  completed successfully."  It's done.

The warning notice won't go away until a scrub is completed to determine that a 
large OMAP object no longer exists.

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com


From: Peter Eisch [mailto:peter.ei...@virginpulse.com] 
Sent: Thursday, October 22, 2020 8:04 AM
To: Dominic Hilsbos; ceph-users@ceph.io
Subject: Re: Large map object found

Thank you! This was helpful.

I opted for a manual reshard:

[root@cephmon-s03 ~]# radosgw-admin bucket reshard 
--bucket=d2ff913f5b6542cda307c9cd6a95a214/NAME_segments --num-shards=3
tenant: d2ff913f5b6542cda307c9cd6a95a214
bucket name: backups_sql_dswhseloadrepl_segments
old bucket instance id: 80bdfc66-d1fd-418d-b87d-5c8518a0b707.340850308.51
new bucket instance id: 80bdfc66-d1fd-418d-b87d-5c8518a0b707.948621036.1
total entries: 1000 2000 3000 3228
2020-10-22 08:40:26.353 7fb197fc66c0 1 execute INFO: reshard of bucket 
"backups_sql_dswhseloadrepl_segments" from 
"d2ff913f5b6542cda307c9cd6a95a214/backups_sql_dswhseloadrepl_segments:80bdfc66-d1fd-418d-b87d-5c8518a0b707.340850308.51"
 to 
"d2ff913f5b6542cda307c9cd6a95a214/backups_sql_dswhseloadrepl_segments:80bdfc66-d1fd-418d-b87d-5c8518a0b707.948621036.1"
 completed successfully

[root@cephmon-s03 ~]# radosgw-admin buckets reshard list
[] 
[root@cephmon-s03 ~]# radosgw-admin buckets reshard status 
--bucket=d2ff913f5b6542cda307c9cd6a95a214/NAME_segments
[
{
"reshard_status": "not-resharding",
"new_bucket_instance_id": "",
"num_shards": -1
},
{
"reshard_status": "not-resharding",
"new_bucket_instance_id": "",
"num_shards": -1
},
{
"reshard_status": "not-resharding",
"new_bucket_instance_id": "",
"num_shards": -1
}
]
[root@cephmon-s03 ~]#

This kicked of an autoscale event. Would the reshard presumably start after the 
autoscaling is complete?

peter



Peter Eisch​

Senior Site Reliability Engineer


T

1.612.445.5135












virginpulse.com


Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland 
| United Kingdom | USA

Confidentiality Notice: The information contained in this e-mail, including any 
attachment(s), is intended solely for use by the designated recipient(s). 
Unauthorized use, dissemination, distribution, or reproduction of this message 
by anyone other than the intended recipient(s), or a person designated as 
responsible for delivering such messages to the intended recipient, is strictly 
prohibited and may be unlawful. This e-mail may contain proprietary, 
confidential or privileged information. Any views or opinions expressed are 
solely those of the author and do not necessarily represent those of Virgin 
Pulse, Inc. If you have received this message in error, or are not the named 
recipient(s), please immediately notify the sender and delete this e-mail 
message.


v2.66

On 10/21/20, 3:19 PM, "dhils...@performair.com"  wrote:

This email originates outside Virgin Pulse.


Peter;

Look into bucket sharding.

Thank you,

Dominic L. Hilsbos, MBA
Director – Information Technology
Perform Air International Inc.
dhils...@performair.com
https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.performair.com%2F&data=04%7C01%7Cpeter.eisch%40virginpulse.com%7C667c3a35965f41ae09e908d875fe8be6%7Cb123a16e892b4cf6a55a6f8c7606a035%7C0%7C0%7C637389083427271421%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ze6f5CxNUXEsaL2HMSbuc1liFMjitKk9tG1gTNdojgE%3D&reserved=0


From: Peter Eisch [mailto:peter.ei...@virginpulse.com]
Sent: Wednesday, October 21, 2020 12:39 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Large map object found

Hi,

My rgw.buckets.index has the cluster in WARN. I'm either not understanding the 
real issue or I'm making it worse, or both.

OMAP_BYTES: 70461524
OMAP_KEYS: 250874

I thought I'd head this off by deleting rgw objects which would normally get 
deleted in the near future but this only seemed to make the values grow. Before 
I deleted lots of objects the values were:

OMAP_BYTES: 65450132
OMAP_KEYS: 209843

I read the default is 200k but I haven't read the proper way to manage this 
situation. What reading should I dive into? I could probably craft up a command 
to increase the value to clear the warning but I'm guessing this might not be 
great long-term.

Other errata which might matter:
Size: 3
Pool: nvme
CLASS SIZE AVAIL USED RAW USED %RAW USED
nvme 256 TiB 165 TiB 91 TiB 91 TiB 35.53

Errata: the complete statements:

PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE 
SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP
43.d 2 0 0 0 0 70461524 250

[ceph-users] Re: Hardware for new OSD nodes.

2020-10-22 Thread Brian Topping


> On Oct 22, 2020, at 9:14 AM, Eneko Lacunza  wrote:
> 
> Don't stripe them, if one NVMe fails you'll lose all OSDs. Just use 1 NVMe 
> drive for 2  SAS drives  and provision 300GB for WAL/DB for each OSD (see 
> related threads on this mailing list about why that exact size).
> 
> This way if a NVMe fails, you'll only lose 2 OSD.
> 
> Also, what size of WAL/DB partitions do you have now, and what spillover size?

Generally agreed against making a single giant striped bucket.

Note this may be a good use for RAID10 on WAL/DB if you are committed to 
multiple disks.

I generally put WAL/DB on RAID10 boot disks. It’s important to have reliable 
WAL/DB, but also important that the machine actually boots in the first place. 
With enough RAM and non-interactive use, most of the boot bits will be cached 
so there is no contention for the channel.

Happy for any critique on this as well!

Brian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Huge RAM Ussage on OSD recovery

2020-10-22 Thread Ing . Luis Felipe Domínguez Vega

El 2020-10-22 10:48, Mark Nelson escribió:

On 10/22/20 9:02 AM, Ing. Luis Felipe Domínguez Vega wrote:

El 2020-10-22 09:07, Mark Nelson escribió:

On 10/21/20 10:54 PM, Ing. Luis Felipe Domínguez Vega wrote:

El 2020-10-20 17:57, Ing. Luis Felipe Domínguez Vega escribió:
Hi, today mi Infra provider has a blackout, then the Ceph was try 
to
recover but are in an inconsistent state because many OSD can 
recover

itself because the kernel kill it by OOM. Even now one OSD that was
OK, go down by OOM killed.

Even in a server with 32GB RAM the OSD use ALL that and never 
recover,

i think that can be a memory leak, ceph version octopus 15.2.3

In: https://pastebin.pl/view/59089adc
You can see that buffer_anon get 32GB, but why?? all my cluster is
down because that.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
Used the --op export-remove and then --op import of 
ceph-objectstore-tool for the failing PG and now the OSD is running 
great.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



That's great news! ...but hopefully we'll figure out what's going on
so we can avoid the problem in the first place. :)


Mark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


umm not at all, the OSD is not killed but is using huge ammount of RAM 
and many log meesages like this:


osd.46 osd.46 41072109 : slow request osd_op(client.72068484.0:1851999 
5.d 5.1aef4f8d (undecoded) ondisk+write+known_if_redirected e155365) 
initiated 2020-10-22T11:21:56.949886+ currently queued for pg




Do you mean that the --op-export-remove and --op import step didn't
end up fixing it in the end?  I had interpreted "running great" to
mean the OSD was no longer using tons of memory (but it's not a real
fix, just a workaround).


Mark
Yes, yesterday was running great, but today is consuming huge RAM, but 
not OOM killed, is working now with high ammount of RAM, almost 96% of 
my server

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread David C
I'm pretty sure it's replaying the same ops every time, the last
"EMetaBlob.replay updated dir" before it dies is always referring to
the same directory. Although interestingly that particular dir shows
up in the log thousands of times - the dir appears to be where a
desktop app is doing some analytics collecting - I don't know if
that's likely to be a red herring or the reason why the journal
appears to be so long. It's a dir I'd be quite happy to lose changes
to or remove from the file system altogether.

I'm loath to update during an outage although I have seen people
update the MDS code independently to get out of a scrape - I suspect
you wouldn't recommend that.

I feel like this leaves me with having to manipulate the journal in
some way, is there a nuclear option where I can choose to disregard
the uncommitted events? I assume that would be a journal reset with
the cephfs-journal-tool but I'm unclear on the impact of that, I'd
expect to lose any metadata changes that were made since my cluster
filled up but are there further implications? I also wonder what's the
riskier option, resetting the journal or attempting an update.

I'm very grateful for your help so far

Below is more of the debug 10 log with ops relating to the
aforementioned dir (name changed but inode is accurate):

2020-10-22 16:44:00.488850 7f424659e700 10 mds.0.journal
EMetaBlob.replay updated dir [dir 0x10009e1ec8d /path/to/desktop/app/
[2,head] auth v=911968 cv=0/0 state=1610612736 f(v0 m2020-10-14
16:32:42.596652 1=0+1) n(v6164 rc2020-10-22 08:46:44.932805 b17592
89216=89215+1)/n(v6164 rc2020-10-22 08:46:43.950805 b17592
89214=89213+1) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 0x5654f8288300]
2020-10-22 16:44:00.488864 7f424659e700 10 mds.0.journal
EMetaBlob.replay for [2,head] had [dentry
#0x1/path/to/desktop/app/Upload [2,head] auth (dversion lock) v=911967
inode=0x5654f8288a00 state=1610612736 | inodepin=1 dirty=1
0x5654f82794a0]
2020-10-22 16:44:00.488873 7f424659e700 10 mds.0.journal
EMetaBlob.replay for [2,head] had [inode 0x10009e1ec8e [...2,head]
/path/to/desktop/app/Upload/ auth v911967 f(v0 m2020-10-22
08:46:44.932805 89215=89215+0) n(v2 rc2020-10-22 08:46:44.932805
b17592 89216=89215+1) (iversion lock) | dirfrag=1 dirty=1
0x5654f8288a00]
2020-10-22 16:44:00.44 7f424659e700 10 mds.0.journal
EMetaBlob.replay dir 0x10009e1ec8e
2020-10-22 16:44:00.45 7f424659e700 10 mds.0.journal
EMetaBlob.replay updated dir [dir 0x10009e1ec8e
/path/to/desktop/app/Upload/ [2,head] auth v=904150 cv=0/0
state=1073741824 f(v0 m2020-10-22 08:46:44.932805 89215=89215+0) n(v2
rc2020-10-22 08:46:44.932805 b17592 89215=89215+0)
hs=42926+1178,ss=0+0 dirty=2375 | child=1 0x5654f8289100]
2020-10-22 16:44:00.488898 7f424659e700 10 mds.0.journal
EMetaBlob.replay added (full) [dentry
#0x1/path/to/desktop/app/Upload/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp
[2,head] auth NULL (dversion lock) v=904149 inode=0
state=1610612800|bottomlru | dirty=1 0x56586df52f00]
2020-10-22 16:44:00.488911 7f424659e700 10 mds.0.journal
EMetaBlob.replay added [inode 0x1000e4c0ff4 [2,head]
/path/to/desktop/app/Upload/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp
auth v904149 s=0 n(v0 1=1+0) (iversion lock) 0x566ce168ce00]
2020-10-22 16:44:00.488918 7f424659e700 10
mds.0.cache.ino(0x1000e4c0ff4) mark_dirty_parent
2020-10-22 16:44:00.488920 7f424659e700 10 mds.0.journal
EMetaBlob.replay noting opened inode [inode 0x1000e4c0ff4 [2,head]
/path/to/desktop/app/Upload/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp
auth v904149 dirtyparent s=0 n(v0 1=1+0) (iversion lock) |
dirtyparent=1 dirty=1 0x566ce168ce00]
2020-10-22 16:44:00.488924 7f424659e700 10 mds.0.journal
EMetaBlob.replay inotable tablev 481253 <= table 481328
2020-10-22 16:44:00.488926 7f424659e700 10 mds.0.journal
EMetaBlob.replay sessionmap v 240341131 <= table 240378576
2020-10-22 16:44:00.488927 7f424659e700 10 mds.0.journal
EMetaBlob.replay request client.16250824:1416595263 trim_to 1416595263
2020-10-22 16:44:00.491462 7f424659e700 10 mds.0.log _replay
57437755528637~11764673 / 57441334490146 2020-10-22 09:08:56.198798:
EOpen [metablob 0x10009e1ec8e, 1881 dirs], 16748 open files
2020-10-22 16:44:00.491471 7f424659e700 10 mds.0.journal EOpen.replay
2020-10-22 16:44:00.491472 7f424659e700 10 mds.0.journal
EMetaBlob.replay 1881 dirlumps by unknown.0
2020-10-22 16:44:00.491475 7f424659e700 10 mds.0.journal
EMetaBlob.replay dir 0x10009e1ec8e
2020-10-22 16:44:00.491478 7f424659e700 10 mds.0.journal
EMetaBlob.replay updated dir [dir 0x10009e1ec8e
/path/to/desktop/app/Upload/ [2,head] auth v=904150 cv=0/0
state=1073741824 f(v0 m2020-10-22 08:46:44.932805 89215=89215+0) n(v2
rc2020-10-22 08:46:44.932805 b17592 89215=89215+0)
hs=42927+1178,ss=0+0 dirty=2376 | child=1 0x5654f8289100]
2020-10-22 16:44:03.783487 7f424ada7700  5
mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 14
2020-10-22 16:44:03.784082 7f424fd2c700  5
mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 14
rtt 0.001000

[ceph-users] Re: Hardware for new OSD nodes.

2020-10-22 Thread Anthony D'Atri

> Also, any thoughts/recommendations on 12TB OSD drives?  For price/capacity 
> this is a good size for us

Last I checked HDD prices seemed linear from 10-16TB.  Remember to include the 
cost of the drive bay, ie. the cost of the chassis, the RU(s) it takes up, 
power, switch ports etc. 

I’ll guess you’re talking LFF HDDs here and a 2U server?   You also don’t tell 
us how many nodes total, which affects blast radius decisions.

>  has the option of 4 x U.2 NVMe bays - each with 4 PCIe lanes, (and 8 SAS 
> bays)

Think about what that would do to your total $/TB, including the chassis, CPU, 
switch ports, rack space, etc.  Check if those bays are NVMe-only, or if they 
are tri-mode.

If you do go with NVMe for WAL+DB, 4 drives is overkill.  Performance-wise, 
assuming you use a quality NVMe drive and not some consumer-grade crap, you’re 
going to see sharply diminishing returns after just 1.  Or you could mirror 2 
as someone else describes.

But really, consider the hassles of maintaining partitions and mapping as 
drives fail.  When an HDD fails and you need to re-use its metadata partition 
for the replacement OSD, you have to be very careful when using a shared device 
that you re-use the original.  Honestly, depending on your use-case, consider 
whether using 24xSFF SATA SSDs might not be cost-competitive, factoring in 
hassle, the time you’ll spend waiting for HDDs to do backfill, etc.  With 
careful choices, and again depending on your undisclosed use-case, all-NVMe can 
with careful choices also be surprisingly cost-effective.  If your data is 
cold, QLC is an option.  With system vendors still pushing expensive RAID HBAs 
(they must have high margins), you could easly save $600 per chassis just by 
not having one.  Not having to monitor and deal with the BBU/supercap, etc.

>  but I'm wondering if my BlueFS spillovers are resulting from using drives 
> that are too big.  I also thought I might have seen some comments about 
> cutting large drives into multiple OSDs - could that be?

Don’t cut anything less than an NVMe drive into more than one OSD.  HDD seeks 
and IOPs are its bottlenecks and slicing/dicing isn’t going to work any magic.

imho,ymmv

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread Dan van der Ster
I assume you aren't able to quickly double the RAM on this MDS ? or
failover to a new MDS with more ram?

Failing that, you shouldn't reset the journal without recovering
dentries, otherwise the cephfs_data objects won't be consistent with
the metadata.
The full procedure to be used is here:
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts

 backup the journal, recover dentires, then reset the journal.
(the steps after might not be needed)

That said -- maybe there is a more elegant procedure than using
cephfs-journal-tool.  A cephfs dev might have better advice.

-- dan


On Thu, Oct 22, 2020 at 6:03 PM David C  wrote:
>
> I'm pretty sure it's replaying the same ops every time, the last
> "EMetaBlob.replay updated dir" before it dies is always referring to
> the same directory. Although interestingly that particular dir shows
> up in the log thousands of times - the dir appears to be where a
> desktop app is doing some analytics collecting - I don't know if
> that's likely to be a red herring or the reason why the journal
> appears to be so long. It's a dir I'd be quite happy to lose changes
> to or remove from the file system altogether.
>
> I'm loath to update during an outage although I have seen people
> update the MDS code independently to get out of a scrape - I suspect
> you wouldn't recommend that.
>
> I feel like this leaves me with having to manipulate the journal in
> some way, is there a nuclear option where I can choose to disregard
> the uncommitted events? I assume that would be a journal reset with
> the cephfs-journal-tool but I'm unclear on the impact of that, I'd
> expect to lose any metadata changes that were made since my cluster
> filled up but are there further implications? I also wonder what's the
> riskier option, resetting the journal or attempting an update.
>
> I'm very grateful for your help so far
>
> Below is more of the debug 10 log with ops relating to the
> aforementioned dir (name changed but inode is accurate):
>
> 2020-10-22 16:44:00.488850 7f424659e700 10 mds.0.journal
> EMetaBlob.replay updated dir [dir 0x10009e1ec8d /path/to/desktop/app/
> [2,head] auth v=911968 cv=0/0 state=1610612736 f(v0 m2020-10-14
> 16:32:42.596652 1=0+1) n(v6164 rc2020-10-22 08:46:44.932805 b17592
> 89216=89215+1)/n(v6164 rc2020-10-22 08:46:43.950805 b17592
> 89214=89213+1) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 0x5654f8288300]
> 2020-10-22 16:44:00.488864 7f424659e700 10 mds.0.journal
> EMetaBlob.replay for [2,head] had [dentry
> #0x1/path/to/desktop/app/Upload [2,head] auth (dversion lock) v=911967
> inode=0x5654f8288a00 state=1610612736 | inodepin=1 dirty=1
> 0x5654f82794a0]
> 2020-10-22 16:44:00.488873 7f424659e700 10 mds.0.journal
> EMetaBlob.replay for [2,head] had [inode 0x10009e1ec8e [...2,head]
> /path/to/desktop/app/Upload/ auth v911967 f(v0 m2020-10-22
> 08:46:44.932805 89215=89215+0) n(v2 rc2020-10-22 08:46:44.932805
> b17592 89216=89215+1) (iversion lock) | dirfrag=1 dirty=1
> 0x5654f8288a00]
> 2020-10-22 16:44:00.44 7f424659e700 10 mds.0.journal
> EMetaBlob.replay dir 0x10009e1ec8e
> 2020-10-22 16:44:00.45 7f424659e700 10 mds.0.journal
> EMetaBlob.replay updated dir [dir 0x10009e1ec8e
> /path/to/desktop/app/Upload/ [2,head] auth v=904150 cv=0/0
> state=1073741824 f(v0 m2020-10-22 08:46:44.932805 89215=89215+0) n(v2
> rc2020-10-22 08:46:44.932805 b17592 89215=89215+0)
> hs=42926+1178,ss=0+0 dirty=2375 | child=1 0x5654f8289100]
> 2020-10-22 16:44:00.488898 7f424659e700 10 mds.0.journal
> EMetaBlob.replay added (full) [dentry
> #0x1/path/to/desktop/app/Upload/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp
> [2,head] auth NULL (dversion lock) v=904149 inode=0
> state=1610612800|bottomlru | dirty=1 0x56586df52f00]
> 2020-10-22 16:44:00.488911 7f424659e700 10 mds.0.journal
> EMetaBlob.replay added [inode 0x1000e4c0ff4 [2,head]
> /path/to/desktop/app/Upload/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp
> auth v904149 s=0 n(v0 1=1+0) (iversion lock) 0x566ce168ce00]
> 2020-10-22 16:44:00.488918 7f424659e700 10
> mds.0.cache.ino(0x1000e4c0ff4) mark_dirty_parent
> 2020-10-22 16:44:00.488920 7f424659e700 10 mds.0.journal
> EMetaBlob.replay noting opened inode [inode 0x1000e4c0ff4 [2,head]
> /path/to/desktop/app/Upload/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp
> auth v904149 dirtyparent s=0 n(v0 1=1+0) (iversion lock) |
> dirtyparent=1 dirty=1 0x566ce168ce00]
> 2020-10-22 16:44:00.488924 7f424659e700 10 mds.0.journal
> EMetaBlob.replay inotable tablev 481253 <= table 481328
> 2020-10-22 16:44:00.488926 7f424659e700 10 mds.0.journal
> EMetaBlob.replay sessionmap v 240341131 <= table 240378576
> 2020-10-22 16:44:00.488927 7f424659e700 10 mds.0.journal
> EMetaBlob.replay request client.16250824:1416595263 trim_to 1416595263
> 2020-10-22 16:44:00.491462 7f424659e700 10 mds.0.log _replay
> 57437755528637~11764673 / 57441334490146 2020-10-22 09:08:56.198798:
> EOpen [metablob 0x10009e1ec8e, 1881 dirs], 16748 open files
> 2020-10-22 

[ceph-users] Re: Hardware for new OSD nodes.

2020-10-22 Thread Anthony D'Atri



> Yeah, didn't think about a RAID10 really, although there wouldn't be enough 
> space for 8x300GB = 2400GB WAL/DBs.

300 is overkill for many applications anyway.

> 
> Also, using a RAID10 for WAL/DBs will:
> - make OSDs less movable between hosts (they'd have to be moved all 
> together - with 2 OSD per NVMe you can move them around in pairs

Why would you want to move them between hosts?  

> - You must really be sure your raid card is dependable. (sorry but I have 
> seen so much management problems with top-tier RAID cards I avoid them like 
> the plague).

This.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] 14.2.12 breaks mon_host pointing to Round Robin DNS entry

2020-10-22 Thread Wido den Hollander
Hi,

I already submitted a ticket: https://tracker.ceph.com/issues/47951

Maybe other people noticed this as well.

Situation:
- Cluster is running IPv6
- mon_host is set to a DNS entry
- DNS entry is a Round Robin with three -records

root@wido-standard-benchmark:~# ceph -s
unable to parse addrs in 'mon.objects.xx.xxx.net'
[errno 22] error connecting to the cluster
root@wido-standard-benchmark:~#

The relevant part of the ceph.conf:

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
mon_host = mon.objects.xxx.xxx.xxx
ms_bind_ipv6 = true

This works fine with 14.2.11 and breaks under 14.2.12

Anybody else seeing this as well?

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Hardware for new OSD nodes.

2020-10-22 Thread Brian Topping


> On Oct 22, 2020, at 10:34 AM, Anthony D'Atri  wrote:
> 
>>- You must really be sure your raid card is dependable. (sorry but I have 
>> seen so much management problems with top-tier RAID cards I avoid them like 
>> the plague).
> 
> This.

I’d definitely avoid a RAID card. If I can do advanced encryption with an MMX 
instruction, I think I can certainly trust IOMMU to handle device multiplexing 
from software in an efficient manner, no? mdadm RAID is just fine for me and is 
reliably bootable from GRUB.

I’m not an expert in driver mechanics, but mirroring should be very low 
overhead at the software level.

Once it’s software RAID, moving disks between chassis is a simple process as 
well. 

Apologies I didn’t make that clear earlier...
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 14.2.12 breaks mon_host pointing to Round Robin DNS entry

2020-10-22 Thread Jason Dillaman
This backport [1] looks suspicious as it was introduced in v14.2.12
and directly changes the initial MonMap code. If you revert it in a
dev build does it solve your problem?

[1] https://github.com/ceph/ceph/pull/36704

On Thu, Oct 22, 2020 at 12:39 PM Wido den Hollander  wrote:
>
> Hi,
>
> I already submitted a ticket: https://tracker.ceph.com/issues/47951
>
> Maybe other people noticed this as well.
>
> Situation:
> - Cluster is running IPv6
> - mon_host is set to a DNS entry
> - DNS entry is a Round Robin with three -records
>
> root@wido-standard-benchmark:~# ceph -s
> unable to parse addrs in 'mon.objects.xx.xxx.net'
> [errno 22] error connecting to the cluster
> root@wido-standard-benchmark:~#
>
> The relevant part of the ceph.conf:
>
> [global]
> auth_client_required = cephx
> auth_cluster_required = cephx
> auth_service_required = cephx
> mon_host = mon.objects.xxx.xxx.xxx
> ms_bind_ipv6 = true
>
> This works fine with 14.2.11 and breaks under 14.2.12
>
> Anybody else seeing this as well?
>
> Wido
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Jason
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread David C
Thanks, guys

I can't add more RAM right now or have access to a server that does,
I'd fear it wouldn't be enough anyway. I'll give the swap idea a go
and try and track down the thread you mentioned, Frank.

'cephfs-journal-tool journal inspect' tells me the journal is fine. I
was able to back it up cleanly, however the apparent size of the file
reported by du is 53TB, does that sound right to you? The actual size
is 3.7GB.

'cephfs-journal-tool event get list' starts listing events but
eventually gets killed as expected.

'cephfs-journal-tool event get summary'
Events by type:
  OPEN: 314260
  SUBTREEMAP: 1134
  UPDATE: 547973
Errors: 0

Those numbers seem really high to me - for reference this is an approx
128TB (usable space) cluster, 505 objects in metadata pool.

On Thu, Oct 22, 2020 at 5:23 PM Frank Schilder  wrote:
>
> If you can't add RAM, you could try provisioning SWAP on a reasonably fast 
> drive. There is a thread from this year where someone had a similar problem, 
> the MDS running out of memory during replay. He could quickly add sufficient 
> swap and the MDS managed to come up. Took a long time though, but might be 
> faster than getting more RAM and will not loose data.
>
> Your clients will not be able to do much, if anything during recovery though.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Dan van der Ster 
> Sent: 22 October 2020 18:11:57
> To: David C
> Cc: ceph-devel; ceph-users
> Subject: [ceph-users] Re: Urgent help needed please - MDS offline
>
> I assume you aren't able to quickly double the RAM on this MDS ? or
> failover to a new MDS with more ram?
>
> Failing that, you shouldn't reset the journal without recovering
> dentries, otherwise the cephfs_data objects won't be consistent with
> the metadata.
> The full procedure to be used is here:
> https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts
>
>  backup the journal, recover dentires, then reset the journal.
> (the steps after might not be needed)
>
> That said -- maybe there is a more elegant procedure than using
> cephfs-journal-tool.  A cephfs dev might have better advice.
>
> -- dan
>
>
> On Thu, Oct 22, 2020 at 6:03 PM David C  wrote:
> >
> > I'm pretty sure it's replaying the same ops every time, the last
> > "EMetaBlob.replay updated dir" before it dies is always referring to
> > the same directory. Although interestingly that particular dir shows
> > up in the log thousands of times - the dir appears to be where a
> > desktop app is doing some analytics collecting - I don't know if
> > that's likely to be a red herring or the reason why the journal
> > appears to be so long. It's a dir I'd be quite happy to lose changes
> > to or remove from the file system altogether.
> >
> > I'm loath to update during an outage although I have seen people
> > update the MDS code independently to get out of a scrape - I suspect
> > you wouldn't recommend that.
> >
> > I feel like this leaves me with having to manipulate the journal in
> > some way, is there a nuclear option where I can choose to disregard
> > the uncommitted events? I assume that would be a journal reset with
> > the cephfs-journal-tool but I'm unclear on the impact of that, I'd
> > expect to lose any metadata changes that were made since my cluster
> > filled up but are there further implications? I also wonder what's the
> > riskier option, resetting the journal or attempting an update.
> >
> > I'm very grateful for your help so far
> >
> > Below is more of the debug 10 log with ops relating to the
> > aforementioned dir (name changed but inode is accurate):
> >
> > 2020-10-22 16:44:00.488850 7f424659e700 10 mds.0.journal
> > EMetaBlob.replay updated dir [dir 0x10009e1ec8d /path/to/desktop/app/
> > [2,head] auth v=911968 cv=0/0 state=1610612736 f(v0 m2020-10-14
> > 16:32:42.596652 1=0+1) n(v6164 rc2020-10-22 08:46:44.932805 b17592
> > 89216=89215+1)/n(v6164 rc2020-10-22 08:46:43.950805 b17592
> > 89214=89213+1) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 0x5654f8288300]
> > 2020-10-22 16:44:00.488864 7f424659e700 10 mds.0.journal
> > EMetaBlob.replay for [2,head] had [dentry
> > #0x1/path/to/desktop/app/Upload [2,head] auth (dversion lock) v=911967
> > inode=0x5654f8288a00 state=1610612736 | inodepin=1 dirty=1
> > 0x5654f82794a0]
> > 2020-10-22 16:44:00.488873 7f424659e700 10 mds.0.journal
> > EMetaBlob.replay for [2,head] had [inode 0x10009e1ec8e [...2,head]
> > /path/to/desktop/app/Upload/ auth v911967 f(v0 m2020-10-22
> > 08:46:44.932805 89215=89215+0) n(v2 rc2020-10-22 08:46:44.932805
> > b17592 89216=89215+1) (iversion lock) | dirfrag=1 dirty=1
> > 0x5654f8288a00]
> > 2020-10-22 16:44:00.44 7f424659e700 10 mds.0.journal
> > EMetaBlob.replay dir 0x10009e1ec8e
> > 2020-10-22 16:44:00.45 7f424659e700 10 mds.0.journal
> > EMetaBlob.replay updated dir [dir 0x10009e1ec8e
> > /path/to/desktop/app/U

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread Dan van der Ster
On Thu, 22 Oct 2020, 19:03 David C,  wrote:

> Thanks, guys
>
> I can't add more RAM right now or have access to a server that does,
> I'd fear it wouldn't be enough anyway. I'll give the swap idea a go
> and try and track down the thread you mentioned, Frank.
>
> 'cephfs-journal-tool journal inspect' tells me the journal is fine. I
> was able to back it up cleanly, however the apparent size of the file
> reported by du is 53TB, does that sound right to you? The actual size
> is 3.7GB.
>

IIRC it's a sparse file. So yes that sounds normal.



> 'cephfs-journal-tool event get list' starts listing events but
> eventually gets killed as expected.
>


Does it go oom too?

.. dan




> 'cephfs-journal-tool event get summary'
> Events by type:
>   OPEN: 314260
>   SUBTREEMAP: 1134
>   UPDATE: 547973
> Errors: 0
>
> Those numbers seem really high to me - for reference this is an approx
> 128TB (usable space) cluster, 505 objects in metadata pool.
>
> On Thu, Oct 22, 2020 at 5:23 PM Frank Schilder  wrote:
> >
> > If you can't add RAM, you could try provisioning SWAP on a reasonably
> fast drive. There is a thread from this year where someone had a similar
> problem, the MDS running out of memory during replay. He could quickly add
> sufficient swap and the MDS managed to come up. Took a long time though,
> but might be faster than getting more RAM and will not loose data.
> >
> > Your clients will not be able to do much, if anything during recovery
> though.
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: Dan van der Ster 
> > Sent: 22 October 2020 18:11:57
> > To: David C
> > Cc: ceph-devel; ceph-users
> > Subject: [ceph-users] Re: Urgent help needed please - MDS offline
> >
> > I assume you aren't able to quickly double the RAM on this MDS ? or
> > failover to a new MDS with more ram?
> >
> > Failing that, you shouldn't reset the journal without recovering
> > dentries, otherwise the cephfs_data objects won't be consistent with
> > the metadata.
> > The full procedure to be used is here:
> >
> https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts
> >
> >  backup the journal, recover dentires, then reset the journal.
> > (the steps after might not be needed)
> >
> > That said -- maybe there is a more elegant procedure than using
> > cephfs-journal-tool.  A cephfs dev might have better advice.
> >
> > -- dan
> >
> >
> > On Thu, Oct 22, 2020 at 6:03 PM David C  wrote:
> > >
> > > I'm pretty sure it's replaying the same ops every time, the last
> > > "EMetaBlob.replay updated dir" before it dies is always referring to
> > > the same directory. Although interestingly that particular dir shows
> > > up in the log thousands of times - the dir appears to be where a
> > > desktop app is doing some analytics collecting - I don't know if
> > > that's likely to be a red herring or the reason why the journal
> > > appears to be so long. It's a dir I'd be quite happy to lose changes
> > > to or remove from the file system altogether.
> > >
> > > I'm loath to update during an outage although I have seen people
> > > update the MDS code independently to get out of a scrape - I suspect
> > > you wouldn't recommend that.
> > >
> > > I feel like this leaves me with having to manipulate the journal in
> > > some way, is there a nuclear option where I can choose to disregard
> > > the uncommitted events? I assume that would be a journal reset with
> > > the cephfs-journal-tool but I'm unclear on the impact of that, I'd
> > > expect to lose any metadata changes that were made since my cluster
> > > filled up but are there further implications? I also wonder what's the
> > > riskier option, resetting the journal or attempting an update.
> > >
> > > I'm very grateful for your help so far
> > >
> > > Below is more of the debug 10 log with ops relating to the
> > > aforementioned dir (name changed but inode is accurate):
> > >
> > > 2020-10-22 16:44:00.488850 7f424659e700 10 mds.0.journal
> > > EMetaBlob.replay updated dir [dir 0x10009e1ec8d /path/to/desktop/app/
> > > [2,head] auth v=911968 cv=0/0 state=1610612736 f(v0 m2020-10-14
> > > 16:32:42.596652 1=0+1) n(v6164 rc2020-10-22 08:46:44.932805 b17592
> > > 89216=89215+1)/n(v6164 rc2020-10-22 08:46:43.950805 b17592
> > > 89214=89213+1) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 0x5654f8288300]
> > > 2020-10-22 16:44:00.488864 7f424659e700 10 mds.0.journal
> > > EMetaBlob.replay for [2,head] had [dentry
> > > #0x1/path/to/desktop/app/Upload [2,head] auth (dversion lock) v=911967
> > > inode=0x5654f8288a00 state=1610612736 | inodepin=1 dirty=1
> > > 0x5654f82794a0]
> > > 2020-10-22 16:44:00.488873 7f424659e700 10 mds.0.journal
> > > EMetaBlob.replay for [2,head] had [inode 0x10009e1ec8e [...2,head]
> > > /path/to/desktop/app/Upload/ auth v911967 f(v0 m2020-10-22
> > > 08:46:44.932805 89215=89215+0) n(v2 rc2020-10-22 08:4

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread David C
On Thu, Oct 22, 2020 at 6:09 PM Dan van der Ster  wrote:
>
>
>
> On Thu, 22 Oct 2020, 19:03 David C,  wrote:
>>
>> Thanks, guys
>>
>> I can't add more RAM right now or have access to a server that does,
>> I'd fear it wouldn't be enough anyway. I'll give the swap idea a go
>> and try and track down the thread you mentioned, Frank.
>>
>> 'cephfs-journal-tool journal inspect' tells me the journal is fine. I
>> was able to back it up cleanly, however the apparent size of the file
>> reported by du is 53TB, does that sound right to you? The actual size
>> is 3.7GB.
>
>
> IIRC it's a sparse file. So yes that sounds normal.
>
>
>>
>> 'cephfs-journal-tool event get list' starts listing events but
>> eventually gets killed as expected.
>
>
>
> Does it go oom too?

Yep the cephfs_journal_tool process gets killed
>
> .. dan
>
>
>
>>
>> 'cephfs-journal-tool event get summary'
>> Events by type:
>>   OPEN: 314260
>>   SUBTREEMAP: 1134
>>   UPDATE: 547973
>> Errors: 0
>>
>> Those numbers seem really high to me - for reference this is an approx
>> 128TB (usable space) cluster, 505 objects in metadata pool.
>>
>> On Thu, Oct 22, 2020 at 5:23 PM Frank Schilder  wrote:
>> >
>> > If you can't add RAM, you could try provisioning SWAP on a reasonably fast 
>> > drive. There is a thread from this year where someone had a similar 
>> > problem, the MDS running out of memory during replay. He could quickly add 
>> > sufficient swap and the MDS managed to come up. Took a long time though, 
>> > but might be faster than getting more RAM and will not loose data.
>> >
>> > Your clients will not be able to do much, if anything during recovery 
>> > though.
>> >
>> > Best regards,
>> > =
>> > Frank Schilder
>> > AIT Risø Campus
>> > Bygning 109, rum S14
>> >
>> > 
>> > From: Dan van der Ster 
>> > Sent: 22 October 2020 18:11:57
>> > To: David C
>> > Cc: ceph-devel; ceph-users
>> > Subject: [ceph-users] Re: Urgent help needed please - MDS offline
>> >
>> > I assume you aren't able to quickly double the RAM on this MDS ? or
>> > failover to a new MDS with more ram?
>> >
>> > Failing that, you shouldn't reset the journal without recovering
>> > dentries, otherwise the cephfs_data objects won't be consistent with
>> > the metadata.
>> > The full procedure to be used is here:
>> > https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts
>> >
>> >  backup the journal, recover dentires, then reset the journal.
>> > (the steps after might not be needed)
>> >
>> > That said -- maybe there is a more elegant procedure than using
>> > cephfs-journal-tool.  A cephfs dev might have better advice.
>> >
>> > -- dan
>> >
>> >
>> > On Thu, Oct 22, 2020 at 6:03 PM David C  wrote:
>> > >
>> > > I'm pretty sure it's replaying the same ops every time, the last
>> > > "EMetaBlob.replay updated dir" before it dies is always referring to
>> > > the same directory. Although interestingly that particular dir shows
>> > > up in the log thousands of times - the dir appears to be where a
>> > > desktop app is doing some analytics collecting - I don't know if
>> > > that's likely to be a red herring or the reason why the journal
>> > > appears to be so long. It's a dir I'd be quite happy to lose changes
>> > > to or remove from the file system altogether.
>> > >
>> > > I'm loath to update during an outage although I have seen people
>> > > update the MDS code independently to get out of a scrape - I suspect
>> > > you wouldn't recommend that.
>> > >
>> > > I feel like this leaves me with having to manipulate the journal in
>> > > some way, is there a nuclear option where I can choose to disregard
>> > > the uncommitted events? I assume that would be a journal reset with
>> > > the cephfs-journal-tool but I'm unclear on the impact of that, I'd
>> > > expect to lose any metadata changes that were made since my cluster
>> > > filled up but are there further implications? I also wonder what's the
>> > > riskier option, resetting the journal or attempting an update.
>> > >
>> > > I'm very grateful for your help so far
>> > >
>> > > Below is more of the debug 10 log with ops relating to the
>> > > aforementioned dir (name changed but inode is accurate):
>> > >
>> > > 2020-10-22 16:44:00.488850 7f424659e700 10 mds.0.journal
>> > > EMetaBlob.replay updated dir [dir 0x10009e1ec8d /path/to/desktop/app/
>> > > [2,head] auth v=911968 cv=0/0 state=1610612736 f(v0 m2020-10-14
>> > > 16:32:42.596652 1=0+1) n(v6164 rc2020-10-22 08:46:44.932805 b17592
>> > > 89216=89215+1)/n(v6164 rc2020-10-22 08:46:43.950805 b17592
>> > > 89214=89213+1) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 0x5654f8288300]
>> > > 2020-10-22 16:44:00.488864 7f424659e700 10 mds.0.journal
>> > > EMetaBlob.replay for [2,head] had [dentry
>> > > #0x1/path/to/desktop/app/Upload [2,head] auth (dversion lock) v=911967
>> > > inode=0x5654f8288a00 state=1610612736 | inodepin=1 dirty=1
>> > > 0x5654f82794a0]
>> > > 20

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread Dan van der Ster
On Thu, 22 Oct 2020, 19:14 David C,  wrote:

> On Thu, Oct 22, 2020 at 6:09 PM Dan van der Ster 
> wrote:
> >
> >
> >
> > On Thu, 22 Oct 2020, 19:03 David C,  wrote:
> >>
> >> Thanks, guys
> >>
> >> I can't add more RAM right now or have access to a server that does,
> >> I'd fear it wouldn't be enough anyway. I'll give the swap idea a go
> >> and try and track down the thread you mentioned, Frank.
> >>
> >> 'cephfs-journal-tool journal inspect' tells me the journal is fine. I
> >> was able to back it up cleanly, however the apparent size of the file
> >> reported by du is 53TB, does that sound right to you? The actual size
> >> is 3.7GB.
> >
> >
> > IIRC it's a sparse file. So yes that sounds normal.
> >
> >
> >>
> >> 'cephfs-journal-tool event get list' starts listing events but
> >> eventually gets killed as expected.
> >
> >
> >
> > Does it go oom too?
>
> Yep the cephfs_journal_tool process gets killed
> >
>

So yeah you can infer that the dentries cmd will oom similarly.

Load up with swap and try the up:replay route.
Set the beacon to 10 until it finishes.

Good luck,

Dan





> .. dan
> >
> >
> >
> >>
> >> 'cephfs-journal-tool event get summary'
> >> Events by type:
> >>   OPEN: 314260
> >>   SUBTREEMAP: 1134
> >>   UPDATE: 547973
> >> Errors: 0
> >>
> >> Those numbers seem really high to me - for reference this is an approx
> >> 128TB (usable space) cluster, 505 objects in metadata pool.
> >>
> >> On Thu, Oct 22, 2020 at 5:23 PM Frank Schilder  wrote:
> >> >
> >> > If you can't add RAM, you could try provisioning SWAP on a reasonably
> fast drive. There is a thread from this year where someone had a similar
> problem, the MDS running out of memory during replay. He could quickly add
> sufficient swap and the MDS managed to come up. Took a long time though,
> but might be faster than getting more RAM and will not loose data.
> >> >
> >> > Your clients will not be able to do much, if anything during recovery
> though.
> >> >
> >> > Best regards,
> >> > =
> >> > Frank Schilder
> >> > AIT Risø Campus
> >> > Bygning 109, rum S14
> >> >
> >> > 
> >> > From: Dan van der Ster 
> >> > Sent: 22 October 2020 18:11:57
> >> > To: David C
> >> > Cc: ceph-devel; ceph-users
> >> > Subject: [ceph-users] Re: Urgent help needed please - MDS offline
> >> >
> >> > I assume you aren't able to quickly double the RAM on this MDS ? or
> >> > failover to a new MDS with more ram?
> >> >
> >> > Failing that, you shouldn't reset the journal without recovering
> >> > dentries, otherwise the cephfs_data objects won't be consistent with
> >> > the metadata.
> >> > The full procedure to be used is here:
> >> >
> https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts
> >> >
> >> >  backup the journal, recover dentires, then reset the journal.
> >> > (the steps after might not be needed)
> >> >
> >> > That said -- maybe there is a more elegant procedure than using
> >> > cephfs-journal-tool.  A cephfs dev might have better advice.
> >> >
> >> > -- dan
> >> >
> >> >
> >> > On Thu, Oct 22, 2020 at 6:03 PM David C 
> wrote:
> >> > >
> >> > > I'm pretty sure it's replaying the same ops every time, the last
> >> > > "EMetaBlob.replay updated dir" before it dies is always referring to
> >> > > the same directory. Although interestingly that particular dir shows
> >> > > up in the log thousands of times - the dir appears to be where a
> >> > > desktop app is doing some analytics collecting - I don't know if
> >> > > that's likely to be a red herring or the reason why the journal
> >> > > appears to be so long. It's a dir I'd be quite happy to lose changes
> >> > > to or remove from the file system altogether.
> >> > >
> >> > > I'm loath to update during an outage although I have seen people
> >> > > update the MDS code independently to get out of a scrape - I suspect
> >> > > you wouldn't recommend that.
> >> > >
> >> > > I feel like this leaves me with having to manipulate the journal in
> >> > > some way, is there a nuclear option where I can choose to disregard
> >> > > the uncommitted events? I assume that would be a journal reset with
> >> > > the cephfs-journal-tool but I'm unclear on the impact of that, I'd
> >> > > expect to lose any metadata changes that were made since my cluster
> >> > > filled up but are there further implications? I also wonder what's
> the
> >> > > riskier option, resetting the journal or attempting an update.
> >> > >
> >> > > I'm very grateful for your help so far
> >> > >
> >> > > Below is more of the debug 10 log with ops relating to the
> >> > > aforementioned dir (name changed but inode is accurate):
> >> > >
> >> > > 2020-10-22 16:44:00.488850 7f424659e700 10 mds.0.journal
> >> > > EMetaBlob.replay updated dir [dir 0x10009e1ec8d
> /path/to/desktop/app/
> >> > > [2,head] auth v=911968 cv=0/0 state=1610612736 f(v0 m2020-10-14
> >> > > 16:32:42.596652 1=0+1) n(v6164 rc2020-10-22 08:46:44.932805
> b13

[ceph-users] Re: Hardware for new OSD nodes.

2020-10-22 Thread Dave Hall

Eneko,

On 10/22/2020 11:14 AM, Eneko Lacunza wrote:

Hi Dave,

El 22/10/20 a las 16:48, Dave Hall escribió:

Hello,

(BTW, Nautilus 14.2.7 on Debian non-container.)

We're about to purchase more OSD nodes for our cluster, but I have a 
couple questions about hardware choices.  Our original nodes were 8 x 
12TB SAS drives and a 1.6TB Samsung NVMe card for WAL, DB, etc.


We chose the NVMe card for performance since it has an 8 lane PCIe 
interface.  However, we're currently BlueFS spillovers.


The Tyan chassis we are considering has the option of 4 x U.2 NVMe 
bays - each with 4 PCIe lanes, (and 8 SAS bays).   It has occurred to 
me that I might stripe 4 1TB NVMe drives together to get much more 
space for WAL/DB and a net performance of 16 PCIe lanes.


Any thoughts on this approach?
Don't stripe them, if one NVMe fails you'll lose all OSDs. Just use 1 
NVMe drive for 2  SAS drives  and provision 300GB for WAL/DB for each 
OSD (see related threads on this mailing list about why that exact size).


This way if a NVMe fails, you'll only lose 2 OSD.
I was under the impression that everything that BlueStore puts on the 
SSD/NVMe could be reconstructed from information on the OSD. Am I 
mistaken about this?  If so, my single 1.6TB NVMe card is equally 
vulnerable.


Also, what size of WAL/DB partitions do you have now, and what 
spillover size?


I recently posted another question to the list on this topic, since I 
now have spillover on 7 of 24 OSDs.  Since the data layout on the NVMe 
for BlueStore is not  traditional I've never quite figured out how to 
get this information.   The current partition size is 1.6TB /12 since we 
had the possibility to add for more drives to each node.  How that was 
divided between WAL, DB, etc. is something I'd like to be able to 
understand.  However, we're not going to add the extra 4 drives, so 
expanding the LVM partitions is now a possibility.






Also, any thoughts/recommendations on 12TB OSD drives?  For 
price/capacity this is a good size for us, but I'm wondering if my 
BlueFS spillovers are resulting from using drives that are too big.  
I also thought I might have seen some comments about cutting large 
drives into multiple OSDs - could that be?


Not using such big disk here, sorry :) (no space needs)

Cheers


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Hardware for new OSD nodes.

2020-10-22 Thread Eneko Lacunza

Hi Dave,

El 22/10/20 a las 16:48, Dave Hall escribió:

Hello,

(BTW, Nautilus 14.2.7 on Debian non-container.)

We're about to purchase more OSD nodes for our cluster, but I have a 
couple questions about hardware choices.  Our original nodes were 8 x 
12TB SAS drives and a 1.6TB Samsung NVMe card for WAL, DB, etc.


We chose the NVMe card for performance since it has an 8 lane PCIe 
interface.  However, we're currently BlueFS spillovers.


The Tyan chassis we are considering has the option of 4 x U.2 NVMe 
bays - each with 4 PCIe lanes, (and 8 SAS bays).   It has occurred to 
me that I might stripe 4 1TB NVMe drives together to get much more 
space for WAL/DB and a net performance of 16 PCIe lanes.


Any thoughts on this approach?
Don't stripe them, if one NVMe fails you'll lose all OSDs. Just use 1 
NVMe drive for 2  SAS drives  and provision 300GB for WAL/DB for each 
OSD (see related threads on this mailing list about why that exact size).


This way if a NVMe fails, you'll only lose 2 OSD.

Also, what size of WAL/DB partitions do you have now, and what spillover 
size?




Also, any thoughts/recommendations on 12TB OSD drives?  For 
price/capacity this is a good size for us, but I'm wondering if my 
BlueFS spillovers are resulting from using drives that are too big.  I 
also thought I might have seen some comments about cutting large 
drives into multiple OSDs - could that be?


Not using such big disk here, sorry :) (no space needs)

Cheers

--
Eneko Lacunza| +34 943 569 206
 | elacu...@binovo.es
Zuzendari teknikoa   | https://www.binovo.es
Director técnico | Astigarragako Bidea, 2 - 2º izda.
BINOVO IT HUMAN PROJECT S.L  | oficina 10-11, 20180 Oiartzun
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-22 Thread Michael Thomas
Done.  I gave it 4 PGs (I read somewhere that PG counts should be 
multiples of 2), and restarted the mgr.  I still don't see any traffic 
to the pool, though I'm also unsure how much traffic is to be expected.


--Mike

On 10/22/20 2:32 AM, Frank Schilder wrote:

Sounds good. Did you re-create the pool again? If not, please do to give the 
devicehealth manager module its storage. In case you can't see any IO, it might 
be necessary to restart the MGR to flush out a stale rados connection. I would 
probably give the pool 10 PGs instead of 1, but that's up to you.

I hope I find time today to look at the incomplete PG.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michael Thomas 
Sent: 21 October 2020 22:58:47
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

On 10/21/20 6:47 AM, Frank Schilder wrote:

Hi Michael,

some quick thoughts.


That you can create a pool with 1 PG is a good sign, the crush rule is OK. That 
pg query says it doesn't have PG 1.0 points in the right direction. There is an 
inconsistency in the cluster. This is also indicated by the fact that no upmaps 
seem to exist (the clean-up script was empty). With the osd map you extracted, 
you could check what the osd map believes the mapping of the PGs of pool 1 are:

# osdmaptool osd.map --test-map-pgs-dump --pool 1


https://pastebin.com/seh6gb7R

As I suspected, it thinks that OSDs 0, 41 are the acting set.


or if it also claims the PG does not exist. It looks like something went wrong 
during pool creation and you are not the only one having problems with this 
particular pool: https://www.spinics.net/lists/ceph-users/msg52665.html . 
Sounds a lot like a bug in cephadm.

In principle, it looks like the idea to delete and recreate the health metrics 
pool is a way forward. Please look at the procedure mentioned in the thread 
quoted above. Deletion of the pool there lead to some crashes and some surgery 
on some OSDs was necessary. However, in your case it might just work, because 
you redeployed the OSDs in question already - if I remember correctly.


That is correct.  The original OSDs 0 and 41 were removed and redeployed
on new disks.


In order to do so cleanly, however, you will probably want to shut down all 
clients accessing this pool. Note that clients accessing the health metrics 
pool are not FS clients, so the mds cannot tell you anything about them. The 
only command that seems to list all clients is

# ceph daemon mon.MON-ID sessions

that needs to be executed on all mon hosts. On the other hand, you could also 
just go ahead and see if something crashes (an MGR module probably) or disable 
all MGR modules during this recovery attempt. I found some info that cephadm 
creates this pool and starts an MGR module.

If you google "device_health_metric pool" you should find descriptions of 
similar cases. It looks solvable.


Unfortunately, in Octopus you can not disable the devicehealth manager
module, and the manager is required for operation.  So I just went ahead
and removed the pool with everything still running.  Fortunately, this
did not appear to cause any problems, and the single unknown PG has
disappeared from the ceph health output.


I will look at the incomplete PG issue. I hope this is just some PG tuning. At 
least pg query didn't complain :)


I have OSDs ready to add to the pool, in case you think we should try.


The stuck MDS request could be an attempt to access an unfound object. It 
should be possible to locate the fs client and find out what it was trying to 
do. I see this sometimes when people are too impatient. They manage to trigger 
a race condition and an MDS operation gets stuck (there are MDS bugs and in my 
case it was an ls command that got stuck). Usually, evicting the client 
temporarily solves the issue (but tell the user :).


I found the fs client and rebooted it.  The MDS still reports the slow
OPs, but according to the mds logs the offending ops were established
before the client was rebooted, and the offending client session (now
defunct) has been blacklisted.  I'll check back later to see if the slow
OPS get cleared from 'ceph status'.

Regards,

--Mike


From: Michael Thomas 
Sent: 20 October 2020 23:48:36
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

On 10/20/20 1:18 PM, Frank Schilder wrote:

Dear Michael,


Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an OSD 
mapping?


I meant here with crush rule replicated_host_nvme. Sorry, forgot.


Seems to have worked fine:

https://pastebin.com/PFgDE4J1


Yes, the OSD was still out when the previous health report was created.


Hmm, this is odd. If this is correct, then it did report a slow op even though 
it was out of the cluster:


from https://pastebin.com/3G3ij9ui:

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-22 Thread Michael Thomas

On 10/22/20 3:22 AM, Frank Schilder wrote:

Could you also execute (and post the output of)

   # osdmaptool osd.map --test-map-pgs-dump --pool 7


osdmaptool dumped core.  Here is stdout:

https://pastebin.com/HPtSqcS1

The PG map for 7.39d matches the pg dump, with the expected difference 
of 2147483647 -> NONE.


...and here is stderr:

https://pastebin.com/CrtwE54r

Regards,

--Mike


with the osd map you pulled out (pool 7 should be the fs data pool)? Please 
check what mapping is reported for PG 7.39d? Just checking if osd map and pg 
dump agree here.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 22 October 2020 09:32:07
To: Michael Thomas; ceph-users@ceph.io
Subject: [ceph-users] Re: multiple OSD crash, unfound objects

Sounds good. Did you re-create the pool again? If not, please do to give the 
devicehealth manager module its storage. In case you can't see any IO, it might 
be necessary to restart the MGR to flush out a stale rados connection. I would 
probably give the pool 10 PGs instead of 1, but that's up to you.

I hope I find time today to look at the incomplete PG.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michael Thomas 
Sent: 21 October 2020 22:58:47
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

On 10/21/20 6:47 AM, Frank Schilder wrote:

Hi Michael,

some quick thoughts.


That you can create a pool with 1 PG is a good sign, the crush rule is OK. That 
pg query says it doesn't have PG 1.0 points in the right direction. There is an 
inconsistency in the cluster. This is also indicated by the fact that no upmaps 
seem to exist (the clean-up script was empty). With the osd map you extracted, 
you could check what the osd map believes the mapping of the PGs of pool 1 are:

# osdmaptool osd.map --test-map-pgs-dump --pool 1


https://pastebin.com/seh6gb7R

As I suspected, it thinks that OSDs 0, 41 are the acting set.


or if it also claims the PG does not exist. It looks like something went wrong 
during pool creation and you are not the only one having problems with this 
particular pool: https://www.spinics.net/lists/ceph-users/msg52665.html . 
Sounds a lot like a bug in cephadm.

In principle, it looks like the idea to delete and recreate the health metrics 
pool is a way forward. Please look at the procedure mentioned in the thread 
quoted above. Deletion of the pool there lead to some crashes and some surgery 
on some OSDs was necessary. However, in your case it might just work, because 
you redeployed the OSDs in question already - if I remember correctly.


That is correct.  The original OSDs 0 and 41 were removed and redeployed
on new disks.


In order to do so cleanly, however, you will probably want to shut down all 
clients accessing this pool. Note that clients accessing the health metrics 
pool are not FS clients, so the mds cannot tell you anything about them. The 
only command that seems to list all clients is

# ceph daemon mon.MON-ID sessions

that needs to be executed on all mon hosts. On the other hand, you could also 
just go ahead and see if something crashes (an MGR module probably) or disable 
all MGR modules during this recovery attempt. I found some info that cephadm 
creates this pool and starts an MGR module.

If you google "device_health_metric pool" you should find descriptions of 
similar cases. It looks solvable.


Unfortunately, in Octopus you can not disable the devicehealth manager
module, and the manager is required for operation.  So I just went ahead
and removed the pool with everything still running.  Fortunately, this
did not appear to cause any problems, and the single unknown PG has
disappeared from the ceph health output.


I will look at the incomplete PG issue. I hope this is just some PG tuning. At 
least pg query didn't complain :)


I have OSDs ready to add to the pool, in case you think we should try.


The stuck MDS request could be an attempt to access an unfound object. It 
should be possible to locate the fs client and find out what it was trying to 
do. I see this sometimes when people are too impatient. They manage to trigger 
a race condition and an MDS operation gets stuck (there are MDS bugs and in my 
case it was an ls command that got stuck). Usually, evicting the client 
temporarily solves the issue (but tell the user :).


I found the fs client and rebooted it.  The MDS still reports the slow
OPs, but according to the mds logs the offending ops were established
before the client was rebooted, and the offending client session (now
defunct) has been blacklisted.  I'll check back later to see if the slow
OPS get cleared from 'ceph status'.

Regards,

--Mike


From: Michael Thomas 
Sent: 20 October 2020 23:48:36
To: Frank Schilder; ce

[ceph-users] Re: Hardware for new OSD nodes.

2020-10-22 Thread Eneko Lacunza

Hi Brian,

El 22/10/20 a las 17:50, Brian Topping escribió:



On Oct 22, 2020, at 9:14 AM, Eneko Lacunza > wrote:


Don't stripe them, if one NVMe fails you'll lose all OSDs. Just use 1 
NVMe drive for 2  SAS drives  and provision 300GB for WAL/DB for each 
OSD (see related threads on this mailing list about why that exact size).


This way if a NVMe fails, you'll only lose 2 OSD.

Also, what size of WAL/DB partitions do you have now, and what 
spillover size?


Generally agreed against making a single giant striped bucket.

Note this may be a good use for RAID10 on WAL/DB if you are committed 
to multiple disks.


I generally put WAL/DB on RAID10 boot disks. It’s important to have 
reliable WAL/DB, but also important that the machine actually boots in 
the first place. With enough RAM and non-interactive use, most of the 
boot bits will be cached so there is no contention for the channel.


Happy for any critique on this as well!


Yeah, didn't think about a RAID10 really, although there wouldn't be 
enough space for 8x300GB = 2400GB WAL/DBs.


I usually also use the boot disk for WAL/DBs, it happens our clusters 
are small and nodes not very dense.


Also, using a RAID10 for WAL/DBs will:
    - make OSDs less movable between hosts (they'd have to be moved all 
together - with 2 OSD per NVMe you can move them around in pairs, 
although there would be data movement for sure)
    - Provide half the IOPS/bandwith for WAL/DB (I think there would be 
plenty for SAS magnetic drives though)

    + WAL/DBs will be safer (one disk failure won't lose any OSD)
    - You must really be sure your raid card is dependable. (sorry but 
I have seen so much management problems with top-tier RAID cards I avoid 
them like the plague).


But it is an interesting idea nonetheless.

Cheers

--
Eneko Lacunza| +34 943 569 206
 | elacu...@binovo.es
Zuzendari teknikoa   | https://www.binovo.es
Director técnico | Astigarragako Bidea, 2 - 2º izda.
BINOVO IT HUMAN PROJECT S.L  | oficina 10-11, 20180 Oiartzun

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread Frank Schilder
If you can't add RAM, you could try provisioning SWAP on a reasonably fast 
drive. There is a thread from this year where someone had a similar problem, 
the MDS running out of memory during replay. He could quickly add sufficient 
swap and the MDS managed to come up. Took a long time though, but might be 
faster than getting more RAM and will not loose data.

Your clients will not be able to do much, if anything during recovery though.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: 22 October 2020 18:11:57
To: David C
Cc: ceph-devel; ceph-users
Subject: [ceph-users] Re: Urgent help needed please - MDS offline

I assume you aren't able to quickly double the RAM on this MDS ? or
failover to a new MDS with more ram?

Failing that, you shouldn't reset the journal without recovering
dentries, otherwise the cephfs_data objects won't be consistent with
the metadata.
The full procedure to be used is here:
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts

 backup the journal, recover dentires, then reset the journal.
(the steps after might not be needed)

That said -- maybe there is a more elegant procedure than using
cephfs-journal-tool.  A cephfs dev might have better advice.

-- dan


On Thu, Oct 22, 2020 at 6:03 PM David C  wrote:
>
> I'm pretty sure it's replaying the same ops every time, the last
> "EMetaBlob.replay updated dir" before it dies is always referring to
> the same directory. Although interestingly that particular dir shows
> up in the log thousands of times - the dir appears to be where a
> desktop app is doing some analytics collecting - I don't know if
> that's likely to be a red herring or the reason why the journal
> appears to be so long. It's a dir I'd be quite happy to lose changes
> to or remove from the file system altogether.
>
> I'm loath to update during an outage although I have seen people
> update the MDS code independently to get out of a scrape - I suspect
> you wouldn't recommend that.
>
> I feel like this leaves me with having to manipulate the journal in
> some way, is there a nuclear option where I can choose to disregard
> the uncommitted events? I assume that would be a journal reset with
> the cephfs-journal-tool but I'm unclear on the impact of that, I'd
> expect to lose any metadata changes that were made since my cluster
> filled up but are there further implications? I also wonder what's the
> riskier option, resetting the journal or attempting an update.
>
> I'm very grateful for your help so far
>
> Below is more of the debug 10 log with ops relating to the
> aforementioned dir (name changed but inode is accurate):
>
> 2020-10-22 16:44:00.488850 7f424659e700 10 mds.0.journal
> EMetaBlob.replay updated dir [dir 0x10009e1ec8d /path/to/desktop/app/
> [2,head] auth v=911968 cv=0/0 state=1610612736 f(v0 m2020-10-14
> 16:32:42.596652 1=0+1) n(v6164 rc2020-10-22 08:46:44.932805 b17592
> 89216=89215+1)/n(v6164 rc2020-10-22 08:46:43.950805 b17592
> 89214=89213+1) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 0x5654f8288300]
> 2020-10-22 16:44:00.488864 7f424659e700 10 mds.0.journal
> EMetaBlob.replay for [2,head] had [dentry
> #0x1/path/to/desktop/app/Upload [2,head] auth (dversion lock) v=911967
> inode=0x5654f8288a00 state=1610612736 | inodepin=1 dirty=1
> 0x5654f82794a0]
> 2020-10-22 16:44:00.488873 7f424659e700 10 mds.0.journal
> EMetaBlob.replay for [2,head] had [inode 0x10009e1ec8e [...2,head]
> /path/to/desktop/app/Upload/ auth v911967 f(v0 m2020-10-22
> 08:46:44.932805 89215=89215+0) n(v2 rc2020-10-22 08:46:44.932805
> b17592 89216=89215+1) (iversion lock) | dirfrag=1 dirty=1
> 0x5654f8288a00]
> 2020-10-22 16:44:00.44 7f424659e700 10 mds.0.journal
> EMetaBlob.replay dir 0x10009e1ec8e
> 2020-10-22 16:44:00.45 7f424659e700 10 mds.0.journal
> EMetaBlob.replay updated dir [dir 0x10009e1ec8e
> /path/to/desktop/app/Upload/ [2,head] auth v=904150 cv=0/0
> state=1073741824 f(v0 m2020-10-22 08:46:44.932805 89215=89215+0) n(v2
> rc2020-10-22 08:46:44.932805 b17592 89215=89215+0)
> hs=42926+1178,ss=0+0 dirty=2375 | child=1 0x5654f8289100]
> 2020-10-22 16:44:00.488898 7f424659e700 10 mds.0.journal
> EMetaBlob.replay added (full) [dentry
> #0x1/path/to/desktop/app/Upload/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp
> [2,head] auth NULL (dversion lock) v=904149 inode=0
> state=1610612800|bottomlru | dirty=1 0x56586df52f00]
> 2020-10-22 16:44:00.488911 7f424659e700 10 mds.0.journal
> EMetaBlob.replay added [inode 0x1000e4c0ff4 [2,head]
> /path/to/desktop/app/Upload/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp
> auth v904149 s=0 n(v0 1=1+0) (iversion lock) 0x566ce168ce00]
> 2020-10-22 16:44:00.488918 7f424659e700 10
> mds.0.cache.ino(0x1000e4c0ff4) mark_dirty_parent
> 2020-10-22 16:44:00.488920 7f424659e700 10 mds.0.journal
> EMetaBlob.replay noting opened inode [inode 0x1000e4c0ff4 [2,head]
> /path/to/desktop/app/

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread Frank Schilder
The post was titled "mds behind on trimming - replay until memory exhausted".

> Load up with swap and try the up:replay route.
> Set the beacon to 10 until it finishes.

Good point! The MDS will not send beacons for a long time. Same was necessary 
in the other case.

Good luck!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Switch docker image?

2020-10-22 Thread Harry G. Coin
This has got to be ceph/docker "101" but I can't find the answer in the
docs and need help.

The latest docker octopus images support using the ntpsec time daemon. 
The default stable octopus image doesn't as yet.

I want to add a mon to a cluster that needs to use ntpsec  (just go with
it..), so I need the  ceph/daemon-base:octopus-latest docker image.

Could someone offer the [cephadm ?  ceph orch ? ]  command sequence
necessary to add the mon to an existing cluster using a specific docker
image that's not the one used elsewhere?

Thanks!


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io