Hi,
I'm having a strange problem with the orchestrator. My cluster has the
following OSD services configured based on certain attributes of the disks:
NAME PORTS RUNNING REFRESHED AGE PLACEMENT
...
osd.osd-default-hdd 1351 2m ago 22m label:osd;HOSTPREFIX*
osd
2, Gregory Orange wrote:
On 18/12/24 02:30, Janek Bevendorff wrote:
I did increase the pgp_num of a pool a while back, totally
forgot about that. Due to the ongoing rebalancing it was stuck half way,
but now suddenly started up again. The current PG number of that pool is
not quite final yet, but
e to the ongoing rebalancing it was stuck half way,
but now suddenly started up again. The current PG number of that pool is
not quite final yet, but definitely higher than previously.
I'll keep this running over night and see where it settles.
Thanks so far!
Janek
Josh
On Tue, Dec 17,
Something's not quite right yet. I got the remapped PGs down from > 4000
to around 1300, but there it stops. When I restart the process, I can
get it down to around 280, but there it stops and creeps back up afterwards.
I have a bunch of these messages in the output:
WARNING: pg 100.3d53: conf
t the MGRs recreate it?? (is it?)
On 17/12/2024 17:01, Janek Bevendorff wrote:
I checked the ceph osd dump json-pretty output and validated it with a
little Python script. Turns out, there's this somewhere around line 1200:
"read_balance": {
ary_affinity_weighted": 1
}
The inf values seem to be the problem. These are the only two invalid
JSON values in the whole file. Do you happen to know how I can debug/fix
this?
On 17/12/2024 16:17, Janek Bevendorff wrote:
Thanks. I tried running the command (dry run for now),
Thanks. I tried running the command (dry run for now), but something's
not working as expected. Have you ever seen this?
$ /root/go/bin/pgremapper cancel-backfill --verbose
** executing: ceph osd dump -f json
panic: invalid character 'i' looking for beginning of value
goroutine 1 [running]:
mai
Thanks for your replies!
You can use pg-remapper (https://github.com/digitalocean/pgremapper)
or similar tools to cancel the remapping; up-map entries will be
created that reflect the current state of the cluster. After all
currently running backfills are finished your mons should not be
bloc
Hi all,
We moved our Ceph cluster to a new data centre about three months ago,
which completely changed its physical topology. I changed the CRUSH map
accordingly so that the CRUSH location matches the physical location
again and the cluster has been rebalancing ever since. Due to capacity
li
sts with a command?
Zitat von Janek Bevendorff :
Thanks. I increased the number even further and got a (literal)
handful of non-debug messages. Unfortunately, none were relevant for
the problem I'm trying to debug.
On 13/08/2024 14:03, Eugen Block wrote:
Interesting, apparently the number
take a closer look if and
how that can be handled properly.
Zitat von Janek Bevendorff :
That's where the 'ceph log last' commands should help you out, but
I don't know why you don't see it, maybe increase the number of
lines to display or something?
BTW, which
That's where the 'ceph log last' commands should help you out, but I
don't know why you don't see it, maybe increase the number of lines to
display or something?
BTW, which ceph version are we talking about here?
reef.
I tried ceph log last 100 debug cluster and that gives me the usual DBG
Thanks all.
ceph log last 10 warn cluster
That outputs nothing for me. Any docs about this?
I don't have much to comment about logging, I feel you though. I just
wanted to point out that the details about the large omap object
should be in the (primary) OSD log, not in the MON log:
The me
Hi,
I have a bunch of long-standing struggles with the way Ceph handles
logging and I cannot figured out how to solve them. These issues are
basically the following:
- The log config options are utterly confusing and very badly documented
- Mon file logs are spammed with DBG-level cluster log
ame. I checked whether cephadm deploy perhaps
has an undocumented flag for setting the service name, but couldn't find
any. I could run deploy, change the service name and then restart the
service, but that's quite ugly. Any better ideas?
Janek
Zitat von Janek Bevendorff :
Hi Eugen
host": ["XXX"], "target":
["mon-mgr", ""]} to mon-mgr
Created no osd(s) on host XXX; already created?
I suspect that it doesn't work for OSDs that are not explicitly marked
as managed by ceph orch. But how do I do that?
I also commented the tr
wouldn't be an option either, even if I were willing to deploy the admin
key on the OSD hosts.
On 07/11/2023 11:41, Janek Bevendorff wrote:
Hi,
We have our cluster RAM-booted, so we start from a clean slate after
every reboot. That means I need to redeploy all OSD daemons as well.
Hi,
We have our cluster RAM-booted, so we start from a clean slate after
every reboot. That means I need to redeploy all OSD daemons as well. At
the moment, I run cephadm deploy via Salt on the rebooted node, which
brings the deployed OSDs back up, but the problem with this is that the
deploy
Hey all,
My Ceph cluster is managed mostly by cephadm / ceph orch to avoid
circular dependencies between in our infrastructure deployment. Our
RadosGW endpoints, however, are managed by Kubernetes, since it provides
proper load balancing and service health checks.
This leaves me in the unsat
Hi,
I recently moved from a manual Ceph deployment using Saltstack to a
hybrid of Saltstack and cephadm / ceph orch. We are provisioning our
Ceph hosts using a stateless PXE RAM root, so I definitely need
Saltstack to bootstrap at least the Ceph APT repository and the MON/MGR
deployment. Afte
Yes. If you've seen this reoccur multiple times, you can expect it will
only get worse with time. You should replace the disk soon. Very often
these disks are also starting to slow down other operations in the
cluster as the read times increase.
On 26/09/2023 13:17, Jorge JP wrote:
Hello,
F
seems to do more harm than good. I can see why it's there, but the
implementation appears to be rather buggy.
I also set mds_session_blocklist_on_timeout to false, because I had the
impression that clients where being blocklisted too quickly.
On 21/09/2023 09:24, Janek Bevendorff wrote
ous
email?
On Wed, Sep 20, 2023 at 3:07 PM Venky Shankar
wrote:
> Hi Janek,
>
> On Tue, Sep 19, 2023 at 4:44 PM Janek Bevendorff <
> janek.bevendo...@uni-weimar.de> wrote:
>
>> Hi Venky,
>>
>> As I said: There a
that the
affected hosts are usually the same, but I have absolutely no clue why.
Janek
On 19/09/2023 12:36, Venky Shankar wrote:
Hi Janek,
On Mon, Sep 18, 2023 at 9:52 PM Janek Bevendorff
wrote:
Thanks! However, I still don't really understand why I am seeing this.
This is due to
able.
On Mon, Sep 18, 2023 at 10:51 AM Janek Bevendorff
wrote:
Hey all,
Since the upgrade to Ceph 16.2.14, I keep seeing the following
warning:
10 client(s) laggy due to laggy OSDs
ceph health detail shows it as:
[WRN] MDS_CLIENTS_LAGGY: 10 client(s) laggy due to
Hey all,
Since the upgrade to Ceph 16.2.14, I keep seeing the following warning:
10 client(s) laggy due to laggy OSDs
ceph health detail shows it as:
[WRN] MDS_CLIENTS_LAGGY: 10 client(s) laggy due to laggy OSDs
mds.***(mds.3): Client *** is laggy; not evicted because some
OSD(s) is/are l
Hi Patrick,
The event log size of 3/5 MDS is also very high, still. mds.1, mds.3,
and mds.4 report between 4 and 5 million events, mds.0 around 1.4
million and mds.2 between 0 and 200,000. The numbers have been constant
since my last MDS restart four days ago.
I ran your ceph-gather.sh script a
rt four days ago.
I ran your ceph-gather.sh script a couple of times, but dumps only
mds.0. Should I modify it to dump mds.3 instead so you can have a look?
Janek
On 10/06/2023 15:23, Patrick Donnelly wrote:
On Fri, Jun 9, 2023 at 3:27 AM Janek Bevendorff
wrote:
Hi Patrick,
I'm afraid y
Hi Patrick,
I'm afraid your ceph-post-file logs were lost to the nether. AFAICT,
our ceph-post-file storage has been non-functional since the beginning
of the lab outage last year. We're looking into it.
I have it here still. Any other way I can send it to you?
Extremely unlikely.
Okay, ta
06/06/2023 09:16, Janek Bevendorff wrote:
I checked our Prometheus logs and the number of log events of
individual MONs are indeed randomly starting to increase dramatically
all of a sudden. I attached a picture of the curves.
The first incidence you see there was when our metadata store
store is full.
On 05/06/2023 18:03, Janek Bevendorff wrote:
That said, our MON store size has also been growing slowly from 900MB
to 5.4GB. But we also have a few remapped PGs right now. Not sure if
that would have an influence.
On 05/06/2023 17:48, Janek Bevendorff wrote:
Hi Patrick, hi
That said, our MON store size has also been growing slowly from 900MB to
5.4GB. But we also have a few remapped PGs right now. Not sure if that
would have an influence.
On 05/06/2023 17:48, Janek Bevendorff wrote:
Hi Patrick, hi Dan!
I got the MDS back and I think the issue is connected to
nek
[1] "Newly corrupt dentry" ML link:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JNZ6V5WSYKQTNQPQPLWRBM2GEP2YSCRV/#PKQVZYWZCH7P76Q75D5WD5JEAVWOKJE3
[2] ceph-post-file ID: 7c039483-49fd-468c-ba40-fb10337aa7d6
On 05/06/2023 16:08, Janek Bevendorff wrote:
I ju
en at debug_mds = 20). Any idea what the
reason could be? I checked whether they have enough RAM, which seems to
be the case (unless they try to allocate tens of GB in one allocation).
Janek
On 31/05/2023 21:57, Janek Bevendorff wrote:
Hi Dan,
Sorry, I meant Pacific. The version number was co
rs, Dan
>
> ______
> Clyso GmbH | https://www.clyso.com
>
>
> On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff
> wrote:
>>
>> I checked our logs from yesterday, the PG scaling only started today,
>> perhaps triggered by the snapshot tr
. None of these operations are safe, though.
On 31/05/2023 16:41, Janek Bevendorff wrote:
Hi Jake,
Very interesting. This sounds very much like what we have been
experiencing the last two days. We also had a sudden fill-up of the
metadata pool, which repeated last night. See my question here
Hi Jake,
Very interesting. This sounds very much like what we have been
experiencing the last two days. We also had a sudden fill-up of the
metadata pool, which repeated last night. See my question here:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7U27L27FHHPDYGA6VNNVWGLTXC
r MDS,
probably because it wasn't that extreme yet and I had reduced the
maximum cache size. Still looks like a bug to me.
On 31/05/2023 11:18, Janek Bevendorff wrote:
Another thing I just noticed is that the auto-scaler is trying to
scale the pool down to 128 PGs. That could also resul
618 flags hashpspool,nodelete stripe_width 0
expected_num_objects 300 recovery_op_priority 5 recovery_priority 2
application cephfs
Janek
On 31/05/2023 10:10, Janek Bevendorff wrote:
Forgot to add: We are still on Nautilus (16.2.12).
On 31/05/2023 09:53, Janek Bevendorff wrote:
Hi,
Per
Forgot to add: We are still on Nautilus (16.2.12).
On 31/05/2023 09:53, Janek Bevendorff wrote:
Hi,
Perhaps this is a known issue and I was simply too dumb to find it,
but we are having problems with our CephFS metadata pool filling up
over night.
Our cluster has a small SSD pool of
Hi,
Perhaps this is a known issue and I was simply too dumb to find it, but
we are having problems with our CephFS metadata pool filling up over night.
Our cluster has a small SSD pool of around 15TB which hosts our CephFS
metadata pool. Usually, that's more than enough. The normal size of th
have some files with binary keys that
cannot be decoded as utf-8. Unfortunately, the rados python library
assumes that omap keys can be decoded this way. I have a ticket here:
https://tracker.ceph.com/issues/59716
I hope to have a fix soon.
On Thu, May 4, 2023 at 3:15 AM Janek Bevendorff
he output file is corrupt?
Any way I can fix it?
The output file has 14 million lines. We have about 24.5 million objects
in the metadata pool.
Janek
On 03/05/2023 14:20, Patrick Donnelly wrote:
On Wed, May 3, 2023 at 4:33 AM Janek Bevendorff
wrote:
Hi Patrick,
I'll try that tomorrow an
Hi Patrick,
I'll try that tomorrow and let you know, thanks!
I was unable to reproduce the crash today. Even with
mds_abort_on_newly_corrupt_dentry set to true, all MDS booted up
correctly (though they took forever to rejoin with logs set to 20).
To me it looks like the issue has resolved
Hi Patrick,
Please be careful resetting the journal. It was not necessary. You can
try to recover the missing inode using cephfs-data-scan [2].
Yes. I did that very reluctantly after trying everything else as a last
resort. But since it only gave me another error, I restored the previous
sta
working on this issue.
Cheers, Dan
__
Clyso GmbH | https://www.clyso.com
On Tue, May 2, 2023 at 7:32 AM Janek Bevendorff
wrote:
Hi,
After a patch version upgrade from 16.2.10 to 16.2.12, our rank 0 MDS
fails start start. After replaying the journal, it just crashes with
[ERR] : MDS
Hi,
After a patch version upgrade from 16.2.10 to 16.2.12, our rank 0 MDS
fails start start. After replaying the journal, it just crashes with
[ERR] : MDS abort because newly corrupt dentry to be committed: [dentry
#0x1/storage [2,head] auth (dversion lock)
Immediately after the upgrade, I
ctory on all hosts and restarting the deployment, all pods came back
up again. So I guess something was wrong with the keys stored on some of
the host machines.
Janek
On 31/05/2022 11:08, Janek Bevendorff wrote:
Hi,
This is an issue I've been having since at least Ceph 15.x and I
haven
Hi,
This is an issue I've been having since at least Ceph 15.x and I haven't
found a way around it yet. I have a bunch of radosgw nodes in a
Kubernetes cluster (using the ceph/ceph-daemon Docker image) and once
every few container restarts, the daemon decides to crash at startup for
unknown r
The quay.io/ceph/daemon:latest-pacific image is also stuck on 16.2.5.
Only the quay.io/ceph/ceph:v16 image seems to be up to date, but I can't
get it to start the daemons.
On 30/05/2022 14:54, Janek Bevendorff wrote:
Was this announced somewhere? Could this not wait till Pacific is EO
: https://docs.ceph.com/en/latest/install/containers/
> On 30. May 2022, at 14:47, Robert Sander wrote:
>
> Am 30.05.22 um 13:16 schrieb Janek Bevendorff:
>
>> The image tags on Docker Hub are even more outdated and stop at v16.2.5.
>> quay.io seems to be up to date.
Hi,
The release index on docs.ceph.com is outdated by two Pacific patch releases:
https://docs.ceph.com/en/quincy/releases/index.html
The image tags on Docker Hub are even more outdated and stop at v16.2.5.
quay.io seems to be up to date.
Can these be updated please? Thanks
Janek
_
26/02/2021 15:24, Janek Bevendorff wrote:
Since the full cluster restart and disabling logging to syslog, it's
not a problem any more (for now).
Unfortunately, just disabling clog_to_monitors didn't have the wanted
effect when I tried it yesterday. But I also believe that it is
somehow r
ize 2 which would explain what you're describing.
Zitat von Janek Bevendorff :
Hi,
I am having a weird phenomenon, which I am having trouble to debug.
We have 16 OSDs per host, so when I reboot one node, 16 OSDs will be
missing for a short time. Since our minimum CRUSH failure domain is
Hi,
I am having a weird phenomenon, which I am having trouble to debug. We
have 16 OSDs per host, so when I reboot one node, 16 OSDs will be
missing for a short time. Since our minimum CRUSH failure domain is
host, this should not cause any problems. Unfortunately, I always have
handful (1-5)
fic reason for the incident yesterday
in the logs besides a few more RocksDB status and compact messages than
usual, but that's more symptomatic.
On 26/02/2021 13:05, Mykola Golub wrote:
On Thu, Feb 25, 2021 at 08:58:01PM +0100, Janek Bevendorff wrote:
On the first MON, the command does
> On 25. Feb 2021, at 22:17, Dan van der Ster wrote:
>
> Also did you solve your log spam issue here?
> https://tracker.ceph.com/issues/49161
> Surely these things are related?
No. But I noticed that DBG log spam only happens when log_to_syslog is enabled.
systemd is smart enough to avoid fi
Thanks, I’ll try that tomorrow.
> On 25. Feb 2021, at 21:59, Dan van der Ster wrote:
>
> Maybe the debugging steps in that insights tracker can be helpful
> anyway: https://tracker.ceph.com/issues/39955
>
> -- dan
>
> On Thu, Feb 25, 2021 at 9:27 PM Janek Bevendorff
ksDB stuff, but nothing that
> > actually looks serious or even log-worthy. I noticed that before that
> > despite logging being set to warning level, the cluster log keeps being
> > written to the MON log. But it shouldn’t cause such massive stability
> > issues, should it? The date
or are they flapping with this load?
>
> .. dan
>
>
>
> On Thu, Feb 25, 2021, 8:58 PM Janek Bevendorff
> mailto:janek.bevendo...@uni-weimar.de>>
> wrote:
> Thanks, Dan.
>
> On the first MON, the command doesn’t even retu
basic log_to_stderr
false
globalbasic log_to_syslog
true
globaladvanced mon_cluster_log_file_level
error
global
Hi,
All of a sudden, we are experiencing very concerning MON behaviour. We have
five MONs and all of them have thousands up to tens of thousands of slow ops,
the oldest one blocking basically indefinitely (at least the timer keeps
creeping up). Additionally, the MON stores keep inflating heavil
My current settings are:
mds advanced mds_beacon_grace 15.00
True. I might as well remove it completely, it's an artefact of earlier
experiments.
This should be a global setting. It is used by the mons and mdss.
mds basic mds_cache_memory_limit 4294967296
My current settings are:
mds advanced mds_beacon_grace 15.00
mds basic mds_cache_memory_limit 4294967296
mds advanced mds_cache_trim_threshold 393216
global advanced mds_export_ephemeral_distributed true
mds advanced mds_recall_global_max
FYI, this is the ceph-exporter we're using at the moment:
https://github.com/digitalocean/ceph_exporter
It's not as good, but it does the job mostly. Some more specific metrics
are missing, but the majority is there.
On 10/12/2020 19:01, Janek Bevendorff wrote:
Do you have the
Do you have the prometheus module enabled? Turn that off, it's causing
issues. I replaced it with another ceph exporter from Github and almost
forgot about it.
Here's the relevant issue report:
https://tracker.ceph.com/issues/39264#change-179946
On 10/12/2020 16:43, Welby McRoberts wrote:
H
Wow! Distributed epins :) Thanks for trying it. How many
sub-directories under the distributed epin'd directory? (There's a lot
of stability problems that are to be fixed in Pacific associated with
lots of subtrees so if you have too large of a directory, things could
get ugly!)
Yay, beta tes
Hi Patrick,
I haven't gone through this thread yet but I want to note for those
reading that we do now have documentation (thanks for the frequent
pokes Janek!) for the recall configurations:
https://docs.ceph.com/en/latest/cephfs/cache-configuration/#mds-recall
Please let us know if it's missi
This sounds like there is one or a few clients acquiring too many
caps. Have you checked this? Are there any messages about the OOM
killer? What config changes for the MDS have you made?
Yes, it's individual clients acquiring too my caps. I first ran the
adjusted recall settings you suggeste
Never mind, when I enable it on a more busy directory, I do see new
ephemeral pins popping up. Just not on the directories I set it on
originally. Let's see how that holds up.
On 07/12/2020 13:04, Janek Bevendorff wrote:
Thanks. I tried playing around a bit
for a few sub trees. Despite
mds_export_ephemeral_distributed being set to true, all work is done by
mds.0 now and I also don't see any additional pins in ceph tell mds.\*
get subtrees.
Any ideas why that might be?
On 07/12/2020 10:49, Dan van der Ster wrote:
On Mon, Dec 7, 2020 at 10:3
What exactly do you set to 64k?
We used to set mds_max_caps_per_client to 5, but once we started
using the tuned caps recall config, we reverted that back to the
default 1M without issue.
mds_max_caps_per_client. As I mentioned, some clients hit this limit
regularly and they aren't entir
Thanks, Dan!
I have played with many thresholds, including the decay rates. It is
indeed very difficult to assess their effects, since our workloads
differ widely depending on what people are working on at the moment. I
would need to develop a proper benchmarking suite to simulate the
differe
few hungry GPUs.
-- Dan
I guess we're pretty lucky with our CephFS's because we have more than
1k clients and it is pretty solid (though the last upgrade had a
hiccup decreasing down to single active MDS).
-- Dan
On Fri, Dec 4, 2020 at 8:20 PM Janek Bevendorff
wrote:
This
laced by a standby before it's finished).
I guess we're pretty lucky with our CephFS's because we have more than
1k clients and it is pretty solid (though the last upgrade had a
hiccup decreasing down to single active MDS).
-- Dan
On Fri, Dec 4, 2020 at 8:20 PM Janek Bevendorff
This is very common issue. Deleting mdsX_openfiles.Y has become part of
my standard maintenance repertoire. As soon as you have a few more
clients and one of them starts opening and closing files in rapid
succession (or does other metadata-heavy things), it becomes very likely
that the MDS cras
try:
s3.Object(bucket, obj_name).get()['Body'].read(1)
return False
except ClientError:
return True
With 600 executors on 130 hosts, it takes about 30 seconds for a 300k
object bucket.
On 19/11/2020 09:21, Janek Bevendorff wrote:
I would recommend you get
m not knowledgable about Rados, so I’m not sure this is helpful.
On 18 Nov 2020, at 10:01, Janek Bevendorff
wrote:
Sorry, it's radosgw-admin object stat --bucket=BUCKETNAME --object=OBJECTNAME (forgot the
"object" there)
On 18/11/2020 09:58, Janek Bevendorff wrote:
The object,
about Rados, so I’m not sure this is helpful.
On 18 Nov 2020, at 10:01, Janek Bevendorff
wrote:
Sorry, it's radosgw-admin object stat --bucket=BUCKETNAME --object=OBJECTNAME (forgot the
"object" there)
On 18/11/2020 09:58, Janek Bevendorff wrote:
The object, a Docker layer, th
Sorry, it's radosgw-admin object stat --bucket=BUCKETNAME
--object=OBJECTNAME (forgot the "object" there)
On 18/11/2020 09:58, Janek Bevendorff wrote:
The object, a Docker layer, that went missing has not been touched in
2 months. It worked for a while, but then suddenly we
The object, a Docker layer, that went missing has not been touched in
2 months. It worked for a while, but then suddenly went missing.
Was the object a multipart object? You can check by running
radosgw-admin stat --bucket=BUCKETNAME --object=OBJECTNAME. It should
say something "ns": "multipar
FYI: I have radosgw-admin gc list --include-all running every three
minutes for a day, but the list has stayed empty. Though I haven't seen
any further data loss, either. I will keep it running until the next
time I seen an object vanish.
On 17/11/2020 09:22, Janek Bevendorff wrote:
I
om/issues/47866
We may want to move most of the conversation to the comments there,
so everything’s together.
I do want to follow up on your answer to Question 4, Janek:
On Nov 13, 2020, at 12:22 PM, Janek Bevendorff
<mailto:janek.bevendo...@uni-weimar.de>> wrote:
4. Is anyone exper
comments there, so
everything’s together.
I do want to follow up on your answer to Question 4, Janek:
On Nov 13, 2020, at 12:22 PM, Janek Bevendorff
<mailto:janek.bevendo...@uni-weimar.de>> wrote:
4. Is anyone experiencing this issue willing to run their RGWs with
'debug_ms=
ery dangerous bug for data safety. Hope the bug
would be quickly identified and fixed.
best regards,
Samuel
huxia...@horebdata.cn <mailto:huxia...@horebdata.cn>
From: Janek Bevendorff
Date: 2020-11-12 18:17
To:huxia...@horebdata.cn <mailto:huxia...@horebdata.cn>; EDH - Manuel
Ri
huxia...@horebdata.cn
*From:* EDH - Manuel Rios <mailto:mrios...@easydatahost.com>
*Date:* 2020-11-12 14:27
*To:* Janek Bevendorff <mailto:janek.bevendo...@uni-weimar.de>;
Rafael Lopez <m
Here is a bug report concerning (probably) this exact issue:
https://tracker.ceph.com/issues/47866
I left a comment describing the situation and my (limited) experiences
with it.
On 11/11/2020 10:04, Janek Bevendorff wrote:
Yeah, that seems to be it. There are 239 objects prefixed
s (eg. renaming/deleting). I had thought perhaps it was a bug
with the rgw garbage collector..but that is pure speculation.
Once you can articulate the problem, I'd recommend logging a bug
tracker upstream.
On Wed, 11 Nov 2020 at 06:33, Janek Bevendorff
<mailto:janek.bevendo...@uni-
We are having the exact same problem (also Octopus). The object is listed by
s3cmd, but trying to download it results in a 404 error. radosgw-admin object
stat shows that the object still exists. Any further ideas how I can restore
access to this object?
_
Here's something else I noticed: when I stat objects that work via
radosgw-admin, the stat info contains a "begin_iter" JSON object with RADOS key
info like this
"key": {
"name":
"29/items/WIDE-20110924034843-crawl420/WIDE-20110924065228-02544.warc.g
me.19:
(2) No such file or directory
On 10/11/2020 10:14, Janek Bevendorff wrote:
Thanks for the reply. This issue seems to be VERY serious. New objects
are disappearing every day. This is a silent, creeping data loss.
I couldn't find the object with rados stat, but I am now listin
netflix_mp4' {tmpfile}
The first grep is actually the S3 multipart ID string added to the
prefix by rgw.
Rafael
On Tue, 10 Nov 2020 at 01:04, Janek Bevendorff
<mailto:janek.bevendo...@uni-weimar.de>> wrote:
We are having the exact same problem (also Octopus). The object is
We are having the exact same problem (also Octopus). The object is
listed by s3cmd, but trying to download it results in a 404 error.
radosgw-admin object stat shows that the object still exists. Any
further ideas how I can restore access to this object?
(Sorry if this is a duplicate, but it s
gt; prometheus exporter. I am using Zabbix instead. Our cluster is 1404
>> OSD's in size with about 9PB raw with around 35% utilization.
>>
>> On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff
>> wrote:
>>> Sorry, I meant MGR of course. MDS are fine for me
complain...
>
>
> On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff
> <mailto:janek.bevendo...@uni-weimar.de>> wrote:
>
> If there is actually a connection, then it's no wonder our MDS
> kept crashing. Our Ceph has 9.2PiB of available space at the moment.
&g
for the largest pool went
> substantially lower than 1PB.
> I wonder if there's some metric that exceeds the maximum size for some
> int, double, etc?
>
> -Paul
>
> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
> <mailto:janek.bevendo...@uni-weimar.de>> wrot
For anybody finding this thread via Google or something, here's a link
to a (so far unresolved) bug report: https://tracker.ceph.com/issues/39264
On 19/03/2020 17:37, Janek Bevendorff wrote:
> Sorry for nagging, but is there a solution to this? Routinely restarting
> my MGRs ever
I dug up this issue report, where the problem has been reported before:
https://tracker.ceph.com/issues/39264
Unfortuantely, the issue hasn't got much (or any) attention yet. So
let's get this fixed, the prometheus module is unusable in its current
state.
On 23/03/2020 17:50, Janek
/2020 09:00, Janek Bevendorff wrote:
> I am running the very latest version of Nautilus. I will try setting up
> an external exporter today and see if that fixes anything. Our cluster
> is somewhat large-ish with 1248 OSDs, so I expect stat collection to
> take "some" time,
t; the same issue as Mimic?
>
> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
> <mailto:janek.bevendo...@uni-weimar.de>> wrote:
>
> I think this is related to my previous post to this list about MGRs
> failing regularly and being overall quite slow to respond.
1 - 100 of 123 matches
Mail list logo