Hi,
I keep reading recommendations about disabling debug logging in Ceph
in order to improve performance. There are two things that are unclear
to me though:
a. what do we lose if we decrease default debug logging and where is
the sweet point in order to not lose critical messages?
I would say fo
Hello,
we are on Debian Jessie and Hammer 0.94.9 and recently we decided to
upgrade our kernel from 3.16 to 4.9 (jessie-backports). We experience
the same regression but with some shiny points
-- ceph tell osd average across the cluster --
3.16.39-1: 204MB/s
4.9.0-0: 158MB/s
-- 1 rados bench c
ript, perhaps
> you could post it as an example?
>
> Thanks!
>
> -- Dan
>
>
>> On Oct 20, 2016, at 01:42, Kostis Fardelas wrote:
>>
>> We pulled leveldb from upstream and fired leveldb.RepairDB against the
>> OSD omap directory using a simple python script.
leveldb?
> Cheers
> Goncalo
>
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Kostis
> Fardelas [dante1...@gmail.com]
> Sent: 20 October 2016 09:09
> To: ceph-users
> Subject: [ceph-users] Surviving a ceph cluster outage: the hard way
>
Hello cephers,
this is the blog post on our Ceph cluster's outage we experienced some
weeks ago and about how we managed to revive the cluster and our
clients's data.
I hope it will prove useful for anyone who will find himself/herself
in a similar position. Thanks for everyone on the ceph-users a
hit a bug.
On 19 September 2016 at 00:00, Ronny Aasen wrote:
> added debug journal = 20 and got some new lines in the log. that i added to
> the end of this email.
>
> any of you can make something out of them ?
>
> kind regards
> Ronny Aasen
>
>
>
>
> O
If you are aware of the problematic PGs and they are exportable, then
ceph-objectstore-tool is a viable solution. If not, then running gdb
and/or higher debug osd level logs may prove useful (to understand
more about the problem or collect info to ask for more in ceph-devel).
On 13 September 2016
Hello Goncalo,
afaik the authoritative shard is concluded based on deep-scrub object
checksums which was included in Hammer. Is this in-line with your
experience? If yes, is there any other method of concluding for the
auth shard besides object timestamps for ceph < jewel?
Kostis
On 13 September
on to bump this. It looks like a leak (and
of course I could extend the leak by bumping pid_max) but this is not
the case, isn't it?
Kostis
On 15 September 2016 at 14:40, Wido den Hollander wrote:
>
>> Op 15 september 2016 om 13:27 schreef Kostis Fardelas :
>>
>>
Hello cephers,
being in a degraded cluster state with 6/162 OSDs down ((Hammer
0.94.7, 162 OSDs, 27 "fat" nodes, 1000s of clients) ) like the below
ceph cluster log indicates:
2016-09-12 06:26:08.443152 mon.0 62.217.119.14:6789/0 217309 : cluster
[INF] pgmap v106027148: 28672 pgs: 2 down+remapped+
Hello cephers,
last week we survived a 3-day outage on our ceph cluster (Hammer
0.94.7, 162 OSDs, 27 "fat" nodes, 1000s of clients) due to 6 out of
162 OSDs crash in the SAME node. The outage was caused in the
following timeline:
time 0: OSDs living in the same node (rd0-19) start heavily flapping
a (host-wise) pool is going to be limited to <
> 0.8TB usable space. (The two 0.3 hosts will fill up well before
> the two larger hosts are full).
>
>
> On Tue, Jul 26, 2016 at 1:55 PM, Kostis Fardelas wrote:
>> Hello Dan,
>> I increased choose_local_tries to 75 an
is too low for
> your cluster configuration.
> Try increasing choose_total_tries from 50 to 75.
>
> -- Dan
>
>
>
> On Fri, Jul 22, 2016 at 4:17 PM, Kostis Fardelas wrote:
>> Hello,
>> being in latest Hammer, I think I hit a bug with more recent than
>> lega
quot;: "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 2,
"rule_name": "rbd",
"ruleset": 2,
&
os of profile tunables
changes and their impact on a production cluster.
Kostis
On 24 July 2016 at 14:29, Kostis Fardelas wrote:
> nice to hear from you Goncalo,
> what you propose sounds like an interesting theory, I will test it
> tomorrow and let you know. In the meanwhile, I did the
ph-users-boun...@lists.ceph.com] on behalf of Kostis
> Fardelas [dante1...@gmail.com]
> Sent: 23 July 2016 16:32
> To: Brad Hubbard
> Cc: ceph-users
> Subject: Re: [ceph-users] Recovery stuck after adjusting to recent tunables
>
> Hi Brad,
>
> pool 0 'data'
lica for some objects whose third OSD (OSD.14) is
down. That was not the case with argonaut tunables as I remember.
Regards
On 23 July 2016 at 06:16, Brad Hubbard wrote:
> On Sat, Jul 23, 2016 at 12:17 AM, Kostis Fardelas wrote:
>> Hello,
>> being in latest Hammer, I think
Hello,
being in latest Hammer, I think I hit a bug with more recent than
legacy tunables.
Being in legacy tunables for a while, I decided to experiment with
"better" tunables. So first I went from argonaut profile to bobtail
and then to firefly. However, I decided to make the changes on
chooseleaf
Hello,
I upgraded a staging ceph cluster from latest Firefly to latest Hammer
last week. Everything went fine overall and I would like to share my
observations so far:
a. every OSD upgrade lasts appr. 3 minutes. I doubt there is any way
to speed this up though
b. rados bench with different block si
update OSD fsids from redeployed
OSDs, even after removing the old ones from crushmap. You need to rm
them
Regards,
Kostis
On 15 June 2016 at 17:14, Kostis Fardelas wrote:
> Hello,
> in the process of redeploying some OSDs in our cluster, after
> destroying one of them (down, out, remove
Hi Hauke,
you could increase the mon/osd full/near full ratios but at this level
of disk space scarcity, things may need your constant attention
especially in case of failure given the risk of closing down the
cluster IO. Modifying crush weights may be of use too.
Regards,
Kostis
On 15 June 2016
Hello Jacob, Gregory,
did you manage to start up those OSDs at last? I came across a very
much alike incident [1] (no flags preventing the OSDs from getting UP
in the cluster though, no hardware problems reported) and I wonder if
you found out what was the culprit in your case.
[1] http://permali
Hello,
in the process of redeploying some OSDs in our cluster, after
destroying one of them (down, out, remove from crushmap) and trying to
redeploy it (crush add ,start), we reach a state where the OSD gets
stuck at booting state:
root@staging-rd0-02:~# ceph daemon osd.12 status
{ "cluster_fsid":
There is the "ceph pg {pgid} mark_unfound_lost revert|delete" command but
you may also find interesting to utilize ceph-objectstore-tool to do the job
On 15 May 2016 at 20:22, Michael Kuriger wrote:
> I would try:
>
> ceph pg repair 15.3b3
>
>
>
>
>
> [image: yp]
>
>
>
> Michael Kuriger
> Sr. Un
then delete it when
> done).
>
> I've never done this or worked on the tooling though so that's bailout the
> extent of my knowledge.
> -Greg
>
>
> On Wednesday, February 17, 2016, Kostis Fardelas
> wrote:
>>
>> Right now the PG is served by two oth
--op import --data-path
/var/lib/ceph/osd/ceph-xx/ --journal-path
/var/lib/ceph/osd/ceph-xx/journal --file 3.5a9..export
d. start the osd
Regards,
Kostis
On 18 February 2016 at 02:54, Gregory Farnum wrote:
> On Wed, Feb 17, 2016 at 4:44 PM, Kostis Fardelas wrote:
>> Thanks Greg,
&g
d I achieve this with ceph_objectstore_tool?
Regards,
Kostis
On 18 February 2016 at 01:22, Gregory Farnum wrote:
> On Wed, Feb 17, 2016 at 3:05 PM, Kostis Fardelas wrote:
>> Hello cephers,
>> due to an unfortunate sequence of events (disk crashes, network
>> problems
Hello cephers,
due to an unfortunate sequence of events (disk crashes, network
problems), we are currently in a situation with one PG that reports
unfound objects. There is also an OSD which cannot start-up and
crashes with the following:
2016-02-17 18:40:01.919546 7fecb0692700 -1 os/FileStore.cc:
Hello cephers,
after being on 0.80.10 for a while, we upgraded to 0.80.11 and we
noticed the following things:
a. ~13% paxos refresh latency increase (from about 0.015 to 0.017 on average)
b. ~15% paxos commit latency ( from 0.019 to 0.022 on average)
c. osd commitcycle latencies were decreased and
Hi Vickey,
under "Upgrade procedures", you will see that it is recommended to
upgrade clients after having upgraded your cluster [1]
[1] http://docs.ceph.com/docs/master/install/upgrading-ceph/#upgrading-a-client
Regards
On 13 January 2016 at 12:44, Vickey Singh wrote:
> Hello Guys
>
> Need help
The cluster recovered at about 16:20. I have not restarted any OSD
till now. Nothing else happened in the cluster in the meanwhile. There
was no ERR/WRN in cluster's log.
Regards,
Kostis
On 7 December 2015 at 17:08, Gregory Farnum wrote:
> On Mon, Dec 7, 2015 at 6:59 AM, Kostis Fardela
Hi cephers,
after one OSD node crash (6 OSDs in total), we experienced an increase
of approximately 230-260 threads for every other OSD node. We have 26
OSD nodes with 6 OSDs per node, so this is approximately 40 threads
per osd. The OSD node has joined the cluster after 15-20 minutes.
The only wo
right? Again the statistics fetched from the sockets seem
more reasonable, the commitcycle latency is substantially larger than
apply latency, which seems normal to me.
Is this a bug or a misunderstanding on my part?
Regards,
Kostis
On 13 July 2015 at 13:27, Kostis Fardelas wrote:
> Hello,
>
ing our hosts this weekend and
>>> most of them came up fine with simple “service ceph start”, some just sat
>>> there spinning the CPU and not doing any real world (and the cluster was
>>> not very happy about that).
>>>
>>> Jan
>>>
>>&
more, the slow requests will vanish. The possibility
of not having tuned our setup to the most finest detail is not zeroed
out but I wonder if at any way we miss some ceph tuning in terms of
ceph configuration.
We run firefly latest stable version.
Regards,
Kostis
On 13 July 2015 at 13:28, Kostis
Hello,
after rebooting a ceph node and the OSDs starting booting and joining
the cluster, we experience slow requests that get resolved immediately
after cluster recovers. It is improtant to note that before the node
reboot, we set noout flag in order to prevent recovery - so there are
only degrade
Hello,
I noticed that commit/apply latency reported using:
ceph pg dump -f json-pretty
is very different from the values reported when querying the OSD sockets.
What is your opinion? What are the targets the I should fetch metrics
from in order to be as precise as possible?
___
Hello,
it seems that new packages for firefly have been uploaded to repo.
However, I can't find any details in Ceph Release notes. There is only
one thread in ceph-devel [1], but it is not clear what this new
version is about. Is it safe to upgrade from 0.80.9 to 0.80.10?
Regards,
Kostis
[1] http
Hi,
we are running Ceph v.0.72.2 (emperor) from the ceph emperor repo. The
latest week we had 2 random OSD crashes (one during cluster recovery
and one while in healthy state) with the same symptom: osd process
crashes, logs the following trace on its log and gets down and out. We
are in the proces
Hi Robert,
an improvement to your checks could be the addition of check
parameters (instead of using hard coded values for warn and crit) so
that someone can change their values in main.mk. Hope to find some
time soon and send you a PR about it. Nice job btw!
On 19 November 2014 18:23, Robert Sand
ot; and related parameters;
> check the docs.
> -Greg
>
>
> On Tuesday, July 8, 2014, Kostis Fardelas wrote:
>>
>> Hi,
>> we maintain a cluster with 126 OSDs, replication 3 and appr. 148T raw
>> used space. We store data objects basically on two pools, the o
Hi,
we maintain a cluster with 126 OSDs, replication 3 and appr. 148T raw
used space. We store data objects basically on two pools, the one
being appr. 300x larger in data stored and # of objects terms than the
other. Based on the formula provided here
http://ceph.com/docs/master/rados/operations/p
Hi,
during PGs remapping, the cluster recovery process sometimes gets
stuck on PGs with backfill_toofull state. The obvious solution is to
reweight the impacted OSD until we add new OSDs to the cluster. In
order to force the remapping process to complete asap we try to inject
a higher value on "osd
Hi,
from my experience both "ceph osd crush reweight" and "ceph osd
reweight" will lead to CRUSH map changes and PGs remapping. So both
commands eventually redistribute data beween OSDs. Is there any good
reason in terms of ceph performance best practices to choose the one
over the other?
On 26 Ju
44 matches
Mail list logo