Dear List,
We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose by
12 nodes, each nodes have 10 OSD with journal on disk.
We have one rbd partition and a radosGW with 2 data pool, one replicated, one
EC (8+2)
in attachment few details on our cluster.
Currently, our clu
Hi Yoann,
On Wed, Oct 19, 2016 at 9:44 AM, Yoann Moulin wrote:
> Dear List,
>
> We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose
> by 12 nodes, each nodes have 10 OSD with journal on disk.
>
> We have one rbd partition and a radosGW with 2 data pool, one replicated,
Hello,
no specific ideas, but this somewhat sounds familiar.
One thing first, you already stopped client traffic but to make sure your
cluster really becomes quiescent, stop all scrubs as well.
That's always a good idea in any recovery, overload situation.
Have you verified CPU load (are those
Hi,
just an additional comment:
you can disable backfilling and recovery temporarily by setting the
'nobackfill' and 'norecover' flags. It will reduce the backfilling
traffic and may help the cluster and its OSD to recover. Afterwards you
should set the backfill traffic settings to the minimu
On 06. okt. 2016 13:41, Ronny Aasen wrote:
hello
I have a few osd's in my cluster that are regularly crashing.
[snip]
ofcourse having 3 osd's dying regularly is not good for my health. so i
have set noout, to avoid heavy recoveries.
googeling this error messages gives exactly 1 hit:
https:
I have setup a new linux cluster to allow migration from our old SAN based
cluster to a new cluster with ceph.
All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
I am basically running stock ceph settings, with just turning the write cache
off via hdparm on the drives, and temporaril
Hello,
>> We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose
>> by 12 nodes, each nodes have 10 OSD with journal on disk.
>>
>> We have one rbd partition and a radosGW with 2 data pool, one replicated,
>> one EC (8+2)
>>
>> in attachment few details on our cluster.
>>
>
On Wed, Oct 19, 2016 at 3:22 PM, Yoann Moulin wrote:
> Hello,
>
>>> We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is
>>> compose by 12 nodes, each nodes have 10 OSD with journal on disk.
>>>
>>> We have one rbd partition and a radosGW with 2 data pool, one replicated,
>>> one
Hello,
>>> We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is
>>> compose by 12 nodes, each nodes have 10 OSD with journal on disk.
>>>
>>> We have one rbd partition and a radosGW with 2 data pool, one replicated,
>>> one EC (8+2)
>>>
>>> in attachment few details on our cluste
On Wed, Oct 19, 2016 at 1:28 PM, Jim Kilborn wrote:
> I have setup a new linux cluster to allow migration from our old SAN based
> cluster to a new cluster with ceph.
> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
> I am basically running stock ceph settings, with just turning
This is a cool project, keep up the good work!
_
Tyler Bishop
Founder
O: 513-299-7108 x10
M: 513-646-5809
http://BeyondHosting.net
This email is intended only for the recipient(s) above and/or otherwise
authorized personnel. The information
I would take the analogy of a Raid scenario. Basically a standby is
considered like a spare drive. If that spare drive goes down. It is good to
know about the event, but it does in no way indicate a degraded system,
everything keeps running at top speed.
If you had multi active MDS and one goes do
Hi,
I would be interested in this case when a mds in standby-replay fails.
Thanks
On Wed, Oct 19, 2016 at 4:06 PM, Scottix wrote:
> I would take the analogy of a Raid scenario. Basically a standby is
> considered like a spare drive. If that spare drive goes down. It is good to
> know about the
John,
Thanks for the tips….
Unfortunately, I was looking at this page
http://docs.ceph.com/docs/jewel/start/os-recommendations/
I’ll consider either upgrading the kernels or using the fuse client, but will
likely go the kernel 4.4 route
As for moving to just a replicated pool, I take it t
Hello
>From the documentation i understand that clients that uses librados must
perform striping for themselves, but i do not understand how could this be
if we have striping options in ceph ? i mean i can create rbd images that
has configuration for striping, count and unite size.
So my question
On Wed, Oct 19, 2016 at 5:17 PM, Jim Kilborn wrote:
> John,
>
>
>
> Thanks for the tips….
>
> Unfortunately, I was looking at this page
> http://docs.ceph.com/docs/jewel/start/os-recommendations/
OK, thanks - I've pushed an update to clarify that
(https://github.com/ceph/ceph/pull/11564).
> I’l
Not sure if related, but I see the same issue on the very different
hardware/configuration. In particular on large data transfers OSDs become slow
and blocking. Iostat await on spinners can go up to 6(!) s ( journal is on the
ssd). Looking closer on those spinners with blktrace suggest that most
John,
Updating to the latest mainline kernel from elrepo (4.8.2-1) on all 4 ceph
servers, and the ceph client that I am testing with, still didn’t fix the
issues.
Still getting “Failing to respond to Cache Pressure”. And ops block currently
hovering between 100-300 requests > 32 sec
This
Hello cephers,
this is the blog post on our Ceph cluster's outage we experienced some
weeks ago and about how we managed to revive the cluster and our
clients's data.
I hope it will prove useful for anyone who will find himself/herself
in a similar position. Thanks for everyone on the ceph-users a
Hi all,
We setup rbd mirroring between 2 clusters, but have issues when we want
to delete one image. Following is the detailed info.
It reports that some other instance is still using it, which kind of makes
sense because we set up the mirror between 2 clusters.
What's the best practice to rem
librbd (used by QEMU to provide RBD-backed disks) uses librados and
provides the necessary handling for striping across multiple backing
objects. When you don't specify "fancy" striping options via
"--stripe-count" and "--stripe-unit", it essentially defaults to
stripe count of 1 and stripe unit of
On Wed, Oct 19, 2016 at 6:52 PM, yan cui wrote:
> 2016-10-19 15:46:44.843053 7f35c9925d80 -1 librbd: cannot obtain exclusive
> lock - not removing
Are you attempting to delete the primary or non-primary image? I would
expect any attempts to delete the non-primary image to fail since the
non-prima
Hello,
On Wed, 19 Oct 2016 12:28:28 + Jim Kilborn wrote:
> I have setup a new linux cluster to allow migration from our old SAN based
> cluster to a new cluster with ceph.
> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
As others mentioned, not a good choice, but also not
Hi all,
When I try to mount rbd through KRBD, it failed because of mismatch
features.
The Client's OS is Ubuntu 16.04 and kernel is 4.4.0-38
My original CRUSH tunables is below.
root@Fx2x1ctrlserv01:~# ceph osd crush show-tunables
{
"choose_local_tries": 0,
"choose_local_fallback_tries"
works fine with kernel 4.6 for me.
from doc:
http://docs.ceph.com/docs/master/rados/operations/crush-map/#crush-tunables
it should works with kernel 4.5 too.
I don't known if they are any plan to backport last krbd module version to
kernel 4.4 ?
- Mail original -
De: "한승진"
À: "cep
Hi Kostis...
That is a tale from the dark side. Glad you recover it and that you were
willing to doc it all up, and share it. Thank you for that,
Can I also ask which tool did you use to recover the leveldb?
Cheers
Goncalo
From: ceph-users [ceph-users-boun.
Does this also mean that strip count can be thought of as the number of
parrallel writes to different objects at different OSDs ?
Thank you
On Thursday, 20 October 2016, Jason Dillaman wrote:
> librbd (used by QEMU to provide RBD-backed disks) uses librados and
> provides the necessary handling
ceph_10.2.3.orig.tar.gz Source package
Compile completed:
/root/neunn_gitlab/ceph-Jewel10.2.3/src/radosgw
The following issues occur when script execution:
2016-10-20 11:36:30.102266 7f8b4b93f900 -1 auth: unable to find a keyring on
/var/lib/ceph/radosgw/-admin/keyring: (2) No such file or dir
We pulled leveldb from upstream and fired leveldb.RepairDB against the
OSD omap directory using a simple python script. Ultimately, that
didn't make things forward. We resorted to check every object's
timestamp/md5sum/attributes on the crashed OSD against the replicas in
the cluster and at last too
29 matches
Mail list logo