Hi!
Glad to hear your Ceph is working again! ;)
BTW, it is a new knowledge: how ceph behave with bad ram.
Do you have memory ECC errors in logs?
Linux has EDAC module (I think, it is enabled by default in Debian) which
reports any machine
errors happening - machine check exeptions, memory error
On Tue, Aug 11, 2015 at 2:21 AM, Bob Ababurko wrote:
> I had a dual mds server configuration and have been copying data via cephfs
> kernel module to my cluster for the past 3 weeks and just had a MDS crash
> halting all IO. Leading up to the crash, I ran a test dd that increased the
> throughput
Ouch - been there too.
Now the question becomes: Which copy is the right one?
And a slightly related question - how many of you look at BER rate when
selecting drives? Do the math, it's pretty horrible when you know you have one
bad sector for every ~11.5TB of data (if you use desktop-class dri
Hello Community
Need Help with my production Ceph cluster were multiple OSDs are getting
crashed after throwing this error
2015-08-11 16:01:19.617860 7f3d95219700 -1 accepter.accepter.bind unable to
bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already
in use
2015-08-
Hi!
We got some strange performance results when running random read fio test on
our test Hammer cluster.
When we run fio-rbd (4k, randread, 8 jobs, QD=32, 500Gb rbd image) at first
time (pagecache is cold/empty)
we got ~12kiops sustained performance. It is quite resonable value, as
12kiops/3
On Tue, Aug 4, 2015 at 9:48 PM, Stefan Priebe wrote:
> Hi,
>
> Am 04.08.2015 um 21:16 schrieb Ketor D:
>>
>> Hi Stefan,
>>Could you describe more about the linger ops bug?
>>I'm runing Firefly as you say still has this bug.
>
>
> It will be fixed in next ff release.
>
> This on:
>
Hi,
if you look in the archive you'll see I posted something similiar about 2
months ago.
You can try something experimenting with
1) stock binaries - tcmalloc
2) LD_PRELOADed jemalloc
3) ceph recompiled with neither (glibc malloc)
4) ceph recompiled with jemalloc (?)
We simply recompiled ceph b
Could someone clarify what the impact of this bug is?
We did increase pg_num/pgp_num and we are on dumpling (0.67.12 unofficial
snapshot).
Most of our clients are likely restarted already, but not all. Should we be
worried?
Thanks
Jan
> On 11 Aug 2015, at 17:31, Dan van der Ster wrote:
>
> On
Here is the backtrace from the core dump.
(gdb) bt
#0 0x7f71f5404ffb in raise () from /lib64/libpthread.so.0
#1 0x0087065d in reraise_fatal (signum=6) at
global/signal_handler.cc:59
#2 handle_fatal_signal (signum=6) at global/signal_handler.cc:109
#3
#4 0x7f71f40235d7 in rais
Dear all,
I run a ceph system with 4 nodes and ~80 OSDs using xfs, with currently
75% usage, running firefly. On friday I upgraded it from 0.80.8 to
0.80.10, and since then I got several OSDs crashing and never
recovering: trying to run it, ends up crashing as follows.
Is this problem know
Yes, this was a package install and ceph-debuginfo was used and hopefully
the output of the backtrace is useful.
I thought it was interesting that you mentioned reproduce with an ls
because aside from me doing a large dd before this issue surfaced, your
post made me recall that I also ran ls a few
On Tue, Aug 11, 2015 at 6:23 PM, Bob Ababurko wrote:
> Here is the backtrace from the core dump.
>
> (gdb) bt
> #0 0x7f71f5404ffb in raise () from /lib64/libpthread.so.0
> #1 0x0087065d in reraise_fatal (signum=6) at
> global/signal_handler.cc:59
> #2 handle_fatal_signal (signum=6)
[sequential read]
readwrite=read
size=2g
directory=/mnt/mycephfs
ioengine=libaio
direct=1
blocksize=${BLOCKSIZE}
numjobs=1
iodepth=1
invalidate=1 # causes the kernel buffer and page cache to be invalidated
#nrfiles=1
[sequential write]
readwrite=write # randread randwrite
size=2g
directory=/mnt/
For the record: I've created issue #12671 to improve our memory
management in this type of situation.
John
http://tracker.ceph.com/issues/12671
On Tue, Aug 11, 2015 at 10:25 PM, John Spray wrote:
> On Tue, Aug 11, 2015 at 6:23 PM, Bob Ababurko wrote:
>> Here is the backtrace from the core dump
On Wed, Aug 12, 2015 at 5:53 AM, John Spray wrote:
> For the record: I've created issue #12671 to improve our memory
> management in this type of situation.
>
> John
>
> http://tracker.ceph.com/issues/12671
this situation has been improved in recent clients. recent clients
trim their cache first,
On Wed, Aug 12, 2015 at 1:23 AM, Bob Ababurko wrote:
> Here is the backtrace from the core dump.
>
> (gdb) bt
> #0 0x7f71f5404ffb in raise () from /lib64/libpthread.so.0
> #1 0x0087065d in reraise_fatal (signum=6) at
> global/signal_handler.cc:59
> #2 handle_fatal_signal (signum=6)
it seems like a leveldb problem. could you just kick it out and add a
new osd to make cluster healthy firstly?
On Wed, Aug 12, 2015 at 1:31 AM, Gerd Jakobovitsch wrote:
>
>
> Dear all,
>
> I run a ceph system with 4 nodes and ~80 OSDs using xfs, with currently 75%
> usage, running firefly. On fri
On Wed, Aug 12, 2015 at 5:33 AM, Hadi Montakhabi wrote:
>
> [sequential read]
> readwrite=read
> size=2g
> directory=/mnt/mycephfs
> ioengine=libaio
> direct=1
> blocksize=${BLOCKSIZE}
> numjobs=1
> iodepth=1
> invalidate=1 # causes the kernel buffer and page cache to be invalidated
> #nrfiles
John,
This seems to have worked. I rebooted my client and restarted ceph on the
MDS hosts after giving them more RAM. I restarted the rsync's that were
running on the client after remounting the cephfs fs and things seem to be
working. I can access the files so that is a relief.
What is risky
19 matches
Mail list logo