[ceph-users] НА: inconsistent pgs

2015-08-11 Thread Межов Игорь Александрович
Hi! Glad to hear your Ceph is working again! ;) BTW, it is a new knowledge: how ceph behave with bad ram. Do you have memory ECC errors in logs? Linux has EDAC module (I think, it is enabled by default in Debian) which reports any machine errors happening - machine check exeptions, memory error

Re: [ceph-users] mds server(s) crashed

2015-08-11 Thread John Spray
On Tue, Aug 11, 2015 at 2:21 AM, Bob Ababurko wrote: > I had a dual mds server configuration and have been copying data via cephfs > kernel module to my cluster for the past 3 weeks and just had a MDS crash > halting all IO. Leading up to the crash, I ran a test dd that increased the > throughput

Re: [ceph-users] inconsistent pgs

2015-08-11 Thread Jan Schermer
Ouch - been there too. Now the question becomes: Which copy is the right one? And a slightly related question - how many of you look at BER rate when selecting drives? Do the math, it's pretty horrible when you know you have one bad sector for every ~11.5TB of data (if you use desktop-class dri

[ceph-users] Several OSD's Crashed : unable to bind to any port in range 6800-7300: (98) Address already in use

2015-08-11 Thread Karan Singh
Hello Community Need Help with my production Ceph cluster were multiple OSDs are getting crashed after throwing this error 2015-08-11 16:01:19.617860 7f3d95219700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use 2015-08-

[ceph-users] Ceph allocator and performance

2015-08-11 Thread Межов Игорь Александрович
Hi! We got some strange performance results when running random read fio test on our test Hammer cluster. When we run fio-rbd (4k, randread, 8 jobs, QD=32, 500Gb rbd image) at first time (pagecache is cold/empty) we got ~12kiops sustained performance. It is quite resonable value, as 12kiops/3

Re: [ceph-users] Is it safe to increase pg number in a production environment

2015-08-11 Thread Dan van der Ster
On Tue, Aug 4, 2015 at 9:48 PM, Stefan Priebe wrote: > Hi, > > Am 04.08.2015 um 21:16 schrieb Ketor D: >> >> Hi Stefan, >>Could you describe more about the linger ops bug? >>I'm runing Firefly as you say still has this bug. > > > It will be fixed in next ff release. > > This on: >

Re: [ceph-users] Ceph allocator and performance

2015-08-11 Thread Jan Schermer
Hi, if you look in the archive you'll see I posted something similiar about 2 months ago. You can try something experimenting with 1) stock binaries - tcmalloc 2) LD_PRELOADed jemalloc 3) ceph recompiled with neither (glibc malloc) 4) ceph recompiled with jemalloc (?) We simply recompiled ceph b

Re: [ceph-users] Is it safe to increase pg number in a production environment

2015-08-11 Thread Jan Schermer
Could someone clarify what the impact of this bug is? We did increase pg_num/pgp_num and we are on dumpling (0.67.12 unofficial snapshot). Most of our clients are likely restarted already, but not all. Should we be worried? Thanks Jan > On 11 Aug 2015, at 17:31, Dan van der Ster wrote: > > On

Re: [ceph-users] mds server(s) crashed

2015-08-11 Thread Bob Ababurko
Here is the backtrace from the core dump. (gdb) bt #0 0x7f71f5404ffb in raise () from /lib64/libpthread.so.0 #1 0x0087065d in reraise_fatal (signum=6) at global/signal_handler.cc:59 #2 handle_fatal_signal (signum=6) at global/signal_handler.cc:109 #3 #4 0x7f71f40235d7 in rais

[ceph-users] Fwd: OSD crashes after upgrade to 0.80.10

2015-08-11 Thread Gerd Jakobovitsch
Dear all, I run a ceph system with 4 nodes and ~80 OSDs using xfs, with currently 75% usage, running firefly. On friday I upgraded it from 0.80.8 to 0.80.10, and since then I got several OSDs crashing and never recovering: trying to run it, ends up crashing as follows. Is this problem know

Re: [ceph-users] mds server(s) crashed

2015-08-11 Thread Bob Ababurko
Yes, this was a package install and ceph-debuginfo was used and hopefully the output of the backtrace is useful. I thought it was interesting that you mentioned reproduce with an ls because aside from me doing a large dd before this issue surfaced, your post made me recall that I also ran ls a few

Re: [ceph-users] mds server(s) crashed

2015-08-11 Thread John Spray
On Tue, Aug 11, 2015 at 6:23 PM, Bob Ababurko wrote: > Here is the backtrace from the core dump. > > (gdb) bt > #0 0x7f71f5404ffb in raise () from /lib64/libpthread.so.0 > #1 0x0087065d in reraise_fatal (signum=6) at > global/signal_handler.cc:59 > #2 handle_fatal_signal (signum=6)

Re: [ceph-users] Is there a limit for object size in CephFS?

2015-08-11 Thread Hadi Montakhabi
​​ [sequential read] readwrite=read size=2g directory=/mnt/mycephfs ioengine=libaio direct=1 blocksize=${BLOCKSIZE} numjobs=1 iodepth=1 invalidate=1 # causes the kernel buffer and page cache to be invalidated #nrfiles=1 [sequential write] readwrite=write # randread randwrite size=2g directory=/mnt/

Re: [ceph-users] mds server(s) crashed

2015-08-11 Thread John Spray
For the record: I've created issue #12671 to improve our memory management in this type of situation. John http://tracker.ceph.com/issues/12671 On Tue, Aug 11, 2015 at 10:25 PM, John Spray wrote: > On Tue, Aug 11, 2015 at 6:23 PM, Bob Ababurko wrote: >> Here is the backtrace from the core dump

Re: [ceph-users] mds server(s) crashed

2015-08-11 Thread Yan, Zheng
On Wed, Aug 12, 2015 at 5:53 AM, John Spray wrote: > For the record: I've created issue #12671 to improve our memory > management in this type of situation. > > John > > http://tracker.ceph.com/issues/12671 this situation has been improved in recent clients. recent clients trim their cache first,

Re: [ceph-users] mds server(s) crashed

2015-08-11 Thread Yan, Zheng
On Wed, Aug 12, 2015 at 1:23 AM, Bob Ababurko wrote: > Here is the backtrace from the core dump. > > (gdb) bt > #0 0x7f71f5404ffb in raise () from /lib64/libpthread.so.0 > #1 0x0087065d in reraise_fatal (signum=6) at > global/signal_handler.cc:59 > #2 handle_fatal_signal (signum=6)

Re: [ceph-users] Fwd: OSD crashes after upgrade to 0.80.10

2015-08-11 Thread Haomai Wang
it seems like a leveldb problem. could you just kick it out and add a new osd to make cluster healthy firstly? On Wed, Aug 12, 2015 at 1:31 AM, Gerd Jakobovitsch wrote: > > > Dear all, > > I run a ceph system with 4 nodes and ~80 OSDs using xfs, with currently 75% > usage, running firefly. On fri

Re: [ceph-users] Is there a limit for object size in CephFS?

2015-08-11 Thread Yan, Zheng
On Wed, Aug 12, 2015 at 5:33 AM, Hadi Montakhabi wrote: > ​​ > [sequential read] > readwrite=read > size=2g > directory=/mnt/mycephfs > ioengine=libaio > direct=1 > blocksize=${BLOCKSIZE} > numjobs=1 > iodepth=1 > invalidate=1 # causes the kernel buffer and page cache to be invalidated > #nrfiles

Re: [ceph-users] mds server(s) crashed

2015-08-11 Thread Bob Ababurko
John, This seems to have worked. I rebooted my client and restarted ceph on the MDS hosts after giving them more RAM. I restarted the rsync's that were running on the client after remounting the cephfs fs and things seem to be working. I can access the files so that is a relief. What is risky