Re: [ceph-users] CephFS - Problems with the reported used space

2015-08-07 Thread Goncalo Borges

Hi All...

I am still fighting with this issue. It may be something which is not 
properly implemented, and if that is the case, that is fine.


I am still trying to understand what is the real space occupied by files 
in a /cephfs filesystem, reported for example by a df.



Maybe I did not explain myself clearly. I am not saying that block size has
something to do with rbytes, I was just making a comparison with what I
expect in a regular POSIX filesystem. Let me put the question in the
following / different way:

1) I know that, if I only have one char file in a ext4 filesystem, where my
filesystem was set with a 4KB blocksize, a df would show 4KB as used space.

2) now imagine that I only have one char file in my Cephfs filesystem, and
the layout of my file is object_size=512K, stripe_count=2, and
stripe_unit=256K. Also assume that I have set my cluster to have 3
replicates. What would be the used space reported by a df command in this
case?

My naive assumption would be that a df should show as used space 512KB x 3.
Is this correct?

No. Used space reported by df is the sum of used space of OSDs' local
store. A 512k file require 3x512k space for data, OSD and local
filesystem also need extra space for tracking these data.


Please bare with me on my simple minded tests:

   0) # umount /cephfs;mount -t ceph X.X.X.X:6789:/ /cephfs -o
   name=admin,secretfile=/etc/ceph/admin.secret

   1) # getfattr -d -m ceph.*
   /cephfs/objectsize4M_stripeunit512K_stripecount8/
   (...)
   ceph.dir.layout="stripe_unit=524288 stripe_count=8
   object_size=4194304 pool=cephfs_dt"
   ceph.dir.rbytes="*549755813888*"
   (...)


   2) # df -B 1 /cephfs/
   Filesystem1B-blocks   Used  Available
   Use% Mounted on
   X.X.X.X:6789:/ 95618814967808 11738728628224 *83880086339584* 13%
   /cephfs


   3) # dd if=/dev/zero
   of=/cephfs/objectsize4M_stripeunit512K_stripecount8/4096bytes.txt
   bs=1 count=4096
   4096+0 records in
   4096+0 records out
   4096 bytes (4.1 kB) copied, 0.0139456 s, 294 kB/s


   4) # ls -lb
   /cephfs/objectsize4M_stripeunit512K_stripecount8/4096bytes.txt
   -rw-r--r-- 1 root root 4096 Aug  7 07:16
   /cephfs/objectsize4M_stripeunit512K_stripecount8/4096bytes.txt


   5) # umount /cephfs;mount -t ceph X.X.X.X:6789:/  /cephfs -o
   name=admin,secretfile=/etc/ceph/admin.secret


   6) # getfattr -d -m ceph.*
   /cephfs/objectsize4M_stripeunit512K_stripecount8/
   (...)
   ceph.dir.layout="stripe_unit=524288 stripe_count=8
   object_size=4194304 pool=cephfs_dt"
   ceph.dir.rbytes="*549755817984*"


   7) # df -B 1 /cephfs/
   Filesystem1B-blocks   Used  Available
   Use% Mounted on
   192.231.127.8:6789:/ 95618814967808 11738728628224 *83880086339584*
   13% /cephfs


Please note that in this simple minded tests:

a./  rbytes  properly reports the change in size (after a 
unmount/mount)

*549755817984 **- 549755813888 = 4096

*b./ A df does not show any change.

I could use 'ceph df details' but it does not give me the granularity  
want. Moreover, I also do not understand well its input:


   # ceph df
   GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
89051G 78119G   10932G 12.28
   POOLS:
NAME  ID USED  %USED MAX AVAIL OBJECTS
cephfs_dt 5  3633G  4.0825128G 1554050
cephfs_mt6  3455k 025128G  39

- What imposes the MAX AVAILABLE? I am assuming it is ~ GLOBAL AVAIL / 
Number of replicas...
- The %USED is computed in reference to what? I am asking because it 
seems it is computed in references to GLOBAL SIZE...  But this is 
misleading since the POOL MAX AVAIL is much less.


Thanks for the clarifications
Goncalo


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - Problems with the reported used space

2015-08-07 Thread Yan, Zheng
On Fri, Aug 7, 2015 at 3:41 PM, Goncalo Borges
 wrote:
> Hi All...
>
> I am still fighting with this issue. It may be something which is not
> properly implemented, and if that is the case, that is fine.
>
> I am still trying to understand what is the real space occupied by files in
> a /cephfs filesystem, reported for example by a df.
>
> Maybe I did not explain myself clearly. I am not saying that block size has
> something to do with rbytes, I was just making a comparison with what I
> expect in a regular POSIX filesystem. Let me put the question in the
> following / different way:
>
> 1) I know that, if I only have one char file in a ext4 filesystem, where my
> filesystem was set with a 4KB blocksize, a df would show 4KB as used space.
>
> 2) now imagine that I only have one char file in my Cephfs filesystem, and
> the layout of my file is object_size=512K, stripe_count=2, and
> stripe_unit=256K. Also assume that I have set my cluster to have 3
> replicates. What would be the used space reported by a df command in this
> case?
>
> My naive assumption would be that a df should show as used space 512KB x 3.
> Is this correct?
>
> No. Used space reported by df is the sum of used space of OSDs' local
> store. A 512k file require 3x512k space for data, OSD and local
> filesystem also need extra space for tracking these data.
>
>
> Please bare with me on my simple minded tests:
>
> 0) # umount /cephfs;mount -t ceph X.X.X.X:6789:/  /cephfs -o
> name=admin,secretfile=/etc/ceph/admin.secret
>
> 1) # getfattr -d -m ceph.* /cephfs/objectsize4M_stripeunit512K_stripecount8/
> (...)
> ceph.dir.layout="stripe_unit=524288 stripe_count=8 object_size=4194304
> pool=cephfs_dt"
> ceph.dir.rbytes="549755813888"
> (...)
>
>
> 2) # df -B 1 /cephfs/
> Filesystem1B-blocks   Used  Available Use%
> Mounted on
> X.X.X.X:6789:/ 95618814967808 11738728628224 83880086339584  13% /cephfs
>
>
> 3) # dd if=/dev/zero
> of=/cephfs/objectsize4M_stripeunit512K_stripecount8/4096bytes.txt bs=1
> count=4096
> 4096+0 records in
> 4096+0 records out
> 4096 bytes (4.1 kB) copied, 0.0139456 s, 294 kB/s
>
>
> 4) # ls -lb /cephfs/objectsize4M_stripeunit512K_stripecount8/4096bytes.txt
> -rw-r--r-- 1 root root 4096 Aug  7 07:16
> /cephfs/objectsize4M_stripeunit512K_stripecount8/4096bytes.txt
>
>
> 5) # umount /cephfs;mount -t ceph X.X.X.X:6789:/  /cephfs -o
> name=admin,secretfile=/etc/ceph/admin.secret
>
>
> 6) # getfattr -d -m ceph.* /cephfs/objectsize4M_stripeunit512K_stripecount8/
> (...)
> ceph.dir.layout="stripe_unit=524288 stripe_count=8 object_size=4194304
> pool=cephfs_dt"
> ceph.dir.rbytes="549755817984"
>
>
> 7) # df -B 1 /cephfs/
> Filesystem1B-blocks   Used  Available Use%
> Mounted on
> 192.231.127.8:6789:/ 95618814967808 11738728628224 83880086339584  13%
> /cephfs
>
>
> Please note that in this simple minded tests:
>
> a./  rbytes  properly reports the change in size (after a unmount/mount)
> 549755817984 - 549755813888 = 4096
>
> b./ A df does not show any change.

I think df on cephfs and 'ceph df' use the same mechanism to get used
and available spaces. The used space it reports is not updated in real
time. (OSDs report their used space to monitor periodically. The
monitor gathers these informations to get used space of the whole
cluster)

>
> I could use 'ceph df details' but it does not give me the granularity  want.
> Moreover, I also do not understand well its input:
>
> # ceph df
> GLOBAL:
> SIZE   AVAIL  RAW USED %RAW USED
> 89051G 78119G   10932G 12.28
> POOLS:
> NAME  ID USED  %USED MAX AVAIL OBJECTS
> cephfs_dt 5  3633G  4.0825128G 1554050
> cephfs_mt6  3455k 025128G  39
>
> - What imposes the MAX AVAILABLE? I am assuming it is ~ GLOBAL AVAIL /
> Number of replicas...
> - The %USED is computed in reference to what? I am asking because it seems
> it is computed in references to GLOBAL SIZE...  But this is misleading since
> the POOL MAX AVAIL is much less.

both used and available spaces are computed in references to global
size. yes, it's misleading. we will improve it later


>
> Thanks for the clarifications
> Goncalo
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests during ceph osd boot

2015-08-07 Thread Jan Schermer
You'd need to disable the udev rule as well as the initscript (probably 
somewhere in /lib/udev/)

What I do when I'm restarting the server is:
chmod -x /usr/bin/ceph-osd

Jan

> On 07 Aug 2015, at 05:11, Nathan O'Sullivan  wrote:
> 
> I'm seeing the same sort of issue.
> 
> Any suggestions on how to get Ceph to not start the ceph-osd processes on 
> host boot?  It does not seem to be as simple as just disabling the service
> 
> Regards
> Nathan
> 
> 
> On 15/07/2015 7:15 PM, Jan Schermer wrote:
>> We have the same problems, we need to start the OSDs slowly.
>> The problem seems to be CPU congestion. A booting OSD will use all available 
>> CPU power you give it, and if it doesn’t have enough nasty stuff happens 
>> (this might actually be the manifestation of some kind of problem in our 
>> setup as well).
>> It doesn’t do that always - I was restarting our hosts this weekend and most 
>> of them came up fine with simple “service ceph start”, some just sat there 
>> spinning the CPU and not doing any real world (and the cluster was not very 
>> happy about that).
>> 
>> Jan
>> 
>> 
>>> On 15 Jul 2015, at 10:53, Kostis Fardelas  wrote:
>>> 
>>> Hello,
>>> after some trial and error we concluded that if we start the 6 stopped
>>> OSD daemons with a delay of 1 minute, we do not experience slow
>>> requests (threshold is set on 30 sec), althrough there are some ops
>>> that last up to 10s which is already high enough. I assume that if we
>>> spread the delay more, the slow requests will vanish. The possibility
>>> of not having tuned our setup to the most finest detail is not zeroed
>>> out but I wonder if at any way we miss some ceph tuning in terms of
>>> ceph configuration.
>>> 
>>> We run firefly latest stable version.
>>> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Direct IO tests on RBD device vary significantly

2015-08-07 Thread Jan Schermer
You're not really testing only a RBD device there - you're testing
1) the O_DIRECT implementation in the kernel version you have (they differ)
- try different kernels in guest
2) cache implementation in qemu (and possibly virtio block driver) - if it's 
enabled
- disable it for this test completely (cache=none)
3) O_DIRECT implementation on the filesystem where your "test" file is - most 
important is preallocation!
- I'm not sure if you can "dd" into an existing file without truncating it, but 
you should first create the file with:
dd if=/dev/zero of=test bs=1M
It's better to create a new virtual drive and attach it to this machine, then 
test it directly (and while dd is good for "baseline" testing, I recommend you 
use fio)

Btw I find the 4MB test pretty consistent, it seems to oscillate about ~50MB/s. 
In the beginning the cluster was probably busy doing something else (scrubbing? 
cleanup of something? cron scripts?).

Jan

> On 07 Aug 2015, at 03:31, Steve Dainard  wrote:
> 
> Trying to get an understanding why direct IO would be so slow on my cluster.
> 
> Ceph 0.94.1
> 1 Gig public network
> 10 Gig public network
> 10 Gig cluster network
> 
> 100 OSD's, 4T disk sizes, 5G SSD journal.
> 
> As of this morning I had no SSD journal and was finding direct IO was
> sub 10MB/s so I decided to add journals today.
> 
> Afterwards I started running tests again and wasn't very impressed.
> Then for no apparent reason the write speeds increased significantly.
> But I'm finding they vary wildly.
> 
> Currently there is a bit of background ceph activity, but only my
> testing client has an rbd mapped/mounted:
>   election epoch 144, quorum 0,1,2 mon1,mon3,mon2
> osdmap e181963: 100 osds: 100 up, 100 in
>flags noout
>  pgmap v2852566: 4144 pgs, 7 pools, 113 TB data, 29179 kobjects
>227 TB used, 135 TB / 363 TB avail
>4103 active+clean
>  40 active+clean+scrubbing
>   1 active+clean+scrubbing+deep
> 
> Tests:
> 1M block size: http://pastebin.com/LKtsaHrd throughput has no consistency
> 4k block size: http://pastebin.com/ib6VW9eB thoughput is amazingly consistent
> 
> Thoughts?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Warning regarding LTTng while checking status or restarting service

2015-08-07 Thread Jan Schermer
Well, you could explicitly export HOME=/root then, that should make it go away.
I think it's normally only present in a login shell.

Jan

> On 06 Aug 2015, at 17:51, Josh Durgin  wrote:
> 
> On 08/06/2015 03:10 AM, Daleep Bais wrote:
>> Hi,
>> 
>> Whenever I restart or check the logs for OSD, MON, I get below warning
>> message..
>> 
>> I am running a test cluster of 09 OSD's and 03 MON nodes.
>> 
>> [ceph-node1][WARNIN] libust[3549/3549]: Warning: HOME environment
>> variable not set. Disabling LTTng-UST per-user tracing. (in
>> setup_local_apps() at lttng-ust-comm.c:375)
> 
> In short: this is harmless, you can ignore it.
> 
> liblttng-ust tries to listen for control commands from lttng-sessiond
> in a few places by default, including under $HOME. It does this via a
> shared mmaped file. If you were interested in tracing as a non-root
> user, you could set LTTNG_HOME to a place that was usable, like /var/lib
> /ceph/. Since ceph daemons run as root today, this is irrelevant, and
> you can still use lttng as root just fine. Unfortunately there's no
> simple to silence liblttng-ust about this.
> 
> Josh
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] inconsistent pgs

2015-08-07 Thread Константин Сахинов
Hi!

I have a large number of inconsistent pgs 229 of 656, and it's increasing
every hour.
I'm using ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3).

For example, pg 3.d8:
# ceph health detail | grep 3.d8
pg 3.d8 is active+clean+scrubbing+deep+inconsistent, acting [1,7]

# grep 3.d8 /var/log/ceph/ceph-osd.1.log | less -S
2015-08-07 13:10:48.311810 7f5903f7a700 0 log_channel(cluster) log [INF] :
3.d8 repair starts 2015-08-07 13:12:05.703084 7f5903f7a700 -1
log_channel(cluster) log [ERR] : repair 3.d8
cbd2d0d8/rbd_data.6a5cf474b0dc51.0b1f/head//3 on disk data
digest 0x6e4d80bf != 0x6fb5b103 2015-08-07 13:13:26.837524 7f5903f7a700 -1
log_channel(cluster) log [ERR] : repair 3.d8
b5892d8/rbd_data.dbe674b0dc51.01b9/head//3 on disk data digest
0x79082779 != 0x9f102f3d 2015-08-07 13:13:44.874725 7f5903f7a700 -1
log_channel(cluster) log [ERR] : repair 3.d8
ee6dc2d8/rbd_data.e7592ae8944a.0833/head//3 on disk data digest
0x63ab49d0 != 0x68778496 2015-08-07 13:14:19.378582 7f5903f7a700 -1
log_channel(cluster) log [ERR] : repair 3.d8
d93e14d8/rbd_data.3ef8442ae8944a.0729/head//3 on disk data
digest 0x3cdb1f5c != 0x4e0400c2 2015-08-07 13:23:38.668080 7f5903f7a700 -1
log_channel(cluster) log [ERR] : 3.d8 repair 4 errors, 0 fixed 2015-08-07
13:23:38.714668 7f5903f7a700 0 log_channel(cluster) log [INF] : 3.d8
deep-scrub starts 2015-08-07 13:25:00.656306 7f5903f7a700 -1
log_channel(cluster) log [ERR] : deep-scrub 3.d8
cbd2d0d8/rbd_data.6a5cf474b0dc51.0b1f/head//3 on disk data
digest 0x6e4d80bf != 0x6fb5b103 2015-08-07 13:26:18.775362 7f5903f7a700 -1
log_channel(cluster) log [ERR] : deep-scrub 3.d8
b5892d8/rbd_data.dbe674b0dc51.01b9/head//3 on disk data digest
0x79082779 != 0x9f102f3d 2015-08-07 13:26:42.084218 7f5903f7a700 -1
log_channel(cluster) log [ERR] : deep-scrub 3.d8
ee6dc2d8/rbd_data.e7592ae8944a.0833/head//3 on disk data digest
0x59a6e7e0 != 0x68778496 2015-08-07 13:26:56.495207 7f5903f7a700 -1
log_channel(cluster) log [ERR] : be_compare_scrubmaps: 3.d8 shard 1: soid
cc49f2d8/rbd_data.3ef8442ae8944a.0aff/head//3 data_digest
0x4e20a792 != known data_digest 0xc0e9b2d2 from auth shard 7 2015-08-07
13:27:12.134765 7f5903f7a700 -1 log_channel(cluster) log [ERR] : deep-scrub
3.d8 d93e14d8/rbd_data.3ef8442ae8944a.0729/head//3 on disk data
digest 0x3cdb1f5c != 0x4e0400c2

osd.7.log is clean for that period of time.

Please help to heal my cluster.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] НА: inconsistent pgs

2015-08-07 Thread Межов Игорь Александрович
Hi!

Do you have any disk errors in dmesg output? In our practice, every time the 
deep
scrub found inconsistent PG, we also found a disk error, that was the reason.
Sometimes it was media errors (bad sectors), one time - bad sata cable and
we also had some raid/hba firmware issues. But in all cases, when we see 
inconsistent
PG - we also see disk errors. So, please, check them at your cluster firstly.

Megov Igor
CIO, Yuterra

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent pgs

2015-08-07 Thread Константин Сахинов
No, dmesg is clean on both hosts of osd.1 (block0) and osd.7 (block2).
There are only boot time messages (listings below).
If there is cable or SATA-controller issues, will it be shown in
/var/log/dmesg?


block0 dmesg:
[5.296302] XFS (sdb1): Mounting V4 Filesystem
[5.316487] XFS (sda1): Mounting V4 Filesystem
[5.464439] XFS (sdb1): Ending clean mount
[5.464466] SELinux: initialized (dev sdb1, type xfs), uses xattr
[5.506185] XFS (sda1): Ending clean mount
[5.506214] SELinux: initialized (dev sda1, type xfs), uses xattr
[5.649101] XFS (sdc1): Ending clean mount
[5.649129] SELinux: initialized (dev sdc1, type xfs), uses xattr
[5.668062] systemd-journald[500]: Received request to flush runtime
journal from PID 1
[5.786559] type=1305 audit(1438754525.356:4): audit_pid=629 old=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1
[5.870809] snd_hda_intel :00:1b.0: Codec #2 probe error; disabling
it...
[5.976594] sound hdaudioC0D0: autoconfig: line_outs=1
(0x14/0x0/0x0/0x0/0x0) type:line
[5.976604] sound hdaudioC0D0:speaker_outs=0 (0x0/0x0/0x0/0x0/0x0)
[5.976609] sound hdaudioC0D0:hp_outs=1 (0x1b/0x0/0x0/0x0/0x0)
[5.976613] sound hdaudioC0D0:mono: mono_out=0x0
[5.976617] sound hdaudioC0D0:dig-out=0x11/0x0
[5.976621] sound hdaudioC0D0:inputs:
[5.976627] sound hdaudioC0D0:  Rear Mic=0x18
[5.976632] sound hdaudioC0D0:  Front Mic=0x19
[5.976637] sound hdaudioC0D0:  Line=0x1a
[5.976641] sound hdaudioC0D0:  CD=0x1c

block2 dmesg:
[4.987126] XFS (sda1): Mounting V4 Filesystem
[4.990737] raid6: sse2x1 601 MB/s
[5.007765] raid6: sse2x2 632 MB/s
[5.024711] raid6: sse2x41316 MB/s
[5.024717] raid6: using algorithm sse2x4 (1316 MB/s)
[5.024720] raid6: using ssse3x2 recovery algorithm
[5.046099] XFS (sda1): Ending clean mount
[5.046126] SELinux: initialized (dev sda1, type xfs), uses xattr
[5.074257] Btrfs loaded
[5.075283] BTRFS: device fsid a3176b6c-2e2b-4121-9dc3-f191d469eccc
devid 1 transid 287115 /dev/sdc
[5.098721] Adding 4882428k swap on /dev/mapper/root-swap00.
Priority:-1 extents:1 across:4882428k SSFS
[5.157349] XFS (sdc1): Mounting V4 Filesystem
[5.222992] XFS (sdb1): Mounting V4 Filesystem
[5.362687] XFS (sdc1): Ending clean mount
[5.362738] SELinux: initialized (dev sdc1, type xfs), uses xattr
[5.417178] XFS (sdb1): Ending clean mount
[5.417208] SELinux: initialized (dev sdb1, type xfs), uses xattr
[5.429280] systemd-journald[497]: Received request to flush runtime
journal from PID 1
[5.533922] type=1305 audit(1438754526.111:4): audit_pid=626 old=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1

пт, 7 авг. 2015 г. в 14:08, Межов Игорь Александрович :

> Hi!
>
> Do you have any disk errors in dmesg output? In our practice, every time
> the deep
> scrub found inconsistent PG, we also found a disk error, that was the
> reason.
> Sometimes it was media errors (bad sectors), one time - bad sata cable and
> we also had some raid/hba firmware issues. But in all cases, when we see
> inconsistent
> PG - we also see disk errors. So, please, check them at your cluster
> firstly.
>
> Megov Igor
> CIO, Yuterra
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Different filesystems on OSD hosts at the same cluster

2015-08-07 Thread Межов Игорь Александрович
Hi!

We do some performance tests on our small Hammer install:
 - Debian Jessie;
 - Ceph Hammer 0.94.2 self-built from sources (tcmalloc)
 - 1xE5-2670 + 128Gb RAM
 - 2 nodes shared with mons, system and mon DB are on separate SAS mirror;
 - 16 OSD on each node, SAS 10k;
 - 2 Intel DC S3700 200Gb SSD for journalling 
 - 10Gbit interconnect, shared public and cluster metwork, MTU9100
 - 10Gbit client host, fio 2.2.7 compiled with RBD engine

We benchmark 4k random read performance on 500G RBD volume with fio-rbd 
and got different results. When we use XFS 
(noatime,attr2,inode64,allocsize=4096k,
noquota) on OSD disks, we can get ~7k sustained iops. After recreating the same 
OSDs
with EXT4 fs (noatime,data=ordered) we can achieve ~9.5k iops in the same 
benchmark.

So there are some questions to community:
 1. Is really EXT4 perform better under typical RBD load (we Ceph to host VM 
images)?
 2. Is it safe to intermix OSDs with different backingstore filesystems at one 
cluster 
(we use ceph-deploy to create and manage OSDs)?
 3. Is it safe to move our production cluster (Firefly 0.80.7) from XFS to ext4 
by
removing XFS osds one-by-one and later add the same disk drives as Ext4 OSDs
(of course, I know about huge data-movement that will take place during this 
process)?

Thanks!

Megov Igor
CIO, Yuterra

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] НА: inconsistent pgs

2015-08-07 Thread Межов Игорь Александрович
Hi!

>If there is cable or SATA-controller issues, will it be shown in 
>/var/log/dmesg?

If it will lead to read errors, it will be logged in dmesg. If it cause only
SATA command retransmission, it maybe won't logged in dmesg,
but have to be shown in SMART attributes.

And anyway, we face only some (<10) inconsistences during a year
in production usage, not so large amount, as you have. So maybe
it is not a disk error issue.

Also, there are two strange things I can see:
- do you need SELinux enabled? AFAIK, SELinux security attributes are
stored in extended attrs (xatts), and Ceph also use xattrs in intensive
manner. Maybe some problems between them? Try to disable SELinux
at boot - I saw this suggestion quite often, when do trobleshooting.
- As I cen see, you use XFS on your OSDs? But in dmesg we see
a records about the same /dev/sdc:

>BTRFS: device fsid a3176b6c-2e2b-4121-9dc3-f191d469eccc devid 1
>transid 287115 /dev/sdc

and

>XFS (sdc1): Mounting V4 Filesystem
>XFS (sdc1): Ending clean mount

Is it btrfs or XFS? I, personally, dont have much success with btrfs.
Some of my test usages ends with completely hang machines.


Megov Igor
CIO, Yuterr



block0 dmesg:
[5.296302] XFS (sdb1): Mounting V4 Filesystem
[5.316487] XFS (sda1): Mounting V4 Filesystem
[5.464439] XFS (sdb1): Ending clean mount
[5.464466] SELinux: initialized (dev sdb1, type xfs), uses xattr
[5.506185] XFS (sda1): Ending clean mount
[5.506214] SELinux: initialized (dev sda1, type xfs), uses xattr
[5.649101] XFS (sdc1): Ending clean mount
[5.649129] SELinux: initialized (dev sdc1, type xfs), uses xattr
[5.668062] systemd-journald[500]: Received request to flush runtime journal 
from PID 1
[5.786559] type=1305 audit(1438754525.356:4): audit_pid=629 old=0 
auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1
[5.870809] snd_hda_intel :00:1b.0: Codec #2 probe error; disabling it...
[5.976594] sound hdaudioC0D0: autoconfig: line_outs=1 
(0x14/0x0/0x0/0x0/0x0) type:line
[5.976604] sound hdaudioC0D0:speaker_outs=0 (0x0/0x0/0x0/0x0/0x0)
[5.976609] sound hdaudioC0D0:hp_outs=1 (0x1b/0x0/0x0/0x0/0x0)
[5.976613] sound hdaudioC0D0:mono: mono_out=0x0
[5.976617] sound hdaudioC0D0:dig-out=0x11/0x0
[5.976621] sound hdaudioC0D0:inputs:
[5.976627] sound hdaudioC0D0:  Rear Mic=0x18
[5.976632] sound hdaudioC0D0:  Front Mic=0x19
[5.976637] sound hdaudioC0D0:  Line=0x1a
[5.976641] sound hdaudioC0D0:  CD=0x1c

block2 dmesg:
[4.987126] XFS (sda1): Mounting V4 Filesystem
[4.990737] raid6: sse2x1 601 MB/s
[5.007765] raid6: sse2x2 632 MB/s
[5.024711] raid6: sse2x41316 MB/s
[5.024717] raid6: using algorithm sse2x4 (1316 MB/s)
[5.024720] raid6: using ssse3x2 recovery algorithm
[5.046099] XFS (sda1): Ending clean mount
[5.046126] SELinux: initialized (dev sda1, type xfs), uses xattr
[5.074257] Btrfs loaded
[5.075283] BTRFS: device fsid a3176b6c-2e2b-4121-9dc3-f191d469eccc devid 1 
transid 287115 /dev/sdc
[5.098721] Adding 4882428k swap on /dev/mapper/root-swap00.  Priority:-1 
extents:1 across:4882428k SSFS
[5.157349] XFS (sdc1): Mounting V4 Filesystem
[5.222992] XFS (sdb1): Mounting V4 Filesystem
[5.362687] XFS (sdc1): Ending clean mount
[5.362738] SELinux: initialized (dev sdc1, type xfs), uses xattr
[5.417178] XFS (sdb1): Ending clean mount
[5.417208] SELinux: initialized (dev sdb1, type xfs), uses xattr
[5.429280] systemd-journald[497]: Received request to flush runtime journal 
from PID 1
[5.533922] type=1305 audit(1438754526.111:4): audit_pid=626 old=0 
auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1

пт, 7 авг. 2015 г. в 14:08, Межов Игорь Александрович 
mailto:me...@yuterra.ru>>:
Hi!

Do you have any disk errors in dmesg output? In our practice, every time the 
deep
scrub found inconsistent PG, we also found a disk error, that was the reason.
Sometimes it was media errors (bad sectors), one time - bad sata cable and
we also had some raid/hba firmware issues. But in all cases, when we see 
inconsistent
PG - we also see disk errors. So, please, check them at your cluster firstly.

Megov Igor
CIO, Yuterra

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent pgs

2015-08-07 Thread Константин Сахинов
"you use XFS on your OSDs?"

This OSD was formatted in BTRFS as a whole block device /dev/sdc (with no
partition table). Then I moved from BTRFS to XFS /dev/sdc1 (with partition
table), because BTRFS was v-v-very slow.  Maybe partprober sees some old
signatures from first sectors of that disk...
By "move" I mean removing OSD from cluster as described in official docs
http://ceph.com/docs/master/rados/operations/add-or-rm-osds/, and then
adding it as new OSD.

пт, 7 авг. 2015 г. в 14:46, Межов Игорь Александрович :

> Hi!
>
>
> >If there is cable or SATA-controller issues, will it be shown in
> /var/log/dmesg?
>
> If it will lead to read errors, it will be logged in dmesg. If it cause
> only
> SATA command retransmission, it maybe won't logged in dmesg,
> but have to be shown in SMART attributes.
>
> And anyway, we face only some (<10) inconsistences during a year
> in production usage, not so large amount, as you have. So maybe
> it is not a disk error issue.
>
> Also, there are two strange things I can see:
> - do you need SELinux enabled? AFAIK, SELinux security attributes are
> stored in extended attrs (xatts), and Ceph also use xattrs in intensive
> manner. Maybe some problems between them? Try to disable SELinux
> at boot - I saw this suggestion quite often, when do trobleshooting.
> - As I cen see, you use XFS on your OSDs? But in dmesg we see
> a records about the same /dev/sdc:
>
>
> >BTRFS: device fsid a3176b6c-2e2b-4121-9dc3-f191d469eccc devid 1
> >transid 287115 /dev/sdc
>
> and
>
>
> >XFS (sdc1): Mounting V4 Filesystem
> >XFS (sdc1): Ending clean mount
>
> Is it btrfs or XFS? I, personally, dont have much success with btrfs.
> Some of my test usages ends with completely hang machines.
>
>
> Megov Igor
> CIO, Yuterr
>
>
>
> block0 dmesg:
> [5.296302] XFS (sdb1): Mounting V4 Filesystem
> [5.316487] XFS (sda1): Mounting V4 Filesystem
> [5.464439] XFS (sdb1): Ending clean mount
> [5.464466] SELinux: initialized (dev sdb1, type xfs), uses xattr
> [5.506185] XFS (sda1): Ending clean mount
> [5.506214] SELinux: initialized (dev sda1, type xfs), uses xattr
> [5.649101] XFS (sdc1): Ending clean mount
> [5.649129] SELinux: initialized (dev sdc1, type xfs), uses xattr
> [5.668062] systemd-journald[500]: Received request to flush runtime
> journal from PID 1
> [5.786559] type=1305 audit(1438754525.356:4): audit_pid=629 old=0
> auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1
> [5.870809] snd_hda_intel :00:1b.0: Codec #2 probe error; disabling
> it...
> [5.976594] sound hdaudioC0D0: autoconfig: line_outs=1
> (0x14/0x0/0x0/0x0/0x0) type:line
> [5.976604] sound hdaudioC0D0:speaker_outs=0 (0x0/0x0/0x0/0x0/0x0)
> [5.976609] sound hdaudioC0D0:hp_outs=1 (0x1b/0x0/0x0/0x0/0x0)
> [5.976613] sound hdaudioC0D0:mono: mono_out=0x0
> [5.976617] sound hdaudioC0D0:dig-out=0x11/0x0
> [5.976621] sound hdaudioC0D0:inputs:
> [5.976627] sound hdaudioC0D0:  Rear Mic=0x18
> [5.976632] sound hdaudioC0D0:  Front Mic=0x19
> [5.976637] sound hdaudioC0D0:  Line=0x1a
> [5.976641] sound hdaudioC0D0:  CD=0x1c
>
> block2 dmesg:
> [4.987126] XFS (sda1): Mounting V4 Filesystem
> [4.990737] raid6: sse2x1 601 MB/s
> [5.007765] raid6: sse2x2 632 MB/s
> [5.024711] raid6: sse2x41316 MB/s
> [5.024717] raid6: using algorithm sse2x4 (1316 MB/s)
> [5.024720] raid6: using ssse3x2 recovery algorithm
> [5.046099] XFS (sda1): Ending clean mount
> [5.046126] SELinux: initialized (dev sda1, type xfs), uses xattr
> [5.074257] Btrfs loaded
> [5.075283] BTRFS: device fsid a3176b6c-2e2b-4121-9dc3-f191d469eccc
> devid 1 transid 287115 /dev/sdc
> [5.098721] Adding 4882428k swap on /dev/mapper/root-swap00.
> Priority:-1 extents:1 across:4882428k SSFS
> [5.157349] XFS (sdc1): Mounting V4 Filesystem
> [5.222992] XFS (sdb1): Mounting V4 Filesystem
> [5.362687] XFS (sdc1): Ending clean mount
> [5.362738] SELinux: initialized (dev sdc1, type xfs), uses xattr
> [5.417178] XFS (sdb1): Ending clean mount
> [5.417208] SELinux: initialized (dev sdb1, type xfs), uses xattr
> [5.429280] systemd-journald[497]: Received request to flush runtime
> journal from PID 1
> [5.533922] type=1305 audit(1438754526.111:4): audit_pid=626 old=0
> auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1
>
> пт, 7 авг. 2015 г. в 14:08, Межов Игорь Александрович :
>
>> Hi!
>>
>> Do you have any disk errors in dmesg output? In our practice, every time
>> the deep
>> scrub found inconsistent PG, we also found a disk error, that was the
>> reason.
>> Sometimes it was media errors (bad sectors), one time - bad sata cable
>> and
>> we also had some raid/hba firmware issues. But in all cases, when we see
>> inconsistent
>> PG - we also see disk errors. So, please, check them at your cluster
>> firstly.
>>
>> Megov Igor
>> CIO, Yuterra
>>
>>
>>
___

[ceph-users] НА: inconsistent pgs

2015-08-07 Thread Межов Игорь Александрович
Hi!

When inconsistent PGs starting to appear? Maybe after some event?
Hang, node reboot or after reconfiguration or changing parameters?
Can you say, what triggers such behaviour? And, BTW, what system/kernel
you use?

Megov Igor
CIO, Yuterra

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent pgs

2015-08-07 Thread Константин Сахинов
It's hard to say now. I changed one-by-one my 6 OSDs from btrfs to xfs.
During the repair process I added 2 more OSDs. Changed crush map from
root-host-osd to root-*chasis*-host-osd structure... There was SSD cache
tiering set, when first inconsistency showed up. Then I removed tiering to
confirm than it was not the reason of inconsistencies.
Once there was hardware problem with one node - PCI slot issue. I shut down
that node and exchanged motherboard to the same model.
I'm running CentOS Linux release 7.1.1503 (Core) with
3.10.0-229.7.2.el7.x86_64 kernel.

пт, 7 авг. 2015 г. в 15:18, Межов Игорь Александрович :

> Hi!
>
> When inconsistent PGs starting to appear? Maybe after some event?
> Hang, node reboot or after reconfiguration or changing parameters?
> Can you say, what triggers such behaviour? And, BTW, what system/kernel
> you use?
>
> Megov Igor
> CIO, Yuterra
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent pgs

2015-08-07 Thread Константин Сахинов
When I changed crush from root-host-osd to root-chasis-host-osd, did I've
to change default ruleset? I didn't changed it. It looks like this:

rule replicated_ruleset {
  ruleset 0
  type replicated
  min_size 1
  max_size 10
  step take default
  step chooseleaf firstn 0 type host
  step emit
}

пт, 7 авг. 2015 г. в 15:39, Константин Сахинов :

> It's hard to say now. I changed one-by-one my 6 OSDs from btrfs to xfs.
> During the repair process I added 2 more OSDs. Changed crush map from
> root-host-osd to root-*chasis*-host-osd structure... There was SSD cache
> tiering set, when first inconsistency showed up. Then I removed tiering to
> confirm than it was not the reason of inconsistencies.
> Once there was hardware problem with one node - PCI slot issue. I shut
> down that node and exchanged motherboard to the same model.
> I'm running CentOS Linux release 7.1.1503 (Core) with
> 3.10.0-229.7.2.el7.x86_64 kernel.
>
> пт, 7 авг. 2015 г. в 15:18, Межов Игорь Александрович :
>
>> Hi!
>>
>> When inconsistent PGs starting to appear? Maybe after some event?
>> Hang, node reboot or after reconfiguration or changing parameters?
>> Can you say, what triggers such behaviour? And, BTW, what system/kernel
>> you use?
>>
>> Megov Igor
>> CIO, Yuterra
>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD are not seen as down when i stop node

2015-08-07 Thread Thomas Bernard
Hi,

I recently add 5 nodes to my ceph cluster, each node store 16 osd.
My old nodes were in firefly release and i upgrade my cluster in hammer.

My problem is when i stop a node or an osd (with or without set noout
first) the OSD are not seen as down. All OSD are still up and i have
manifold blocked requests.
If i restart an old node everithing is fine, i can't explain that.

All monitor and osd have been restarted
All daemon are in same version (verify with ceph --admin-daemon
/var/run/ceph/ceph-osd.xxx version)
All node have the same kernel version : 3.13.0-61-generic

My setup : 9 node, 116 osd
Ubuntu 12.04

Thanks for your help
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Different filesystems on OSD hosts at the same cluster

2015-08-07 Thread Udo Lembke
Hi,
some time ago I switched all OSDs from XFS to ext4 (step by step).
I had no issues during mixed osd-format (the process takes some weeks).

And yes, for me ext4 performs also better (esp. the latencies).

Udo

Am 07.08.2015 13:31, schrieb Межов Игорь Александрович:
> Hi!
> 
> We do some performance tests on our small Hammer install:
>  - Debian Jessie;
>  - Ceph Hammer 0.94.2 self-built from sources (tcmalloc)
>  - 1xE5-2670 + 128Gb RAM
>  - 2 nodes shared with mons, system and mon DB are on separate SAS mirror;
>  - 16 OSD on each node, SAS 10k;
>  - 2 Intel DC S3700 200Gb SSD for journalling 
>  - 10Gbit interconnect, shared public and cluster metwork, MTU9100
>  - 10Gbit client host, fio 2.2.7 compiled with RBD engine
> 
> We benchmark 4k random read performance on 500G RBD volume with fio-rbd 
> and got different results. When we use XFS 
> (noatime,attr2,inode64,allocsize=4096k,
> noquota) on OSD disks, we can get ~7k sustained iops. After recreating the 
> same OSDs
> with EXT4 fs (noatime,data=ordered) we can achieve ~9.5k iops in the same 
> benchmark.
> 
> So there are some questions to community:
>  1. Is really EXT4 perform better under typical RBD load (we Ceph to host VM 
> images)?
>  2. Is it safe to intermix OSDs with different backingstore filesystems at 
> one cluster 
> (we use ceph-deploy to create and manage OSDs)?
>  3. Is it safe to move our production cluster (Firefly 0.80.7) from XFS to 
> ext4 by
> removing XFS osds one-by-one and later add the same disk drives as Ext4 OSDs
> (of course, I know about huge data-movement that will take place during this 
> process)?
> 
> Thanks!
> 
> Megov Igor
> CIO, Yuterra
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Different filesystems on OSD hosts at the samecluster

2015-08-07 Thread Burkhard Linke

Hi,


On 08/07/2015 04:04 PM, Udo Lembke wrote:

Hi,
some time ago I switched all OSDs from XFS to ext4 (step by step).
I had no issues during mixed osd-format (the process takes some weeks).

And yes, for me ext4 performs also better (esp. the latencies).

Just out of curiosity:

Do you use a ext4 setup as described in the documentation? Did you try 
to use external ext4 journals on SSD?


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Different filesystems on OSD hosts at the samecluster

2015-08-07 Thread Udo Lembke
Hi,
I use the ext4-parameters like Christian Balzer wrote in one posting:
osd mount options ext4 = "user_xattr,rw,noatime,nodiratime"
osd_mkfs_options_ext4 = -J size=1024 -E lazy_itable_init=0,lazy_journal_init=0

The osd-journals are on SSD-Partitions (without filesystem). IMHO ext4 don't 
support an different journal-device, like
xfs do, but I assume you mean the osd-jounal and not the filesystem journal?!

Udo

Am 07.08.2015 16:13, schrieb Burkhard Linke:
> Hi,
> 
> 
> On 08/07/2015 04:04 PM, Udo Lembke wrote:
>> Hi,
>> some time ago I switched all OSDs from XFS to ext4 (step by step).
>> I had no issues during mixed osd-format (the process takes some weeks).
>>
>> And yes, for me ext4 performs also better (esp. the latencies).
> Just out of curiosity:
> 
> Do you use a ext4 setup as described in the documentation? Did you try to use 
> external ext4 journals on SSD?
> 
> Regards,
> Burkhard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Different filesystems on OSD hosts at the samecluster

2015-08-07 Thread Burkhard Linke

Hi,

On 08/07/2015 04:30 PM, Udo Lembke wrote:

Hi,
I use the ext4-parameters like Christian Balzer wrote in one posting:
osd mount options ext4 = "user_xattr,rw,noatime,nodiratime"
osd_mkfs_options_ext4 = -J size=1024 -E lazy_itable_init=0,lazy_journal_init=0

Thx for the details.


The osd-journals are on SSD-Partitions (without filesystem). IMHO ext4 don't 
support an different journal-device, like
xfs do, but I assume you mean the osd-jounal and not the filesystem journal?!

No, I was indeed talking about the ext4 journals, e.g. described here:

http://raid6.com.au/posts/fs_ext4_external_journal_caveats/

The setup is tempting (both ext4 + OSD journal on SSD), but the problem 
with the persistent device names is keeping me from trying it.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Different filesystems on OSD hosts at the samecluster

2015-08-07 Thread Jan Schermer
ext4 does support external journal, and it is _FAST_

btw I'm not sure noatime is the right option nowadays for two reasons
1) the default is "relatime" which has minimal impact on performance
2) AFAIK some ceph features actually use atime (cache tiering was it?) or at 
least so I gathered from some bugs I saw

Jan

> On 07 Aug 2015, at 16:30, Udo Lembke  wrote:
> 
> Hi,
> I use the ext4-parameters like Christian Balzer wrote in one posting:
> osd mount options ext4 = "user_xattr,rw,noatime,nodiratime"
> osd_mkfs_options_ext4 = -J size=1024 -E lazy_itable_init=0,lazy_journal_init=0
> 
> The osd-journals are on SSD-Partitions (without filesystem). IMHO ext4 don't 
> support an different journal-device, like
> xfs do, but I assume you mean the osd-jounal and not the filesystem journal?!
> 
> Udo
> 
> Am 07.08.2015 16:13, schrieb Burkhard Linke:
>> Hi,
>> 
>> 
>> On 08/07/2015 04:04 PM, Udo Lembke wrote:
>>> Hi,
>>> some time ago I switched all OSDs from XFS to ext4 (step by step).
>>> I had no issues during mixed osd-format (the process takes some weeks).
>>> 
>>> And yes, for me ext4 performs also better (esp. the latencies).
>> Just out of curiosity:
>> 
>> Do you use a ext4 setup as described in the documentation? Did you try to 
>> use external ext4 journals on SSD?
>> 
>> Regards,
>> Burkhard
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] НА: Different filesystems on OSD hosts at the samecluster

2015-08-07 Thread Межов Игорь Александрович
Hi!

>No, I was indeed talking about the ext4 journals, e.g. described here:
...
>but the problem with the persistent device names is keeping me from trying it.

So you assume 3-way setup in Ceph: first drive for filesystem data, second
drive for filesystem journal and third drive for ceph journal?  And what is the 
benefits? 
Ceph journalling already support transactional writes and ext4 journaling 
doesn't
improve it anyway. Maybe it is useful to split iops onto a pair devices instead 
of one?
It is a too complicated setup, I think.


Megov Igor
CIO, Yuterra



От: ceph-users  от имени Burkhard Linke 

Отправлено: 7 августа 2015 г. 17:37
Кому: ceph-users@lists.ceph.com
Тема: Re: [ceph-users] Different filesystems on OSD hosts at the
samecluster

Hi,

On 08/07/2015 04:30 PM, Udo Lembke wrote:
> Hi,
> I use the ext4-parameters like Christian Balzer wrote in one posting:
> osd mount options ext4 = "user_xattr,rw,noatime,nodiratime"
> osd_mkfs_options_ext4 = -J size=1024 -E lazy_itable_init=0,lazy_journal_init=0
Thx for the details.
>
> The osd-journals are on SSD-Partitions (without filesystem). IMHO ext4 don't 
> support an different journal-device, like
> xfs do, but I assume you mean the osd-jounal and not the filesystem journal?!
No, I was indeed talking about the ext4 journals, e.g. described here:

http://raid6.com.au/posts/fs_ext4_external_journal_caveats/

The setup is tempting (both ext4 + OSD journal on SSD), but the problem
with the persistent device names is keeping me from trying it.

Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] НА: Different filesystems on OSD hosts at the samecluster

2015-08-07 Thread Jan Schermer
An interesting benchmark would be to compare "Ceph SSD journal" + "ext4 on 
spinner" versus "Ceph without journal" + "ext4 on spinner with external SSD 
journal".
I won't be surprised if the second outperformed the first - you are actually 
making the whole setup much simpler and Ceph is mostly CPU bound. My bet is on 
ext4.

Jan
 
> On 07 Aug 2015, at 16:57, Межов Игорь Александрович  wrote:
> 
> Hi!
> 
>> No, I was indeed talking about the ext4 journals, e.g. described here:
> ...
>> but the problem with the persistent device names is keeping me from trying 
>> it.
> 
> So you assume 3-way setup in Ceph: first drive for filesystem data, second
> drive for filesystem journal and third drive for ceph journal?  And what is 
> the benefits? 
> Ceph journalling already support transactional writes and ext4 journaling 
> doesn't
> improve it anyway. Maybe it is useful to split iops onto a pair devices 
> instead of one?
> It is a too complicated setup, I think.
> 
> 
> Megov Igor
> CIO, Yuterra
> 
> 
> 
> От: ceph-users  от имени Burkhard Linke 
> 
> Отправлено: 7 августа 2015 г. 17:37
> Кому: ceph-users@lists.ceph.com
> Тема: Re: [ceph-users] Different filesystems on OSD hosts at the
> samecluster
> 
> Hi,
> 
> On 08/07/2015 04:30 PM, Udo Lembke wrote:
>> Hi,
>> I use the ext4-parameters like Christian Balzer wrote in one posting:
>> osd mount options ext4 = "user_xattr,rw,noatime,nodiratime"
>> osd_mkfs_options_ext4 = -J size=1024 -E 
>> lazy_itable_init=0,lazy_journal_init=0
> Thx for the details.
>> 
>> The osd-journals are on SSD-Partitions (without filesystem). IMHO ext4 don't 
>> support an different journal-device, like
>> xfs do, but I assume you mean the osd-jounal and not the filesystem journal?!
> No, I was indeed talking about the ext4 journals, e.g. described here:
> 
> http://raid6.com.au/posts/fs_ext4_external_journal_caveats/
> 
> The setup is tempting (both ext4 + OSD journal on SSD), but the problem
> with the persistent device names is keeping me from trying it.
> 
> Regards,
> Burkhard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] НА: Different filesystems on OSD hosts at the samecluster

2015-08-07 Thread Udo Lembke
Hi,
I think also it's much too complicate and the effort is not in any relation, 
like Megor allready wrote the osd-journal
on SSD handle the speed.

But for the persistant device names you can easily use partlabel and select the 
disk with something like
/dev/disk/by-partlabel/ext4-journal-15

I do this way with the osd-journal (osd_journal = 
/dev/disk/by-partlabel/journal-$id) - with this method I see very
fast, which Journal on which SSD is.

Udo

Am 07.08.2015 16:57, schrieb Межов Игорь Александрович:
> Hi!
> 
>> No, I was indeed talking about the ext4 journals, e.g. described here:
> ...
>> but the problem with the persistent device names is keeping me from trying 
>> it.
> 
> So you assume 3-way setup in Ceph: first drive for filesystem data, second
> drive for filesystem journal and third drive for ceph journal?  And what is 
> the benefits? 
> Ceph journalling already support transactional writes and ext4 journaling 
> doesn't
> improve it anyway. Maybe it is useful to split iops onto a pair devices 
> instead of one?
> It is a too complicated setup, I think.
> 
> 
> Megov Igor
> CIO, Yuterra
> 
> 
> 
> От: ceph-users  от имени Burkhard Linke 
> 
> Отправлено: 7 августа 2015 г. 17:37
> Кому: ceph-users@lists.ceph.com
> Тема: Re: [ceph-users] Different filesystems on OSD hosts at the
> samecluster
> 
> Hi,
> 
> On 08/07/2015 04:30 PM, Udo Lembke wrote:
>> Hi,
>> I use the ext4-parameters like Christian Balzer wrote in one posting:
>> osd mount options ext4 = "user_xattr,rw,noatime,nodiratime"
>> osd_mkfs_options_ext4 = -J size=1024 -E 
>> lazy_itable_init=0,lazy_journal_init=0
> Thx for the details.
>>
>> The osd-journals are on SSD-Partitions (without filesystem). IMHO ext4 don't 
>> support an different journal-device, like
>> xfs do, but I assume you mean the osd-jounal and not the filesystem journal?!
> No, I was indeed talking about the ext4 journals, e.g. described here:
> 
> http://raid6.com.au/posts/fs_ext4_external_journal_caveats/
> 
> The setup is tempting (both ext4 + OSD journal on SSD), but the problem
> with the persistent device names is keeping me from trying it.
> 
> Regards,
> Burkhard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] НА: inconsistent pgs

2015-08-07 Thread Межов Игорь Александрович
Hi!

I'm sorry, but I dont know, how to help you. We move OSDs from XFS to EXT4 on 
our test
cluster (Hammer 0.94.2), removing ODSs one-by-one and re-adding them after 
reformatting
to EXT4. This process is usual to a ceph (Add/Remove OSDs in documentaion) and 
took
place without any data loss. We also change ruleset, like written in Sebastian 
Han's blog:

http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/

And it was also with no harm to data. But we dont use tiering, maybe some things
happens with data while removing cache tier, like not all objects was written 
back to
lower lier pool?


Megov Igor
CIO, Yuterra



От: Константин Сахинов 
Отправлено: 7 августа 2015 г. 15:39
Кому: Межов Игорь Александрович; ceph-users@lists.ceph.com
Тема: Re: [ceph-users] inconsistent pgs

It's hard to say now. I changed one-by-one my 6 OSDs from btrfs to xfs. During 
the repair process I added 2 more OSDs. Changed crush map from root-host-osd to 
root-chasis-host-osd structure... There was SSD cache tiering set, when first 
inconsistency showed up. Then I removed tiering to confirm than it was not the 
reason of inconsistencies.
Once there was hardware problem with one node - PCI slot issue. I shut down 
that node and exchanged motherboard to the same model.
I'm running CentOS Linux release 7.1.1503 (Core) with 3.10.0-229.7.2.el7.x86_64 
kernel.

пт, 7 авг. 2015 г. в 15:18, Межов Игорь Александрович 
mailto:me...@yuterra.ru>>:
Hi!

When inconsistent PGs starting to appear? Maybe after some event?
Hang, node reboot or after reconfiguration or changing parameters?
Can you say, what triggers such behaviour? And, BTW, what system/kernel
you use?

Megov Igor
CIO, Yuterra

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] НА: НА: Different filesystems on OSD hosts at the samecluster

2015-08-07 Thread Межов Игорь Александрович
Hi!

>An interesting benchmark would be to compare "Ceph SSD journal" + "ext4 on 
>spinner" >versus "Ceph without journal" + "ext4 on spinner with external SSD 
>journal".
>I won't be surprised if the second outperformed the first - you are actually 
>making 
>the whole setup much simpler and Ceph is mostly CPU bound. My bet is on ext4.


Well, it's worth to try! Maybe I'll do such kind of setup in spare time.
And write about results, of course. What IO patterns to test?

Megov Igor 
CIO, Yuterra




От: Jan Schermer 
Отправлено: 7 августа 2015 г. 18:00
Кому: Межов Игорь Александрович
Копия: Burkhard Linke; ceph-users@lists.ceph.com
Тема: Re: [ceph-users] НА:  Different filesystems on OSD hosts at the   
samecluster


Jan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 0.94 (and lower) performance on >1 hosts ??

2015-08-07 Thread SCHAER Frederic

De : Jake Young [mailto:jak3...@gmail.com]
Envoyé : mercredi 29 juillet 2015 17:13
À : SCHAER Frederic 
Cc : ceph-users@lists.ceph.com
Objet : Re: [ceph-users] Ceph 0.94 (and lower) performance on >1 hosts ??

On Tue, Jul 28, 2015 at 11:48 AM, SCHAER Frederic 
mailto:frederic.sch...@cea.fr>> wrote:
>
> Hi again,
>
> So I have tried
> - changing the cpus frequency : either 1.6GHZ, or 2.4GHZ on all cores
> - changing the memory configuration, from "advanced ecc mode" to "performance 
> mode", boosting the memory bandwidth from 35GB/s to 40GB/s
> - plugged a second 10GB/s link and setup a ceph internal network
> - tried various "tuned-adm profile" such as "throughput-performance"
>
> This changed about nothing.
>
> If
> - the CPUs are not maxed out, and lowering the frequency doesn't change a 
> thing
> - the network is not maxed out
> - the memory doesn't seem to have an impact
> - network interrupts are spread across all 8 cpu cores and receive queues are 
> OK
> - disks are not used at their maximum potential (iostat shows my dd commands 
> produce much more tps than the 4MB ceph transfers...)
>
> Where can I possibly find a bottleneck ?
>
> I'm /(almost) out of ideas/ ... :'(
>
> Regards
>
>
Frederic,

I was trying to optimize my ceph cluster as well and I looked at all of the 
same things you described, which didn't help my performance noticeably.

The following network kernel tuning settings did help me significantly.

This is my /etc/sysctl.conf file on all of  my hosts: ceph mons, ceph osds and 
any client that connects to my ceph cluster.

# Increase Linux autotuning TCP buffer limits
# Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) for 10GE
# Don't set tcp_mem itself! Let the kernel scale it based on RAM.
#net.core.rmem_max = 56623104
#net.core.wmem_max = 56623104
# Use 128M buffers
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.core.rmem_default = 67108864
net.core.wmem_default = 67108864
net.core.optmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864

# Make room for more TIME_WAIT sockets due to more clients,
# and allow them to be reused if we run out of sockets
# Also increase the max packet backlog
net.core.somaxconn = 1024
# Increase the length of the processor input queue
net.core.netdev_max_backlog = 25
net.ipv4.tcp_max_syn_backlog = 3
net.ipv4.tcp_max_tw_buckets = 200
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 10

# Disable TCP slow start on idle connections
net.ipv4.tcp_slow_start_after_idle = 0

# If your servers talk UDP, also up these limits
net.ipv4.udp_rmem_min = 8192
net.ipv4.udp_wmem_min = 8192

# Disable source routing and redirects
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.accept_source_route = 0

# Recommended when jumbo frames are enabled
net.ipv4.tcp_mtu_probing = 1

I have 40 Gbps links on my osd nodes, and 10 Gbps links on everything else.

Let me know if that helps.

Jake
[>- FS : -<]
Hi,

Thanks for suggesting these :]

I finally got some time to try your kernel parameters… but that doesn’t seem to 
help at least for the EC pools.
I’ll need to re-add all the disk OSDs to be really sure, especially with the 
replicated pools – I’d like to see if at least the replicated pools are better, 
so that I can use them as frontend pools…

Regards



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Is there a limit for object size in CephFS?

2015-08-07 Thread Hadi Montakhabi
Hello Cephers,

I am benchmarking CephFS. In one of my experiments, I change the object
size.
I start from 64kb. Everytime I do different block size reads and writes.
By increasing the object size to 64MB and increasing the block size to
64MB, CephFS crashes (shown in the chart below). What I mean by crash is
when I do "ceph -s" or "ceph -w" it gets into constantly reporting me
reads, but it never finishes the operation (even after a few days!).
I have repeated this experiment for different underlying file systems (xfs
and btrfs), and the same thing happens in both cases.
What could be the reason for crashing CephFS? Is there a limit for object
size in CephFS?

Thank you,
Hadi
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] НА: inconsistent pgs

2015-08-07 Thread Jan Schermer
Did you copy the OSD objects between btrfs->xfs or did you remove the btrfs OSD 
and add a new XFS OSD?

Jan

> On 07 Aug 2015, at 17:06, Межов Игорь Александрович  wrote:
> 
> Hi!
> 
> I'm sorry, but I dont know, how to help you. We move OSDs from XFS to EXT4 on 
> our test 
> cluster (Hammer 0.94.2), removing ODSs one-by-one and re-adding them after 
> reformatting 
> to EXT4. This process is usual to a ceph (Add/Remove OSDs in documentaion) 
> and took
> place without any data loss. We also change ruleset, like written in 
> Sebastian Han's blog:
> 
> http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
>  
> 
> 
> And it was also with no harm to data. But we dont use tiering, maybe some 
> things
> happens with data while removing cache tier, like not all objects was written 
> back to
> lower lier pool?
> 
> 
> Megov Igor
> CIO, Yuterra
> 
> 
> От: Константин Сахинов mailto:sakhi...@gmail.com>>
> Отправлено: 7 августа 2015 г. 15:39
> Кому: Межов Игорь Александрович; ceph-users@lists.ceph.com 
> 
> Тема: Re: [ceph-users] inconsistent pgs
>  
> It's hard to say now. I changed one-by-one my 6 OSDs from btrfs to xfs. 
> During the repair process I added 2 more OSDs. Changed crush map from 
> root-host-osd to root-chasis-host-osd structure... There was SSD cache 
> tiering set, when first inconsistency showed up. Then I removed tiering to 
> confirm than it was not the reason of inconsistencies.
> Once there was hardware problem with one node - PCI slot issue. I shut down 
> that node and exchanged motherboard to the same model.
> I'm running CentOS Linux release 7.1.1503 (Core) with 
> 3.10.0-229.7.2.el7.x86_64 kernel.
> 
> пт, 7 авг. 2015 г. в 15:18, Межов Игорь Александрович  >:
> Hi!
> 
> When inconsistent PGs starting to appear? Maybe after some event?
> Hang, node reboot or after reconfiguration or changing parameters?
> Can you say, what triggers such behaviour? And, BTW, what system/kernel
> you use?
> 
> Megov Igor
> CIO, Yuterra
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD crashes when starting

2015-08-07 Thread Gerd Jakobovitsch

Dear all,

I got to an unrecoverable crash at one specific OSD, every time I try to 
restart it. It happened first at firefly 0.80.8, I updated to 0.80.10, 
but it continued to happen.


Due to this failure, I have several PGs down+peering, that won't recover 
even marking the OSD out.


Could someone help me? Is it possible to edit/rebuild the leveldb-based 
log that seems to be causing the problem?


Here is what the logfile informs me:

[(12:54:45) root@spcsnp2 ~]# service ceph start osd.31
=== osd.31 ===
create-or-move updated item name 'osd.31' weight 2.73 at location 
{host=spcsnp2,root=default} to crush map

Starting Ceph osd.31 on spcsnp2...
starting osd.31 at :/0 osd_data /var/lib/ceph/osd/ceph-31 
/var/lib/ceph/osd/ceph-31/journal
2015-08-07 12:55:12.916880 7fd614c8f780  0 ceph version 0.80.10 
(ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ceph-osd, pid 23260
[(12:55:12) root@spcsnp2 ~]# 2015-08-07 12:55:12.928614 7fd614c8f780  0 
filestore(/var/lib/ceph/osd/ceph-31) mount detected xfs (libxfs)
2015-08-07 12:55:12.928622 7fd614c8f780  1 
filestore(/var/lib/ceph/osd/ceph-31)  disabling 'filestore replica 
fadvise' due to known issues with fadvise(DONTNEED) on xfs
2015-08-07 12:55:12.931410 7fd614c8f780  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-31) detect_features: 
FIEMAP ioctl is supported and appears to work
2015-08-07 12:55:12.931419 7fd614c8f780  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-31) detect_features: 
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-08-07 12:55:12.939290 7fd614c8f780  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-31) detect_features: 
syscall(SYS_syncfs, fd) fully supported
2015-08-07 12:55:12.939326 7fd614c8f780  0 
xfsfilestorebackend(/var/lib/ceph/osd/ceph-31) detect_feature: extsize 
is disabled by conf

2015-08-07 12:55:45.587019 7fd614c8f780 -1 *** Caught signal (Aborted) **
 in thread 7fd614c8f780

 ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70)
 1: /usr/bin/ceph-osd() [0xab7562]
 2: (()+0xf030) [0x7fd6141ce030]
 3: (gsignal()+0x35) [0x7fd612d41475]
 4: (abort()+0x180) [0x7fd612d446f0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fd61359689d]
 6: (()+0x63996) [0x7fd613594996]
 7: (()+0x639c3) [0x7fd6135949c3]
 8: (()+0x63bee) [0x7fd613594bee]
 9: (tc_new()+0x48e) [0x7fd614414aee]
 10: (std::string::_Rep::_S_create(unsigned long, unsigned long, 
std::allocator const&)+0x59) [0x7fd6135f0999]
 11: (std::string::_Rep::_M_clone(std::allocator const&, unsigned 
long)+0x28) [0x7fd6135f1708]

 12: (std::string::reserve(unsigned long)+0x30) [0x7fd6135f17f0]
 13: (std::string::append(char const*, unsigned long)+0xb5) 
[0x7fd6135f1ab5]
 14: (leveldb::log::Reader::ReadRecord(leveldb::Slice*, 
std::string*)+0x2a2) [0x7fd614670fa2]
 15: (leveldb::DBImpl::RecoverLogFile(unsigned long, 
leveldb::VersionEdit*, unsigned long*)+0x180) [0x7fd614669360]
 16: (leveldb::DBImpl::Recover(leveldb::VersionEdit*)+0x5c2) 
[0x7fd61466bdf2]
 17: (leveldb::DB::Open(leveldb::Options const&, std::string const&, 
leveldb::DB**)+0xff) [0x7fd61466c11f]

 18: (LevelDBStore::do_open(std::ostream&, bool)+0xd8) [0xa123a8]
 19: (FileStore::mount()+0x18e0) [0x9b7080]
 20: (OSD::do_convertfs(ObjectStore*)+0x1a) [0x78f52a]
 21: (main()+0x2234) [0x7331c4]
 22: (__libc_start_main()+0xfd) [0x7fd612d2dead]
 23: /usr/bin/ceph-osd() [0x736e99]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


--- begin dump of recent events ---
   -56> 2015-08-07 12:55:12.915675 7fd614c8f780  5 asok(0x1a20230) 
register_command perfcounters_dump hook 0x1a10010
   -55> 2015-08-07 12:55:12.915697 7fd614c8f780  5 asok(0x1a20230) 
register_command 1 hook 0x1a10010
   -54> 2015-08-07 12:55:12.915700 7fd614c8f780  5 asok(0x1a20230) 
register_command perf dump hook 0x1a10010
   -53> 2015-08-07 12:55:12.915704 7fd614c8f780  5 asok(0x1a20230) 
register_command perfcounters_schema hook 0x1a10010
   -52> 2015-08-07 12:55:12.915706 7fd614c8f780  5 asok(0x1a20230) 
register_command 2 hook 0x1a10010
   -51> 2015-08-07 12:55:12.915709 7fd614c8f780  5 asok(0x1a20230) 
register_command perf schema hook 0x1a10010
   -50> 2015-08-07 12:55:12.915711 7fd614c8f780  5 asok(0x1a20230) 
register_command config show hook 0x1a10010
   -49> 2015-08-07 12:55:12.915714 7fd614c8f780  5 asok(0x1a20230) 
register_command config set hook 0x1a10010
   -48> 2015-08-07 12:55:12.915716 7fd614c8f780  5 asok(0x1a20230) 
register_command config get hook 0x1a10010
   -47> 2015-08-07 12:55:12.915718 7fd614c8f780  5 asok(0x1a20230) 
register_command log flush hook 0x1a10010
   -46> 2015-08-07 12:55:12.915721 7fd614c8f780  5 asok(0x1a20230) 
register_command log dump hook 0x1a10010
   -45> 2015-08-07 12:55:12.915723 7fd614c8f780  5 asok(0x1a20230) 
register_command log reopen hook 0x1a10010
   -44> 2015-08-07 12:55:12.916880 7fd614c8f780  0 ceph version 0.80.10 
(ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ceph-osd, pid 23260
   -43> 2015-08-07 12:55:12.91815

Re: [ceph-users] НА: inconsistent pgs

2015-08-07 Thread Константин Сахинов
I removed btrfs OSD as written in docs, reformatted it to xfs, and then
added as new OSD.

пт, 7 авг. 2015 г. в 19:00, Jan Schermer :

> Did you copy the OSD objects between btrfs->xfs or did you remove the
> btrfs OSD and add a new XFS OSD?
>
> Jan
>
> On 07 Aug 2015, at 17:06, Межов Игорь Александрович 
> wrote:
>
> Hi!
>
> I'm sorry, but I dont know, how to help you. We move OSDs from XFS to EXT4
> on our test
> cluster (Hammer 0.94.2), removing ODSs one-by-one and re-adding them after
> reformatting
> to EXT4. This process is usual to a ceph (Add/Remove OSDs in documentaion)
> and took
> place without any data loss. We also change ruleset, like written in
> Sebastian Han's blog:
>
>
> http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
>
> And it was also with no harm to data. But we dont use tiering, maybe some
> things
> happens with data while removing cache tier, like not all objects was
> written back to
> lower lier pool?
>
>
> Megov Igor
> CIO, Yuterra
>
>
> --
> *От:* Константин Сахинов 
> *Отправлено:* 7 августа 2015 г. 15:39
> *Кому:* Межов Игорь Александрович; ceph-users@lists.ceph.com
> *Тема:* Re: [ceph-users] inconsistent pgs
>
> It's hard to say now. I changed one-by-one my 6 OSDs from btrfs to xfs.
> During the repair process I added 2 more OSDs. Changed crush map from
> root-host-osd to root-*chasis*-host-osd structure... There was SSD cache
> tiering set, when first inconsistency showed up. Then I removed tiering to
> confirm than it was not the reason of inconsistencies.
> Once there was hardware problem with one node - PCI slot issue. I shut
> down that node and exchanged motherboard to the same model.
> I'm running CentOS Linux release 7.1.1503 (Core) with
> 3.10.0-229.7.2.el7.x86_64 kernel.
>
> пт, 7 авг. 2015 г. в 15:18, Межов Игорь Александрович :
>
>> Hi!
>>
>> When inconsistent PGs starting to appear? Maybe after some event?
>> Hang, node reboot or after reconfiguration or changing parameters?
>> Can you say, what triggers such behaviour? And, BTW, what system/kernel
>> you use?
>>
>> Megov Igor
>> CIO, Yuterra
>>
>> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Flapping OSD's when scrubbing

2015-08-07 Thread Tuomas Juntunen
Hi

 

We are experiencing an annoying problem where scrubs make OSD's flap down
and cause Ceph cluster to be unusable for couple of minutes.

 

Our cluster consists of three nodes connected with 40gbit infiniband using
IPoIB, with 2x 6 core X5670 CPU's and 64GB of memory

Each node has 6 SSD's for journals to 12 OSD's 2TB disks (Fast pools) and
another 12 OSD's 4TB disks (Archive pools) which have journal on the same
disk. 

 

It seems that our cluster is constantly doing scrubbing, we rarely see only
active+clean, below is the status at the moment.

 

cluster a2974742-3805-4cd3-bc79-765f2bddaefe

 health HEALTH_OK

 monmap e16: 4 mons at
{lb1=10.20.60.1:6789/0,lb2=10.20.60.2:6789/0,nc1=10.20.50.2:6789/0,nc2=10.20
.50.3:6789/0}

election epoch 1838, quorum 0,1,2,3 nc1,nc2,lb1,lb2

 mdsmap e7901: 1/1/1 up {0=lb1=up:active}, 4 up:standby

 osdmap e104824: 72 osds: 72 up, 72 in

  pgmap v12941402: 5248 pgs, 9 pools, 19644 GB data, 4810 kobjects

59067 GB used, 138 TB / 196 TB avail

5241 active+clean

   7 active+clean+scrubbing

 

When OSD's go down, first the load on a node goes high during scrubbing and
after that some OSD's go down and 30 secs, they are back up. They are not
really going down, but are marked as down. Then it takes around couple of
minutes for everything be OK again.

 

Any suggestion how to fix this? We can't go to production while this
behavior exists.

 

Our config is below:

 

[global]

fsid = a2974742-3805-4cd3-bc79-765f2bddaefe

mon_initial_members = lb1,lb2,nc1,nc2

mon_host = 10.20.60.1,10.20.60.2,10.20.50.2,10.20.50.3

auth_cluster_required = cephx

auth_service_required = cephx

auth_client_required = cephx

filestore_xattr_use_omap = true

 

osd pool default pg num = 128

osd pool default pgp num = 128

 

public network = 10.20.0.0/16

 

osd_op_threads = 12

osd_op_num_threads_per_shard = 2

osd_op_num_shards = 6

#osd_op_num_sharded_pool_threads = 25

filestore_op_threads = 12

ms_nocrc = true

filestore_fd_cache_size = 64

filestore_fd_cache_shards = 32

ms_dispatch_throttle_bytes = 0

throttler_perf_counter = false

 

mon osd min down reporters = 25

 

[osd]

osd scrub max interval = 1209600

osd scrub min interval = 604800

osd scrub load threshold = 3.0

osd max backfills = 1

osd recovery max active = 1

# IO Scheduler settings

osd scrub sleep = 1.0

osd disk thread ioprio class = idle

osd disk thread ioprio priority = 7

osd scrub chunk max = 1

osd scrub chunk min = 1

osd deep scrub stride = 1048576

filestore queue max ops = 1

filestore max sync interval = 30

filestore min sync interval = 29

 

osd deep scrub interval = 2592000

osd heartbeat grace = 240

osd heartbeat interval = 12

osd mon report interval max = 120

osd mon report interval min = 5

 

   osd_client_message_size_cap = 0

osd_client_message_cap = 0

osd_enable_op_tracker = false

 

osd crush update on start = false

 

[client]

rbd cache = true

rbd cache size = 67108864 # 64mb

rbd cache max dirty = 50331648 # 48mb

rbd cache target dirty = 33554432 # 32mb

rbd cache writethrough until flush = true # It's by default

rbd cache max dirty age = 2

admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok

 

 

Br,

Tuomas

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flapping OSD's when scrubbing

2015-08-07 Thread Quentin Hartman
That kind of behavior is usually caused by the OSDs getting busy enough
that they aren't answering heartbeats in a timely fashion. It can also
happen if you have any netowrk flakiness and heartbeats are getting lost
because of that.

I think (I'm not positive though) that increasing your heartbeat interval
may help. Also, looking at the number of threads you have for your OSDs,
that seems potentially problematic. If you've got 24 OSDs per machine and
each one is running 12 threads, that's 288 threads on 12 cores for just the
requests. Plus the disk threads, plus the filestore op threads... That
level of thread contention seems like it might be contributing to missing
the heartbeats. But again, that's conjecture. I've not worked with a setup
as dense as yours.

QH

On Fri, Aug 7, 2015 at 11:21 AM, Tuomas Juntunen <
tuomas.juntu...@databasement.fi> wrote:

> Hi
>
>
>
> We are experiencing an annoying problem where scrubs make OSD’s flap down
> and cause Ceph cluster to be unusable for couple of minutes.
>
>
>
> Our cluster consists of three nodes connected with 40gbit infiniband using
> IPoIB, with 2x 6 core X5670 CPU’s and 64GB of memory
>
> Each node has 6 SSD’s for journals to 12 OSD’s 2TB disks (Fast pools) and
> another 12 OSD’s 4TB disks (Archive pools) which have journal on the same
> disk.
>
>
>
> It seems that our cluster is constantly doing scrubbing, we rarely see
> only active+clean, below is the status at the moment.
>
>
>
> cluster a2974742-3805-4cd3-bc79-765f2bddaefe
>
>  health HEALTH_OK
>
>  monmap e16: 4 mons at {lb1=
> 10.20.60.1:6789/0,lb2=10.20.60.2:6789/0,nc1=10.20.50.2:6789/0,nc2=10.20.50.3:6789/0
> }
>
> election epoch 1838, quorum 0,1,2,3 nc1,nc2,lb1,lb2
>
>  mdsmap e7901: 1/1/1 up {0=lb1=up:active}, 4 up:standby
>
>  osdmap e104824: 72 osds: 72 up, 72 in
>
>   pgmap v12941402: 5248 pgs, 9 pools, 19644 GB data, 4810 kobjects
>
> 59067 GB used, 138 TB / 196 TB avail
>
> 5241 active+clean
>
>7 active+clean+scrubbing
>
>
>
> When OSD’s go down, first the load on a node goes high during scrubbing
> and after that some OSD’s go down and 30 secs, they are back up. They are
> not really going down, but are marked as down. Then it takes around couple
> of minutes for everything be OK again.
>
>
>
> Any suggestion how to fix this? We can’t go to production while this
> behavior exists.
>
>
>
> Our config is below:
>
>
>
> [global]
>
> fsid = a2974742-3805-4cd3-bc79-765f2bddaefe
>
> mon_initial_members = lb1,lb2,nc1,nc2
>
> mon_host = 10.20.60.1,10.20.60.2,10.20.50.2,10.20.50.3
>
> auth_cluster_required = cephx
>
> auth_service_required = cephx
>
> auth_client_required = cephx
>
> filestore_xattr_use_omap = true
>
>
>
> osd pool default pg num = 128
>
> osd pool default pgp num = 128
>
>
>
> public network = 10.20.0.0/16
>
>
>
> osd_op_threads = 12
>
> osd_op_num_threads_per_shard = 2
>
> osd_op_num_shards = 6
>
> #osd_op_num_sharded_pool_threads = 25
>
> filestore_op_threads = 12
>
> ms_nocrc = true
>
> filestore_fd_cache_size = 64
>
> filestore_fd_cache_shards = 32
>
> ms_dispatch_throttle_bytes = 0
>
> throttler_perf_counter = false
>
>
>
> mon osd min down reporters = 25
>
>
>
> [osd]
>
> osd scrub max interval = 1209600
>
> osd scrub min interval = 604800
>
> osd scrub load threshold = 3.0
>
> osd max backfills = 1
>
> osd recovery max active = 1
>
> # IO Scheduler settings
>
> osd scrub sleep = 1.0
>
> osd disk thread ioprio class = idle
>
> osd disk thread ioprio priority = 7
>
> osd scrub chunk max = 1
>
> osd scrub chunk min = 1
>
> osd deep scrub stride = 1048576
>
> filestore queue max ops = 1
>
> filestore max sync interval = 30
>
> filestore min sync interval = 29
>
>
>
> osd deep scrub interval = 2592000
>
> osd heartbeat grace = 240
>
> osd heartbeat interval = 12
>
> osd mon report interval max = 120
>
> osd mon report interval min = 5
>
>
>
>osd_client_message_size_cap = 0
>
> osd_client_message_cap = 0
>
> osd_enable_op_tracker = false
>
>
>
> osd crush update on start = false
>
>
>
> [client]
>
> rbd cache = true
>
> rbd cache size = 67108864 # 64mb
>
> rbd cache max dirty = 50331648 # 48mb
>
> rbd cache target dirty = 33554432 # 32mb
>
> rbd cache writethrough until flush = true # It's by default
>
> rbd cache max dirty age = 2
>
> admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
>
>
>
>
>
> Br,
>
> Tuomas
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flapping OSD's when scrubbing

2015-08-07 Thread Tuomas Juntunen
Thanks

 

We play with the values a bit and see what happens.

 

Br,

Tuomas

 

 

From: Quentin Hartman [mailto:qhart...@direwolfdigital.com] 
Sent: 7. elokuuta 2015 20:32
To: Tuomas Juntunen
Cc: ceph-users
Subject: Re: [ceph-users] Flapping OSD's when scrubbing

 

That kind of behavior is usually caused by the OSDs getting busy enough that 
they aren't answering heartbeats in a timely fashion. It can also happen if you 
have any netowrk flakiness and heartbeats are getting lost because of that.

 

I think (I'm not positive though) that increasing your heartbeat interval may 
help. Also, looking at the number of threads you have for your OSDs, that seems 
potentially problematic. If you've got 24 OSDs per machine and each one is 
running 12 threads, that's 288 threads on 12 cores for just the requests. Plus 
the disk threads, plus the filestore op threads... That level of thread 
contention seems like it might be contributing to missing the heartbeats. But 
again, that's conjecture. I've not worked with a setup as dense as yours.

 

QH

 

On Fri, Aug 7, 2015 at 11:21 AM, Tuomas Juntunen 
 wrote:

Hi

 

We are experiencing an annoying problem where scrubs make OSD’s flap down and 
cause Ceph cluster to be unusable for couple of minutes.

 

Our cluster consists of three nodes connected with 40gbit infiniband using 
IPoIB, with 2x 6 core X5670 CPU’s and 64GB of memory

Each node has 6 SSD’s for journals to 12 OSD’s 2TB disks (Fast pools) and 
another 12 OSD’s 4TB disks (Archive pools) which have journal on the same disk. 

 

It seems that our cluster is constantly doing scrubbing, we rarely see only 
active+clean, below is the status at the moment.

 

cluster a2974742-3805-4cd3-bc79-765f2bddaefe

 health HEALTH_OK

 monmap e16: 4 mons at 
{lb1=10.20.60.1:6789/0,lb2=10.20.60.2:6789/0,nc1=10.20.50.2:6789/0,nc2=10.20.50.3:6789/0}

election epoch 1838, quorum 0,1,2,3 nc1,nc2,lb1,lb2

 mdsmap e7901: 1/1/1 up {0=lb1=up:active}, 4 up:standby

 osdmap e104824: 72 osds: 72 up, 72 in

  pgmap v12941402: 5248 pgs, 9 pools, 19644 GB data, 4810 kobjects

59067 GB used, 138 TB / 196 TB avail

5241 active+clean

   7 active+clean+scrubbing

 

When OSD’s go down, first the load on a node goes high during scrubbing and 
after that some OSD’s go down and 30 secs, they are back up. They are not 
really going down, but are marked as down. Then it takes around couple of 
minutes for everything be OK again.

 

Any suggestion how to fix this? We can’t go to production while this behavior 
exists.

 

Our config is below:

 

[global]

fsid = a2974742-3805-4cd3-bc79-765f2bddaefe

mon_initial_members = lb1,lb2,nc1,nc2

mon_host = 10.20.60.1,10.20.60.2,10.20.50.2,10.20.50.3

auth_cluster_required = cephx

auth_service_required = cephx

auth_client_required = cephx

filestore_xattr_use_omap = true

 

osd pool default pg num = 128

osd pool default pgp num = 128

 

public network = 10.20.0.0/16

 

osd_op_threads = 12

osd_op_num_threads_per_shard = 2

osd_op_num_shards = 6

#osd_op_num_sharded_pool_threads = 25

filestore_op_threads = 12

ms_nocrc = true

filestore_fd_cache_size = 64

filestore_fd_cache_shards = 32

ms_dispatch_throttle_bytes = 0

throttler_perf_counter = false

 

mon osd min down reporters = 25

 

[osd]

osd scrub max interval = 1209600

osd scrub min interval = 604800

osd scrub load threshold = 3.0

osd max backfills = 1

osd recovery max active = 1

# IO Scheduler settings

osd scrub sleep = 1.0

osd disk thread ioprio class = idle

osd disk thread ioprio priority = 7

osd scrub chunk max = 1

osd scrub chunk min = 1

osd deep scrub stride = 1048576

filestore queue max ops = 1

filestore max sync interval = 30

filestore min sync interval = 29

 

osd deep scrub interval = 2592000

osd heartbeat grace = 240

osd heartbeat interval = 12

osd mon report interval max = 120

osd mon report interval min = 5

 

   osd_client_message_size_cap = 0

osd_client_message_cap = 0

osd_enable_op_tracker = false

 

osd crush update on start = false

 

[client]

rbd cache = true

rbd cache size = 67108864 # 64mb

rbd cache max dirty = 50331648 # 48mb

rbd cache target dirty = 33554432 # 32mb

rbd cache writethrough until flush = true # It's by default

rbd cache max dirty age = 2

admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok

 

 

Br,

Tuomas


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flapping OSD's when scrubbing

2015-08-07 Thread Константин Сахинов
Hi!

One time I faced such a behavior of my home cluster. At the time my OSDs go
down I noticed that node is using swap despite of sufficient memory. Tuning
/proc/sys/vm/swappiness to 0 helped to solve the problem.

пт, 7 авг. 2015 г. в 20:41, Tuomas Juntunen :

> Thanks
>
>
>
> We play with the values a bit and see what happens.
>
>
>
> Br,
>
> Tuomas
>
>
>
>
>
> *From:* Quentin Hartman [mailto:qhart...@direwolfdigital.com]
> *Sent:* 7. elokuuta 2015 20:32
> *To:* Tuomas Juntunen
> *Cc:* ceph-users
> *Subject:* Re: [ceph-users] Flapping OSD's when scrubbing
>
>
>
> That kind of behavior is usually caused by the OSDs getting busy enough
> that they aren't answering heartbeats in a timely fashion. It can also
> happen if you have any netowrk flakiness and heartbeats are getting lost
> because of that.
>
>
>
> I think (I'm not positive though) that increasing your heartbeat interval
> may help. Also, looking at the number of threads you have for your OSDs,
> that seems potentially problematic. If you've got 24 OSDs per machine and
> each one is running 12 threads, that's 288 threads on 12 cores for just the
> requests. Plus the disk threads, plus the filestore op threads... That
> level of thread contention seems like it might be contributing to missing
> the heartbeats. But again, that's conjecture. I've not worked with a setup
> as dense as yours.
>
>
>
> QH
>
>
>
> On Fri, Aug 7, 2015 at 11:21 AM, Tuomas Juntunen <
> tuomas.juntu...@databasement.fi> wrote:
>
> Hi
>
>
>
> We are experiencing an annoying problem where scrubs make OSD’s flap down
> and cause Ceph cluster to be unusable for couple of minutes.
>
>
>
> Our cluster consists of three nodes connected with 40gbit infiniband using
> IPoIB, with 2x 6 core X5670 CPU’s and 64GB of memory
>
> Each node has 6 SSD’s for journals to 12 OSD’s 2TB disks (Fast pools) and
> another 12 OSD’s 4TB disks (Archive pools) which have journal on the same
> disk.
>
>
>
> It seems that our cluster is constantly doing scrubbing, we rarely see
> only active+clean, below is the status at the moment.
>
>
>
> cluster a2974742-3805-4cd3-bc79-765f2bddaefe
>
>  health HEALTH_OK
>
>  monmap e16: 4 mons at {lb1=
> 10.20.60.1:6789/0,lb2=10.20.60.2:6789/0,nc1=10.20.50.2:6789/0,nc2=10.20.50.3:6789/0
> }
>
> election epoch 1838, quorum 0,1,2,3 nc1,nc2,lb1,lb2
>
>  mdsmap e7901: 1/1/1 up {0=lb1=up:active}, 4 up:standby
>
>  osdmap e104824: 72 osds: 72 up, 72 in
>
>   pgmap v12941402: 5248 pgs, 9 pools, 19644 GB data, 4810 kobjects
>
> 59067 GB used, 138 TB / 196 TB avail
>
> 5241 active+clean
>
>7 active+clean+scrubbing
>
>
>
> When OSD’s go down, first the load on a node goes high during scrubbing
> and after that some OSD’s go down and 30 secs, they are back up. They are
> not really going down, but are marked as down. Then it takes around couple
> of minutes for everything be OK again.
>
>
>
> Any suggestion how to fix this? We can’t go to production while this
> behavior exists.
>
>
>
> Our config is below:
>
>
>
> [global]
>
> fsid = a2974742-3805-4cd3-bc79-765f2bddaefe
>
> mon_initial_members = lb1,lb2,nc1,nc2
>
> mon_host = 10.20.60.1,10.20.60.2,10.20.50.2,10.20.50.3
>
> auth_cluster_required = cephx
>
> auth_service_required = cephx
>
> auth_client_required = cephx
>
> filestore_xattr_use_omap = true
>
>
>
> osd pool default pg num = 128
>
> osd pool default pgp num = 128
>
>
>
> public network = 10.20.0.0/16
>
>
>
> osd_op_threads = 12
>
> osd_op_num_threads_per_shard = 2
>
> osd_op_num_shards = 6
>
> #osd_op_num_sharded_pool_threads = 25
>
> filestore_op_threads = 12
>
> ms_nocrc = true
>
> filestore_fd_cache_size = 64
>
> filestore_fd_cache_shards = 32
>
> ms_dispatch_throttle_bytes = 0
>
> throttler_perf_counter = false
>
>
>
> mon osd min down reporters = 25
>
>
>
> [osd]
>
> osd scrub max interval = 1209600
>
> osd scrub min interval = 604800
>
> osd scrub load threshold = 3.0
>
> osd max backfills = 1
>
> osd recovery max active = 1
>
> # IO Scheduler settings
>
> osd scrub sleep = 1.0
>
> osd disk thread ioprio class = idle
>
> osd disk thread ioprio priority = 7
>
> osd scrub chunk max = 1
>
> osd scrub chunk min = 1
>
> osd deep scrub stride = 1048576
>
> filestore queue max ops = 1
>
> filestore max sync interval = 30
>
> filestore min sync interval = 29
>
>
>
> osd deep scrub interval = 2592000
>
> osd heartbeat grace = 240
>
> osd heartbeat interval = 12
>
> osd mon report interval max = 120
>
> osd mon report interval min = 5
>
>
>
>osd_client_message_size_cap = 0
>
> osd_client_message_cap = 0
>
> osd_enable_op_tracker = false
>
>
>
> osd crush update on start = false
>
>
>
> [client]
>
> rbd cache = true
>
> rbd cache size = 67108864 # 64mb
>
> rbd cache max dirty = 50331648 # 48mb
>
> rbd cache target

Re: [ceph-users] Different filesystems on OSD hosts at the samecluster

2015-08-07 Thread Udo Lembke
Hi Jan,
thanks for the hint.

I changed the mount-option from noatime to relatime and will remount all
OSDs during weekend.

Udo

On 07.08.2015 16:37, Jan Schermer wrote:
> ext4 does support external journal, and it is _FAST_
>
> btw I'm not sure noatime is the right option nowadays for two reasons
> 1) the default is "relatime" which has minimal impact on performance
> 2) AFAIK some ceph features actually use atime (cache tiering was it?) or at 
> least so I gathered from some bugs I saw
>
> Jan
>
>> On 07 Aug 2015, at 16:30, Udo Lembke  wrote:
>>
>> Hi,
>> I use the ext4-parameters like Christian Balzer wrote in one posting:
>> osd mount options ext4 = "user_xattr,rw,noatime,nodiratime"
>> osd_mkfs_options_ext4 = -J size=1024 -E 
>> lazy_itable_init=0,lazy_journal_init=0
>>
>> The osd-journals are on SSD-Partitions (without filesystem). IMHO ext4 don't 
>> support an different journal-device, like
>> xfs do, but I assume you mean the osd-jounal and not the filesystem journal?!
>>
>> Udo
>>
>> Am 07.08.2015 16:13, schrieb Burkhard Linke:
>>> Hi,
>>>
>>>
>>> On 08/07/2015 04:04 PM, Udo Lembke wrote:
 Hi,
 some time ago I switched all OSDs from XFS to ext4 (step by step).
 I had no issues during mixed osd-format (the process takes some weeks).

 And yes, for me ext4 performs also better (esp. the latencies).
>>> Just out of curiosity:
>>>
>>> Do you use a ext4 setup as described in the documentation? Did you try to 
>>> use external ext4 journals on SSD?
>>>
>>> Regards,
>>> Burkhard
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] btrfs w/ centos 7.1

2015-08-07 Thread Ben Hines
Howdy,

The Ceph docs still say btrfs is 'experimental' in one section, but
say it's the long term ideal for ceph in the later section. Is this
still accurate with Hammer? Is it mature enough on centos 7.1 for
production use?

(kernel is  3.10.0-229.7.2.el7.x86_64 )

thanks-

-Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] btrfs w/ centos 7.1

2015-08-07 Thread Константин Сахинов
I'v tested it on my home cluster: 8 OSDs (4 nodes by 2x4TB OSDs with
Celeron J1900 and 8GB RAM) + 4 cache tier OSDs (2 nodes by 2x250GB SSD OSDs
with Atom D2500 and 4GB RAM).
HDD OSDs worked v-v-very slow. And SSD OSDs sometimes stopped working
because btrfs couldn't rebalance quickly enough and overfilled its SSDs
(100% used space). To return them back to life, I had to perform
complicated procedure of freeing some space and rebalancing btrfs tree.
Maybe real production hardware hasn't such a problems with btrfs, but I
don't think that production cluster stability must depend on its hardware
performance.

пт, 7 авг. 2015 г. в 23:05, Ben Hines :

> Howdy,
>
> The Ceph docs still say btrfs is 'experimental' in one section, but
> say it's the long term ideal for ceph in the later section. Is this
> still accurate with Hammer? Is it mature enough on centos 7.1 for
> production use?
>
> (kernel is  3.10.0-229.7.2.el7.x86_64 )
>
> thanks-
>
> -Ben
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] btrfs w/ centos 7.1

2015-08-07 Thread Quentin Hartman
I would say probably not. btrfs (or, "worse FS" as we call it around my
office) still does weird stuff from time to time, especially in low-memory
conditions. This is based on testing we did on Ubuntu 14.04, running kernel
3.16.something.

I long for the day that btrfs realizes it's promise, but I do not think
that day is here.

QH

On Fri, Aug 7, 2015 at 2:05 PM, Ben Hines  wrote:

> Howdy,
>
> The Ceph docs still say btrfs is 'experimental' in one section, but
> say it's the long term ideal for ceph in the later section. Is this
> still accurate with Hammer? Is it mature enough on centos 7.1 for
> production use?
>
> (kernel is  3.10.0-229.7.2.el7.x86_64 )
>
> thanks-
>
> -Ben
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] btrfs w/ centos 7.1

2015-08-07 Thread Jan Schermer
The answer to this, as well as life, universe and everything, is simple:
ZFS.

:)

> On 07 Aug 2015, at 22:24, Quentin Hartman  
> wrote:
> 
> I would say probably not. btrfs (or, "worse FS" as we call it around my 
> office) still does weird stuff from time to time, especially in low-memory 
> conditions. This is based on testing we did on Ubuntu 14.04, running kernel 
> 3.16.something.
> 
> I long for the day that btrfs realizes it's promise, but I do not think that 
> day is here.
> 
> QH
> 
> On Fri, Aug 7, 2015 at 2:05 PM, Ben Hines  > wrote:
> Howdy,
> 
> The Ceph docs still say btrfs is 'experimental' in one section, but
> say it's the long term ideal for ceph in the later section. Is this
> still accurate with Hammer? Is it mature enough on centos 7.1 for
> production use?
> 
> (kernel is  3.10.0-229.7.2.el7.x86_64 )
> 
> thanks-
> 
> -Ben
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flapping OSD's when scrubbing

2015-08-07 Thread Tuomas Juntunen
Hi

 

Thanks, we were able to resolve the problem by disabling swap completely, no 
need for it anyway.

 

Also memory was fragmenting since all memory was used for caching

 

Running “perf top”  we saw that freeing blocks of memory took all the cpu power

 

Samples: 4M of event 'cycles', Event count (approx.): 965471653281  
  

 71.95%  [kernel] [k] isolate_freepages_block   
  

 10.37%  [kernel] [k] __reset_isolation_suitable
  

 

Now by forcing systems to have 10GB of free memory all the time, the problem 
was solved.

 

We added 

 

vm.min_free_kbytes = 1000

 

to /etc/sysctl.conf

 

Don’t know why this happens, is this a “problem” of the kernel version we are 
running or something else. (Ubuntu 14.04 3.13.0-32-generic)

 

Br,

Tuomas

 

From: Константин Сахинов [mailto:sakhi...@gmail.com] 
Sent: 7. elokuuta 2015 21:15
To: Tuomas Juntunen; Quentin Hartman
Cc: ceph-users
Subject: Re: [ceph-users] Flapping OSD's when scrubbing

 

Hi!

 

One time I faced such a behavior of my home cluster. At the time my OSDs go 
down I noticed that node is using swap despite of sufficient memory. Tuning 
/proc/sys/vm/swappiness to 0 helped to solve the problem.

 

пт, 7 авг. 2015 г. в 20:41, Tuomas Juntunen :

Thanks

 

We play with the values a bit and see what happens.

 

Br,

Tuomas

 

 

From: Quentin Hartman [mailto:qhart...@direwolfdigital.com] 
Sent: 7. elokuuta 2015 20:32
To: Tuomas Juntunen
Cc: ceph-users
Subject: Re: [ceph-users] Flapping OSD's when scrubbing

 

That kind of behavior is usually caused by the OSDs getting busy enough that 
they aren't answering heartbeats in a timely fashion. It can also happen if you 
have any netowrk flakiness and heartbeats are getting lost because of that.

 

I think (I'm not positive though) that increasing your heartbeat interval may 
help. Also, looking at the number of threads you have for your OSDs, that seems 
potentially problematic. If you've got 24 OSDs per machine and each one is 
running 12 threads, that's 288 threads on 12 cores for just the requests. Plus 
the disk threads, plus the filestore op threads... That level of thread 
contention seems like it might be contributing to missing the heartbeats. But 
again, that's conjecture. I've not worked with a setup as dense as yours.

 

QH

 

On Fri, Aug 7, 2015 at 11:21 AM, Tuomas Juntunen 
 wrote:

Hi

 

We are experiencing an annoying problem where scrubs make OSD’s flap down and 
cause Ceph cluster to be unusable for couple of minutes.

 

Our cluster consists of three nodes connected with 40gbit infiniband using 
IPoIB, with 2x 6 core X5670 CPU’s and 64GB of memory

Each node has 6 SSD’s for journals to 12 OSD’s 2TB disks (Fast pools) and 
another 12 OSD’s 4TB disks (Archive pools) which have journal on the same disk. 

 

It seems that our cluster is constantly doing scrubbing, we rarely see only 
active+clean, below is the status at the moment.

 

cluster a2974742-3805-4cd3-bc79-765f2bddaefe

 health HEALTH_OK

 monmap e16: 4 mons at 
{lb1=10.20.60.1:6789/0,lb2=10.20.60.2:6789/0,nc1=10.20.50.2:6789/0,nc2=10.20.50.3:6789/0}

election epoch 1838, quorum 0,1,2,3 nc1,nc2,lb1,lb2

 mdsmap e7901: 1/1/1 up {0=lb1=up:active}, 4 up:standby

 osdmap e104824: 72 osds: 72 up, 72 in

  pgmap v12941402: 5248 pgs, 9 pools, 19644 GB data, 4810 kobjects

59067 GB used, 138 TB / 196 TB avail

5241 active+clean

   7 active+clean+scrubbing

 

When OSD’s go down, first the load on a node goes high during scrubbing and 
after that some OSD’s go down and 30 secs, they are back up. They are not 
really going down, but are marked as down. Then it takes around couple of 
minutes for everything be OK again.

 

Any suggestion how to fix this? We can’t go to production while this behavior 
exists.

 

Our config is below:

 

[global]

fsid = a2974742-3805-4cd3-bc79-765f2bddaefe

mon_initial_members = lb1,lb2,nc1,nc2

mon_host = 10.20.60.1,10.20.60.2,10.20.50.2,10.20.50.3

auth_cluster_required = cephx

auth_service_required = cephx

auth_client_required = cephx

filestore_xattr_use_omap = true

 

osd pool default pg num = 128

osd pool default pgp num = 128

 

public network = 10.20.0.0/16

 

osd_op_threads = 12

osd_op_num_threads_per_shard = 2

osd_op_num_shards = 6

#osd_op_num_sharded_pool_threads = 25

filestore_op_threads = 12

ms_nocrc = true

filestore_fd_cache_size = 64

filestore_fd_cache_shards = 32

ms_dispatch_throttle_bytes = 0

throttler_perf_counter = false

 

mon osd min down reporters = 25

 

[osd]

osd scrub max interval = 1209600

osd scrub min interval = 604800

osd scrub load threshold = 3.0

osd max backfills = 1

osd recovery max active = 1

[ceph-users] optimizing non-ssd journals

2015-08-07 Thread Ben Hines
Our cluster is primarily used for RGW, but would like to use for RBD
eventually...

We don't have SSDs on our journals (for a while yet) and we're still
updating our cluster to 10GBE.

I do see some pretty high commit and apply latencies in 'osd perf'
often 100-500 ms, which figure is a result of the spinning journals.

Cluster consists of ~110 OSDs, 4 per node, on 2TB drives each, JBOD,
xfs with the associated 5GB  journal a second partition on each of
them:

/dev/sdb :
 /dev/sdb1 ceph data, active, cluster ceph, osd.35, journal /dev/sdb2
 /dev/sdb2 ceph journal, for /dev/sdb1
/dev/sdc :
 /dev/sdc1 ceph data, active, cluster ceph, osd.36, journal /dev/sdc2
 /dev/sdc2 ceph journal, for /dev/sdc1
...

Also they are mounted with:
osd mount options xfs = rw,noatime,inode64

+ 8 experimental btrfs osds, mounted with
osd_mount_options_btrfs = rw,noatime,space_cache,user_subvol_rm_allowed


Considering that SSDs are unlikely in near term, what can we do to
help commit/apply latency?

- Would increasing the size of the journal partition help?

- JBOD vs single-disk RAID0 - the drives are just JBODded now.
Research indicates i may see improvements with single-disk RAID0. Is
this information still current?

thanks-

-Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] btrfs w/ centos 7.1

2015-08-07 Thread Lionel Bouton
Le 07/08/2015 22:05, Ben Hines a écrit :
> Howdy,
>
> The Ceph docs still say btrfs is 'experimental' in one section, but
> say it's the long term ideal for ceph in the later section. Is this
> still accurate with Hammer? Is it mature enough on centos 7.1 for
> production use?
>
> (kernel is  3.10.0-229.7.2.el7.x86_64 )

Difficult to say with distribution kernels, they may have patched their
kernels to fix some Btrfs issues or not (3.10.0 is more than 2 years
old) I wouldn't trust them myself.

We are converting our OSDs to Btrfs but we use recent kernel versions
(4.0.5 currently), we disabled Btrfs snapshots in ceph.conf (they are
too costly), created journals NOCOW (we will move them to SSDs
eventually) and developed our own defragmentation scheduler (Btrfs' own
autodefrag didn't perform well with Ceph when we started and we use the
btrfs defragmentation process to recompress data with zlib instead of
lzo as we mount OSD fs with compress=lzo for lower latency OSD writes).
In the above conditions, it is faster than XFS (~30% lower apply
latencies according to ceph osd perf), detects otherwise silent data
corruption (it caught some already) and provides ~10% additional storage
space thanks to lzo/zlib compression (most of our data is in the form of
already compressed files stored on RBD, actual gains obviously depend on
the data).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flapping OSD's when scrubbing

2015-08-07 Thread Somnath Roy
Yes, if you dig down older mails, you will see I reported that as a Ubuntu 
kernel bug (not sure about other Linux flavors) .. vm.min_free_kbytes is the 
way to work around that..


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Tuomas 
Juntunen
Sent: Friday, August 07, 2015 1:57 PM
To: 'Константин Сахинов'; 'Quentin Hartman'
Cc: 'ceph-users'
Subject: Re: [ceph-users] Flapping OSD's when scrubbing

Hi

Thanks, we were able to resolve the problem by disabling swap completely, no 
need for it anyway.

Also memory was fragmenting since all memory was used for caching

Running “perf top”  we saw that freeing blocks of memory took all the cpu power

Samples: 4M of event 'cycles', Event count (approx.): 965471653281
 71.95%  [kernel] [k] isolate_freepages_block
 10.37%  [kernel] [k] __reset_isolation_suitable

Now by forcing systems to have 10GB of free memory all the time, the problem 
was solved.

We added

vm.min_free_kbytes = 1000

to /etc/sysctl.conf

Don’t know why this happens, is this a “problem” of the kernel version we are 
running or something else. (Ubuntu 14.04 3.13.0-32-generic)

Br,
Tuomas

From: Константин Сахинов [mailto:sakhi...@gmail.com]
Sent: 7. elokuuta 2015 21:15
To: Tuomas Juntunen; Quentin Hartman
Cc: ceph-users
Subject: Re: [ceph-users] Flapping OSD's when scrubbing

Hi!

One time I faced such a behavior of my home cluster. At the time my OSDs go 
down I noticed that node is using swap despite of sufficient memory. Tuning 
/proc/sys/vm/swappiness to 0 helped to solve the problem.

пт, 7 авг. 2015 г. в 20:41, Tuomas Juntunen 
mailto:tuomas.juntu...@databasement.fi>>:
Thanks

We play with the values a bit and see what happens.

Br,
Tuomas


From: Quentin Hartman 
[mailto:qhart...@direwolfdigital.com]
Sent: 7. elokuuta 2015 20:32
To: Tuomas Juntunen
Cc: ceph-users
Subject: Re: [ceph-users] Flapping OSD's when scrubbing

That kind of behavior is usually caused by the OSDs getting busy enough that 
they aren't answering heartbeats in a timely fashion. It can also happen if you 
have any netowrk flakiness and heartbeats are getting lost because of that.

I think (I'm not positive though) that increasing your heartbeat interval may 
help. Also, looking at the number of threads you have for your OSDs, that seems 
potentially problematic. If you've got 24 OSDs per machine and each one is 
running 12 threads, that's 288 threads on 12 cores for just the requests. Plus 
the disk threads, plus the filestore op threads... That level of thread 
contention seems like it might be contributing to missing the heartbeats. But 
again, that's conjecture. I've not worked with a setup as dense as yours.

QH

On Fri, Aug 7, 2015 at 11:21 AM, Tuomas Juntunen 
mailto:tuomas.juntu...@databasement.fi>> wrote:
Hi

We are experiencing an annoying problem where scrubs make OSD’s flap down and 
cause Ceph cluster to be unusable for couple of minutes.

Our cluster consists of three nodes connected with 40gbit infiniband using 
IPoIB, with 2x 6 core X5670 CPU’s and 64GB of memory
Each node has 6 SSD’s for journals to 12 OSD’s 2TB disks (Fast pools) and 
another 12 OSD’s 4TB disks (Archive pools) which have journal on the same disk.

It seems that our cluster is constantly doing scrubbing, we rarely see only 
active+clean, below is the status at the moment.

cluster a2974742-3805-4cd3-bc79-765f2bddaefe
 health HEALTH_OK
 monmap e16: 4 mons at 
{lb1=10.20.60.1:6789/0,lb2=10.20.60.2:6789/0,nc1=10.20.50.2:6789/0,nc2=10.20.50.3:6789/0}
election epoch 1838, quorum 0,1,2,3 nc1,nc2,lb1,lb2
 mdsmap e7901: 1/1/1 up {0=lb1=up:active}, 4 up:standby
 osdmap e104824: 72 osds: 72 up, 72 in
  pgmap v12941402: 5248 pgs, 9 pools, 19644 GB data, 4810 kobjects
59067 GB used, 138 TB / 196 TB avail
5241 active+clean
   7 active+clean+scrubbing

When OSD’s go down, first the load on a node goes high during scrubbing and 
after that some OSD’s go down and 30 secs, they are back up. They are not 
really going down, but are marked as down. Then it takes around couple of 
minutes for everything be OK again.

Any suggestion how to fix this? We can’t go to production while this behavior 
exists.

Our config is below:

[global]
fsid = a2974742-3805-4cd3-bc79-765f2bddaefe
mon_initial_members = lb1,lb2,nc1,nc2
mon_host = 10.20.60.1,10.20.60.2,10.20.50.2,10.20.50.3
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true

osd pool default pg num = 128
osd pool default pgp num = 128

public network = 10.20.0.0/16

osd_op_threads = 12
osd_op_num_threads_per_shard = 2
osd_op_num_shards = 6
#osd_op_num_sharded_pool_threads = 25
filestore_op_threads = 12
ms_nocr

Re: [ceph-users] btrfs w/ centos 7.1

2015-08-07 Thread Shinobu Kinjo
Hello,

Ceph is not problem. Problem is that btrfs is not still production.
There are many testing line in source codes.

But it's really up to you which filesystem you use.

Each filesystem has unique functions so you have to consider
them to get best performance from one of them.

Meaning that there would be no perfect filesystem.

 Shinobu


On Sat, Aug 8, 2015 at 6:29 AM, Lionel Bouton 
wrote:

> Le 07/08/2015 22:05, Ben Hines a écrit :
> > Howdy,
> >
> > The Ceph docs still say btrfs is 'experimental' in one section, but
> > say it's the long term ideal for ceph in the later section. Is this
> > still accurate with Hammer? Is it mature enough on centos 7.1 for
> > production use?
> >
> > (kernel is  3.10.0-229.7.2.el7.x86_64 )
>
> Difficult to say with distribution kernels, they may have patched their
> kernels to fix some Btrfs issues or not (3.10.0 is more than 2 years
> old) I wouldn't trust them myself.
>
> We are converting our OSDs to Btrfs but we use recent kernel versions
> (4.0.5 currently), we disabled Btrfs snapshots in ceph.conf (they are
> too costly), created journals NOCOW (we will move them to SSDs
> eventually) and developed our own defragmentation scheduler (Btrfs' own
> autodefrag didn't perform well with Ceph when we started and we use the
> btrfs defragmentation process to recompress data with zlib instead of
> lzo as we mount OSD fs with compress=lzo for lower latency OSD writes).
> In the above conditions, it is faster than XFS (~30% lower apply
> latencies according to ceph osd perf), detects otherwise silent data
> corruption (it caught some already) and provides ~10% additional storage
> space thanks to lzo/zlib compression (most of our data is in the form of
> already compressed files stored on RBD, actual gains obviously depend on
> the data).
>
> Best regards,
>
> Lionel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Email:
 shin...@linux.com
 ski...@redhat.com

 Life w/ Linux 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] btrfs w/ centos 7.1

2015-08-07 Thread Ross Annetts

Hi Ben,

 RedHat (which CentOS is based off) have included btrfs in RHEL7 as 
Technology Preview. This basically means that they are happy for you to 
use it at your own risk. I have spoken with an engineer there and they 
said they would basically try to support any related issues using it 
that they could, but there was no guarantees with their support.


 So seeing as RedHat is targeted at enterprise and they are not 
guaranteeing support on it, I would not use it in production unless:


 * You have a reliable backup strategy.
 * You know the Recovery point objective.
 * You know the Mean time to recovery.
 * Does the advantages of using btrfs out weigh the possibility for
   causing you down time and a lot of work down the track?

The current ceph production recommendation is xfs.

Regards,
Ross

On 8/08/2015 6:05 am, Ben Hines wrote:

Howdy,

The Ceph docs still say btrfs is 'experimental' in one section, but
say it's the long term ideal for ceph in the later section. Is this
still accurate with Hammer? Is it mature enough on centos 7.1 for
production use?

(kernel is  3.10.0-229.7.2.el7.x86_64 )

thanks-

-Ben
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Regards,
Ross Annetts

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com