date:20160510

Re: [ceph-users] thanks for a double check on ceph's config

2016-05-10 Thread Christian Balzer

On Tue, 10 May 2016 11:48:07 +0800 Geocast wrote:

Hello,

> We have 21 hosts for ceph OSD servers, each host has 12 SATA disks (4TB
> each), 64GB memory.
No journal SSDs? 
What CPU(s) and network?

> ceph version 10.2.0, Ubuntu 16.04 LTS
> The whole cluster is new installed.
> 
> Can you help check what the arguments we put in ceph.conf is reasonable
> or not?
> thanks.
> 
> [osd]
> osd_data = /var/lib/ceph/osd/ceph-$id
> osd_journal_size = 2
Overkill most likely, but not an issue.

> osd_mkfs_type = xfs
> osd_mkfs_options_xfs = -f
> filestore_xattr_use_omap = true
> filestore_min_sync_interval = 10
Are you aware what this does and have you actually tested this (IOPS AND
throughput) with various other setting on your hardware to arrive at this
number?

> filestore_max_sync_interval = 15
That's fine in and by itself, unlikely to ever be reached anyway.

> filestore_queue_max_ops = 25000
> filestore_queue_max_bytes = 10485760
> filestore_queue_committing_max_ops = 5000
> filestore_queue_committing_max_bytes = 1048576
> journal_max_write_bytes = 1073714824
> journal_max_write_entries = 1
> journal_queue_max_ops = 5
> journal_queue_max_bytes = 1048576
Same as above, have you tested these setting (from filestore_queue_max_ops
onward) compared to the defaults?

With HDDs only I'd expect any benefits to be small and/or things to become
very uneven once the HDDs are saturated. 

> osd_max_write_size = 512
> osd_client_message_size_cap = 2147483648
> osd_deep_scrub_stride = 131072
> osd_op_threads = 8
> osd_disk_threads = 4
> osd_map_cache_size = 1024
> osd_map_cache_bl_size = 128
> osd_mount_options_xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier"
The nobarrier part is a a potential recipe for disaster unless you have all
on-disk caches disabled and every other cache battery backed.

The only devices I trust to mount nobarrier are SSDs with powercaps that
have been proven to do the right thing (Intel DC S amongst them). 

> osd_recovery_op_priority = 4
> osd_recovery_max_active = 10
> osd_max_backfills = 4
> 
That's sane enough. 

> [client]
> rbd_cache = true
AFAIK that's the case with recent Ceph versions anyway.

> rbd_cache_size = 268435456

Are you sure that you have 256MB per client to waste on RBD cache?
If so, bully for you, but you might find that depending on your use case a
smaller RBD cache but more VM memory (for pagecache, SLAB, etc) could be
more beneficial. 

> rbd_cache_max_dirty = 134217728
> rbd_cache_max_dirty_age = 5

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] journal or cache tier on SSDs ?

2016-05-10 Thread Yoann Moulin

Hello,

I'd like some advices about the setup of a new ceph cluster. Here the use case :

RadowGW (S3 and maybe swift for hadoop/spark) will be the main usage. Most of
the access will be in read only mode. Write access will only be done by the
admin to update the datasets.

We might use rbd some time to sync data as temp storage (when POSIX is needed)
but performance will not be an issue here. We might use cephfs in the futur if
that can replace a filesystem on rdb.

We gonna start with 16 nodes (up to 24). The configuration of each node is :

CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (12c/48t)
Memory : 128GB
OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1)
Journal or cache Storage : 2 x SSD 400GB Intel S3300 DC (no Raid)
OSD Disk : 10 x HGST ultrastar-7k6000 6TB
Public Network : 1 x 10Gb/s
Private Network : 1 x 10Gb/s
OS : Ubuntu 16.04
Ceph version : Jewel

The question is : journal or cache tier (read only) on the SD 400GB Intel S3300 
DC ?

Each disk is able to write sequentially at 220MB/s. SSDs can write at ~500MB/s.
if we set 5 journals on each SSDs, SSD will still be the bottleneck (1GB/s vs
2GB/s). If we set the journal on OSDs, we can expect a good throughput in read
on the disk (in case of data not in the cache) and write shouldn't be so bad
too, even if we have random read on the OSD during the write ?

SSDs as cache tier seem to be a better usage than only 5 journal on each ? Is
that correct ?

We gonna use an EC pool for big files (jerasure 8+2 I think) and a replicated
pool for small files.

If I check on http://ceph.com/pgcalc/, in this use case

replicated pool: pg_num = 8192 for 160 OSDs but 16384 for 240 OSDs
Ec pool : pg_num = 4096
and pgp_num = pg_num

Should I set the pg_num to 8192 or 16384 ? what is the impact on the cluster if
we set the pg_num to 16384 at the beginning ? 16384 is high, isn't it ?

Thanks for your help

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] thanks for a double check on ceph's config

2016-05-10 Thread Geocast Networks

Hello Chris,

We don't use SSD as journal.
each host has one intel E5-2620 CPU which is 6 cores.
the networking (both cluster and data networks) is 10Gbps.

My further questions include,

(1) osd_mkfs_type = xfs
osd_mkfs_options_xfs = -f
filestore_xattr_use_omap = true

for XFS filesystem, we should not enable filestore_xattr_use_omap = true,
is it?

(2) filestore_queue_max_ops = 25000
filestore_queue_max_bytes = 10485760
filestore_queue_committing_max_ops = 5000
filestore_queue_committing_max_bytes = 1048576
journal_max_write_bytes = 1073714824
journal_max_write_entries = 1
journal_queue_max_ops = 5
journal_queue_max_bytes = 1048576

Since we don't have SSD as journals, all these setup are too large? what
are the better values?

(3) osd_mount_options_xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier"
What's your suggested options here?

Thanks a lot.


2016-05-10 15:31 GMT+08:00 Christian Balzer :

> On Tue, 10 May 2016 11:48:07 +0800 Geocast wrote:
>
> Hello,
>
> > We have 21 hosts for ceph OSD servers, each host has 12 SATA disks (4TB
> > each), 64GB memory.
> No journal SSDs?
> What CPU(s) and network?
>
> > ceph version 10.2.0, Ubuntu 16.04 LTS
> > The whole cluster is new installed.
> >
> > Can you help check what the arguments we put in ceph.conf is reasonable
> > or not?
> > thanks.
> >
> > [osd]
> > osd_data = /var/lib/ceph/osd/ceph-$id
> > osd_journal_size = 2
> Overkill most likely, but not an issue.
>
> > osd_mkfs_type = xfs
> > osd_mkfs_options_xfs = -f
> > filestore_xattr_use_omap = true
> > filestore_min_sync_interval = 10
> Are you aware what this does and have you actually tested this (IOPS AND
> throughput) with various other setting on your hardware to arrive at this
> number?
>
> > filestore_max_sync_interval = 15
> That's fine in and by itself, unlikely to ever be reached anyway.
>
> > filestore_queue_max_ops = 25000
> > filestore_queue_max_bytes = 10485760
> > filestore_queue_committing_max_ops = 5000
> > filestore_queue_committing_max_bytes = 1048576
> > journal_max_write_bytes = 1073714824
> > journal_max_write_entries = 1
> > journal_queue_max_ops = 5
> > journal_queue_max_bytes = 1048576
> Same as above, have you tested these setting (from filestore_queue_max_ops
> onward) compared to the defaults?
>
> With HDDs only I'd expect any benefits to be small and/or things to become
> very uneven once the HDDs are saturated.
>
> > osd_max_write_size = 512
> > osd_client_message_size_cap = 2147483648
> > osd_deep_scrub_stride = 131072
> > osd_op_threads = 8
> > osd_disk_threads = 4
> > osd_map_cache_size = 1024
> > osd_map_cache_bl_size = 128
> > osd_mount_options_xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier"
> The nobarrier part is a a potential recipe for disaster unless you have all
> on-disk caches disabled and every other cache battery backed.
>
> The only devices I trust to mount nobarrier are SSDs with powercaps that
> have been proven to do the right thing (Intel DC S amongst them).
>
> > osd_recovery_op_priority = 4
> > osd_recovery_max_active = 10
> > osd_max_backfills = 4
> >
> That's sane enough.
>
> > [client]
> > rbd_cache = true
> AFAIK that's the case with recent Ceph versions anyway.
>
> > rbd_cache_size = 268435456
>
> Are you sure that you have 256MB per client to waste on RBD cache?
> If so, bully for you, but you might find that depending on your use case a
> smaller RBD cache but more VM memory (for pagecache, SLAB, etc) could be
> more beneficial.
>
> > rbd_cache_max_dirty = 134217728
> > rbd_cache_max_dirty_age = 5
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] thanks for a double check on ceph's config

2016-05-10 Thread Geocast Networks

> rbd_cache_size = 268435456

Are you sure that you have 256MB per client to waste on RBD cache?
If so, bully for you, but you might find that depending on your use case a
smaller RBD cache but more VM memory (for pagecache, SLAB, etc) could be
more beneficial.

We have changed this value to 64MB. thanks.


2016-05-10 15:31 GMT+08:00 Christian Balzer :

> On Tue, 10 May 2016 11:48:07 +0800 Geocast wrote:
>
> Hello,
>
> > We have 21 hosts for ceph OSD servers, each host has 12 SATA disks (4TB
> > each), 64GB memory.
> No journal SSDs?
> What CPU(s) and network?
>
> > ceph version 10.2.0, Ubuntu 16.04 LTS
> > The whole cluster is new installed.
> >
> > Can you help check what the arguments we put in ceph.conf is reasonable
> > or not?
> > thanks.
> >
> > [osd]
> > osd_data = /var/lib/ceph/osd/ceph-$id
> > osd_journal_size = 2
> Overkill most likely, but not an issue.
>
> > osd_mkfs_type = xfs
> > osd_mkfs_options_xfs = -f
> > filestore_xattr_use_omap = true
> > filestore_min_sync_interval = 10
> Are you aware what this does and have you actually tested this (IOPS AND
> throughput) with various other setting on your hardware to arrive at this
> number?
>
> > filestore_max_sync_interval = 15
> That's fine in and by itself, unlikely to ever be reached anyway.
>
> > filestore_queue_max_ops = 25000
> > filestore_queue_max_bytes = 10485760
> > filestore_queue_committing_max_ops = 5000
> > filestore_queue_committing_max_bytes = 1048576
> > journal_max_write_bytes = 1073714824
> > journal_max_write_entries = 1
> > journal_queue_max_ops = 5
> > journal_queue_max_bytes = 1048576
> Same as above, have you tested these setting (from filestore_queue_max_ops
> onward) compared to the defaults?
>
> With HDDs only I'd expect any benefits to be small and/or things to become
> very uneven once the HDDs are saturated.
>
> > osd_max_write_size = 512
> > osd_client_message_size_cap = 2147483648
> > osd_deep_scrub_stride = 131072
> > osd_op_threads = 8
> > osd_disk_threads = 4
> > osd_map_cache_size = 1024
> > osd_map_cache_bl_size = 128
> > osd_mount_options_xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier"
> The nobarrier part is a a potential recipe for disaster unless you have all
> on-disk caches disabled and every other cache battery backed.
>
> The only devices I trust to mount nobarrier are SSDs with powercaps that
> have been proven to do the right thing (Intel DC S amongst them).
>
> > osd_recovery_op_priority = 4
> > osd_recovery_max_active = 10
> > osd_max_backfills = 4
> >
> That's sane enough.
>
> > [client]
> > rbd_cache = true
> AFAIK that's the case with recent Ceph versions anyway.
>
> > rbd_cache_size = 268435456
>
> Are you sure that you have 256MB per client to waste on RBD cache?
> If so, bully for you, but you might find that depending on your use case a
> smaller RBD cache but more VM memory (for pagecache, SLAB, etc) could be
> more beneficial.
>
> > rbd_cache_max_dirty = 134217728
> > rbd_cache_max_dirty_age = 5
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] journal or cache tier on SSDs ?

2016-05-10 Thread Christian Balzer

Hello,

On Tue, 10 May 2016 10:40:08 +0200 Yoann Moulin wrote:

> Hello,
> 
> I'd like some advices about the setup of a new ceph cluster. Here the
> use case :
> 
> RadowGW (S3 and maybe swift for hadoop/spark) will be the main usage.
> Most of the access will be in read only mode. Write access will only be
> done by the admin to update the datasets.
> 
> We might use rbd some time to sync data as temp storage (when POSIX is
> needed) but performance will not be an issue here. We might use cephfs
> in the futur if that can replace a filesystem on rdb.
> 
> We gonna start with 16 nodes (up to 24). The configuration of each node
> is :
> 
> CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (12c/48t)
> Memory : 128GB
> OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1)

Dedicated OS SSDs aren't really needed, I tend to share OS and
cache/journal SSDs.
That's of course with more durable (S3610) models.

Since you didn't mention dedicated MON nodes, make sure that if you plan
to put MONs on storage servers to have fast SSDs in them for the leveldb
(again DC S36xx or 37xx).

This will also free up 2 more slots in your (likely Supermicro) chassis
for OSD HDDs.

> Journal or cache Storage : 2 x SSD 400GB Intel S3300 DC (no Raid)

These SSDs do not exist according to the Intel site and the only
references I can find for them are on "no longer available" European sites.
Since you're in the land of rich chocolate bankers, I assume that this
model is something that just happened in Europe.

Without knowing the specifications for these SSDs, I can't recommend them.
I'd use DC S3610 or 3710 instead, this very much depends on how much
endurance (TPW) you need.

> OSD Disk : 10 x HGST ultrastar-7k6000 6TB
> Public Network : 1 x 10Gb/s
> Private Network : 1 x 10Gb/s
> OS : Ubuntu 16.04
> Ceph version : Jewel
> 
> The question is : journal or cache tier (read only) on the SD 400GB
> Intel S3300 DC ?
> 
You said read-only, or read-mostly up there. 

So why journals (only helpful for writes) or cache tiers (your 2 SSDs may
not be faster than your 10 HDDs for reads) at all?

Mind, if you have the money, go for it!

> Each disk is able to write sequentially at 220MB/s. SSDs can write at
> ~500MB/s. if we set 5 journals on each SSDs, SSD will still be the
> bottleneck (1GB/s vs 2GB/s). 

Your filestore based OSDs will never write Ceph data at 220MB/s, 100 would
be pushing it. 
So no, your journal SSDs won't be the limiting factor, though 5 journals
on one SSD is pushing my comfort zone when it comes to SPoFs. 

> If we set the journal on OSDs, we can
> expect a good throughput in read on the disk (in case of data not in the
> cache) and write shouldn't be so bad too, even if we have random read on
> the OSD during the write ?
> 
> SSDs as cache tier seem to be a better usage than only 5 journal on
> each ? Is that correct ?
> 
Potentially, depends on your actual usage.

Again, since you said read-mostly, the question with a cache-tier becomes,
how much of your truly hot data can fit into it?

Remember that super-hot objects are likely to come from the pagecache of
the storage node in question anyway.

If you do care about fast writes after all, consider de-coupling writes
and reads as much as possible.
As in, set your cache to "readforward" (undocumented, google for it), so
all un-cached reads will go to the HDDs (they CAN read at near full speed),
while all writes will go the cache pool (and eventually to the HDDs, you
can time that with lowering the dirty ratio during off-peak hours).

> We gonna use an EC pool for big files (jerasure 8+2 I think) and a
> replicated pool for small files.
> 
> If I check on http://ceph.com/pgcalc/, in this use case
> 
> replicated pool: pg_num = 8192 for 160 OSDs but 16384 for 240 OSDs
> Ec pool : pg_num = 4096
> and pgp_num = pg_num
> 
> Should I set the pg_num to 8192 or 16384 ? what is the impact on the
> cluster if we set the pg_num to 16384 at the beginning ? 16384 is high,
> isn't it ?
> 
If 24 nodes is the absolute limit of your cluster, you want to set the
target pg num to 100 in the calculator, which gives you 8192 again.

Keep in mind that splitting PGs is an expensive operation, so if 24 isn't
a hard upper limit, you might be better off starting big.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] thanks for a double check on ceph's config

2016-05-10 Thread Christian Balzer


Hello,

On Tue, 10 May 2016 16:50:17 +0800 Geocast Networks wrote:

> Hello Chris,
> 
> We don't use SSD as journal.
> each host has one intel E5-2620 CPU which is 6 cores.
That should be enough.

> the networking (both cluster and data networks) is 10Gbps.
>
12 HDDs will barely saturate a 10Gb/s link during writes, if you care
about fast reads you may be better off with a uniform, bonded 20Gb/s
network. 
 
> My further questions include,
> 
> (1) osd_mkfs_type = xfs
> osd_mkfs_options_xfs = -f
> filestore_xattr_use_omap = true
> 
> for XFS filesystem, we should not enable filestore_xattr_use_omap = true,
> is it?
> 
You don't need to, AFAIK this switch doesn't cause any overhead if it isn't
needed.
Somebody actually using XFS or knowing the code may pipe up here.

> (2) filestore_queue_max_ops = 25000
> filestore_queue_max_bytes = 10485760
> filestore_queue_committing_max_ops = 5000
> filestore_queue_committing_max_bytes = 1048576
> journal_max_write_bytes = 1073714824
> journal_max_write_entries = 1
> journal_queue_max_ops = 5
> journal_queue_max_bytes = 1048576
> 
> Since we don't have SSD as journals, all these setup are too large? what
> are the better values?
> 
You really want to test them against the defaults.

And the defaults are designed for usage with HDD only OSDs, so they are
probably your best bet unless you feel like empiric testing.

> (3) osd_mount_options_xfs =
> "rw,noexec,nodev,noatime,nodiratime,nobarrier" What's your suggested
> options here?
> 

As I said, loose the "nobarrier".

Christian

> Thanks a lot.
> 
> 
> 2016-05-10 15:31 GMT+08:00 Christian Balzer :
> 
> > On Tue, 10 May 2016 11:48:07 +0800 Geocast wrote:
> >
> > Hello,
> >
> > > We have 21 hosts for ceph OSD servers, each host has 12 SATA disks
> > > (4TB each), 64GB memory.
> > No journal SSDs?
> > What CPU(s) and network?
> >
> > > ceph version 10.2.0, Ubuntu 16.04 LTS
> > > The whole cluster is new installed.
> > >
> > > Can you help check what the arguments we put in ceph.conf is
> > > reasonable or not?
> > > thanks.
> > >
> > > [osd]
> > > osd_data = /var/lib/ceph/osd/ceph-$id
> > > osd_journal_size = 2
> > Overkill most likely, but not an issue.
> >
> > > osd_mkfs_type = xfs
> > > osd_mkfs_options_xfs = -f
> > > filestore_xattr_use_omap = true
> > > filestore_min_sync_interval = 10
> > Are you aware what this does and have you actually tested this (IOPS
> > AND throughput) with various other setting on your hardware to arrive
> > at this number?
> >
> > > filestore_max_sync_interval = 15
> > That's fine in and by itself, unlikely to ever be reached anyway.
> >
> > > filestore_queue_max_ops = 25000
> > > filestore_queue_max_bytes = 10485760
> > > filestore_queue_committing_max_ops = 5000
> > > filestore_queue_committing_max_bytes = 1048576
> > > journal_max_write_bytes = 1073714824
> > > journal_max_write_entries = 1
> > > journal_queue_max_ops = 5
> > > journal_queue_max_bytes = 1048576
> > Same as above, have you tested these setting (from
> > filestore_queue_max_ops onward) compared to the defaults?
> >
> > With HDDs only I'd expect any benefits to be small and/or things to
> > become very uneven once the HDDs are saturated.
> >
> > > osd_max_write_size = 512
> > > osd_client_message_size_cap = 2147483648
> > > osd_deep_scrub_stride = 131072
> > > osd_op_threads = 8
> > > osd_disk_threads = 4
> > > osd_map_cache_size = 1024
> > > osd_map_cache_bl_size = 128
> > > osd_mount_options_xfs =
> > > "rw,noexec,nodev,noatime,nodiratime,nobarrier"
> > The nobarrier part is a a potential recipe for disaster unless you
> > have all on-disk caches disabled and every other cache battery backed.
> >
> > The only devices I trust to mount nobarrier are SSDs with powercaps
> > that have been proven to do the right thing (Intel DC S amongst them).
> >
> > > osd_recovery_op_priority = 4
> > > osd_recovery_max_active = 10
> > > osd_max_backfills = 4
> > >
> > That's sane enough.
> >
> > > [client]
> > > rbd_cache = true
> > AFAIK that's the case with recent Ceph versions anyway.
> >
> > > rbd_cache_size = 268435456
> >
> > Are you sure that you have 256MB per client to waste on RBD cache?
> > If so, bully for you, but you might find that depending on your use
> > case a smaller RBD cache but more VM memory (for pagecache, SLAB, etc)
> > could be more beneficial.
> >
> > > rbd_cache_max_dirty = 134217728
> > > rbd_cache_max_dirty_age = 5
> >
> > Christian
> > --
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> >


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] journal or cache tier on SSDs ?

2016-05-10 Thread Yoann Moulin

Hello,

>> I'd like some advices about the setup of a new ceph cluster. Here the
>> use case :
>>
>> RadowGW (S3 and maybe swift for hadoop/spark) will be the main usage.
>> Most of the access will be in read only mode. Write access will only be
>> done by the admin to update the datasets.
>>
>> We might use rbd some time to sync data as temp storage (when POSIX is
>> needed) but performance will not be an issue here. We might use cephfs
>> in the futur if that can replace a filesystem on rdb.
>>
>> We gonna start with 16 nodes (up to 24). The configuration of each node
>> is :
>>
>> CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (12c/48t)
>> Memory : 128GB
>> OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1)
> 
> Dedicated OS SSDs aren't really needed, I tend to share OS and
> cache/journal SSDs.
> That's of course with more durable (S3610) models.

I already have those 24 servers running 2 ceph cluster for test right now, so I
cannot change anything. we were thinking about share journal but as I mention it
below, MON will be on storage server, so that might use too much I/O to share
levedb and journal on the same SSD.

> Since you didn't mention dedicated MON nodes, make sure that if you plan
> to put MONs on storage servers to have fast SSDs in them for the leveldb
> (again DC S36xx or 37xx).

Yes MON nodes will be shared on storage server. MONs use the SSD 240GB for the
leveldb right now.

> This will also free up 2 more slots in your (likely Supermicro) chassis
> for OSD HDDs.

It's not supermicro enclosure, it's Intel one with 12 slot 3.5" front and 2
slots 2.5" back, so I cannot add more disk. the 240GB SSDs are in front.

>> Journal or cache Storage : 2 x SSD 400GB Intel S3300 DC (no Raid)
> 
> These SSDs do not exist according to the Intel site and the only
> references I can find for them are on "no longer available" European sites.

I made a mistake, it's not 400 but 480GB, smartctl give me Model SSDSC2BB480H4

> Since you're in the land of rich chocolate bankers, I assume that this
> model is something that just happened in Europe.

I'm just a poor sysadmin with expensive toy in a University ;)

> Without knowing the specifications for these SSDs, I can't recommend them.
> I'd use DC S3610 or 3710 instead, this very much depends on how much
> endurance (TPW) you need.

As I write above, I already have those SSDs so I look for the best setup with
the hardware I have.

>> OSD Disk : 10 x HGST ultrastar-7k6000 6TB
>> Public Network : 1 x 10Gb/s
>> Private Network : 1 x 10Gb/s
>> OS : Ubuntu 16.04
>> Ceph version : Jewel
>>
>> The question is : journal or cache tier (read only) on the SD 400GB
>> Intel S3300 DC ?
>>
> You said read-only, or read-mostly up there. 

I mean, I think about using cache tier for read operation. No write operation
gonna use the cache tier. I don't know yet wich mode I gonna use, I have to do
some tests.

> So why journals (only helpful for writes) or cache tiers (your 2 SSDs may
> not be faster than your 10 HDDs for reads) at all?

We plan to have eavy read access some time so we think about to have cache tier
on SSD to speed up the throughput and decrease the I/O pressure on disk. I might
be wrong on that.

> Mind, if you have the money, go for it!

I don't have the money, I have the hardware :)

>> Each disk is able to write sequentially at 220MB/s. SSDs can write at
>> ~500MB/s. if we set 5 journals on each SSDs, SSD will still be the
>> bottleneck (1GB/s vs 2GB/s). 
> 
> Your filestore based OSDs will never write Ceph data at 220MB/s, 100 would
> be pushing it. 
> So no, your journal SSDs won't be the limiting factor, though 5 journals
> on one SSD is pushing my comfort zone when it comes to SPoFs. 
> 
>> If we set the journal on OSDs, we can
>> expect a good throughput in read on the disk (in case of data not in the
>> cache) and write shouldn't be so bad too, even if we have random read on
>> the OSD during the write ?
>>
>> SSDs as cache tier seem to be a better usage than only 5 journal on
>> each ? Is that correct ?
>>
> Potentially, depends on your actual usage.
> 
> Again, since you said read-mostly, the question with a cache-tier becomes,
> how much of your truly hot data can fit into it?

That the biggest point, many datasets will fit into the cache, but some of them
will definitely be too big (+100TB) but in that case, Our user know what going
one.

> Remember that super-hot objects are likely to come from the pagecache of
> the storage node in question anyway.

Yes I know that.

> If you do care about fast writes after all, consider de-coupling writes
> and reads as much as possible.

Write operation will only be done by the admins for datasets update. those
updates will be plan according the usage of the cluster and scheduled during low
usage period.

> As in, set your cache to "readforward" (undocumented, google for it), so
> all un-cached reads will go to the HDDs (they CAN read at near full speed),
> while all writes will go the cache pool (and ev

[ceph-users] ceph 0.94.5 / Kernel 4.5.2 / assertion on OSD

2016-05-10 Thread Torsten Urbas

Hello,

we are running ceph 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) on a
4.5.2 kernel. Our cluster currently consists of 5 nodes, with 6 OSD's each.

An issue has also been filed here (also containg logs, etc.):
http://tracker.ceph.com/issues/15813

Last night we have observed a single OSD (osd.11) die with an assertion:

2016-05-10 03:16:30.718936 7fa5166dc700 -1 common/Mutex.cc: In function
'void Mutex::Lock(bool)' thread 7fa5166dc700 time 2016-05-10 03:16:30.688044

common/Mutex.cc: 100: FAILED assert(r == 0)


 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)

 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x80) [0xb34520]

 2: (Mutex::Lock(bool)+0x105) [0xae2395]

 3: (DispatchQueue::discard_queue(unsigned long)+0x37) [0xbeff67]

 4: (Pipe::fault(bool)+0x426) [0xc16256]

 5: (Pipe::reader()+0x3f2) [0xc1d752]

 6: (Pipe::Reader::entry()+0xd) [0xc2880d]

 7: (()+0x7474) [0x7fa54c70b474]

 8: (clone()+0x6d) [0x7fa54ac01acd]

 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.


--- begin dump of recent events ---

   -74> 2016-05-09 11:56:56.015711 7fa54d1fc7c0  5 asok(0x5644000)
register_command perfcounters_dump hook 0x5624030

   -73> 2016-05-09 11:56:56.015739 7fa54d1fc7c0  5 asok(0x5644000)
register_command 1 hook 0x5624030

   -72> 2016-05-09 11:56:56.015745 7fa54d1fc7c0  5 asok(0x5644000)
register_command perf dump hook 0x5624030

   -71> 2016-05-09 11:56:56.015751 7fa54d1fc7c0  5 asok(0x5644000)
register_command perfcounters_schema hook 0x5624030

   -70> 2016-05-09 11:56:56.015756 7fa54d1fc7c0  5 asok(0x5644000)
register_command 2 hook 0x5624030

   -69> 2016-05-09 11:56:56.015758 7fa54d1fc7c0  5 asok(0x5644000)
register_command perf schema hook 0x5624030

   -68> 2016-05-09 11:56:56.015763 7fa54d1fc7c0  5 asok(0x5644000)
register_command perf reset hook 0x5624030

   -67> 2016-05-09 11:56:56.015766 7fa54d1fc7c0  5 asok(0x5644000)
register_command config show hook 0x5624030

   -66> 2016-05-09 11:56:56.015770 7fa54d1fc7c0  5 asok(0x5644000)
register_command config set hook 0x5624030

   -65> 2016-05-09 11:56:56.015773 7fa54d1fc7c0  5 asok(0x5644000)
register_command config get hook 0x5624030

   -64> 2016-05-09 11:56:56.015776 7fa54d1fc7c0  5 asok(0x5644000)
register_command config diff hook 0x5624030

   -63> 2016-05-09 11:56:56.015779 7fa54d1fc7c0  5 asok(0x5644000)
register_command log flush hook 0x5624030

   -62> 2016-05-09 11:56:56.015783 7fa54d1fc7c0  5 asok(0x5644000)
register_command log dump hook 0x5624030

   -61> 2016-05-09 11:56:56.015786 7fa54d1fc7c0  5 asok(0x5644000)
register_command log reopen hook 0x5624030

   -60> 2016-05-09 11:56:56.017553 7fa54d1fc7c0  0 ceph version 0.94.5
(9764da52395923e0b32908d83a9f7304401fee43), process ceph-osd, pid 76017

   -59> 2016-05-09 11:56:56.027154 7fa54d1fc7c0  0
filestore(/var/lib/ceph/osd/ceph-11) backend xfs (magic 0x58465342)

   -58> 2016-05-09 11:56:56.028635 7fa54d1fc7c0  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: FIEMAP
ioctl is supported and appears to work

   -57> 2016-05-09 11:56:56.028644 7fa54d1fc7c0  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option

   -56> 2016-05-09 11:56:56.042822 7fa54d1fc7c0  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)

   -55> 2016-05-09 11:56:56.043047 7fa54d1fc7c0  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature: extsize is
supported and kernel 4.5.2-1-ARCH >= 3.5

   -54> 2016-05-09 11:56:56.109483 7fa54d1fc7c0  0
filestore(/var/lib/ceph/osd/ceph-11) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled

   -53> 2016-05-09 11:56:56.110110 7fa54d1fc7c0 -1 journal
FileJournal::_open: disabling aio for non-block journal.  Use
journal_force_aio to force use of aio anyway

   -52> 2016-05-09 11:56:56.110825 7fa54d1fc7c0  0 
cls/hello/cls_hello.cc:271: loading cls_hello

   -51> 2016-05-09 11:56:56.114886 7fa54d1fc7c0  0 osd.11 9819 crush map
has features 283675107524608, adjusting msgr requires for clients

   -50> 2016-05-09 11:56:56.114895 7fa54d1fc7c0  0 osd.11 9819 crush map
has features 283675107524608 was 8705, adjusting msgr requires for mons

   -49> 2016-05-09 11:56:56.114899 7fa54d1fc7c0  0 osd.11 9819 crush map
has features 283675107524608, adjusting msgr requires for osds

   -48> 2016-05-09 11:56:56.114919 7fa54d1fc7c0  0 osd.11 9819 load_pgs

   -47> 2016-05-09 11:56:57.000584 7fa54d1fc7c0  0 osd.11 9819 load_pgs
opened 55 pgs

   -46> 2016-05-09 11:56:57.000991 7fa54d1fc7c0 -1 osd.11 9819
log_to_monitors {default=true}

   -45> 2016-05-09 11:56:57.052319 7fa54d1fc7c0  0 osd.11 9819 done with
init, starting boot process

   -44> 2016-05-09 11:57:01.103141 7fa50d84e700  0 --
[fd00:2380:0:21::3]:6806/76017 >> [fd00:2380:0:21::3]:6804/75598
pipe(0x11152000 sd=80 :6806 s=0 pgs=0 cs=0 l=0 c=0x10ec6580)

[ceph-users] Cluster issue - pgs degraded, recovering, stale, etc.

2016-05-10 Thread deeepdish

Hello.

I have a two node cluster with 4x replicas for all objects distributed between 
the two nodes (two copies on each node).  I recently converted my OSDs from 
BTRFS to XFS (BTRFS was slow) by removing / preparing / activating OSDs on each 
node (one at at time) as XFS allowing cluster to rebalance / recover itself.   
Now with this all complete, I have a better performing cluster, all data is 
intact, however I have the following status.   How can I remedy this?  Looking 
for guidance into steps / troubleshooting starting point.   There’s a bunch of 
seemingly different issues that likely stem from the same root cause.   

 health HEALTH_WARN
11 pgs degraded
7 pgs peering
4 pgs recovering
2 pgs recovery_wait
885 pgs stale
11 pgs stuck degraded
60 pgs stuck inactive
885 pgs stuck stale
66 pgs stuck unclean
recovery 3/24971148 objects degraded (0.000%)


Thank you. ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] journal or cache tier on SSDs ?

2016-05-10 Thread Christian Balzer


Hello,

On Tue, 10 May 2016 13:14:35 +0200 Yoann Moulin wrote:

> Hello,
> 
> >> I'd like some advices about the setup of a new ceph cluster. Here the
> >> use case :
> >>
> >> RadowGW (S3 and maybe swift for hadoop/spark) will be the main usage.
> >> Most of the access will be in read only mode. Write access will only
> >> be done by the admin to update the datasets.
> >>
> >> We might use rbd some time to sync data as temp storage (when POSIX is
> >> needed) but performance will not be an issue here. We might use cephfs
> >> in the futur if that can replace a filesystem on rdb.
> >>
> >> We gonna start with 16 nodes (up to 24). The configuration of each
> >> node is :
> >>
> >> CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (12c/48t)
> >> Memory : 128GB
> >> OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1)
> > 
> > Dedicated OS SSDs aren't really needed, I tend to share OS and
> > cache/journal SSDs.
> > That's of course with more durable (S3610) models.
> 
> I already have those 24 servers running 2 ceph cluster for test right
> now, so I cannot change anything. we were thinking about share journal
> but as I mention it below, MON will be on storage server, so that might
> use too much I/O to share levedb and journal on the same SSD.
>
Not really, the journal is sequential writes, the leveldb small, fast
IOPS. Both of them on the same (decent) SSD should be fine.

But as your HW is fixed, lets not speculate about that.
 
> > Since you didn't mention dedicated MON nodes, make sure that if you
> > plan to put MONs on storage servers to have fast SSDs in them for the
> > leveldb (again DC S36xx or 37xx).
> 
> Yes MON nodes will be shared on storage server. MONs use the SSD 240GB
> for the leveldb right now.
> 
Note that the lowest IP(s) become the MON leader, so if you put RADOSGW
and other things on the storage nodes as well, spread things out
accordingly.

> > This will also free up 2 more slots in your (likely Supermicro) chassis
> > for OSD HDDs.
> 
> It's not supermicro enclosure, it's Intel one with 12 slot 3.5" front
> and 2 slots 2.5" back, so I cannot add more disk. the 240GB SSDs are in
> front.
>
That sounds like a SM chassis. ^o^
In fact, I can't find a chassis on Intel's page with 2 back 2.5 slots.
 
> >> Journal or cache Storage : 2 x SSD 400GB Intel S3300 DC (no Raid)
> > 
> > These SSDs do not exist according to the Intel site and the only
> > references I can find for them are on "no longer available" European
> > sites.
> 
> I made a mistake, it's not 400 but 480GB, smartctl give me Model
> SSDSC2BB480H4
>
OK, that's not good.
Firstly, that model number still doesn't get us any hits from Intel,
strangely enough.

Secondly, it is 480GB (instead of 400, which denotes overprovisioning) and
matches the 3510 480GB model up to the last 2 characters.
And that has an endurance of 275TBW, not something you want to use for
either journals or cache pools.
 
> > Since you're in the land of rich chocolate bankers, I assume that this
> > model is something that just happened in Europe.
> 
> I'm just a poor sysadmin with expensive toy in a University ;)
> 
I know, I recognized the domain. ^.^

> > Without knowing the specifications for these SSDs, I can't recommend
> > them. I'd use DC S3610 or 3710 instead, this very much depends on how
> > much endurance (TPW) you need.
> 
> As I write above, I already have those SSDs so I look for the best setup
> with the hardware I have.
> 

Unless they have at least an endurance of 3 DWPD like the 361x (and their
model number, size and the 3300 naming suggests they do NOT), your 480GB
SSDs aren't suited for intense Ceph usage.

How much have you used them yet and what is their smartctl status, in
particular these values (from a 800GB DC S3610 in my cache pool):
---
232 Available_Reservd_Space 0x0033   100   100   010Pre-fail  Always   
-   0
233 Media_Wearout_Indicator 0x0032   100   100   000Old_age   Always   
-   0
241 Host_Writes_32MiB   0x0032   100   100   000Old_age   Always   
-   869293
242 Host_Reads_32MiB0x0032   100   100   000Old_age   Always   
-   43435
243 NAND_Writes_32MiB   0x0032   100   100   000Old_age   Always   
-   1300884
---

Not even 1% down after 40TBW, at which point your SSDs are likely to be
15% down...


> >> OSD Disk : 10 x HGST ultrastar-7k6000 6TB
> >> Public Network : 1 x 10Gb/s
> >> Private Network : 1 x 10Gb/s
> >> OS : Ubuntu 16.04
> >> Ceph version : Jewel
> >>
> >> The question is : journal or cache tier (read only) on the SD 400GB
> >> Intel S3300 DC ?
> >>
> > You said read-only, or read-mostly up there. 
> 
> I mean, I think about using cache tier for read operation. No write
> operation gonna use the cache tier. I don't know yet wich mode I gonna
> use, I have to do some tests.
> 
As I said, your HDDs are unlikely to be slower (for sufficient parallel
access, not short, sequential reads) than those SSDs.
 
> > So why journals (only helpf

Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile

2016-05-10 Thread Nick Fisk

> -Original Message-
> From: Eric Eastman [mailto:eric.east...@keepertech.com]
> Sent: 09 May 2016 23:09
> To: Nick Fisk 
> Cc: Ceph Users 
> Subject: Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on
> lockfile
> 
> On Mon, May 9, 2016 at 3:28 PM, Nick Fisk  wrote:
> > Hi Eric,
> >
> >>
> >> I am trying to do some similar testing with SAMBA and CTDB with the
> >> Ceph file system.  Are you using the vfs_ceph SAMBA module or are you
> >> kernel mounting the Ceph file system?
> >
> > I'm using the kernel client. I couldn't find any up to date information on 
> > if
> the vfs plugin supported all the necessary bits and pieces.
> >
> > How is your testing coming along? I would be very interested in any
> findings you may have come across.
> >
> > Nick
> 
> I am also using CephFS kernel mounts, with 4 SAMBA gateways. When from a
> SAMBA client, I write a large file (about 2GB) to a gateway that is not the
> holder of the CTDB lock file, and then kill that gateway server during the
> write, the IP failover works as expected, and in most cases the file ends up
> being the correct size after the new server finishes writing it, but the data 
> is
> corrupt. The data in the file, from the point of the failover, is all zeros.
> 
> I thought the issue may be with the kernel mount, so I looked into using  the
> SAMBA vfs_ceph module, but I need SAMBA with AD support and the
> current vfs_ceph module, even in the SAMBA git master version, is lacking
> ACL support for CephFS, as the vfs_ceph.c patches summited to the SAMBA
> mail list are not yet available. See:
> https://lists.samba.org/archive/samba-technical/2016-March/113063.html
> 
> I tried using a FUSE mount of the CephFS, and it also fails setting ACLs.  
> See:
> http://tracker.ceph.com/issues/15783.
> 
> My current status is IP failover is working, but I am seeing data corruption 
> on
> writes to the share when using kernel mounts. I am also seeing the issue you
> reported when I kill the system holding the CTDB lock file.  Are you verifying
> your data after each failover?

I must admit you are slightly ahead of me. I was initially trying to just get 
hard/soft failover working correctly. But your response has prompted me to test 
out the scenario you mentioned. I'm seeing slightly different results, my copy 
seems to error out when I do a node failover. I'm copying an ISO from a 2008 
server to the CTDB/Samba share and when I reboot the active node, the copy 
pauses for a couple of seconds and then comes up with the error box. Clicking 
try again several times doesn't let it resume. I need to do a bit more digging 
to try and work out why this is happening. The share itself does seem to be in 
a working state when trying to click the try again button, so there is probably 
some sort of state/session problem.

Do you have multiple vip's configured on your cluster or just a single IP? I 
have just the one at the moment.

> 
> Eric

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile

2016-05-10 Thread Nick Fisk



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Nick Fisk
> Sent: 10 May 2016 13:30
> To: 'Eric Eastman' 
> Cc: 'Ceph Users' 
> Subject: Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on
> lockfile
> 
> > -Original Message-
> > From: Eric Eastman [mailto:eric.east...@keepertech.com]
> > Sent: 09 May 2016 23:09
> > To: Nick Fisk 
> > Cc: Ceph Users 
> > Subject: Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on
> > lockfile
> >
> > On Mon, May 9, 2016 at 3:28 PM, Nick Fisk  wrote:
> > > Hi Eric,
> > >
> > >>
> > >> I am trying to do some similar testing with SAMBA and CTDB with the
> > >> Ceph file system.  Are you using the vfs_ceph SAMBA module or are
> > >> you kernel mounting the Ceph file system?
> > >
> > > I'm using the kernel client. I couldn't find any up to date
> > > information on if
> > the vfs plugin supported all the necessary bits and pieces.
> > >
> > > How is your testing coming along? I would be very interested in any
> > findings you may have come across.
> > >
> > > Nick
> >
> > I am also using CephFS kernel mounts, with 4 SAMBA gateways. When
> from
> > a SAMBA client, I write a large file (about 2GB) to a gateway that is
> > not the holder of the CTDB lock file, and then kill that gateway
> > server during the write, the IP failover works as expected, and in
> > most cases the file ends up being the correct size after the new
> > server finishes writing it, but the data is corrupt. The data in the
file, from
> the point of the failover, is all zeros.
> >
> > I thought the issue may be with the kernel mount, so I looked into
> > using  the SAMBA vfs_ceph module, but I need SAMBA with AD support
> and
> > the current vfs_ceph module, even in the SAMBA git master version, is
> > lacking ACL support for CephFS, as the vfs_ceph.c patches summited to
> > the SAMBA mail list are not yet available. See:
> > https://lists.samba.org/archive/samba-technical/2016-March/113063.html
> >
> > I tried using a FUSE mount of the CephFS, and it also fails setting
ACLs.  See:
> > http://tracker.ceph.com/issues/15783.
> >
> > My current status is IP failover is working, but I am seeing data
> > corruption on writes to the share when using kernel mounts. I am also
> > seeing the issue you reported when I kill the system holding the CTDB
> > lock file.  Are you verifying your data after each failover?
> 
> I must admit you are slightly ahead of me. I was initially trying to just
get
> hard/soft failover working correctly. But your response has prompted me to
> test out the scenario you mentioned. I'm seeing slightly different
results, my
> copy seems to error out when I do a node failover. I'm copying an ISO from
a
> 2008 server to the CTDB/Samba share and when I reboot the active node,
> the copy pauses for a couple of seconds and then comes up with the error
> box. Clicking try again several times doesn't let it resume. I need to do
a bit
> more digging to try and work out why this is happening. The share itself
does
> seem to be in a working state when trying to click the try again button,
so
> there is probably some sort of state/session problem.
> 
> Do you have multiple vip's configured on your cluster or just a single IP?
I
> have just the one at the moment.

Just to add to this, I have just been reading this article

https://nnc3.com/mags/LM10/Magazine/Archive/2009/105/030-035_SambaHA/article
.html

And the following paragraph seems to indicate that what I am seeing is the
correct behaviour? I 'm wondering if this is not happening in your case and
is why you are getting corruption?

"It is important to understand that load balancing and client distribution
over the client nodes are connection oriented. If an IP address is switched
from one node to another, all the connections actively using this IP address
are dropped and the clients have to reconnect.

To avoid delays, CTDB uses a trick: When an IP is switched, the new CTDB
node "tickles" the client with an illegal TCP ACK packet (tickle ACK)
containing an invalid sequence number of 0 and an ACK number of 0. The
client responds with a valid ACK packet, allowing the new IP address owner
to close the connection with an RST packet, thus forcing the client to
reestablish the connection to the new node."

Nick

> 
> >
> > Eric
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] thanks for a double check on ceph's config

2016-05-10 Thread ulembke


Hi,

Am 2016-05-10 05:48, schrieb Geocast:

Hi members,

We have 21 hosts for ceph OSD servers, each host has 12 SATA disks (4TB
each), 64GB memory.
ceph version 10.2.0, Ubuntu 16.04 LTS
The whole cluster is new installed.

Can you help check what the arguments we put in ceph.conf is reasonable 
or

not?
thanks.

[osd]
osd_data = /var/lib/ceph/osd/ceph-$id
osd_journal_size = 2
osd_mkfs_type = xfs
osd_mkfs_options_xfs = -f
filestore_xattr_use_omap = true
filestore_min_sync_interval = 10
filestore_max_sync_interval = 15
filestore_queue_max_ops = 25000
filestore_queue_max_bytes = 10485760
filestore_queue_committing_max_ops = 5000
filestore_queue_committing_max_bytes = 1048576
journal_max_write_bytes = 1073714824
journal_max_write_entries = 1
journal_queue_max_ops = 5
journal_queue_max_bytes = 1048576
osd_max_write_size = 512
osd_client_message_size_cap = 2147483648
osd_deep_scrub_stride = 131072
osd_op_threads = 8
osd_disk_threads = 4
osd_map_cache_size = 1024
osd_map_cache_bl_size = 128
osd_mount_options_xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier"

I have this settings (to avoid fragmentation):
 osd mount options xfs = 
"rw,noatime,inode64,logbufs=8,logbsize=256k,allocsize=4M"

 osd mkfs options xfs = "-f -i size=2048"

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Erasure pool performance expectations

2016-05-10 Thread Peter Kerdisle

To answer my own question it seems that you can change settings on the fly
using

ceph tell osd.* injectargs '--osd_tier_promote_max_bytes_sec 5242880'
osd.0: osd_tier_promote_max_bytes_sec = '5242880' (unchangeable)

However the response seems to imply I can't change this setting. Is there
an other way to change these settings?


On Sun, May 8, 2016 at 2:37 PM, Peter Kerdisle 
wrote:

> Hey guys,
>
> I noticed the merge request that fixes the switch around here
> https://github.com/ceph/ceph/pull/8912
>
> I had two questions:
>
>
>- Does this effect my performance in any way? Could it explain the
>slow requests I keep having?
>- Can I modify these settings manually myself on my cluster?
>
> Thanks,
>
> Peter
>
>
> On Fri, May 6, 2016 at 9:58 AM, Peter Kerdisle 
> wrote:
>
>> Hey Mark,
>>
>> Sorry I missed your message as I'm only subscribed to daily digests.
>>
>>
>>> Date: Tue, 3 May 2016 09:05:02 -0500
>>> From: Mark Nelson 
>>> To: ceph-users@lists.ceph.com
>>> Subject: Re: [ceph-users] Erasure pool performance expectations
>>> Message-ID: 
>>> Content-Type: text/plain; charset=windows-1252; format=flowed
>>> In addition to what nick said, it's really valuable to watch your cache
>>> tier write behavior during heavy IO.  One thing I noticed is you said
>>> you have 2 SSDs for journals and 7 SSDs for data.
>>
>>
>> I thought the hardware recommendations were 1 journal disk per 3 or 4
>> data disks but I think I might have misunderstood it. Looking at my journal
>> read/writes they seem to be ok though:
>> https://www.dropbox.com/s/er7bei4idd56g4d/Screenshot%202016-05-06%2009.55.30.png?dl=0
>>
>> However I started running into a lot of slow requests (made a separate
>> thread for those: Diagnosing slow requests) and now I'm hoping these
>> could be related to my journaling setup.
>>
>>
>>> If they are all of
>>> the same type, you're likely bottlenecked by the journal SSDs for
>>> writes, which compounded with the heavy promotions is going to really
>>> hold you back.
>>> What you really want:
>>> 1) (assuming filestore) equal large write throughput between the
>>> journals and data disks.
>>
>> How would one achieve that?
>>
>>>
>>> 2) promotions to be limited by some reasonable fraction of the cache
>>> tier and/or network throughput (say 70%).  This is why the
>>> user-configurable promotion throttles were added in jewel.
>>
>> Are these already in the docs somewhere?
>>
>>>
>>> 3) The cache tier to fill up quickly when empty but change slowly once
>>> it's full (ie limiting promotions and evictions).  No real way to do
>>> this yet.
>>> Mark
>>
>>
>> Thanks for your thoughts.
>>
>> Peter
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Erasure pool performance expectations

2016-05-10 Thread Nick Fisk



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Peter Kerdisle
> Sent: 10 May 2016 14:37
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Erasure pool performance expectations
> 
> To answer my own question it seems that you can change settings on the fly
> using
> 
> ceph tell osd.* injectargs '--osd_tier_promote_max_bytes_sec 5242880'
> osd.0: osd_tier_promote_max_bytes_sec = '5242880' (unchangeable)
> 
> However the response seems to imply I can't change this setting. Is there an
> other way to change these settings?

Sorry Peter, I missed your last email. You can also specify that setting in the 
ceph.conf, ie I have in mine

osd_tier_promote_max_bytes_sec = 400



> 
> 
> On Sun, May 8, 2016 at 2:37 PM, Peter Kerdisle 
> wrote:
> Hey guys,
> 
> I noticed the merge request that fixes the switch around here
> https://github.com/ceph/ceph/pull/8912
> 
> I had two questions:
> 
> • Does this effect my performance in any way? Could it explain the slow
> requests I keep having?
> • Can I modify these settings manually myself on my cluster?
> Thanks,
> 
> Peter
> 
> 
> On Fri, May 6, 2016 at 9:58 AM, Peter Kerdisle 
> wrote:
> Hey Mark,
> 
> Sorry I missed your message as I'm only subscribed to daily digests.
> 
> Date: Tue, 3 May 2016 09:05:02 -0500
> From: Mark Nelson 
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Erasure pool performance expectations
> Message-ID: 
> Content-Type: text/plain; charset=windows-1252; format=flowed
> In addition to what nick said, it's really valuable to watch your cache
> tier write behavior during heavy IO.  One thing I noticed is you said
> you have 2 SSDs for journals and 7 SSDs for data.
> 
> I thought the hardware recommendations were 1 journal disk per 3 or 4 data
> disks but I think I might have misunderstood it. Looking at my journal
> read/writes they seem to be ok
> though: https://www.dropbox.com/s/er7bei4idd56g4d/Screenshot%202016-
> 05-06%2009.55.30.png?dl=0
> 
> However I started running into a lot of slow requests (made a separate
> thread for those: Diagnosing slow requests) and now I'm hoping these could
> be related to my journaling setup.
> 
> If they are all of
> the same type, you're likely bottlenecked by the journal SSDs for
> writes, which compounded with the heavy promotions is going to really
> hold you back.
> What you really want:
> 1) (assuming filestore) equal large write throughput between the
> journals and data disks.
> How would one achieve that?
> 
> 2) promotions to be limited by some reasonable fraction of the cache
> tier and/or network throughput (say 70%).  This is why the
> user-configurable promotion throttles were added in jewel.
> Are these already in the docs somewhere?
> 
> 3) The cache tier to fill up quickly when empty but change slowly once
> it's full (ie limiting promotions and evictions).  No real way to do
> this yet.
> Mark
> 
> Thanks for your thoughts.
> 
> Peter
> 
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] RGW user quota may not adjust on bucket removal

2016-05-10 Thread nick

Hey,
we currently have a problem with our radosgw.
The quota value of a user does not get updated after an admin manually deleted 
a bucket (via radosgw-admin). You can only circumvent this if you synced the 
user stats before the removal. So there are now users which can not upload new 
objects although they should be able to.

There is already a bug filed for this: http://tracker.ceph.com/issues/14507

It looks like the corresponding merge commit got into Ceph v10.1.0 first:

"""
nick@nick-nine-virtual:~/git_repos/ceph$ git tag --contains 
709ab2dd6e84abf152527e6a9177aabcf1a4c887
v10.1.0
v10.1.1
v10.1.2
v10.2.0
"""

We are using Ceph version 9.2.1. I will upgrade the cluster to Jewel in the 
next days, but I guess my problem will stay the same :-)

So does anyone know if there is a method to let ceph recalculate the quota 
usage of a user or change it manually somewhere?

I had the same problem a few weeks ago and I did the following:
- create a new temp user with new temp buckets
- lock the old account
- copy all the objects with S3fuse from the old account to the new one
- delete the old account and recreate it
- copy the objects back

(I did this because it was not possible to change the ownership of a bucket to 
a new user)

This time it would take a long time to do this again as the users have a lot 
more objects in their buckets.

Thanks for any help or advise...

Cheers
Nick

signature.asc
Description: This is a digitally signed message part.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] journal or cache tier on SSDs ?

2016-05-10 Thread Yoann Moulin

Re,

 I'd like some advices about the setup of a new ceph cluster. Here the
 use case :

 RadowGW (S3 and maybe swift for hadoop/spark) will be the main usage.
 Most of the access will be in read only mode. Write access will only
 be done by the admin to update the datasets.

 We might use rbd some time to sync data as temp storage (when POSIX is
 needed) but performance will not be an issue here. We might use cephfs
 in the futur if that can replace a filesystem on rdb.

 We gonna start with 16 nodes (up to 24). The configuration of each
 node is :

 CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (12c/48t)
 Memory : 128GB
 OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1)
>>>
>>> Dedicated OS SSDs aren't really needed, I tend to share OS and
>>> cache/journal SSDs.
>>> That's of course with more durable (S3610) models.
>>
>> I already have those 24 servers running 2 ceph cluster for test right
>> now, so I cannot change anything. we were thinking about share journal
>> but as I mention it below, MON will be on storage server, so that might
>> use too much I/O to share levedb and journal on the same SSD.
>>
> Not really, the journal is sequential writes, the leveldb small, fast
> IOPS. Both of them on the same (decent) SSD should be fine.
> 
> But as your HW is fixed, lets not speculate about that.

Ok.

>>> Since you didn't mention dedicated MON nodes, make sure that if you
>>> plan to put MONs on storage servers to have fast SSDs in them for the
>>> leveldb (again DC S36xx or 37xx).
>>
>> Yes MON nodes will be shared on storage server. MONs use the SSD 240GB
>> for the leveldb right now.
>>
> Note that the lowest IP(s) become the MON leader, so if you put RADOSGW
> and other things on the storage nodes as well, spread things out
> accordingly.

Yes for sur, we gonna spread services over nodes. The 3 RadosGW won't be on the
MONs nodes.

>>> This will also free up 2 more slots in your (likely Supermicro) chassis
>>> for OSD HDDs.
>>
>> It's not supermicro enclosure, it's Intel one with 12 slot 3.5" front
>> and 2 slots 2.5" back, so I cannot add more disk. the 240GB SSDs are in
>> front.
>
> That sounds like a SM chassis. ^o^
> In fact, I can't find a chassis on Intel's page with 2 back 2.5 slots.

http://www.colfax-intl.com/nd/images/systems/servers/R2208WT-rear.gif

 Journal or cache Storage : 2 x SSD 400GB Intel S3300 DC (no Raid)
>>>
>>> These SSDs do not exist according to the Intel site and the only
>>> references I can find for them are on "no longer available" European
>>> sites.
>>
>> I made a mistake, it's not 400 but 480GB, smartctl give me Model
>> SSDSC2BB480H4
>>
> OK, that's not good.
> Firstly, that model number still doesn't get us any hits from Intel,
> strangely enough.
> 
> Secondly, it is 480GB (instead of 400, which denotes overprovisioning) and
> matches the 3510 480GB model up to the last 2 characters.
> And that has an endurance of 275TBW, not something you want to use for
> either journals or cache pools.

I see, here the information from the resseler :

"The S3300 series is the OEM version of S3510 and 1:1 the same drive"

>>> Since you're in the land of rich chocolate bankers, I assume that this
>>> model is something that just happened in Europe.
>>
>> I'm just a poor sysadmin with expensive toy in a University ;)
>>
> I know, I recognized the domain. ^.^

:)

>>> Without knowing the specifications for these SSDs, I can't recommend
>>> them. I'd use DC S3610 or 3710 instead, this very much depends on how
>>> much endurance (TPW) you need.
>>
>> As I write above, I already have those SSDs so I look for the best setup
>> with the hardware I have.
>>
> 
> Unless they have at least an endurance of 3 DWPD like the 361x (and their
> model number, size and the 3300 naming suggests they do NOT), your 480GB
> SSDs aren't suited for intense Ceph usage.
> 
> How much have you used them yet and what is their smartctl status, in
> particular these values (from a 800GB DC S3610 in my cache pool):
> ---
> 232 Available_Reservd_Space 0x0033   100   100   010Pre-fail  Always  
>  -   0
> 233 Media_Wearout_Indicator 0x0032   100   100   000Old_age   Always  
>  -   0
> 241 Host_Writes_32MiB   0x0032   100   100   000Old_age   Always  
>  -   869293
> 242 Host_Reads_32MiB0x0032   100   100   000Old_age   Always  
>  -   43435
> 243 NAND_Writes_32MiB   0x0032   100   100   000Old_age   Always  
>  -   1300884
> ---
> 
> Not even 1% down after 40TBW, at which point your SSDs are likely to be
> 15% down...

More or less the same value on the 10 hosts I have on my beta cluster :

232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age  Always - 0
241 Total_LBAs_Written  0x0032 100 100 000 Old_age  Always - 233252
242 Total_LBAs_Read 0x0032 100 100 000 Old_age  Always - 13

[ceph-users] Ceph OSD not goint up and join to the cluster. OSD does not goes up. ceph version 10.1.2

2016-05-10 Thread Gonzalo Aguilar Delgado

Hello,

I just upgraded my cluster to the version 10.1.2 and it worked well for a
while until I saw that systemctl ceph-disk@dev-sdc1.service was failed and
I reruned it.

>From there the OSD stopped working.

This is ubuntu 16.04.

I connected to the IRC looking for help where people pointed me to one or
another place but none of the investigations helped to resolve.

My configuration is rather simple:

oot@red-compute:~# ceph osd tree
ID WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 1.0 root default
-4 1.0 rack rack-1
-2 1.0 host blue-compute
 0 1.0 osd.0down0  1.0
 2 1.0 osd.2down0  1.0
-3 1.0 host red-compute
 1 1.0 osd.1down0  1.0
 3 0.5 osd.3  up  1.0  1.0
 4 1.0 osd.4down0  1.0



This is what I got sofar:


   1. Once upgraded I discovered that daemon runs under ceph. I just ran
   chown on ceph directories. and it worked.
   2. Firewall is fully disabled. Checked connectivity with nc and nmap.
   3. Configuration seems to be right. I can post if you want.
   4. Enabling logging on OSD shows that for example osd.1 is reconnecting
   all the time.
  1. 2016-05-10 14:35:48.199573 7f53e8f1a700  1 -- 0.0.0.0:6806/13962
  >> :/0 pipe(0x556f99413400 sd=84 :6806 s=0 pgs=0 cs=0 l=0
  c=0x556f993b3a80).accept sd=84 172.16.0.119:35388/0
   2016-05-10 14:35:48.199966 7f53e8f1a700  2 -- 0.0.0.0:6806/13962 >>
  :/0 pipe(0x556f99413400 sd=84 :6806 s=4 pgs=0 cs=0 l=0
  c=0x556f993b3a80).fault (0) Success
   2016-05-10 14:35:48.200018 7f53fb941700  1 osd.1 2468
  ms_handle_reset con 0x556f993b3a80 session 0
   5. OSD.3 goes ok because never left out because ceph restriction.
   6. I rebooted all services at once for it to have available all OSD at
   the same time and don't mark it down. Don't work.
   7. I forced up from commandline. ceph osd in 1-5. They appear as in for
   a while then out.
   8. We tried ceph-disk activate-all to boot everything. Don't work.


The strange thing is that culster started worked just right after upgrade.
But the systemctrl command broke both servers.

root@blue-compute:~# ceph -w
cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
 health HEALTH_ERR
694 pgs are stuck inactive for more than 300 seconds
694 pgs stale
694 pgs stuck stale
too many PGs per OSD (1528 > max 300)
mds cluster is degraded
crush map has straw_calc_version=0
 monmap e10: 2 mons at {blue-compute=
172.16.0.119:6789/0,red-compute=172.16.0.100:6789/0}
election epoch 3600, quorum 0,1 red-compute,blue-compute
  fsmap e673: 1/1/1 up {0:0=blue-compute=up:replay}
 osdmap e2495: 5 osds: 1 up, 1 in; 5 remapped pgs
  pgmap v40765481: 764 pgs, 6 pools, 410 GB data, 103 kobjects
87641 MB used, 212 GB / 297 GB avail
 694 stale+active+clean
  70 active+clean

2016-05-10 17:03:55.822440 mon.0 [INF] HEALTH_ERR; 694 pgs are stuck
inactive for more than 300 seconds; 694 pgs stale; 694 pgs stuck stale; too
many PGs per OSD (1528 > max 300); mds cluster is degraded; crush map has
straw_calc_version=

cat /etc/ceph/ceph.conf
[global]

fsid = 9028f4da-0d77-462b-be9b-dbdf7fa57771
mon_initial_members = blue-compute, red-compute
mon_host = 172.16.0.119, 172.16.0.100
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = 172.16.0.0/24
osd_pool_default_pg_num = 100
osd_pool_default_pgp_num = 100
osd_pool_default_size = 2  # Write an object 3 times.
osd_pool_default_min_size = 1 # Allow writing one copy in a degraded state.

## Required upgrade
osd max object name len = 256
osd max object namespace len = 64

[mon.]

debug mon = 9
caps mon = "allow *"


Any help on this? Any clue of what's going wrong?


Best regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] inconsistencies from read errors during scrub

2016-05-10 Thread Sage Weil

On Thu, 21 Apr 2016, Dan van der Ster wrote:
> On Thu, Apr 21, 2016 at 1:23 PM, Dan van der Ster  wrote:
> > Hi cephalapods,
> >
> > In our couple years of operating a large Ceph cluster, every single
> > inconsistency I can recall was caused by a failed read during
> > deep-scrub. In other words, deep scrub reads an object, the read fails
> > with dmesg reporting "Sense Key : Medium Error [current]", "Add.
> > Sense: Unrecovered read error", "blk_update_request: critical medium
> > error", but the ceph-osd keeps on running and serving up data.
> 
> I forgot to mention that the OSD notices the read error. In jewel it prints:
> 
> :head got -5 on read, read_error
> 
> So why no assert?

I think this should be controlled by a config option, similar to how it is 
on read (filestore_fail_eio ... although we probably want a more generic 
option for that, too).

The danger would be that if we fail the whole due to a single failed read, 
we might fail too many osds too quickly, and availability drops.  
Ideally, if we saw an eio we would do a graceful offload (mark osd out or 
reweight to 0, drop primary_affinity; and then fail osd when we are done).

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph OSD not goint up and join to the cluster. OSD does not goes up. ceph version 10.1.2

2016-05-10 Thread Gonzalo Aguilar Delgado

Hello,

I forgot to say that the nodes are in preboot status. Something seems
strange to me.

root@red-compute:/var/lib/ceph/osd/ceph-1# ceph daemon osd.1 status
{
"cluster_fsid": "9028f4da-0d77-462b-be9b-dbdf7fa57771",
"osd_fsid": "adf9890a-e680-48e4-82c6-e96f4ed56889",
"whoami": 1,
"state": "preboot",
"oldest_map": 1764,
"newest_map": 2504,
"num_pgs": 323
}

root@red-compute:/var/lib/ceph/osd/ceph-1# ceph daemon osd.3 status
{
"cluster_fsid": "9028f4da-0d77-462b-be9b-dbdf7fa57771",
"osd_fsid": "8dd085d4-0b50-4c80-a0ca-c5bc4ad972f7",
"whoami": 3,
"state": "preboot",
"oldest_map": 1764,
"newest_map": 2504,
"num_pgs": 150
}

3 is up and in.



On Tue, May 10, 2016 at 6:07 PM, Gonzalo Aguilar Delgado <
gaguilar.delg...@gmail.com> wrote:

> Hello,
>
> I just upgraded my cluster to the version 10.1.2 and it worked well for a
> while until I saw that systemctl ceph-disk@dev-sdc1.service was failed
> and I reruned it.
>
> From there the OSD stopped working.
>
> This is ubuntu 16.04.
>
> I connected to the IRC looking for help where people pointed me to one or
> another place but none of the investigations helped to resolve.
>
> My configuration is rather simple:
>
> oot@red-compute:~# ceph osd tree
> ID WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 1.0 root default
> -4 1.0 rack rack-1
> -2 1.0 host blue-compute
>  0 1.0 osd.0down0  1.0
>  2 1.0 osd.2down0  1.0
> -3 1.0 host red-compute
>  1 1.0 osd.1down0  1.0
>  3 0.5 osd.3  up  1.0  1.0
>  4 1.0 osd.4down0  1.0
>
>
>
> This is what I got sofar:
>
>
>1. Once upgraded I discovered that daemon runs under ceph. I just ran
>chown on ceph directories. and it worked.
>2. Firewall is fully disabled. Checked connectivity with nc and nmap.
>3. Configuration seems to be right. I can post if you want.
>4. Enabling logging on OSD shows that for example osd.1 is
>reconnecting all the time.
>   1. 2016-05-10 14:35:48.199573 7f53e8f1a700  1 -- 0.0.0.0:6806/13962
>   >> :/0 pipe(0x556f99413400 sd=84 :6806 s=0 pgs=0 cs=0 l=0
>   c=0x556f993b3a80).accept sd=84 172.16.0.119:35388/0
>2016-05-10 14:35:48.199966 7f53e8f1a700  2 -- 0.0.0.0:6806/13962
>   >> :/0 pipe(0x556f99413400 sd=84 :6806 s=4 pgs=0 cs=0 l=0
>   c=0x556f993b3a80).fault (0) Success
>2016-05-10 14:35:48.200018 7f53fb941700  1 osd.1 2468
>   ms_handle_reset con 0x556f993b3a80 session 0
>5. OSD.3 goes ok because never left out because ceph restriction.
>6. I rebooted all services at once for it to have available all OSD at
>the same time and don't mark it down. Don't work.
>7. I forced up from commandline. ceph osd in 1-5. They appear as in
>for a while then out.
>8. We tried ceph-disk activate-all to boot everything. Don't work.
>
>
> The strange thing is that culster started worked just right after upgrade.
> But the systemctrl command broke both servers.
>
> root@blue-compute:~# ceph -w
> cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
>  health HEALTH_ERR
> 694 pgs are stuck inactive for more than 300 seconds
> 694 pgs stale
> 694 pgs stuck stale
> too many PGs per OSD (1528 > max 300)
> mds cluster is degraded
> crush map has straw_calc_version=0
>  monmap e10: 2 mons at {blue-compute=
> 172.16.0.119:6789/0,red-compute=172.16.0.100:6789/0}
> election epoch 3600, quorum 0,1 red-compute,blue-compute
>   fsmap e673: 1/1/1 up {0:0=blue-compute=up:replay}
>  osdmap e2495: 5 osds: 1 up, 1 in; 5 remapped pgs
>   pgmap v40765481: 764 pgs, 6 pools, 410 GB data, 103 kobjects
> 87641 MB used, 212 GB / 297 GB avail
>  694 stale+active+clean
>   70 active+clean
>
> 2016-05-10 17:03:55.822440 mon.0 [INF] HEALTH_ERR; 694 pgs are stuck
> inactive for more than 300 seconds; 694 pgs stale; 694 pgs stuck stale; too
> many PGs per OSD (1528 > max 300); mds cluster is degraded; crush map has
> straw_calc_version=
>
> cat /etc/ceph/ceph.conf
> [global]
>
> fsid = 9028f4da-0d77-462b-be9b-dbdf7fa57771
> mon_initial_members = blue-compute, red-compute
> mon_host = 172.16.0.119, 172.16.0.100
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
> public_network = 172.16.0.0/24
> osd_pool_default_pg_num = 100
> osd_pool_default_pgp_num = 100
> osd_pool_default_size = 2  # Write an object 3 times.
> osd_pool_default_min_size = 1 # Allow writing one copy in a degraded state.
>
> ## Required upgrade
> osd max object name len = 256
> osd max object namespace len = 64
>
> [mon.]
>
> de

Re: [ceph-users] Ceph OSD not goint up and join to the cluster. OSD does not goes up. ceph version 10.1.2

2016-05-10 Thread Gonzalo Aguilar Delgado

I must also add that I just found in the log the following. I don't know if
this has something to do with the problem.


=> ceph-osd.admin.log <==
2016-05-10 18:21:46.060278 7fa8f30cc8c0  0 ceph version 10.1.2
(4a2a6f72640d6b74a3bbd92798bb913ed380dcd4), process ceph-osd, pid 14135
2016-05-10 18:21:46.060460 7fa8f30cc8c0 -1 bluestore(/dev/sdc2)
_read_bdev_label unable to decode label at offset 66:
buffer::malformed_input: void
bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past
end of struct encoding
2016-05-10 18:21:46.062949 7fa8f30cc8c0  1 journal _open /dev/sdc2 fd 4:
5367660544 bytes, block size 4096 bytes, directio = 0, aio = 0
2016-05-10 18:21:46.062991 7fa8f30cc8c0  1 journal close /dev/sdc2
2016-05-10 18:21:46.063026 7fa8f30cc8c0  0 probe_block_device_fsid
/dev/sdc2 is filestore, 119a9f4e-73d8-4a1f-877c-d60b01840c96
2016-05-10 18:21:47.072082 7eff735598c0  0 ceph version 10.1.2
(4a2a6f72640d6b74a3bbd92798bb913ed380dcd4), process ceph-osd, pid 14177
2016-05-10 18:21:47.072285 7eff735598c0 -1 bluestore(/dev/sdf2)
_read_bdev_label unable to decode label at offset 66:
buffer::malformed_input: void
bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past
end of struct encoding
2016-05-10 18:21:47.074799 7eff735598c0  1 journal _open /dev/sdf2 fd 4:
5367660544 bytes, block size 4096 bytes, directio = 0, aio = 0
2016-05-10 18:21:47.074844 7eff735598c0  1 journal close /dev/sdf2
2016-05-10 18:21:47.074881 7eff735598c0  0 probe_block_device_fsid
/dev/sdf2 is filestore, fd069e6a-9a62-4286-99cb-d8a523bd946a
r

On Tue, May 10, 2016 at 6:07 PM, Gonzalo Aguilar Delgado <
gaguilar.delg...@gmail.com> wrote:

> Hello,
>
> I just upgraded my cluster to the version 10.1.2 and it worked well for a
> while until I saw that systemctl ceph-disk@dev-sdc1.service was failed
> and I reruned it.
>
> From there the OSD stopped working.
>
> This is ubuntu 16.04.
>
> I connected to the IRC looking for help where people pointed me to one or
> another place but none of the investigations helped to resolve.
>
> My configuration is rather simple:
>
> oot@red-compute:~# ceph osd tree
> ID WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 1.0 root default
> -4 1.0 rack rack-1
> -2 1.0 host blue-compute
>  0 1.0 osd.0down0  1.0
>  2 1.0 osd.2down0  1.0
> -3 1.0 host red-compute
>  1 1.0 osd.1down0  1.0
>  3 0.5 osd.3  up  1.0  1.0
>  4 1.0 osd.4down0  1.0
>
>
>
> This is what I got sofar:
>
>
>1. Once upgraded I discovered that daemon runs under ceph. I just ran
>chown on ceph directories. and it worked.
>2. Firewall is fully disabled. Checked connectivity with nc and nmap.
>3. Configuration seems to be right. I can post if you want.
>4. Enabling logging on OSD shows that for example osd.1 is
>reconnecting all the time.
>   1. 2016-05-10 14:35:48.199573 7f53e8f1a700  1 -- 0.0.0.0:6806/13962
>   >> :/0 pipe(0x556f99413400 sd=84 :6806 s=0 pgs=0 cs=0 l=0
>   c=0x556f993b3a80).accept sd=84 172.16.0.119:35388/0
>2016-05-10 14:35:48.199966 7f53e8f1a700  2 -- 0.0.0.0:6806/13962
>   >> :/0 pipe(0x556f99413400 sd=84 :6806 s=4 pgs=0 cs=0 l=0
>   c=0x556f993b3a80).fault (0) Success
>2016-05-10 14:35:48.200018 7f53fb941700  1 osd.1 2468
>   ms_handle_reset con 0x556f993b3a80 session 0
>5. OSD.3 goes ok because never left out because ceph restriction.
>6. I rebooted all services at once for it to have available all OSD at
>the same time and don't mark it down. Don't work.
>7. I forced up from commandline. ceph osd in 1-5. They appear as in
>for a while then out.
>8. We tried ceph-disk activate-all to boot everything. Don't work.
>
>
> The strange thing is that culster started worked just right after upgrade.
> But the systemctrl command broke both servers.
>
> root@blue-compute:~# ceph -w
> cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
>  health HEALTH_ERR
> 694 pgs are stuck inactive for more than 300 seconds
> 694 pgs stale
> 694 pgs stuck stale
> too many PGs per OSD (1528 > max 300)
> mds cluster is degraded
> crush map has straw_calc_version=0
>  monmap e10: 2 mons at {blue-compute=
> 172.16.0.119:6789/0,red-compute=172.16.0.100:6789/0}
> election epoch 3600, quorum 0,1 red-compute,blue-compute
>   fsmap e673: 1/1/1 up {0:0=blue-compute=up:replay}
>  osdmap e2495: 5 osds: 1 up, 1 in; 5 remapped pgs
>   pgmap v40765481: 764 pgs, 6 pools, 410 GB data, 103 kobjects
> 87641 MB used, 212 GB / 297 GB avail
>  694 stale+active+clean
>   70 active+clean
>
> 2016-05-10 17:03:55.822440 mon.

Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile

2016-05-10 Thread Eric Eastman

On Tue, May 10, 2016 at 6:48 AM, Nick Fisk  wrote:
>
>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Nick Fisk
>> Sent: 10 May 2016 13:30
>> To: 'Eric Eastman' 
>> Cc: 'Ceph Users' 
>> Subject: Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on
>> lockfile

>> > On Mon, May 9, 2016 at 3:28 PM, Nick Fisk  wrote:
>> > > Hi Eric,
>> > >
>> > >>
>> > >> I am trying to do some similar testing with SAMBA and CTDB with the
>> > >> Ceph file system.  Are you using the vfs_ceph SAMBA module or are
>> > >> you kernel mounting the Ceph file system?
>> > >
>> > > I'm using the kernel client. I couldn't find any up to date
>> > > information on if
>> > the vfs plugin supported all the necessary bits and pieces.
>> > >
>> > > How is your testing coming along? I would be very interested in any
>> > findings you may have come across.
>> > >
>> > > Nick
>> >
>> > I am also using CephFS kernel mounts, with 4 SAMBA gateways. When
>> from
>> > a SAMBA client, I write a large file (about 2GB) to a gateway that is
>> > not the holder of the CTDB lock file, and then kill that gateway
>> > server during the write, the IP failover works as expected, and in
>> > most cases the file ends up being the correct size after the new
>> > server finishes writing it, but the data is corrupt. The data in the
> file, from
>> the point of the failover, is all zeros.
>> >
>> > I thought the issue may be with the kernel mount, so I looked into
>> > using  the SAMBA vfs_ceph module, but I need SAMBA with AD support
>> and
>> > the current vfs_ceph module, even in the SAMBA git master version, is
>> > lacking ACL support for CephFS, as the vfs_ceph.c patches summited to
>> > the SAMBA mail list are not yet available. See:
>> > https://lists.samba.org/archive/samba-technical/2016-March/113063.html
>> >
>> > I tried using a FUSE mount of the CephFS, and it also fails setting
> ACLs.  See:
>> > http://tracker.ceph.com/issues/15783.
>> >
>> > My current status is IP failover is working, but I am seeing data
>> > corruption on writes to the share when using kernel mounts. I am also
>> > seeing the issue you reported when I kill the system holding the CTDB
>> > lock file.  Are you verifying your data after each failover?
>>
>> I must admit you are slightly ahead of me. I was initially trying to just
> get
>> hard/soft failover working correctly. But your response has prompted me to
>> test out the scenario you mentioned. I'm seeing slightly different
> results, my
>> copy seems to error out when I do a node failover. I'm copying an ISO from
> a
>> 2008 server to the CTDB/Samba share and when I reboot the active node,
>> the copy pauses for a couple of seconds and then comes up with the error
>> box. Clicking try again several times doesn't let it resume. I need to do
> a bit
>> more digging to try and work out why this is happening. The share itself
> does
>> seem to be in a working state when trying to click the try again button,
> so
>> there is probably some sort of state/session problem.
>>
>> Do you have multiple vip's configured on your cluster or just a single IP?
> I
>> have just the one at the moment.

I have 4 HA addresses setup, and I am using my AD to do the
round-robin DNS. The moving of IP addresses on failure or when a CTDB
controlled SAMBA system comes on line works great.

>
> Just to add to this, I have just been reading this article
>
> https://nnc3.com/mags/LM10/Magazine/Archive/2009/105/030-035_SambaHA/article
> .html
>
> And the following paragraph seems to indicate that what I am seeing is the
> correct behaviour? I 'm wondering if this is not happening in your case and
> is why you are getting corruption?
>
> "It is important to understand that load balancing and client distribution
> over the client nodes are connection oriented. If an IP address is switched
> from one node to another, all the connections actively using this IP address
> are dropped and the clients have to reconnect.
>
> To avoid delays, CTDB uses a trick: When an IP is switched, the new CTDB
> node "tickles" the client with an illegal TCP ACK packet (tickle ACK)
> containing an invalid sequence number of 0 and an ACK number of 0. The
> client responds with a valid ACK packet, allowing the new IP address owner
> to close the connection with an RST packet, thus forcing the client to
> reestablish the connection to the new node."
>

Nice article.  I have been trying to figure out if data integrity is
supported with CTDB on failover on any shared file system.  From
looking at various email posts on CTDB+GPFS, it looks like it may
work, so I am going to continue to test it with various CephFS
configurations.  There is a new "witness protocol" in SMB3 to support
failover, that is not yet supported in any released versions of SAMBA.
I may have to wait for it to be implemented in SAMBA to get fully
working failover. See:

https://wiki.samba.org/index.php/Samba3/SMB2#Witness_Notification_Protocol

Re: [ceph-users] Erasure pool performance expectations

2016-05-10 Thread Peter Kerdisle

Thanks Nick. I added it to my ceph.conf. I'm guessing this is an OSD
setting and therefor I should restart my OSDs is that correct?

On Tue, May 10, 2016 at 3:48 PM, Nick Fisk  wrote:

>
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > Peter Kerdisle
> > Sent: 10 May 2016 14:37
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Erasure pool performance expectations
> >
> > To answer my own question it seems that you can change settings on the
> fly
> > using
> >
> > ceph tell osd.* injectargs '--osd_tier_promote_max_bytes_sec 5242880'
> > osd.0: osd_tier_promote_max_bytes_sec = '5242880' (unchangeable)
> >
> > However the response seems to imply I can't change this setting. Is
> there an
> > other way to change these settings?
>
> Sorry Peter, I missed your last email. You can also specify that setting
> in the ceph.conf, ie I have in mine
>
> osd_tier_promote_max_bytes_sec = 400
>
>
>
> >
> >
> > On Sun, May 8, 2016 at 2:37 PM, Peter Kerdisle  >
> > wrote:
> > Hey guys,
> >
> > I noticed the merge request that fixes the switch around here
> > https://github.com/ceph/ceph/pull/8912
> >
> > I had two questions:
> >
> > • Does this effect my performance in any way? Could it explain the slow
> > requests I keep having?
> > • Can I modify these settings manually myself on my cluster?
> > Thanks,
> >
> > Peter
> >
> >
> > On Fri, May 6, 2016 at 9:58 AM, Peter Kerdisle  >
> > wrote:
> > Hey Mark,
> >
> > Sorry I missed your message as I'm only subscribed to daily digests.
> >
> > Date: Tue, 3 May 2016 09:05:02 -0500
> > From: Mark Nelson 
> > To: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Erasure pool performance expectations
> > Message-ID: 
> > Content-Type: text/plain; charset=windows-1252; format=flowed
> > In addition to what nick said, it's really valuable to watch your cache
> > tier write behavior during heavy IO.  One thing I noticed is you said
> > you have 2 SSDs for journals and 7 SSDs for data.
> >
> > I thought the hardware recommendations were 1 journal disk per 3 or 4
> data
> > disks but I think I might have misunderstood it. Looking at my journal
> > read/writes they seem to be ok
> > though: https://www.dropbox.com/s/er7bei4idd56g4d/Screenshot%202016-
> > 05-06%2009.55.30.png?dl=0
> >
> > However I started running into a lot of slow requests (made a separate
> > thread for those: Diagnosing slow requests) and now I'm hoping these
> could
> > be related to my journaling setup.
> >
> > If they are all of
> > the same type, you're likely bottlenecked by the journal SSDs for
> > writes, which compounded with the heavy promotions is going to really
> > hold you back.
> > What you really want:
> > 1) (assuming filestore) equal large write throughput between the
> > journals and data disks.
> > How would one achieve that?
> >
> > 2) promotions to be limited by some reasonable fraction of the cache
> > tier and/or network throughput (say 70%).  This is why the
> > user-configurable promotion throttles were added in jewel.
> > Are these already in the docs somewhere?
> >
> > 3) The cache tier to fill up quickly when empty but change slowly once
> > it's full (ie limiting promotions and evictions).  No real way to do
> > this yet.
> > Mark
> >
> > Thanks for your thoughts.
> >
> > Peter
> >
> >
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Adding an OSD to existing Ceph using ceph-deploy

2016-05-10 Thread Andrus, Brian Contractor

All,

I am trying to add another OSD to our cluster using ceph-deploy. This is 
running Jewel.
I previously set up the other 12 OSDs on a fresh install using the command: 
ceph-deploy osd create :/dev/mapper/mpath:/dev/sda
Those are all up and happy. On the systems /dev/sda is an SSD which I have 
created partitions on for journals.

It seems to prepare everything fine (ceph-deploy osd prepare 
ceph-1-35a:/dev/mapper/mpathn:/dev/sda8), but when it comes time to activate, I 
am getting a Traceback:

[2016-05-10 11:27:58,195][ceph_deploy.osd][INFO  ] Distro info: CentOS Linux 
7.2.1511 Core
[2016-05-10 11:27:58,195][ceph_deploy.osd][DEBUG ] activating host ceph-1-35a 
disk /dev/mapper/mpathn
[2016-05-10 11:27:58,195][ceph_deploy.osd][DEBUG ] will use init type: systemd
[2016-05-10 11:27:58,196][ceph-1-35a][INFO  ] Running command: ceph-disk -v 
activate --mark-init systemd --mount /dev/mapper/mpathn
[2016-05-10 11:27:58,315][ceph-1-35a][WARNING] main_activate: path = 
/dev/mapper/mpathn
[2016-05-10 11:27:58,315][ceph-1-35a][WARNING] get_dm_uuid: get_dm_uuid 
/dev/mapper/mpathn uuid path is /sys/dev/block/253:8/dm/uuid
[2016-05-10 11:27:58,316][ceph-1-35a][WARNING] get_dm_uuid: get_dm_uuid 
/dev/mapper/mpathn uuid is mpath-360001ff09070e00c8921000c
[2016-05-10 11:27:58,316][ceph-1-35a][WARNING]
[2016-05-10 11:27:58,316][ceph-1-35a][WARNING] get_dm_uuid: get_dm_uuid 
/dev/mapper/mpathn uuid path is /sys/dev/block/253:8/dm/uuid
[2016-05-10 11:27:58,316][ceph-1-35a][WARNING] get_dm_uuid: get_dm_uuid 
/dev/mapper/mpathn uuid is mpath-360001ff09070e00c8921000c
[2016-05-10 11:27:58,316][ceph-1-35a][WARNING]
[2016-05-10 11:27:58,316][ceph-1-35a][WARNING] command: Running command: 
/sbin/blkid -p -s TYPE -o value -- /dev/mapper/mpathn
[2016-05-10 11:27:58,316][ceph-1-35a][WARNING] Traceback (most recent call 
last):
[2016-05-10 11:27:58,316][ceph-1-35a][WARNING]   File "/usr/sbin/ceph-disk", 
line 9, in 
[2016-05-10 11:27:58,316][ceph-1-35a][WARNING] 
load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
[2016-05-10 11:27:58,316][ceph-1-35a][WARNING]   File 
"/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4964, in run
[2016-05-10 11:27:58,316][ceph-1-35a][WARNING] main(sys.argv[1:])
[2016-05-10 11:27:58,317][ceph-1-35a][WARNING]   File 
"/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 4915, in main
[2016-05-10 11:27:58,317][ceph-1-35a][WARNING] args.func(args)
[2016-05-10 11:27:58,317][ceph-1-35a][WARNING]   File 
"/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3269, in 
main_activate
[2016-05-10 11:27:58,317][ceph-1-35a][WARNING] reactivate=args.reactivate,
[2016-05-10 11:27:58,317][ceph-1-35a][WARNING]   File 
"/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 2979, in 
mount_activate
[2016-05-10 11:27:58,317][ceph-1-35a][WARNING] e,
[2016-05-10 11:27:58,317][ceph-1-35a][WARNING] 
ceph_disk.main.FilesystemTypeError: Cannot discover filesystem type: device 
/dev/mapper/mpathn: Line is truncated:
[2016-05-10 11:27:58,318][ceph-1-35a][ERROR ] RuntimeError: command returned 
non-zero exit status: 1
[2016-05-10 11:27:58,318][ceph_deploy][ERROR ] RuntimeError: Failed to execute 
command: ceph-disk -v activate --mark-init systemd --mount /dev/mapper/mpathn


This seems to be due to the command:
/sbin/blkid -p -s TYPE -o value -- /dev/mapper/mpathn
is being run instead of:
/sbin/blkid -p -s TYPE -o value -- /dev/mapper/mpathn1


Anyone have ideas on how to get these happy?

Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Performance during disk rebuild - MercadoLibre

2016-05-10 Thread Agustín Trolli

Hello All,
I´m writing to you because i´m trying to find the way to rebuild a osd disk
in a way to don´t impact the performance of the cluster.
That´s because my applications are very latency sensitive.

1_ I found the way to reuse a OSD ID and don´t rebalance the cluster every
time that I lost a disk.
So, my cluster is running with the noout check forever.
The point here is do the disk change as fast I can.

2_ after reuse de OSD ID, I´m living the OSD up and running, but with CERO
weight.
For example:

root@DC4-ceph03-dn03:/var/lib/ceph/osd/ceph-352# ceph osd tree | grep 352
*352   1.81999 osd.352   up0
 1.0*

At this point everything is good.

3_ Starting the reweight, using "osd reweigh" i´m not touching the
crushmap, and I´m doing the reweight very gradually.
Example:
*ceph osd reweight 352 0.001*

But, anyway doing the reweight in this way i´m heating the latency
sometimes.
Depending of the amount of PGs that the cluster is recovering the impact is
worst.

Tunings that I already have done:

ceph tell osd.* injectargs "--osd_max_backfills 1"
ceph tell osd.* injectargs "--osd_recovery_max_active 1"
ceph tell osd.* injectargs '--osd-max-recovery-threads 1'
ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
ceph tell osd.* injectargs '--osd-client-op-priority 63'

The question is, there are more parameters to change in order to do more
gradually the OSD rebuild?

I really appreciate your help, thanks in advance.

Agustin Trolli
Storage Team
Mercadolibre.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Performance during disk rebuild - MercadoLibre

2016-05-10 Thread Christian Balzer

Hello,

As far as I know and can tell, you're doing everything that is possible
for having a least impact OSD rebuild/replacement.

If your cluster is still strongly, adversely impacted by this gradual and
throttled approach, how about the following things:

1. Does scrub or deep_scrub also impact your performance so that your
applications notice it?

2. Are there times when other cluster activity (like reboots or installs
of new VMs, other large data movements created by clients) impacts your
applications?

If both or either of these are true, your cluster is at the limit of its
capacity. 
And in general, a rebuild with throttled parameters like yours (and many
others, including me) should not hurt things.

If it does, it's time to improve your cluster performance.

1. Adding journal SSDs if not present already.
2. Adding more OSDs in general.
3. Adding a cache tier, this is particular effective if your latency
sensitive applications do small writes or reads that easily fit into the
cache.

I was in a similar situation with hundreds of VMs running an application
that had latency sensitive small writes and adding a cache tier completely
solved the problem.

Regards,

Christian

On Tue, 10 May 2016 16:30:00 -0300 Agustín Trolli wrote:

> Hello All,
> I´m writing to you because i´m trying to find the way to rebuild a osd
> disk in a way to don´t impact the performance of the cluster.
> That´s because my applications are very latency sensitive.
> 
> 1_ I found the way to reuse a OSD ID and don´t rebalance the cluster
> every time that I lost a disk.
> So, my cluster is running with the noout check forever.
> The point here is do the disk change as fast I can.
> 
> 2_ after reuse de OSD ID, I´m living the OSD up and running, but with
> CERO weight.
> For example:
> 
> root@DC4-ceph03-dn03:/var/lib/ceph/osd/ceph-352# ceph osd tree | grep 352
> *352   1.81999 osd.352   up0
>  1.0*
> 
> At this point everything is good.
> 
> 3_ Starting the reweight, using "osd reweigh" i´m not touching the
> crushmap, and I´m doing the reweight very gradually.
> Example:
> *ceph osd reweight 352 0.001*
> 
> But, anyway doing the reweight in this way i´m heating the latency
> sometimes.
> Depending of the amount of PGs that the cluster is recovering the impact
> is worst.
> 
> Tunings that I already have done:
> 
> ceph tell osd.* injectargs "--osd_max_backfills 1"
> ceph tell osd.* injectargs "--osd_recovery_max_active 1"
> ceph tell osd.* injectargs '--osd-max-recovery-threads 1'
> ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
> ceph tell osd.* injectargs '--osd-client-op-priority 63'
> 
> The question is, there are more parameters to change in order to do more
> gradually the OSD rebuild?
> 
> I really appreciate your help, thanks in advance.
> 
> Agustin Trolli
> Storage Team
> Mercadolibre.com

-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] journal or cache tier on SSDs ?

2016-05-10 Thread Christian Balzer

On Tue, 10 May 2016 17:51:24 +0200 Yoann Moulin wrote:

[snip]
>  Journal or cache Storage : 2 x SSD 400GB Intel S3300 DC (no Raid)
> >>>
> >>> These SSDs do not exist according to the Intel site and the only
> >>> references I can find for them are on "no longer available" European
> >>> sites.
> >>
> >> I made a mistake, it's not 400 but 480GB, smartctl give me Model
> >> SSDSC2BB480H4
> >>
> > OK, that's not good.
> > Firstly, that model number still doesn't get us any hits from Intel,
> > strangely enough.
> > 
> > Secondly, it is 480GB (instead of 400, which denotes overprovisioning)
> > and matches the 3510 480GB model up to the last 2 characters.
> > And that has an endurance of 275TBW, not something you want to use for
> > either journals or cache pools.
> 
> I see, here the information from the resseler :
> 
> "The S3300 series is the OEM version of S3510 and 1:1 the same drive"
> 
Given the SMART output below, it seems to be 3500 based, but that doesn't
change things.
 
> >>> Without knowing the specifications for these SSDs, I can't recommend
> >>> them. I'd use DC S3610 or 3710 instead, this very much depends on how
> >>> much endurance (TPW) you need.
> >>
> >> As I write above, I already have those SSDs so I look for the best
> >> setup with the hardware I have.
> >>
> > 
> > Unless they have at least an endurance of 3 DWPD like the 361x (and
> > their model number, size and the 3300 naming suggests they do NOT),
> > your 480GB SSDs aren't suited for intense Ceph usage.
> > 
> > How much have you used them yet and what is their smartctl status, in
> > particular these values (from a 800GB DC S3610 in my cache pool):
> > ---
> > 232 Available_Reservd_Space 0x0033   100   100   010Pre-fail
> > Always   -   0 233 Media_Wearout_Indicator 0x0032   100
> > 100   000Old_age   Always   -   0 241
> > Host_Writes_32MiB   0x0032   100   100   000Old_age
> > Always   -   869293 242 Host_Reads_32MiB0x0032   100
> > 100   000Old_age   Always   -   43435 243
> > NAND_Writes_32MiB   0x0032   100   100   000Old_age
> > Always   -   1300884 ---
> > 
> > Not even 1% down after 40TBW, at which point your SSDs are likely to be
> > 15% down...
> 
> More or less the same value on the 10 hosts I have on my beta cluster :
> 
> 232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
> 233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age  Always - 0
> 241 Total_LBAs_Written  0x0032 100 100 000 Old_age  Always - 233252
> 242 Total_LBAs_Read 0x0032 100 100 000 Old_age  Always - 13
> 

>From the read count it's obvious that you used those as journals. ^.^

As I hinted above, if these were 3510 based they also should have the 243
attribute, as in my 3610 example.
You may want to upgrade your smartctl and/or it's definition DB (on Debian
that can be done with "update-smart-drivedb").

Intel's calculation of the media wearout always seems to be very fuzzy to
me, given your 7TB written I'd expect it to be 98%, at least 99%.

But then again a 200GB DC S3700 of mine has written 90TB out of 3650TB
total and is at 99%, when I would expect it to be at 98%. 

Either way, those SSDs are designed for 275TBW (or 0.3 DWPD), and if they
are used as journals they will expire quickly when those 100TB+ datasets
get updated.

They _might_ survive longer with a very carefully tuned cache tier
(promote only really hot objects), but the risk of loosing SSDs there can
be even higher than with journals.

[snap]

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] journal or cache tier on SSDs ?

2016-05-10 Thread Geocast Networks

Hi,

If we have 12 SATA disks, each 4TB as storage pool.
Then how many SSD disks we should have for cache tier usage?

thanks.

2016-05-10 16:40 GMT+08:00 Yoann Moulin :

> Hello,
>
> I'd like some advices about the setup of a new ceph cluster. Here the use
> case :
>
> RadowGW (S3 and maybe swift for hadoop/spark) will be the main usage. Most
> of
> the access will be in read only mode. Write access will only be done by the
> admin to update the datasets.
>
> We might use rbd some time to sync data as temp storage (when POSIX is
> needed)
> but performance will not be an issue here. We might use cephfs in the
> futur if
> that can replace a filesystem on rdb.
>
> We gonna start with 16 nodes (up to 24). The configuration of each node is
> :
>
> CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (12c/48t)
> Memory : 128GB
> OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1)
> Journal or cache Storage : 2 x SSD 400GB Intel S3300 DC (no Raid)
> OSD Disk : 10 x HGST ultrastar-7k6000 6TB
> Public Network : 1 x 10Gb/s
> Private Network : 1 x 10Gb/s
> OS : Ubuntu 16.04
> Ceph version : Jewel
>
> The question is : journal or cache tier (read only) on the SD 400GB Intel
> S3300 DC ?
>
> Each disk is able to write sequentially at 220MB/s. SSDs can write at
> ~500MB/s.
> if we set 5 journals on each SSDs, SSD will still be the bottleneck (1GB/s
> vs
> 2GB/s). If we set the journal on OSDs, we can expect a good throughput in
> read
> on the disk (in case of data not in the cache) and write shouldn't be so
> bad
> too, even if we have random read on the OSD during the write ?
>
> SSDs as cache tier seem to be a better usage than only 5 journal on each ?
> Is
> that correct ?
>
> We gonna use an EC pool for big files (jerasure 8+2 I think) and a
> replicated
> pool for small files.
>
> If I check on http://ceph.com/pgcalc/, in this use case
>
> replicated pool: pg_num = 8192 for 160 OSDs but 16384 for 240 OSDs
> Ec pool : pg_num = 4096
> and pgp_num = pg_num
>
> Should I set the pg_num to 8192 or 16384 ? what is the impact on the
> cluster if
> we set the pg_num to 16384 at the beginning ? 16384 is high, isn't it ?
>
> Thanks for your help
>
> --
> Yoann Moulin
> EPFL IC-IT
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] journal or cache tier on SSDs ?

2016-05-10 Thread Christian Balzer

Hello,

On Wed, 11 May 2016 11:24:29 +0800 Geocast Networks wrote:

> Hi,
> 
> If we have 12 SATA disks, each 4TB as storage pool.
> Then how many SSD disks we should have for cache tier usage?
> 
That question makes no sense.

Firstly you mentioned earlier that you have 21 of those hosts.
Which would be a significant factor when trying to determine cache-tier
sizes, as it it gives an idea of your overall storage needs.

But the the size of the cache-tier would totally depend on your use case
and how big your hot data is. 
Nobody can answer that for you.

A cache tier may also make no (financial) sense for you, speeding up
things with SSD journals currently is the best first step.

Secondly, what I think you mean is the number of SSDs for JOURNAL usage,
which is something completely different. 
You will want to read up more on Ceph concepts and explore the ML archives.

That said, 12 HDDs will be able to write about 1GB/s in total, so your
journal SSDs should be around that (sequential write) speed as well.
And they should be DC level SSDs (Intel DC S or respective Samsung
models), with medium (3 DWPD) to large (10 DWPD) endurance.

Normally you will also want to avoid putting too many journals on one SSD,
as a failure of the SSD will kill all associated HDD OSDs. 
However as you have 21 hosts and hopefully decent redundancy and
distribution (CRUSH Map), going with 2 SSDs (6 journals per SSD) should be
fine.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] rbd resize option

2016-05-10 Thread M Ranga Swami Reddy

Hello,
I wanted to resize an image using 'rbd' resize option, but it should
be have data loss.
For ex: I have image with 100 GB size (thin provisioned). and this
image has data of 10GB only. Here I wanted to resize this image to
11GB, so that 10GB data is safe and its resized.

Can I do the above resize safely.?

If I tried to resize to 5GB, is rbd throughs an error saying that your
data is going loss, something like that???

Any inputs here are appriciated.

Thanks
Swami
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] wrong exit status if bucket already exists

2016-05-10 Thread Swapnil Jain

Hi

I am using infernalis 9.2.1. While creating bucket, if the bucket already
exists, its still returns 0 as exit status. Is it intentional out of some
reason or a bug?



root@node1:~# ceph osd crush add-bucket rack1 rack

bucket 'rack1' already exists

root@node1:~# echo $?

0

root@node1:~#



--
Swapnil
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Weighted Priority Queue testing

2016-05-10 Thread Somnath Roy

+ceph users

Hi,



Here is first cut result. I can only manage 128TB box for now.




Ceph code base

Capacity

Each drive capacity

Compute-nodes

Total copy

Total data-set

Failure domain

Fault-injected

Percentage of degraded PGs

Full recovery time

Last 1% of degraded PG recovery time

Hammer

2X128TB IF150

8TB

2

2

~80TB

Chassis

One OSD node down

~20%

~24 hours

~3-4 hours

Hammer

2X128TB IF150

8TB

4

2

~80TB

Chassis

One OSD node down

~10%

10 hours 3 min

~3 hours

Hammer

2X128TB IF150

4TB

4

2

~100TB

Chassis

One OSD node down

~12.5%

7 hours 5 min

~2.5 hours

Jewel

2X128TB IF150

4TB

4

2

~100TB

Chassis

One OSD node down

~12.5%

6 hours 10 min

~1 hour 30 min

Jewel + wpq

2X128TB IF150

4TB

4

2

~100TB

Chassis

One OSD node down

~12.5%

8 hours 30 min

~4 hours 30 min




Summary :





1. First scenario, only 4 node scenario and since it is chassis level 
replication single node remaining on the chassis taking all the traffic. It 
seems that is a bottleneck as for the host level replication on the similar 
setup recovery time is much less (data is not in this table).



2. In the second scenario , I kept everything else same but doubled the 
node/chassis. Recovery time is also half.



3.  For the third scenario, increased cluster data and also now I have doubled 
the number of  OSDs in the cluster (since each drive size is 4TB now). Recovery 
time came down further.



4. Moved to Jewel keeping everything else same, got further improvement. Mostly 
because of improved write performance in jewel (?).



5. Last scenario is interesting. I got improved recovery speed than any other 
scenario with this WPQ. Degraded PG % came down to 2% within 3 hours , ~0.6% 
within 4 hours and 15 min , but last 0.6% took ~4 hours hurting overall time 
for recovery.

6. In fact, this long tail latency is hurting the overall recovery time for 
every other scenarios. Related tracker I found is 
http://tracker.ceph.com/issues/15763



Any feedback much appreciated. We can discuss this in tomorrow’s performance 
call if needed.



Thanks & Regards

Somnath



-Original Message-
From: Somnath Roy
Sent: Wednesday, May 04, 2016 11:47 AM
To: 'Mark Nelson'; Nick Fisk; Ben England; Kyle Bader
Cc: Sage Weil; Samuel Just
Subject: RE: Weighted Priority Queue testing



Thanks Mark, I will come back to you with some data on that. This is what I am 
planning to run.



1. One 2X IF150 chassis with 256 TB  flash each and total 8 node cluster (4 
servers on each). Will generate ~100TB of data on the cluster.



2. Will go for host and chassis level replication with 2 copies.



3. Heavy IO will be on (different block sizes 60% RW and 40% RR)



Hammer took me ~4 hours to complete recovery for a host level replication and 
single host down.

~12 hours when single host down with chassis level replication.



Bear with me till I find all the HW for this :-) Let me know if you guys want 
to add something here..



Regards

Somnath



-Original Message-

From: Mark Nelson [mailto:mnel...@redhat.com]

Sent: Wednesday, May 04, 2016 8:40 AM

To: Somnath Roy; Nick Fisk; Ben England; Kyle Bader

Cc: Sage Weil; Samuel Just

Subject: Weighted Priority Queue testing



Hi Guys,



I think all of you have expressed some interest in recovery testing either now 
or in the past, so I wanted to get folks together to talk.

We need to get the new weighted priority queue tested to:



a) see when/how it's breaking

b) hopefully see better behavior



It's available in Jewel through a simple ceph.conf change:



osd_op_queue = wpq



For those of you who have run cbt recovery tests in the past, it might be worth 
running some new stress tests comparing:



a) jewel + wpq

b) jewel + prio queue

c) hammer



In the past I've done this under various concurrent client workloads (say large 
sequential or small random writes).  I think Kyle has done quite a bit of this 
kind of testing in the recent past with Intel as well, so he might have some 
insights as to where we've been hurting recently.



Thanks,

Mark

PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] thanks for a double check on ceph's config

[ceph-users] journal or cache tier on SSDs ?

Re: [ceph-users] thanks for a double check on ceph's config

Re: [ceph-users] thanks for a double check on ceph's config

Re: [ceph-users] journal or cache tier on SSDs ?

Re: [ceph-users] thanks for a double check on ceph's config

Re: [ceph-users] journal or cache tier on SSDs ?

[ceph-users] ceph 0.94.5 / Kernel 4.5.2 / assertion on OSD

[ceph-users] Cluster issue - pgs degraded, recovering, stale, etc.

Re: [ceph-users] journal or cache tier on SSDs ?

Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile

Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile

Re: [ceph-users] thanks for a double check on ceph's config

Re: [ceph-users] Erasure pool performance expectations

Re: [ceph-users] Erasure pool performance expectations

[ceph-users] RGW user quota may not adjust on bucket removal

Re: [ceph-users] journal or cache tier on SSDs ?

[ceph-users] Ceph OSD not goint up and join to the cluster. OSD does not goes up. ceph version 10.1.2

Re: [ceph-users] inconsistencies from read errors during scrub

Re: [ceph-users] Ceph OSD not goint up and join to the cluster. OSD does not goes up. ceph version 10.1.2

Re: [ceph-users] Ceph OSD not goint up and join to the cluster. OSD does not goes up. ceph version 10.1.2

Re: [ceph-users] CephFS + CTDB/Samba - MDS session timeout on lockfile

Re: [ceph-users] Erasure pool performance expectations

[ceph-users] Adding an OSD to existing Ceph using ceph-deploy

[ceph-users] Performance during disk rebuild - MercadoLibre

Re: [ceph-users] Performance during disk rebuild - MercadoLibre

Re: [ceph-users] journal or cache tier on SSDs ?

Re: [ceph-users] journal or cache tier on SSDs ?

Re: [ceph-users] journal or cache tier on SSDs ?

[ceph-users] rbd resize option

[ceph-users] wrong exit status if bucket already exists

Re: [ceph-users] Weighted Priority Queue testing

32 matches

Site Navigation

Mail list logo

Footer information