Re: [ceph-users] Ceph Performance vs Entry Level San Arrays

2016-06-22 Thread Oliver Dzombic
Hi Denver,

its like christian said. On top of that, i would add, that iSCSI is
always a more native protocol. You dont have to go through as much
layers as you have it -per design- with a software defined storage.

So you can expect always a better performance with hardware accelerated
iSCSI.

If you are looking for a cost efficient solution, take ceph.

If you have the money for this HP stuff, and if you are sure, your needs
will never grow over this

(199) SFF-SAS/MDL SAS/SSD or
96 LFF SAS/MDL SAS

Disklimit, then take this HP stuff.



Another decision making reason:

For sure, taking this HP stuff means, taking a solution which will
natively work with everything ( iSCSI ).

Also you will have commercial support with you ( if you buy them new /
with support package ).

So, if you are not familar with ceph, and want to run important
data/services with your storage, maybe you should pick a solution which
will have vendor support and a less "configure all yourself" solution.

The upside of that is of course, that you have a limited amount of
possibilities to kill your services yourself.
The downside is, that you have a very limited amount of possibilities to
tweak and also debug your services.

If it works, fine. If not you might end up being forced to wait for the
vendor support and just hope they will react fast and solve your case.
While with a commercial product like this, usually the
UI/utils/management possibilities are usually well grown. So maybe your
chance that something is going wrong might be not too much big.

But if something is going wrong, you can just wait for help.

While with ceph and linux you usually have always the chance to debug/do
something yourself ( IF you have the knowledge ).
If not, and if you dont have time to learn that knowledge, taking ceph,
or any other system you dont know, but having productive data on it, is
just a BIG invitation for murphy to give your life a nice kick :)

-

So, as usual: Eighter you have the knowledge yourself and be able to

1. save money
2. debug stuff yourself

Or you pay someone else to have it, and this way, if you picked the
right partner, being on a safe(er) side that things will not go mess.


-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 22.06.2016 um 01:09 schrieb Denver Williams:
> HP MSA 2040 10GBe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Tiering with Same Cache Pool

2016-06-22 Thread Lazuardi Nasution
Hi Christian,

If I have several cache pool on the same SSD OSDs (by using same ruleset)
so those cache pool always show same Max. Available of "ceph df detail"
output, what should I put on target_max_bytes of cache tiering
configuration for each cache pool? should it be same and use Max Available
size? If diffrent, how can I know if such cache pool need more size than
other.

Best regards,

Date: Mon, 20 Jun 2016 09:34:05 +0900
> From: Christian Balzer 
> To: ceph-users@lists.ceph.com
> Cc: Lazuardi Nasution 
> Subject: Re: [ceph-users] Cache Tiering with Same Cache Pool
> Message-ID: <20160620093405.732f5...@batzmaru.gol.ad.jp>
> Content-Type: text/plain; charset=US-ASCII
>
> On Mon, 20 Jun 2016 00:14:55 +0700 Lazuardi Nasution wrote:
>
> > Hi,
> >
> > Is it possible to do cache tiering for some storage pools with the same
> > cache pool?
>
> As mentioned several times on this ML, no.
> There is a strict 1:1 relationship between base and cache pools.
> You can of course (if your SSDs/NVMes are large and fast enough) put more
> than one cache pool on them.
>
> > What will happen if cache pool is broken or at least doesn't
> > meet quorum when storage pool is OK?
> >
> With a read-only cache pool nothing should happen, as all writes are going
> to the base pool.
>
> In any other mode (write-back, read-forward or read-proxy) your hottest
> objects are likely to be ONLY on the cache pool and never getting flushed
> to the base pool.
> So that means, if your cache pool fails, your cluster is essentially dead
> or at the very least has suffered massive data loss.
>
> Something to very much think about when doing cache tiering.
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance issue with jewel on ubuntu xenial (kernel)

2016-06-22 Thread Yoann Moulin
Hello Florian,

> On Tue, Jun 21, 2016 at 3:11 PM, Yoann Moulin  wrote:
>> Hello,
>>
>> I found a performance drop between kernel 3.13.0-88 (default kernel on Ubuntu
>> Trusty 14.04) and kernel 4.4.0.24.14 (default kernel on Ubuntu Xenial 16.04)
>>
>> ceph version is Jewel (10.2.2).
>> All tests have been done under Ubuntu 14.04
> 
> Knowing that you also have an internalis cluster on almost identical
> hardware, can you please let the list know whether you see the same
> behavior (severely reduced throughput on a 4.4 kernel, vs. 3.13) on
> that cluster as well?

ceph version is infernalis (9.2.0)

Ceph osd Benchmark:

Kernel 3.13.0-88-generic : ceph tell osd.ID => average ~84MB/s
Kernel 4.2.0-38-generic  : ceph tell osd.ID => average ~90MB/s
Kernel 4.4.0-24-generic  : ceph tell osd.ID => average ~75MB/s

The slow down is not as much as I have with Jewel but it is still present.

Best Regards,

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph deployment

2016-06-22 Thread Fran Barrera
Hi all,

I have a couple of question about the deployment of Ceph.


This is what I plan:

Private Net - 10.0.0.0/24
Public Net - 192.168.1.0/24

Ceph server:
 - eth1: 192.168.1.67
 - eth2: 10.0.0.67

Openstack server:
 - eth1: 192.168.1.65
 - eth2: 10.0.0.65


 ceph.conf
  - mon_host: 10.0.0.67
  - cluster_network - 10.0.0.0/24
  - public_network - 192.168.1.0/24

Now, I have some doubts:
 - If I configure Ceph with this configuration. Could I connect with Ceph
from a client in the Public Net? I say it this because mon_host is
10.0.0.67 in ceph.conf
 - The private NET was created for Openstack, but I think if I can use this
net for Ceph cluster network  or if I need to create another one.

I want to connect Ceph with Openstack through a private Net and have the
possibility to connect with ceph from the public net too.


Any suggestions?

Thanks,
Fran.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PGs

2016-06-22 Thread 施柏安
Hi,
You can use command 'ceph pg query' to check what's going on with the pgs
which have problem and use "ceph-objectstore-tool" to recover that pg.

2016-06-21 19:09 GMT+08:00 Paweł Sadowski :

> Already restarted those OSD and then whole cluster (rack by rack,
> failure domain is rack in this setup).
> We would like to try *ceph-objectstore-tool mark-complete* operation. Is
> there any way (other than checking mtime on file and querying PGs) to
> determine which replica has most up to date datas?
>
> On 06/21/2016 12:37 PM, M Ranga Swami Reddy wrote:
> > Try to restart OSD 109 and 166? check if it help?
> >
> >
> > On Tue, Jun 21, 2016 at 4:05 PM, Paweł Sadowski  wrote:
> >> Thanks for response.
> >>
> >> All OSDs seems to be ok, they have been restarted, joined cluster after
> >> that, nothing weird in the logs.
> >>
> >> # ceph pg dump_stuck stale
> >> ok
> >>
> >> # ceph pg dump_stuck inactive
> >> ok
> >> pg_statstateupup_primaryactingacting_primary
> >> 3.2929incomplete[109,272,83]109[109,272,83]109
> >> 3.1683incomplete[166,329,281]166[166,329,281]166
> >>
> >> # ceph pg dump_stuck unclean
> >> ok
> >> pg_statstateupup_primaryactingacting_primary
> >> 3.2929incomplete[109,272,83]109[109,272,83]109
> >> 3.1683incomplete[166,329,281]166[166,329,281]166
> >>
> >>
> >> On OSD 166 there is 100 blocked ops (on 109 too), they all end on
> >> "event": "reached_pg"
> >>
> >> # ceph --admin-daemon /var/run/ceph/ceph-osd.166.asok dump_ops_in_flight
> >> ...
> >> {
> >> "description": "osd_op(client.958764031.0:18137113
> >> rbd_data.392585982ae8944a.0ad4 [set-alloc-hint object_size
> >> 4194304 write_size 4194304,write 2641920~8192] 3.d6195683 RETRY=15
> >> ack+ondisk+retry+write+known_if_redirected e613241)",
> >> "initiated_at": "2016-06-21 10:19:59.894393",
> >> "age": 828.025527,
> >> "duration": 600.020809,
> >> "type_data": [
> >> "reached pg",
> >> {
> >> "client": "client.958764031",
> >> "tid": 18137113
> >> },
> >> [
> >> {
> >> "time": "2016-06-21 10:19:59.894393",
> >> "event": "initiated"
> >> },
> >> {
> >> "time": "2016-06-21 10:29:59.915202",
> >> "event": "reached_pg"
> >> }
> >> ]
> >> ]
> >> }
> >> ],
> >> "num_ops": 100
> >> }
> >>
> >>
> >>
> >> On 06/21/2016 12:27 PM, M Ranga Swami Reddy wrote:
> >>> you can use the below cmds:
> >>> ==
> >>>
> >>> ceph pg dump_stuck stale
> >>> ceph pg dump_stuck inactive
> >>> ceph pg dump_stuck unclean
> >>> ===
> >>>
> >>> And the query the PG, which are in unclean or stale state, check for
> >>> any issue with a specific OSD.
> >>>
> >>> Thanks
> >>> Swami
> >>>
> >>> On Tue, Jun 21, 2016 at 3:02 PM, Paweł Sadowski 
> wrote:
>  Hello,
> 
>  We have an issue on one of our clusters. One node with 9 OSD was down
>  for more than 12 hours. During that time cluster recovered without
>  problems. When host back to the cluster we got two PGs in incomplete
>  state. We decided to mark OSDs on this host as out but the two PGs are
>  still in incomplete state. Trying to query those pg hangs forever. We
>  were alredy trying restarting OSDs. Is there any way to solve this
> issue
>  without loosing data? Any help appreciate :)
> 
>  # ceph health detail | grep incomplete
>  HEALTH_WARN 2 pgs incomplete; 2 pgs stuck inactive; 2 pgs stuck
> unclean;
>  200 requests are blocked > 32 sec; 2 osds have slow requests;
>  noscrub,nodeep-scrub flag(s) set
>  pg 3.2929 is stuck inactive since forever, current state incomplete,
>  last acting [109,272,83]
>  pg 3.1683 is stuck inactive since forever, current state incomplete,
>  last acting [166,329,281]
>  pg 3.2929 is stuck unclean since forever, current state incomplete,
> last
>  acting [109,272,83]
>  pg 3.1683 is stuck unclean since forever, current state incomplete,
> last
>  acting [166,329,281]
>  pg 3.1683 is incomplete, acting [166,329,281] (reducing pool vms
>  min_size from 2 may help; search ceph.com/docs for 'incomplete')
>  pg 3.2929 is incomplete, acting [109,272,83] (reducing pool vms
> min_size
>  from 2 may help; search ceph.com/docs for 'incomplete')
> 
>  Directory for PG 3.1683 is present on OSD 166 and containes ~8GB.
> 
>  We didn't try setting min_size to 1 yet (we treat is as a last
> resort).
> 
> 
> 
>  Some cluster info:
>  # ceph --version
> 
>  ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
> 
>  # ceph -s

Re: [ceph-users] Ceph deployment

2016-06-22 Thread Oliver Dzombic
Hi Fran,

public_network = the network of the clients to access ceph ressources

cluster_network = the network ceph use to keep the osd's sycronizing
themself



So if you want that your ceph cluster is available to public internet
addresses, you will have to assign IPs from a real public network.

That means not 10.0.0.0 / 192.168.0.0 and so on. But thats a logical
network design question and has nothing to do with ceph.

Of course you could, via iptables/what ever create rules to
masquarade/forward public ceph traffic to an internal, private network.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 22.06.2016 um 11:33 schrieb Fran Barrera:
> Hi all,
> 
> I have a couple of question about the deployment of Ceph.
> 
> 
> This is what I plan:
> 
> Private Net - 10.0.0.0/24 
> Public Net - 192.168.1.0/24 
> 
> Ceph server:
>  - eth1: 192.168.1.67
>  - eth2: 10.0.0.67
> 
> Openstack server:
>  - eth1: 192.168.1.65
>  - eth2: 10.0.0.65
>  
> 
>  ceph.conf
>   - mon_host: 10.0.0.67
>   - cluster_network - 10.0.0.0/24 
>   - public_network - 192.168.1.0/24 
> 
> Now, I have some doubts:
>  - If I configure Ceph with this configuration. Could I connect with
> Ceph from a client in the Public Net? I say it this because mon_host is
> 10.0.0.67 in ceph.conf
>  - The private NET was created for Openstack, but I think if I can use
> this net for Ceph cluster network  or if I need to create another one.
>  
> I want to connect Ceph with Openstack through a private Net and have the
> possibility to connect with ceph from the public net too.
> 
> 
> Any suggestions?
> 
> Thanks,
> Fran.
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] stuck unclean since forever

2016-06-22 Thread min fang
Hi, I created a new ceph cluster, and create a pool, but see "stuck unclean
since forever" errors happen(as the following), can help point out the
possible reasons for this? thanks.

ceph -s
cluster 602176c1-4937-45fc-a246-cc16f1066f65
 health HEALTH_WARN
8 pgs degraded
8 pgs stuck unclean
8 pgs undersized
too few PGs per OSD (2 < min 30)
 monmap e1: 1 mons at {ceph-01=172.0.0.11:6789/0}
election epoch 14, quorum 0 ceph-01
 osdmap e89: 3 osds: 3 up, 3 in
flags
  pgmap v310: 8 pgs, 1 pools, 0 bytes data, 0 objects
60112 MB used, 5527 GB / 5586 GB avail
   8 active+undersized+degraded

ceph health detail
HEALTH_WARN 8 pgs degraded; 8 pgs stuck unclean; 8 pgs undersized; too few
PGs per OSD (2 < min 30)
pg 5.0 is stuck unclean since forever, current state
active+undersized+degraded, last acting [3]
pg 5.1 is stuck unclean since forever, current state
active+undersized+degraded, last acting [3]
pg 5.2 is stuck unclean since forever, current state
active+undersized+degraded, last acting [3]
pg 5.3 is stuck unclean since forever, current state
active+undersized+degraded, last acting [4]
pg 5.7 is stuck unclean since forever, current state
active+undersized+degraded, last acting [3]
pg 5.6 is stuck unclean since forever, current state
active+undersized+degraded, last acting [2]
pg 5.5 is stuck unclean since forever, current state
active+undersized+degraded, last acting [4]
pg 5.4 is stuck unclean since forever, current state
active+undersized+degraded, last acting [4]
pg 5.7 is active+undersized+degraded, acting [3]
pg 5.6 is active+undersized+degraded, acting [2]
pg 5.5 is active+undersized+degraded, acting [4]
pg 5.4 is active+undersized+degraded, acting [4]
pg 5.3 is active+undersized+degraded, acting [4]
pg 5.2 is active+undersized+degraded, acting [3]
pg 5.1 is active+undersized+degraded, acting [3]
pg 5.0 is active+undersized+degraded, acting [3]
too few PGs per OSD (2 < min 30)

ceph osd tree
ID WEIGHT  TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 3.0 root default
-2 3.0 host ceph-01
 2 1.0 osd.2  up  1.0  1.0
 3 1.0 osd.3  up  1.0  1.0
 4 1.0 osd.4  up  1.0  1.0

 ceph osd crush tree
[
{
"id": -1,
"name": "default",
"type": "root",
"type_id": 10,
"items": [
{
"id": -2,
"name": "ceph-01",
"type": "host",
"type_id": 1,
"items": [
{
"id": 2,
"name": "osd.2",
"type": "osd",
"type_id": 0,
"crush_weight": 1.00,
"depth": 2
},
{
"id": 3,
"name": "osd.3",
"type": "osd",
"type_id": 0,
"crush_weight": 1.00,
"depth": 2
},
{
"id": 4,
"name": "osd.4",
"type": "osd",
"type_id": 0,
"crush_weight": 1.00,
"depth": 2
}
]
}
]
}
]
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] stuck unclean since forever

2016-06-22 Thread Oliver Dzombic
Hi Min,

as its written there:

too few PGs per OSD (2 < min 30)

You have to raise the PG's.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 22.06.2016 um 12:10 schrieb min fang:
> too few PGs per OSD (2 < min 30)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] stuck unclean since forever

2016-06-22 Thread Burkhard Linke

Hi,

On 06/22/2016 12:10 PM, min fang wrote:
Hi, I created a new ceph cluster, and create a pool, but see "stuck 
unclean since forever" errors happen(as the following), can help point 
out the possible reasons for this? thanks.


ceph -s
cluster 602176c1-4937-45fc-a246-cc16f1066f65
 health HEALTH_WARN
8 pgs degraded
8 pgs stuck unclean
8 pgs undersized
too few PGs per OSD (2 < min 30)
 monmap e1: 1 mons at {ceph-01=172.0.0.11:6789/0 
}

election epoch 14, quorum 0 ceph-01
 osdmap e89: 3 osds: 3 up, 3 in
flags
  pgmap v310: 8 pgs, 1 pools, 0 bytes data, 0 objects
60112 MB used, 5527 GB / 5586 GB avail
   8 active+undersized+degraded


*snipsnap*

With three OSDs and a single host you need to change the crush ruleset 
for the pool, since it tries to distribute the data across 3 different 
_host_ by default.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PGs

2016-06-22 Thread Paweł Sadowski
Query on that PGs hangs forever. We ended up using
*ceph-objectstore-tool**mark-complete* on those PGs.

On 06/22/2016 11:45 AM, 施柏安 wrote:
> Hi,
> You can use command 'ceph pg query' to check what's going on with the
> pgs which have problem and use "ceph-objectstore-tool" to recover that pg.
>
> 2016-06-21 19:09 GMT+08:00 Paweł Sadowski  >:
>
> Already restarted those OSD and then whole cluster (rack by rack,
> failure domain is rack in this setup).
> We would like to try *ceph-objectstore-tool mark-complete*
> operation. Is
> there any way (other than checking mtime on file and querying PGs) to
> determine which replica has most up to date datas?
>
> On 06/21/2016 12:37 PM, M Ranga Swami Reddy wrote:
> > Try to restart OSD 109 and 166? check if it help?
> >
> >
> > On Tue, Jun 21, 2016 at 4:05 PM, Paweł Sadowski  > wrote:
> >> Thanks for response.
> >>
> >> All OSDs seems to be ok, they have been restarted, joined
> cluster after
> >> that, nothing weird in the logs.
> >>
> >> # ceph pg dump_stuck stale
> >> ok
> >>
> >> # ceph pg dump_stuck inactive
> >> ok
> >> pg_statstateupup_primaryactingacting_primary
> >> 3.2929incomplete[109,272,83]109[109,272,83]109
> >> 3.1683incomplete[166,329,281]166[166,329,281] 
>   166
> >>
> >> # ceph pg dump_stuck unclean
> >> ok
> >> pg_statstateupup_primaryactingacting_primary
> >> 3.2929incomplete[109,272,83]109[109,272,83]109
> >> 3.1683incomplete[166,329,281]166[166,329,281] 
>   166
> >>
> >>
> >> On OSD 166 there is 100 blocked ops (on 109 too), they all end on
> >> "event": "reached_pg"
> >>
> >> # ceph --admin-daemon /var/run/ceph/ceph-osd.166.asok
> dump_ops_in_flight
> >> ...
> >> {
> >> "description": "osd_op(client.958764031.0:18137113
> >> rbd_data.392585982ae8944a.0ad4 [set-alloc-hint
> object_size
> >> 4194304 write_size 4194304,write 2641920~8192] 3.d6195683 RETRY=15
> >> ack+ondisk+retry+write+known_if_redirected e613241)",
> >> "initiated_at": "2016-06-21 10:19:59.894393",
> >> "age": 828.025527,
> >> "duration": 600.020809,
> >> "type_data": [
> >> "reached pg",
> >> {
> >> "client": "client.958764031",
> >> "tid": 18137113
> >> },
> >> [
> >> {
> >> "time": "2016-06-21 10:19:59.894393",
> >> "event": "initiated"
> >> },
> >> {
> >> "time": "2016-06-21 10:29:59.915202",
> >> "event": "reached_pg"
> >> }
> >> ]
> >> ]
> >> }
> >> ],
> >> "num_ops": 100
> >> }
> >>
> >>
> >>
> >> On 06/21/2016 12:27 PM, M Ranga Swami Reddy wrote:
> >>> you can use the below cmds:
> >>> ==
> >>>
> >>> ceph pg dump_stuck stale
> >>> ceph pg dump_stuck inactive
> >>> ceph pg dump_stuck unclean
> >>> ===
> >>>
> >>> And the query the PG, which are in unclean or stale state,
> check for
> >>> any issue with a specific OSD.
> >>>
> >>> Thanks
> >>> Swami
> >>>
> >>> On Tue, Jun 21, 2016 at 3:02 PM, Paweł Sadowski
> mailto:c...@sadziu.pl>> wrote:
>  Hello,
> 
>  We have an issue on one of our clusters. One node with 9 OSD
> was down
>  for more than 12 hours. During that time cluster recovered
> without
>  problems. When host back to the cluster we got two PGs in
> incomplete
>  state. We decided to mark OSDs on this host as out but the
> two PGs are
>  still in incomplete state. Trying to query those pg hangs
> forever. We
>  were alredy trying restarting OSDs. Is there any way to solve
> this issue
>  without loosing data? Any help appreciate :)
> 
>  # ceph health detail | grep incomplete
>  HEALTH_WARN 2 pgs incomplete; 2 pgs stuck inactive; 2 pgs
> stuck unclean;
>  200 requests are blocked > 32 sec; 2 osds have slow requests;
>  noscrub,nodeep-scrub flag(s) set
>  pg 3.2929 is stuck inactive since forever, current state
> incomplete,
>  last acting [109,272,83]
>  pg 3.1683 is stuck inactive since forever, current state
> incomplete,
>  last acting [166,329,281]
>  pg 3.2929 is stuck unclean since forever, current state
> incomplete, last
>  acting [109,272,

Re: [ceph-users] stuck unclean since forever

2016-06-22 Thread min fang
Thanks, actually I create a pool with more pgs also meet this problem.
Following is my crush map, please help point how to change the crush
ruleset? thanks.

#begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1

# devices
device 0 device0
device 1 device1
device 2 osd.2
device 3 osd.3
device 4 osd.4

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host redpower-ceph-01 {
id -2   # do not change unnecessarily
# weight 3.000
alg straw
hash 0  # rjenkins1
item osd.2 weight 1.000
item osd.3 weight 1.000
item osd.4 weight 1.000
}
root default {
id -1   # do not change unnecessarily
# weight 3.000
alg straw
hash 0  # rjenkins1
item redpower-ceph-01 weight 3.000
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map


2016-06-22 18:27 GMT+08:00 Burkhard Linke <
burkhard.li...@computational.bio.uni-giessen.de>:

> Hi,
>
> On 06/22/2016 12:10 PM, min fang wrote:
>
> Hi, I created a new ceph cluster, and create a pool, but see "stuck
> unclean since forever" errors happen(as the following), can help point out
> the possible reasons for this? thanks.
>
> ceph -s
> cluster 602176c1-4937-45fc-a246-cc16f1066f65
>  health HEALTH_WARN
> 8 pgs degraded
> 8 pgs stuck unclean
> 8 pgs undersized
> too few PGs per OSD (2 < min 30)
>  monmap e1: 1 mons at {ceph-01=172.0.0.11:6789/0}
> election epoch 14, quorum 0 ceph-01
>  osdmap e89: 3 osds: 3 up, 3 in
> flags
>   pgmap v310: 8 pgs, 1 pools, 0 bytes data, 0 objects
> 60112 MB used, 5527 GB / 5586 GB avail
>8 active+undersized+degraded
>
>
> *snipsnap*
>
> With three OSDs and a single host you need to change the crush ruleset for
> the pool, since it tries to distribute the data across 3 different _host_
> by default.
>
> Regards,
> Burkhard
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-release RPM has broken URL

2016-06-22 Thread Oleksandr Natalenko

Hello.

ceph-release-1-1.el7.noarch.rpm [1] is considered to be broken now 
because it contains wrong baseurl:


===
baseurl=http://ceph.com/rpm-hammer/rhel7/$basearch
===

That leads to 404 for yum trying to use it.

I believe, "rhel7" should be replaced by "el7", and 
ceph-release-1-2.el7.noarch.rpm should be released containing this fix.


Thanks.

Regards,
  Oleksandr.

[1] 
http://download.ceph.com/rpm-hammer/el7/noarch/ceph-release-1-1.el7.noarch.rpm

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs snapshots

2016-06-22 Thread Kenneth Waegeman

Hi all,

In Jewel ceph fs snapshots are still experimental. Does someone has a 
clue when this would become stable, or how experimental this is ?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-22 Thread Daniel Swarbrick
On 20/06/16 19:51, Gregory Farnum wrote:
> On Mon, Jun 20, 2016 at 8:33 AM, Daniel Swarbrick
>>
>> At this stage, I have a strong suspicion that it is the introduction of
>> "require_feature_tunables5 = 1" in the tunables. This seems to require
>> all RADOS connections to be re-established.
> 
> Do you have any evidence of that besides the one restart?
> 
> I guess it's possible that we aren't kicking requests if the crush map
> but not the rest of the osdmap changes, but I'd be surprised.
> -Greg

I think the key fact to take note of is that we had long-running Qemu
processes that had been started a few months ago, using Infernalis
librbd shared libs.

If Infernalis had no concept of require_feature_tunables5, then it seems
logical that these clients would block if the cluster were upgraded to
Jewel and this tunable became mandatory.

I have just upgraded our fourth and final cluster to Jewel. Prior to
applying optimal tunables, we upgraded our hypervisor nodes' librbd
also, and migrated all VMs at least once, to start a fresh Qemu process
for each (using the updated librbd).

We're seeing ~65% data movement due to chooseleaf_stable 0 => 1, but
other than that, so far so good. No clients are blocking indefinitely.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] problems mounting from fstab on boot

2016-06-22 Thread Daniel Davidson
When I add my ceph system to fstab, I can make mount by referencing it, 
but when I restart the system it stops during boot because the mount 
failed.  I am guessing it is because fstab is run before the network 
starts?  Using centos 7.


thanks for the help,

Dan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



Re: [ceph-users] problems mounting from fstab on boot

2016-06-22 Thread David Riedl

You have do at _netdev to your line in fstab.

Example:

localhost:/data/var/dataglusterfs _netdev0 0

from

https://blog.sleeplessbeastie.eu/2013/05/10/centos-6-_netdev-fstab-option-and-netfs-service/


On 22.06.2016 15:16, Daniel Davidson wrote:
When I add my ceph system to fstab, I can make mount by referencing 
it, but when I restart the system it stops during boot because the 
mount failed.  I am guessing it is because fstab is run before the 
network starts?  Using centos 7.


thanks for the help,

Dan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Mit freundlichen Grüßen

David Riedl



WINGcon GmbH Wireless New Generation - Consulting & Solutions

Phone: +49 (0) 7543 9661 - 26
E-Mail: david.ri...@wingcon.com
Web: http://www.wingcon.com

Sitz der Gesellschaft: Langenargen
Registergericht: ULM, HRB 632019
USt-Id.: DE232931635, WEEE-Id.: DE74015979
Geschäftsführer: Thomas Ehrle, Fritz R. Paul

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph osd create - how to detect changes

2016-06-22 Thread George Shuklin
I'm writing ceph playbook and I see some issues with ceph osd create 
command. When I call it with all arguments, I can't distinct when it 
creates new or when it confirms that this osd already exists.


ceph osd create 5ecc7a8c-388a-11e6-b8ad-5f3ab2552b13 22; echo $?
22
0
ceph osd create 5ecc7a8c-388a-11e6-b8ad-5f3ab2552b13 22; echo $?
22
0

How are you solving this?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-22 Thread Andrei Mikhailovsky
Hi Daniel,

Many thanks for your useful tests and your results.

How much IO wait do you have on your client vms? Has it significantly increased 
or not?

Many thanks

Andrei

- Original Message -
> From: "Daniel Swarbrick" 
> To: "ceph-users" 
> Cc: "ceph-devel" 
> Sent: Wednesday, 22 June, 2016 13:43:37
> Subject: Re: [ceph-users] cluster down during backfilling, Jewel tunables and 
> client IO optimisations

> On 20/06/16 19:51, Gregory Farnum wrote:
>> On Mon, Jun 20, 2016 at 8:33 AM, Daniel Swarbrick
>>>
>>> At this stage, I have a strong suspicion that it is the introduction of
>>> "require_feature_tunables5 = 1" in the tunables. This seems to require
>>> all RADOS connections to be re-established.
>> 
>> Do you have any evidence of that besides the one restart?
>> 
>> I guess it's possible that we aren't kicking requests if the crush map
>> but not the rest of the osdmap changes, but I'd be surprised.
>> -Greg
> 
> I think the key fact to take note of is that we had long-running Qemu
> processes that had been started a few months ago, using Infernalis
> librbd shared libs.
> 
> If Infernalis had no concept of require_feature_tunables5, then it seems
> logical that these clients would block if the cluster were upgraded to
> Jewel and this tunable became mandatory.
> 
> I have just upgraded our fourth and final cluster to Jewel. Prior to
> applying optimal tunables, we upgraded our hypervisor nodes' librbd
> also, and migrated all VMs at least once, to start a fresh Qemu process
> for each (using the updated librbd).
> 
> We're seeing ~65% data movement due to chooseleaf_stable 0 => 1, but
> other than that, so far so good. No clients are blocking indefinitely.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-22 Thread Daniel Swarbrick
On 22/06/16 17:54, Andrei Mikhailovsky wrote:
> Hi Daniel,
> 
> Many thanks for your useful tests and your results.
> 
> How much IO wait do you have on your client vms? Has it significantly 
> increased or not?
> 

Hi Andrei,

Bearing in mind that this cluster is tiny (four nodes, each with four
OSDs), our metrics may not be that meaningful. However, on a VM that is
running ElasticSearch, collecting logs from Graylog, we're seeing no
more than about 5% iowait for a 5s period, and most of the time it's
below 1%. This VM is really not writing a lot of data though.

The cluster as a whole is peaking at only about 1200 write op/s,
according to ceph -w.

Executing a "sync" in a VM does of course have a noticeable delay due to
the recovery happening in the background, but nothing is waiting for IO
long enough to trigger the kernel's 120s timer / warning.

The recovery has been running for about four hours now, and is down to
20% misplaced objects. So far we have not had any clients block
indefinitely, so I think the migration of VMs to Jewel-capable
hypervisors did the trick.

Best,
Daniel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster down during backfilling, Jewel tunables and client IO optimisations

2016-06-22 Thread Andrei Mikhailovsky
Hi Daniel,

Many thanks, I will keep this in mind while performing the updates in the 
future.

Note to documentation manager - perhaps it makes sens to add this solution as a 
note/tip to the Upgrade section of the release notes?


Andrei

- Original Message -
> From: "Daniel Swarbrick" 
> To: "ceph-users" 
> Cc: "ceph-devel" 
> Sent: Wednesday, 22 June, 2016 17:09:48
> Subject: Re: [ceph-users] cluster down during backfilling, Jewel tunables and 
> client IO optimisations

> On 22/06/16 17:54, Andrei Mikhailovsky wrote:
>> Hi Daniel,
>> 
>> Many thanks for your useful tests and your results.
>> 
>> How much IO wait do you have on your client vms? Has it significantly 
>> increased
>> or not?
>> 
> 
> Hi Andrei,
> 
> Bearing in mind that this cluster is tiny (four nodes, each with four
> OSDs), our metrics may not be that meaningful. However, on a VM that is
> running ElasticSearch, collecting logs from Graylog, we're seeing no
> more than about 5% iowait for a 5s period, and most of the time it's
> below 1%. This VM is really not writing a lot of data though.
> 
> The cluster as a whole is peaking at only about 1200 write op/s,
> according to ceph -w.
> 
> Executing a "sync" in a VM does of course have a noticeable delay due to
> the recovery happening in the background, but nothing is waiting for IO
> long enough to trigger the kernel's 120s timer / warning.
> 
> The recovery has been running for about four hours now, and is down to
> 20% misplaced objects. So far we have not had any clients block
> indefinitely, so I think the migration of VMs to Jewel-capable
> hypervisors did the trick.
> 
> Best,
> Daniel
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Error EPERM when running ceph tell command

2016-06-22 Thread Andrei Mikhailovsky
Hi 

I am trying to run an osd level benchmark but get the following error: 

# ceph tell osd.3 bench 
Error EPERM: problem getting command descriptions from osd.3 

I am running Jewel 10.2.2 on Ubuntu 16.04 servers. Has the syntax change or do 
I have an issue? 

Cheers 
Andrei 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Use of legacy bobtail tunables and potential performance impact to "jewel"?

2016-06-22 Thread Yang X
Our ceph client is using bobtail legacy tunable and in particular,
"chooseleaf_vary_r" is set to 0.

My question is how it would impact CRUSH and hence performance in deploying
"jewel" on the server side and also the experimental "bluestore" backend.

Does it only affect data placement or does it actually would not find
mapping for certain objects and thus cause error?

Thanks in advance,

Yang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] libceph dns resolution

2016-06-22 Thread Willi Fehler

Hello,

I'm trying to mount a ceph storage. It seems that libceph does not 
support records in /etc/hosts?


libceph: parse_ips bad ip 'linsrv001.willi-net.local'

Regards - Willi
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RBD object-map and discard in VM

2016-06-22 Thread Brian Andrus
I've created a downstream bug for this same issue.

https://bugzilla.redhat.com/show_bug.cgi?id=1349116

On Wed, Jun 15, 2016 at 6:23 AM,  wrote:

> Hello guys,
>
> We are currently testing Ceph Jewel with object-map feature enabled:
>
> rbd image 'disk-22920':
> size 102400 MB in 25600 objects
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.7cfa2238e1f29
> format: 2
> features: layering, exclusive-lock, object-map, fast-diff,
> deep-flatten
> flags:
>
> We use this RBD as disk for a kvm virtual machine with virtio-scsi and
> discard=unmap. We noticed the following paremeters in /sys/block:
>
> # cat /sys/block/sda/queue/discard_*
> 4096
> 1073741824
> 0 <- discard_zeroes_data
>
> While trying to do a mkfs.ext4 on the disk in VM we noticed a low
> performance with using discard.
>
> mkfs.ext4 -E nodiscard /dev/sda1 - tooks 5 seconds to complete
> mkfs.ext4 -E discard /dev/sda1 - tooks around 3 monutes
>
> When disabling the object-map the mkfs with discard tooks just 5 seconds.
>
> Do you have any idea what might cause this issue?
>
> Kernel: 4.2.0-35-generic #40~14.04.1-Ubuntu
> Ceph: 10.2.0
> Libvirt: 1.3.1
> QEMU: 2.5.0
>
> Thanks!
>
> Best regards,
> Jonas
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Brian Andrus
Red Hat, Inc.
Storage Consultant, Global Storage Practice
Mobile +1 (530) 903-8487
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] issues with misplaced object and PG that won't clean

2016-06-22 Thread Mike Shaffer
Hi,

I'm running into an issue with a PG that will not get clean and seems to
be blocking requests.
When I restart an OSD the PG is on I never reach recovery. Removing the
OSD in question only seems to move the problem.

Additionally ceph pg query hangs forever unless the OSD in question is
stopped.

I tried doing a ceph pg force_create_pg but that didn't seem to have impact

Really not sure how to go about correcting this issue. Currently running
a mix of 0.80.11 and 0.94.7 that I am in the process of upgrading but
that seems to be unrelated as this issue occasionally presented even
before the upgrade.

Any advice would be greatly appreciated.

Here is the output of ceph health, not sure what other outputs would be
useful but would be happy to attach any additional useful information:

HEALTH_WARN 1 pgs backfilling; 1 pgs degraded; 1 pgs stuck degraded; 1
pgs stuck unclean; 1 pgs stuck undersized; 1 pgs undersized; 39 requests
are blocked > 32 sec; recovery 2/414113144 objects degraded (0.000%);
recovery 1/4141131 44 objects misplaced (0.000%); noout flag(s) set

Thanks,
Mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs snapshots

2016-06-22 Thread Gregory Farnum
On Wednesday, June 22, 2016, Kenneth Waegeman 
wrote:

> Hi all,
>
> In Jewel ceph fs snapshots are still experimental. Does someone has a clue
> when this would become stable, or how experimental this is ?
>

We're not sure yet. Probably it will follow stable multi-MDS; we're
thinking about redoing some of the core snapshot pieces still. :/

It's still pretty experimental in Jewel. Shen had been working on this and
I think it often works, but tends to fall apart under the failure of other
components (eg, restarting an MDS while snapshot work is happening).
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs snapshots

2016-06-22 Thread Gregory Farnum
[re-adding ceph-users]

Yes, it can corrupt the metadata and require use of filesystem repair
tools. I really don't recommend using snapshots except on toy clusters.

On Wednesday, June 22, 2016, Brady Deetz > wrote:

> Snapshots would be excellent for a number of fairly obvious reasons. Are
> any of the know issues with snapshots issues that result in the loss of
> non-snapshot data or a cluster?
> On Jun 22, 2016 2:16 PM, "Gregory Farnum"  wrote:
>
>> On Wednesday, June 22, 2016, Kenneth Waegeman 
>> wrote:
>>
>>> Hi all,
>>>
>>> In Jewel ceph fs snapshots are still experimental. Does someone has a
>>> clue when this would become stable, or how experimental this is ?
>>>
>>
>> We're not sure yet. Probably it will follow stable multi-MDS; we're
>> thinking about redoing some of the core snapshot pieces still. :/
>>
>> It's still pretty experimental in Jewel. Shen had been working on this
>> and I think it often works, but tends to fall apart under the failure of
>> other components (eg, restarting an MDS while snapshot work is happening).
>>
>
s/Shen/Zheng/
Silly autocorrect!




> -Greg
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RBD object-map and discard in VM

2016-06-22 Thread Jason Dillaman
I'm not sure why I never received the original list email, so I
apologize for the delay. Is /dev/sda1, from your example, fresh with
no data to actually discard or does it actually have lots of data to
discard?

Thanks,

On Wed, Jun 22, 2016 at 1:56 PM, Brian Andrus  wrote:
> I've created a downstream bug for this same issue.
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1349116
>
> On Wed, Jun 15, 2016 at 6:23 AM,  wrote:
>>
>> Hello guys,
>>
>> We are currently testing Ceph Jewel with object-map feature enabled:
>>
>> rbd image 'disk-22920':
>> size 102400 MB in 25600 objects
>> order 22 (4096 kB objects)
>> block_name_prefix: rbd_data.7cfa2238e1f29
>> format: 2
>> features: layering, exclusive-lock, object-map, fast-diff,
>> deep-flatten
>> flags:
>>
>> We use this RBD as disk for a kvm virtual machine with virtio-scsi and
>> discard=unmap. We noticed the following paremeters in /sys/block:
>>
>> # cat /sys/block/sda/queue/discard_*
>> 4096
>> 1073741824
>> 0 <- discard_zeroes_data
>>
>> While trying to do a mkfs.ext4 on the disk in VM we noticed a low
>> performance with using discard.
>>
>> mkfs.ext4 -E nodiscard /dev/sda1 - tooks 5 seconds to complete
>> mkfs.ext4 -E discard /dev/sda1 - tooks around 3 monutes
>>
>> When disabling the object-map the mkfs with discard tooks just 5 seconds.
>>
>> Do you have any idea what might cause this issue?
>>
>> Kernel: 4.2.0-35-generic #40~14.04.1-Ubuntu
>> Ceph: 10.2.0
>> Libvirt: 1.3.1
>> QEMU: 2.5.0
>>
>> Thanks!
>>
>> Best regards,
>> Jonas
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
> Brian Andrus
> Red Hat, Inc.
> Storage Consultant, Global Storage Practice
> Mobile +1 (530) 903-8487
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Tiering with Same Cache Pool

2016-06-22 Thread Christian Balzer

Hello,

On Wed, 22 Jun 2016 15:40:40 +0700 Lazuardi Nasution wrote:

> Hi Christian,
> 
> If I have several cache pool on the same SSD OSDs (by using same ruleset)
> so those cache pool always show same Max. Available of "ceph df detail"
> output, 

That's true for all pools that share the same backing storage.

>what should I put on target_max_bytes of cache tiering
> configuration for each cache pool? should it be same and use Max
> Available size? 

Definitely not, you will want to at least subtract enough space from your
available size to avoid having one failed OSD generating a full disk
situation. Even more to cover a failed host scenario.
Then you want to divide the rest by the number of pools you plan to put on
there and set that as the target_max_bytes in the simplest case.

>If diffrent, how can I know if such cache pool need more
> size than other.
> 
By looking at df detail again, the usage is per pool after all.

But a cache pool will of course use all the space it has, so that's not a
good way to determine your needs.
Watching how fast they fill up may be more helpful.

You should have decent idea before doing cache tiering about your needs,
by monitoring the pools (and their storage) you want to cache, again
with "df detail" (how many writes/reads?), "ceph -w", atop or iostat, etc.

Christian

> Best regards,
> 
> Date: Mon, 20 Jun 2016 09:34:05 +0900
> > From: Christian Balzer 
> > To: ceph-users@lists.ceph.com
> > Cc: Lazuardi Nasution 
> > Subject: Re: [ceph-users] Cache Tiering with Same Cache Pool
> > Message-ID: <20160620093405.732f5...@batzmaru.gol.ad.jp>
> > Content-Type: text/plain; charset=US-ASCII
> >
> > On Mon, 20 Jun 2016 00:14:55 +0700 Lazuardi Nasution wrote:
> >
> > > Hi,
> > >
> > > Is it possible to do cache tiering for some storage pools with the
> > > same cache pool?
> >
> > As mentioned several times on this ML, no.
> > There is a strict 1:1 relationship between base and cache pools.
> > You can of course (if your SSDs/NVMes are large and fast enough) put
> > more than one cache pool on them.
> >
> > > What will happen if cache pool is broken or at least doesn't
> > > meet quorum when storage pool is OK?
> > >
> > With a read-only cache pool nothing should happen, as all writes are
> > going to the base pool.
> >
> > In any other mode (write-back, read-forward or read-proxy) your hottest
> > objects are likely to be ONLY on the cache pool and never getting
> > flushed to the base pool.
> > So that means, if your cache pool fails, your cluster is essentially
> > dead or at the very least has suffered massive data loss.
> >
> > Something to very much think about when doing cache tiering.
> >
> > Christian
> > --
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> >


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-22 Thread Wade Holler
Based on everyones suggestions; The first modification to 50 / 16
enabled our config to get to ~645Mill objects before the behavior in
question was observed (~330 was the previous ceiling).  Subsequent
modification to 50 / 24 has enabled us to get to 1.1 Billion+

Thank you all very much for your support and assistance.

Best Regards,
Wade


On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer  wrote:
>
> Hello,
>
> On Mon, 20 Jun 2016 20:47:32 + Warren Wang - ISD wrote:
>
>> Sorry, late to the party here. I agree, up the merge and split
>> thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
>> One of those things you just have to find out as an operator since it's
>> not well documented :(
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>>
>> We have over 200 million objects in this cluster, and it's still doing
>> over 15000 write IOPS all day long with 302 spinning drives + SATA SSD
>> journals. Having enough memory and dropping your vfs_cache_pressure
>> should also help.
>>
> Indeed.
>
> Since it was asked in that bug report and also my first suspicion, it
> would probably be good time to clarify that it isn't the splits that cause
> the performance degradation, but the resulting inflation of dir entries
> and exhaustion of SLAB and thus having to go to disk for things that
> normally would be in memory.
>
> Looking at Blair's graph from yesterday pretty much makes that clear, a
> purely split caused degradation should have relented much quicker.
>
>
>> Keep in mind that if you change the values, it won't take effect
>> immediately. It only merges them back if the directory is under the
>> calculated threshold and a write occurs (maybe a read, I forget).
>>
> If it's a read a plain scrub might do the trick.
>
> Christian
>> Warren
>>
>>
>> From: ceph-users
>> mailto:ceph-users-boun...@lists.ceph.com>>
>> on behalf of Wade Holler
>> mailto:wade.hol...@gmail.com>> Date: Monday, June
>> 20, 2016 at 2:48 PM To: Blair Bethwaite
>> mailto:blair.bethwa...@gmail.com>>, Wido den
>> Hollander mailto:w...@42on.com>> Cc: Ceph Development
>> mailto:ceph-de...@vger.kernel.org>>,
>> "ceph-users@lists.ceph.com"
>> mailto:ceph-users@lists.ceph.com>> Subject:
>> Re: [ceph-users] Dramatic performance drop at certain number of objects
>> in pool
>>
>> Thanks everyone for your replies.  I sincerely appreciate it. We are
>> testing with different pg_num and filestore_split_multiple settings.
>> Early indications are  well not great. Regardless it is nice to
>> understand the symptoms better so we try to design around it.
>>
>> Best Regards,
>> Wade
>>
>>
>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>> mailto:blair.bethwa...@gmail.com>> wrote: On
>> 20 June 2016 at 09:21, Blair Bethwaite
>> mailto:blair.bethwa...@gmail.com>> wrote:
>> > slow request issues). If you watch your xfs stats you'll likely get
>> > further confirmation. In my experience xs_dir_lookups balloons (which
>> > means directory lookups are missing cache and going to disk).
>>
>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
>> preparation for Jewel/RHCS2. Turns out when we last hit this very
>> problem we had only ephemerally set the new filestore merge/split
>> values - oops. Here's what started happening when we upgraded and
>> restarted a bunch of OSDs:
>> https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png
>>
>> Seemed to cause lots of slow requests :-/. We corrected it about
>> 12:30, then still took a while to settle.
>>
>> --
>> Cheers,
>> ~Blairo
>>
>> This email and any files transmitted with it are confidential and
>> intended solely for the individual or entity to whom they are addressed.
>> If you have received this email in error destroy it immediately. ***
>> Walmart Confidential ***
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] about image's largest size

2016-06-22 Thread Ops Cloud

We want to run a backup server, which has huge storage as backend.
If we use rbd client to mount a block storage from ceph, for a single image, 
how large can it be? xxx TB or PB?

Thank you.

-- 
Ops Cloud
o...@19cloud.net___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-22 Thread Blair Bethwaite
Wade, good to know.

For the record, what does this work out to roughly per OSD? And how
much RAM and how many PGs per OSD do you have?

What's your workload? I wonder whether for certain workloads (e.g.
RBD) it's better to increase default object size somewhat before
pushing the split/merge up a lot...

Cheers,

On 23 June 2016 at 11:26, Wade Holler  wrote:
> Based on everyones suggestions; The first modification to 50 / 16
> enabled our config to get to ~645Mill objects before the behavior in
> question was observed (~330 was the previous ceiling).  Subsequent
> modification to 50 / 24 has enabled us to get to 1.1 Billion+
>
> Thank you all very much for your support and assistance.
>
> Best Regards,
> Wade
>
>
> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer  wrote:
>>
>> Hello,
>>
>> On Mon, 20 Jun 2016 20:47:32 + Warren Wang - ISD wrote:
>>
>>> Sorry, late to the party here. I agree, up the merge and split
>>> thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
>>> One of those things you just have to find out as an operator since it's
>>> not well documented :(
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>>>
>>> We have over 200 million objects in this cluster, and it's still doing
>>> over 15000 write IOPS all day long with 302 spinning drives + SATA SSD
>>> journals. Having enough memory and dropping your vfs_cache_pressure
>>> should also help.
>>>
>> Indeed.
>>
>> Since it was asked in that bug report and also my first suspicion, it
>> would probably be good time to clarify that it isn't the splits that cause
>> the performance degradation, but the resulting inflation of dir entries
>> and exhaustion of SLAB and thus having to go to disk for things that
>> normally would be in memory.
>>
>> Looking at Blair's graph from yesterday pretty much makes that clear, a
>> purely split caused degradation should have relented much quicker.
>>
>>
>>> Keep in mind that if you change the values, it won't take effect
>>> immediately. It only merges them back if the directory is under the
>>> calculated threshold and a write occurs (maybe a read, I forget).
>>>
>> If it's a read a plain scrub might do the trick.
>>
>> Christian
>>> Warren
>>>
>>>
>>> From: ceph-users
>>> mailto:ceph-users-boun...@lists.ceph.com>>
>>> on behalf of Wade Holler
>>> mailto:wade.hol...@gmail.com>> Date: Monday, June
>>> 20, 2016 at 2:48 PM To: Blair Bethwaite
>>> mailto:blair.bethwa...@gmail.com>>, Wido den
>>> Hollander mailto:w...@42on.com>> Cc: Ceph Development
>>> mailto:ceph-de...@vger.kernel.org>>,
>>> "ceph-users@lists.ceph.com"
>>> mailto:ceph-users@lists.ceph.com>> Subject:
>>> Re: [ceph-users] Dramatic performance drop at certain number of objects
>>> in pool
>>>
>>> Thanks everyone for your replies.  I sincerely appreciate it. We are
>>> testing with different pg_num and filestore_split_multiple settings.
>>> Early indications are  well not great. Regardless it is nice to
>>> understand the symptoms better so we try to design around it.
>>>
>>> Best Regards,
>>> Wade
>>>
>>>
>>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>>> mailto:blair.bethwa...@gmail.com>> wrote: On
>>> 20 June 2016 at 09:21, Blair Bethwaite
>>> mailto:blair.bethwa...@gmail.com>> wrote:
>>> > slow request issues). If you watch your xfs stats you'll likely get
>>> > further confirmation. In my experience xs_dir_lookups balloons (which
>>> > means directory lookups are missing cache and going to disk).
>>>
>>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
>>> preparation for Jewel/RHCS2. Turns out when we last hit this very
>>> problem we had only ephemerally set the new filestore merge/split
>>> values - oops. Here's what started happening when we upgraded and
>>> restarted a bunch of OSDs:
>>> https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png
>>>
>>> Seemed to cause lots of slow requests :-/. We corrected it about
>>> 12:30, then still took a while to settle.
>>>
>>> --
>>> Cheers,
>>> ~Blairo
>>>
>>> This email and any files transmitted with it are confidential and
>>> intended solely for the individual or entity to whom they are addressed.
>>> If you have received this email in error destroy it immediately. ***
>>> Walmart Confidential ***
>>
>>
>> --
>> Christian BalzerNetwork/Systems Engineer
>> ch...@gol.com   Global OnLine Japan/Rakuten Communications
>> http://www.gol.com/



-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-22 Thread Wade Holler
Blairo,

We'll speak in pre-replication numbers, replication for this pool is 3.

23.3 Million Objects / OSD
pg_num 2048
16 OSDs / Server
3 Servers
660 GB RAM Total, 179 GB Used (free -t) / Server
vm.swappiness = 1
vm.vfs_cache_pressure = 100

Workload is native librados with python.  ALL 4k objects.

Best Regards,
Wade


On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
 wrote:
> Wade, good to know.
>
> For the record, what does this work out to roughly per OSD? And how
> much RAM and how many PGs per OSD do you have?
>
> What's your workload? I wonder whether for certain workloads (e.g.
> RBD) it's better to increase default object size somewhat before
> pushing the split/merge up a lot...
>
> Cheers,
>
> On 23 June 2016 at 11:26, Wade Holler  wrote:
>> Based on everyones suggestions; The first modification to 50 / 16
>> enabled our config to get to ~645Mill objects before the behavior in
>> question was observed (~330 was the previous ceiling).  Subsequent
>> modification to 50 / 24 has enabled us to get to 1.1 Billion+
>>
>> Thank you all very much for your support and assistance.
>>
>> Best Regards,
>> Wade
>>
>>
>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer  wrote:
>>>
>>> Hello,
>>>
>>> On Mon, 20 Jun 2016 20:47:32 + Warren Wang - ISD wrote:
>>>
 Sorry, late to the party here. I agree, up the merge and split
 thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
 One of those things you just have to find out as an operator since it's
 not well documented :(

 https://bugzilla.redhat.com/show_bug.cgi?id=1219974

 We have over 200 million objects in this cluster, and it's still doing
 over 15000 write IOPS all day long with 302 spinning drives + SATA SSD
 journals. Having enough memory and dropping your vfs_cache_pressure
 should also help.

>>> Indeed.
>>>
>>> Since it was asked in that bug report and also my first suspicion, it
>>> would probably be good time to clarify that it isn't the splits that cause
>>> the performance degradation, but the resulting inflation of dir entries
>>> and exhaustion of SLAB and thus having to go to disk for things that
>>> normally would be in memory.
>>>
>>> Looking at Blair's graph from yesterday pretty much makes that clear, a
>>> purely split caused degradation should have relented much quicker.
>>>
>>>
 Keep in mind that if you change the values, it won't take effect
 immediately. It only merges them back if the directory is under the
 calculated threshold and a write occurs (maybe a read, I forget).

>>> If it's a read a plain scrub might do the trick.
>>>
>>> Christian
 Warren


 From: ceph-users
 mailto:ceph-users-boun...@lists.ceph.com>>
 on behalf of Wade Holler
 mailto:wade.hol...@gmail.com>> Date: Monday, June
 20, 2016 at 2:48 PM To: Blair Bethwaite
 mailto:blair.bethwa...@gmail.com>>, Wido den
 Hollander mailto:w...@42on.com>> Cc: Ceph Development
 mailto:ceph-de...@vger.kernel.org>>,
 "ceph-users@lists.ceph.com"
 mailto:ceph-users@lists.ceph.com>> Subject:
 Re: [ceph-users] Dramatic performance drop at certain number of objects
 in pool

 Thanks everyone for your replies.  I sincerely appreciate it. We are
 testing with different pg_num and filestore_split_multiple settings.
 Early indications are  well not great. Regardless it is nice to
 understand the symptoms better so we try to design around it.

 Best Regards,
 Wade


 On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
 mailto:blair.bethwa...@gmail.com>> wrote: On
 20 June 2016 at 09:21, Blair Bethwaite
 mailto:blair.bethwa...@gmail.com>> wrote:
 > slow request issues). If you watch your xfs stats you'll likely get
 > further confirmation. In my experience xs_dir_lookups balloons (which
 > means directory lookups are missing cache and going to disk).

 Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
 preparation for Jewel/RHCS2. Turns out when we last hit this very
 problem we had only ephemerally set the new filestore merge/split
 values - oops. Here's what started happening when we upgraded and
 restarted a bunch of OSDs:
 https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png

 Seemed to cause lots of slow requests :-/. We corrected it about
 12:30, then still took a while to settle.

 --
 Cheers,
 ~Blairo

 This email and any files transmitted with it are confidential and
 intended solely for the individual or entity to whom they are addressed.
 If you have received this email in error destroy it immediately. ***
 Walmart Confidential ***
>>>
>>>
>>> --
>>> Christian BalzerNetwork/Systems Engineer
>>> ch...@gol.com   Global OnLine Japan/Rakuten Communications
>>> http://www.gol.com/
>
>
>
> --
> Cheers,
> ~Blai

Re: [ceph-users] Ceph 10.1.1 rbd map fail

2016-06-22 Thread Brad Hubbard
On Wed, Jun 22, 2016 at 3:20 PM, 王海涛  wrote:
> I find this message in dmesg:
> [83090.212918] libceph: mon0 192.168.159.128:6789 feature set mismatch, my
> 4a042a42 < server's 2004a042a42, missing 200
>
> According to
> "http://cephnotes.ksperis.com/blog/2014/01/21/feature-set-mismatch-error-on-ceph-kernel-client";,
> this could mean that I need to upgrade kernel client up to 3.15 or disable
> tunable 3 features.
> Our cluster is not convenient to upgrade.
> Could you tell me how to disable tunable 3 features?

Can you show the output of the following command please?

# ceph osd crush show-tunables -f json-pretty

I believe you'll need to use "ceph osd crush tunables " to adjust this.

>
> Thanks!
>
> Kind Regards,
> Haitao Wang
>
>
> At 2016-06-22 12:33:42, "Brad Hubbard"  wrote:
>>On Wed, Jun 22, 2016 at 1:35 PM, 王海涛  wrote:
>>> Hi All
>>>
>>> I'm using ceph-10.1.1 to map a rbd image ,but it dosen't work ,the error
>>> messages are:
>>>
>>> root@heaven:~#rbd map rbd/myimage --id admin
>>> 2016-06-22 11:16:34.546623 7fc87ca53d80 -1 WARNING: the following
>>> dangerous
>>> and experimental features are enabled: bluestore,rocksdb
>>> 2016-06-22 11:16:34.547166 7fc87ca53d80 -1 WARNING: the following
>>> dangerous
>>> and experimental features are enabled: bluestore,rocksdb
>>> 2016-06-22 11:16:34.549018 7fc87ca53d80 -1 WARNING: the following
>>> dangerous
>>> and experimental features are enabled: bluestore,rocksdb
>>> rbd: sysfs write failed
>>> rbd: map failed: (5) Input/output error
>>
>>Anything in dmesg, or anywhere, about "feature set mismatch" ?
>>
>>http://cephnotes.ksperis.com/blog/2014/01/21/feature-set-mismatch-error-on-ceph-kernel-client
>>
>>>
>>> Could someone tell me what's wrong?
>>> Thanks!
>>>
>>> Kind Regards,
>>> Haitao Wang
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>>
>>--
>>Cheers,
>>Brad



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-22 Thread Blair Bethwaite
On 23 June 2016 at 11:41, Wade Holler  wrote:
> Workload is native librados with python.  ALL 4k objects.

Was that meant to be 4MB?

-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-22 Thread Christian Balzer
On Thu, 23 Jun 2016 12:01:38 +1000 Blair Bethwaite wrote:

> On 23 June 2016 at 11:41, Wade Holler  wrote:
> > Workload is native librados with python.  ALL 4k objects.
> 
> Was that meant to be 4MB?
> 
Nope, he means 4K, he's putting lots of small objects via a python script
into the cluster to test for exactly this problem.

See his original post.


Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-22 Thread Wade Holler
No.  Our application writes very small objects.


On Wed, Jun 22, 2016 at 10:01 PM, Blair Bethwaite
 wrote:
> On 23 June 2016 at 11:41, Wade Holler  wrote:
>> Workload is native librados with python.  ALL 4k objects.
>
> Was that meant to be 4MB?
>
> --
> Cheers,
> ~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-22 Thread Blair Bethwaite
Hi Christian,

Ah ok, I didn't see object size mentioned earlier. But I guess direct
rados small objects would be a rarish use-case and explains the very
high object counts.

I'm interested in finding the right balance for RBD given object size
is another variable that can be tweaked there. I recall the
UnitedStack folks using 32MB.

Cheers,

On 23 June 2016 at 12:28, Christian Balzer  wrote:
> On Thu, 23 Jun 2016 12:01:38 +1000 Blair Bethwaite wrote:
>
>> On 23 June 2016 at 11:41, Wade Holler  wrote:
>> > Workload is native librados with python.  ALL 4k objects.
>>
>> Was that meant to be 4MB?
>>
> Nope, he means 4K, he's putting lots of small objects via a python script
> into the cluster to test for exactly this problem.
>
> See his original post.
>
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/



-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-22 Thread Christian Balzer
On Thu, 23 Jun 2016 11:33:05 +1000 Blair Bethwaite wrote:

> Wade, good to know.
> 
> For the record, what does this work out to roughly per OSD? And how
> much RAM and how many PGs per OSD do you have?
> 
> What's your workload? I wonder whether for certain workloads (e.g.
> RBD) it's better to increase default object size somewhat before
> pushing the split/merge up a lot...
> 
I'd posit that that RBD is _least_ likely to encounter this issue in a
moderately balanced setup.
Think about it, a 4MB RBD object can hold literally hundreds of files.

While with CephFS or RGW, a file or S3 object is going to cost you about 2
RADOS objects each.

Case in point, my main cluster (RBD images only) with 18 5+TB OSDs on 3
servers (64GB RAM each) has 1.8 million 4MB RBD objects using about 7% of
the available space. 
Don't think I could hit this problem before running out of space.

Christian

> Cheers,
> 
> On 23 June 2016 at 11:26, Wade Holler  wrote:
> > Based on everyones suggestions; The first modification to 50 / 16
> > enabled our config to get to ~645Mill objects before the behavior in
> > question was observed (~330 was the previous ceiling).  Subsequent
> > modification to 50 / 24 has enabled us to get to 1.1 Billion+
> >
> > Thank you all very much for your support and assistance.
> >
> > Best Regards,
> > Wade
> >
> >
> > On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer 
> > wrote:
> >>
> >> Hello,
> >>
> >> On Mon, 20 Jun 2016 20:47:32 + Warren Wang - ISD wrote:
> >>
> >>> Sorry, late to the party here. I agree, up the merge and split
> >>> thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
> >>> One of those things you just have to find out as an operator since
> >>> it's not well documented :(
> >>>
> >>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
> >>>
> >>> We have over 200 million objects in this cluster, and it's still
> >>> doing over 15000 write IOPS all day long with 302 spinning drives +
> >>> SATA SSD journals. Having enough memory and dropping your
> >>> vfs_cache_pressure should also help.
> >>>
> >> Indeed.
> >>
> >> Since it was asked in that bug report and also my first suspicion, it
> >> would probably be good time to clarify that it isn't the splits that
> >> cause the performance degradation, but the resulting inflation of dir
> >> entries and exhaustion of SLAB and thus having to go to disk for
> >> things that normally would be in memory.
> >>
> >> Looking at Blair's graph from yesterday pretty much makes that clear,
> >> a purely split caused degradation should have relented much quicker.
> >>
> >>
> >>> Keep in mind that if you change the values, it won't take effect
> >>> immediately. It only merges them back if the directory is under the
> >>> calculated threshold and a write occurs (maybe a read, I forget).
> >>>
> >> If it's a read a plain scrub might do the trick.
> >>
> >> Christian
> >>> Warren
> >>>
> >>>
> >>> From: ceph-users
> >>> mailto:ceph-users-boun...@lists.ceph.com>>
> >>> on behalf of Wade Holler
> >>> mailto:wade.hol...@gmail.com>> Date: Monday,
> >>> June 20, 2016 at 2:48 PM To: Blair Bethwaite
> >>> mailto:blair.bethwa...@gmail.com>>, Wido
> >>> den Hollander mailto:w...@42on.com>> Cc: Ceph
> >>> Development
> >>> mailto:ceph-de...@vger.kernel.org>>,
> >>> "ceph-users@lists.ceph.com"
> >>> mailto:ceph-users@lists.ceph.com>>
> >>> Subject: Re: [ceph-users] Dramatic performance drop at certain
> >>> number of objects in pool
> >>>
> >>> Thanks everyone for your replies.  I sincerely appreciate it. We are
> >>> testing with different pg_num and filestore_split_multiple settings.
> >>> Early indications are  well not great. Regardless it is nice to
> >>> understand the symptoms better so we try to design around it.
> >>>
> >>> Best Regards,
> >>> Wade
> >>>
> >>>
> >>> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
> >>> mailto:blair.bethwa...@gmail.com>> wrote:
> >>> On 20 June 2016 at 09:21, Blair Bethwaite
> >>> mailto:blair.bethwa...@gmail.com>> wrote:
> >>> > slow request issues). If you watch your xfs stats you'll likely get
> >>> > further confirmation. In my experience xs_dir_lookups balloons
> >>> > (which means directory lookups are missing cache and going to
> >>> > disk).
> >>>
> >>> Murphy's a bitch. Today we upgraded a cluster to latest Hammer in
> >>> preparation for Jewel/RHCS2. Turns out when we last hit this very
> >>> problem we had only ephemerally set the new filestore merge/split
> >>> values - oops. Here's what started happening when we upgraded and
> >>> restarted a bunch of OSDs:
> >>> https://au-east.erc.monash.edu.au/swift/v1/public/grafana-ceph-xs_dir_lookup.png
> >>>
> >>> Seemed to cause lots of slow requests :-/. We corrected it about
> >>> 12:30, then still took a while to settle.
> >>>
> >>> --
> >>> Cheers,
> >>> ~Blairo
> >>>
> >>> This email and any files transmitted with it are confidential and
> >>> intended solely for the individual or entity to whom they a

Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-22 Thread Blair Bethwaite
On 23 June 2016 at 12:37, Christian Balzer  wrote:
> Case in point, my main cluster (RBD images only) with 18 5+TB OSDs on 3
> servers (64GB RAM each) has 1.8 million 4MB RBD objects using about 7% of
> the available space.
> Don't think I could hit this problem before running out of space.

Perhaps. However ~30TB per server is pretty low with present HDD
sizes. In the pool on our large cluster where we've seen this issue we
have 24x 4TB OSDs per server, and we first hit the problem in pre-prod
testing at about 20% usage (with default 4MB objects). We went to 40 /
8. Then as I reported the other day we hit the issue again at
somewhere around 50% usage. Now we're at 50 / 12.

The boxes mentioned above are a couple of years old. Today we're
buying 2RU servers with 128TB in them (16x 8TB)!

Replacing our current NAS on RBD setup with CephFS is now starting to
scare me...

-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-22 Thread Christian Balzer

Hello Blair, hello Wade (see below),

On Thu, 23 Jun 2016 12:55:17 +1000 Blair Bethwaite wrote:

> On 23 June 2016 at 12:37, Christian Balzer  wrote:
> > Case in point, my main cluster (RBD images only) with 18 5+TB OSDs on 3
> > servers (64GB RAM each) has 1.8 million 4MB RBD objects using about 7%
> > of the available space.
> > Don't think I could hit this problem before running out of space.
> 
> Perhaps. However ~30TB per server is pretty low with present HDD
> sizes. 
These are in fact 24 3TB HDDs per server, but in 6 RAID10s with 4 HDDs
each.

>In the pool on our large cluster where we've seen this issue we
> have 24x 4TB OSDs per server, and we first hit the problem in pre-prod
> testing at about 20% usage (with default 4MB objects). We went to 40 /
> 8. Then as I reported the other day we hit the issue again at
> somewhere around 50% usage. Now we're at 50 / 12.
> 
High density storage servers have a number of other gotchas and tuning
requirements, I'd consider this simply another one.

As for increasing the default RBD object size, I'd be weary about
performance impacts, especially if you ever are going to have a cache-tier.

If there is no cache-tier in your future for certain, striping might
counteract larger objects.

> The boxes mentioned above are a couple of years old. Today we're
> buying 2RU servers with 128TB in them (16x 8TB)!
> 
As people including me noticed and noted, large OSDs are pushing things,
in more ways than just this issues.

I know very well how attractive it is from a cost and rack space (also
a cost factor of course) perspective to build dense storage nodes, but
most people need more IOPS than storage space and that's were smaller,
faster OSDs are better suited, as pointed out in the Ceph docs for a long
time.

> Replacing our current NAS on RBD setup with CephFS is now starting to
> scare me...
> 
If this is going to happen when Bluestore is stable, this _particular_
problem should be a non-issue hopefully.
I'm sure Murphy will find other amusing ways to keep us entertained
and high-stressed, though.
If nothing else, CephFS would scare me more than a by now well known
problem that can be tuned away.


A question/request for Wade, would it be possible to reformat your OSDs
with Ext4 (I know deprecated, but if you know what you're doing...), BTRFS?
I'm wondering if either doesn't exhibit this behavior or if so at a
different point?

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance issue with jewel on ubuntu xenial (kernel)

2016-06-22 Thread Florian Haas
On Wed, Jun 22, 2016 at 10:56 AM, Yoann Moulin  wrote:
> Hello Florian,
>
>> On Tue, Jun 21, 2016 at 3:11 PM, Yoann Moulin  wrote:
>>> Hello,
>>>
>>> I found a performance drop between kernel 3.13.0-88 (default kernel on 
>>> Ubuntu
>>> Trusty 14.04) and kernel 4.4.0.24.14 (default kernel on Ubuntu Xenial 16.04)
>>>
>>> ceph version is Jewel (10.2.2).
>>> All tests have been done under Ubuntu 14.04
>>
>> Knowing that you also have an internalis cluster on almost identical
>> hardware, can you please let the list know whether you see the same
>> behavior (severely reduced throughput on a 4.4 kernel, vs. 3.13) on
>> that cluster as well?
>
> ceph version is infernalis (9.2.0)
>
> Ceph osd Benchmark:
>
> Kernel 3.13.0-88-generic : ceph tell osd.ID => average ~84MB/s
> Kernel 4.2.0-38-generic  : ceph tell osd.ID => average ~90MB/s
> Kernel 4.4.0-24-generic  : ceph tell osd.ID => average ~75MB/s
>
> The slow down is not as much as I have with Jewel but it is still present.

But this is not on precisely identical hardware, is it?

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance issue with jewel on ubuntu xenial (kernel)

2016-06-22 Thread Sarni Sofiane
Hi Florian,

All the benchmarks were run on strictly identical hardware setups per node.
Clusters differ slightly in sizes (infernalis vs jewel) but nodes and OSDs are 
identical.

Best regards,
Sofiane

On 23.06.16 06:25, "ceph-users on behalf of Florian Haas" 
 wrote:

>On Wed, Jun 22, 2016 at 10:56 AM, Yoann Moulin  wrote:
>> Hello Florian,
>>
>>> On Tue, Jun 21, 2016 at 3:11 PM, Yoann Moulin  wrote:
 Hello,

 I found a performance drop between kernel 3.13.0-88 (default kernel on 
 Ubuntu
 Trusty 14.04) and kernel 4.4.0.24.14 (default kernel on Ubuntu Xenial 
 16.04)

 ceph version is Jewel (10.2.2).
 All tests have been done under Ubuntu 14.04
>>>
>>> Knowing that you also have an internalis cluster on almost identical
>>> hardware, can you please let the list know whether you see the same
>>> behavior (severely reduced throughput on a 4.4 kernel, vs. 3.13) on
>>> that cluster as well?
>>
>> ceph version is infernalis (9.2.0)
>>
>> Ceph osd Benchmark:
>>
>> Kernel 3.13.0-88-generic : ceph tell osd.ID => average ~84MB/s
>> Kernel 4.2.0-38-generic  : ceph tell osd.ID => average ~90MB/s
>> Kernel 4.4.0-24-generic  : ceph tell osd.ID => average ~75MB/s
>>
>> The slow down is not as much as I have with Jewel but it is still present.
>
>But this is not on precisely identical hardware, is it?
>
>Cheers,
>Florian
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com