[ceph-users] NAS on RBD

2014-09-09 Thread Blair Bethwaite
Hi folks,

In lieu of a prod ready Cephfs I'm wondering what others in the user
community are doing for file-serving out of Ceph clusters (if at all)?

We're just about to build a pretty large cluster - 2PB for file-based
NAS and another 0.5PB rgw. For the rgw component we plan to dip our
toes in and use an EC backing pool with a ~25TB (usable) 10K SAS + SSD
cache tier.

For the file storage we're looking at mounting RBDs (out of a standard
3-replica pool for now) on a collection of presentation nodes, which
will use ZFS to stripe together those RBD vdevs into a zpool which we
can then carve datasets out of for access from NFS & CIFS clients.
Those presentation servers will have some PCIe SSD in them for ZIL and
L2ARC devices, and clients will be split across them depending on what
ID domain they are coming from. Presentation server availability
issues will be handled by mounting the relevant zpool on a spare
server, so it won't be HA from a client perspective, but I can't see a
way to getting this with an RBD backend.

Wondering what the collective wisdom has to offer on such a setup...

-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] resizing the OSD

2014-09-09 Thread Martin B Nielsen
Hi,

Or did you mean some OSD are near full while others are under-utilized?

On Sat, Sep 6, 2014 at 5:04 PM, Christian Balzer  wrote:

>
> Hello,
>
> On Fri, 05 Sep 2014 15:31:01 -0700 JIten Shah wrote:
>
> > Hello Cephers,
> >
> > We created a ceph cluster with 100 OSD, 5 MON and 1 MSD and most of the
> > stuff seems to be working fine but we are seeing some degrading on the
> > osd's due to lack of space on the osd's.
>
> Please elaborate on that degradation.
>
> > Is there a way to resize the
> > OSD without bringing the cluster down?
> >
>
> Define both "resize" and "cluster down".
>
> As in, resizing how?
> Are your current OSDs on disks/LVMs that are not fully used and thus could
> be grown?
> What is the size of your current OSDs?
>
> The normal way of growing a cluster is to add more OSDs.
> Preferably of the same size and same performance disks.
> This will not only simplify things immensely but also make them a lot more
> predictable.
> This of course depends on your use case and usage patterns, but often when
> running out of space you're also running out of other resources like CPU,
> memory or IOPS of the disks involved. So adding more instead of growing
> them is most likely the way forward.
>
> If you were to replace actual disks with larger ones, take them (the OSDs)
> out one at a time and re-add it. If you're using ceph-deploy, it will use
> the disk size as basic weight, if you're doing things manually make sure
> to specify that size/weight accordingly.
> Again, you do want to do this for all disks to keep things uniform.
>

Just want to emphasize this - if your disks already have high utilization
and you add a [much] larger drive and auto-weights it for say 2 or 3x the
other disks, that disk will have that much higher utilization and will most
likely max out and bottleneck your cluster. So keep that in mind :).

Cheers,
Martin


>
> If your cluster (pools really) are set to a replica size of at least 2
> (risky!) or 3 (as per Firefly default), taking a single OSD out would of
> course never bring the cluster down.
> However taking an OSD out and/or adding a new one will cause data movement
> that might impact your cluster's performance.
>
> Regards,
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] monitoring tool for monitoring end-user

2014-09-09 Thread pragya jain
please somebody reply to clarify it to me.

Regards
Pragya Jain


On Wednesday, 3 September 2014 12:14 PM, pragya jain  
wrote:
 

>
>
>hi all!
>
>
>Is there any monitoring tool for ceph which monitor end-user level usage and 
>data transfer for ceph object storage service?
>
>
>Please help me  to know any type of information related to it. 
>
>
>Regards
>Pragya Jain
>
>___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NAS on RBD

2014-09-09 Thread Christian Balzer

Hello,

On Tue, 9 Sep 2014 17:05:03 +1000 Blair Bethwaite wrote:

> Hi folks,
> 
> In lieu of a prod ready Cephfs I'm wondering what others in the user
> community are doing for file-serving out of Ceph clusters (if at all)?
> 
> We're just about to build a pretty large cluster - 2PB for file-based
> NAS and another 0.5PB rgw. For the rgw component we plan to dip our
> toes in and use an EC backing pool with a ~25TB (usable) 10K SAS + SSD
> cache tier.
> 
> For the file storage we're looking at mounting RBDs (out of a standard
> 3-replica pool for now) on a collection of presentation nodes, which
> will use ZFS to stripe together those RBD vdevs into a zpool which we
> can then carve datasets out of for access from NFS & CIFS clients.
> Those presentation servers will have some PCIe SSD in them for ZIL and
> L2ARC devices, and clients will be split across them depending on what
> ID domain they are coming from. Presentation server availability
> issues will be handled by mounting the relevant zpool on a spare
> server, so it won't be HA from a client perspective, but I can't see a
> way to getting this with an RBD backend.
> 
> Wondering what the collective wisdom has to offer on such a setup...
> 
I have nearly no experience with ZFS, but I'm wondering why you'd pool
things on the level when Ceph is already supplying a redundant and
resizeable block device. 

Wanting to use ZFS because of checksumming, which is sorely missing in
Ceph, I can understand. 

Using a CoW filesystem on top of RBD might not be a great idea either,
since it is sparsely allocated, performance is likely to be bad until all
"blocks" have been actually allocated. Maybe somebody with experience in
that can pipe up. 

Something that ties into the previous point, kernel based RBD currently
does not support TRIM, so even if you were to use something other than
ZFS, you'd never be able to get that space back.

There are HA NFS cluster examples based on pacemaker (and usually backed
up by DRBD or a SAN) on the net and I think I've seen people here doing
things based on that, too.

I would start with that and coerce to Ceph developers to get that TRIM
support into the kernel after thinking about it for 2 years or so.

Another scenario might be running the NFS heads on VMs, thus using librbd
an having TRIM (with the correct disk device type). And again use
pacemaker to quickly fail over things.

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] number of PGs (global vs per pool)

2014-09-09 Thread Luis Periquito
I was reading on the number of PGs we should have for a cluster, and I
found the formula to place 100 PGs in each OSD (
http://ceph.com/docs/master/rados/operations/placement-groups/).

Now this formula has generated some discussion as to how many PGs we should
have in each pool.

Currently our main cluster is being used for S3, cephFS and RBD data type
usage. So we have 3 very big pools (data, .rgw.buckets and rbd) and 9 small
pools (all the remaining ones).

As we have a total of 60 OSDs we've been discussing how many PGs we should
really have. We are using a replication of 4.

Should we have a total around 1500 PGs distributed over all the pools
(total PGs) or should we have the big pools each with 1500 PGs for a total
around 5000 PGs on the cluster?

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NAS on RBD

2014-09-09 Thread Ilya Dryomov
On Tue, Sep 9, 2014 at 12:33 PM, Christian Balzer  wrote:
>
> Hello,
>
> On Tue, 9 Sep 2014 17:05:03 +1000 Blair Bethwaite wrote:
>
>> Hi folks,
>>
>> In lieu of a prod ready Cephfs I'm wondering what others in the user
>> community are doing for file-serving out of Ceph clusters (if at all)?
>>
>> We're just about to build a pretty large cluster - 2PB for file-based
>> NAS and another 0.5PB rgw. For the rgw component we plan to dip our
>> toes in and use an EC backing pool with a ~25TB (usable) 10K SAS + SSD
>> cache tier.
>>
>> For the file storage we're looking at mounting RBDs (out of a standard
>> 3-replica pool for now) on a collection of presentation nodes, which
>> will use ZFS to stripe together those RBD vdevs into a zpool which we
>> can then carve datasets out of for access from NFS & CIFS clients.
>> Those presentation servers will have some PCIe SSD in them for ZIL and
>> L2ARC devices, and clients will be split across them depending on what
>> ID domain they are coming from. Presentation server availability
>> issues will be handled by mounting the relevant zpool on a spare
>> server, so it won't be HA from a client perspective, but I can't see a
>> way to getting this with an RBD backend.
>>
>> Wondering what the collective wisdom has to offer on such a setup...
>>
> I have nearly no experience with ZFS, but I'm wondering why you'd pool
> things on the level when Ceph is already supplying a redundant and
> resizeable block device.
>
> Wanting to use ZFS because of checksumming, which is sorely missing in
> Ceph, I can understand.
>
> Using a CoW filesystem on top of RBD might not be a great idea either,
> since it is sparsely allocated, performance is likely to be bad until all
> "blocks" have been actually allocated. Maybe somebody with experience in
> that can pipe up.
>
> Something that ties into the previous point, kernel based RBD currently
> does not support TRIM, so even if you were to use something other than
> ZFS, you'd never be able to get that space back.

Initial discard support will be in the 3.18 kernel.  (We have it in
testing and, unless something critical comes up, 3.18 is our target.)

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] number of PGs (global vs per pool)

2014-09-09 Thread Wido den Hollander

On 09/09/2014 10:42 AM, Luis Periquito wrote:

I was reading on the number of PGs we should have for a cluster, and I
found the formula to place 100 PGs in each OSD
(http://ceph.com/docs/master/rados/operations/placement-groups/).

Now this formula has generated some discussion as to how many PGs we
should have in each pool.

Currently our main cluster is being used for S3, cephFS and RBD data
type usage. So we have 3 very big pools (data, .rgw.buckets and rbd) and
9 small pools (all the remaining ones).

As we have a total of 60 OSDs we've been discussing how many PGs we
should really have. We are using a replication of 4.

Should we have a total around 1500 PGs distributed over all the pools
(total PGs) or should we have the big pools each with 1500 PGs for a
total around 5000 PGs on the cluster?



Balance it a bit. Give the big pools 1024 PGs and the smaller PGs 256 
for example.


Each PG will consume memory, so you have to make sure the total amount 
of PGs isn't to high.



thanks,


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] number of PGs (global vs per pool)

2014-09-09 Thread Christian Balzer

Hello,

On Tue, 9 Sep 2014 09:42:13 +0100 Luis Periquito wrote:

> I was reading on the number of PGs we should have for a cluster, and I
> found the formula to place 100 PGs in each OSD (
> http://ceph.com/docs/master/rados/operations/placement-groups/).
> 
> Now this formula has generated some discussion as to how many PGs we
> should have in each pool.
> 
> Currently our main cluster is being used for S3, cephFS and RBD data type
> usage. So we have 3 very big pools (data, .rgw.buckets and rbd) and 9
> small pools (all the remaining ones).
> 
> As we have a total of 60 OSDs we've been discussing how many PGs we
> should really have. We are using a replication of 4.
> 
> Should we have a total around 1500 PGs distributed over all the pools
> (total PGs) or should we have the big pools each with 1500 PGs for a
> total around 5000 PGs on the cluster?
> 
As it says in the documentation you've linked up there:
---
When using multiple data pools for storing objects, you need to ensure
that you balance the number of placement groups per pool with the number
of placement groups per OSD so that you arrive at a reasonable total
number of placement groups that provides reasonably low variance per OSD
without taxing system resources or making the peering process too slow.
---

Also as stated on the same page, you will want to round up that 1500 to
2048 for starters.

With smaller clusters, it is beneficial to overprovision PGs for various
reasons (smoother data distribution, etc). 

The larger the cluster gets, the closer you will want to adhere to that 100
PGs per OSD, as the resource usage (memory, CPU, network peering traffic)
creeps up.

So as Wido just wrote (I'm clearly typing to slow ^o^), balance it out
according to usage. I only use RBD, so my 2 other default pools stay at
measly 64 PGs while RBD gets all the PG loving. 

In your case (really depends on how much data is in the pools) you could
do 512 PGs for the 3 big ones and 64 PGs for the small ones and stay
within the recommended limits. 

However if you're planning on growing this cluster further and your
current hardware has plenty of reserves, I would go with the 1024 PGs for
big pools and 128 or 256 for the small ones.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 回复: mix ceph verion with 0.80.5 and 0.85

2014-09-09 Thread Haomai Wang
Hi,

Thanks for your report, I will fix it(https://github.com/ceph/ceph/pull/2429
).

Because KeyValueStore is aimed as a experiment backend, we still don't have
enough test suite to cover it.



On Tue, Sep 9, 2014 at 11:02 AM, 廖建锋  wrote:

>  Looks like it dosn't work,  i noticed that 0.85 added superblock to
> leveldb osd,   the osd which I alread have do not have superblock
> is there anybody can tell me how to upgrade OSDs ?
>
>
>
>  *发件人:* ceph-users 
> *发送时间:* 2014-09-09 10:32
> *收件人:* ceph-users 
> *主题:* [ceph-users] mix ceph verion with 0.80.5 and 0.85
>   dear,
>  As there are a lot of bugs of keyvalue backend in 0.80.5 firely
> version ,  So i want to upgrade to 0.85 for some osds which already down
> and unable to start
> and keep some other osd with 0.80.5,I wondering ,  will it works?
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 回复: mix ceph verion with 0.80.5 and 0.85

2014-09-09 Thread 廖建锋
I re-installed the whole cluster with ceph 0.85 and lost all my 10T data
Now i have another question is  which i have not way to re-create pool.

[cid:_Foxmail.1@35e9c94b-8a82-e940-b72c-fdbeeead1185]

264 => # ceph osd pool delete data data --yes-i-really-really-mean-it
Error EBUSY: pool 'data' is in use by CephFS



发件人: Haomai Wang
发送时间: 2014-09-09 17:28
收件人: 廖建锋
抄送: ceph-users; 
ceph-users
主题: Re: [ceph-users] 回复: mix ceph verion with 0.80.5 and 0.85
Hi,

Thanks for your report, I will fix it(https://github.com/ceph/ceph/pull/2429).

Because KeyValueStore is aimed as a experiment backend, we still don't have 
enough test suite to cover it.



On Tue, Sep 9, 2014 at 11:02 AM, 廖建锋 mailto:de...@f-club.cn>> 
wrote:
Looks like it dosn't work,  i noticed that 0.85 added superblock to leveldb 
osd,   the osd which I alread have do not have superblock
is there anybody can tell me how to upgrade OSDs ?



发件人: ceph-users
发送时间: 2014-09-09 10:32
收件人: ceph-users
主题: [ceph-users] mix ceph verion with 0.80.5 and 0.85
dear,
 As there are a lot of bugs of keyvalue backend in 0.80.5 firely version ,  
So i want to upgrade to 0.85 for some osds which already down and unable to 
start
and keep some other osd with 0.80.5,I wondering ,  will it works?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 回复: Re: 回复: mix ceph verion with 0.80.5 and 0.85

2014-09-09 Thread 廖建锋
I solved  by creating new then removing old pool

发件人: 廖建锋
发送时间: 2014-09-09 17:39
收件人: haomaiwang
抄送: ceph-users; 
ceph-users
主题: Re: Re: [ceph-users] 回复: mix ceph verion with 0.80.5 and 0.85
I re-installed the whole cluster with ceph 0.85 and lost all my 10T data
Now i have another question is  which i have not way to re-create pool.

[cid:_Foxmail.1@2488d66a-606e-bac6-c084-a4641a98b3be]

264 => # ceph osd pool delete data data --yes-i-really-really-mean-it
Error EBUSY: pool 'data' is in use by CephFS



发件人: Haomai Wang
发送时间: 2014-09-09 17:28
收件人: 廖建锋
抄送: ceph-users; 
ceph-users
主题: Re: [ceph-users] 回复: mix ceph verion with 0.80.5 and 0.85
Hi,

Thanks for your report, I will fix it(https://github.com/ceph/ceph/pull/2429).

Because KeyValueStore is aimed as a experiment backend, we still don't have 
enough test suite to cover it.



On Tue, Sep 9, 2014 at 11:02 AM, 廖建锋 mailto:de...@f-club.cn>> 
wrote:
Looks like it dosn't work,  i noticed that 0.85 added superblock to leveldb 
osd,   the osd which I alread have do not have superblock
is there anybody can tell me how to upgrade OSDs ?



发件人: ceph-users
发送时间: 2014-09-09 10:32
收件人: ceph-users
主题: [ceph-users] mix ceph verion with 0.80.5 and 0.85
dear,
 As there are a lot of bugs of keyvalue backend in 0.80.5 firely version ,  
So i want to upgrade to 0.85 for some osds which already down and unable to 
start
and keep some other osd with 0.80.5,I wondering ,  will it works?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph on RHEL 7 with multiple OSD's

2014-09-09 Thread BG
Loic Dachary  writes:

> 
> Hi,
> 
> It it looks like your osd.0 is down and you only have one osd left (osd.1)
> which would explain why the cluster cannot get to a healthy state. The "size
> 2" in  "pool 0 'data' replicated size 2 ..." means the pool needs at
> least two OSDs up to function properly. Do you know why the osd.0 is not up ?
> 
> Cheers
> 

I've been trying unsuccessfully to get this up and running since. I've added
another OSD but still can't get to "active + clean" state. I'm not even sure if
the problems I'm having are related to the OS version but I'm running out of
ideas and unless somebody here can spot something obvious in the logs below I'm
going to try rolling back to CentOS 6.

$ echo "HEALTH" && ceph health && echo "STATUS" && ceph status && echo
"OSD_DUMP" && ceph osd dump
HEALTH
HEALTH_WARN 129 pgs peering; 129 pgs stuck unclean
STATUS
cluster f68332e4-1081-47b8-9b22-e5f3dc1f4521
 health HEALTH_WARN 129 pgs peering; 129 pgs stuck unclean
 monmap e1: 1 mons at {hp09=10.119.16.14:6789/0}, election epoch 2, quorum
 0 hp09
 osdmap e43: 3 osds: 3 up, 3 in
  pgmap v61: 192 pgs, 3 pools, 0 bytes data, 0 objects
15469 MB used, 368 GB / 383 GB avail
 129 peering
  63 active+clean
OSD_DUMP
epoch 43
fsid f68332e4-1081-47b8-9b22-e5f3dc1f4521
created 2014-09-09 10:42:35.490711
modified 2014-09-09 10:47:25.077178
flags 
pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins
pg_num 64 pgp_num 64 last_change 1 flags hashpspool crash_replay_interval 45
stripe_width 0
pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins
pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
max_osd 3
osd.0 up   in  weight 1 up_from 4 up_thru 42 down_at 0 last_clean_interval
[0,0) 10.119.16.14:6800/24988 10.119.16.14:6801/24988 10.119.16.14:6802/24988
10.119.16.14:6803/24988 exists,up 63f3f351-eccc-4a98-8f18-e107bd33f82b
osd.1 up   in  weight 1 up_from 38 up_thru 42 down_at 36 last_clean_interval
[7,37) 10.119.16.15:6800/22999 10.119.16.15:6801/4022999
10.119.16.15:6802/4022999 10.119.16.15:6803/4022999 exists,up
8e1c029d-ebfb-4a8d-b567-ee9cd9ebd876
osd.2 up   in  weight 1 up_from 42 up_thru 42 down_at 40 last_clean_interval
[11,41) 10.119.16.16:6800/25605 10.119.16.16:6805/5025605
10.119.16.16:6806/5025605 10.119.16.16:6807/5025605 exists,up
5d398bba-59f5-41f8-9bd6-aed6a0204656

Sample of warnings from monitor log:
2014-09-09 10:51:10.636325 7f75037d0700  1 mon.hp09@0(leader).osd e72
prepare_failure osd.1 10.119.16.15:6800/22999 from osd.2
10.119.16.16:6800/25605 is reporting failure:1
2014-09-09 10:51:10.636343 7f75037d0700  0 log [DBG] : osd.1
10.119.16.15:6800/22999 reported failed by osd.2 10.119.16.16:6800/25605

Sample of warnings from osd.2 log:
2014-09-09 10:44:13.723714 7fb828c57700 -1 osd.2 18 heartbeat_check: no reply
from osd.1 ever on either front or back, first ping sent 2014-09-09
10:43:30.437170 (cutoff 2014-09-09 10:43:53.723713)
2014-09-09 10:44:13.724883 7fb81f2f9700  0 log [WRN] : map e19 wrongly marked
me down
2014-09-09 10:44:13.726104 7fb81f2f9700  0 osd.2 19 crush map has features
1107558400, adjusting msgr requires for mons
2014-09-09 10:44:13.726741 7fb811edb700  0 -- 10.119.16.16:0/25605 >>
10.119.16.15:6806/1022999 pipe(0x3171900 sd=34 :0 s=1 pgs=0 cs=0 l=1
c=0x3ad8580).fault



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NAS on RBD

2014-09-09 Thread Dan Van Der Ster
Hi Blair,

> On 09 Sep 2014, at 09:05, Blair Bethwaite  wrote:
> 
> Hi folks,
> 
> In lieu of a prod ready Cephfs I'm wondering what others in the user
> community are doing for file-serving out of Ceph clusters (if at all)?
> 
> We're just about to build a pretty large cluster - 2PB for file-based
> NAS and another 0.5PB rgw. For the rgw component we plan to dip our
> toes in and use an EC backing pool with a ~25TB (usable) 10K SAS + SSD
> cache tier.
> 
> For the file storage we're looking at mounting RBDs (out of a standard
> 3-replica pool for now) on a collection of presentation nodes, which
> will use ZFS to stripe together those RBD vdevs into a zpool which we
> can then carve datasets out of for access from NFS & CIFS clients.
> Those presentation servers will have some PCIe SSD in them for ZIL and
> L2ARC devices, and clients will be split across them depending on what
> ID domain they are coming from. Presentation server availability
> issues will be handled by mounting the relevant zpool on a spare
> server, so it won't be HA from a client perspective, but I can't see a
> way to getting this with an RBD backend.
> 
> Wondering what the collective wisdom has to offer on such a setup…
> 

We do this for some small scale NAS use-cases, with ZFS running in a VM with 
rbd volumes. The performance is not great (especially since we throttle the 
IOPS of our RBD). We also tried a few kRBD / ZFS servers with an SSD ZIL — the 
SSD solves any performance problem we ever had with ZFS on RBD.

I would say though that this setup is rather adventurous. ZoL is not rock solid 
— we’ve had a few lockups in testing, all of which have been fixed in the 
latest ZFS code in git (my colleague in CC could elaborate if you’re 
interested).  One thing I’m not comfortable with is the idea of ZFS checking 
the data in addition to Ceph. Sure, ZFS will tell us if there is a checksum 
error, but without any redundancy at the ZFS layer there will be no way to 
correct that error. Of course, the hope is that RADOS will ensure 100% data 
consistency, but what happens if not?...

Personally, I think you’re very brave to consider running 2PB of ZoL on RBD. If 
I were you I would seriously evaluate the CephFS option. It used to be on the 
roadmap for ICE 2.0 coming out this fall, though I noticed its not there 
anymore (??!!!). Anyway I would say that ZoL on kRBD is not necessarily a more 
stable solution than CephFS. Even Gluster striped on top of RBD would probably 
be more stable than ZoL on RBD.

Cheers, Dan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Problem with customized crush rule for EC pool

2014-09-09 Thread Lei Dong
Hi ceph users:

I want to create a customized crush rule for my EC pool (with replica_size = 
11) to distribute replicas into 6 different Racks.

I use the following rule at first:

Step take default   // root
Step choose firstn 6 type rack // 6 racks, I have and only have 6 racks
Step chooseleaf indep 2 type osd // 2 osds per rack
Step emit

I looks fine and works fine when PG num is small.
But when pg num increase, there are always some PGs which can not take all the 
6 racks.
It looks like “Step choose firstn 6 type rack” sometimes returns only 5 racks.
After some investigation,  I think it may caused by collision of choices.

Then I come up with another solution to solve collision like this:

Step take rack0
Step chooseleaf indep 2 type osd
Step emit
Step take rack1
….
(manually take every rack)

This won’t cause rack collision, because I specify rack by name at first. But 
the problem is that osd in rack0 will always be primary osd because I choose 
from rack0 first.

So the question is what is the recommended way to meet such a need (distribute 
11 replicas into 6 racks evenly in case of rack failure)?


Thanks!
LeiDong


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problem with customized crush rule for EC pool

2014-09-09 Thread Loic Dachary
Hi,

It is indeed possible that mapping fails if there are just enough racks to 
match the constraint. And the probability of a bad mapping increases when the 
number of PG increases because there is a need for more mapping. You can tell 
crush to try harder with 

step set_chooseleaf_tries 10

Be careful though : increasing this number will change mapping. It will not 
just fix the bad mappings you're seeing, it will also change the mappings that 
succeeded with a lower value. Once you've set this parameter, it cannot be 
modified.

Would you mind sharing the erasure code profile you plan to work with ?

Cheers

On 09/09/2014 12:39, Lei Dong wrote:
> Hi ceph users:
> 
> I want to create a customized crush rule for my EC pool (with replica_size = 
> 11) to distribute replicas into 6 different Racks. 
> 
> I use the following rule at first:
> 
> Step take default  // root
> Step choose firstn 6 type rack// 6 racks, I have and only have 6 racks
> Step chooseleaf indep 2 type osd // 2 osds per rack 
> Step emit
> 
> I looks fine and works fine when PG num is small. 
> But when pg num increase, there are always some PGs which can not take all 
> the 6 racks. 
> It looks like “Step choose firstn 6 type rack” sometimes returns only 5 racks.
> After some investigation,  I think it may caused by collision of choices.
> 
> Then I come up with another solution to solve collision like this:
> 
> Step take rack0
> Step chooseleaf indep 2 type osd
> Step emit
> Step take rack1
> ….
> (manually take every rack)
> 
> This won’t cause rack collision, because I specify rack by name at first. But 
> the problem is that osd in rack0 will always be primary osd because I choose 
> from rack0 first.
> 
> So the question is what is the recommended way to meet such a need 
> (distribute 11 replicas into 6 racks evenly in case of rack failure)?
> 
> 
> Thanks!
> LeiDong
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NAS on RBD

2014-09-09 Thread Blair Bethwaite
Hi Christian,

On 09/09/2014 6:33 PM, "Christian Balzer"  wrote:
> I have nearly no experience with ZFS, but I'm wondering why you'd pool
> things on the level when Ceph is already supplying a redundant and
> resizeable block device.

That's really subject to further testing. At this stage I'm just
guessing that multiple vdevs (one to one rbd-vdev mapping) may give
ZFS more opportunity to parallelise workload to the cluster, and if we
do need to expand a pool it's obvious we could just add vdevs in xTB
blocks rather than growing non-redundant vdevs (don't even know if
that is possible). I'm sketchy about how that works in ZFS so will
need some testing to determine if that's really the best option.

The reason for leaning towards ZFS is inline compression and native
read cache and write-log device support. The rich set of dataset level
features also doesn't hurt.

> Using a CoW filesystem on top of RBD might not be a great idea either,
> since it is sparsely allocated, performance is likely to be bad until all
> "blocks" have been actually allocated. Maybe somebody with experience in
> that can pipe up.

That's an interesting observation, though must admit I'm struggling to
visualise the problem.


> Another scenario might be running the NFS heads on VMs, thus using librbd
> an having TRIM (with the correct disk device type). And again use
> pacemaker to quickly fail over things.

Ah yes, I forgot to mention plans for KVM based presentation servers
in order to get librbd rather than krbd - that's a good point, I
hadn't specifically thought about TRIM but rather just the general lag
of the kernel. (Those VMs would have pci pass-through for the latency
sensitive devices - vNIC, ZIL, L2ARC.)

Also planning nightly backups of these filesystems to tape via TSM
(using the agent journal, which seems to work okay with ZoL from basic
tests).

Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problem with customized crush rule for EC pool

2014-09-09 Thread Lei Dong
Thanks loic!

Actually I've found that increase choose_local_fallback_tries can 
help(chooseleaf_tries helps not so significantly), but I'm afraid when osd 
failure happen and need to find new acting set, it may be fail to find enough 
racks again. So I'm trying to find a more guaranteed way in case of osd failure.

My profile is nothing special other than k=8 m=3. 

Thanks again!

Leidong





> On 2014年9月9日, at 下午7:53, "Loic Dachary"  wrote:
> 
> Hi,
> 
> It is indeed possible that mapping fails if there are just enough racks to 
> match the constraint. And the probability of a bad mapping increases when the 
> number of PG increases because there is a need for more mapping. You can tell 
> crush to try harder with 
> 
> step set_chooseleaf_tries 10
> 
> Be careful though : increasing this number will change mapping. It will not 
> just fix the bad mappings you're seeing, it will also change the mappings 
> that succeeded with a lower value. Once you've set this parameter, it cannot 
> be modified.
> 
> Would you mind sharing the erasure code profile you plan to work with ?
> 
> Cheers
> 
>> On 09/09/2014 12:39, Lei Dong wrote:
>> Hi ceph users:
>> 
>> I want to create a customized crush rule for my EC pool (with replica_size = 
>> 11) to distribute replicas into 6 different Racks. 
>> 
>> I use the following rule at first:
>> 
>> Step take default  // root
>> Step choose firstn 6 type rack// 6 racks, I have and only have 6 racks
>> Step chooseleaf indep 2 type osd // 2 osds per rack 
>> Step emit
>> 
>> I looks fine and works fine when PG num is small. 
>> But when pg num increase, there are always some PGs which can not take all 
>> the 6 racks. 
>> It looks like “Step choose firstn 6 type rack” sometimes returns only 5 
>> racks.
>> After some investigation,  I think it may caused by collision of choices.
>> 
>> Then I come up with another solution to solve collision like this:
>> 
>> Step take rack0
>> Step chooseleaf indep 2 type osd
>> Step emit
>> Step take rack1
>> ….
>> (manually take every rack)
>> 
>> This won’t cause rack collision, because I specify rack by name at first. 
>> But the problem is that osd in rack0 will always be primary osd because I 
>> choose from rack0 first.
>> 
>> So the question is what is the recommended way to meet such a need 
>> (distribute 11 replicas into 6 racks evenly in case of rack failure)?
>> 
>> 
>> Thanks!
>> LeiDong
>> 
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NAS on RBD

2014-09-09 Thread Blair Bethwaite
Hi Dan,

Thanks for sharing!

On 9 September 2014 20:12, Dan Van Der Ster  wrote:
> We do this for some small scale NAS use-cases, with ZFS running in a VM with 
> rbd volumes. The performance is not great (especially since we throttle the 
> IOPS of our RBD). We also tried a few kRBD / ZFS servers with an SSD ZIL — 
> the SSD solves any performance problem we ever had with ZFS on RBD.

That's good to hear. My limited experience doing this on a smaller
Ceph cluster (and without any SSD journals or cache devices for ZFS
head) points to write latency being an immediate issue, decent PCIe
SLC SSD devices should pretty much sort that out given the cluster
itself has plenty of write throughput available. Then there's further
MLC devices for L2ARC - not sure yet but guessing metadata heavy
datasets might require primarycache=metadata and rely of L2ARC for
data cache. And all this should get better in the medium term with
performance improvements and RDMA capability (we're building this with
that option in the hole).

> I would say though that this setup is rather adventurous. ZoL is not rock 
> solid — we’ve had a few lockups in testing, all of which have been fixed in 
> the latest ZFS code in git (my colleague in CC could elaborate if you’re 
> interested).

Hmm okay, that's not great. The only problem I've experienced thus far
is when the ZoL repos stopped providing DKMS and borked an upgrade for
me until I figured out what had happened and cleaned up the old .ko
files. So yes, interested to hear elaboration on that.

>One thing I’m not comfortable with is the idea of ZFS checking the data in 
>addition to Ceph. Sure, ZFS will tell us if there is a checksum error, but 
>without any redundancy at the ZFS layer there will be no way to correct that 
>error. Of course, the hope is that RADOS will ensure 100% data consistency, 
>but what happens if not?...

The ZFS checksumming would tell us if there has been any corruption,
which as you've pointed out shouldn't happen anyway on top of Ceph.
But if we did have some awful disaster scenario where that happened
then we'd be restoring from tape, and it'd sure be good to know which
files actually needed restoring. I.e., if we lost a single PG at the
Ceph level then we don't want to have to blindly restore the whole
zpool or dataset.

> Personally, I think you’re very brave to consider running 2PB of ZoL on RBD. 
> If I were you I would seriously evaluate the CephFS option. It used to be on 
> the roadmap for ICE 2.0 coming out this fall, though I noticed its not there 
> anymore (??!!!).

Yeah, it's very disappointing that this was silently removed. And it's
particularly concerning that this happened post RedHat acquisition.
I'm an ICE customer and sure would have liked some input there for
exactly the reason we're discussing.

> Anyway I would say that ZoL on kRBD is not necessarily a more stable solution 
> than CephFS. Even Gluster striped on top of RBD would probably be more stable 
> than ZoL on RBD.

If we really have to we'll just run Gluster natively instead (or
perhaps XFS on RBD as the option before that) - the hardware needn't
change for that except to configure RAIDs rather than JBODs on the
servers.

-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph on RHEL 7 with multiple OSD's

2014-09-09 Thread Michal Kozanecki
Network issue maybe? Have you checked your firewall settings? Iptables changed 
a bit in EL7 and might of broken any rules your normally try and use, try 
flushing the rules (iptables -F) and see if that fixes things, if you then 
you'll need to fix your firewall rules. 

I ran into a similar issue on EL7 where the OSD's appeared up and in, but were 
stuck in peering which was due to a few ports being blocked.

Cheers

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of BG
Sent: September-09-14 6:05 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph on RHEL 7 with multiple OSD's

Loic Dachary  writes:

> 
> Hi,
> 
> It it looks like your osd.0 is down and you only have one osd left 
> (osd.1) which would explain why the cluster cannot get to a healthy 
> state. The "size 2" in  "pool 0 'data' replicated size 2 ..." means 
> the pool needs at least two OSDs up to function properly. Do you know why the 
> osd.0 is not up ?
> 
> Cheers
> 

I've been trying unsuccessfully to get this up and running since. I've added 
another OSD but still can't get to "active + clean" state. I'm not even sure if 
the problems I'm having are related to the OS version but I'm running out of 
ideas and unless somebody here can spot something obvious in the logs below I'm 
going to try rolling back to CentOS 6.

$ echo "HEALTH" && ceph health && echo "STATUS" && ceph status && echo 
"OSD_DUMP" && ceph osd dump HEALTH HEALTH_WARN 129 pgs peering; 129 pgs stuck 
unclean STATUS
cluster f68332e4-1081-47b8-9b22-e5f3dc1f4521
 health HEALTH_WARN 129 pgs peering; 129 pgs stuck unclean
 monmap e1: 1 mons at {hp09=10.119.16.14:6789/0}, election epoch 2, quorum
 0 hp09
 osdmap e43: 3 osds: 3 up, 3 in
  pgmap v61: 192 pgs, 3 pools, 0 bytes data, 0 objects
15469 MB used, 368 GB / 383 GB avail
 129 peering
  63 active+clean
OSD_DUMP
epoch 43
fsid f68332e4-1081-47b8-9b22-e5f3dc1f4521
created 2014-09-09 10:42:35.490711
modified 2014-09-09 10:47:25.077178
flags
pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1 flags hashpspool crash_replay_interval 45 
stripe_width 0 pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 
object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool 
stripe_width 0 pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 
object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool 
stripe_width 0 max_osd 3
osd.0 up   in  weight 1 up_from 4 up_thru 42 down_at 0 last_clean_interval
[0,0) 10.119.16.14:6800/24988 10.119.16.14:6801/24988 10.119.16.14:6802/24988
10.119.16.14:6803/24988 exists,up 63f3f351-eccc-4a98-8f18-e107bd33f82b
osd.1 up   in  weight 1 up_from 38 up_thru 42 down_at 36 last_clean_interval
[7,37) 10.119.16.15:6800/22999 10.119.16.15:6801/4022999
10.119.16.15:6802/4022999 10.119.16.15:6803/4022999 exists,up
8e1c029d-ebfb-4a8d-b567-ee9cd9ebd876
osd.2 up   in  weight 1 up_from 42 up_thru 42 down_at 40 last_clean_interval
[11,41) 10.119.16.16:6800/25605 10.119.16.16:6805/5025605
10.119.16.16:6806/5025605 10.119.16.16:6807/5025605 exists,up
5d398bba-59f5-41f8-9bd6-aed6a0204656

Sample of warnings from monitor log:
2014-09-09 10:51:10.636325 7f75037d0700  1 mon.hp09@0(leader).osd e72 
prepare_failure osd.1 10.119.16.15:6800/22999 from osd.2
10.119.16.16:6800/25605 is reporting failure:1
2014-09-09 10:51:10.636343 7f75037d0700  0 log [DBG] : osd.1
10.119.16.15:6800/22999 reported failed by osd.2 10.119.16.16:6800/25605

Sample of warnings from osd.2 log:
2014-09-09 10:44:13.723714 7fb828c57700 -1 osd.2 18 heartbeat_check: no reply 
from osd.1 ever on either front or back, first ping sent 2014-09-09
10:43:30.437170 (cutoff 2014-09-09 10:43:53.723713)
2014-09-09 10:44:13.724883 7fb81f2f9700  0 log [WRN] : map e19 wrongly marked 
me down
2014-09-09 10:44:13.726104 7fb81f2f9700  0 osd.2 19 crush map has features 
1107558400, adjusting msgr requires for mons
2014-09-09 10:44:13.726741 7fb811edb700  0 -- 10.119.16.16:0/25605 >>
10.119.16.15:6806/1022999 pipe(0x3171900 sd=34 :0 s=1 pgs=0 cs=0 l=1 
c=0x3ad8580).fault



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph on RHEL 7 with multiple OSD's

2014-09-09 Thread Marco Garcês
Actually in EL7, iptables does not come installed by default, they use
firewalld... just remove firewalld and install iptables, and you are back
in the game! Or learn firewalld, that will work to! :)


*Marco Garcês*
*#sysadmin*
Maputo - Mozambique
*[Phone]* +258 84 4105579
*[Skype]* marcogarces

On Tue, Sep 9, 2014 at 3:10 PM, Michal Kozanecki 
wrote:

> Network issue maybe? Have you checked your firewall settings? Iptables
> changed a bit in EL7 and might of broken any rules your normally try and
> use, try flushing the rules (iptables -F) and see if that fixes things, if
> you then you'll need to fix your firewall rules.
>
> I ran into a similar issue on EL7 where the OSD's appeared up and in, but
> were stuck in peering which was due to a few ports being blocked.
>
> Cheers
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> BG
> Sent: September-09-14 6:05 AM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph on RHEL 7 with multiple OSD's
>
> Loic Dachary  writes:
>
> >
> > Hi,
> >
> > It it looks like your osd.0 is down and you only have one osd left
> > (osd.1) which would explain why the cluster cannot get to a healthy
> > state. The "size 2" in  "pool 0 'data' replicated size 2 ..." means
> > the pool needs at least two OSDs up to function properly. Do you know
> why the osd.0 is not up ?
> >
> > Cheers
> >
>
> I've been trying unsuccessfully to get this up and running since. I've
> added another OSD but still can't get to "active + clean" state. I'm not
> even sure if the problems I'm having are related to the OS version but I'm
> running out of ideas and unless somebody here can spot something obvious in
> the logs below I'm going to try rolling back to CentOS 6.
>
> $ echo "HEALTH" && ceph health && echo "STATUS" && ceph status && echo
> "OSD_DUMP" && ceph osd dump HEALTH HEALTH_WARN 129 pgs peering; 129 pgs
> stuck unclean STATUS
> cluster f68332e4-1081-47b8-9b22-e5f3dc1f4521
>  health HEALTH_WARN 129 pgs peering; 129 pgs stuck unclean
>  monmap e1: 1 mons at {hp09=10.119.16.14:6789/0}, election epoch 2,
> quorum
>  0 hp09
>  osdmap e43: 3 osds: 3 up, 3 in
>   pgmap v61: 192 pgs, 3 pools, 0 bytes data, 0 objects
> 15469 MB used, 368 GB / 383 GB avail
>  129 peering
>   63 active+clean
> OSD_DUMP
> epoch 43
> fsid f68332e4-1081-47b8-9b22-e5f3dc1f4521
> created 2014-09-09 10:42:35.490711
> modified 2014-09-09 10:47:25.077178
> flags
> pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool
> crash_replay_interval 45 stripe_width 0 pool 1 'metadata' replicated size 3
> min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64
> last_change 1 flags hashpspool stripe_width 0 pool 2 'rbd' replicated size
> 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64
> last_change 1 flags hashpspool stripe_width 0 max_osd 3
> osd.0 up   in  weight 1 up_from 4 up_thru 42 down_at 0 last_clean_interval
> [0,0) 10.119.16.14:6800/24988 10.119.16.14:6801/24988
> 10.119.16.14:6802/24988
> 10.119.16.14:6803/24988 exists,up 63f3f351-eccc-4a98-8f18-e107bd33f82b
> osd.1 up   in  weight 1 up_from 38 up_thru 42 down_at 36
> last_clean_interval
> [7,37) 10.119.16.15:6800/22999 10.119.16.15:6801/4022999
> 10.119.16.15:6802/4022999 10.119.16.15:6803/4022999 exists,up
> 8e1c029d-ebfb-4a8d-b567-ee9cd9ebd876
> osd.2 up   in  weight 1 up_from 42 up_thru 42 down_at 40
> last_clean_interval
> [11,41) 10.119.16.16:6800/25605 10.119.16.16:6805/5025605
> 10.119.16.16:6806/5025605 10.119.16.16:6807/5025605 exists,up
> 5d398bba-59f5-41f8-9bd6-aed6a0204656
>
> Sample of warnings from monitor log:
> 2014-09-09 10:51:10.636325 7f75037d0700  1 mon.hp09@0(leader).osd e72
> prepare_failure osd.1 10.119.16.15:6800/22999 from osd.2
> 10.119.16.16:6800/25605 is reporting failure:1
> 2014-09-09 10:51:10.636343 7f75037d0700  0 log [DBG] : osd.1
> 10.119.16.15:6800/22999 reported failed by osd.2 10.119.16.16:6800/25605
>
> Sample of warnings from osd.2 log:
> 2014-09-09 10:44:13.723714 7fb828c57700 -1 osd.2 18 heartbeat_check: no
> reply from osd.1 ever on either front or back, first ping sent 2014-09-09
> 10:43:30.437170 (cutoff 2014-09-09 10:43:53.723713)
> 2014-09-09 10:44:13.724883 7fb81f2f9700  0 log [WRN] : map e19 wrongly
> marked me down
> 2014-09-09 10:44:13.726104 7fb81f2f9700  0 osd.2 19 crush map has features
> 1107558400, adjusting msgr requires for mons
> 2014-09-09 10:44:13.726741 7fb811edb700  0 -- 10.119.16.16:0/25605 >>
> 10.119.16.15:6806/1022999 pipe(0x3171900 sd=34 :0 s=1 pgs=0 cs=0 l=1
> c=0x3ad8580).fault
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://list

Re: [ceph-users] Problem with customized crush rule for EC pool

2014-09-09 Thread Loic Dachary


On 09/09/2014 14:21, Lei Dong wrote:
> Thanks loic!
> 
> Actually I've found that increase choose_local_fallback_tries can 
> help(chooseleaf_tries helps not so significantly), but I'm afraid when osd 
> failure happen and need to find new acting set, it may be fail to find enough 
> racks again. So I'm trying to find a more guaranteed way in case of osd 
> failure.
> 
> My profile is nothing special other than k=8 m=3. 

So your goal is to make it so loosing 3 OSD simultaneously does not mean 
loosing data. By forcing each rack to hold at most 2 OSDs for a given object, 
you make it so loosing a full rack does not mean loosing data. Are these racks 
in the same room in the datacenter ? In the event of a catastrophic failure 
that permanently destroy one rack, how realistic is it that the other racks are 
unharmed ? If the rack is destroyed by fire and is in a row with the six other 
racks, there is a very high chance that the other racks will also be damaged. 
Note that I am not a system architect nor a system administrator : I may be 
completely wrong ;-) If it turns out that the probability of a single rack to 
fail entirely and independently of the others is negligible, it may not be 
necessary to make a complex ruleset and instead use the default ruleset.

My 2cts
 
> 
> Thanks again!
> 
> Leidong
> 
> 
> 
> 
> 
>> On 2014年9月9日, at 下午7:53, "Loic Dachary"  wrote:
>>
>> Hi,
>>
>> It is indeed possible that mapping fails if there are just enough racks to 
>> match the constraint. And the probability of a bad mapping increases when 
>> the number of PG increases because there is a need for more mapping. You can 
>> tell crush to try harder with 
>>
>> step set_chooseleaf_tries 10
>>
>> Be careful though : increasing this number will change mapping. It will not 
>> just fix the bad mappings you're seeing, it will also change the mappings 
>> that succeeded with a lower value. Once you've set this parameter, it cannot 
>> be modified.
>>
>> Would you mind sharing the erasure code profile you plan to work with ?
>>
>> Cheers
>>
>>> On 09/09/2014 12:39, Lei Dong wrote:
>>> Hi ceph users:
>>>
>>> I want to create a customized crush rule for my EC pool (with replica_size 
>>> = 11) to distribute replicas into 6 different Racks. 
>>>
>>> I use the following rule at first:
>>>
>>> Step take default  // root
>>> Step choose firstn 6 type rack// 6 racks, I have and only have 6 racks
>>> Step chooseleaf indep 2 type osd // 2 osds per rack 
>>> Step emit
>>>
>>> I looks fine and works fine when PG num is small. 
>>> But when pg num increase, there are always some PGs which can not take all 
>>> the 6 racks. 
>>> It looks like “Step choose firstn 6 type rack” sometimes returns only 5 
>>> racks.
>>> After some investigation,  I think it may caused by collision of choices.
>>>
>>> Then I come up with another solution to solve collision like this:
>>>
>>> Step take rack0
>>> Step chooseleaf indep 2 type osd
>>> Step emit
>>> Step take rack1
>>> ….
>>> (manually take every rack)
>>>
>>> This won’t cause rack collision, because I specify rack by name at first. 
>>> But the problem is that osd in rack0 will always be primary osd because I 
>>> choose from rack0 first.
>>>
>>> So the question is what is the recommended way to meet such a need 
>>> (distribute 11 replicas into 6 racks evenly in case of rack failure)?
>>>
>>>
>>> Thanks!
>>> LeiDong
>>>
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>>

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] question about librbd io

2014-09-09 Thread yuelongguang
hi, josh.durgin:
 
i want to know how librbd launch io request.
 
use case:
inside vm, i use fio to test rbd-disk's io performance.
fio's pramaters are  bs=4k, direct io, qemu cache=none.
in this case, if librbd just send what it gets from vm, i mean  no 
gather/scatter. the rate , io inside vm : io at librbd: io at osd filestore = 
1:1:1?
 
 
 
thanks
 
fio
[global]
ioengine=libaio
buffered=0
rw=randrw
#size=3g
#directory=/data1
filename=/dev/vdb

[file0]
iodepth=1
bs=4k
time_based
runtime=300
stonewall___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NAS on RBD

2014-09-09 Thread Michal Kozanecki
Hi Blair!

On 9 September 2014 08:47, Blair Bethwaite  wrote:
> Hi Dan,
>
> Thanks for sharing!
>
> On 9 September 2014 20:12, Dan Van Der Ster  wrote:
>> We do this for some small scale NAS use-cases, with ZFS running in a VM with 
>> rbd volumes. The performance is not great (especially since we throttle the 
>> IOPS of our RBD). We also tried a few kRBD / ZFS servers with an SSD ZIL — 
>> the SSD solves any performance problem we ever had with ZFS on RBD.
>
> That's good to hear. My limited experience doing this on a smaller Ceph 
> cluster (and without any SSD journals or cache devices for ZFS
> head) points to write latency being an immediate issue, decent PCIe SLC SSD 
> devices should pretty much sort that out given the cluster itself has plenty 
> of write throughput available. Then there's further MLC devices for L2ARC - 
> not sure yet but guessing metadata heavy datasets might require 
> primarycache=metadata and rely of L2ARC for data cache. And all this should 
> get better in the medium term with performance improvements and RDMA 
> capability (we're building this with that option in the hole).
>

I'd love to go back and forth with you privately or on one of the ZFS 
mailing-lists if you want to discuss ZFS tuning in depth, but I want to just 
mention that setting primarycache=metadata will also cause the L2ARC to ONLY 
store and accelerate metadata as well(despite whatever secondarycache is set 
to). I believe this is something that the ZFS developers are looking to improve 
eventually but as-is, currently that’s how it works (L2ARC only contains what 
was pushed out of the main in-memory ARC). 

>> I would say though that this setup is rather adventurous. ZoL is not rock 
>> solid — we’ve had a few lockups in testing, all of which have been fixed in 
>> the latest ZFS code in git (my colleague in CC could elaborate if you’re 
>> interested).
>
> Hmm okay, that's not great. The only problem I've experienced thus far is 
> when the ZoL repos stopped providing DKMS and borked an upgrade for me until 
> I figured out what had happened and cleaned up the old .ko files. So yes, 
> interested to hear elaboration on that.
>

You mentioned in one of your other emails that if you deployed this idea of a 
ZFS NFS server, you'd do it inside a KVM VM and make use of librbd rather than 
krbd. If you're worried about ZoL stability and feel comfortable going outside 
Linux, you could always go with a *BSD or Illumos distro where ZFS support is 
much more stable/solid. 
In any case I haven't had any major show stopping issues with ZoL myself and I 
use it heavily. Still, unless you're really comfortable with ZoL or 
*BSD/Illumos(as I am), I'd likely recommend looking into other solutions.

>> One thing I’m not comfortable with is the idea of ZFS checking the data in 
>> addition to Ceph. Sure, ZFS will tell us if there is a checksum error, but 
>> without any redundancy at the ZFS layer there will be no way to correct that 
>> error. Of course, the hope is that RADOS will ensure 100% data consistency, 
>> but what happens if not?...
> 
> The ZFS checksumming would tell us if there has been any corruption, which as 
> you've pointed out shouldn't happen anyway on top of Ceph.

Just want to quickly address this, someone correct me if I'm wrong, but IIRC 
even with replica value of 3 or more, ceph does not(currently) have any 
intelligence when it detects a corrupted/"incorrect" PG, it will always 
replace/repair the PG with whatever data is in the primary, meaning that if the 
primary PG is the one that’s corrupted/bit-rotted/"incorrect", it will replace 
the good replicas with the bad.  

> But if we did have some awful disaster scenario where that happened then we'd 
> be restoring from tape, and it'd sure be good to know which files actually 
> needed restoring. I.e., if we lost a single PG at the Ceph level then we 
> don't want to have to blindly restore the whole zpool or dataset.
>
>> Personally, I think you’re very brave to consider running 2PB of ZoL on RBD. 
>> If I were you I would seriously evaluate the CephFS option. It used to be on 
>> the roadmap for ICE 2.0 coming out this fall, though I noticed its not there 
>> anymore (??!!!).
>
> Yeah, it's very disappointing that this was silently removed. And it's 
> particularly concerning that this happened post RedHat acquisition.
> I'm an ICE customer and sure would have liked some input there for exactly 
> the reason we're discussing.
>

I'm looking forward to CephFS as well, and I agree, it's somewhat concerning 
that it happened post RedHat acquisition. I'm hoping RedHat pours more 
resources into InkTank and ceph, and not instead leach resources away from them.

>> Anyway I would say that ZoL on kRBD is not necessarily a more stable 
>> solution than CephFS. Even Gluster striped on top of RBD would probably be 
>> more stable than ZoL on RBD.
>
> If we really have to we'll just run Gluster natively instead (or perhaps XFS 
> on RBD as the optio

Re: [ceph-users] NAS on RBD

2014-09-09 Thread Dan Van Der Ster

> On 09 Sep 2014, at 16:39, Michal Kozanecki  wrote:
> On 9 September 2014 08:47, Blair Bethwaite  wrote:
>> On 9 September 2014 20:12, Dan Van Der Ster  
>> wrote:
>>> One thing I’m not comfortable with is the idea of ZFS checking the data in 
>>> addition to Ceph. Sure, ZFS will tell us if there is a checksum error, but 
>>> without any redundancy at the ZFS layer there will be no way to correct 
>>> that error. Of course, the hope is that RADOS will ensure 100% data 
>>> consistency, but what happens if not?...
>> 
>> The ZFS checksumming would tell us if there has been any corruption, which 
>> as you've pointed out shouldn't happen anyway on top of Ceph.
> 
> Just want to quickly address this, someone correct me if I'm wrong, but IIRC 
> even with replica value of 3 or more, ceph does not(currently) have any 
> intelligence when it detects a corrupted/"incorrect" PG, it will always 
> replace/repair the PG with whatever data is in the primary, meaning that if 
> the primary PG is the one that’s corrupted/bit-rotted/"incorrect", it will 
> replace the good replicas with the bad.  

According to the the "scrub error on firefly” thread, repair "tends to choose 
the copy with the lowest osd number which is not obviously corrupted.  Even 
with three replicas, it does not do any kind of voting at this time.”

Cheers, Dan




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-community] ceph replication and striping

2014-09-09 Thread m.channappa.negalur
Hello Aaron

Thanks for your answers!!

If my understanding is correct then  , By default ceph supports data 
replication and striping and  striping doesn’t requires any separate 
configuration .

Please correct me if I am wrong.





From: Aaron Ten Clay [mailto:aaro...@aarontc.com]
Sent: Wednesday, August 27, 2014 1:55 AM
To: Channappa Negalur, M.
Cc: ceph-commun...@lists.ceph.com; ceph-users@lists.ceph.com
Subject: Re: [Ceph-community] ceph replication and striping

On Tue, Aug 26, 2014 at 5:07 AM, 
mailto:m.channappa.nega...@accenture.com>> 
wrote:
Hello all,

I have configured a ceph storage cluster.

1. I created the volume .I would like to know that  replication of data will 
happen automatically in ceph ?
2. how to configure striped volume using ceph ?


Regards,
Malleshi CN


If I understand your position and questions correctly... the replication level 
is configured per-pool, so whatever your "size" parameter is set to for the 
pool you created the volume in will dictate how many copies are stored. 
(Default is 3, IIRC.)
RADOS block device volumes are always striped across 4 MiB objects. I don't 
believe that is configurable (at least not yet.)

FYI, this list is intended for discussion of Ceph community concerns. These 
kinds of questions are better handled on the ceph-users list, and I've 
forwarded your message accordingly.
-Aaron



This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy.
__

www.accenture.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] one stuck pg

2014-09-09 Thread Erwin Lubbers
Hi,

My cluster is giving one stuck pg which seems to be backfilling for days now. 
Any suggestions on how to solve it?

HEALTH_WARN 1 pgs backfilling; 1 pgs stuck unclean; recovery 32/5989217 
degraded (0.001%)
pg 206.3f is stuck unclean for 294420.424122, current state 
active+remapped+backfilling, last acting [24,28,3,44]
pg 206.3f is active+remapped+backfilling, acting [24,28,3,44]
recovery 32/5989217 degraded (0.001%)

Regards,
Erwin



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph data consistency

2014-09-09 Thread 池信泽
hi, everyone:

  when I read the filestore.cc, I find the ceph use crc the check the data.
Why should check the data?

  In my knowledge,  the disk has error-correcting code
 (ECC) for each sector.
Looking at wiki: http://en.wikipedia.org/wiki/Disk_sector, "In disk drives,
each physical sector is made up of three basic parts, the sectorheader
, the data area and
the error-correcting
code  (ECC)".  So if
the data is not correct. the disk can recovery it or  return i/o error.

  Does anyone can explain it?

 Thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] number of PGs

2014-09-09 Thread Luis Periquito
I was reading on the number of PGs we should have for a cluster, and I
found the formula to place 100 PGs in each OSD (
http://ceph.com/docs/master/rados/operations/placement-groups/).

Now this formula has generated some discussion as to how many PGs we should
have in each pool.

Currently our main cluster is being used for S3, cephFS and RBD data type
usage. So we have 3 very big pools (data, .rgw.buckets and rbd) and 9 small
pools (all the remaining ones).

As we have a total of 60 OSDs we've been discussing how many PGs we should
really have. We are using a replication of 4.

Should we have a total around 1500 PGs distributed over all the pools
(total PGs) or should we have the big pools each with 1500 PGs for a total
around 5000 PGs on the cluster?

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [ceph-calamari] RFC: A preliminary Chinese version of Calamari

2014-09-09 Thread Gregory Meno
Li Wang,

Thank you for doing this!

Would you please change this so that calamari will display the correct
locale?

As it exists your branch will not display an all english version.

To merge upstream I would expect this to work so that both en and zh_CN
versions could work.

I made a branch that moves strings.l20n to the correct locale
https://github.com/ceph/calamari-clients/tree/zh_CN_version

Please let me know if there is anything I can do to help.

best regards,
Gregory


On Fri, Aug 29, 2014 at 5:45 AM, Li Wang  wrote:

> Hi,
>   We have set up a preliminary Chinese version of Calamari at
> https://github.com/xiaoxianxia/calamari-clients-cn, major jobs done
> are translating the English words on the web interface into Chinese,
> we did not change the localization infrastructure, any help
> in this direction are appreciated,  also any suggestions, tests
> and technical involvement are welcome, to make it ready to
> be merged to the upstream.
>
> Cheers,
> Li Wang
> ___
> ceph-calamari mailing list
> ceph-calam...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-calamari-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] max_bucket limit -- safe to disable?

2014-09-09 Thread Daniel Schneller
Hi list!

Under 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-September/033670.html 
I found a situation not unlike ours, but unfortunately either 
the list archive fails me or the discussion ended without a 
conclusion, so I dare to ask again :)

We currently have a setup of 4 servers with 12 OSDs each, 
combined journal and data. No SSDs.

We develop a document management application that accepts user
uploads of all kinds of documents and processes them in several
ways. For any given document, we might create anywhere from 10s
to several hundred dependent artifacts.

We are now preparing to move from Gluster to a Ceph based
backend. The application uses the Apache JClouds Library to 
talk to the Rados Gateways that are running on all 4 of these 
machines, load balanced by haproxy. 

We currently intend to create one container for each document 
and put all the dependent and derived artifacts as objects into
that container. 
This gives us a nice compartmentalization per document, also 
making it easy to remove a document and everything that is
connected with it.

During the first test runs we ran into the default limit of
1000 containers per user. In the thread mentioned above that 
limit was removed (setting the max_buckets value to 0). We did 
that and now can upload more than 1000 documents.

I just would like to understand

a) if this design is recommended, or if there are reasons to go
   about the whole issue in a different way, potentially giving
   up the benefit of having all document artifacts under one
   convenient handle.

b) is there any absolute limit for max_buckets that we will run
   into? Remember we are talking about 10s of millions of 
   containers over time.

c) are any performance issues to be expected with this design
   and can we tune any parameters to alleviate this?

Any feedback would be very much appreciated.

Regards,
Daniel

-- 
Daniel Schneller
Mobile Development Lead
 
CenterDevice GmbH  | Merscheider Straße 1
   | 42699 Solingen
tel: +49 1754155711| Deutschland
daniel.schnel...@centerdevice.com  | www.centerdevice.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Remaped osd at remote restart

2014-09-09 Thread Eduard Kormann

Hello,

have I missed something or is it a feature: When I restart a osd on the 
belonging server so it restarts normally:


root@cephosd10:~# service ceph restart osd.76
=== osd.76 ===
=== osd.76 ===
Stopping Ceph osd.76 on cephosd10...kill 799176...done
=== osd.76 ===
create-or-move updating item name 'osd.76' weight 3.64 at location 
{host=cephosd10,root=default} to crush map

Starting Ceph osd.76 on cephosd10...
starting osd.76 at :/0 osd_data /var/lib/ceph/osd/ceph-76 
/var/lib/ceph/osd/ceph-76/journal


But if I trie to restart osd on the admin server...:

root@cephadmin:/etc/ceph# service ceph -a restart osd.76
=== osd.76 ===
=== osd.76 ===
Stopping Ceph osd.76 on cephosd10...kill 800262...kill 800262...done
=== osd.76 ===
df: `/var/lib/ceph/osd/ceph-76/.': No such file or directory
df: no file systems processed
create-or-move updating item name 'osd.76' weight 1 at location 
{host=cephadmin,root=default} to crush map

Starting Ceph osd.76 on cephosd10...
starting osd.76 at :/0 osd_data /var/lib/ceph/osd/ceph-76 
/var/lib/ceph/osd/ceph-76/journal


...it will associated with the admin server in the crush map:
-17 0   host cephadmin
76  0   osd.76  up  0

Before that each osd could be started from arbitrary server with option 
"-a". Apparently it no longer works.

How do I run any osd from any server without error messages?

I running firefly 0.80.5

BR
Eduard Kormann
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph data consistency

2014-09-09 Thread Christian Balzer
On Thu, 4 Sep 2014 16:31:12 +0800 池信泽 wrote:

> hi, everyone:
> 
>   when I read the filestore.cc, I find the ceph use crc the check the
> data. Why should check the data?
> 
It should do even more, it should also do checksums for all replicas:

>   In my knowledge,  the disk has error-correcting code
>  (ECC) for each
> sector. Looking at wiki: http://en.wikipedia.org/wiki/Disk_sector, "In
> disk drives, each physical sector is made up of three basic parts, the
> sectorheader , the data
> area and the error-correcting
> code  (ECC)".  So if
> the data is not correct. the disk can recovery it or  return i/o error.
> 
>   Does anyone can explain it?
> 
http://en.wikipedia.org/wiki/Data_corruption#Silent_data_corruption


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph data consistency

2014-09-09 Thread ????????
hi, guys:
  
  when I read the filestore.cc, I find the ceph use crc the check the data. Why 
should check the data?


  In my knowledge,  the disk has error-correcting code (ECC) for each sector. 
Looking at wiki: http://en.wikipedia.org/wiki/Disk_sector, "In disk drives, 
each physical sector is made up of three basic parts, the sector header, the 
data area and the error-correcting code (ECC)".  So if the data is not correct. 
the disk can recovery it or  return i/o error.
  
  Does anyone can explain it?


 Thanks___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Filesystem - Production?

2014-09-09 Thread James Devine
The issue isn't so much mounting the ceph client as it is the mounted ceph
client becoming unusable requiring a remount.  So far so good though.


On Fri, Sep 5, 2014 at 5:53 PM, JIten Shah  wrote:

> We ran into the same issue where we could not mount the filesystem on the
> clients because it had 3.9. Once we upgraded the kernel on the client node,
> we were able to mount it fine. FWIW, you need kernel 3.14 and above.
>
> --jiten
>
> On Sep 5, 2014, at 6:55 AM, James Devine  wrote:
>
> No messages in dmesg, I've updated the two clients to 3.16, we'll see if
> that fixes this issue.
>
>
> On Fri, Sep 5, 2014 at 12:28 AM, Yan, Zheng  wrote:
>
>> On Fri, Sep 5, 2014 at 8:42 AM, James Devine  wrote:
>> > I'm using 3.13.0-35-generic on Ubuntu 14.04.1
>> >
>>
>> Was there any kernel message when the hang happened?  We have fixed a
>> few bugs since 3.13 kernel, please use 3.16 kernel if possible.
>>
>> Yan, Zheng
>>
>> >
>> > On Thu, Sep 4, 2014 at 6:08 PM, Yan, Zheng  wrote:
>> >>
>> >> On Fri, Sep 5, 2014 at 3:24 AM, James Devine 
>> wrote:
>> >> > It took a week to happen again, I had hopes that it was fixed but
>> alas
>> >> > it is
>> >> > not.  Looking at top logs on the active mds server, the load average
>> was
>> >> > 0.00 the whole time and memory usage never changed much, it is using
>> >> > close
>> >> > to 100% and some swap but since I changed memory.swappiness swap
>> usage
>> >> > hasn't gone up but has been slowly coming back down.  Same symptoms,
>> the
>> >> > mount on the client is unresponsive and a cat on
>> >> > /sys/kernel/debug/ceph/*/mdsc had a whole list of entries.  A umount
>> and
>> >> > remount seems to fix it.
>> >> >
>> >>
>> >> which version of kernel do you use ?
>> >>
>> >> Yan, Zheng
>> >>
>> >> >
>> >> > On Fri, Aug 29, 2014 at 11:26 AM, James Devine 
>> >> > wrote:
>> >> >>
>> >> >> I am running active/standby and it didn't swap over to the
>> standby.  If
>> >> >> I
>> >> >> shutdown the active server it swaps to the standby fine though.
>> When
>> >> >> there
>> >> >> were issues, disk access would back up on the webstats servers and a
>> >> >> cat of
>> >> >> /sys/kernel/debug/ceph/*/mdsc would have a list of entries whereas
>> >> >> normally
>> >> >> it would only list one or two if any.  I have 4 cores and 2GB of
>> ram on
>> >> >> the
>> >> >> mds machines.  Watching it right now it is using most of the ram and
>> >> >> some of
>> >> >> swap although most of the active ram is disk cache.  I lowered the
>> >> >> memory.swappiness value to see if that helps.  I'm also logging top
>> >> >> output
>> >> >> if it happens again.
>> >> >>
>> >> >>
>> >> >> On Thu, Aug 28, 2014 at 8:22 PM, Yan, Zheng 
>> wrote:
>> >> >>>
>> >> >>> On Fri, Aug 29, 2014 at 8:36 AM, James Devine 
>> >> >>> wrote:
>> >> >>> >
>> >> >>> > On Thu, Aug 28, 2014 at 1:30 PM, Gregory Farnum <
>> g...@inktank.com>
>> >> >>> > wrote:
>> >> >>> >>
>> >> >>> >> On Thu, Aug 28, 2014 at 10:36 AM, Brian C. Huffman
>> >> >>> >>  wrote:
>> >> >>> >> > Is Ceph Filesystem ready for production servers?
>> >> >>> >> >
>> >> >>> >> > The documentation says it's not, but I don't see that
>> mentioned
>> >> >>> >> > anywhere
>> >> >>> >> > else.
>> >> >>> >> > http://ceph.com/docs/master/cephfs/
>> >> >>> >>
>> >> >>> >> Everybody has their own standards, but Red Hat isn't supporting
>> it
>> >> >>> >> for
>> >> >>> >> general production use at this time. If you're brave you could
>> test
>> >> >>> >> it
>> >> >>> >> under your workload for a while and see how it comes out; the
>> known
>> >> >>> >> issues are very much workload-dependent (or just general
>> concerns
>> >> >>> >> over
>> >> >>> >> polish).
>> >> >>> >> -Greg
>> >> >>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
>> >> >>> >> ___
>> >> >>> >> ceph-users mailing list
>> >> >>> >> ceph-users@lists.ceph.com
>> >> >>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >>> >
>> >> >>> >
>> >> >>> >
>> >> >>> > I've been testing it with our webstats since it gets live hits
>> but
>> >> >>> > isn't
>> >> >>> > customer affecting.  Seems the MDS server has problems every few
>> >> >>> > days
>> >> >>> > requiring me to umount and remount the ceph disk to resolve.  Not
>> >> >>> > sure
>> >> >>> > if
>> >> >>> > the issue is resolved in development versions but as of 0.80.5 we
>> >> >>> > seem
>> >> >>> > to be
>> >> >>> > hitting it.  I set the log verbosity to 20 so there's tons of
>> logs
>> >> >>> > but
>> >> >>> > ends
>> >> >>> > with
>> >> >>>
>> >> >>> The cephfs client is supposed to be able to handle MDS takeover.
>> >> >>> what's symptom makes you umount and remount the cephfs ?
>> >> >>>
>> >> >>> >
>> >> >>> > 2014-08-24 07:10:19.682015 7f2b575e7700 10 mds.0.14  laggy,
>> >> >>> > deferring
>> >> >>> > client_request(client.92141:6795587 getattr pAsLsXsFs
>> #1026dc1)
>> >> >>> > 2014-08-24 07:10:19.682021 7f2b575e7700  5 mds.0.14 is_laggy
>> >> >>> > 19.324963
>> >> 

[ceph-users] CephFS roadmap (was Re: NAS on RBD)

2014-09-09 Thread Sage Weil
On Tue, 9 Sep 2014, Blair Bethwaite wrote:
> > Personally, I think you?re very brave to consider running 2PB of ZoL 
> > on RBD. If I were you I would seriously evaluate the CephFS option. It 
> > used to be on the roadmap for ICE 2.0 coming out this fall, though I 
> > noticed its not there anymore (??!!!).
> 
> Yeah, it's very disappointing that this was silently removed. And it's
> particularly concerning that this happened post RedHat acquisition.
> I'm an ICE customer and sure would have liked some input there for
> exactly the reason we're discussing.

A couple quick comments:

1) We have more developers actively working on CephFS today than we have 
ever had before.  It is a huge priority for me and the engineering team to 
get it into a state where it is ready for general purpose production 
workloads.

2) As a scrappy startup like Inktank we were very fast and loose about 
what went into the product roadmap and what claims we made.  Red Hat is 
much more cautious about forward looking statements in their enterprise 
products.  Do not read too much into the presence or non-presence of 
CephFS in the ICE roadmap.  Also note that Red Hat Storage today is 
shipping a fully production-ready and stable distributed file system 
(GlusterFS).

3) We've recently moved to CephFS in the sepia QA lab for archiving all of 
our test results.  This dogfooding exercise has helped us identify several 
general usability and rough edges that have resulted in changes for giant.  
We identified and fixed two kernel client bugs that went into 3.16 or 
thereabouts.  The biggest problem we had we finally tracked down and 
turned out to be an old bug due to an old kernel client that we forgot was 
mounting the cluster.  Overall, I'm pretty pleased.  CephFS in Giant is 
going to be pretty good.  We are still lacking fsck, so be careful, and 
there are several performance issues we need to address, but I encourage 
anyone who is interested to give Giant CepHFS a go in any environment you 
have were you can tolerate the risk.  We are *very* keen to get feedback 
on performance, stability, robustness, and usability.

Thanks!
sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD journal deployment experiences

2014-09-09 Thread Craig Lewis
On Sat, Sep 6, 2014 at 7:50 AM, Dan van der Ster 
wrote:

>
> BTW, do you happen to know, _if_ we re-use an OSD after the journal has
> failed, are any object inconsistencies going to be found by a
> scrub/deep-scrub?
>

I haven't tested this, but I did something I *think* is similar.  I deleted
an OSD, removed it from the crushmap, marked it lost, then added it back
without reformatting.  It got the same OSD ID.  I think I spent about 10
minutes doing it.  I don't remember exactly why... I think I was trying to
force_pg_create or something.

If I recall correctly, the backfill was much faster than I expected.  It
should have taken >24 hours.  IIRC, it completed in about 2 hours.  It
wasn't as fast as marking the OSD out and in, but much faster than a
freshly formatted OSD.

It's possible that this only worked because the PGs hadn't completed
backfilling.  Despite my marking the OSD lost, the OSD was still listed in
the pg query, in the osds to probe section.


I want to experiment with losing an SSD.  I'm trying to think of a way to
run the test using VMs, but I haven't come up with anything yet.  All of my
test clusters are virtual, and I'm not ready to test this on a production
cluster yet.

I *think* losing an SSD will be similar to the above, possibly followed by
some inconsistencies found during scrub and deep-scrub.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph data consistency

2014-09-09 Thread Sage Weil
On Thu, 4 Sep 2014,  wrote:
> 
> hi, guys:
>  
>   when I read the filestore.cc, I find the ceph use crc the check the data.
> Why should check the data?
> 
>   In my knowledge,  the disk has error-correcting code (ECC) for each
> sector. Looking at wiki: http://en.wikipedia.org/wiki/Disk_sector, "In disk
> drives, each physical sector is made up of three basic parts, the sector
> header, the data area and the error-correcting code (ECC)".  So if the data
> is not correct. the disk can recovery it or  return i/o error.
>  
>   Does anyone can explain it?

These checksums are not terribly reliable, and I think not even present on 
all disks.

Note that the CRC checks in FileStore today are opportunistic and there 
primarily for our QA environment; we don't recommend enabling them in 
production environments right now because we're not sure what the 
performance implications are.

sage___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD journal deployment experiences

2014-09-09 Thread Craig Lewis
On Sat, Sep 6, 2014 at 9:27 AM, Christian Balzer  wrote:

> On Sat, 06 Sep 2014 16:06:56 + Scott Laird wrote:
>
> > Backing up slightly, have you considered RAID 5 over your SSDs?
> >  Practically speaking, there's no performance downside to RAID 5 when
> > your devices aren't IOPS-bound.
> >
>
> Well...
> For starters with RAID5 you would loose 25% throughput in both Dan's and
> my case (4 SSDs) compared to JBOD SSD journals.
> In Dan's case that might not matter due to other bottlenecks, in my case
> it certainly would.
>

It's a trade off between lower performance all the time, or much lower
performance while you're backfilling those OSDs.  To me, this seems like a
somewhat reasonable idea for a small cluster, where losing one SSD could
lose >5% of the OSDs.  It doesn't seem worth the effort for a large
cluster, where losing one SSD would lose < 1% of the OSDs.


>
> And while you're quite correct when it comes to IOPS, doing RAID5 will
> either consume significant CPU resource in a software RAID case or require
> a decent HW RAID controller.
>
> Christian


 I haven't worried about CPU with software RAID5 in a very long time...
maybe Pentium 4 days?  It's so rare to actually have 0% Idle CPU, even
under high loads.

Most of my RAID5 is ZFS, but the CPU hasn't been the limiting factor on my
database or NFS servers.  I'm even doing software crypto, without CPU
support, with only a 10% performance penalty.  If the CPU has AES support,
crypto is free.  Obviously, RAID0 (or fully parallel JBOD) will be faster
than RAID5, but RAID5 is faster than RAID10 for all but the most heavily
read biased workloads.  Surprised the hell out of me.  I'll be converting
all of my database servers from RAID10 to RAIDZ.  Of course, benchmarks
that match your workload trump some random yahoo on the internet.  :-)


Ceph OSD nodes are a bit different though.  They're one of the few beasts
I've dealt with that are CPU, Disk, and network bound all at the same time.
 If you have some idle CPU during a big backfill, then I'd consider
Software RAID5 a possibility.  If you ever sustain 0% idle, then I wouldn't
try it.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + Postfix/Zimbra

2014-09-09 Thread Patrick McGarry
Hey Oscar,

Sorry for the delay on this, it looks like my reply got stuck in the outbox.

I am moving this over to Ceph-User for discussion as the community
will probably have more experience and opinions to offer than just the
couple of us community guys.  Let me know if you don't get what you
need though.  Thanks!


Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph


On Tue, Aug 26, 2014 at 5:30 PM, Oscar Mas  wrote:
> Hi Community
>
> Excuseme my poor English
>
> in my company, we are interested in implementing Ceph, on various platforms:
>
> 1.- Migrate 30,000 email accounts ( Postfix ) of a filesystem OCFS with DRBD
> to Ceph system.
> 2.- Email systems ( Zimbra ) virtualized with VMWare. The storage now are:
> MD300 and DataCore.
>
> the second point, we have many doubts. I can:
>
> 2a.- RBD install in zimbra servers and so would connect to Ceph
> 2b.- Connect Ceph with iSCSI to VMWare
>
>
> which may be the best solution: 2a or 2b  ?
>
> If everything works... I like the t-shirt ;-)
>
> Thanks a lot
> --
>
> LinkedIn: http://es.linkedin.com/in/oscarmash
> Blog: http://www.jorgedelacruz.es/author/oscar.mas/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Remaped osd at remote restart

2014-09-09 Thread Gregory Farnum
On Mon, Sep 8, 2014 at 6:33 AM, Eduard Kormann  wrote:
> Hello,
>
> have I missed something or is it a feature: When I restart a osd on the
> belonging server so it restarts normally:
>
> root@cephosd10:~# service ceph restart osd.76
> === osd.76 ===
> === osd.76 ===
> Stopping Ceph osd.76 on cephosd10...kill 799176...done
> === osd.76 ===
> create-or-move updating item name 'osd.76' weight 3.64 at location
> {host=cephosd10,root=default} to crush map
> Starting Ceph osd.76 on cephosd10...
> starting osd.76 at :/0 osd_data /var/lib/ceph/osd/ceph-76
> /var/lib/ceph/osd/ceph-76/journal
>
> But if I trie to restart osd on the admin server...:
>
> root@cephadmin:/etc/ceph# service ceph -a restart osd.76
> === osd.76 ===
> === osd.76 ===
> Stopping Ceph osd.76 on cephosd10...kill 800262...kill 800262...done
> === osd.76 ===
> df: `/var/lib/ceph/osd/ceph-76/.': No such file or directory
> df: no file systems processed
> create-or-move updating item name 'osd.76' weight 1 at location
> {host=cephadmin,root=default} to crush map
> Starting Ceph osd.76 on cephosd10...
> starting osd.76 at :/0 osd_data /var/lib/ceph/osd/ceph-76
> /var/lib/ceph/osd/ceph-76/journal
>
> ...it will associated with the admin server in the crush map:
> -17 0   host cephadmin
> 76  0   osd.76  up  0
>
> Before that each osd could be started from arbitrary server with option
> "-a". Apparently it no longer works.
> How do I run any osd from any server without error messages?

...huh. I didn't realize the "-a" option still existed. You should be
able to prevent this from happening by adding "osd crush update on
start = false" to the global section of your ceph.conf on any nodes
which you are going to use to restart OSDs from other nodes with.

I created a ticket to address this issue: http://tracker.ceph.com/issues/9407
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] max_bucket limit -- safe to disable?

2014-09-09 Thread Gregory Farnum
On Tue, Sep 9, 2014 at 9:11 AM, Daniel Schneller
 wrote:
> Hi list!
>
> Under 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-September/033670.html
> I found a situation not unlike ours, but unfortunately either
> the list archive fails me or the discussion ended without a
> conclusion, so I dare to ask again :)
>
> We currently have a setup of 4 servers with 12 OSDs each,
> combined journal and data. No SSDs.
>
> We develop a document management application that accepts user
> uploads of all kinds of documents and processes them in several
> ways. For any given document, we might create anywhere from 10s
> to several hundred dependent artifacts.
>
> We are now preparing to move from Gluster to a Ceph based
> backend. The application uses the Apache JClouds Library to
> talk to the Rados Gateways that are running on all 4 of these
> machines, load balanced by haproxy.
>
> We currently intend to create one container for each document
> and put all the dependent and derived artifacts as objects into
> that container.
> This gives us a nice compartmentalization per document, also
> making it easy to remove a document and everything that is
> connected with it.
>
> During the first test runs we ran into the default limit of
> 1000 containers per user. In the thread mentioned above that
> limit was removed (setting the max_buckets value to 0). We did
> that and now can upload more than 1000 documents.
>
> I just would like to understand
>
> a) if this design is recommended, or if there are reasons to go
>about the whole issue in a different way, potentially giving
>up the benefit of having all document artifacts under one
>convenient handle.
>
> b) is there any absolute limit for max_buckets that we will run
>into? Remember we are talking about 10s of millions of
>containers over time.
>
> c) are any performance issues to be expected with this design
>and can we tune any parameters to alleviate this?
>
> Any feedback would be very much appreciated.

Yehuda can talk about this with more expertise than I can, but I think
it should be basically fine. By creating so many buckets you're
decreasing the effectiveness of RGW's metadata caching, which means
the initial lookup in a particular bucket might take longer.
The big concern is that we do maintain a per-user list of all their
buckets — which is stored in a single RADOS object — so if you have an
extreme number of buckets that RADOS object could get pretty big and
become a bottleneck when creating/removing/listing the buckets. You
should run your own experiments to figure out what the limits are
there; perhaps you have an easy way of sharding up documents into
different users.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OpTracker optimization

2014-09-09 Thread Somnath Roy
Hi Sam/Sage,
As we discussed earlier, enabling the present OpTracker code degrading 
performance severely. For example, in my setup a single OSD node with 10 
clients is reaching ~103K read iops with io served from memory while optracking 
is disabled but enabling optracker it is reduced to ~39K iops. Probably, 
running OSD without enabling OpTracker is not an option for many of Ceph users.
Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist ops_in_flight) 
and removing some other bottlenecks I am able to match the performance of 
OpTracking enabled OSD with OpTracking disabled, but with the expense of ~1 
extra cpu core.
In this process I have also fixed the following tracker.

http://tracker.ceph.com/issues/9384


and probably http://tracker.ceph.com/issues/8885 too.



I have created following pull request for the same. Please review it.



https://github.com/ceph/ceph/pull/2440



Thanks & Regards

Somnath




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NAS on RBD

2014-09-09 Thread Quenten Grasso

We have been using the NFS/Pacemaker/RBD Method for a while explains it a bit 
better here, http://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/
PS: Thanks Sebastien,

Our use case is vmware storage, So as I mentioned we've been running it for 
some time and we've had pretty mixed results. 
Pros are when it works it works really well!
Cons When it doesn't, I've had a couple of instances where the XFS volumes 
needed fsck and this took about 3 hours on a 4TB Volume. (Lesson learnt use 
smaller volumes)
 

ZFS RaidZ Option could be interesting but expensive if using say 3 Pools with 
2x replicas with a RBD volume from each and a RaidZ on top of that. (I assume 
you would use 3 Pools here so we don't end up with data in the same PG which 
may be corrupted.)


Currently we also use FreeNAS VM's which are backed via RBD w/ 3 replicas and 
ZFS Striped Volumes and iSCSI/NFS out of these. While not really HA seems 
mostly work be it FreeNAS iSCSI can get a bit cranky at times. 

We are moving towards another KVM Hypervisor such as proxmox for these vm's 
which don't quite fit into our Openstack environment instead of having to use 
"RBD Proxys"

Regards,
Quenten Grasso

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Dan 
Van Der Ster
Sent: Wednesday, 10 September 2014 12:54 AM
To: Michal Kozanecki
Cc: ceph-users@lists.ceph.com; Blair Bethwaite
Subject: Re: [ceph-users] NAS on RBD


> On 09 Sep 2014, at 16:39, Michal Kozanecki  wrote:
> On 9 September 2014 08:47, Blair Bethwaite  wrote:
>> On 9 September 2014 20:12, Dan Van Der Ster  
>> wrote:
>>> One thing I’m not comfortable with is the idea of ZFS checking the data in 
>>> addition to Ceph. Sure, ZFS will tell us if there is a checksum error, but 
>>> without any redundancy at the ZFS layer there will be no way to correct 
>>> that error. Of course, the hope is that RADOS will ensure 100% data 
>>> consistency, but what happens if not?...
>> 
>> The ZFS checksumming would tell us if there has been any corruption, which 
>> as you've pointed out shouldn't happen anyway on top of Ceph.
> 
> Just want to quickly address this, someone correct me if I'm wrong, but IIRC 
> even with replica value of 3 or more, ceph does not(currently) have any 
> intelligence when it detects a corrupted/"incorrect" PG, it will always 
> replace/repair the PG with whatever data is in the primary, meaning that if 
> the primary PG is the one that’s corrupted/bit-rotted/"incorrect", it will 
> replace the good replicas with the bad.  

According to the the "scrub error on firefly” thread, repair "tends to choose 
the copy with the lowest osd number which is not obviously corrupted.  Even 
with three replicas, it does not do any kind of voting at this time.”

Cheers, Dan




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph data consistency

2014-09-09 Thread Chen, Xiaoxi
Yes, but usually a system has several layer of error-detecting/recovering stuff 
 in different granularity.

Disk CRC works on Sector level, Ceph CRC mostly work on object level, and we 
also have replication/erasure coding in system level.
The CRC in ceph mainly handle the case, imaging you have an object there, but 
part of the object has been mistakenly rewritten by some other process, in this 
case, since disk works well, disk CRC cannot provide any help.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of ???
Sent: Thursday, September 4, 2014 4:31 PM
To: ceph-us...@ceph.com
Subject: [ceph-users] ceph data consistency

hi, everyone:

  when I read the filestore.cc, I find the ceph use crc the check the data. Why 
should check the data?

  In my knowledge,  the disk has error-correcting 
code (ECC) for each sector. 
Looking at wiki: http://en.wikipedia.org/wiki/Disk_sector, "In disk drives, 
each physical sector is made up of three basic parts, the 
sectorheader, the data area 
and the error-correcting 
code (ECC)".  So if the 
data is not correct. the disk can recovery it or  return i/o error.

  Does anyone can explain it?

 Thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS roadmap (was Re: NAS on RBD)

2014-09-09 Thread Blair Bethwaite
Hi Sage,

Thanks for weighing into this directly and allaying some concerns.

It would be good to get a better understanding about where the rough
edges are - if deployers have some knowledge of those then they can be
worked around to some extent. E.g., for our use-case it may be that
whilst Inktank/RedHat won't provide support for CephFS that we are
better off using it in a tightly controlled fashion (e.g., no
snapshots, restricted set of native clients acting as presentation
layer with others coming in via SAMBA & Ganesha, no dynamic metadata
tree/s, ???) where we're less likely to run into issues.

Related, given there is no fsck, how would one go about backing up the
metadata in order to facilitate DR? Is there even a way for that to
make sense given the decoupling of data & metadata pools...?

Cheers,

On 10 September 2014 03:47, Sage Weil  wrote:
> On Tue, 9 Sep 2014, Blair Bethwaite wrote:
>> > Personally, I think you?re very brave to consider running 2PB of ZoL
>> > on RBD. If I were you I would seriously evaluate the CephFS option. It
>> > used to be on the roadmap for ICE 2.0 coming out this fall, though I
>> > noticed its not there anymore (??!!!).
>>
>> Yeah, it's very disappointing that this was silently removed. And it's
>> particularly concerning that this happened post RedHat acquisition.
>> I'm an ICE customer and sure would have liked some input there for
>> exactly the reason we're discussing.
>
> A couple quick comments:
>
> 1) We have more developers actively working on CephFS today than we have
> ever had before.  It is a huge priority for me and the engineering team to
> get it into a state where it is ready for general purpose production
> workloads.
>
> 2) As a scrappy startup like Inktank we were very fast and loose about
> what went into the product roadmap and what claims we made.  Red Hat is
> much more cautious about forward looking statements in their enterprise
> products.  Do not read too much into the presence or non-presence of
> CephFS in the ICE roadmap.  Also note that Red Hat Storage today is
> shipping a fully production-ready and stable distributed file system
> (GlusterFS).
>
> 3) We've recently moved to CephFS in the sepia QA lab for archiving all of
> our test results.  This dogfooding exercise has helped us identify several
> general usability and rough edges that have resulted in changes for giant.
> We identified and fixed two kernel client bugs that went into 3.16 or
> thereabouts.  The biggest problem we had we finally tracked down and
> turned out to be an old bug due to an old kernel client that we forgot was
> mounting the cluster.  Overall, I'm pretty pleased.  CephFS in Giant is
> going to be pretty good.  We are still lacking fsck, so be careful, and
> there are several performance issues we need to address, but I encourage
> anyone who is interested to give Giant CepHFS a go in any environment you
> have were you can tolerate the risk.  We are *very* keen to get feedback
> on performance, stability, robustness, and usability.
>
> Thanks!
> sage



-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd unexpected error by leveldb

2014-09-09 Thread Haomai Wang
please show your ceph version. There exists some known bugs on Firefly


On Fri, Sep 5, 2014 at 9:12 AM, derek <908429...@qq.com> wrote:

> Dear CEPH ,
> Urgent question, I met a "FAILED assert(0 == "unexpected error")"  
> yesterday
> , Now i have not way to start this OSDS
> I have attached my logs in the attachment, and some  ceph configurations
>  as below
>
>
> osd_pool_default_pgp_num = 300
> osd_pool_default_size = 2
> osd_pool_default_min_size = 1
> osd_pool_default_pg_num = 300
> mon_host = 10.1.0.213,10.1.0.214
> osd_crush_chooseleaf_type = 1
> mds_cache_size = 50
> osd objectstore = keyvaluestore-dev‍
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Best practices on Filesystem recovery on RBD block volume?

2014-09-09 Thread Keith Phua
Dear ceph-users,

Recently we had an encounter of a XFS filesystem corruption on a NAS box.  
After repairing the filesystem, we discover the files were gone.  This trigger 
some questions with regards to filesystem on RBD block which I hope the 
community can enlighten me.

1.  If a local filesystem on a rbd block is corrupted, is it fair to say that 
regardless of how many replicated copies we specified for the pool, unless the 
filesystem is properly repaired and recovered, we may not get our data back?

2.  If the above statement is true, does it mean that severe filesystem 
corruption on a RBD block constitute a single point of failure, since 
filesystems corruption can happened when the RBD client is not properly 
shutdown or due to a kernel bug?

3.  Other than existing best practices for a filesystem recovery, does ceph 
have any other best practices for filesystem on RBD which we can adopt for data 
recovery?


Thanks in advance.

Regards,

Keith
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problem with customized crush rule for EC pool

2014-09-09 Thread Lei Dong
Yes, My goal is to make it loosing 3 OSD does not lose data.

My 6 racks may not be in different rooms but they use 6 different
switches, so I want when any switch is down or unreachable, my data can
still be accessed. I think it’s not an unrealistic requirement.


Thanks!

LeiDong.

On 9/9/14, 10:02 PM, "Loic Dachary"  wrote:

>
>
>On 09/09/2014 14:21, Lei Dong wrote:
>> Thanks loic!
>> 
>> Actually I've found that increase choose_local_fallback_tries can
>>help(chooseleaf_tries helps not so significantly), but I'm afraid when
>>osd failure happen and need to find new acting set, it may be fail to
>>find enough racks again. So I'm trying to find a more guaranteed way in
>>case of osd failure.
>> 
>> My profile is nothing special other than k=8 m=3.
>
>So your goal is to make it so loosing 3 OSD simultaneously does not mean
>loosing data. By forcing each rack to hold at most 2 OSDs for a given
>object, you make it so loosing a full rack does not mean loosing data.
>Are these racks in the same room in the datacenter ? In the event of a
>catastrophic failure that permanently destroy one rack, how realistic is
>it that the other racks are unharmed ? If the rack is destroyed by fire
>and is in a row with the six other racks, there is a very high chance
>that the other racks will also be damaged. Note that I am not a system
>architect nor a system administrator : I may be completely wrong ;-) If
>it turns out that the probability of a single rack to fail entirely and
>independently of the others is negligible, it may not be necessary to
>make a complex ruleset and instead use the default ruleset.
>
>My 2cts
> 
>> 
>> Thanks again!
>> 
>> Leidong
>> 
>> 
>> 
>> 
>> 
>>> On 2014年9月9日, at 下午7:53, "Loic Dachary"  wrote:
>>>
>>> Hi,
>>>
>>> It is indeed possible that mapping fails if there are just enough
>>>racks to match the constraint. And the probability of a bad mapping
>>>increases when the number of PG increases because there is a need for
>>>more mapping. You can tell crush to try harder with
>>>
>>> step set_chooseleaf_tries 10
>>>
>>> Be careful though : increasing this number will change mapping. It
>>>will not just fix the bad mappings you're seeing, it will also change
>>>the mappings that succeeded with a lower value. Once you've set this
>>>parameter, it cannot be modified.
>>>
>>> Would you mind sharing the erasure code profile you plan to work with ?
>>>
>>> Cheers
>>>
 On 09/09/2014 12:39, Lei Dong wrote:
 Hi ceph users:

 I want to create a customized crush rule for my EC pool (with
replica_size = 11) to distribute replicas into 6 different Racks.

 I use the following rule at first:

 Step take default  // root
 Step choose firstn 6 type rack// 6 racks, I have and only have 6 racks
 Step chooseleaf indep 2 type osd // 2 osds per rack
 Step emit

 I looks fine and works fine when PG num is small.
 But when pg num increase, there are always some PGs which can not
take all the 6 racks.
 It looks like “Step choose firstn 6 type rack” sometimes returns only
5 racks.
 After some investigation,  I think it may caused by collision of
choices.

 Then I come up with another solution to solve collision like this:

 Step take rack0
 Step chooseleaf indep 2 type osd
 Step emit
 Step take rack1
 ….
 (manually take every rack)

 This won’t cause rack collision, because I specify rack by name at
first. But the problem is that osd in rack0 will always be primary osd
because I choose from rack0 first.

 So the question is what is the recommended way to meet such a need
(distribute 11 replicas into 6 racks evenly in case of rack failure)?


 Thanks!
 LeiDong




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> -- 
>>> Loïc Dachary, Artisan Logiciel Libre
>>>
>
>-- 
>Loïc Dachary, Artisan Logiciel Libre
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph-deploy bug; CentOS 7, Firefly

2014-09-09 Thread Piers Dawson-Damer
Ceph-deploy wants;

ceph-release-1-0.el7.noarch.rpm

But the contents of ceph.com/rpm-firefly/el7/noarch only include the file;

ceph-release-1-0.el7.centos.noarch.rpm

Piers



[stor][DEBUG ] Retrieving 
http://ceph.com/rpm-firefly/el7/noarch/ceph-release-1-0.el7.noarch.rpm
[stor][WARNIN] curl: (22) The requested URL returned error: 404 Not Found
[stor][WARNIN] error: skipping 
http://ceph.com/rpm-firefly/el7/noarch/ceph-release-1-0.el7.noarch.rpm - 
transfer failed
[stor][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: rpm -Uvh 
--replacepkgs 
http://ceph.com/rpm-firefly/el7/noarch/ceph-release-1-0.el7.noarch.rpm




Linux stor.domain 3.16.2-1.el7.elrepo.x86_64 #1 SMP Sat Sep 6 11:34:36 EDT 2014 
x86_64 x86_64 x86_64 GNU/Linux

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD journal deployment experiences

2014-09-09 Thread Christian Balzer
On Tue, 9 Sep 2014 10:57:26 -0700 Craig Lewis wrote:

> On Sat, Sep 6, 2014 at 9:27 AM, Christian Balzer  wrote:
> 
> > On Sat, 06 Sep 2014 16:06:56 + Scott Laird wrote:
> >
> > > Backing up slightly, have you considered RAID 5 over your SSDs?
> > >  Practically speaking, there's no performance downside to RAID 5 when
> > > your devices aren't IOPS-bound.
> > >
> >
> > Well...
> > For starters with RAID5 you would loose 25% throughput in both Dan's
> > and my case (4 SSDs) compared to JBOD SSD journals.
> > In Dan's case that might not matter due to other bottlenecks, in my
> > case it certainly would.
> >
> 
> It's a trade off between lower performance all the time, or much lower
> performance while you're backfilling those OSDs.  To me, this seems like
> a somewhat reasonable idea for a small cluster, where losing one SSD
> could lose >5% of the OSDs.  It doesn't seem worth the effort for a large
> cluster, where losing one SSD would lose < 1% of the OSDs.
> 
A good point, but for example in my case (4x DC3700s 100GB with 2 journals
each in front of 8 HDDs) the SSDs are already the limiting factor (and one
I willingly accept). 
Lowering that by another 25% just doesn't feel worth it, given the
reliability/durability of the Intel SSDs.

> 
> >
> > And while you're quite correct when it comes to IOPS, doing RAID5 will
> > either consume significant CPU resource in a software RAID case or
> > require a decent HW RAID controller.
> >
> > Christian
> 
> 
>  I haven't worried about CPU with software RAID5 in a very long time...
> maybe Pentium 4 days?  It's so rare to actually have 0% Idle CPU, even
> under high loads.
>
True in most cases indeed...
 
> Most of my RAID5 is ZFS, but the CPU hasn't been the limiting factor on
> my database or NFS servers.  I'm even doing software crypto, without CPU
> support, with only a 10% performance penalty.  If the CPU has AES
> support, crypto is free.  Obviously, RAID0 (or fully parallel JBOD) will
> be faster than RAID5, but RAID5 is faster than RAID10 for all but the
> most heavily read biased workloads.  Surprised the hell out of me.  I'll
> be converting all of my database servers from RAID10 to RAIDZ.  Of
> course, benchmarks that match your workload trump some random yahoo on
> the internet.  :-)
> 
RAID5 (I won't deploy any below RAID6 with more than 4 drives anymore,
FWIW) will indeed be faster than an equally sized RAID10, given that is
more data disks to play with. However that speed (bandwidth) does not
necessarily translate to IOPS, especially with a software RAID as opposed
to a HW RAID with a large HW cache.

> 
> Ceph OSD nodes are a bit different though.  They're one of the few beasts
> I've dealt with that are CPU, Disk, and network bound all at the same
> time. If you have some idle CPU during a big backfill, then I'd consider
> Software RAID5 a possibility.  If you ever sustain 0% idle, then I
> wouldn't try it.

Precisely and the reason I mentioned it in this context.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] question about RGW

2014-09-09 Thread baijia...@126.com
when I read RGW code,  and can't  understand  master_ver  inside struct 
rgw_bucket_dir_header .
who can explain this struct , in especial master_ver and stats , thanks




baijia...@126.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com