Re: [ceph-users] Could not find module rbd. CentOs 6.4

2014-07-28 Thread Karan Singh
Yes you can use other features like CephFS and Object Store on this kernel 
release that you are running.

- Karan Singh 


On 28 Jul 2014, at 07:45, Pratik Rupala  wrote:

> Hi Karan,
> 
> I have basic setup of Ceph storage cluster in active+clean state on Linux 
> kernel 2.6.32. As per your suggestion, RBD support starts from 2.6.34 kernel.
> So, can I use other facilities like object store and Cephfs on this setup 
> with 2.6.32 or they are also not supported for this kernel version and is 
> there any way to have Ceph block devices on Linux kernel 2.6.32?
> 
> Regards,
> Pratik Rupala
> 
> 
> On 7/25/2014 5:51 PM, Karan Singh wrote:
>> Hi Pratik
>> 
>> Ceph RBD support has been added in mainline Linux kernel starting 2.6.34 ,  
>> The following errors shows that , RBD module is not present in kernel.
>> 
>> Its advisable to run latest stable kernel release if you need RBD to be 
>> working.
>> 
>>> ERROR: modinfo: could not find module rbd
>>> FATAL: Module rbd not found.
>>> rbd: modprobe rbd failed! (256)
>> 
>> 
>> 
>> - Karan -
>> 
>> On 25 Jul 2014, at 14:52, Pratik Rupala  wrote:
>> 
>>> Hi,
>>> 
>>> I am deploying firefly version on CentOs 6.4. I am following quick 
>>> installation instructions available at ceph.com.
>>> I have my customized kernel version in CentOs 6.4 which is 2.6.32.
>>> 
>>> I am able to create basic Ceph storage cluster with active+clean state. Now 
>>> I am trying to create block device image on ceph client but it is giving 
>>> messages as shown below:
>>> 
>>> [ceph@ceph-client1 ~]$ rbd create foo --size 1024
>>> 2014-07-25 22:31:48.519218 7f6721d43700  0 -- 172.17.35.20:0/1003053 >> 
>>> 172.17.35.22:6800/1875 pipe(0x6a7c50 sd=4 :0 s=1 pgs=0 cs=0 l=1 
>>> c=0x6a8050).fault
>>> 2014-07-25 22:32:18.536771 7f6721b41700  0 -- 172.17.35.20:0/1003053 >> 
>>> 172.17.35.22:6800/1875 pipe(0x7f6718006310 sd=5 :0 s=1 pgs=0 cs=0 l=1 
>>> c=0x7f6718006580).fault
>>> 2014-07-25 22:33:09.598763 7f6721b41700  0 -- 172.17.35.20:0/1003053 >> 
>>> 172.17.35.22:6800/1875 pipe(0x7f67180063e0 sd=5 :0 s=1 pgs=0 cs=0 l=1 
>>> c=0x7f6718007e70).fault
>>> 2014-07-25 22:34:08.621655 7f6721b41700  0 -- 172.17.35.20:0/1003053 >> 
>>> 172.17.35.22:6800/1875 pipe(0x7f6718007e70 sd=5 :0 s=1 pgs=0 cs=0 l=1 
>>> c=0x7f67180080e0).fault
>>> 2014-07-25 22:35:19.581978 7f6721b41700  0 -- 172.17.35.20:0/1003053 >> 
>>> 172.17.35.22:6800/1875 pipe(0x7f6718007e70 sd=5 :0 s=1 pgs=0 cs=0 l=1 
>>> c=0x7f67180080e0).fault
>>> 2014-07-25 22:36:23.694665 7f6721b41700  0 -- 172.17.35.20:0/1003053 >> 
>>> 172.17.35.22:6800/1875 pipe(0x7f6718007e70 sd=5 :0 s=1 pgs=0 cs=0 l=1 
>>> c=0x7f67180080e0).fault
>>> 2014-07-25 22:37:28.868293 7f6721b41700  0 -- 172.17.35.20:0/1003053 >> 
>>> 172.17.35.22:6800/1875 pipe(0x7f6718007e70 sd=5 :0 s=1 pgs=0 cs=0 l=1 
>>> c=0x7f67180080e0).fault
>>> 2014-07-25 22:38:29.159830 7f6721b41700  0 -- 172.17.35.20:0/1003053 >> 
>>> 172.17.35.22:6800/1875 pipe(0x7f6718007e70 sd=5 :0 s=1 pgs=0 cs=0 l=1 
>>> c=0x7f67180080e0).fault
>>> 2014-07-25 22:39:28.854441 7f6721b41700  0 -- 172.17.35.20:0/1003053 >> 
>>> 172.17.35.22:6800/1875 pipe(0x7f6718001db0 sd=5 :0 s=1 pgs=0 cs=0 l=1 
>>> c=0x7f6718006990).fault
>>> 2014-07-25 22:40:14.581055 7f6721b41700  0 -- 172.17.35.20:0/1003053 >> 
>>> 172.17.35.22:6800/1875 pipe(0x7f6718001ac0 sd=5 :0 s=1 pgs=0 cs=0 l=1 
>>> c=0x7f671800c950).fault
>>> 2014-07-25 22:41:03.794903 7f6721b41700  0 -- 172.17.35.20:0/1003053 >> 
>>> 172.17.35.22:6800/1875 pipe(0x7f6718004d30 sd=5 :0 s=1 pgs=0 cs=0 l=1 
>>> c=0x7f671800c950).fault
>>> 2014-07-25 22:42:12.537442 7f6721b41700  0 -- 172.17.35.20:0/1003053 >> 
>>> 172.17.35.22:6800/1875 pipe(0x6a4640 sd=5 :0 s=1 pgs=0 cs=0 l=1 
>>> c=0x6a4a00).fault
>>> 2014-07-25 22:43:18.912430 7f6721b41700  0 -- 172.17.35.20:0/1003053 >> 
>>> 172.17.35.22:6800/1875 pipe(0x7f6718008300 sd=5 :0 s=1 pgs=0 cs=0 l=1 
>>> c=0x7f67180080e0).fault
>>> 2014-07-25 22:44:24.129258 7f6721b41700  0 -- 172.17.35.20:0/1003053 >> 
>>> 172.17.35.22:6800/1875 pipe(0x7f6718008300 sd=5 :0 s=1 pgs=0 cs=0 l=1 
>>> c=0x7f6718008f80).fault
>>> 2014-07-25 22:45:29.174719 7f6721b41700  0 -- 172.17.35.20:0/1003053 >> 
>>> 172.17.35.22:6800/1875 pipe(0x7f671800a150 sd=5 :0 s=1 pgs=0 cs=0 l=1 
>>> c=0x7f671800a620).fault
>>> 2014-07-25 22:46:34.032246 7f6721b41700  0 -- 172.17.35.20:0/1003053 >> 
>>> 172.17.35.22:6800/1875 pipe(0x7f6718008390 sd=5 :0 s=1 pgs=0 cs=0 l=1 
>>> c=0x7f671800a620).fault
>>> 2014-07-25 22:47:39.551973 7f6721b41700  0 -- 172.17.35.20:0/1003053 >> 
>>> 172.17.35.22:6800/1875 pipe(0x7f6718008390 sd=5 :0 s=1 pgs=0 cs=0 l=1 
>>> c=0x7f67180077e0).fault
>>> 2014-07-25 22:48:39.342226 7f6721b41700  0 -- 172.17.35.20:0/1003053 >> 
>>> 172.17.35.22:6800/1875 pipe(0x7f6718001db0 sd=5 :0 s=1 pgs=0 cs=0 l=1 
>>> c=0x7f6718003040).fault
>>> 
>>> I am not sure whether block device image has been created or not. Further I 
>>> tried below command which fails:
>>> [ceph@ceph-client1 ~]$ sudo rbd map foo
>>> ERROR: modinfo: could not find module rb

Re: [ceph-users] firefly osds stuck in state booting

2014-07-28 Thread Karan Singh
The output that you have provided says that OSDs are not IN , Try the below

ceph osd in osd.0
ceph osd in osd.1

service ceph start osd.0
service ceph start osd.1

If you have 1 more host with 1 disk , add it , starting Ceph Firefly default 
rep size is 3


- Karan -

On 27 Jul 2014, at 11:17, 10 minus  wrote:

> Hi Sage, 
> 
> I have dropped all unset .. and even restarted the osd 
> No dice .. OSDs are still stuck .
> 
> 
> 
> --snip--
>  ceph daemon osd.0 status│rtt 
> min/avg/max/mdev = 0.095/0.120/0.236/0.015 ms 
> 
> { "cluster_fsid": "99babb8f-c880-4b32-a227-94aa483d4871", 
>│[root@ceph2 ~]#  ceph daemon osd.1 status 
>   
>   
>   "osd_fsid": "1ad28bde-c23c-44ba-a3b7-0fd3372e", 
>│{ "cluster_fsid": "99babb8f-c880-4b32-a227-94aa483d4871", 
>   
>   
>   "whoami": 0,
>│  "osd_fsid": "becc3252-6977-47d6-87af-7b1337e591d8", 
>   
>   
>   "state": "booting", 
>│  "whoami": 1,
>   
>   
>   "oldest_map": 1,
>│  "state": "booting", 
>   
>   
>   "newest_map": 24,   
>│  "oldest_map": 1,
>   
>   
>   "num_pgs": 0}   
>│  "newest_map": 21,   
>   
>   
>  --snip--
> 
> --snip-- 
> ceph osd tree 
>  
> # idweight  type name   up/down reweight
> -1  2   root default
> -3  1   host ceph1
> 0   1   osd.0   down0
> -2  1   host ceph2
> 1   1   osd.1   down0
> 
>  --snip--
> 
> --snip--
>  ceph -s
> cluster 2929fa80-0841-4cb6-a133-90b2098fc802
>  health HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean
>  monmap e2: 3 mons at 
> {ceph0=10.0.12.220:6789/0,ceph1=10.0.12.221:6789/0,ceph2=10.0.12.222:6789/0}, 
> election epoch 50, quorum 0,1,2 ceph0,ceph1,ceph2
>  osdmap e24: 2 osds: 0 up, 0 in
>   pgmap v25: 192 pgs, 3 pools, 0 bytes data, 0 objects
> 0 kB used, 0 kB / 0 kB avail
>  192 creating
> --snip--
> 
> 
> 
> 
> On Sat, Jul 26, 2014 at 5:57 PM, Sage Weil  wrote:
> On Sat, 26 Jul 2014, 10 minus wrote:
> > Hi,
> >
> > I just setup a test ceph installation on 3 node Centos 6.5  .
> > two of the nodes are used for hosting osds and the third acts as mon .
> >
> > Please note I'm using LVM so had to set up the osd using the manual install
> > guide.
> >
> > --snip--
> > ceph -s
> > cluster 2929fa80-0841-4cb6-a133-90b2098fc802
> >  health HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean;
> > noup,nodown,noout flag(s) set
> >  monmap e2: 3 mons 
> > at{ceph0=10.0.12.220:6789/0,ceph1=10.0.12.221:6789/0,ceph2=10.0.12.222:6789/0
> > }, election epoch 46, quorum 0,1,2 ceph0,ceph1,ceph2
> >  osdmap e21: 2 osds: 0 up, 0 in
> > flags noup,nodown,noout
> 
> 
> Do 'ceph osd unset noup' and they should start up.  You likely also want
> to clear nodown and noout as well.
> 
> sage
> 
> 
> >   pgmap v22: 192 pgs, 3 pools, 0 bytes data, 0 objects
> > 0 kB used, 0 kB / 0 kB avail
> >  192 creating
> > --snip--
> >
> > osd tree
> >
> > --snip--
> > ceph osd tree
> > # idweight  type name   up/down reweight
> > -1  2   root default
> > -3  1   host ceph1
> > 0   1   osd.0   down0
> > -2  1   host ceph2
> > 1   1   osd.1   down0
> > --snip--
> >
> > --snip--
> >  ceph daemon osd.0 status
> > { "cluster_fsid": "99babb8f-c880-4b32-a227-94aa483d4871",
> >   "osd_fsid": "1ad28bde-c23c-44ba-a3b7-0fd3372e",
> >   "whoami": 0,
> >   "state": "booting",
> >  

Re: [ceph-users] OSD weight 0

2014-07-28 Thread Karan Singh

Looks like osd.1 has a valid auth ID , which was defined previously.

Trust this is your test cluster , try this

ceph osd crush rm osd.1
ceph osd rm osd.1
ceph auth del osd.1

Once again try to add osd.1 using ceph-deploy ( prepare and then activate 
commands ) , check the logs carefully for any other clues.

- Karan  Singh -

On 25 Jul 2014, at 12:49, Kapil Sharma  wrote:

> Hi,
> 
> I am using ceph-deploy to deploy my cluster. Whenever I try to add more
> than one osd in a node, except the first osd, all the other osds get a
> weight of 0, and they are in a state of down and out.
> 
> So, if I have three nodes in my cluster, I can successfully add 1 node
> each in the three nodes, but the moment I try to add a second node in
> any of the nodes, it gets a weight of 0 and goes down and out.
> 
> The capacity of all the disks is same.
> 
> 
> cephdeploy@node-1:~/cluster> ceph osd tree
> # idweight  type name   up/down reweight
> -1  1.82root default
> -2  1.82host node-1
> 0   1.82osd.0   up  1
> 1   0   osd.1   down0
> 
> There is no error as such after I run ceph-deploy activate command.
> 
> Has anyone seen this issue before ? 
> 
> 
> 
> Kind Regards,
> Kapil.
> 
> 
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Could not find module rbd. CentOs 6.4

2014-07-28 Thread Pratik Rupala

Hi Karan,

So that means I can't have RBD on 2.6.32. Do you know where can I find 
source for rbd.ko for other kernel versions like 2.6.34?


Regards,
Pratik Rupala

On 7/28/2014 12:32 PM, Karan Singh wrote:
Yes you can use other features like CephFS and Object Store on this 
kernel release that you are running.


- Karan Singh


On 28 Jul 2014, at 07:45, Pratik Rupala > wrote:



Hi Karan,

I have basic setup of Ceph storage cluster in active+clean state on 
Linux kernel 2.6.32. As per your suggestion, RBD support starts from 
2.6.34 kernel.
So, can I use other facilities like object store and Cephfs on this 
setup with 2.6.32 or they are also not supported for this kernel 
version and is there any way to have Ceph block devices on Linux 
kernel 2.6.32?


Regards,
Pratik Rupala


On 7/25/2014 5:51 PM, Karan Singh wrote:

Hi Pratik

Ceph RBD support has been added in mainline Linux kernel starting 
2.6.34 ,  The following errors shows that , RBD module is not 
present in kernel.


Its advisable to run latest stable kernel release if you need RBD to 
be working.



ERROR: modinfo: could not find module rbd
FATAL: Module rbd not found.
rbd: modprobe rbd failed! (256)



- Karan -

On 25 Jul 2014, at 14:52, Pratik Rupala 
mailto:pratik.rup...@calsoftinc.com>> 
wrote:



Hi,

I am deploying firefly version on CentOs 6.4. I am following quick 
installation instructions available at ceph.com .

I have my customized kernel version in CentOs 6.4 which is 2.6.32.

I am able to create basic Ceph storage cluster with active+clean 
state. Now I am trying to create block device image on ceph client 
but it is giving messages as shown below:


[ceph@ceph-client1 ~]$ rbd create foo --size 1024
2014-07-25 22:31:48.519218 7f6721d43700  0 -- 
172.17.35.20:0/1003053 >> 172.17.35.22:6800/1875 pipe(0x6a7c50 sd=4 
:0 s=1 pgs=0 cs=0 l=1 c=0x6a8050).fault
2014-07-25 22:32:18.536771 7f6721b41700  0 -- 
172.17.35.20:0/1003053 >> 172.17.35.22:6800/1875 
pipe(0x7f6718006310 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6718006580).fault
2014-07-25 22:33:09.598763 7f6721b41700  0 -- 
172.17.35.20:0/1003053 >> 172.17.35.22:6800/1875 
pipe(0x7f67180063e0 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6718007e70).fault
2014-07-25 22:34:08.621655 7f6721b41700  0 -- 
172.17.35.20:0/1003053 >> 172.17.35.22:6800/1875 
pipe(0x7f6718007e70 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f67180080e0).fault
2014-07-25 22:35:19.581978 7f6721b41700  0 -- 
172.17.35.20:0/1003053 >> 172.17.35.22:6800/1875 
pipe(0x7f6718007e70 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f67180080e0).fault
2014-07-25 22:36:23.694665 7f6721b41700  0 -- 
172.17.35.20:0/1003053 >> 172.17.35.22:6800/1875 
pipe(0x7f6718007e70 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f67180080e0).fault
2014-07-25 22:37:28.868293 7f6721b41700  0 -- 
172.17.35.20:0/1003053 >> 172.17.35.22:6800/1875 
pipe(0x7f6718007e70 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f67180080e0).fault
2014-07-25 22:38:29.159830 7f6721b41700  0 -- 
172.17.35.20:0/1003053 >> 172.17.35.22:6800/1875 
pipe(0x7f6718007e70 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f67180080e0).fault
2014-07-25 22:39:28.854441 7f6721b41700  0 -- 
172.17.35.20:0/1003053 >> 172.17.35.22:6800/1875 
pipe(0x7f6718001db0 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6718006990).fault
2014-07-25 22:40:14.581055 7f6721b41700  0 -- 
172.17.35.20:0/1003053 >> 172.17.35.22:6800/1875 
pipe(0x7f6718001ac0 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f671800c950).fault
2014-07-25 22:41:03.794903 7f6721b41700  0 -- 
172.17.35.20:0/1003053 >> 172.17.35.22:6800/1875 
pipe(0x7f6718004d30 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f671800c950).fault
2014-07-25 22:42:12.537442 7f6721b41700  0 -- 
172.17.35.20:0/1003053 >> 172.17.35.22:6800/1875 pipe(0x6a4640 sd=5 
:0 s=1 pgs=0 cs=0 l=1 c=0x6a4a00).fault
2014-07-25 22:43:18.912430 7f6721b41700  0 -- 
172.17.35.20:0/1003053 >> 172.17.35.22:6800/1875 
pipe(0x7f6718008300 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f67180080e0).fault
2014-07-25 22:44:24.129258 7f6721b41700  0 -- 
172.17.35.20:0/1003053 >> 172.17.35.22:6800/1875 
pipe(0x7f6718008300 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6718008f80).fault
2014-07-25 22:45:29.174719 7f6721b41700  0 -- 
172.17.35.20:0/1003053 >> 172.17.35.22:6800/1875 
pipe(0x7f671800a150 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f671800a620).fault
2014-07-25 22:46:34.032246 7f6721b41700  0 -- 
172.17.35.20:0/1003053 >> 172.17.35.22:6800/1875 
pipe(0x7f6718008390 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f671800a620).fault
2014-07-25 22:47:39.551973 7f6721b41700  0 -- 
172.17.35.20:0/1003053 >> 172.17.35.22:6800/1875 
pipe(0x7f6718008390 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f67180077e0).fault
2014-07-25 22:48:39.342226 7f6721b41700  0 -- 
172.17.35.20:0/1003053 >> 172.17.35.22:6800/1875 
pipe(0x7f6718001db0 sd=5 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6718003040).fault


I am not sure whether block device image has been created or not. 
Further I tried below command which fails:

[ceph@ceph-client1 ~]$ sudo rbd map foo
ERROR: modinfo: could not find module rbd
FATAL: Module rbd not found.
rbd: modpro

Re: [ceph-users] anti-cephalopod question

2014-07-28 Thread Christian Balzer

Hello,

On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:

> Hello Christian,
> 
> Let me supply more info and answer some questions.
> 
> * Our main concern is high availability, not speed.
> Our storage requirements are not huge.
> However we want good keyboard response 99.99% of the time.   We mostly do
> data entry and reporting.   20-25  users doing mostly order , invoice
> processing and email.
> 
> * DRBD has been very reliable , but I am the SPOF .   Meaning that when
> split brain occurs [ every 18-24 months ] it is me or no one who knows
> what to do. Try to explain how to deal with split brain in advance
> For the future ceph looks like it will be easier to maintain.
> 
The DRBD people would of course tell you to configure things in a way that
a split brain can't happen. ^o^

Note that given the right circumstances (too many OSDs down, MONs down)
Ceph can wind up in a similar state.

> * We use Proxmox . So ceph and mons will share each node. I've used
> proxmox for a few years and like the kvm / openvz management.
> 
I tried it some time ago, but at that time it was still stuck with 2.6.32
due to OpenVZ and that wasn't acceptable to me for various reasons. 
I think it still is, too.

> * Ceph hardware:
> 
> Four  hosts .  8 drives each.
> 
> OPSYS: raid-1  on ssd .
> 
Good, that should be sufficient for running MONs (you will want 3).

> OSD: four disk raid 10 array using  2-TB drives.
> 
> Two of the systems will use Seagate Constellation ES.3  2TB 7200 RPM
> 128MB Cache SAS 6Gb/s
> 
> the other two hosts use Western Digital RE WD2000FYYZ 2TB 7200 RPM 64MB
> Cache SATA 6.0Gb/s   drives.
> 
> Journal: 200GB Intel DC S3700 Series
> 
> Spare disk for raid.
> 
> * more questions.
> you wrote:
> "In essence, if your current setup can't handle the loss of a single
> disk, what happens if a node fails?
> You will need to design (HW) and configure (various Ceph options) your
> cluster to handle these things because at some point a recovery might be
> unavoidable.
> 
> To prevent recoveries based on failed disks, use RAID, for node failures
> you could permanently set OSD noout or have a monitoring software do that
> when it detects a node failure."
> 
> I'll research  'OSD noout' .
>
You probably might be happy with the "mon osd downout subtree limit" set
to "host" as well.
In that case you will need to manually trigger a rebuild (set that
node/OSD to out) if you can't repair a failed node in a short time and
keep your redundancy levels.
 
> Are there other setting I should read up on / consider?
> 
> For node reboots due to kernel upgrades -  how is that handled?   Of
> course that would be scheduled for off hours.
> 
Set noout before a planned downtime or live dangerously and assume it
comes back within the timeout period (5 minutes IIRC).

> Any other suggestions?
> 
Test your cluster extensively before going into production.

Fill it with enough data to be close to what you're expecting and fail one
node/OSD. 

See how bad things become, try to determine where any bottlenecks are with
tools like atop.

While you've done pretty much everything to prevent that scenario from a
disk failure with the RAID10 and by keeping nodes from being set out by
whatever means you choose ("mon osd downout subtree limit = host" seems to
work, I just tested it), having a cluster that doesn't melt down when
recovering or at least knowing how bad things will be in such a scenario
helps a lot.

Regards,

Christian

> thanks for the suggestions,
> Rob
> 
> 
> On Sat, Jul 26, 2014 at 1:47 AM, Christian Balzer  wrote:
> 
> >
> > Hello,
> >
> > actually replying in the other thread was fine by me, it was after
> > relevant in a sense to it.
> > And you mentioned something important there, which you didn't mention
> > below, that you're coming from DRBD with a lot of experience there.
> >
> > So do I and Ceph/RBD simply isn't (and probably never will be) an
> > adequate replacement for DRBD in some use cases.
> > I certainly plan to keep deploying DRBD where it makes more sense
> > (IOPS/speed), while migrating everything else to Ceph.
> >
> > Anyway, lets look at your mail:
> >
> > On Fri, 25 Jul 2014 14:33:56 -0400 Robert Fantini wrote:
> >
> > > I've a question regarding advice from these threads:
> > >
> > https://mail.google.com/mail/u/0/#label/ceph/1476b93097673ad7?compose=1476ec7fef10fd01
> > >
> > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg11011.html
> > >
> > >
> > >
> > >  Our current setup has 4 osd's per node.When a drive  fails   the
> > > cluster is almost unusable for data entry.   I want to change our
> > > set up so that under no circumstances ever happens.
> > >
> >
> > While you can pretty much avoid this from happening, your cluster
> > should be able to handle a recovery.
> > While Ceph is a bit more hamfisted than DRBD and definitely needs more
> > controls and tuning to make recoveries have less of an impact you would
> > see something similar with DRBD and badly config

Re: [ceph-users] anti-cephalopod question

2014-07-28 Thread Robert Fantini
I have 3 hosts that i want to use to test new setup...

Currently they have 3-4 OSD's each.

Could you suggest a fast way to remove all the OSD's ?




On Mon, Jul 28, 2014 at 3:49 AM, Christian Balzer  wrote:

>
> Hello,
>
> On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:
>
> > Hello Christian,
> >
> > Let me supply more info and answer some questions.
> >
> > * Our main concern is high availability, not speed.
> > Our storage requirements are not huge.
> > However we want good keyboard response 99.99% of the time.   We mostly do
> > data entry and reporting.   20-25  users doing mostly order , invoice
> > processing and email.
> >
> > * DRBD has been very reliable , but I am the SPOF .   Meaning that when
> > split brain occurs [ every 18-24 months ] it is me or no one who knows
> > what to do. Try to explain how to deal with split brain in advance
> > For the future ceph looks like it will be easier to maintain.
> >
> The DRBD people would of course tell you to configure things in a way that
> a split brain can't happen. ^o^
>
> Note that given the right circumstances (too many OSDs down, MONs down)
> Ceph can wind up in a similar state.
>
> > * We use Proxmox . So ceph and mons will share each node. I've used
> > proxmox for a few years and like the kvm / openvz management.
> >
> I tried it some time ago, but at that time it was still stuck with 2.6.32
> due to OpenVZ and that wasn't acceptable to me for various reasons.
> I think it still is, too.
>
> > * Ceph hardware:
> >
> > Four  hosts .  8 drives each.
> >
> > OPSYS: raid-1  on ssd .
> >
> Good, that should be sufficient for running MONs (you will want 3).
>
> > OSD: four disk raid 10 array using  2-TB drives.
> >
> > Two of the systems will use Seagate Constellation ES.3  2TB 7200 RPM
> > 128MB Cache SAS 6Gb/s
> >
> > the other two hosts use Western Digital RE WD2000FYYZ 2TB 7200 RPM 64MB
> > Cache SATA 6.0Gb/s   drives.
> >
> > Journal: 200GB Intel DC S3700 Series
> >
> > Spare disk for raid.
> >
> > * more questions.
> > you wrote:
> > "In essence, if your current setup can't handle the loss of a single
> > disk, what happens if a node fails?
> > You will need to design (HW) and configure (various Ceph options) your
> > cluster to handle these things because at some point a recovery might be
> > unavoidable.
> >
> > To prevent recoveries based on failed disks, use RAID, for node failures
> > you could permanently set OSD noout or have a monitoring software do that
> > when it detects a node failure."
> >
> > I'll research  'OSD noout' .
> >
> You probably might be happy with the "mon osd downout subtree limit" set
> to "host" as well.
> In that case you will need to manually trigger a rebuild (set that
> node/OSD to out) if you can't repair a failed node in a short time and
> keep your redundancy levels.
>
> > Are there other setting I should read up on / consider?
> >
> > For node reboots due to kernel upgrades -  how is that handled?   Of
> > course that would be scheduled for off hours.
> >
> Set noout before a planned downtime or live dangerously and assume it
> comes back within the timeout period (5 minutes IIRC).
>
> > Any other suggestions?
> >
> Test your cluster extensively before going into production.
>
> Fill it with enough data to be close to what you're expecting and fail one
> node/OSD.
>
> See how bad things become, try to determine where any bottlenecks are with
> tools like atop.
>
> While you've done pretty much everything to prevent that scenario from a
> disk failure with the RAID10 and by keeping nodes from being set out by
> whatever means you choose ("mon osd downout subtree limit = host" seems to
> work, I just tested it), having a cluster that doesn't melt down when
> recovering or at least knowing how bad things will be in such a scenario
> helps a lot.
>
> Regards,
>
> Christian
>
> > thanks for the suggestions,
> > Rob
> >
> >
> > On Sat, Jul 26, 2014 at 1:47 AM, Christian Balzer  wrote:
> >
> > >
> > > Hello,
> > >
> > > actually replying in the other thread was fine by me, it was after
> > > relevant in a sense to it.
> > > And you mentioned something important there, which you didn't mention
> > > below, that you're coming from DRBD with a lot of experience there.
> > >
> > > So do I and Ceph/RBD simply isn't (and probably never will be) an
> > > adequate replacement for DRBD in some use cases.
> > > I certainly plan to keep deploying DRBD where it makes more sense
> > > (IOPS/speed), while migrating everything else to Ceph.
> > >
> > > Anyway, lets look at your mail:
> > >
> > > On Fri, 25 Jul 2014 14:33:56 -0400 Robert Fantini wrote:
> > >
> > > > I've a question regarding advice from these threads:
> > > >
> > >
> https://mail.google.com/mail/u/0/#label/ceph/1476b93097673ad7?compose=1476ec7fef10fd01
> > > >
> > > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg11011.html
> > > >
> > > >
> > > >
> > > >  Our current setup has 4 osd's per node.When a drive  fails   the
> > > > clus

Re: [ceph-users] anti-cephalopod question

2014-07-28 Thread Christian Balzer

On Mon, 28 Jul 2014 04:19:16 -0400 Robert Fantini wrote:

> I have 3 hosts that i want to use to test new setup...
> 
> Currently they have 3-4 OSD's each.
>
How did you create the current cluster?

ceph-deploy or something withing Proxmox?
 
> Could you suggest a fast way to remove all the OSD's ?
> 
There is documentation on how to remove OSDs in the manual deployment
section.

If you can (have no data on it), why not start from scratch?

Christian
> 
> 
> 
> On Mon, Jul 28, 2014 at 3:49 AM, Christian Balzer  wrote:
> 
> >
> > Hello,
> >
> > On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:
> >
> > > Hello Christian,
> > >
> > > Let me supply more info and answer some questions.
> > >
> > > * Our main concern is high availability, not speed.
> > > Our storage requirements are not huge.
> > > However we want good keyboard response 99.99% of the time.   We
> > > mostly do data entry and reporting.   20-25  users doing mostly
> > > order , invoice processing and email.
> > >
> > > * DRBD has been very reliable , but I am the SPOF .   Meaning that
> > > when split brain occurs [ every 18-24 months ] it is me or no one
> > > who knows what to do. Try to explain how to deal with split brain in
> > > advance For the future ceph looks like it will be easier to
> > > maintain.
> > >
> > The DRBD people would of course tell you to configure things in a way
> > that a split brain can't happen. ^o^
> >
> > Note that given the right circumstances (too many OSDs down, MONs down)
> > Ceph can wind up in a similar state.
> >
> > > * We use Proxmox . So ceph and mons will share each node. I've used
> > > proxmox for a few years and like the kvm / openvz management.
> > >
> > I tried it some time ago, but at that time it was still stuck with
> > 2.6.32 due to OpenVZ and that wasn't acceptable to me for various
> > reasons. I think it still is, too.
> >
> > > * Ceph hardware:
> > >
> > > Four  hosts .  8 drives each.
> > >
> > > OPSYS: raid-1  on ssd .
> > >
> > Good, that should be sufficient for running MONs (you will want 3).
> >
> > > OSD: four disk raid 10 array using  2-TB drives.
> > >
> > > Two of the systems will use Seagate Constellation ES.3  2TB 7200 RPM
> > > 128MB Cache SAS 6Gb/s
> > >
> > > the other two hosts use Western Digital RE WD2000FYYZ 2TB 7200 RPM
> > > 64MB Cache SATA 6.0Gb/s   drives.
> > >
> > > Journal: 200GB Intel DC S3700 Series
> > >
> > > Spare disk for raid.
> > >
> > > * more questions.
> > > you wrote:
> > > "In essence, if your current setup can't handle the loss of a single
> > > disk, what happens if a node fails?
> > > You will need to design (HW) and configure (various Ceph options)
> > > your cluster to handle these things because at some point a recovery
> > > might be unavoidable.
> > >
> > > To prevent recoveries based on failed disks, use RAID, for node
> > > failures you could permanently set OSD noout or have a monitoring
> > > software do that when it detects a node failure."
> > >
> > > I'll research  'OSD noout' .
> > >
> > You probably might be happy with the "mon osd downout subtree limit"
> > set to "host" as well.
> > In that case you will need to manually trigger a rebuild (set that
> > node/OSD to out) if you can't repair a failed node in a short time and
> > keep your redundancy levels.
> >
> > > Are there other setting I should read up on / consider?
> > >
> > > For node reboots due to kernel upgrades -  how is that handled?   Of
> > > course that would be scheduled for off hours.
> > >
> > Set noout before a planned downtime or live dangerously and assume it
> > comes back within the timeout period (5 minutes IIRC).
> >
> > > Any other suggestions?
> > >
> > Test your cluster extensively before going into production.
> >
> > Fill it with enough data to be close to what you're expecting and fail
> > one node/OSD.
> >
> > See how bad things become, try to determine where any bottlenecks are
> > with tools like atop.
> >
> > While you've done pretty much everything to prevent that scenario from
> > a disk failure with the RAID10 and by keeping nodes from being set out
> > by whatever means you choose ("mon osd downout subtree limit = host"
> > seems to work, I just tested it), having a cluster that doesn't melt
> > down when recovering or at least knowing how bad things will be in
> > such a scenario helps a lot.
> >
> > Regards,
> >
> > Christian
> >
> > > thanks for the suggestions,
> > > Rob
> > >
> > >
> > > On Sat, Jul 26, 2014 at 1:47 AM, Christian Balzer 
> > > wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > actually replying in the other thread was fine by me, it was after
> > > > relevant in a sense to it.
> > > > And you mentioned something important there, which you didn't
> > > > mention below, that you're coming from DRBD with a lot of
> > > > experience there.
> > > >
> > > > So do I and Ceph/RBD simply isn't (and probably never will be) an
> > > > adequate replacement for DRBD in some use cases.
> > > > I certainly plan to keep deploying

Re: [ceph-users] Issues compiling Ceph (master branch) on Debian Wheezy (armhf)

2014-07-28 Thread Joao Eduardo Luis

On 07/25/2014 04:54 AM, Deven Phillips wrote:

Hi all,

 I am in the process of installing and setting up Ceph on a group of
Allwinner A20 SoC mini computers. They are armhf devices and I have
installed Cubian (http://cubian.org/), which is a port of Debian Wheezy.
I tried to follow the instructions at:

http://ceph.com/docs/master/install/build-ceph/

But I found that some needed dependencies were not installed. Below is a
list of the items I had to install in order to compile Ceph for these
devices:

uuid-dev
libblkid-dev
libudev-dev
libatomic-ops-dev
libsnappy-dev
libleveldb-dev
xfslibs-dev
libboost-all-dev

I also had to specify --without-tcmalloc because I could not find a
package which implements that for the armhf platform.


http://ceph.com/packages/google-perftools/debian/

FWIW, I recall this working fine on the Cubietruck.  Can't recall though 
if there was any foo involved, but I don't think so.


  -Joao

--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] anti-cephalopod question

2014-07-28 Thread Robert Fantini
the osd's were created using proxmox web page .

there is not data that I want to save..

so I'd like to start from scratch but not do a reinstall of the operating
system.

I'll check the documentation that you mentioned.


On Mon, Jul 28, 2014 at 4:38 AM, Christian Balzer  wrote:

>
> On Mon, 28 Jul 2014 04:19:16 -0400 Robert Fantini wrote:
>
> > I have 3 hosts that i want to use to test new setup...
> >
> > Currently they have 3-4 OSD's each.
> >
> How did you create the current cluster?
>
> ceph-deploy or something withing Proxmox?
>
> > Could you suggest a fast way to remove all the OSD's ?
> >
> There is documentation on how to remove OSDs in the manual deployment
> section.
>
> If you can (have no data on it), why not start from scratch?
>
> Christian
> >
> >
> >
> > On Mon, Jul 28, 2014 at 3:49 AM, Christian Balzer  wrote:
> >
> > >
> > > Hello,
> > >
> > > On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:
> > >
> > > > Hello Christian,
> > > >
> > > > Let me supply more info and answer some questions.
> > > >
> > > > * Our main concern is high availability, not speed.
> > > > Our storage requirements are not huge.
> > > > However we want good keyboard response 99.99% of the time.   We
> > > > mostly do data entry and reporting.   20-25  users doing mostly
> > > > order , invoice processing and email.
> > > >
> > > > * DRBD has been very reliable , but I am the SPOF .   Meaning that
> > > > when split brain occurs [ every 18-24 months ] it is me or no one
> > > > who knows what to do. Try to explain how to deal with split brain in
> > > > advance For the future ceph looks like it will be easier to
> > > > maintain.
> > > >
> > > The DRBD people would of course tell you to configure things in a way
> > > that a split brain can't happen. ^o^
> > >
> > > Note that given the right circumstances (too many OSDs down, MONs down)
> > > Ceph can wind up in a similar state.
> > >
> > > > * We use Proxmox . So ceph and mons will share each node. I've used
> > > > proxmox for a few years and like the kvm / openvz management.
> > > >
> > > I tried it some time ago, but at that time it was still stuck with
> > > 2.6.32 due to OpenVZ and that wasn't acceptable to me for various
> > > reasons. I think it still is, too.
> > >
> > > > * Ceph hardware:
> > > >
> > > > Four  hosts .  8 drives each.
> > > >
> > > > OPSYS: raid-1  on ssd .
> > > >
> > > Good, that should be sufficient for running MONs (you will want 3).
> > >
> > > > OSD: four disk raid 10 array using  2-TB drives.
> > > >
> > > > Two of the systems will use Seagate Constellation ES.3  2TB 7200 RPM
> > > > 128MB Cache SAS 6Gb/s
> > > >
> > > > the other two hosts use Western Digital RE WD2000FYYZ 2TB 7200 RPM
> > > > 64MB Cache SATA 6.0Gb/s   drives.
> > > >
> > > > Journal: 200GB Intel DC S3700 Series
> > > >
> > > > Spare disk for raid.
> > > >
> > > > * more questions.
> > > > you wrote:
> > > > "In essence, if your current setup can't handle the loss of a single
> > > > disk, what happens if a node fails?
> > > > You will need to design (HW) and configure (various Ceph options)
> > > > your cluster to handle these things because at some point a recovery
> > > > might be unavoidable.
> > > >
> > > > To prevent recoveries based on failed disks, use RAID, for node
> > > > failures you could permanently set OSD noout or have a monitoring
> > > > software do that when it detects a node failure."
> > > >
> > > > I'll research  'OSD noout' .
> > > >
> > > You probably might be happy with the "mon osd downout subtree limit"
> > > set to "host" as well.
> > > In that case you will need to manually trigger a rebuild (set that
> > > node/OSD to out) if you can't repair a failed node in a short time and
> > > keep your redundancy levels.
> > >
> > > > Are there other setting I should read up on / consider?
> > > >
> > > > For node reboots due to kernel upgrades -  how is that handled?   Of
> > > > course that would be scheduled for off hours.
> > > >
> > > Set noout before a planned downtime or live dangerously and assume it
> > > comes back within the timeout period (5 minutes IIRC).
> > >
> > > > Any other suggestions?
> > > >
> > > Test your cluster extensively before going into production.
> > >
> > > Fill it with enough data to be close to what you're expecting and fail
> > > one node/OSD.
> > >
> > > See how bad things become, try to determine where any bottlenecks are
> > > with tools like atop.
> > >
> > > While you've done pretty much everything to prevent that scenario from
> > > a disk failure with the RAID10 and by keeping nodes from being set out
> > > by whatever means you choose ("mon osd downout subtree limit = host"
> > > seems to work, I just tested it), having a cluster that doesn't melt
> > > down when recovering or at least knowing how bad things will be in
> > > such a scenario helps a lot.
> > >
> > > Regards,
> > >
> > > Christian
> > >
> > > > thanks for the suggestions,
> > > > Rob
> > > >
> > > >
> > > > On Sat, Jul 26, 2014

[ceph-users] Not able to upload object using Horizon(Openstack Dashboard) to Ceph

2014-07-28 Thread Ashish Chandra
Hi Cephers,

I have configured Ceph RadosGW for Swift. I have also set authentication
using keystone. Using Swift CLI I can do all stuff viz. uploading
container, object, listing. But while using Dashboard I am able to do all
the stuff apart from uploading an object.

While uploading an object I am getting 411 length required error in backend.

Any idea what could be going wrong. Please help.


-- 

.-  -..--.  ,---.  .-=<>=-.
   /_-\'''/-_\  / / '' \ \ |,-.| /____\
  |/  o) (o  \|| | ')(' | |   /,'-'.\   |/ (')(') \|
   \   ._.   /  \ \/ /   {_/(') (')\_}   \   __   /
   ,>-_,,,_-<.   >'=jf='< `.   _   .','--__--'.
 /  .  \/\ /'-___-'\/:|\
(_) . (_)  /  \   / \  (_)   :|   (_)
 \_-'--/  (_)(_) (_)___(_)   |___:||
  \___/ || \___/ |_|


Thanks and Regards

Ashish Chandra

Openstack Developer, Cloud Engineering

Reliance Jio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD weight 0

2014-07-28 Thread Kapil Sharma
It's fixed now. Apparently we can not share a journal across different
OSDs. I added a journal /dev/sdc1 (20GB) with my first OSD. I was trying
to add the same journal with my second OSD and it was causing the issue.
Then I added the secons OSD with a new journal and it worked fine.


Thanks,
Kapil.




On Mon, 2014-07-28 at 10:16 +0300, Karan Singh wrote:
> 
> 
> Looks like osd.1 has a valid auth ID , which was defined previously.
> 
> 
> Trust this is your test cluster , try this
> 
> 
> ceph osd crush rm osd.1
> ceph osd rm osd.1
> ceph auth del osd.1
> 
> 
> Once again try to add osd.1 using ceph-deploy ( prepare and then
> activate commands ) , check the logs carefully for any other clues.
> 
> 
> - Karan  Singh -
> 
> On 25 Jul 2014, at 12:49, Kapil Sharma  wrote:
> 
> > Hi,
> > 
> > I am using ceph-deploy to deploy my cluster. Whenever I try to add
> > more
> > than one osd in a node, except the first osd, all the other osds get
> > a
> > weight of 0, and they are in a state of down and out.
> > 
> > So, if I have three nodes in my cluster, I can successfully add 1
> > node
> > each in the three nodes, but the moment I try to add a second node
> > in
> > any of the nodes, it gets a weight of 0 and goes down and out.
> > 
> > The capacity of all the disks is same.
> > 
> > 
> > cephdeploy@node-1:~/cluster> ceph osd tree
> > # idweight  type name   up/down reweight
> > -1  1.82root default
> > -2  1.82host node-1
> > 0   1.82osd.0   up  1
> > 1   0   osd.1   down0
> > 
> > There is no error as such after I run ceph-deploy activate command.
> > 
> > Has anyone seen this issue before ? 
> > 
> > 
> > 
> > Kind Regards,
> > Kapil.
> > 
> > 
> > 
> > 
> > 
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Not able to upload object using Horizon(Openstack Dashboard) to Ceph

2014-07-28 Thread Ashish Chandra
Hi Karan,

Once you are able to put objects using RadosGW created user name and
password 90% of the job is done. We have to follow only these steps
afterwards:
1.  Put the configurations specified in
http://ceph.com/docs/master/radosgw/keystone/
2. Make sure you create object-service and endpoint in keystone.
 $ keystone service-create --name swift --type object-store
$ keystone endpoint-create --region RegionOne --service-id
1132aee9446b4efa83d41375daf231c5 --publicurl
http://firefly-master.ashish.com/swift/v1 --internalurl
http://firefly-master.ashish.com/swift/v1 --adminurl
http://firefly-master.ashish.com/swift/v1

After these two steps we are good to go. I am using Openstack master,
installed using Devstack. Due to recent patch (11th July), keystone runs
behind apache2. So we have to put WSGIChunkedRequest On in rgw.conf.

Please refer to :
https://github.com/ashishchandra1/swift_ceph_operations/blob/master/ceph_swift_keystone_authentication

Please dont hesitate to connect with me on hangouts anytime if you stuck in
any issues, I will be more than happy to make it working for you and who
knows you ll provide me solution for my issue.



On Mon, Jul 28, 2014 at 3:04 PM, Karan Singh  wrote:

> Hey Ashish
>
> Sorry i can’t help you in your issue , as you are ahead of me . But i
> require your help in setting up what you have already done.
>
> I need your help in setting keystone authentication for Ceph RGW and want
> to upload containers , objects , listing from Swift CLI. I have a running
> Ceph cluster with RGW configured , i can put objects from Swift CLI to Ceph
> Cluster using username and password.
> Now i want to use keystone authentication so that i do not need to supply
> username and password to store objects.
>
> Requesting your help with the steps i should take to make this happen.
>
>
> 
> Karan Singh
> Systems Specialist , Storage Platforms
> CSC - IT Center for Science,
> Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland
> mobile: +358 503 812758
> tel. +358 9 4572001
> fax +358 9 4572302
> http://www.csc.fi/
> 
>
> On 28 Jul 2014, at 12:20, Ashish Chandra 
> wrote:
>
>
> Hi Cephers,
>
> I have configured Ceph RadosGW for Swift. I have also set authentication
> using keystone. Using Swift CLI I can do all stuff viz. uploading
> container, object, listing. But while using Dashboard I am able to do all
> the stuff apart from uploading an object.
>
> While uploading an object I am getting 411 length required error in
> backend.
>
> Any idea what could be going wrong. Please help.
>
>
> --
>
> .-  -..--.  ,---.  .-=<>=-.
>/_-\'''/-_\  / / '' \ \ |,-.| /____\
>   |/  o) (o  \|| | ')(' | |   /,'-'.\   |/ (')(') \|
>\   ._.   /  \ \/ /   {_/(') (')\_}   \   __   /
>,>-_,,,_-<.   >'=jf='< `.   _   .','--__--'.
>  /  .  \/\ /'-___-'\/:|\
> (_) . (_)  /  \   / \  (_)   :|   (_)
>  \_-'--/  (_)(_) (_)___(_)   |___:||
>   \___/ || \___/ |_|
>
>
> Thanks and Regards
>
> Ashish Chandra
>
> Openstack Developer, Cloud Engineering
>
> Reliance Jio
>
>  ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>


-- 

.-  -..--.  ,---.  .-=<>=-.
   /_-\'''/-_\  / / '' \ \ |,-.| /____\
  |/  o) (o  \|| | ')(' | |   /,'-'.\   |/ (')(') \|
   \   ._.   /  \ \/ /   {_/(') (')\_}   \   __   /
   ,>-_,,,_-<.   >'=jf='< `.   _   .','--__--'.
 /  .  \/\ /'-___-'\/:|\
(_) . (_)  /  \   / \  (_)   :|   (_)
 \_-'--/  (_)(_) (_)___(_)   |___:||
  \___/ || \___/ |_|


Thanks and Regards

Ashish Chandra

Openstack Developer, Cloud Engineering

Reliance Jio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] anti-cephalopod question

2014-07-28 Thread Joao Eduardo Luis

On 07/28/2014 08:49 AM, Christian Balzer wrote:


Hello,

On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:


Hello Christian,

Let me supply more info and answer some questions.

* Our main concern is high availability, not speed.
Our storage requirements are not huge.
However we want good keyboard response 99.99% of the time.   We mostly do
data entry and reporting.   20-25  users doing mostly order , invoice
processing and email.

* DRBD has been very reliable , but I am the SPOF .   Meaning that when
split brain occurs [ every 18-24 months ] it is me or no one who knows
what to do. Try to explain how to deal with split brain in advance
For the future ceph looks like it will be easier to maintain.


The DRBD people would of course tell you to configure things in a way that
a split brain can't happen. ^o^

Note that given the right circumstances (too many OSDs down, MONs down)
Ceph can wind up in a similar state.



I am not sure what you mean by ceph winding up in a similar state.  If 
you mean regarding 'split brain' in the usual sense of the term, it does 
not occur in Ceph.  If it does, you have surely found a bug and you 
should let us know with lots of CAPS.


What you can incur though if you have too many monitors down is cluster 
downtime.  The monitors will ensure you need a strict majority of 
monitors up in order to operate the cluster, and will not serve requests 
if said majority is not in place.  The monitors will only serve requests 
when there's a formed 'quorum', and a quorum is only formed by (N/2)+1 
monitors, N being the total number of monitors in the cluster (via the 
monitor map -- monmap).


This said, if out of 3 monitors you have 2 monitors down, your cluster 
will cease functioning (no admin commands, no writes or reads served). 
As there is no configuration in which you can have two strict 
majorities, thus no two partitions of the cluster are able to function 
at the same time, you do not incur in split brain.


If you are a creative admin however, you may be able to enforce split 
brain by modifying monmaps.  In the end you'd obviously end up with two 
distinct monitor clusters, but if you so happened to not inform the 
clients about this there's a fair chance that it would cause havoc with 
unforeseen effects.  Then again, this would be the operator's fault, not 
Ceph itself -- especially because rewriting monitor maps is not trivial 
enough for someone to mistakenly do something like this.


  -Joao


--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Recommendation to safely avoid problems with osd-failure

2014-07-28 Thread Josef Johansson
Hi,

I'm trying to compile a strategy to avoid performance problems if osds
or osd hosts fails.

If I encounter a re-balance of one OSD during mid-day, there'll be
problems with performance right now, if I could see the issue and let it
re-balance during evening, that would be great.

I.e. if two OSD hosts dies around the same time I suspect that the
clients would suffer greatly.

Currently the osd has the following settings

 osd max backfills = 1
 osd recovery max active = 1

Is there any general guidance or recommendation for unexpected outages?

Cheers,
Josef Johansson
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] anti-cephalopod question

2014-07-28 Thread Joao Eduardo Luis

(CC'ing ceph-users)

On 07/28/2014 12:34 PM, Marc wrote:

Hi,


This said, if out of 3 monitors you have 2 monitors down, your cluster
will cease functioning (no admin commands, no writes or reads served).


This is not entirely true. (At least) RBDs will continue being fully
functional even if the mon quorum is lost. This only applies to RBDs
that are already mounted (qemu) at the time of quorum loss though.

Meaning: (K)VMs running off of Ceph will remain fully functional even if
the mon quorum is lost (assuming you havent lost too many OSDs at the
same time).


True.  Clients will maintain the connections they have to OSDs for about 
15 minutes or so, at which point timeouts will go off and all work will 
be halted.  New clients won't be able to do this though, as they have to 
grab maps from the monitors prior to connecting to OSDs, and the monitor 
will not serve those requests if quorum is not in place.


  -Joao



On 28/07/2014 12:22, Joao Eduardo Luis wrote:

On 07/28/2014 08:49 AM, Christian Balzer wrote:


Hello,

On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:


Hello Christian,

Let me supply more info and answer some questions.

* Our main concern is high availability, not speed.
Our storage requirements are not huge.
However we want good keyboard response 99.99% of the time.   We
mostly do
data entry and reporting.   20-25  users doing mostly order , invoice
processing and email.

* DRBD has been very reliable , but I am the SPOF .   Meaning that when
split brain occurs [ every 18-24 months ] it is me or no one who knows
what to do. Try to explain how to deal with split brain in advance
For the future ceph looks like it will be easier to maintain.


The DRBD people would of course tell you to configure things in a way
that
a split brain can't happen. ^o^

Note that given the right circumstances (too many OSDs down, MONs down)
Ceph can wind up in a similar state.



I am not sure what you mean by ceph winding up in a similar state.  If
you mean regarding 'split brain' in the usual sense of the term, it does
not occur in Ceph.  If it does, you have surely found a bug and you
should let us know with lots of CAPS.

What you can incur though if you have too many monitors down is cluster
downtime.  The monitors will ensure you need a strict majority of
monitors up in order to operate the cluster, and will not serve requests
if said majority is not in place.  The monitors will only serve requests
when there's a formed 'quorum', and a quorum is only formed by (N/2)+1
monitors, N being the total number of monitors in the cluster (via the
monitor map -- monmap).

This said, if out of 3 monitors you have 2 monitors down, your cluster
will cease functioning (no admin commands, no writes or reads served).
As there is no configuration in which you can have two strict
majorities, thus no two partitions of the cluster are able to function
at the same time, you do not incur in split brain.

If you are a creative admin however, you may be able to enforce split
brain by modifying monmaps.  In the end you'd obviously end up with two
distinct monitor clusters, but if you so happened to not inform the
clients about this there's a fair chance that it would cause havoc with
unforeseen effects.  Then again, this would be the operator's fault, not
Ceph itself -- especially because rewriting monitor maps is not trivial
enough for someone to mistakenly do something like this.

   -Joao







--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] anti-cephalopod question

2014-07-28 Thread Robert Fantini
Is the '15 minutes or so '  something that can be configured at run time?


On Mon, Jul 28, 2014 at 8:44 AM, Joao Eduardo Luis 
wrote:

> (CC'ing ceph-users)
>
> On 07/28/2014 12:34 PM, Marc wrote:
>
>> Hi,
>>
>>
>>  This said, if out of 3 monitors you have 2 monitors down, your cluster
>>> will cease functioning (no admin commands, no writes or reads served).
>>>
>>
>> This is not entirely true. (At least) RBDs will continue being fully
>> functional even if the mon quorum is lost. This only applies to RBDs
>> that are already mounted (qemu) at the time of quorum loss though.
>>
>> Meaning: (K)VMs running off of Ceph will remain fully functional even if
>> the mon quorum is lost (assuming you havent lost too many OSDs at the
>> same time).
>>
>
> True.  Clients will maintain the connections they have to OSDs for about
> 15 minutes or so, at which point timeouts will go off and all work will be
> halted.  New clients won't be able to do this though, as they have to grab
> maps from the monitors prior to connecting to OSDs, and the monitor will
> not serve those requests if quorum is not in place.
>
>   -Joao
>
>
>
>> On 28/07/2014 12:22, Joao Eduardo Luis wrote:
>>
>>> On 07/28/2014 08:49 AM, Christian Balzer wrote:
>>>

 Hello,

 On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:

  Hello Christian,
>
> Let me supply more info and answer some questions.
>
> * Our main concern is high availability, not speed.
> Our storage requirements are not huge.
> However we want good keyboard response 99.99% of the time.   We
> mostly do
> data entry and reporting.   20-25  users doing mostly order , invoice
> processing and email.
>
> * DRBD has been very reliable , but I am the SPOF .   Meaning that when
> split brain occurs [ every 18-24 months ] it is me or no one who knows
> what to do. Try to explain how to deal with split brain in advance
> For the future ceph looks like it will be easier to maintain.
>
>  The DRBD people would of course tell you to configure things in a way
 that
 a split brain can't happen. ^o^

 Note that given the right circumstances (too many OSDs down, MONs down)
 Ceph can wind up in a similar state.

>>>
>>>
>>> I am not sure what you mean by ceph winding up in a similar state.  If
>>> you mean regarding 'split brain' in the usual sense of the term, it does
>>> not occur in Ceph.  If it does, you have surely found a bug and you
>>> should let us know with lots of CAPS.
>>>
>>> What you can incur though if you have too many monitors down is cluster
>>> downtime.  The monitors will ensure you need a strict majority of
>>> monitors up in order to operate the cluster, and will not serve requests
>>> if said majority is not in place.  The monitors will only serve requests
>>> when there's a formed 'quorum', and a quorum is only formed by (N/2)+1
>>> monitors, N being the total number of monitors in the cluster (via the
>>> monitor map -- monmap).
>>>
>>> This said, if out of 3 monitors you have 2 monitors down, your cluster
>>> will cease functioning (no admin commands, no writes or reads served).
>>> As there is no configuration in which you can have two strict
>>> majorities, thus no two partitions of the cluster are able to function
>>> at the same time, you do not incur in split brain.
>>>
>>> If you are a creative admin however, you may be able to enforce split
>>> brain by modifying monmaps.  In the end you'd obviously end up with two
>>> distinct monitor clusters, but if you so happened to not inform the
>>> clients about this there's a fair chance that it would cause havoc with
>>> unforeseen effects.  Then again, this would be the operator's fault, not
>>> Ceph itself -- especially because rewriting monitor maps is not trivial
>>> enough for someone to mistakenly do something like this.
>>>
>>>-Joao
>>>
>>>
>>>
>>
>
> --
> Joao Eduardo Luis
> Software Engineer | http://inktank.com | http://ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] anti-cephalopod question

2014-07-28 Thread Joao Eduardo Luis

On 07/28/2014 02:07 PM, Robert Fantini wrote:

Is the '15 minutes or so '  something that can be configured at run time?


Someone who knows this better than I do should probably chime in, but 
from a quick look throughout the code it seems to be 
'client_mount_interval', which by default is 300 seconds (5 minutes) 
instead of 15 minutes.


As with all (or most?) options, this can be adjust at run time via 
injectargs (via 'ceph tell') or 'config set' (via the admin socket).


Please bear in mind that just because you can adjust it doesn't mean 
that you should.  Keeping live connections alive should not be a 
problem, but given I haven't given much thought to it there's a chance 
that I'm missing something.


  -Joao




On Mon, Jul 28, 2014 at 8:44 AM, Joao Eduardo Luis
mailto:joao.l...@inktank.com>> wrote:

(CC'ing ceph-users)

On 07/28/2014 12:34 PM, Marc wrote:

Hi,


This said, if out of 3 monitors you have 2 monitors down,
your cluster
will cease functioning (no admin commands, no writes or
reads served).


This is not entirely true. (At least) RBDs will continue being fully
functional even if the mon quorum is lost. This only applies to RBDs
that are already mounted (qemu) at the time of quorum loss though.

Meaning: (K)VMs running off of Ceph will remain fully functional
even if
the mon quorum is lost (assuming you havent lost too many OSDs
at the
same time).


True.  Clients will maintain the connections they have to OSDs for
about 15 minutes or so, at which point timeouts will go off and all
work will be halted.  New clients won't be able to do this though,
as they have to grab maps from the monitors prior to connecting to
OSDs, and the monitor will not serve those requests if quorum is not
in place.

   -Joao



On 28/07/2014 12:22, Joao Eduardo Luis wrote:

On 07/28/2014 08:49 AM, Christian Balzer wrote:


Hello,

On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:

Hello Christian,

Let me supply more info and answer some questions.

* Our main concern is high availability, not speed.
Our storage requirements are not huge.
However we want good keyboard response 99.99% of the
time.   We
mostly do
data entry and reporting.   20-25  users doing
mostly order , invoice
processing and email.

* DRBD has been very reliable , but I am the SPOF .
   Meaning that when
split brain occurs [ every 18-24 months ] it is me
or no one who knows
what to do. Try to explain how to deal with split
brain in advance
For the future ceph looks like it will be easier to
maintain.

The DRBD people would of course tell you to configure
things in a way
that
a split brain can't happen. ^o^

Note that given the right circumstances (too many OSDs
down, MONs down)
Ceph can wind up in a similar state.



I am not sure what you mean by ceph winding up in a similar
state.  If
you mean regarding 'split brain' in the usual sense of the
term, it does
not occur in Ceph.  If it does, you have surely found a bug
and you
should let us know with lots of CAPS.

What you can incur though if you have too many monitors down
is cluster
downtime.  The monitors will ensure you need a strict
majority of
monitors up in order to operate the cluster, and will not
serve requests
if said majority is not in place.  The monitors will only
serve requests
when there's a formed 'quorum', and a quorum is only formed
by (N/2)+1
monitors, N being the total number of monitors in the
cluster (via the
monitor map -- monmap).

This said, if out of 3 monitors you have 2 monitors down,
your cluster
will cease functioning (no admin commands, no writes or
reads served).
As there is no configuration in which you can have two strict
majorities, thus no two partitions of the cluster are able
to function
at the same time, you do not incur in split brain.

If you are a creative admin however, you may be able to
enforce split
brain by modifying monmaps.  In the end you'd obviously end
up with two
distinct

Re: [ceph-users] Pool size 2 min_size 1 Advisability?

2014-07-28 Thread Edward Huyer
> > Ceph has a default pool size of 3. Is it a bad idea to run a pool of
> > size 2? What about size 2 min_size 1?
> >
> min_size 1 is sensible, 2 obviously won't protect you against dual disk 
> failures.
> Which happen and happen with near certainty once your cluster gets big
> enough.

I though I saw somewhere in the docs that there could be issues with min_size 
1, but I can't seem to find it now.

> > I have a cluster I'm moving data into (on RBDs) that is full enough
> > with size 3 that I'm bumping into nearfull warnings. Part of that is
> > because of the amount of data, part is probably because of suboptimal
> > tuning (Proxmox VE doesn't support all the tuning options), and part
> > is probably because of unbalanced drive distribution and multiple
> > drive sizes.
> >
> > I'm hoping I'll be able to solve the drive size/distribution issue,
> > but in the mean time, what problems could the size and min_size
> > changes create (aside from the obvious issue of fewer replicas)?
> 
> I'd address all those issues (setting the correct weight for your OSDs).
> Because it is something you will need to do anyway down the road.
> Alternatively add more nodes and OSDs.

I don't think it's a weighting issue.  My weights seem sane (e.g., they are 
scaled according to drive size).  I think it's more an artifact arising from a 
combination of factors:
- A relatively small number of nodes
- Some of the nodes having additional OSDs
- Those additional OSDs being 500GB drives compared to the other OSDs being 1TB 
and 3TB drives
- Having to use older CRUSH tuneables
- The cluster being around 72% full with that pool set to size 3

Running ' ceph osd reweight-by-utilization' clears the issue up temporarily, 
but additional data inevitably causes certain OSDs to be overloaded again.

> While setting the replica down to 2 will "solve" your problem, it will also
> create another one besides the reduced redundancy:
> It will reshuffle all your data, slowing down your cluster (to the point of
> becoming unresponsive if it isn't designed and configured well).
> 
> Murphy might take those massive disk reads and writes as a clue to provide
> you with a double disk failure as well. ^o^

I actually already did the size 2 change on that pool before I sent my original 
email.  It was the only way I would get the data moved.  It didn't result in 
any data movement, just deletion.  When I get new drives I'll turn that knob 
back up.

Thanks for your input, by the way.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] anti-cephalopod question

2014-07-28 Thread Sage Weil
On Mon, 28 Jul 2014, Joao Eduardo Luis wrote:
> On 07/28/2014 02:07 PM, Robert Fantini wrote:
> > Is the '15 minutes or so '  something that can be configured at run time?
> 
> Someone who knows this better than I do should probably chime in, but from a
> quick look throughout the code it seems to be 'client_mount_interval', which
> by default is 300 seconds (5 minutes) instead of 15 minutes.
> 
> As with all (or most?) options, this can be adjust at run time via injectargs
> (via 'ceph tell') or 'config set' (via the admin socket).
> 
> Please bear in mind that just because you can adjust it doesn't mean that you
> should.  Keeping live connections alive should not be a problem, but given I
> haven't given much thought to it there's a chance that I'm missing something.

I think connected clients will continue to funciton much longer 
than clinet_mount_interval... it should be as long as 
auth_service_ticket_ttl (default is 1h), or somewhere between 1x and 2x 
that interval, when cephx is in use.  The real limitation is that if the 
mons lose quorum you can't have new clients authenticate, and there won't 
be any cluster state changes (e.g., an OSD can't go down or come up).  A 
few other random operations will also fail (snap creation, 'df', etc.).

sage


 > 
>   -Joao
> 
> > 
> > 
> > On Mon, Jul 28, 2014 at 8:44 AM, Joao Eduardo Luis
> > mailto:joao.l...@inktank.com>> wrote:
> > 
> > (CC'ing ceph-users)
> > 
> > On 07/28/2014 12:34 PM, Marc wrote:
> > 
> > Hi,
> > 
> > 
> > This said, if out of 3 monitors you have 2 monitors down,
> > your cluster
> > will cease functioning (no admin commands, no writes or
> > reads served).
> > 
> > 
> > This is not entirely true. (At least) RBDs will continue being fully
> > functional even if the mon quorum is lost. This only applies to RBDs
> > that are already mounted (qemu) at the time of quorum loss though.
> > 
> > Meaning: (K)VMs running off of Ceph will remain fully functional
> > even if
> > the mon quorum is lost (assuming you havent lost too many OSDs
> > at the
> > same time).
> > 
> > 
> > True.  Clients will maintain the connections they have to OSDs for
> > about 15 minutes or so, at which point timeouts will go off and all
> > work will be halted.  New clients won't be able to do this though,
> > as they have to grab maps from the monitors prior to connecting to
> > OSDs, and the monitor will not serve those requests if quorum is not
> > in place.
> > 
> >-Joao
> > 
> > 
> > 
> > On 28/07/2014 12:22, Joao Eduardo Luis wrote:
> > 
> > On 07/28/2014 08:49 AM, Christian Balzer wrote:
> > 
> > 
> > Hello,
> > 
> > On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:
> > 
> > Hello Christian,
> > 
> > Let me supply more info and answer some questions.
> > 
> > * Our main concern is high availability, not speed.
> > Our storage requirements are not huge.
> > However we want good keyboard response 99.99% of the
> > time.   We
> > mostly do
> > data entry and reporting.   20-25  users doing
> > mostly order , invoice
> > processing and email.
> > 
> > * DRBD has been very reliable , but I am the SPOF .
> >Meaning that when
> > split brain occurs [ every 18-24 months ] it is me
> > or no one who knows
> > what to do. Try to explain how to deal with split
> > brain in advance
> > For the future ceph looks like it will be easier to
> > maintain.
> > 
> > The DRBD people would of course tell you to configure
> > things in a way
> > that
> > a split brain can't happen. ^o^
> > 
> > Note that given the right circumstances (too many OSDs
> > down, MONs down)
> > Ceph can wind up in a similar state.
> > 
> > 
> > 
> > I am not sure what you mean by ceph winding up in a similar
> > state.  If
> > you mean regarding 'split brain' in the usual sense of the
> > term, it does
> > not occur in Ceph.  If it does, you have surely found a bug
> > and you
> > should let us know with lots of CAPS.
> > 
> > What you can incur though if you have too many monitors down
> > is cluster
> > downtime.  The monitors will ensure you need a strict
> > majority of
> > monitors up in order to operate the cluster, and will not
> > serve requests
> >  

[ceph-users] Dependency issues in fresh ceph/CentOS 7 install

2014-07-28 Thread Brian Lovett

I'm installing the latest firefly on a fresh centos 7 machine using the rhel 
7 yum repo. I'm getting a few dependency issues when using ceph-deploy 
install. Mostly it looks like it doesn't like python 2.7.

[monitor01][DEBUG ] --> Processing Dependency: libboost_system-mt.so.5()
(64bit) for package: librbd1-0.80.4-0.el6.x86_64
[monitor01][DEBUG ] ---> Package mesa-libgbm.x86_64 0:9.2.5-5.20131218.el7 
will be installed
[monitor01][DEBUG ] ---> Package mesa-libglapi.x86_64 0:9.2.5-5.20131218.el7 
will be installed
[monitor01][DEBUG ] ---> Package python-ceph.x86_64 0:0.80.4-0.el6 will be 
installed
[monitor01][DEBUG ] --> Processing Dependency: python(abi) = 2.6 for 
package: python-ceph-0.80.4-0.el6.x86_64
[monitor01][DEBUG ] --> Finished Dependency Resolution
[monitor01][DEBUG ]  You could try using --skip-broken to work around the 
problem
[monitor01][DEBUG ]  You could try running: rpm -Va --nofiles --nodigest
[monitor01][WARNIN] Error: Package: ceph-common-0.80.4-0.el6.x86_64 (Ceph)
[monitor01][WARNIN]Requires: libboost_thread-mt.so.5()(64bit)
[monitor01][WARNIN] Error: Package: python-ceph-0.80.4-0.el6.x86_64 (Ceph)
[monitor01][WARNIN]Requires: python(abi) = 2.6
[monitor01][WARNIN]Installed: python-2.7.5-16.el7.x86_64 
(@anaconda)
[monitor01][WARNIN]python(abi) = 2.7
[monitor01][WARNIN]python(abi) = 2.7
[monitor01][WARNIN] Error: Package: librados2-0.80.4-0.el6.x86_64 (Ceph)
[monitor01][WARNIN]Requires: libboost_system-mt.so.5()(64bit)
[monitor01][WARNIN] Error: Package: libcephfs1-0.80.4-0.el6.x86_64 (Ceph)
[monitor01][WARNIN]Requires: libboost_system-mt.so.5()(64bit)
[monitor01][WARNIN] Error: Package: ceph-0.80.4-0.el6.x86_64 (Ceph)
[monitor01][WARNIN]Requires: libboost_system-mt.so.5()(64bit)
[monitor01][WARNIN] Error: Package: librbd1-0.80.4-0.el6.x86_64 (Ceph)
[monitor01][WARNIN]Requires: libboost_system-mt.so.5()(64bit)
[monitor01][WARNIN] Error: Package: librbd1-0.80.4-0.el6.x86_64 (Ceph)
[monitor01][WARNIN]Requires: libboost_thread-mt.so.5()(64bit)
[monitor01][WARNIN] Error: Package: ceph-0.80.4-0.el6.x86_64 (Ceph)
[monitor01][WARNIN]Requires: libboost_thread-mt.so.5()(64bit)
[monitor01][WARNIN] Error: Package: librados2-0.80.4-0.el6.x86_64 (Ceph)
[monitor01][WARNIN]Requires: libboost_thread-mt.so.5()(64bit)
[monitor01][WARNIN] Error: Package: libcephfs1-0.80.4-0.el6.x86_64 (Ceph)
[monitor01][WARNIN]Requires: libboost_thread-mt.so.5()(64bit)
[monitor01][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: yum -y 
install ceph


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recommendation to safely avoid problems with osd-failure

2014-07-28 Thread Christian Balzer

See the anti-cephalopod thread on this ML.

On Mon, 28 Jul 2014 12:23:02 +0200 Josef Johansson wrote:

> Hi,
> 
> I'm trying to compile a strategy to avoid performance problems if osds
> or osd hosts fails.
> 
> If I encounter a re-balance of one OSD during mid-day, there'll be
> problems with performance right now, if I could see the issue and let it
> re-balance during evening, that would be great.
> 
> I.e. if two OSD hosts dies around the same time I suspect that the
> clients would suffer greatly.
> 
> Currently the osd has the following settings
> 
>  osd max backfills = 1
>  osd recovery max active = 1
> 
> Is there any general guidance or recommendation for unexpected outages?
> 
> Cheers,
> Josef Johansson
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pool size 2 min_size 1 Advisability?

2014-07-28 Thread Christian Balzer
On Mon, 28 Jul 2014 14:24:02 + Edward Huyer wrote:

> > > Ceph has a default pool size of 3. Is it a bad idea to run a pool of
> > > size 2? What about size 2 min_size 1?
> > >
> > min_size 1 is sensible, 2 obviously won't protect you against dual
> > disk failures. Which happen and happen with near certainty once your
> > cluster gets big enough.
> 
> I though I saw somewhere in the docs that there could be issues with
> min_size 1, but I can't seem to find it now.
> 
> > > I have a cluster I'm moving data into (on RBDs) that is full enough
> > > with size 3 that I'm bumping into nearfull warnings. Part of that is
> > > because of the amount of data, part is probably because of suboptimal
> > > tuning (Proxmox VE doesn't support all the tuning options), and part
> > > is probably because of unbalanced drive distribution and multiple
> > > drive sizes.
> > >
> > > I'm hoping I'll be able to solve the drive size/distribution issue,
> > > but in the mean time, what problems could the size and min_size
> > > changes create (aside from the obvious issue of fewer replicas)?
> > 
> > I'd address all those issues (setting the correct weight for your
> > OSDs). Because it is something you will need to do anyway down the
> > road. Alternatively add more nodes and OSDs.
> 
> I don't think it's a weighting issue.  My weights seem sane (e.g., they
> are scaled according to drive size).  I think it's more an artifact
> arising from a combination of factors:
> - A relatively small number of nodes
> - Some of the nodes having additional OSDs
> - Those additional OSDs being 500GB drives compared to the other OSDs
> being 1TB and 3TB drives
> - Having to use older CRUSH tuneables
> - The cluster being around 72% full with that pool set to size 3
> 
> Running ' ceph osd reweight-by-utilization' clears the issue up
> temporarily, but additional data inevitably causes certain OSDs to be
> overloaded again.
> 
The only time I've ever seen this kind of uneven distribution is when
using too little (and using the default formula with few OSDs might still
be too little) PGs/PG_NUMs.

Did you look into that?


> > While setting the replica down to 2 will "solve" your problem, it will
> > also create another one besides the reduced redundancy:
> > It will reshuffle all your data, slowing down your cluster (to the
> > point of becoming unresponsive if it isn't designed and configured
> > well).
> > 
> > Murphy might take those massive disk reads and writes as a clue to
> > provide you with a double disk failure as well. ^o^
> 
> I actually already did the size 2 change on that pool before I sent my
> original email.  It was the only way I would get the data moved.  It
> didn't result in any data movement, just deletion.  When I get new
> drives I'll turn that knob back up.
> 
Ahahaha, there you go. 
I actually changed my test cluster from 2 to 3 and was going to change it
back when the data dance stopped, but you did beat me to it.

This is quite (pleasantly) surprising, as fiddling with any CRUSH knob
usually makes CEPH go into data shuffling overdrive.

> Thanks for your input, by the way.
> 
You're quite welcome, glad to hear it worked out that way.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dependency issues in fresh ceph/CentOS 7 install

2014-07-28 Thread Simon Ironside

Hi Brian,

I have a fresh install working on RHEL 7 running the same version of 
python as you. I did have trouble installing from the ceph.com yum repos 
though and worked around it by creating and installing from my own local 
yum repos instead.


I then skip the ceph-deploy install step, as I've already done this bit 
on each of my ceph nodes. This also stops ceph-deploy from overwriting 
my own repo definitions.


HTH,
Simon.

On 28/07/14 16:46, Brian Lovett wrote:


I'm installing the latest firefly on a fresh centos 7 machine using the rhel
7 yum repo. I'm getting a few dependency issues when using ceph-deploy
install. Mostly it looks like it doesn't like python 2.7.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow read speeds from kernel rbd (Firefly 0.80.4)

2014-07-28 Thread Steve Anthony
While searching for more information I happened across the following
post (http://dachary.org/?p=2961) which vaguely resembled the symptoms
I've been experiencing. I ran tcpdump and noticed what appeared to be a
high number of retransmissions on the host where the images are mounted
during a read from a Ceph rbd, so I ran iperf3 to get some concrete numbers:

Server: nas4 (where rbd images are mapped)
Client: ceph2 (currently not in the cluster, but configured identically to the 
other nodes)

Start server on nas4:
iperf3 -s

On ceph2, connect to server nas4, send 4096MB of data, report on 1 second 
intervals. Add -R to reverse the client/server roles.
iperf3 -c nas4 -n 4096M -i 1

Summary of traffic going out the 1Gb interface to a switch

[ ID] Interval   Transfer Bandwidth   Retr
[  5]   0.00-36.53  sec  4.00 GBytes   941 Mbits/sec   15 sender
[  5]   0.00-36.53  sec  4.00 GBytes   940 Mbits/sec 
receiver

Reversed, summary of traffic going over the fabric extender

[ ID] Interval   Transfer Bandwidth   Retr
[  5]   0.00-80.84  sec  4.00 GBytes   425 Mbits/sec  30756
sender
[  5]   0.00-80.84  sec  4.00 GBytes   425 Mbits/sec 
receiver


It appears that the issue is related to the network topology employed.
The private cluster network and nas4's public interface are both
connected to a 10Gb Cisco Fabric Extender (FEX), in turn connected to a
Nexus 7000. This was meant as a temporary solution until our network
team could finalize their design and bring up the Nexus 6001 for the
cluster. From what our network guys have said, the FEX has been much
more limited than they anticipated and they haven't been pleased with it
as a solution in general. The 6001 is supposed be ready this week, so
once it's online I'll move the cluster to that switch and re-test to see
if this fixes the issues I've been experiencing.

-Steve

On 07/24/2014 05:59 PM, Steve Anthony wrote:
> Thanks for the information!
>
> Based on my reading of http://ceph.com/docs/next/rbd/rbd-config-ref I
> was under the impression that rbd cache options wouldn't apply, since
> presumably the kernel is handling the caching. I'll have to toggle some
> of those values and see it they make a difference in my setup.
>
> I did some additional testing today. If I limit the write benchmark to 1
> concurrent operation I see a lower bandwidth number, as I expected.
> However, when writing to the XFS filesystem on an rbd image I see
> transfer rates closer to to 400MB/s.
>
> # rados -p bench bench 300 write --no-cleanup -t 1
>
> Total time run: 300.105945
> Total writes made:  1992
> Write size: 4194304
> Bandwidth (MB/sec): 26.551
>
> Stddev Bandwidth:   5.69114
> Max bandwidth (MB/sec): 40
> Min bandwidth (MB/sec): 0
> Average Latency:0.15065
> Stddev Latency: 0.0732024
> Max latency:0.617945
> Min latency:0.097339
>
> # time cp -a /mnt/local/climate /mnt/ceph_test1
>
> real2m11.083s
> user0m0.440s
> sys1m11.632s
>
> # du -h --max-deph=1 /mnt/local
> 53G/mnt/local/climate
>
> This seems to imply that the there is more than one concurrent operation
> when writing into the filesystem on top of the rbd image. However, given
> that the filesystem read speeds and the rados benchmark read speeds are
> much closer in reported bandwidth, it's as if reads are occurring as a
> single operation.
>
> # time cp -a /mnt/ceph_test2/isos /mnt/local/
>
> real36m2.129s
> user0m1.572s
> sys3m23.404s
>
> # du -h --max-deph=1 /mnt/ceph_test2/
> 68G/mnt/ceph_test2/isos
>
> Is this apparent single-thread read and multi-thread write with the rbd
> kernel module the expected mode of operation? If so, could someone
> explain the reason for this limitation?
>
> Based on the information on data striping in
> http://ceph.com/docs/next/architecture/#data-striping I would assume
> that a format 1 image would stripe a file larger than the 4MB object
> size over multiple objects and that those objects would be distributed
> over multiple OSDs. This would seem to indicate that reading a file back
> would be much faster since even though Ceph is only reading the primary
> replica, the read is still distributed over multiple OSDs. At worst I
> would expect something near the read bandwidth of a single OSD, which
> would still be much higher than 30-40MB/s.
>
> -Steve
>
> On 07/24/2014 04:07 PM, Udo Lembke wrote:
>   
>> Hi Steve,
>> I'm also looking for improvements of single-thread-reads.
>>
>> A little bit higher values (twice?) should be possible with your config.
>> I have 5 nodes with 60 4-TB hdds and got following:
>> rados -p test bench -b 4194304 60 seq -t 1 --no-cleanup
>> Total time run:60.066934
>> Total reads made: 863
>> Read size:4194304
>> Bandwidth (MB/sec):57.469
>> Average Latency:   0.0695964
>> Max latency:   0.434677
>> Min latency:   0

Re: [ceph-users] Dependency issues in fresh ceph/CentOS 7 install

2014-07-28 Thread Brian Lovett
Simon Ironside  writes:

> 
> Hi Brian,
> 
> I have a fresh install working on RHEL 7 running the same version of 
> python as you. I did have trouble installing from the ceph.com yum repos 
> though and worked around it by creating and installing from my own local 
> yum repos instead.
> 
> I then skip the ceph-deploy install step, as I've already done this bit 
> on each of my ceph nodes. This also stops ceph-deploy from overwriting 
> my own repo definitions.
> 
> HTH,
> Simon.
> 
> On 28/07/14 16:46, Brian Lovett wrote:
> >
> > I'm installing the latest firefly on a fresh centos 7 machine using the 
rhel
> > 7 yum repo. I'm getting a few dependency issues when using ceph-deploy
> > install. Mostly it looks like it doesn't like python 2.7.
> 
Thank you Simon, I was hoping to avoid anything custom. That's why we moved 
away from centos 6.5. The kernel was too old to support rbd out of the box, so 
rather than use a custom kernel, I thought we would give centos 7 a try. Looks 
like another bag of headaches.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph metrics

2014-07-28 Thread Sage Weil
On Mon, 28 Jul 2014, James Eckersall wrote:
> Hi,
> I'm trying to understand what a lot of the values mean that are reported by
> "perf dump" on the ceph admin socket.  I have a collectd plugin which sends
> all of these values to graphite.
> 
> Does anyone have a cross-reference list that explains what they are in more
> detail?  You can glean so much from the names, but I'm struggling to
> determine what alot of them are.
> 
> One in particular, I'm trying to graph the client ops across the cluster
> (similar to what ceph -w reports).  Does anyone know which values are used
> to generate this figure?

op (which is op_r + op_w).  There's also op_bytes for throughput.

There isn't a nice document describing them all, unfortunately.  I think 
we want do that properly when we do it at all, though.  There was a 
discussion during the giant CDS about this, FWIW, but nobody has picked it 
up yet.

sage___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph metrics

2014-07-28 Thread James Eckersall
Hi,

I'm trying to understand what a lot of the values mean that are reported by
"perf dump" on the ceph admin socket.  I have a collectd plugin which sends
all of these values to graphite.

Does anyone have a cross-reference list that explains what they are in more
detail?  You can glean so much from the names, but I'm struggling to
determine what alot of them are.

One in particular, I'm trying to graph the client ops across the cluster
(similar to what ceph -w reports).  Does anyone know which values are used
to generate this figure?

Thanks

J
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] anti-cephalopod question

2014-07-28 Thread Christian Balzer
On Mon, 28 Jul 2014 11:22:38 +0100 Joao Eduardo Luis wrote:

> On 07/28/2014 08:49 AM, Christian Balzer wrote:
> >
> > Hello,
> >
> > On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:
> >
> >> Hello Christian,
> >>
> >> Let me supply more info and answer some questions.
> >>
> >> * Our main concern is high availability, not speed.
> >> Our storage requirements are not huge.
> >> However we want good keyboard response 99.99% of the time.   We
> >> mostly do data entry and reporting.   20-25  users doing mostly
> >> order , invoice processing and email.
> >>
> >> * DRBD has been very reliable , but I am the SPOF .   Meaning that
> >> when split brain occurs [ every 18-24 months ] it is me or no one who
> >> knows what to do. Try to explain how to deal with split brain in
> >> advance For the future ceph looks like it will be easier to
> >> maintain.
> >>
> > The DRBD people would of course tell you to configure things in a way
> > that a split brain can't happen. ^o^
> >
> > Note that given the right circumstances (too many OSDs down, MONs down)
> > Ceph can wind up in a similar state.
> 
> 
> I am not sure what you mean by ceph winding up in a similar state.  If 
> you mean regarding 'split brain' in the usual sense of the term, it does 
> not occur in Ceph.  If it does, you have surely found a bug and you 
> should let us know with lots of CAPS.
> 
> What you can incur though if you have too many monitors down is cluster 
> downtime.  The monitors will ensure you need a strict majority of 
> monitors up in order to operate the cluster, and will not serve requests 
> if said majority is not in place.  The monitors will only serve requests 
> when there's a formed 'quorum', and a quorum is only formed by (N/2)+1 
> monitors, N being the total number of monitors in the cluster (via the 
> monitor map -- monmap).
> 
> This said, if out of 3 monitors you have 2 monitors down, your cluster 
> will cease functioning (no admin commands, no writes or reads served). 
> As there is no configuration in which you can have two strict 
> majorities, thus no two partitions of the cluster are able to function 
> at the same time, you do not incur in split brain.
> 
I wrote similar state, not "same state".

>From a user perspective it is purely semantics how and why your shared
storage has seized up, the end result is the same.

And yes, that MON example was exactly what I was aiming for, your cluster
might still have all the data (another potential failure mode of cause),
but is inaccessible. 

DRBD will see and call it a split brain, Ceph will call it a Paxos voting
failure, it doesn't matter one iota to the poor sod relying on that
particular storage.

My point was and is, when you design a cluster of whatever flavor, make
sure you understand how it can (and WILL) fail, how to prevent that from
happening if at all possible and how to recover from it if not.

Potentially (hopefully) in the case of Ceph it would be just to get a
missing MON back up.
But given that the failed MON might have a corrupted leveldb (it happened
to me) will put Robert back into square one, as in, a highly qualified
engineer has to deal with the issue. 
I.e somebody who can say "screw this dead MON, lets get a new one in" and
is capable of doing so.

Regards,

Christian

> If you are a creative admin however, you may be able to enforce split 
> brain by modifying monmaps.  In the end you'd obviously end up with two 
> distinct monitor clusters, but if you so happened to not inform the 
> clients about this there's a fair chance that it would cause havoc with 
> unforeseen effects.  Then again, this would be the operator's fault, not 
> Ceph itself -- especially because rewriting monitor maps is not trivial 
> enough for someone to mistakenly do something like this.
> 
>-Joao
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Deployment scenario with 2 hosts

2014-07-28 Thread Don Pinkster
Hi,

Currently I am evalutating multiple distributed storage solutions with an
S3-like interface.
We have two huge machines with big amounts of storage. Is it possible to
let these two behave exactly the same with Ceph? My idea is runninng both
MON and OSD on these two machines.

With quick tests the cluster is degrated after a reboot of 1 host and is
not able to recover from the reboot.

Thanks in advance!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw monitoring

2014-07-28 Thread Craig Lewis
(Sorry for the duplicate email, I forgot to CC the list)

Assuming you're using the default setup (RadosGW, FastCGI, and Apache),
it's the same as monitoring a web site.  On every node, verify that request
for / returns a 200.  If the RadosGW agent is down, or FastCGI is
mis-configured, the request will return a 500 error.  If Apache is down,
you won't be able to connect.

I'm also monitoring my load balancer (HAProxy).  I added alerts if HAProxy
marks a node offline.


That's the basics, but you can get more complicated if you want.  You could
add a heartbeat file, and verify it's being updated.  You can monitor the
performance stats returned by /usr/bin/ceph --admin-daemon
/var/run/ceph/radosgw.asok --format=json perf dump.

I'm not doing a heartbeat, but I am monitoring performance.  If the latency
per operation get too high, I alert on that to.  It's really noise during
recovery, but useful when the cluster is healthy.


On Sat, Jul 26, 2014 at 2:58 AM, pragya jain  wrote:

> Thanks zhu qiang for your response
>
> that means there are only the logs with the help of which we can monitor
> radosgw instances for coming user request traffic for uploading and
> downloading the stored data and also for monitoring other features of
> radosgw
> no external monitoring tool, such as calamari, nagios collectd, zabbix
> etc., provide the functionality to monitor radosgw instances.
>
> Am I right?
>
> Thanks again
> Pragya Jain
>
>
>   On Friday, 25 July 2014 8:12 PM, zhu qiang 
> wrote:
>
>
>
> Hi,
>May be you can try the ways below:
>   1. Set “debug rgw = 2” ,then view the radosgw daemon’s log, also can use
> ‘sed,grep,awk’,get  the infos you want.
>   2. timely rum “ceph daemon client.radosgw.X perf dump” command to get
> the statics message of radosgw daemon.
>
> This is all I know, may this will be usefull for you.
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *pragya jain
> *Sent:* Friday, July 25, 2014 6:39 PM
> *To:* ceph-users@lists.ceph.com
> *Subject:* [ceph-users] radosgw monitoring
>
> Hi all,
>
> Please suggest me some open source monitoring tools which can monitor
> radosgw instances for coming user request traffic for uploading and
> downloading the stored data and also for monitoring other features of
> radosgw
>
> Regards
> Pragya Jain
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deployment scenario with 2 hosts

2014-07-28 Thread Craig Lewis
That's expected.  You need > 50% of the monitors up.  If you only have 2
machines, rebooting one means that 50% are up, so the cluster halts
operations.  That's done on purpose to avoid problems when the cluster is
divided in exactly half, and both halves continue to run thinking the other
half is down.  Monitors don't need a lot of resources.  I'd recommend that
you add a small box as a third monitor.  A VM is fine, as long as it has
enough IOPS to it's disks.

It's best to have 3 storage nodes.  A new, out of the box install tries to
store data on at least 3 separate hosts.  You can lower the replication
level to 2, or change the rules so that it will store data on 3 separate
disks.  It might store all 3 copies on the same host though, so lowering
the replication level to 2 is probably better.

I think it's possible to require data stored on 3 disks, with 2 of the
disks coming from different nodes.  Editing the CRUSH rules is a bit
advanced: http://ceph.com/docs/master/rados/operations/crush-map/




On Mon, Jul 28, 2014 at 9:59 AM, Don Pinkster  wrote:

> Hi,
>
> Currently I am evalutating multiple distributed storage solutions with an
> S3-like interface.
> We have two huge machines with big amounts of storage. Is it possible to
> let these two behave exactly the same with Ceph? My idea is runninng both
> MON and OSD on these two machines.
>
> With quick tests the cluster is degrated after a reboot of 1 host and is
> not able to recover from the reboot.
>
> Thanks in advance!
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow read speeds from kernel rbd (Firefly 0.80.4)

2014-07-28 Thread Mark Nelson

On 07/28/2014 11:28 AM, Steve Anthony wrote:

While searching for more information I happened across the following
post (http://dachary.org/?p=2961) which vaguely resembled the symptoms
I've been experiencing. I ran tcpdump and noticed what appeared to be a
high number of retransmissions on the host where the images are mounted
during a read from a Ceph rbd, so I ran iperf3 to get some concrete numbers:


Very interesting that you are seeing retransmissions.



Server: nas4 (where rbd images are mapped)
Client: ceph2 (currently not in the cluster, but configured identically to the 
other nodes)

Start server on nas4:
iperf3 -s

On ceph2, connect to server nas4, send 4096MB of data, report on 1 second 
intervals. Add -R to reverse the client/server roles.
iperf3 -c nas4 -n 4096M -i 1

Summary of traffic going out the 1Gb interface to a switch

[ ID] Interval   Transfer Bandwidth   Retr
[  5]   0.00-36.53  sec  4.00 GBytes   941 Mbits/sec   15 sender
[  5]   0.00-36.53  sec  4.00 GBytes   940 Mbits/sec
receiver

Reversed, summary of traffic going over the fabric extender

[ ID] Interval   Transfer Bandwidth   Retr
[  5]   0.00-80.84  sec  4.00 GBytes   425 Mbits/sec  30756
sender
[  5]   0.00-80.84  sec  4.00 GBytes   425 Mbits/sec
receiver


Definitely looks suspect!




It appears that the issue is related to the network topology employed.
The private cluster network and nas4's public interface are both
connected to a 10Gb Cisco Fabric Extender (FEX), in turn connected to a
Nexus 7000. This was meant as a temporary solution until our network
team could finalize their design and bring up the Nexus 6001 for the
cluster. From what our network guys have said, the FEX has been much
more limited than they anticipated and they haven't been pleased with it
as a solution in general. The 6001 is supposed be ready this week, so
once it's online I'll move the cluster to that switch and re-test to see
if this fixes the issues I've been experiencing.


If it's not the hardware, one other thing you might want to test is to 
make sure it's not something similar to the autotuning issues we used to 
see.  I don't think this should be an issue at this point given the code 
changes we made to address it, but it would be easy to test.  Doesn't 
seem like it should be happening with simple iperf tests though so the 
hardware is maybe the better theory.


http://www.spinics.net/lists/ceph-devel/msg05049.html



-Steve

On 07/24/2014 05:59 PM, Steve Anthony wrote:

Thanks for the information!

Based on my reading of http://ceph.com/docs/next/rbd/rbd-config-ref I
was under the impression that rbd cache options wouldn't apply, since
presumably the kernel is handling the caching. I'll have to toggle some
of those values and see it they make a difference in my setup.

I did some additional testing today. If I limit the write benchmark to 1
concurrent operation I see a lower bandwidth number, as I expected.
However, when writing to the XFS filesystem on an rbd image I see
transfer rates closer to to 400MB/s.

# rados -p bench bench 300 write --no-cleanup -t 1

Total time run: 300.105945
Total writes made:  1992
Write size: 4194304
Bandwidth (MB/sec): 26.551

Stddev Bandwidth:   5.69114
Max bandwidth (MB/sec): 40
Min bandwidth (MB/sec): 0
Average Latency:0.15065
Stddev Latency: 0.0732024
Max latency:0.617945
Min latency:0.097339

# time cp -a /mnt/local/climate /mnt/ceph_test1

real2m11.083s
user0m0.440s
sys1m11.632s

# du -h --max-deph=1 /mnt/local
53G/mnt/local/climate

This seems to imply that the there is more than one concurrent operation
when writing into the filesystem on top of the rbd image. However, given
that the filesystem read speeds and the rados benchmark read speeds are
much closer in reported bandwidth, it's as if reads are occurring as a
single operation.

# time cp -a /mnt/ceph_test2/isos /mnt/local/

real36m2.129s
user0m1.572s
sys3m23.404s

# du -h --max-deph=1 /mnt/ceph_test2/
68G/mnt/ceph_test2/isos

Is this apparent single-thread read and multi-thread write with the rbd
kernel module the expected mode of operation? If so, could someone
explain the reason for this limitation?

Based on the information on data striping in
http://ceph.com/docs/next/architecture/#data-striping I would assume
that a format 1 image would stripe a file larger than the 4MB object
size over multiple objects and that those objects would be distributed
over multiple OSDs. This would seem to indicate that reading a file back
would be much faster since even though Ceph is only reading the primary
replica, the read is still distributed over multiple OSDs. At worst I
would expect something near the read bandwidth of a single OSD, which
would still be much higher than 30-40MB/s.

-Steve

On 07/24/2014 04:07 PM, Udo Lembke wrote:


Hi Steve,
I'm also looking for improvements of single-thread-reads.


Re: [ceph-users] fs as btrfs and ceph journal

2014-07-28 Thread Gregory Farnum
It still helps; the journal does just as much work. Less of the work
*can* be in the critical path for IO, but for most of the applications
it will be.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sat, Jul 26, 2014 at 2:18 AM, Cristian Falcas
 wrote:
> Hello,
>
> I'm using btrfs for OSDs and want to know if it still helps to have the
> journal on a faster drive. From what I've read I'm under the impression that
> with btrfs journal, the OSD journal doesn't do much work anymore.
>
> Best regards,
> Cristian Falcas
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] anti-cephalopod question

2014-07-28 Thread Robert Fantini
OK for higher availability then  5 nodes is better then 3 .  So we'll run 5
.  However we want normal operations with just 2 nodes.   Is that possible?

Eventually 2 nodes will be next building 10 feet away , with a brick wall
in between.  Connected with Infiniband or better. So one room can go off
line the other will be on.   The flip of the coin means the 3 node room
will probably go down.
 All systems will have dual power supplies connected to different UPS'.
In addition we have a power generator. Later we'll have a 2-nd generator.
and then  the UPS's will use different lines attached to those generators
somehow..
Also of course we never count on one  cluster  to have our data.  We have
2  co-locations with backup going to often using zfs send receive and or
rsync .

So for the 5 node cluster,  how do we set it so 2 nodes up = OK ?   Or is
that a bad idea?


PS:  any other idea on how to increase availability are welcome .








On Mon, Jul 28, 2014 at 12:29 PM, Christian Balzer  wrote:

> On Mon, 28 Jul 2014 11:22:38 +0100 Joao Eduardo Luis wrote:
>
> > On 07/28/2014 08:49 AM, Christian Balzer wrote:
> > >
> > > Hello,
> > >
> > > On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:
> > >
> > >> Hello Christian,
> > >>
> > >> Let me supply more info and answer some questions.
> > >>
> > >> * Our main concern is high availability, not speed.
> > >> Our storage requirements are not huge.
> > >> However we want good keyboard response 99.99% of the time.   We
> > >> mostly do data entry and reporting.   20-25  users doing mostly
> > >> order , invoice processing and email.
> > >>
> > >> * DRBD has been very reliable , but I am the SPOF .   Meaning that
> > >> when split brain occurs [ every 18-24 months ] it is me or no one who
> > >> knows what to do. Try to explain how to deal with split brain in
> > >> advance For the future ceph looks like it will be easier to
> > >> maintain.
> > >>
> > > The DRBD people would of course tell you to configure things in a way
> > > that a split brain can't happen. ^o^
> > >
> > > Note that given the right circumstances (too many OSDs down, MONs down)
> > > Ceph can wind up in a similar state.
> >
> >
> > I am not sure what you mean by ceph winding up in a similar state.  If
> > you mean regarding 'split brain' in the usual sense of the term, it does
> > not occur in Ceph.  If it does, you have surely found a bug and you
> > should let us know with lots of CAPS.
> >
> > What you can incur though if you have too many monitors down is cluster
> > downtime.  The monitors will ensure you need a strict majority of
> > monitors up in order to operate the cluster, and will not serve requests
> > if said majority is not in place.  The monitors will only serve requests
> > when there's a formed 'quorum', and a quorum is only formed by (N/2)+1
> > monitors, N being the total number of monitors in the cluster (via the
> > monitor map -- monmap).
> >
> > This said, if out of 3 monitors you have 2 monitors down, your cluster
> > will cease functioning (no admin commands, no writes or reads served).
> > As there is no configuration in which you can have two strict
> > majorities, thus no two partitions of the cluster are able to function
> > at the same time, you do not incur in split brain.
> >
> I wrote similar state, not "same state".
>
> From a user perspective it is purely semantics how and why your shared
> storage has seized up, the end result is the same.
>
> And yes, that MON example was exactly what I was aiming for, your cluster
> might still have all the data (another potential failure mode of cause),
> but is inaccessible.
>
> DRBD will see and call it a split brain, Ceph will call it a Paxos voting
> failure, it doesn't matter one iota to the poor sod relying on that
> particular storage.
>
> My point was and is, when you design a cluster of whatever flavor, make
> sure you understand how it can (and WILL) fail, how to prevent that from
> happening if at all possible and how to recover from it if not.
>
> Potentially (hopefully) in the case of Ceph it would be just to get a
> missing MON back up.
> But given that the failed MON might have a corrupted leveldb (it happened
> to me) will put Robert back into square one, as in, a highly qualified
> engineer has to deal with the issue.
> I.e somebody who can say "screw this dead MON, lets get a new one in" and
> is capable of doing so.
>
> Regards,
>
> Christian
>
> > If you are a creative admin however, you may be able to enforce split
> > brain by modifying monmaps.  In the end you'd obviously end up with two
> > distinct monitor clusters, but if you so happened to not inform the
> > clients about this there's a fair chance that it would cause havoc with
> > unforeseen effects.  Then again, this would be the operator's fault, not
> > Ceph itself -- especially because rewriting monitor maps is not trivial
> > enough for someone to mistakenly do something like this.
> >
> >-Joao
> >
> >
>
>
> --
> Christian Balz

Re: [ceph-users] Pool size 2 min_size 1 Advisability?

2014-07-28 Thread Gregory Farnum
On Mon, Jul 28, 2014 at 12:14 PM, Christian Balzer  wrote:
> On Mon, 28 Jul 2014 14:24:02 + Edward Huyer wrote:
>
>> > > Ceph has a default pool size of 3. Is it a bad idea to run a pool of
>> > > size 2? What about size 2 min_size 1?
>> > >
>> > min_size 1 is sensible, 2 obviously won't protect you against dual
>> > disk failures. Which happen and happen with near certainty once your
>> > cluster gets big enough.
>>
>> I though I saw somewhere in the docs that there could be issues with
>> min_size 1, but I can't seem to find it now.
>>
>> > > I have a cluster I'm moving data into (on RBDs) that is full enough
>> > > with size 3 that I'm bumping into nearfull warnings. Part of that is
>> > > because of the amount of data, part is probably because of suboptimal
>> > > tuning (Proxmox VE doesn't support all the tuning options), and part
>> > > is probably because of unbalanced drive distribution and multiple
>> > > drive sizes.
>> > >
>> > > I'm hoping I'll be able to solve the drive size/distribution issue,
>> > > but in the mean time, what problems could the size and min_size
>> > > changes create (aside from the obvious issue of fewer replicas)?
>> >
>> > I'd address all those issues (setting the correct weight for your
>> > OSDs). Because it is something you will need to do anyway down the
>> > road. Alternatively add more nodes and OSDs.
>>
>> I don't think it's a weighting issue.  My weights seem sane (e.g., they
>> are scaled according to drive size).  I think it's more an artifact
>> arising from a combination of factors:
>> - A relatively small number of nodes
>> - Some of the nodes having additional OSDs
>> - Those additional OSDs being 500GB drives compared to the other OSDs
>> being 1TB and 3TB drives
>> - Having to use older CRUSH tuneables
>> - The cluster being around 72% full with that pool set to size 3
>>
>> Running ' ceph osd reweight-by-utilization' clears the issue up
>> temporarily, but additional data inevitably causes certain OSDs to be
>> overloaded again.
>>
> The only time I've ever seen this kind of uneven distribution is when
> using too little (and using the default formula with few OSDs might still
> be too little) PGs/PG_NUMs.
>
> Did you look into that?
>
>
>> > While setting the replica down to 2 will "solve" your problem, it will
>> > also create another one besides the reduced redundancy:
>> > It will reshuffle all your data, slowing down your cluster (to the
>> > point of becoming unresponsive if it isn't designed and configured
>> > well).
>> >
>> > Murphy might take those massive disk reads and writes as a clue to
>> > provide you with a double disk failure as well. ^o^
>>
>> I actually already did the size 2 change on that pool before I sent my
>> original email.  It was the only way I would get the data moved.  It
>> didn't result in any data movement, just deletion.  When I get new
>> drives I'll turn that knob back up.
>>
> Ahahaha, there you go.
> I actually changed my test cluster from 2 to 3 and was going to change it
> back when the data dance stopped, but you did beat me to it.
>
> This is quite (pleasantly) surprising, as fiddling with any CRUSH knob
> usually makes CEPH go into data shuffling overdrive.

Yep, this is deliberate — the sizing knobs aren't used as CRUSH
inputs; it just impacts how often the CRUSH calculation is run.
Scaling that value up or down adds or removes values to the end of the
set of OSDs hosting a PG, but doesn't change the order they appear in.
Things that do shuffle data:
1) changing weights (obviously)
2) changing internal CRUSH parameters (for most users, this means
changing the tunables)
3) changing how the map looks (i.e., adding OSDs)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fs as btrfs and ceph journal

2014-07-28 Thread Mark Nelson
Perhaps Cristian is thinking of the clone from journal work that we were 
talking about last year:


http://wiki.ceph.com/Planning/Sideboard/osd%3A_clone_from_journal_on_btrfs

I think we never did much beyond Sage's test branch, and it didn't seem 
to help as much as you would hope. Speaking of which, I believe this 
would open us up to horrible journal fragmentation, especially with rbd 
on btrfs.


Mark

On 07/28/2014 12:37 PM, Gregory Farnum wrote:

It still helps; the journal does just as much work. Less of the work
*can* be in the critical path for IO, but for most of the applications
it will be.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sat, Jul 26, 2014 at 2:18 AM, Cristian Falcas
 wrote:

Hello,

I'm using btrfs for OSDs and want to know if it still helps to have the
journal on a faster drive. From what I've read I'm under the impression that
with btrfs journal, the OSD journal doesn't do much work anymore.

Best regards,
Cristian Falcas


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pool size 2 min_size 1 Advisability?

2014-07-28 Thread Edward Huyer
> >> Running ' ceph osd reweight-by-utilization' clears the issue up
> >> temporarily, but additional data inevitably causes certain OSDs to be
> >> overloaded again.
> >>
> > The only time I've ever seen this kind of uneven distribution is when
> > using too little (and using the default formula with few OSDs might
> > still be too little) PGs/PG_NUMs.
> >
> > Did you look into that?

A bit, yeah.  It was one of the first things I tried.  It didn't seem to have 
much, if any effect.  I did see a reference in an older list discussion about 
wide variations in OSD sizes causing unbalanced usage, so that's my current 
operating theory.

> Yep, this is deliberate — the sizing knobs aren't used as CRUSH inputs; it 
> just
> impacts how often the CRUSH calculation is run.
> Scaling that value up or down adds or removes values to the end of the set of
> OSDs hosting a PG, but doesn't change the order they appear in.
> Things that do shuffle data:
> 1) changing weights (obviously)
> 2) changing internal CRUSH parameters (for most users, this means changing
> the tunables)
> 3) changing how the map looks (i.e., adding OSDs)

Makes sense.  Good to know.  Thanks.
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Desktop Ceph Cluster up for grabs!

2014-07-28 Thread Patrick McGarry
Hey cephers,

Less than four days left to tweet a photo of "how you're celebrating
Ceph's 10th birthday" (be bold, be creative, be awesome) to @Ceph.
Include the hastag #cephturns10 and the best photo will win a desktop
Ceph cluster built by our own Mark Nelson.

https://wiki.ceph.com/Community/Contests/Ceph_Turns_10_Twitter_Photo_Contest

You could be the envy of your coworkers and start down the path of
ending up on a digital hoarders reality TV show, all because you
tweeted a random picture of a cupcake.  Don't wait! Spam us today!



Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] anti-cephalopod question

2014-07-28 Thread Michael
If you've two rooms then I'd go for two OSD nodes in each room, a target 
replication level of 3 with a min of 1 across the node level, then have 
5 monitors and put the last monitor outside of either room (The other 
MON's can share with the OSD nodes if needed). Then you've got 'safe' 
replication for OSD/node replacement on failure with some 'shuffle' room 
for when it's needed and either room can be down while the external last 
monitor allows the decisions required to allow a single room to operate.


There's no way you can do a 3/2 MON split that doesn't risk the two 
nodes being up and unable to serve data while the three are down so 
you'd need to find a way to make it a 2/2/1 split instead.


-Michael

On 28/07/2014 18:41, Robert Fantini wrote:
OK for higher availability then  5 nodes is better then 3 .  So we'll 
run 5 .  However we want normal operations with just 2 nodes.   Is 
that possible?


Eventually 2 nodes will be next building 10 feet away , with a brick 
wall in between.  Connected with Infiniband or better. So one room can 
go off line the other will be on.   The flip of the coin means the 3 
node room will probably go down.
 All systems will have dual power supplies connected to different 
UPS'.   In addition we have a power generator. Later we'll have a 2-nd 
generator. and then  the UPS's will use different lines attached to 
those generators somehow..
Also of course we never count on one  cluster  to have our data.  We 
have 2  co-locations with backup going to often using zfs send receive 
and or rsync .


So for the 5 node cluster,  how do we set it so 2 nodes up = OK ?   Or 
is that a bad idea?



PS:  any other idea on how to increase availability are welcome .








On Mon, Jul 28, 2014 at 12:29 PM, Christian Balzer > wrote:


On Mon, 28 Jul 2014 11:22:38 +0100 Joao Eduardo Luis wrote:

> On 07/28/2014 08:49 AM, Christian Balzer wrote:
> >
> > Hello,
> >
> > On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:
> >
> >> Hello Christian,
> >>
> >> Let me supply more info and answer some questions.
> >>
> >> * Our main concern is high availability, not speed.
> >> Our storage requirements are not huge.
> >> However we want good keyboard response 99.99% of the time.   We
> >> mostly do data entry and reporting. 20-25  users doing mostly
> >> order , invoice processing and email.
> >>
> >> * DRBD has been very reliable , but I am the SPOF .   Meaning
that
> >> when split brain occurs [ every 18-24 months ] it is me or no
one who
> >> knows what to do. Try to explain how to deal with split brain in
> >> advance For the future ceph looks like it will be easier to
> >> maintain.
> >>
> > The DRBD people would of course tell you to configure things
in a way
> > that a split brain can't happen. ^o^
> >
> > Note that given the right circumstances (too many OSDs down,
MONs down)
> > Ceph can wind up in a similar state.
>
>
> I am not sure what you mean by ceph winding up in a similar
state.  If
> you mean regarding 'split brain' in the usual sense of the term,
it does
> not occur in Ceph.  If it does, you have surely found a bug and you
> should let us know with lots of CAPS.
>
> What you can incur though if you have too many monitors down is
cluster
> downtime.  The monitors will ensure you need a strict majority of
> monitors up in order to operate the cluster, and will not serve
requests
> if said majority is not in place.  The monitors will only serve
requests
> when there's a formed 'quorum', and a quorum is only formed by
(N/2)+1
> monitors, N being the total number of monitors in the cluster
(via the
> monitor map -- monmap).
>
> This said, if out of 3 monitors you have 2 monitors down, your
cluster
> will cease functioning (no admin commands, no writes or reads
served).
> As there is no configuration in which you can have two strict
> majorities, thus no two partitions of the cluster are able to
function
> at the same time, you do not incur in split brain.
>
I wrote similar state, not "same state".

From a user perspective it is purely semantics how and why your shared
storage has seized up, the end result is the same.

And yes, that MON example was exactly what I was aiming for, your
cluster
might still have all the data (another potential failure mode of
cause),
but is inaccessible.

DRBD will see and call it a split brain, Ceph will call it a Paxos
voting
failure, it doesn't matter one iota to the poor sod relying on that
particular storage.

My point was and is, when you design a cluster of whatever flavor,
make
sure you understand how it can (and WILL) fail, how to prevent
that from
happening if at all possible and how to recover

Re: [ceph-users] Deployment scenario with 2 hosts

2014-07-28 Thread Michael
You can use multiple "steps" in your crush map in order to do things 
like choose two different hosts then choose a further OSD on one of the 
hosts and do another replication so that you can get three replicas onto 
two hosts without risking ending up with three replicas on a single node.


On 28/07/2014 18:14, Craig Lewis wrote:
That's expected.  You need > 50% of the monitors up.  If you only have 
2 machines, rebooting one means that 50% are up, so the cluster halts 
operations.  That's done on purpose to avoid problems when the cluster 
is divided in exactly half, and both halves continue to run thinking 
the other half is down.  Monitors don't need a lot of resources.  I'd 
recommend that you add a small box as a third monitor.  A VM is fine, 
as long as it has enough IOPS to it's disks.


It's best to have 3 storage nodes.  A new, out of the box install 
tries to store data on at least 3 separate hosts.  You can lower the 
replication level to 2, or change the rules so that it will store data 
on 3 separate disks.  It might store all 3 copies on the same host 
though, so lowering the replication level to 2 is probably better.


I think it's possible to require data stored on 3 disks, with 2 of the 
disks coming from different nodes.  Editing the CRUSH rules is a bit 
advanced: http://ceph.com/docs/master/rados/operations/crush-map/





On Mon, Jul 28, 2014 at 9:59 AM, Don Pinkster > wrote:


Hi,

Currently I am evalutating multiple distributed storage solutions
with an S3-like interface.
We have two huge machines with big amounts of storage. Is it
possible to let these two behave exactly the same with Ceph? My
idea is runninng both MON and OSD on these two machines.

With quick tests the cluster is degrated after a reboot of 1 host
and is not able to recover from the reboot.

Thanks in advance!

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] anti-cephalopod question

2014-07-28 Thread Robert Fantini
"target replication level of 3"
" with a min of 1 across the node level"

After reading http://ceph.com/docs/master/rados/configuration/ceph-conf/
,   I assume that to accomplish that then set these in ceph.conf   ?

osd pool default size = 3
osd pool default min size = 1







On Mon, Jul 28, 2014 at 2:56 PM, Michael  wrote:

>  If you've two rooms then I'd go for two OSD nodes in each room, a target
> replication level of 3 with a min of 1 across the node level, then have 5
> monitors and put the last monitor outside of either room (The other MON's
> can share with the OSD nodes if needed). Then you've got 'safe' replication
> for OSD/node replacement on failure with some 'shuffle' room for when it's
> needed and either room can be down while the external last monitor allows
> the decisions required to allow a single room to operate.
>
> There's no way you can do a 3/2 MON split that doesn't risk the two nodes
> being up and unable to serve data while the three are down so you'd need to
> find a way to make it a 2/2/1 split instead.
>
> -Michael
>
>
> On 28/07/2014 18:41, Robert Fantini wrote:
>
>  OK for higher availability then  5 nodes is better then 3 .  So we'll
> run 5 .  However we want normal operations with just 2 nodes.   Is that
> possible?
>
>  Eventually 2 nodes will be next building 10 feet away , with a brick wall
> in between.  Connected with Infiniband or better. So one room can go off
> line the other will be on.   The flip of the coin means the 3 node room
> will probably go down.
>  All systems will have dual power supplies connected to different UPS'.
> In addition we have a power generator. Later we'll have a 2-nd generator.
> and then  the UPS's will use different lines attached to those generators
> somehow..
> Also of course we never count on one  cluster  to have our data.  We have
> 2  co-locations with backup going to often using zfs send receive and or
> rsync .
>
>  So for the 5 node cluster,  how do we set it so 2 nodes up = OK ?   Or
> is that a bad idea?
>
>
>  PS:  any other idea on how to increase availability are welcome .
>
>
>
>
>
>
>
>
> On Mon, Jul 28, 2014 at 12:29 PM, Christian Balzer  wrote:
>
>>  On Mon, 28 Jul 2014 11:22:38 +0100 Joao Eduardo Luis wrote:
>>
>> > On 07/28/2014 08:49 AM, Christian Balzer wrote:
>> > >
>> > > Hello,
>> > >
>> > > On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:
>> > >
>> > >> Hello Christian,
>> > >>
>> > >> Let me supply more info and answer some questions.
>> > >>
>> > >> * Our main concern is high availability, not speed.
>> > >> Our storage requirements are not huge.
>> > >> However we want good keyboard response 99.99% of the time.   We
>> > >> mostly do data entry and reporting.   20-25  users doing mostly
>> > >> order , invoice processing and email.
>> > >>
>> > >> * DRBD has been very reliable , but I am the SPOF .   Meaning that
>> > >> when split brain occurs [ every 18-24 months ] it is me or no one who
>> > >> knows what to do. Try to explain how to deal with split brain in
>> > >> advance For the future ceph looks like it will be easier to
>> > >> maintain.
>> > >>
>> > > The DRBD people would of course tell you to configure things in a way
>> > > that a split brain can't happen. ^o^
>> > >
>> > > Note that given the right circumstances (too many OSDs down, MONs
>> down)
>> > > Ceph can wind up in a similar state.
>> >
>> >
>> > I am not sure what you mean by ceph winding up in a similar state.  If
>> > you mean regarding 'split brain' in the usual sense of the term, it does
>> > not occur in Ceph.  If it does, you have surely found a bug and you
>> > should let us know with lots of CAPS.
>> >
>> > What you can incur though if you have too many monitors down is cluster
>> > downtime.  The monitors will ensure you need a strict majority of
>> > monitors up in order to operate the cluster, and will not serve requests
>> > if said majority is not in place.  The monitors will only serve requests
>> > when there's a formed 'quorum', and a quorum is only formed by (N/2)+1
>> > monitors, N being the total number of monitors in the cluster (via the
>> > monitor map -- monmap).
>> >
>> > This said, if out of 3 monitors you have 2 monitors down, your cluster
>> > will cease functioning (no admin commands, no writes or reads served).
>> > As there is no configuration in which you can have two strict
>> > majorities, thus no two partitions of the cluster are able to function
>> > at the same time, you do not incur in split brain.
>> >
>>  I wrote similar state, not "same state".
>>
>> From a user perspective it is purely semantics how and why your shared
>> storage has seized up, the end result is the same.
>>
>> And yes, that MON example was exactly what I was aiming for, your cluster
>> might still have all the data (another potential failure mode of cause),
>> but is inaccessible.
>>
>> DRBD will see and call it a split brain, Ceph will call it a Paxos voting
>> failure, it doesn't matter one iota to the poor so

Re: [ceph-users] anti-cephalopod question

2014-07-28 Thread Christian Balzer

On Mon, 28 Jul 2014 18:11:33 -0400 Robert Fantini wrote:

> "target replication level of 3"
> " with a min of 1 across the node level"
> 
> After reading http://ceph.com/docs/master/rados/configuration/ceph-conf/
> ,   I assume that to accomplish that then set these in ceph.conf   ?
> 
> osd pool default size = 3
> osd pool default min size = 1
> 
Not really, the min size specifies how few replicas need to be online
for Ceph to accept IO.

These (the current Firefly defaults) settings with the default crush map
will have 3 sets of data spread over 3 OSDs and not use the same node
(host) more than once.
So with 2 nodes in each location, a replica will always be both locations.
However if you add more nodes, all of them could wind up in the same
building.

To prevent this, you have location qualifiers beyond host and you can
modify the crush map to enforce that at least one replica is in a
different rack, row, room, region, etc.

Advanced material, but one really needs to understand this:
http://ceph.com/docs/master/rados/operations/crush-map/

Christian


> 
> 
> 
> 
> 
> 
> On Mon, Jul 28, 2014 at 2:56 PM, Michael 
> wrote:
> 
> >  If you've two rooms then I'd go for two OSD nodes in each room, a
> > target replication level of 3 with a min of 1 across the node level,
> > then have 5 monitors and put the last monitor outside of either room
> > (The other MON's can share with the OSD nodes if needed). Then you've
> > got 'safe' replication for OSD/node replacement on failure with some
> > 'shuffle' room for when it's needed and either room can be down while
> > the external last monitor allows the decisions required to allow a
> > single room to operate.
> >
> > There's no way you can do a 3/2 MON split that doesn't risk the two
> > nodes being up and unable to serve data while the three are down so
> > you'd need to find a way to make it a 2/2/1 split instead.
> >
> > -Michael
> >
> >
> > On 28/07/2014 18:41, Robert Fantini wrote:
> >
> >  OK for higher availability then  5 nodes is better then 3 .  So we'll
> > run 5 .  However we want normal operations with just 2 nodes.   Is that
> > possible?
> >
> >  Eventually 2 nodes will be next building 10 feet away , with a brick
> > wall in between.  Connected with Infiniband or better. So one room can
> > go off line the other will be on.   The flip of the coin means the 3
> > node room will probably go down.
> >  All systems will have dual power supplies connected to different UPS'.
> > In addition we have a power generator. Later we'll have a 2-nd
> > generator. and then  the UPS's will use different lines attached to
> > those generators somehow..
> > Also of course we never count on one  cluster  to have our data.  We
> > have 2  co-locations with backup going to often using zfs send receive
> > and or rsync .
> >
> >  So for the 5 node cluster,  how do we set it so 2 nodes up = OK ?   Or
> > is that a bad idea?
> >
> >
> >  PS:  any other idea on how to increase availability are welcome .
> >
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Jul 28, 2014 at 12:29 PM, Christian Balzer 
> > wrote:
> >
> >>  On Mon, 28 Jul 2014 11:22:38 +0100 Joao Eduardo Luis wrote:
> >>
> >> > On 07/28/2014 08:49 AM, Christian Balzer wrote:
> >> > >
> >> > > Hello,
> >> > >
> >> > > On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote:
> >> > >
> >> > >> Hello Christian,
> >> > >>
> >> > >> Let me supply more info and answer some questions.
> >> > >>
> >> > >> * Our main concern is high availability, not speed.
> >> > >> Our storage requirements are not huge.
> >> > >> However we want good keyboard response 99.99% of the time.   We
> >> > >> mostly do data entry and reporting.   20-25  users doing mostly
> >> > >> order , invoice processing and email.
> >> > >>
> >> > >> * DRBD has been very reliable , but I am the SPOF .   Meaning
> >> > >> that when split brain occurs [ every 18-24 months ] it is me or
> >> > >> no one who knows what to do. Try to explain how to deal with
> >> > >> split brain in advance For the future ceph looks like it
> >> > >> will be easier to maintain.
> >> > >>
> >> > > The DRBD people would of course tell you to configure things in a
> >> > > way that a split brain can't happen. ^o^
> >> > >
> >> > > Note that given the right circumstances (too many OSDs down, MONs
> >> down)
> >> > > Ceph can wind up in a similar state.
> >> >
> >> >
> >> > I am not sure what you mean by ceph winding up in a similar state.
> >> > If you mean regarding 'split brain' in the usual sense of the term,
> >> > it does not occur in Ceph.  If it does, you have surely found a bug
> >> > and you should let us know with lots of CAPS.
> >> >
> >> > What you can incur though if you have too many monitors down is
> >> > cluster downtime.  The monitors will ensure you need a strict
> >> > majority of monitors up in order to operate the cluster, and will
> >> > not serve requests if said majority is not in place.  The monitors
> >> > will only serve requests when there's a formed '

Re: [ceph-users] Optimal OSD Configuration for 45 drives?

2014-07-28 Thread Christian Balzer

Re-added ML.


On Mon, 28 Jul 2014 20:38:37 +1000 Matt Harlum wrote:
> 
> On 27 Jul 2014, at 1:45 am, Christian Balzer  wrote:
> 
> > On Sat, 26 Jul 2014 20:49:46 +1000 Matt Harlum wrote:
> > 
> >> 
> >> On 25 Jul 2014, at 5:54 pm, Christian Balzer  wrote:
> >> 
> >>> On Fri, 25 Jul 2014 13:31:34 +1000 Matt Harlum wrote:
> >>> 
>  Hi,
>  
>  I’ve purchased a couple of 45Drives enclosures and would like to
>  figure out the best way to configure these for ceph?
>  
> >>> That's the second time within a month somebody mentions these 45
> >>> drive chassis. 
> >>> Would you mind elaborating which enclosures these are precisely?
> >>> 
> >>> I'm wondering especially about the backplane, as 45 is such an odd
> >>> number.
> >>> 
> >> 
> >> The Chassis is from 45drives.com. it has 3 rows of 15 direct wire sas
> >> connectors connected to two highpoint rocket 750s using 12 SFF-8087
> >> Connectors. I’m considering replacing the highpoints with 3x LSI
> >> 9201-16I cards The chassis’ are loaded up with 45 Seagate 4TB drives,
> >> and separate to the 45 large drives are the two boot drives in raid 1.
> >> 
> > Oh, Backblaze inspired!
> > I stared at the originals a couple of years ago. ^.^
> > And yeah, replacing the Highpoint controllers sounds like a VERY good
> > idea. ^o^
> > 
> > You might want to get 2 (large and thus fast) Intel DC 3700 SSDs for
> > the OS drives and put the journals on those (OS MD RAID1, journals on
> > individual partitions). 
> 
> The fact that I have a failure domain containing 180TB of data terrifies
> me to be honest If the whole host dies I’ll be pretty boned I’m guessing
> and I’m going to lose sleep worrying about it, but I will eventually
> have 10Gbit for the replication network,  just waiting on the switches.
> 
Well, if you go for 4) it won't be quite as big, at most 160TB. ^o^

See the other current threads in the ML on how to avoid unwanted
(untimely) recovery events.
Using any of these methods you will have time to bring your host (or OSD)
back online and if that shouldn't be possible at least control WHEN the
recovery kicks in.

> I’m glad you mentioned OS + journal partitions! I didn’t install any
> SSDs initially because I didn’t want to deal with losing a bunch of OSDs
> at once due to journal failure. because even at 4x Raid 6 OSDs I’ve got
> 36TB per OSD to replicate in case of an issue. Combining the journals
> into partitions on a Raided set is a great idea! Not sure I can get the
> boss to spring for some S3700’s but I’ll see :)
> 
Actually my suggestion was just to RAID the OS and not the journals, for
the obvious performance reasons. 
Also see above, recoveries can be controlled and those 36TB would be the
worst case of course.

Remember that you won't be able to write faster than the journal speed to
your OSDs.
So if you were to get 2 400GB DC3700 SSDs that would just shy of 1GB/s,
which is definitely less than what your HDDs can scribble away.
But it will deal with bursty IO much nicer.
It boils down to what that storage is used for, since you said backups
we're looking more at sequential writes and reads than anything else (of
course if those backups come in in parallel...).
The DC3700 400GB SSD is rated for 4TB/day for 5 years, so if you're
thinking of writing more than that per day a RAID controller with lots of
cache is probably the better choice. 


> > 
> >>> Also if you don't mind, specify "a couple" and what your net storage
> >>> requirements are.
> >>> 
> >> 
> >> Total is 3 of these 45drives.com enclosures for 3 replicas of our
> >> data, 
> >> 
> > If you're going to use RAID6, a replica of 2 will be fine.
> 
> Awesome, Should give me a bunch of extra space then :)
>
And higher speed, too.
 
> > 
> >>> In fact, read this before continuing:
> >>> ---
> >>> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg11011.html
> >>> ---
> >>> 
>  Mainly I was wondering if it was better to set up multiple raid
>  groups and then put an OSD on each rather than an OSD for each of
>  the 45 drives in the chassis? 
>  
> >>> Steve already towed the conservative Ceph party line here, let me
> >>> give you some alternative views and options on top of that and to
> >>> recap what I wrote in the thread above.
> >>> 
> >>> In addition to his links, read this:
> >>> ---
> >>> https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf
> >>> ---
> >>> 
> >>> Lets go from cheap and cheerful to "comes with racing stripes".
> >>> 
> >>> 1) All spinning rust, all the time. Plunk in 45 drives, as JBOD
> >>> behind the cheapest (and densest) controllers you can get. Having
> >>> the journal on the disks will halve their performance, but you just
> >>> wanted the space and are not that pressed for IOPS. 
> >>> The best you can expect per node with this setup is something around
> >>> 2300 IOPS with normal (7200RPM) disks.
> >>> 
> >>> 2) Same as 1), but use controllers with a large HW cache (4GB Areca
> >>> comes t