[ceph-users] Online resizing RBD kernel module

2013-04-05 Thread Laurent Barbe
Hello,

I'm trying online resizing with RBD + XFS. But when i try to make a
xfs_growfs, it doesn't seen the new size. I don't use partition table, os
is debian squeeze / kernel 3.8.4 / ceph 0.56.4.
It seems that the mounted file system prevents update the block device size
?

If the file system is not mounted, or if I unmount + mount, xfs_growfs
works as expected.

 ORIGINAL SIZE 
  # parted /dev/rbd1 print
  Model: Unknown (unknown)
  Disk /dev/rbd1: *105MB*
  Sector size (logical/physical): 512B/512B
  Partition Table: loop

  Number  Start  EndSize   File system  Flags
  1  0,00B  105MB  105MB  xfs

 RBD RESIZE 
  # rbd resize rbdxfs --size=200
  Resizing image: 100% complete...done.

 SIZE NOT CHANGE IF FS ON RBD1 IS MOUNTED 
  # parted /dev/rbd1 print
  Model: Unknown (unknown)
  Disk /dev/rbd1: *105MB*
  Sector size (logical/physical): 512B/512B
  Partition Table: loop

  Number  Start  EndSize   File system  Flags
  1  0,00B  105MB  105MB  xfs

 UMOUNT FS --> SIZE OK 
  # umount /mnt/rbdxfs
  # parted /dev/rbd1 print
  Model: Unknown (unknown)
  Disk /dev/rbd1: *210MB*
  Sector size (logical/physical): 512B/512B
  Partition Table: loop

  Number  Start  EndSize   File system  Flags
  1  0,00B  210MB  210MB  xfs


Any Ideas ?
Thanks

--
Laurent Barbe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Online resizing RBD kernel module

2013-04-05 Thread Wido den Hollander

On 04/05/2013 12:34 PM, Laurent Barbe wrote:

Hello,

I'm trying online resizing with RBD + XFS. But when i try to make a
xfs_growfs, it doesn't seen the new size. I don't use partition table,
os is debian squeeze / kernel 3.8.4 / ceph 0.56.4.
It seems that the mounted file system prevents update the block device
size ?

If the file system is not mounted, or if I unmount + mount, xfs_growfs
works as expected.



When I block device is in use it can't change. When you unmount the 
blockdevice is no longer in use and the new size can be detected.


This is a not a RBD limitation, but it's something that goes for all 
block devices in Linux.


I've seen some patches floating around that could do this online, but 
I'm not sure if they are in the kernel.


You could try this:

$ blockdev --rereadpt /dev/rbd1

Or

$ partprobe -s /dev/rbd1


--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on


 ORIGINAL SIZE 
   # parted /dev/rbd1 print
   Model: Unknown (unknown)
   Disk /dev/rbd1: *105MB*
   Sector size (logical/physical): 512B/512B
   Partition Table: loop

   Number  Start  EndSize   File system  Flags
   1  0,00B  105MB  105MB  xfs

 RBD RESIZE 
   # rbd resize rbdxfs --size=200
   Resizing image: 100% complete...done.

 SIZE NOT CHANGE IF FS ON RBD1 IS MOUNTED 
   # parted /dev/rbd1 print
   Model: Unknown (unknown)
   Disk /dev/rbd1: *105MB*
   Sector size (logical/physical): 512B/512B
   Partition Table: loop

   Number  Start  EndSize   File system  Flags
   1  0,00B  105MB  105MB  xfs

 UMOUNT FS --> SIZE OK 
   # umount /mnt/rbdxfs
   # parted /dev/rbd1 print
   Model: Unknown (unknown)
   Disk /dev/rbd1: *210MB*
   Sector size (logical/physical): 512B/512B
   Partition Table: loop

   Number  Start  EndSize   File system  Flags
   1  0,00B  210MB  210MB  xfs


Any Ideas ?
Thanks

--
Laurent Barbe


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - mount from computer not administered by cluster admin

2013-04-05 Thread Wido den Hollander

On 04/05/2013 05:50 AM, Vanja Z wrote:

I have been testing CephFS on our computational cluster of about 30 computers. 
I want users to be able to access the file-system from their personal machines. 
At the moment, we simply allow the same NFS exports to be mounted from users 
personal machines. As far as I can tell it is not possible to mount CephFS from 
a machine without allowing the root user of that machine root accessto the 
file-system thus allowing them unlimited access to other users files; is this 
correct?



No, this is not possible at the moment.

There is a issue in the tracker for it though: 
http://tracker.ceph.com/issues/1237


No progress has been made here yet since other MDS features (like 
getting it stable) have a higher priority.




Any help would be greatly appreciated!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - limit space available to individual users.

2013-04-05 Thread Wido den Hollander

On 04/05/2013 05:47 AM, Vanja Z wrote:

I have been testing CephFS on our computational cluster of about 30 computers. 
I've got 4 machines, 4 disks, 4 osd, 4 mon and 1 mds at the moment for testing. 
The testing has been going very well apart from one problem that needs to be 
resolved before we can use Ceph in place of our existing 'system' of NFS 
exports.

Our users run simulations that are easily capable of writing out data at a rate limited 
only by the storage device. These jobs also often run for days or weeks unattended. This 
unfortunately means that using CephFS, if a user doesn't setup their simulation carefully 
enough or if their code has some bug, they are able to fill the entire filesystem (shared 
by aroud 10 other users) in around a day leaving no room for any other users and 
potentially crashing the entire cluster. I've read the FAQ entry about quotas but I'm not 
sure what to make of it. Is it correct that you can only have one "CephFS" per 
cluster? I guess I was imagining creating a separate file-system of known size for each 
user.



The talks about quotas were indeed userquotas, but nothing about 
enforcing them. The first step is to do accounting and maybe in a later 
stage soft and hard enforcement can be added.


I don't think it's on the roadmap currently.



Any help would be greatly appreciated and since this is my first message, 
thanks to Sage and everyone else involved for creating this excellent project!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph mon quorum

2013-04-05 Thread Alexis GÜNST HORN
Hello to all,

I've a Ceph cluster composed of 4 nodes in 2 differents rooms.

room A : osd.1, osd.3, mon.a, mon.c
room B : osd.2, osd.4, mon.b

My crush rule is made to make replica accross rooms.
So normally, if I shut the whole room A, my cluster should stay usable.

... but, in fact no.
When i switch off room A, mon.b does not succeed in managing the cluster.
Here is the log of mon.b :

2013-04-05 11:46:11.842267 7f42e61fc700  0 mon.b@1(peon) e1
handle_command mon_command(status v 0) v1
2013-04-05 11:46:12.746317 7f42e61fc700  0 mon.b@1(peon) e1
handle_command mon_command(status v 0) v1
2013-04-05 11:46:17.684378 7f42e46f3700  0 -- 10.0.3.2:6789/0 >>
10.0.3.1:6789/0 pipe(0x7f42d4002c80 sd=26 :6789 s=2 pgs=47 cs=1
l=0).fault, initiating reconnect
2013-04-05 11:46:17.685624 7f42f0e93700  0 -- 10.0.3.2:6789/0 >>
10.0.3.1:6789/0 pipe(0x7f42d4002c80 sd=19 :35755 s=1 pgs=47 cs=2
l=0).fault
2013-04-05 11:46:17.721214 7f4266eee700  0 -- 10.0.3.2:6789/0 >>
10.0.3.3:6789/0 pipe(0x2b4c480 sd=17 :58791 s=2 pgs=26 cs=1 l=0).fault
with nothing to send, going to standby
2013-04-05 11:46:18.453162 7f42e61fc700  0 mon.b@1(peon) e1
handle_command mon_command(status v 0) v1
2013-04-05 11:46:25.638744 7f42ec80d700  0 -- 10.0.3.2:6789/0 >>
10.0.3.3:6789/0 pipe(0x2b4c480 sd=17 :58791 s=1 pgs=26 cs=2 l=0).fault



What I understand is that, yes, mon.b knows that mon.a and mon.c are
down, but it can't join the quorum. Why ?

Thanks for your answers.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph mon quorum

2013-04-05 Thread Wido den Hollander

Hi,

On 04/05/2013 01:57 PM, Alexis GÜNST HORN wrote:

Hello to all,

I've a Ceph cluster composed of 4 nodes in 2 differents rooms.

room A : osd.1, osd.3, mon.a, mon.c
room B : osd.2, osd.4, mon.b

My crush rule is made to make replica accross rooms.
So normally, if I shut the whole room A, my cluster should stay usable.

... but, in fact no.
When i switch off room A, mon.b does not succeed in managing the cluster.
Here is the log of mon.b :

2013-04-05 11:46:11.842267 7f42e61fc700  0 mon.b@1(peon) e1
handle_command mon_command(status v 0) v1
2013-04-05 11:46:12.746317 7f42e61fc700  0 mon.b@1(peon) e1
handle_command mon_command(status v 0) v1
2013-04-05 11:46:17.684378 7f42e46f3700  0 -- 10.0.3.2:6789/0 >>
10.0.3.1:6789/0 pipe(0x7f42d4002c80 sd=26 :6789 s=2 pgs=47 cs=1
l=0).fault, initiating reconnect
2013-04-05 11:46:17.685624 7f42f0e93700  0 -- 10.0.3.2:6789/0 >>
10.0.3.1:6789/0 pipe(0x7f42d4002c80 sd=19 :35755 s=1 pgs=47 cs=2
l=0).fault
2013-04-05 11:46:17.721214 7f4266eee700  0 -- 10.0.3.2:6789/0 >>
10.0.3.3:6789/0 pipe(0x2b4c480 sd=17 :58791 s=2 pgs=26 cs=1 l=0).fault
with nothing to send, going to standby
2013-04-05 11:46:18.453162 7f42e61fc700  0 mon.b@1(peon) e1
handle_command mon_command(status v 0) v1
2013-04-05 11:46:25.638744 7f42ec80d700  0 -- 10.0.3.2:6789/0 >>
10.0.3.3:6789/0 pipe(0x2b4c480 sd=17 :58791 s=1 pgs=26 cs=2 l=0).fault



What I understand is that, yes, mon.b knows that mon.a and mon.c are
down, but it can't join the quorum. Why ?



You always need a majority of your monitors to be up. In this case you 
loose 66% of your monitors, so mon.b can't get a majority.


With 3 monitors you need at least 2 to be up to have your cluster working.

Wido


Thanks for your answers.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Online resizing RBD kernel module

2013-04-05 Thread Laurent Barbe
Thanks for your answer,
no more chance with blockdev --rereadpt or partprobe -s. :(


2013/4/5 Wido den Hollander 

> On 04/05/2013 12:34 PM, Laurent Barbe wrote:
>
>> Hello,
>>
>> I'm trying online resizing with RBD + XFS. But when i try to make a
>> xfs_growfs, it doesn't seen the new size. I don't use partition table,
>> os is debian squeeze / kernel 3.8.4 / ceph 0.56.4.
>> It seems that the mounted file system prevents update the block device
>> size ?
>>
>> If the file system is not mounted, or if I unmount + mount, xfs_growfs
>> works as expected.
>>
>>
> When I block device is in use it can't change. When you unmount the
> blockdevice is no longer in use and the new size can be detected.
>
> This is a not a RBD limitation, but it's something that goes for all block
> devices in Linux.
>
> I've seen some patches floating around that could do this online, but I'm
> not sure if they are in the kernel.
>
> You could try this:
>
> $ blockdev --rereadpt /dev/rbd1
>
> Or
>
> $ partprobe -s /dev/rbd1
>
>
> --
> Wido den Hollander
> 42on B.V.
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on
>
>   ORIGINAL SIZE 
>># parted /dev/rbd1 print
>>Model: Unknown (unknown)
>>Disk /dev/rbd1: *105MB*
>>
>>Sector size (logical/physical): 512B/512B
>>Partition Table: loop
>>
>>Number  Start  EndSize   File system  Flags
>>1  0,00B  105MB  105MB  xfs
>>
>>  RBD RESIZE 
>># rbd resize rbdxfs --size=200
>>Resizing image: 100% complete...done.
>>
>>  SIZE NOT CHANGE IF FS ON RBD1 IS MOUNTED 
>># parted /dev/rbd1 print
>>Model: Unknown (unknown)
>>Disk /dev/rbd1: *105MB*
>>
>>Sector size (logical/physical): 512B/512B
>>Partition Table: loop
>>
>>Number  Start  EndSize   File system  Flags
>>1  0,00B  105MB  105MB  xfs
>>
>>  UMOUNT FS --> SIZE OK 
>># umount /mnt/rbdxfs
>># parted /dev/rbd1 print
>>Model: Unknown (unknown)
>>Disk /dev/rbd1: *210MB*
>>
>>Sector size (logical/physical): 512B/512B
>>Partition Table: loop
>>
>>Number  Start  EndSize   File system  Flags
>>1  0,00B  210MB  210MB  xfs
>>
>>
>> Any Ideas ?
>> Thanks
>>
>> --
>> Laurent Barbe
>>
>>
>> __**_
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com
>>
>>
>
> __**_
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS - limit space available to individual users.

2013-04-05 Thread Vanja Z
Thanks Wildo, I have to admit its slightly disappointing (but completely 
understandable) since it basically means it's not safe for us to use CephFS :(


Without "userquotas", it would be sufficient to have multiple CephFS 
filesystems and to be able to set the size of each one.

Is it part of the core design that there can only be one filesystem in the 
cluster? This seems like a 'single point of failure'.


>>  I have been testing CephFS on our computational cluster of about 30 
>> computers. I've got 4 machines, 4 disks, 4 osd, 4 mon and 1 mds at the 
>> moment for testing. The testing has been going very well apart from one 
>> problem 
>> that needs to be resolved before we can use Ceph in place of our existing 
>> 'system' of NFS exports.
>> 
>>  Our users run simulations that are easily capable of writing out data at a 
>> rate limited only by the storage device. These jobs also often run for days 
>> or 
>> weeks unattended. This unfortunately means that using CephFS, if a user 
>> doesn't setup their simulation carefully enough or if their code has some 
>> bug, they are able to fill the entire filesystem (shared by aroud 10 other 
>> users) in around a day leaving no room for any other users and potentially 
>> crashing the entire cluster. I've read the FAQ entry about quotas but 
>> I'm not sure what to make of it. Is it correct that you can only have one 
>> "CephFS" per cluster? I guess I was imagining creating a separate 
>> file-system of known size for each user.
>> 
> 
> The talks about quotas were indeed userquotas, but nothing about 
> enforcing them. The first step is to do accounting and maybe in a later 
> stage soft and hard enforcement can be added.
> 
> I don't think it's on the roadmap currently.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about Backing Up RBD Volumes in Openstack

2013-04-05 Thread Dave Spano
If I pause my instances in Openstack, then snapshot and clone my volumes, I 
should have a consistent backup correct? Is freezing on snapshot creation like 
LVM a potential future feature? 

I've considered Sebastien's method here( 
http://www.sebastien-han.fr/blog/2012/12/10/openstack-perform-consistent-snapshots/
 ), but I would prefer to backup via ceph and outside of the vm.  

Dave Spano 
Optogenics 
Systems Administrator 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph mon quorum

2013-04-05 Thread Dimitri Maziuk

On 4/5/2013 7:57 AM, Wido den Hollander wrote:


You always need a majority of your monitors to be up. In this case you
loose 66% of your monitors, so mon.b can't get a majority.

With 3 monitors you need at least 2 to be up to have your cluster working.


That's kinda useless, isn't it? I'd've thought "2 copies on-site and one 
off-site, and if the main site room's down we can work off the off-site 
server" is a basic enough HA setup -- we've had it here for some time. 
Now you tell me ceph won't even do that?


Dima

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph mon quorum

2013-04-05 Thread Dino Yancey
If, in the case above, you have a monitor per room (a, b) and one in a
third location outside of either (c), you would have the ability to
take down the entirety of either room and still maintain monitor
quorum. (a,c or b,c) The cluster would continue to work.

On Fri, Apr 5, 2013 at 10:02 AM, Dimitri Maziuk  wrote:
> On 4/5/2013 7:57 AM, Wido den Hollander wrote:
>
>> You always need a majority of your monitors to be up. In this case you
>> loose 66% of your monitors, so mon.b can't get a majority.
>>
>> With 3 monitors you need at least 2 to be up to have your cluster working.
>
>
> That's kinda useless, isn't it? I'd've thought "2 copies on-site and one
> off-site, and if the main site room's down we can work off the off-site
> server" is a basic enough HA setup -- we've had it here for some time. Now
> you tell me ceph won't even do that?
>
> Dima
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
__
Dino Yancey
2GNT.com Admin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph mon quorum

2013-04-05 Thread Wido den Hollander

On 04/05/2013 05:02 PM, Dimitri Maziuk wrote:

On 4/5/2013 7:57 AM, Wido den Hollander wrote:


You always need a majority of your monitors to be up. In this case you
loose 66% of your monitors, so mon.b can't get a majority.

With 3 monitors you need at least 2 to be up to have your cluster
working.


That's kinda useless, isn't it? I'd've thought "2 copies on-site and one
off-site, and if the main site room's down we can work off the off-site
server" is a basic enough HA setup -- we've had it here for some time.
Now you tell me ceph won't even do that?



It's a design principle that you need a majority.

Think about it this way. You have two racks and the network connection 
between them fails. If both racks keep operating because they can still 
reach that single monitor in their rack you will end up with data 
inconsistency.


The majority of monitors is there to prevent you from having "rogue" 
parts which still operate without network connections with the rest.


So it's not useless, it's a way to keep everything consistent.

You should place mon.c outside rack A or B to keep you up and running in 
this situation.



Dima

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] File extension

2013-04-05 Thread Noah Watkins

On Apr 4, 2013, at 3:06 AM, Waed Bataineh  wrote:

> Hello, 
> 
> I'm using Ceph as object storage, where it put the whole file what ever was 
> its size in one object (correct me if i'm wrong).  
> i used it for multiple files that have different extension (.txt, .mp3, 
> ...etc) i can store the files and retrieve it smoothly. 
> 
> My questions are:
> 1. when i get the file back using rados get obj_name file_path 
> --pool=pool_name 
>  how would i know the extension that the file was mapped in the object in 
> the first place ?

You can include the file extension in the name/key of an object, or use 
metadata facilities such as object extended attributes to keep this information.

-Noah
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph mon quorum

2013-04-05 Thread Dimitri Maziuk
On 04/05/2013 10:12 AM, Wido den Hollander wrote:

> Think about it this way. You have two racks and the network connection
> between them fails. If both racks keep operating because they can still
> reach that single monitor in their rack you will end up with data
> inconsistency.

Yes. In DRBD land it's called 'split brain' and they have (IIRC) entire
chapter in the user manual about picking up the pieces. It's not a new
problem.

> You should place mon.c outside rack A or B to keep you up and running in
> this situation.

It's not about racks, it's about rooms, but let's say rack == room ==
colocation facility. And I have two of those.

Are you saying I need a 3rd colo with all associated overhead to have a
usable replica of my data in colo #2?

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph mon quorum

2013-04-05 Thread Gregory Farnum
On Fri, Apr 5, 2013 at 10:28 AM, Dimitri Maziuk  wrote:
> On 04/05/2013 10:12 AM, Wido den Hollander wrote:
>
>> Think about it this way. You have two racks and the network connection
>> between them fails. If both racks keep operating because they can still
>> reach that single monitor in their rack you will end up with data
>> inconsistency.
>
> Yes. In DRBD land it's called 'split brain' and they have (IIRC) entire
> chapter in the user manual about picking up the pieces. It's not a new
> problem.
>
>> You should place mon.c outside rack A or B to keep you up and running in
>> this situation.
>
> It's not about racks, it's about rooms, but let's say rack == room ==
> colocation facility. And I have two of those.
>
> Are you saying I need a 3rd colo with all associated overhead to have a
> usable replica of my data in colo #2?

Or just a VM running somewhere that's got a VPN connection to your
room-based monitors, yes. Ceph is a strongly consistent system and
you're not going to get split brains, period. This is the price you
pay for that.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph mon quorum

2013-04-05 Thread Jeff Anderson-Lee

On 4/5/2013 10:32 AM, Gregory Farnum wrote:

On Fri, Apr 5, 2013 at 10:28 AM, Dimitri Maziuk  wrote:

On 04/05/2013 10:12 AM, Wido den Hollander wrote:


Think about it this way. You have two racks and the network connection
between them fails. If both racks keep operating because they can still
reach that single monitor in their rack you will end up with data
inconsistency.

Yes. In DRBD land it's called 'split brain' and they have (IIRC) entire
chapter in the user manual about picking up the pieces. It's not a new
problem.


You should place mon.c outside rack A or B to keep you up and running in
this situation.

It's not about racks, it's about rooms, but let's say rack == room ==
colocation facility. And I have two of those.

Are you saying I need a 3rd colo with all associated overhead to have a
usable replica of my data in colo #2?

Or just a VM running somewhere that's got a VPN connection to your
room-based monitors, yes. Ceph is a strongly consistent system and
you're not going to get split brains, period. This is the price you
pay for that.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
The point is I believe that you don't need a 3rd replica of everything, 
just a 3rd MON running somewhere else.


Jeff

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph mon quorum

2013-04-05 Thread Dimitri Maziuk
On 04/05/2013 12:38 PM, Jeff Anderson-Lee wrote:

> The point is I believe that you don't need a 3rd replica of everything,
> just a 3rd MON running somewhere else.

Bear in mind that you still need a physical machine somewhere in that
"somewhere else".

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD performance test (write) problem

2013-04-05 Thread Kelvin_Huang
Hi all,

I have some problem after my RBD performance test

Setup:
Linux kernel: 3.6.11
OS: Ubuntu 12.04
RAID card: LSI MegaRAID SAS 9260-4i  For every HDD: RAID0, Write Policy: Write 
Back with BBU, Read Policy: ReadAhead, IO Policy: Direct
Storage server number : 1
Storage server :
8 * HDD (each storage server has 8 osd, 7200 rpm, 2T)
4 * SSD (2 osd use 1 SSD as journal, the SSD divided into two partition sdx1, 
sdx2)

Ceph version : 0.56.4
Replicas : 2
Monitor number:1


The write speed of HDD:
# dd if=/dev/zero of=/dev/sdd bs=1024k count=1 oflag=direct
1+0 records in
1+0 records out
1048576 bytes (10 GB) copied, 69.3961 s, 151 MB/s

The write speed of SSD:
# dd if=/dev/zero of=/dev/sdb bs=1024k count=1 oflag=direct
1+0 records in
1+0 records out
1048576 bytes (10 GB) copied, 40.8671 s, 257 MB/s


Then we use the RADOS benchmark and collectl to observed write performance

#rados -p rbd bench 300 write -t 256

2013-04-05 14:31:13.732737min lat: 4.28207 max lat: 5.92085 avg lat: 4.78598
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   300 256 16043 15787   210.455   196  5.91   4.78598
Total time run: 300.588962
Total writes made:  16043
Write size: 4194304
Bandwidth (MB/sec): 213.488

Stddev Bandwidth:   40.6795
Max bandwidth (MB/sec): 288
Min bandwidth (MB/sec): 0
Average Latency:4.75647
Stddev Latency: 0.37182
Max latency:5.93183
Min latency:0.590936



collectl on OSDs :
#collectl  --iosize -sCDN --dskfilt "sd(c|d|e|f|g|h|i|j)"

# DISK STATISTICS (/sec)
#  
<-reads-><-writes-> 
Pct
#Name   KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  
Wait SvcTim Util
sdc  0  000   76848563  460  167 16712
26  0   42
sdd  0  000   45100  0  165  273 273 6
36  1   30
sde  0  000   73800  0  270  273 273 3
14  1   41
sdf  0  000   73800  0  270  273 27317
64  1   33
sdg  0  000   41000  0  150  273 273 1 
7  0   10
sdh  0  000   57400  0  210  273 273 4
20  1   27
sdi  0  000   36904  0  136  271 271 0 
5  07
sdj  0  000   6  0  285  273 27228
87  1   48


collectl on SSDs :
#collectl  --iosize -sCDN --dskfilt "sd(b|k|l|m)"

# DISK STATISTICS (/sec)
#  
<-reads-><-writes-> 
Pct
#Name   KBytes Merged  IOs Size  KBytes Merged  IOs Size  RWSize  QLen  
Wait SvcTim Util
sdb  0  000  115552  0  388  298 29775   
159  2   77
sdk  0  000  114592  0  389  295 29412
33  0   38
sdl  0  000  100364  0  334  300 30035   
148  2   69
sdm  0  000  101644  0  345  295 294   245   
583  2   99 <= almost 99%



My question is:
1.The rados benchmark write is a random write right?

2.Why the bottleneck of write bandwidth occur at 213MB/s even if increased the 
concurrent (-t 512) ?
  It looks a bit worse, because the collectl show SSD's write throughput only 
has 100M~120M, but SSD should be able to 250MB/s

3.Why some SSD (sdm) [Util] almost 99% that means data written to osd not 
enough distributed ?

4.If bottleneck of write performance not SSD , What it should be write 
bottleneck ?

5.How can I improve write performance ?

Thanks!!

- Kelvin

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com