Re: [ceph-users] Cuttlefish VS Bobtail performance series

2013-07-11 Thread Erwan Velu

On 10/07/2013 18:01, Mark Nelson wrote:

Hello again!

Part 2 is now out!  We've got a whole slew of results for 4K FIO tests 
on RBD:


http://ceph.com/performance-2/ceph-cuttlefish-vs-bobtail-part-2-4k-rbd-performance/ 



Hey mark,

I'm really fond of this kind of plotting performance results. You may 
saw my work on this topic that is about to be upstreamed in fio : 
http://www.spinics.net/lists/fio/msg02140.html (one tool to automatize 
the fio jobs, one for graphing). You may find some interest in.


My only comment about thoses graphs will be maintaining comparable 
ranges across same benchmark types.


For example, on this picture, 
http://ceph.com/wp-content/uploads/2014/07/cuttlefish-rbd_btrfs-write-0004K.png, 
we do have 6 runs with the same IO profile but each result is plotted in 
its own context.
Meaning the Zaxis range and its associated colorbar (CB) is changing all 
the time. That makes very difficult to compare runs as the color doesn't 
have the same meaning everywhere.


I'm usually enforcing both zrange & cbrange (looks like you use gnuplot) 
to be constant over graphs.

Syntax is : (to be adjusted in values)
set cbrange [0:12]
set zrange [0:12]

Pros:
- the same color have the same raw value everywhere
- a quick look at a graph series gives immediate an understanding of the 
overall performances (which one is good, which one is bad)

- speed up graph reading

Cons:
- it could visually flatten some low values
- needs to estimate the range before plotting

My 2 cents,
Erwan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Change of Monitors IP Adresses

2013-07-11 Thread Joachim . Tork
Hi folks,

I face the difficulty that I have to change ip adresses in the 
public network for the monitors.

What needs to be done beside the change of the ceph.conf?

Best regards

Joachim Tork___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Live migration of VM using librbd and OpenStack

2013-07-11 Thread Maciej Gałkiewicz
On 12 March 2013 21:38, Josh Durgin  wrote:

> Yes, it works with true live migration just fine (even with caching). You
> can use "virsh migrate" or even do it through the virt-manager gui.
> Nova is just doing a check that doesn't make sense for volume-backed
> instances with live migration there.
>
> Unfortunately I haven't had the time to look at that problem in
> nova since that message, but I suspect the same issue is still
> there.
>

Could you point out how to enable rbd caching in openstack?

regards
Maciej Galkiewicz
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OCFS2 or GFS2 for cluster filesystem?

2013-07-11 Thread Tom Verdaat
Hi guys,

We want to use our Ceph cluster to create a shared disk file system to host
VM's. Our preference would be to use CephFS but since it is not considered
stable I'm looking into alternatives.

The most appealing alternative seems to be to create a RBD volume, format
it with a cluster file system and mount it on all the VM host machines.

Obvious file system candidates would be OCFS2 and GFS2 but I'm having
trouble finding recent and reliable documentation on the performance,
features and reliability of these file systems, especially related to our
specific use case. The specifics I'm trying to keep in mind are:

   - Using it to host VM ephemeral disks means the file system needs to
   perform well with few but very large files and usually machines don't try
   to compete for access to the same file, except for during live migration.
   - Needs to handle scale well (large number of nodes, manage a volume of
   tens of terabytes and file sizes of tens or hundreds of gigabytes) and
   handle online operations like increasing the volume size.
   - Since the cluster FS is already running on a distributed storage
   system (Ceph), the file system does not need to concern itself with things
   like replication. Just needs to not get corrupted and be fast of course.


Anybody here that can help me shed some light on the following questions:

   1. Are there other cluster file systems to consider besides OCFS2 and
   GFS2?
   2. Which one would yield the best performance for our use case?
   3. Is anybody doing this already and willing to share their experience?
   4. Is there anything important that you think we might have missed?


Your help is very much appreciated!

Thanks!

Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Change of Monitors IP Adresses

2013-07-11 Thread Joao Eduardo Luis

On 07/11/2013 09:03 AM, joachim.t...@gad.de wrote:

Hi folks,

I face the difficulty that I have to change ip adresses in the
public network for the monitors.

What needs to be done beside the change of the ceph.conf?


ceph.conf is only used by other daemons (that aren't the monitors) and 
clients.  The monitors however use the monmap (monitor map) to assess 
where all the monitors are and who they are -- this map is kept by the 
monitors themselves.


In other words, solely changing ceph.conf and firing up the monitors 
with their new ips will result in the monitors committing suicide, as 
they will not find themselves (given their new ips) in the monmap.


I would advise you to read through the ceph docs' "Adding/Removing 
monitors", section "Changing a Monitor's IP Address" [1].



Although the appropriate way to do this would be to keep the current 
monitors in place, adding new monitors with new ips (one at a time) and 
letting the cluster become healthy, this assumes two things: 1) your new 
monitor's location are able to communicate with the current monitor's 
location, and the latencies are small enough for it to work; and 2) you 
have spare hardware in which you can keep the new monitors as you're 
adding them.


If one of these conditions are not satisfiable, you may want to do it 
the dirty way:  generate a new monmap with the new monitor locations, 
shutdown your monitors, move them over to their new location, inject the 
new monmap, and restart the monitors.  This is also described in [1].



Hope this helps, and let us know if you run into problems.

  -Joao


[1] - http://ceph.com/docs/master/rados/operations/add-or-rm-mons/



Best regards

Joachim Tork


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cuttlefish VS Bobtail performance series

2013-07-11 Thread Mark Nelson

On 07/11/2013 02:36 AM, Erwan Velu wrote:

On 10/07/2013 18:01, Mark Nelson wrote:

Hello again!

Part 2 is now out!  We've got a whole slew of results for 4K FIO tests
on RBD:

http://ceph.com/performance-2/ceph-cuttlefish-vs-bobtail-part-2-4k-rbd-performance/



Hey mark,

I'm really fond of this kind of plotting performance results. You may
saw my work on this topic that is about to be upstreamed in fio :
http://www.spinics.net/lists/fio/msg02140.html (one tool to automatize
the fio jobs, one for graphing). You may find some interest in.

My only comment about thoses graphs will be maintaining comparable
ranges across same benchmark types.

For example, on this picture,
http://ceph.com/wp-content/uploads/2014/07/cuttlefish-rbd_btrfs-write-0004K.png,
we do have 6 runs with the same IO profile but each result is plotted in
its own context.
Meaning the Zaxis range and its associated colorbar (CB) is changing all
the time. That makes very difficult to compare runs as the color doesn't
have the same meaning everywhere.

I'm usually enforcing both zrange & cbrange (looks like you use gnuplot)
to be constant over graphs.
Syntax is : (to be adjusted in values)
set cbrange [0:12]
set zrange [0:12]

Pros:
- the same color have the same raw value everywhere
- a quick look at a graph series gives immediate an understanding of the
overall performances (which one is good, which one is bad)
- speed up graph reading

Cons:
- it could visually flatten some low values
- needs to estimate the range before plotting

My 2 cents,
Erwan




Hi Erwin,

Yes I agree 100%!  I wanted to do that, but ran out of time.   I think 
(as you stated) the only way to pull it off is to globally figure out 
the maximum z-axis range for the group, then manually pass it in to 
gnuplot for the graph generation.  Unfortunately when I was creating the 
plot file I wasn't planning at the time to glue the graphs together with 
montage, so I didn't really think to design it to work that way.  I'm 
discovering that I probably need to rely less on doing things internally 
in the plot file and more on wrapping the plot generation in something 
else like python.


Your graphs look very similar to the ones I made!  You'd almost think we 
did it together. :)  I see you are also running into the same 
overlapping labels issue that I've hit:


http://www.flickr.com/photos/ennael/9101313574/in/set-72157634249027122

Anyway, glad someone else is as crazy as I am for wanting to graph so 
much data. ;)


Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Including pool_id in the crush hash ? FLAG_HASHPSPOOL ?

2013-07-11 Thread Sylvain Munaut
Hi,


I'd like the pool_id to be included in the hash used for the PG, to
try and improve the data distribution. (I have 10 pool).

I see that there is a flag named FLAG_HASHPSPOOL. Is it possible to
enable it on existing pool ?


Cheers,

 Sylvain
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hadoop/Ceph and DFS IO tests

2013-07-11 Thread Noah Watkins
On Wed, Jul 10, 2013 at 6:23 PM, ker can  wrote:
>
> Now separating out the journal from data disk ...
>
> HDFS write numbers (3 disks/data node)
> Average execution time: 466
> Best execution time : 426
> Worst execution time   : 508
>
> ceph write numbers (3 data disks/data node + 3 journal disks/data node)
> Average execution time: 610
> Best execution time : 593
> Worst execution time   : 635
>
> So ceph was about 1.3x slower for the average case when journal & data are
> separated .. a 70% improvement over the case where journal + data are on the
> same disk - but still a bit off from the HDFS performance.

Were you running 3 OSDs per node (an OSD per data/journal drive pair)?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hadoop/Ceph and DFS IO tests

2013-07-11 Thread ker can
yep  - thats right. 3 OSD daemons per node.


On Thu, Jul 11, 2013 at 9:16 AM, Noah Watkins wrote:

> On Wed, Jul 10, 2013 at 6:23 PM, ker can  wrote:
> >
> > Now separating out the journal from data disk ...
> >
> > HDFS write numbers (3 disks/data node)
> > Average execution time: 466
> > Best execution time : 426
> > Worst execution time   : 508
> >
> > ceph write numbers (3 data disks/data node + 3 journal disks/data node)
> > Average execution time: 610
> > Best execution time : 593
> > Worst execution time   : 635
> >
> > So ceph was about 1.3x slower for the average case when journal & data
> are
> > separated .. a 70% improvement over the case where journal + data are on
> the
> > same disk - but still a bit off from the HDFS performance.
>
> Were you running 3 OSDs per node (an OSD per data/journal drive pair)?
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] storage pools ceph (bobtail) auth failure in xenserver SR creation

2013-07-11 Thread Dave Scott
[sorry I didn't manage to reply to the original message; I only just joined 
this list.
Sorry if this breaks your threading!]

On 10 Jul 2013 at 16:01 John Shen wrote:

> I was following the tech preview of libvirt/ceph integration in xenserver, 
> but ran
> into an issue with ceph auth in setting up the SR. any help would be greatly
> appreciated.

I must confess that I've disabled auth in my test environment. Clearly I should
go back and enable it again :-)

> uuid was generated per: http://eu.ceph.com/docs/wip-dump/rbd/libvirt/
>
> according to inktank, storage pool auth syntax differs slightly from block 
> device
> attachment. I tried both format but got the same error.
>
> Ref:
>
> http://xenserver.org/blog/entry/tech-preview-of-xenserver-libvirt-ceph.html
>
> [root  xen01 ~]# xe sr-create type=libvirt name-label=ceph 
> device-config:xml-filename=ceph.xml
> Error code: libvirt
> Error parameters: libvirt: VIR_ERR_65: VIR_FROM_30: Invalid secret: 
> virSecretFree

The "xe sr-create" call is handled by "xapi" which calls "xapi-libvirt-storage" 
which
uses the libvirt API directly to create the pool. It _should_ do the same as 
running

virsh pool-create ceph.xml

Could you try the "virsh pool-create" and see if that works? If it does, then 
we need
to figure out what the "virsh" CLI is doing that my Pool.create function call 
isn't. If
it doesn't then there might be some other missing step. Did you have to 
pre-create
a secret (is that "virsh secret-create"?)

Cheers,
Dave Scott

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cuttlefish VS Bobtail performance series

2013-07-11 Thread Mark Nelson

And We've now got part 3 out showing 128K FIO results:

http://ceph.com/performance-2/ceph-cuttlefish-vs-bobtail-part-3-128k-rbd-performance/

Mark

On 07/10/2013 11:01 AM, Mark Nelson wrote:

Hello again!

Part 2 is now out!  We've got a whole slew of results for 4K FIO tests
on RBD:

http://ceph.com/performance-2/ceph-cuttlefish-vs-bobtail-part-2-4k-rbd-performance/


Mark

On 07/09/2013 08:41 AM, Mark Nelson wrote:

Hi Guys,

Just wanted to let everyone know that we've released part 1 of a series
of performance articles that looks at Cuttlefish vs Bobtail on our
Supermicro test chassis.  We'll be looking at both RADOS bench and RBD
performance with a variety of IO sizes, IO patterns, concurrency levels,
file systems, and more!

Every day this week we'll be releasing a new part in the series.  Here's
a link to part 1:

http://ceph.com/performance-2/ceph-cuttlefish-vs-bobtail-part-1-introduction-and-rados-bench/



Thanks!
Mark




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW Logging

2013-07-11 Thread Derek Yarnell
>> It will never log anything to /var/log/ceph/radosgw.log.  I am looking
>> for the debug output which I have seen people post, does anyone have a
>> pointer to what could be going on?
> 
> You don't need the 'rgw_enable_ops_log' to have debug logs. The
> log_file param should be enough. Do you have permissions to write into
> /var/log/ceph?

Hi,

Weirdly I had thought it was a permissions issue and had tried all sorts
of combinations yesterday.  Removing the rgw_enable_ops_log and again
chowning the directory to the apache user seemed to fix this. Thanks for
note and sorry for the dumb question.

Thanks,
derek

-- 
---
Derek T. Yarnell
University of Maryland
Institute for Advanced Computer Studies
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Tuning options for 10GE ethernet and ceph

2013-07-11 Thread Mihály Árva-Tóth
Hello,

We are planning to use Intel 10 GE ethernet between nodes of OSDs. Host
operation system will be Ubuntu 12.04 x86_64. Are there any recommendations
available to tuning options (ex. sysctl and ceph)?

Thank you,
Mihaly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Tuning options for 10GE ethernet and ceph

2013-07-11 Thread Mark Nelson

On 07/11/2013 10:04 AM, Mihály Árva-Tóth wrote:

Hello,

We are planning to use Intel 10 GE ethernet between nodes of OSDs. Host
operation system will be Ubuntu 12.04 x86_64. Are there any
recommendations available to tuning options (ex. sysctl and ceph)?

Thank you,
Mihaly


Hi,

Generally if performance and latency look good with something like iperf 
and a couple of parallel streams you should be able to get good 
performance with Ceph.  You may find that using jumbo frames can help in 
some circumstances.  In some cases we've seen that TCP autotuning can 
cause issues (primarily with reads!), but I think we've basically got 
that solved through a ceph tunable now.


Mark




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Tuning options for 10GE ethernet and ceph

2013-07-11 Thread Mihály Árva-Tóth
2013/7/11 Mark Nelson 

> On 07/11/2013 10:04 AM, Mihály Árva-Tóth wrote:
>
>> Hello,
>>
>> We are planning to use Intel 10 GE ethernet between nodes of OSDs. Host
>> operation system will be Ubuntu 12.04 x86_64. Are there any
>> recommendations available to tuning options (ex. sysctl and ceph)?
>>
>> Thank you,
>> Mihaly
>>
>
> Hi,
>
> Generally if performance and latency look good with something like iperf
> and a couple of parallel streams you should be able to get good performance
> with Ceph.  You may find that using jumbo frames can help in some
> circumstances.  In some cases we've seen that TCP autotuning can cause
> issues (primarily with reads!), but I think we've basically got that solved
> through a ceph tunable now.


Hi Mark,

Thank you. So are there no Ceph-related configration options which can I
tuning for good performance on 10GE network? Where can I read more about
TCP autotuning issues?

Regards,
Mihaly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Tuning options for 10GE ethernet and ceph

2013-07-11 Thread Mark Nelson

On 07/11/2013 10:27 AM, Mihály Árva-Tóth wrote:

2013/7/11 Mark Nelson mailto:mark.nel...@inktank.com>>

On 07/11/2013 10:04 AM, Mihály Árva-Tóth wrote:

Hello,

We are planning to use Intel 10 GE ethernet between nodes of
OSDs. Host
operation system will be Ubuntu 12.04 x86_64. Are there any
recommendations available to tuning options (ex. sysctl and ceph)?

Thank you,
Mihaly


Hi,

Generally if performance and latency look good with something like
iperf and a couple of parallel streams you should be able to get
good performance with Ceph.  You may find that using jumbo frames
can help in some circumstances.  In some cases we've seen that TCP
autotuning can cause issues (primarily with reads!), but I think
we've basically got that solved through a ceph tunable now.


Hi Mark,

Thank you. So are there no Ceph-related configration options which can I
tuning for good performance on 10GE network? Where can I read more about
TCP autotuning issues?


Nothing really comes to mind as far as Ceph goes.  You may want to use a 
separate front and back network if you have the ports/switches 
available.  Having said that, I've got a test setup where I used a 
bonded 10GbE interface, and with RADOS bench was able to achieve 2GB/s 
with no special Ceph network options beyond specifying that I wanted to 
use the 10GbE network.  Of course you'll need the clients, concurrency, 
and backend disks to really get that.


The tcp autotuning issues were first discovered by Jim Schutt about a 
year ago and reported on ceph-devel:


http://www.spinics.net/lists/ceph-devel/msg05049.html

And our workaround:

http://tracker.ceph.com/issues/2100




Regards,
Mihaly


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] storage pools ceph (bobtail) auth failure in xenserver SR creation

2013-07-11 Thread John Shen
Hi Dave, Thank you so much for getting back to me.

the command returns the same errors:

[root@xen02 ~]# virsh pool-create ceph.xml
error: Failed to create pool from ceph.xml
error: Invalid secret: virSecretFree

[root@xen02 ~]#

the secret was precreated for the user admin that I use elsewhere with no
issues (rbd mount, cephfs etc.), and per the ceph documentation, i just set
the secret value with this command

 virsh secret-set-value $(cat uuid) --base64 $(cat client.admin.key)

where the key is obtained from

 ceph auth list

and uuid is generated by

virsh secret-define --file secret.xml

# cat secret.xml


client.admin $(cat client.admin.key)





On Thu, Jul 11, 2013 at 7:22 AM, Dave Scott wrote:

> [sorry I didn't manage to reply to the original message; I only just
> joined this list.
> Sorry if this breaks your threading!]
>
> On 10 Jul 2013 at 16:01 John Shen wrote:
>
> > I was following the tech preview of libvirt/ceph integration in
> xenserver, but ran
> > into an issue with ceph auth in setting up the SR. any help would be
> greatly
> > appreciated.
>
> I must confess that I've disabled auth in my test environment. Clearly I
> should
> go back and enable it again :-)
>
> > uuid was generated per: http://eu.ceph.com/docs/wip-dump/rbd/libvirt/
> >
> > according to inktank, storage pool auth syntax differs slightly from
> block device
> > attachment. I tried both format but got the same error.
> >
> > Ref:
> >
> >
> http://xenserver.org/blog/entry/tech-preview-of-xenserver-libvirt-ceph.html
> >
> > [root  xen01 ~]# xe sr-create type=libvirt name-label=ceph
> device-config:xml-filename=ceph.xml
> > Error code: libvirt
> > Error parameters: libvirt: VIR_ERR_65: VIR_FROM_30: Invalid secret:
> virSecretFree
>
> The "xe sr-create" call is handled by "xapi" which calls
> "xapi-libvirt-storage" which
> uses the libvirt API directly to create the pool. It _should_ do the same
> as running
>
> virsh pool-create ceph.xml
>
> Could you try the "virsh pool-create" and see if that works? If it does,
> then we need
> to figure out what the "virsh" CLI is doing that my Pool.create function
> call isn't. If
> it doesn't then there might be some other missing step. Did you have to
> pre-create
> a secret (is that "virsh secret-create"?)
>
> Cheers,
> Dave Scott
>
>


-- 
--John Shen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cuttlefish VS Bobtail performance series

2013-07-11 Thread Erwan Velu

On 11/07/2013 16:56, Mark Nelson wrote:

And We've now got part 3 out showing 128K FIO results:

http://ceph.com/performance-2/ceph-cuttlefish-vs-bobtail-part-3-128k-rbd-performance/ 


Hey Mark,

As speaking about the 10GbE at the end of your document, I do have the 
following questions for you.


What kind of network switch are you using ? That's not listed in the 
hardware setup.

Did you configured something noticeable on it ?

Did you estimated the network bandwidth between your hosts to see it you 
reach 10GbE ?
On my setup, I'm close to release a set of tools for benchmarking & 
graphing thoses, I had the need to use a Jumbo frame at 7500. Is it your 
case too ? If so, it would be lovely to understand your tuning.

At MTU=1500 I had only 6Gbps while 7500 gave me more than 9000.

That could be very valuable to other that does benchmarking or thoses 
who want to optimize their setup.


Thanks for your great work,
Erwan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cuttlefish VS Bobtail performance series

2013-07-11 Thread Mark Nelson

On 07/11/2013 11:16 AM, Erwan Velu wrote:

On 11/07/2013 16:56, Mark Nelson wrote:

And We've now got part 3 out showing 128K FIO results:

http://ceph.com/performance-2/ceph-cuttlefish-vs-bobtail-part-3-128k-rbd-performance/


Hey Mark,


Hi!



As speaking about the 10GbE at the end of your document, I do have the
following questions for you.

What kind of network switch are you using ? That's not listed in the
hardware setup.


I cheated and directly connected the NICs in each node with 3' SFP+ 
cables.  The bonding is just linux round-robin.  This is probably about 
as good as it gets from a throughput and latency perspective!



Did you configured something noticeable on it ?


Not too much beyond the crazines of getting a bridge working on top of 
bonded 10GbE interfaces.  I did tweak tcp reordering to help out:


net.ipv4.tcp_reordering=127



Did you estimated the network bandwidth between your hosts to see it you
reach 10GbE ?


I ran iperf on the bonded link and was sitting right around 2GB/s in 
both directions with multiple streams.  I also did some iperf tests from 
individual VMs and was able to get similar (maybe slightly less) 
throughput.  Now that I think about it, I'm not sure I did a test with 
parallel concurrent iperfs from all VMs, which would have been a good 
test to do.



On my setup, I'm close to release a set of tools for benchmarking &
graphing thoses, I had the need to use a Jumbo frame at 7500. Is it your
case too ? If so, it would be lovely to understand your tuning.
At MTU=1500 I had only 6Gbps while 7500 gave me more than 9000.


I'm actually using MTU=1500.  I suspect I can get away with it because 
the cards are directly connected.  I fiddled with increasing it up to 
9000, but ran into some strange issues with the bonding/bridge and had 
worse performance and stability so I returned it back to 1500.  The 
bonding/bridge setup was pretty finicky to get working.




That could be very valuable to other that does benchmarking or thoses
who want to optimize their setup.


I think the best advice here is to know your network, what your hardware 
is capable of doing, and read the documentation in the kernel src.  The 
impression I've gotten over the years is that network tuning is as much 
of an art as disk IO tuning.  You really need to know what your software 
is doing, what's happening at the hardware/driver level, and what's 
happening at the switches.  On big deployments, just dealing with 
bisection bandwidth issues on supposed fat-tree topology switches can be 
a project by itself!




Thanks for your great work,
Erwan


Thank you!  I really like to hear that people are enjoying the articles.

Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?

2013-07-11 Thread Gilles Mocellin

Le 11/07/2013 12:08, Tom Verdaat a écrit :

Hi guys,

We want to use our Ceph cluster to create a shared disk file system to 
host VM's. Our preference would be to use CephFS but since it is not 
considered stable I'm looking into alternatives.


The most appealing alternative seems to be to create a RBD volume, 
format it with a cluster file system and mount it on all the VM host 
machines.


Obvious file system candidates would be OCFS2 and GFS2 but I'm having 
trouble finding recent and reliable documentation on the performance, 
features and reliability of these file systems, especially related to 
our specific use case. The specifics I'm trying to keep in mind are:


  * Using it to host VM ephemeral disks means the file system needs to
perform well with few but very large files and usually machines
don't try to compete for access to the same file, except for
during live migration.
  * Needs to handle scale well (large number of nodes, manage a volume
of tens of terabytes and file sizes of tens or hundreds of
gigabytes) and handle online operations like increasing the volume
size.
  * Since the cluster FS is already running on a distributed storage
system (Ceph), the file system does not need to concern itself
with things like replication. Just needs to not get corrupted and
be fast of course.


Anybody here that can help me shed some light on the following questions:

 1. Are there other cluster file systems to consider besides OCFS2 and
GFS2?
 2. Which one would yield the best performance for our use case?
 3. Is anybody doing this already and willing to share their experience?
 4. Is there anything important that you think we might have missed?



Hello,

Yes, you missed that qemu can use directly RADOS volume.
Look here :
http://ceph.com/docs/master/rbd/qemu-rbd/

Create :
qemu-img create -f rbd rbd:data/squeeze 10G

Use :

qemu -m 1024 -drive format=raw,file=rbd:data/squeeze


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] storage pools ceph (bobtail) auth failure in xenserver SR creation

2013-07-11 Thread Wido den Hollander

Hi.

So, the problem here is a couple of things.

First: libvirt doesn't handle RBD storage pools without auth. That's my 
bad, but I never resolved that bug: http://tracker.ceph.com/issues/3493


For now, make sure cephx is enabled.

Also, the commands you are using don't seem to be right.

It should be:

$ virsh secret-set-value $(cat uuid) 

Could you try again with cephx enabled and setting the secret value like 
mentioned above?


Wido

On 07/11/2013 06:00 PM, John Shen wrote:

Hi Dave, Thank you so much for getting back to me.

the command returns the same errors:

[root@xen02 ~]# virsh pool-create ceph.xml
error: Failed to create pool from ceph.xml
error: Invalid secret: virSecretFree

[root@xen02 ~]#

the secret was precreated for the user admin that I use elsewhere with
no issues (rbd mount, cephfs etc.), and per the ceph documentation, i
just set the secret value with this command

  virsh secret-set-value $(cat uuid) --base64 $(cat client.admin.key)

where the key is obtained from

  ceph auth list

and uuid is generated by

virsh secret-define --file secret.xml

# cat secret.xml

 
 client.admin $(cat client.admin.key)
 




On Thu, Jul 11, 2013 at 7:22 AM, Dave Scott mailto:dave.sc...@eu.citrix.com>> wrote:

[sorry I didn't manage to reply to the original message; I only just
joined this list.
Sorry if this breaks your threading!]

On 10 Jul 2013 at 16:01 John Shen wrote:

 > I was following the tech preview of libvirt/ceph integration in
xenserver, but ran
 > into an issue with ceph auth in setting up the SR. any help would
be greatly
 > appreciated.

I must confess that I've disabled auth in my test environment.
Clearly I should
go back and enable it again :-)

 > uuid was generated per: http://eu.ceph.com/docs/wip-dump/rbd/libvirt/
 >
 > according to inktank, storage pool auth syntax differs slightly
from block device
 > attachment. I tried both format but got the same error.
 >
 > Ref:
 >
 >
http://xenserver.org/blog/entry/tech-preview-of-xenserver-libvirt-ceph.html
 >
 > [root  xen01 ~]# xe sr-create type=libvirt name-label=ceph
device-config:xml-filename=ceph.xml
 > Error code: libvirt
 > Error parameters: libvirt: VIR_ERR_65: VIR_FROM_30: Invalid
secret: virSecretFree

The "xe sr-create" call is handled by "xapi" which calls
"xapi-libvirt-storage" which
uses the libvirt API directly to create the pool. It _should_ do the
same as running

virsh pool-create ceph.xml

Could you try the "virsh pool-create" and see if that works? If it
does, then we need
to figure out what the "virsh" CLI is doing that my Pool.create
function call isn't. If
it doesn't then there might be some other missing step. Did you have
to pre-create
a secret (is that "virsh secret-create"?)

Cheers,
Dave Scott




--
--John Shen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] storage pools ceph (bobtail) auth failure in xenserver SR creation

2013-07-11 Thread John Shen
Wido, Thanks! I tried again with your command syntax but the result is the
same.

[root@xen02 ~]# virsh secret-set-value $(cat uuid) $(cat client.admin.key)
Secret value set

[root@xen02 ~]# xe sr-create type=libvirt name-label=ceph
device-config:xml-filename=ceph.xml
Error code: libvirt
Error parameters: libvirt: VIR_ERR_65: VIR_FROM_30: Invalid secret:
virSecretFree
[root@xen02 ~]#  virsh pool-create ceph.xml
error: Failed to create pool from ceph.xml
error: Invalid secret: virSecretFree

[root@xen02 ~]#



On Thu, Jul 11, 2013 at 1:14 PM, Wido den Hollander  wrote:

> Hi.
>
> So, the problem here is a couple of things.
>
> First: libvirt doesn't handle RBD storage pools without auth. That's my
> bad, but I never resolved that bug: 
> http://tracker.ceph.com/**issues/3493
>
> For now, make sure cephx is enabled.
>
> Also, the commands you are using don't seem to be right.
>
> It should be:
>
> $ virsh secret-set-value $(cat uuid) 
>
> Could you try again with cephx enabled and setting the secret value like
> mentioned above?
>
> Wido
>
>
> On 07/11/2013 06:00 PM, John Shen wrote:
>
>> Hi Dave, Thank you so much for getting back to me.
>>
>> the command returns the same errors:
>>
>> [root@xen02 ~]# virsh pool-create ceph.xml
>> error: Failed to create pool from ceph.xml
>> error: Invalid secret: virSecretFree
>>
>> [root@xen02 ~]#
>>
>> the secret was precreated for the user admin that I use elsewhere with
>> no issues (rbd mount, cephfs etc.), and per the ceph documentation, i
>> just set the secret value with this command
>>
>>   virsh secret-set-value $(cat uuid) --base64 $(cat client.admin.key)
>>
>> where the key is obtained from
>>
>>   ceph auth list
>>
>> and uuid is generated by
>>
>> virsh secret-define --file secret.xml
>>
>> # cat secret.xml
>> 
>>  
>>  client.admin $(cat client.admin.key)
>>  
>> 
>>
>>
>>
>> On Thu, Jul 11, 2013 at 7:22 AM, Dave Scott > > wrote:
>>
>> [sorry I didn't manage to reply to the original message; I only just
>> joined this list.
>> Sorry if this breaks your threading!]
>>
>> On 10 Jul 2013 at 16:01 John Shen wrote:
>>
>>  > I was following the tech preview of libvirt/ceph integration in
>> xenserver, but ran
>>  > into an issue with ceph auth in setting up the SR. any help would
>> be greatly
>>  > appreciated.
>>
>> I must confess that I've disabled auth in my test environment.
>> Clearly I should
>> go back and enable it again :-)
>>
>>  > uuid was generated per: http://eu.ceph.com/docs/wip-**
>> dump/rbd/libvirt/ 
>>  >
>>  > according to inktank, storage pool auth syntax differs slightly
>> from block device
>>  > attachment. I tried both format but got the same error.
>>  >
>>  > Ref:
>>  >
>>  >
>> http://xenserver.org/blog/**entry/tech-preview-of-**
>> xenserver-libvirt-ceph.html
>>  >
>>  > [root  xen01 ~]# xe sr-create type=libvirt name-label=ceph
>> device-config:xml-filename=**ceph.xml
>>  > Error code: libvirt
>>  > Error parameters: libvirt: VIR_ERR_65: VIR_FROM_30: Invalid
>> secret: virSecretFree
>>
>> The "xe sr-create" call is handled by "xapi" which calls
>> "xapi-libvirt-storage" which
>> uses the libvirt API directly to create the pool. It _should_ do the
>> same as running
>>
>> virsh pool-create ceph.xml
>>
>> Could you try the "virsh pool-create" and see if that works? If it
>> does, then we need
>> to figure out what the "virsh" CLI is doing that my Pool.create
>> function call isn't. If
>> it doesn't then there might be some other missing step. Did you have
>> to pre-create
>> a secret (is that "virsh secret-create"?)
>>
>> Cheers,
>> Dave Scott
>>
>>
>>
>>
>> --
>> --John Shen
>>
>>
>> __**_
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com
>>
>>
>
> --
> Wido den Hollander
> 42on B.V.
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> __**_
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com
>



-- 
--John Shen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?

2013-07-11 Thread Alex Bligh

On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:

> Hello,
> 
> Yes, you missed that qemu can use directly RADOS volume.
> Look here :
> http://ceph.com/docs/master/rbd/qemu-rbd/
> 
> Create :
> qemu-img create -f rbd rbd:data/squeeze 10G
> 
> Use :
> 
> qemu -m 1024 -drive format=raw,file=rbd:data/squeeze

I don't think he did. As I read it he wants his VMs to all access the same 
filing system, and doesn't want to use cephfs.

OCFS2 on RBD I suppose is a reasonable choice for that.

-- 
Alex Bligh




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Check creating

2013-07-11 Thread Mandell Degerness
Is there any command (in shell or python API), that can tell me if
ceph is still creating pgs other than actually attempting a
modification of the pg_num or pgp_num of a pool?  I would like to
minimize the number of errors I get and not keep trying the commands
until success, if possible.

Right now, I run the command "ceph osd pool set  pg_num "
and immediately attempt he "ceph osd pool set  pgp_num "
command, repeating the latter until success happens or I get an error
other than "still creating pgs"

It would be nice to run the first command, wait for the cluster to be
ready again, then run the second.

As a bonus, if there is a reasonable way to do this (expand pgs)
entirely in python without running any shell commands, that would be
awesome.

-Mandell
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?

2013-07-11 Thread McNamara, Bradley
Correct me if I'm wrong, I'm new to this, but I think the distinction between 
the two methods is that using 'qemu-img create -f rbd' creates an RBD for 
either a VM to boot from, or for mounting within a VM.  Whereas, the OP wants a 
single RBD, formatted with a cluster file system, to use as a place for 
multiple VM image files to reside.

I've often contemplated this same scenario, and would be quite interested in 
different ways people have implemented their VM infrastructure using RBD.  I 
guess one of the advantages of using 'qemu-img create -f rbd' is that a 
snapshot of a single RBD would capture just the changed RBD data for that VM, 
whereas a snapshot of a larger RBD with OCFS2 and multiple VM images on it, 
would capture changes of all the VM's, not just one.  It might provide more 
administrative agility to use the former.

Also, I guess another question would be, when a RBD is expanded, does the 
underlying VM that is created using 'qemu-img  create -f rbd' need to be 
rebooted to "see" the additional space.  My guess would be, yes.

Brad

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Alex Bligh
Sent: Thursday, July 11, 2013 2:03 PM
To: Gilles Mocellin
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?


On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:

> Hello,
> 
> Yes, you missed that qemu can use directly RADOS volume.
> Look here :
> http://ceph.com/docs/master/rbd/qemu-rbd/
> 
> Create :
> qemu-img create -f rbd rbd:data/squeeze 10G
> 
> Use :
> 
> qemu -m 1024 -drive format=raw,file=rbd:data/squeeze

I don't think he did. As I read it he wants his VMs to all access the same 
filing system, and doesn't want to use cephfs.

OCFS2 on RBD I suppose is a reasonable choice for that.

-- 
Alex Bligh




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Possible bug with image.list_lockers()

2013-07-11 Thread Mandell Degerness
I'm not certain what the correct behavior should be in this case, so
maybe it is not a bug, but here is what is happening:

When an OSD becomes full, a process fails and we unmount the rbd
attempt to remove the lock associated with the rbd for the process.
The unmount works fine, but removing the lock is failing right now
because the list_lockers() function call never returns.

Here is a code snippet I tried with a fake rbd lock on a test cluster:

import rbd
import rados
with rados.Rados(conffile='/etc/ceph/ceph.conf') as cluster:
  with cluster.open_ioctx('rbd') as ioctx:
with rbd.Image(ioctx, 'msd1') as image:
  image.list_lockers()

The process never returns, even after the ceph cluster is returned to
healthy.  The only indication of the error is an error in the
/var/log/messages file:

Jul 11 23:25:05 node-172-16-0-13 python: 2013-07-11 23:25:05.826793
7ffc66d72700  0 client.6911.objecter  FULL, paused modify
0x7ffc687c6050 tid 2

Any help would be greatly appreciated.

ceph version:

ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?

2013-07-11 Thread Tom Verdaat
You are right, I do want a single RBD, formatted with a cluster file
system, to use as a place for multiple VM image files to reside.

Doing everything straight from volumes would be more effective with regards
to snapshots, using CoW etc. but unfortunately for now OpenStack nova
insists on having an ephemeral disk and copying it to its local
/var/lib/nova/instances
directory. If you want to be able to do live migrations and such you need
to mount a cluster filesystem at that path on every host machine.

And that's what my questions were about!

Tom



2013/7/12 McNamara, Bradley 

> Correct me if I'm wrong, I'm new to this, but I think the distinction
> between the two methods is that using 'qemu-img create -f rbd' creates an
> RBD for either a VM to boot from, or for mounting within a VM.  Whereas,
> the OP wants a single RBD, formatted with a cluster file system, to use as
> a place for multiple VM image files to reside.
>
> I've often contemplated this same scenario, and would be quite interested
> in different ways people have implemented their VM infrastructure using
> RBD.  I guess one of the advantages of using 'qemu-img create -f rbd' is
> that a snapshot of a single RBD would capture just the changed RBD data for
> that VM, whereas a snapshot of a larger RBD with OCFS2 and multiple VM
> images on it, would capture changes of all the VM's, not just one.  It
> might provide more administrative agility to use the former.
>
> Also, I guess another question would be, when a RBD is expanded, does the
> underlying VM that is created using 'qemu-img  create -f rbd' need to be
> rebooted to "see" the additional space.  My guess would be, yes.
>
> Brad
>
> -Original Message-
> From: ceph-users-boun...@lists.ceph.com [mailto:
> ceph-users-boun...@lists.ceph.com] On Behalf Of Alex Bligh
> Sent: Thursday, July 11, 2013 2:03 PM
> To: Gilles Mocellin
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?
>
>
> On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:
>
> > Hello,
> >
> > Yes, you missed that qemu can use directly RADOS volume.
> > Look here :
> > http://ceph.com/docs/master/rbd/qemu-rbd/
> >
> > Create :
> > qemu-img create -f rbd rbd:data/squeeze 10G
> >
> > Use :
> >
> > qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
>
> I don't think he did. As I read it he wants his VMs to all access the same
> filing system, and doesn't want to use cephfs.
>
> OCFS2 on RBD I suppose is a reasonable choice for that.
>
> --
> Alex Bligh
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?

2013-07-11 Thread Tom Verdaat
Hi Alex,

We're planning to deploy OpenStack Grizzly using KVM. I agree that running
every VM directly from RBD devices would be preferable, but booting from
volumes is not one of OpenStack's strengths and configuring nova to make
boot from volume the default method that works automatically is not really
feasible yet.

So the alternative is to mount a shared filesystem
on /var/lib/nova/instances of every compute node. Hence the RBD +
OCFS2/GFS2 question.

Tom

p.s. yes I've read the
rbd-openstack page
which covers images and persistent volumes, not running instances which is
what my question is about.


2013/7/12 Alex Bligh 

> Tom,
>
> On 11 Jul 2013, at 22:28, Tom Verdaat wrote:
>
> > Actually I want my running VMs to all be stored on the same file system,
> so we can use live migration to move them between hosts.
> >
> > QEMU is not going to help because we're not using it in our
> virtualization solution.
>
> Out of interest, what are you using in your virtualization solution? Most
> things (including modern Xen) seem to use Qemu for the back end. If your
> virtualization solution does not use qemu as a back end, you can use kernel
> rbd devices straight which I think will give you better performance than
> OCFS2 on RBD devices.
>
> A
>
> >
> > 2013/7/11 Alex Bligh 
> >
> > On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:
> >
> > > Hello,
> > >
> > > Yes, you missed that qemu can use directly RADOS volume.
> > > Look here :
> > > http://ceph.com/docs/master/rbd/qemu-rbd/
> > >
> > > Create :
> > > qemu-img create -f rbd rbd:data/squeeze 10G
> > >
> > > Use :
> > >
> > > qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
> >
> > I don't think he did. As I read it he wants his VMs to all access the
> same filing system, and doesn't want to use cephfs.
> >
> > OCFS2 on RBD I suppose is a reasonable choice for that.
> >
> > --
> > Alex Bligh
> >
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> --
> Alex Bligh
>
>
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?

2013-07-11 Thread Darryl Bond



Tom,
I'm no expert as I didn't set it up, but we are using Openstack Grizzly with KVM/QEMU and RBD volumes for VM's.
We boot the VMs from the RBD volumes and it all seems to work just fine. 
Migration works perfectly, although live - no break migration only works from the command line tools. The GUI uses the pause, migrate then un-pause mode.
Layered snapshot/cloning works just fine through the GUI. I would say Grizzly has pretty good integration with CEPH.

Regards
Darryl

On 07/12/13 09:41, Tom Verdaat wrote:


Hi Alex,


We're planning to deploy OpenStack Grizzly using KVM. I agree that running every VM directly from RBD devices would be preferable, but booting from volumes is not one of OpenStack's strengths
 and configuring nova to make boot from volume the default method that works automatically is not really feasible yet.


So the alternative is to mount a shared filesystem on /var/lib/nova/instances of every compute node. Hence the RBD + OCFS2/GFS2 question.


Tom


p.s. yes I've read the rbd-openstack page which covers images and persistent volumes,
 not running instances which is what my question is about.



2013/7/12 Alex Bligh 

Tom,

On 11 Jul 2013, at 22:28, Tom Verdaat wrote:

> Actually I want my running VMs to all be stored on the same file system, so we can use live migration to move them between hosts.
>
> QEMU is not going to help because we're not using it in our virtualization solution.


Out of interest, what are you using in your virtualization solution? Most things (including modern Xen) seem to use Qemu for the back end. If your virtualization solution does not use qemu as a back end, you can use kernel rbd devices straight which I think
 will give you better performance than OCFS2 on RBD devices.


A

>
> 2013/7/11 Alex Bligh 
>
> On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:
>
> > Hello,
> >
> > Yes, you missed that qemu can use directly RADOS volume.
> > Look here :
> > 
http://ceph.com/docs/master/rbd/qemu-rbd/
> >
> > Create :
> > qemu-img create -f rbd rbd:data/squeeze 10G
> >
> > Use :
> >
> > qemu -m 1024 -drive format=raw,file=rbd:data/squeeze
>
> I don't think he did. As I read it he wants his VMs to all access the same filing system, and doesn't want to use cephfs.
>
> OCFS2 on RBD I suppose is a reasonable choice for that.
>
> --
> Alex Bligh
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



--
Alex Bligh












The contents of this electronic message and any attachments are intended only for the addressee and may contain legally privileged, personal, sensitive or confidential information. If you are not the intended addressee,
 and have received this email, any transmission, distribution, downloading, printing or photocopying of the contents of this message or attachments is strictly prohibited. Any legal privilege or confidentiality attached to this message and attachments is not
 waived, lost or destroyed by reason of delivery to any person other than intended addressee. If you have received this message and are not the intended addressee you should notify the sender by return email and destroy all copies of the message and any attachments.
 Unless expressly attributed, the views expressed in this email do not necessarily represent the views of the company.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph-deploy

2013-07-11 Thread SUNDAY A. OLUTAYO
I will love to know the different between "ceph-deploy new host" and 
"ceph-deploy new mon"? I will appreaciate your help

Sent from my LG Mobile

"McNamara, Bradley"  wrote:

Correct me if I'm wrong, I'm new to this, but I think the distinction between 
the two methods is that using 'qemu-img create -f rbd' creates an RBD for 
either a VM to boot from, or for mounting within a VM.  Whereas, the OP wants a 
single RBD, formatted with a cluster file system, to use as a place for 
multiple VM image files to reside.

I've often contemplated this same scenario, and would be quite interested in 
different ways people have implemented their VM infrastructure using RBD.  I 
guess one of the advantages of using 'qemu-img create -f rbd' is that a 
snapshot of a single RBD would capture just the changed RBD data for that VM, 
whereas a snapshot of a larger RBD with OCFS2 and multiple VM images on it, 
would capture changes of all the VM's, not just one.  It might provide more 
administrative agility to use the former.

Also, I guess another question would be, when a RBD is expanded, does the 
underlying VM that is created using 'qemu-img  create -f rbd' need to be 
rebooted to "see" the additional space.  My guess would be, yes.

Brad

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Alex Bligh
Sent: Thursday, July 11, 2013 2:03 PM
To: Gilles Mocellin
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?


On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:

> Hello,
> 
> Yes, you missed that qemu can use directly RADOS volume.
> Look here :
> http://ceph.com/docs/master/rbd/qemu-rbd/
> 
> Create :
> qemu-img create -f rbd rbd:data/squeeze 10G
> 
> Use :
> 
> qemu -m 1024 -drive format=raw,file=rbd:data/squeeze

I don't think he did. As I read it he wants his VMs to all access the same 
filing system, and doesn't want to use cephfs.

OCFS2 on RBD I suppose is a reasonable choice for that.

-- 
Alex Bligh




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?

2013-07-11 Thread Youd, Douglas
Depending on which hypervisor he's using, it may not be possible to mount the 
RBD's natively.

For instance, the elephant in the room... ESXi.

I've pondered several architectures for presentation of Ceph to ESXi which may 
be related to this thread.

1) Large RBD's (2TB-512B), re-presented through an iSCSI gateway (hopefully in 
a HA config pair). VMFS, with VMDK's on top.
* Seems to have been done a couple of times already, not sure of the 
success.
* Small number of RBD's required, so not a frequent task. Perhaps 
dev-time in doing the automation provisioning can be reduced.

2) Large CephFS volumes (20+ TB), re-presented through NFS gateways. VMDK's on 
top.
* Less abstraction layers, hopefully better pass-through of commands.
* Any improvements of CephFS should be available to vmware. 
(De-dupe for instance).
* Easy to manage from a vmware perspective, NFS is pretty commonly 
deployed, large volumes.
* No multi-MDS means this is not viable... yet.

3) Small RBD's, (10's-100's GB), represented through iSCSI gateway, RDM to VM's 
directly.
*Possibly more appropriate for Ceph (lots of small RBDs)
* Harder to manage, more automation will be required for provisioning
* Cloning of templates etc may be harder.

Just my 2c anyway

Douglas Youd
Cloud Solution Architect
ZettaGrid



-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of McNamara, Bradley
Sent: Friday, 12 July 2013 8:19 AM
To: Alex Bligh; Gilles Mocellin
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?

Correct me if I'm wrong, I'm new to this, but I think the distinction between 
the two methods is that using 'qemu-img create -f rbd' creates an RBD for 
either a VM to boot from, or for mounting within a VM.  Whereas, the OP wants a 
single RBD, formatted with a cluster file system, to use as a place for 
multiple VM image files to reside.

I've often contemplated this same scenario, and would be quite interested in 
different ways people have implemented their VM infrastructure using RBD.  I 
guess one of the advantages of using 'qemu-img create -f rbd' is that a 
snapshot of a single RBD would capture just the changed RBD data for that VM, 
whereas a snapshot of a larger RBD with OCFS2 and multiple VM images on it, 
would capture changes of all the VM's, not just one.  It might provide more 
administrative agility to use the former.

Also, I guess another question would be, when a RBD is expanded, does the 
underlying VM that is created using 'qemu-img  create -f rbd' need to be 
rebooted to "see" the additional space.  My guess would be, yes.

Brad

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Alex Bligh
Sent: Thursday, July 11, 2013 2:03 PM
To: Gilles Mocellin
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] OCFS2 or GFS2 for cluster filesystem?


On 11 Jul 2013, at 19:25, Gilles Mocellin wrote:

> Hello,
>
> Yes, you missed that qemu can use directly RADOS volume.
> Look here :
> http://ceph.com/docs/master/rbd/qemu-rbd/
>
> Create :
> qemu-img create -f rbd rbd:data/squeeze 10G
>
> Use :
>
> qemu -m 1024 -drive format=raw,file=rbd:data/squeeze

I don't think he did. As I read it he wants his VMs to all access the same 
filing system, and doesn't want to use cephfs.

OCFS2 on RBD I suppose is a reasonable choice for that.

--
Alex Bligh




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




ZettaServe Disclaimer: This email and any files transmitted with it are 
confidential and intended solely for the use of the individual or entity to 
whom they are addressed. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately if you have received this email by mistake and delete this email 
from your system. Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
ZettaServe Pty Ltd accepts no liability for any damage caused by any virus 
transmitted by this email.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Turning off ceph journaling with xfs ?

2013-07-11 Thread ker can
Hi,

Is it possible to turn off ceph journaling if I switch to xfs  ?
For using it as a storage layer for hadoop we're concerned about the
additional requirements for separate SSDs ($$) etc.  In our testing we're
seeing a performance hit when using the same disk for both journal + data
... so we're investigating the possibility of being able to turn it off
completely, since xfs is already a journaled FS.

thanks
kc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Check creating

2013-07-11 Thread Sage Weil
On Thu, 11 Jul 2013, Mandell Degerness wrote:
> Is there any command (in shell or python API), that can tell me if
> ceph is still creating pgs other than actually attempting a
> modification of the pg_num or pgp_num of a pool?  I would like to
> minimize the number of errors I get and not keep trying the commands
> until success, if possible.
> 
> Right now, I run the command "ceph osd pool set  pg_num "
> and immediately attempt he "ceph osd pool set  pgp_num "
> command, repeating the latter until success happens or I get an error
> other than "still creating pgs"
> 
> It would be nice to run the first command, wait for the cluster to be
> ready again, then run the second.

You can get this now from 'ceph status --format=json' but, looking at it 
now, the json needs to be restructured; we'll do that now so that dumpling 
will have something more sane.  For now, just look for 'creating' in the 
'pgmap' key.  I'll fix it to print all that info in a structured way.

> As a bonus, if there is a reasonable way to do this (expand pgs)
> entirely in python without running any shell commands, that would be
> awesome.

The REST API endpoint just got mergd this week and can do everything that 
the CLI can do.  It will be in dumpling as well.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Turning off ceph journaling with xfs ?

2013-07-11 Thread Mark Nelson

Hi Ker,

Unfortunately no.  Ceph uses the journal for internal consistency and 
atomicity and it can't use the XFS journal for it.  On the BTRFS side, 
we've been investigating allowing the Ceph journal to be on the same 
disk as the OSD and doing a clone() operation to effectively reduce the 
journal write penalty, but that feature hasn't been implemented yet.


Mark

On 07/11/2013 08:12 PM, ker can wrote:

Hi,

Is it possible to turn off ceph journaling if I switch to xfs  ?
For using it as a storage layer for hadoop we're concerned about the
additional requirements for separate SSDs ($$) etc.  In our testing
we're seeing a performance hit when using the same disk for both journal
+ data ... so we're investigating the possibility of being able to turn
it off completely, since xfs is already a journaled FS.

thanks
kc



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Turning off ceph journaling with xfs ?

2013-07-11 Thread Sage Weil
Note that you *can* disable teh journal if you use btrfs, but your write 
latency will tend to be pretty terrible.  This is only viable for 
bulk-storage use cases where throughput trumps all and latency is not an 
issue at all (it may be seconds).

We are planning on eliminating the double-write for at least large writes 
when using btrfs by cloning data out of the journal and into the target 
file.  This is not a hugely complex task (although it is non-trivial) but 
it hasn't made it to the top of the priority list yet.

sage


On Thu, 11 Jul 2013, Mark Nelson wrote:

> Hi Ker,
> 
> Unfortunately no.  Ceph uses the journal for internal consistency and
> atomicity and it can't use the XFS journal for it.  On the BTRFS side, we've
> been investigating allowing the Ceph journal to be on the same disk as the OSD
> and doing a clone() operation to effectively reduce the journal write penalty,
> but that feature hasn't been implemented yet.
> 
> Mark
> 
> On 07/11/2013 08:12 PM, ker can wrote:
> > Hi,
> > 
> > Is it possible to turn off ceph journaling if I switch to xfs  ?
> > For using it as a storage layer for hadoop we're concerned about the
> > additional requirements for separate SSDs ($$) etc.  In our testing
> > we're seeing a performance hit when using the same disk for both journal
> > + data ... so we're investigating the possibility of being able to turn
> > it off completely, since xfs is already a journaled FS.
> > 
> > thanks
> > kc
> > 
> > 
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] latency when OSD falls out of cluster

2013-07-11 Thread Edwin Peer

Hi there,

We've been noticing nasty multi-second cluster wide latencies if an OSD 
drops out of an active cluster (due to power failure, or even being 
stopped cleanly). We've also seen this problem occur when an OSD is 
inserted back into the cluster.


Obviously, this has the effect of freezing all VMs doing I/O across the 
cluster for several seconds when a single node fails. Is this behaviour 
expected? Or have I perhaps got something configured wrong?


We're trying very hard to eliminate all single points of failure in our 
architecture, is there anything that can be done about this?


Regards,
Edwin Peer
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph-deploy

2013-07-11 Thread SUNDAY A. OLUTAYO
I am on first exploration of ceph, I need help to understand these terms; 
ceph-deploy new Host, ceph-deploy new MON Host and ceph-deploy mon create Host? 
I will appreciate your help.

Sent from my LG Mobile
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Num of PGs

2013-07-11 Thread Stefan Priebe - Profihost AG
Hello,

is this calculation for the number of PGs correct?

36 OSDs, Replication Factor 3

36 * 100 / 3 => 1200 PGs

But i then read that it should be an exponent of 2 so it should be 2048?

Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com