from:"Josh Durgin"

Re: [ceph-users] RBD Export-Diff With Children Snapshots

2014-06-10 Thread Josh Durgin

On Fri, 6 Jun 2014 17:34:56 -0700
Tyler Wilson  wrote:

> Hey All,
> 
> Simple question, does 'rbd export-diff' work with children snapshot
> aka;
> 
> root:~# rbd children images/03cb46f7-64ab-4f47-bd41-e01ced45f0b4@snap
> compute/2b65c0b9-51c3-4ab1-bc3c-6b734cc796b8_disk
> compute/54f3b23c-facf-4a23-9eaa-9d221ddb7208_disk
> compute/592065d1-264e-4f7d-8504-011c2ea3bce3_disk
> compute/9ce6d6af-c4df-442c-b433-be2bb1cef9f6_disk
> compute/f0714add-683a-4ba2-a6f3-ded7dbf193eb_disk
> 
> Could I export a diff from that image snapshot vs one of the compute
> disks?
> 
> Thanks for your help!

The rbd diff-related commands compare points in time of a single
image. Since children are identical to their parent when they're cloned,
if you created a snapshot right after it was cloned, you could export
the diff between the used child and the parent. Something like:

rbd clone child parent@snap
rbd snap create child@base

rbd snap create child@changed
rbd export-diff child@changed --from-snap base child_changes.diff

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: CEPH Multitenancy and Data Isolation

2014-06-10 Thread Josh Durgin


On 06/10/2014 01:56 AM, Vilobh Meshram wrote:

How does CEPH guarantee data isolation for volumes which are not meant
to be shared in a Openstack tenant?

When used with OpenStack the data isolation is provided by the
Openstack level so that all users who are part of same tenant will be
able to access/share the volumes created by users in that tenant.
Consider a case where we have one pool named “Volumes” for all the
tenants. All the tenants use the same keyring to access the volumes in
the pool.

 1. How do we guarantee that one user can’t see the contents of the
volumes created by another user; if the volume is not meant to be
shared.


OpenStack users or tenants have no access to the keyring. Cinder tracks
volume ownership and checks permissions when a volume is attached, and
qemu prevents users from seeing anything outside of their vm, including 
the keyring.



 2. If someone malicious user gets the access to the keyring (which we
used as a authentication mechanism between the client/Openstack
and CEPH) how does CEPH guarantee that the malicious user can’t
access the volumes in that pool.


The keyring gives a user access to the cluster. If someone has a valid 
keyring, Ceph treats them as a valid user, since there is no information

to say otherwise. Ceph can't tell whether the user of a keyring is
malicious.


 3. Lets say our Cinder services are running on the Openstack API
node. How does the CEPH keyring information gets transferred from
the API node to the Hypervisor node ? Does this keyring passed
through message queue? If yes can the malicious user have a look
at the message queue and grab this keyring information ? If not
then how does it reach from the API node to the Hypervisor node.


The keyring is static and configured by the administrator on the nodes
running cinder-volume and nova-compute. It's not sent over the network,
and is not needed by nova or cinder api nodes.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw-agent failed to parse

2014-07-07 Thread Josh Durgin


On 07/04/2014 08:36 AM, Peter wrote:

i am having issues running radosgw-agent to sync data between two
radosgw zones. As far as i can tell both zones are running correctly.

My issue is when i run the radosgw-agent command:



radosgw-agent -v --src-access-key  --src-secret-key
 --dest-access-key  --dest-secret-key
 --src-zone us-master http://us-secondary.example.com:80


i get the following error:

|DEBUG:boto:Using access key provided by client.||
||DEBUG:boto:Using secret key provided by client.||
||DEBUG:boto:StringToSign:||
||GET||
||
||Fri, 04 Jul 2014 15:25:53 GMT||
||/admin/config||
||DEBUG:boto:Signature:||
||AWS EA20YO07DA8JJJX7ZIPJ:WbykwyXu5m5IlbEsBzo8bKEGIzg=||
||DEBUG:boto:url =
'http://us-secondary.example.comhttp://us-secondary.example.com/admin/config'||
||params={}||
||headers={'Date': 'Fri, 04 Jul 2014 15:25:53 GMT', 'Content-Length':
'0', 'Authorization': 'AWS
EA20YO07DA8JJJX7ZIPJ:WbykwyXu5m5IlbEsBzo8bKEGIzg=', 'User-Agent':
'Boto/2.20.1 Python/2.7.6 Linux/3.13.0-24-generic'}||
||data=None||
||ERROR:root:Could not retrieve region map from destination||
||Traceback (most recent call last):||
||  File "/usr/lib/python2.7/dist-packages/radosgw_agent/cli.py", line
269, in main||
||region_map = client.get_region_map(dest_conn)||
||  File "/usr/lib/python2.7/dist-packages/radosgw_agent/client.py",
line 391, in get_region_map||
||region_map = request(connection, 'get', 'admin/config')||
||  File "/usr/lib/python2.7/dist-packages/radosgw_agent/client.py",
line 153, in request||
||result = handler(url, params=params, headers=request.headers,
data=data)||
||  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 55, in
get||
||return request('get', url, **kwargs)||
||  File "/usr/lib/python2.7/dist-packages/requests/api.py", line 44, in
request||
||return session.request(method=method, url=url, **kwargs)||
||  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line
349, in request||
||prep = self.prepare_request(req)||
||  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line
287, in prepare_request||
||hooks=merge_hooks(request.hooks, self.hooks),||
||  File "/usr/lib/python2.7/dist-packages/requests/models.py", line
287, in prepare||
||self.prepare_url(url, params)||
||  File "/usr/lib/python2.7/dist-packages/requests/models.py", line
334, in prepare_url||
||scheme, auth, host, port, path, query, fragment = parse_url(url)||
||  File "/usr/lib/python2.7/dist-packages/urllib3/util.py", line 390,
in parse_url||
||raise LocationParseError("Failed to parse: %s" % url)||
||LocationParseError: Failed to parse: Failed to parse:
us-secondary.example.comhttp:


|||Is this a bug? or is my setup wrong? i can navigate to
http://us-secondary.example.com/admin/config and it correctly outputs
zone details. at the output above


It seems like an issue with your environment. What version of
radosgw-agent and which distro is this running on?

Are there any special characters in the access or secret keys that
might need to be escaped on the command line?


|DEBUG:boto:url =
'http://us-secondary.example.comhttp://us-secondary.example.com/admin/config'||


|should the url be repeated like that?


No, and it's rather strange since it should be the url passed on the
command line, parsed, and with /admin/config added.

Could post the result of this run in a python interpreter:

import urlparse
result = urlparse.urlparse('http://us-secondary.example.com:80')
print result.hostname, result.port

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Multipart upload on ceph 0.8 doesn't work?

2014-07-07 Thread Josh Durgin


On 07/07/2014 05:41 AM, Patrycja Szabłowska wrote:

OK, the mystery is solved.

 From https://www.mail-archive.com/ceph-users@lists.ceph.com/msg10368.html
"During a multi part upload you can't upload parts smaller than 5M"

I've tried to upload smaller chunks, like 10KB. I've changed chunk size
to 5MB and it works now.

It's a pity that the Ceph's docs don't mention the limit (or I couldn't
found it anywhere). And that the error wasn't helpful at all.


Glad you figured it out. This is in the s3 docs [1], but the lack of
error message is a regression. I added a couple tickets for this:

http://tracker.ceph.com/issues/8764
http://tracker.ceph.com/issues/8766

Josh

[1] http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPart.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] question about librbd io

2014-09-10 Thread Josh Durgin


On 09/09/2014 07:06 AM, yuelongguang wrote:

hi, josh.durgin:
i want to know how librbd launch io request.
use case:
inside vm, i use fio to test rbd-disk's io performance.
fio's pramaters are  bs=4k, direct io, qemu cache=none.
in this case, if librbd just send what it gets from vm， i mean  no
gather/scatter. the rate , io inside vm : io at librbd: io at osd
filestore = 1:1:1?


If the rbd image is not a clone, the io issued from the vm's block
driver will match the io issued by librbd. With caching disabled
as you have it, the io from the OSDs will be similar, with some
small amount extra for OSD bookkeeping.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] command to flush rbd cache?

2015-02-04 Thread Josh Durgin


On 02/05/2015 07:44 AM, Udo Lembke wrote:

Hi all,
is there any command to flush the rbd cache like the
"echo 3 > /proc/sys/vm/drop_caches" for the os cache?


librbd exposes it as rbd_invalidate_cache(), and qemu uses it
internally, but I don't think you can trigger that via any user-facing
qemu commands.

Exposing it through the admin socket would be pretty simple though:

http://tracker.ceph.com/issues/2468

You can also just detach and reattach the device to flush the rbd cache.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] wider rados namespace support?

2015-02-12 Thread Josh Durgin


On 02/10/2015 07:54 PM, Blair Bethwaite wrote:

Just came across this in the docs:
"Currently (i.e., firefly), namespaces are only useful for
applications written on top of librados. Ceph clients such as block
device, object storage and file system do not currently support this
feature."

Then found:
https://wiki.ceph.com/Planning/Sideboard/rbd%3A_namespace_support

Is there any progress or plans to address this (particularly for rbd
clients but also cephfs)?


No immediate plans for rbd. That blueprint still seems like a
reasonable way to implement it to me.

The one part I'm less sure about is the OpenStack or other higher level 
integration, which would need to start adding secret keys to libvirt

dynamically.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] FreeBSD on RBD (KVM)

2015-02-18 Thread Josh Durgin

> From: "Logan Barfield" 
> We've been running some tests to try to determine why our FreeBSD VMs
> are performing much worse than our Linux VMs backed by RBD, especially
> on writes.
> 
> Our current deployment is:
> - 4x KVM Hypervisors (QEMU 2.0.0+dfsg-2ubuntu1.6)
> - 2x OSD nodes (8x SSDs each, 10Gbit links to hypervisors, pool has 2x
> replication across nodes)
> - Hypervisors have "rbd_cache enabled"
> - All VMs use "cache=none" currently.

If you don't have rbd cache writethrough until flush = true, this
configuration is unsafe - with cache=none, qemu will not send flushes.
 
> In testing we were getting ~30MB/s writes, and ~100MB/s reads on
> FreeBSD 10.1.  On Linux VMs we're seeing ~150+MB/s for writes and
> reads (dd if=/dev/zero of=output bs=1M count=1024 oflag=direct).

I'm not very familiar with FreeBSD, but I'd guess it's sending smaller
I/Os for some reason. This could be due to trusting the sector size
qemu reports (this can be changed, though I don't remember the syntax
offhand), lower fs block size, or scheduler or block subsystem
configurables.  It could also be related to differences in block
allocation strategies by whatever FS you're using in the guest and
Linux filesystems. What FS are you using in each guest?

You can check the I/O sizes seen by rbd by adding something like this
to ceph.conf on a node running qemu:

[client]
debug rbd = 20
log file = /path/writeable/by/qemu.$pid.log

This will show the offset and length of requests in lines containing
aio_read and aio_write. If you're using giant you could instead gather
a trace of I/O to rbd via lttng.

> I tested several configurations on both RBD and local SSDs, and the
> only time FreeBSD performance was comparable to Linux was with the
> following configuration:
> - Local SSD
> - Qemu cache=writeback
> - GPT journaling enabled
> 
> We did see some performance improvement (~50MB/s writes instead of
> 30MB/s) when using cache=writeback on RBD.
> 
> I've read several threads regarding cache=none vs cache=writeback.
> cache=none is apparently safer for live migration, but cache=writeback
> is recommended by Ceph to prevent data loss.  Apparently there was a
> patch submitted for Qemu a few months ago to make cache=writeback
> safer for live migrations as well: http://tracker.ceph.com/issues/2467

RBD caching is already safe with live migration without this patch. It
just makes sure that it will continue to be safe in case of future
QEMU changes.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] wider rados namespace support?

2015-02-18 Thread Josh Durgin


On 02/12/2015 05:59 PM, Blair Bethwaite wrote:

My particular interest is for a less dynamic environment, so manual
key distribution is not a problem. Re. OpenStack, it's probably good
enough to have the Cinder host creating them as needed (presumably
stored in its DB) and just send the secret keys over the message bus
to compute hosts as needed - if your infrastructure network is not
trusted then you've got bigger problems to worry about. It's true that
a lot of clouds would end up logging the secrets in various places,
but then they are only useful on particular hosts.

I guess there is nothing special about the default "" namespace
compared to any other as far as cephx is concerned. It would be nice
to have something of a nested auth, so that the client requires
explicit permission to read the default namespace (configured
out-of-band when setting up compute hosts) and further permission for
particular non-default namespaces (managed by the cinder rbd driver),
that way leaking secrets from cinder gives less exposure - but I guess
that would be a bit of a change from the current namespace
functionality.


You can restrict client access to the default namespace like this with
the existing ceph capabilities. For the proposed rbd usage of
namespaces, for example, you could allow read-only access to the
rbd_id.* objects in the default namespace, and full access to other
specific namespaces. Something like:

mon 'allow r' osd 'allow r class-read pool=foo namespace="" 
object_prefix rbd_id, allow rwx pool=foo namespace=bar'


Cinder or other management layers would still want broader access, but
these more restricted keys could be the only ones exposed to QEMU.

Josh


On 13 February 2015 at 05:57, Josh Durgin  wrote:

On 02/10/2015 07:54 PM, Blair Bethwaite wrote:


Just came across this in the docs:
"Currently (i.e., firefly), namespaces are only useful for
applications written on top of librados. Ceph clients such as block
device, object storage and file system do not currently support this
feature."

Then found:
https://wiki.ceph.com/Planning/Sideboard/rbd%3A_namespace_support

Is there any progress or plans to address this (particularly for rbd
clients but also cephfs)?



No immediate plans for rbd. That blueprint still seems like a
reasonable way to implement it to me.

The one part I'm less sure about is the OpenStack or other higher level
integration, which would need to start adding secret keys to libvirt
dynamically.






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v0.80.8 and librbd performance

2015-03-04 Thread Josh Durgin


On 03/03/2015 03:28 PM, Ken Dreyer wrote:

On 03/03/2015 04:19 PM, Sage Weil wrote:

Hi,

This is just a heads up that we've identified a performance regression in
v0.80.8 from previous firefly releases.  A v0.80.9 is working it's way
through QA and should be out in a few days.  If you haven't upgraded yet
you may want to wait.

Thanks!
sage


Hi Sage,

I've seen a couple Redmine tickets on this (eg
http://tracker.ceph.com/issues/9854 ,
http://tracker.ceph.com/issues/10956). It's not totally clear to me
which of the 70+ unreleased commits on the firefly branch fix this
librbd issue.  Is it only the three commits in
https://github.com/ceph/ceph/pull/3410 , or are there more?


Those are the only ones needed to fix the librbd performance
regression, yes.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] qemu-kvm and cloned rbd image

2015-03-04 Thread Josh Durgin


On 03/02/2015 04:16 AM, koukou73gr wrote:


Hello,

Today I thought I'd experiment with snapshots and cloning. So I did:

rbd import --image-format=2 vm-proto.raw rbd/vm-proto
rbd snap create rbd/vm-proto@s1
rbd snap protect rbd/vm-proto@s1
rbd clone rbd/vm-proto@s1 rbd/server

And then proceeded to create a qemu-kvm guest with rbd/server as its
backing store. The guest booted but as soon as it got to mount the root
fs, things got weird:


What does the qemu command line look like?


[...]
scsi2 : Virtio SCSI HBA
scsi 2:0:0:0: Direct-Access QEMU QEMU HARDDISK1.5. PQ: 0
ANSI: 5
sd 2:0:0:0: [sda] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
sd 2:0:0:0: [sda] Write Protect is off
sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
  sda: sda1 sda2
sd 2:0:0:0: [sda] Attached SCSI disk
dracut: Scanning devices sda2  for LVM logical volumes vg_main/lv_swap
vg_main/lv_root
dracut: inactive '/dev/vg_main/lv_swap' [1.00 GiB] inherit
dracut: inactive '/dev/vg_main/lv_root' [6.50 GiB] inherit
EXT4-fs (dm-1): INFO: recovery required on readonly filesystem


This suggests the disk is being exposed as read-only via QEMU,
perhaps via qemu's snapshot or other options.

You can use a clone in exactly the same way as any other rbd image.
If you're running QEMU manually, for example, something like:

qemu-kvm -drive file=rbd:rbd/server,format=raw,cache=writeback

is fine for using the clone. QEMU is supposed to be unaware of any
snapshots, parents, etc. at the rbd level.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] qemu-kvm and cloned rbd image

2015-03-04 Thread Josh Durgin


On 03/04/2015 01:36 PM, koukou73gr wrote:

On 03/03/2015 05:53 PM, Jason Dillaman wrote:

Your procedure appears correct to me.  Would you mind re-running your
cloned image VM with the following ceph.conf properties:

[client]
rbd cache off
debug rbd = 20
log file = /path/writeable/by/qemu.$pid.log

If you recreate the issue, would you mind opening a ticket at
http://tracker.ceph.com/projects/rbd/issues?

Jason,

Thanks for the reply. Recreating the issue is not a problem, I can
reproduce it any time.
The log file was getting a bit large, I destroyed the guest after
letting it thrash for about ~3 minutes, plenty of time to hit the
problem. I've uploaded it at:

http://paste.scsys.co.uk/468868 (~19MB)


It looks like your libvirt rados user doesn't have access to whatever
pool the parent image is in:


librbd::AioRequest: write 0x7f1ec6ad6960 
rbd_data.24413d1b58ba.0186 1523712~4096 should_complete: r = -1


-1 is EPERM, for operation not permitted.

Check the libvirt user capabilites shown in ceph auth list - it should
have at least r and class-read access to the pool storing the parent
image. You can update it via the 'ceph auth caps' command.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] qemu-kvm and cloned rbd image

2015-03-05 Thread Josh Durgin


On 03/05/2015 12:46 AM, koukou73gr wrote:

On 03/05/2015 03:40 AM, Josh Durgin wrote:


It looks like your libvirt rados user doesn't have access to whatever
pool the parent image is in:


librbd::AioRequest: write 0x7f1ec6ad6960
rbd_data.24413d1b58ba.0186 1523712~4096 should_complete: r
= -1

-1 is EPERM, for operation not permitted.

Check the libvirt user capabilites shown in ceph auth list - it should
have at least r and class-read access to the pool storing the parent
image. You can update it via the 'ceph auth caps' command.


Josh,

All  images, parent, snapshot and clone reside on the same pool
(libvirt-pool *) and the user (libvirt) seems to have the proper
capabilities. See:

client.libvirt
 key: 
 caps: [mon] allow r
 caps: [osd] allow class-read object_prefix rbd_children, allow rw
class-read pool=rbd


This includes everything except class-write on the pool you're using.
You'll need that so that a copy_up call (used just for clones) works.
That's what was getting a permissions error. You can use rwx for short.

Josh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?

2015-03-26 Thread Josh Durgin


On 03/26/2015 10:46 AM, Gregory Farnum wrote:

I don't know why you're mucking about manually with the rbd directory;
the rbd tool and rados handle cache pools correctly as far as I know.


That's true, but the rados tool should be able to manipulate binary data 
more easily. It should probably be able to read from a file or stdin for 
this.


Josh



On Thu, Mar 26, 2015 at 8:56 AM, Udo Lembke  wrote:

Hi Greg,
ok!

It's looks like, that my problem is more setomapval-related...

I must o something like
rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
"\0x0f\0x00\0x00\0x00"2cfc7ce74b0dc51

but "rados setomapval" don't use the hexvalues - instead of this I got
rados -p ssd-archiv listomapvals rbd_directory
name_vm-409-disk-2
value: (35 bytes) :
 : 5c 30 78 30 66 5c 30 78 30 30 5c 30 78 30 30 5c : \0x0f\0x00\0x00\
0010 : 30 78 30 30 32 63 66 63 37 63 65 37 34 62 30 64 : 0x002cfc7ce74b0d
0020 : 63 35 31: c51


hmm, strange. With  "rados -p ssd-archiv getomapval rbd_directory name_vm-409-disk-2 
name_vm-409-disk-2"
I got the binary inside the file name_vm-409-disk-2, but reverse do an
"rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
name_vm-409-disk-2"
fill the variable with name_vm-409-disk-2 and not with the content of the 
file...

Are there other tools for the rbd_directory?

regards

Udo

Am 26.03.2015 15:03, schrieb Gregory Farnum:

You shouldn't rely on "rados ls" when working with cache pools. It
doesn't behave properly and is a silly operation to run against a pool
of any size even when it does. :)

More specifically, "rados ls" is invoking the "pgls" operation. Normal
read/write ops will go query the backing store for objects if they're
not in the cache tier. pgls is different — it just tells you what
objects are present in the PG on that OSD right now. So any objects
which aren't in cache won't show up when listing on the cache pool.
-Greg

On Thu, Mar 26, 2015 at 3:43 AM, Udo Lembke  wrote:

Hi all,
due an very silly approach, I removed the cache tier of an filled EC pool.

After recreate the pool and connect with the EC pool I don't see any content.
How can I see the rbd_data and other files through the new ssd cache tier?

I think, that I must recreate the rbd_directory (and fill with setomapval), but 
I don't see anything yet!

$ rados ls -p ecarchiv | more
rbd_data.2e47de674b0dc51.00390074
rbd_data.2e47de674b0dc51.0020b64f
rbd_data.2fbb1952ae8944a.0016184c
rbd_data.2cfc7ce74b0dc51.00363527
rbd_data.2cfc7ce74b0dc51.0004c35f
rbd_data.2fbb1952ae8944a.0008db43
rbd_data.2cfc7ce74b0dc51.0015895a
rbd_data.31229f0238e1f29.000135eb
...

$ rados ls -p ssd-archiv
 nothing 

generation of the cache tier:
$ rados mkpool ssd-archiv
$ ceph osd pool set ssd-archiv crush_ruleset 5
$ ceph osd tier add ecarchiv ssd-archiv
$ ceph osd tier cache-mode ssd-archiv writeback
$ ceph osd pool set ssd-archiv hit_set_type bloom
$ ceph osd pool set ssd-archiv hit_set_count 1
$ ceph osd pool set ssd-archiv hit_set_period 3600
$ ceph osd pool set ssd-archiv target_max_bytes 500


rule ssd {
 ruleset 5
 type replicated
 min_size 1
 max_size 10
 step take ssd
 step choose firstn 0 type osd
 step emit
}


Are there any "magic" (or which command I missed?) to see the excisting data 
throug the cache tier?


regards - and hoping for answers

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Error DATE 1970

2015-04-02 Thread Josh Durgin


On 04/01/2015 02:42 AM, Jimmy Goffaux wrote:

English Version :

Hello,

I found a strange behavior in Ceph. This behavior is visible on Buckets
(RGW) and pools (RDB).
pools:

``
root@:~# qemu-img info rbd:pool/kibana2
image: rbd:pool/kibana2
file format: raw
virtual size: 30G (32212254720 bytes)
disk size: unavailable
Snapshot list:
IDTAG VM SIZE  DATE   VM   CLOCK
snap2014-08-26-kibana2snap2014-08-26-kibana2 30G 1970-01-01 01:00:00
00:00:00.000
snap2014-09-05-kibana2snap2014-09-05-kibana2 30G 1970-01-01 01:00:00
00:00:00.000
``

As you can see the all dates are set to 1970-01-01 ?


The reason for this is simple for rbd: it doesn't store the date for
snapshots, and that's just the default value qemu shows.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] live migration fails with image on ceph

2015-04-06 Thread Josh Durgin


Like the last comment on the bug says, the message about block migration (drive 
mirroring) indicates that nova is telling libvirt to copy the virtual disks, 
which is not what should happen for ceph or other shared storage.

For ceph just plain live migration should be used, not block migration. It's 
either a configuration issue or a bug in nova.

Josh


From: "Yuming Ma (yumima)" 
Sent: Apr 3, 2015 1:27 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] live migration fails with image on ceph


Problem: live-migrating a VM, the migration will complete but cause a VM to 
become unstable.  The VM may become unreachable on the network, or go through a 
cycle where it hangs for ~10 mins at a time. A hard-reboot is the only way to 
resolve this.

Related libvirt logs:

2015-03-30 01:18:23.429+: 244411: warning : 
qemuMigrationCancelDriveMirror:1383 : Unable to stop block job on 
drive-virtio-disk0

2015-03-30 01:17:41.899+: 244408: warning : 
qemuDomainObjEnterMonitorInternal:1175 : This thread seems to be the async job 
owner; entering monitor without asking for a nested job is dangerous


Nova env: 
Kernel : 3.11.0-26-generic

libvirt-bin : 1.1.1-0ubuntu11 

ceph-common : 0.67.9-1precise


Ceph:

Kernel: 3.13.0-36-generic

ceph        : 0.80.7-1precise 

ceph-common : 0.80.7-1precise  



Saw post here (https://bugs.dogfood.paddev.net/mos/+bug/1371130) that this 
might have something to do the libvirt migration with RBD image, but exactly 
how Ceph is related and how to resovle it if anyone had this before. 


Thanks.


— Yuming___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Number of ioctx per rados connection

2015-04-08 Thread Josh Durgin


Yes, you can use multiple ioctxs with the same underlying rados connection. 
There's no hard limit on how many, it depends on your usage if/when a single 
rados connection becomes a bottleneck.

It's safe to use different ioctxs from multiple threads. IoCtxs have some local 
state like namespace, object locator key and snapshot that limit what you can 
do safely with multiple threads using the same IoCtx. librados.h has more 
details, but it's simplest to use a separate ioctx for each thread.

Josh


From: Michel Hollands 
Sent: Apr 8, 2015 6:54 AM
To: ceph-us...@ceph.com
Subject: [ceph-users] Number of ioctx per rados connection

> Hello,
>
> This is a question about the C API for librados. Can you use multiple “IO 
> contexts” (ioctx) per rados connection and if so how many ? Can these then be 
> used by multiple threads ? 
>
> Thanks in advance,
>
> Michel___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] long blocking with writes on rbds

2015-04-08 Thread Josh Durgin


On 04/08/2015 11:40 AM, Jeff Epstein wrote:

Hi, thanks for answering. Here are the answers to your questions.
Hopefully they will be helpful.

On 04/08/2015 12:36 PM, Lionel Bouton wrote:

I probably won't be able to help much, but people knowing more will
need at least: - your Ceph version, - the kernel version of the host
on which you are trying to format /dev/rbd1, - which hardware and
network you are using for this cluster (CPU, RAM, HDD or SSD models,
network cards, jumbo frames, ...).


ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)

Linux 3.18.4pl2 #3 SMP Thu Jan 29 21:11:23 CET 2015 x86_64 GNU/Linux

The hardware is an Amazon AWS c3.large. So, a (virtual) Xeon(R) CPU
E5-2680 v2 @ 2.80GHz, 3845992 kB RAM, plus whatever other virtual
hardware Amazon provides.


AWS will cause some extra perf variance, but...


There's only one thing surprising me here: you have only 6 OSDs, 1504GB
(~ 250G / osd) and a total of 4400 pgs ? With a replication of 3 this is
2200 pgs / OSD, which might be too much and unnecessarily increase the
load on your OSDs.

Best regards,

Lionel Bouton


Our workload involves creating and destroying a lot of pools. Each pool
has 100 pgs, so it adds up. Could this be causing the problem? What
would you suggest instead?


...this is most likely the cause. Deleting a pool causes the data and
pgs associated with it to be deleted asynchronously, which can be a lot
of background work for the osds.

If you're using the cfq scheduler you can try decreasing the priority of 
these operations with the "osd disk thread ioprio..." options:


http://ceph.com/docs/master/rados/configuration/osd-config-ref/#operations

If that doesn't help enough, deleting data from pools before deleting
the pools might help, since you can control the rate more finely. And of
course not creating/deleting so many pools would eliminate the hidden
background cost of deleting the pools.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] live migration fails with image on ceph

2015-04-10 Thread Josh Durgin


On 04/08/2015 09:37 PM, Yuming Ma (yumima) wrote:

Josh,

I think we are using plain live migration and not mirroring block drives
as the other test did.


Do you have the migration flags or more from the libvirt log? Also
which versions of qemu is this?

The libvirt log message about qemuMigrationCancelDriveMirror from your
first email is suspicious. Being unable to stop it may mean it was not 
running (fine, but libvirt shouldn't have tried to stop it), or it kept 
running (bad esp. if it's trying to copy to the same rbd).



What are the chances or scenario that disk image
can be corrupted during the live migration for both source and target
are connected to the same volume and RBD caches is turned on:


Generally rbd caching with live migration is safe. The way to get
corruption is to have drive-mirror try to copy over the rbd on the
destination while the source is still using the disk...

Did you observe fs corruption after a live migration, or just other odd
symptoms? Since a reboot fixed it, it sounds more like memory corruption
to me, unless it was fsck'd during reboot.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v0.80.8 and librbd performance

2015-04-14 Thread Josh Durgin

I don't see any commits that would be likely to affect that between 0.80.7 and 
0.80.9.

Is this after upgrading an existing cluster?
Could this be due to fs aging beneath your osds?

How are you measuring create/delete performance?

You can try increasing rbd concurrent management ops in ceph.conf on the cinder 
node. This affects delete speed, since rbd tries to delete each object in a 
volume.

Josh


From: shiva rkreddy 
Sent: Apr 14, 2015 5:53 AM
To: Josh Durgin
Cc: Ken Dreyer; Sage Weil; Ceph Development; ceph-us...@ceph.com
Subject: Re: v0.80.8 and librbd performance

> Hi Josh,
>
> We are using firefly 0.80.9 and see both cinder create/delete numbers slow 
> down compared 0.80.7.
> I don't see any specific tuning requirements and our cluster is run pretty 
> much on default configuration.
> Do you recommend any tuning or can you please suggest some log signatures we 
> need to be looking at?
>
> Thanks
> shiva
>
> On Wed, Mar 4, 2015 at 1:53 PM, Josh Durgin  wrote:
>>
>> On 03/03/2015 03:28 PM, Ken Dreyer wrote:
>>>
>>> On 03/03/2015 04:19 PM, Sage Weil wrote:
>>>>
>>>> Hi,
>>>>
>>>> This is just a heads up that we've identified a performance regression in
>>>> v0.80.8 from previous firefly releases.  A v0.80.9 is working it's way
>>>> through QA and should be out in a few days.  If you haven't upgraded yet
>>>> you may want to wait.
>>>>
>>>> Thanks!
>>>> sage
>>>
>>>
>>> Hi Sage,
>>>
>>> I've seen a couple Redmine tickets on this (eg
>>> http://tracker.ceph.com/issues/9854 ,
>>> http://tracker.ceph.com/issues/10956). It's not totally clear to me
>>> which of the 70+ unreleased commits on the firefly branch fix this
>>> librbd issue.  Is it only the three commits in
>>> https://github.com/ceph/ceph/pull/3410 , or are there more?
>>
>>
>> Those are the only ones needed to fix the librbd performance
>> regression, yes.
>>
>> Josh
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v0.80.8 and librbd performance

2015-04-15 Thread Josh Durgin


On 04/14/2015 08:01 PM, shiva rkreddy wrote:

The clusters are in test environment, so its a new deployment of 0.80.9.
OS on the cluster nodes is reinstalled as well, so there shouldn't be
any fs aging unless the disks are slowing down.

The perf measurement is done initiating multiple cinder create/delete
commands and tracking the volume to be in available or completely gone
from "cinder list" output.

Even running  "rbd rm " command from cinder node results in similar
behaviour.

I'll try with  increasing  rbd_concurrent_management in ceph.conf.
  Is the param name rbd_concurrent_management or rbd-concurrent-management ?


'rbd concurrent management ops' - spaces, hyphens, and underscores are
equivalent in ceph configuration.

A log with 'debug ms = 1' and 'debug rbd = 20' from 'rbd rm' on both 
versions might give clues about what's going slower.


Josh


On Tue, Apr 14, 2015 at 12:36 PM, Josh Durgin mailto:jdur...@redhat.com>> wrote:

I don't see any commits that would be likely to affect that between
0.80.7 and 0.80.9.

Is this after upgrading an existing cluster?
Could this be due to fs aging beneath your osds?

How are you measuring create/delete performance?

You can try increasing rbd concurrent management ops in ceph.conf on
the cinder node. This affects delete speed, since rbd tries to
delete each object in a volume.

Josh


*From:* shiva rkreddy mailto:shiva.rkre...@gmail.com>>
*Sent:* Apr 14, 2015 5:53 AM
*To:* Josh Durgin
*Cc:* Ken Dreyer; Sage Weil; Ceph Development; ceph-us...@ceph.com
<mailto:ceph-us...@ceph.com>
*Subject:* Re: v0.80.8 and librbd performance

Hi Josh,

We are using firefly 0.80.9 and see both cinder create/delete
numbers slow down compared 0.80.7.
I don't see any specific tuning requirements and our cluster is
run pretty much on default configuration.
Do you recommend any tuning or can you please suggest some log
signatures we need to be looking at?

Thanks
shiva

On Wed, Mar 4, 2015 at 1:53 PM, Josh Durgin mailto:jdur...@redhat.com>> wrote:

On 03/03/2015 03:28 PM, Ken Dreyer wrote:

On 03/03/2015 04:19 PM, Sage Weil wrote:

Hi,

This is just a heads up that we've identified a
performance regression in
v0.80.8 from previous firefly releases.  A v0.80.9
is working it's way
through QA and should be out in a few days.  If you
haven't upgraded yet
you may want to wait.

Thanks!
sage


Hi Sage,

I've seen a couple Redmine tickets on this (eg
http://tracker.ceph.com/__issues/9854
<http://tracker.ceph.com/issues/9854> ,
http://tracker.ceph.com/__issues/10956
<http://tracker.ceph.com/issues/10956>). It's not
totally clear to me
which of the 70+ unreleased commits on the firefly
branch fix this
librbd issue.  Is it only the three commits in
https://github.com/ceph/ceph/__pull/3410
<https://github.com/ceph/ceph/pull/3410> , or are there
more?


Those are the only ones needed to fix the librbd performance
regression, yes.

Josh

--
To unsubscribe from this list: send the line "unsubscribe
ceph-devel" in
the body of a message to majord...@vger.kernel.org
<mailto:majord...@vger.kernel.org>
More majordomo info at
http://vger.kernel.org/__majordomo-info.html
<http://vger.kernel.org/majordomo-info.html>





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Synchronous writes - tuning and some thoughts about them?

2015-06-02 Thread Josh Durgin


On 06/01/2015 03:41 AM, Jan Schermer wrote:

Thanks, that’s it exactly.
But I think that’s really too much work for now, that’s why I really would like 
to see a quick-win by using the local RBD cache for now - that would suffice 
for most workloads (not too many people run big databases on CEPH now, those 
who do must be aware of this).

The issue is - and I have not yet seen an answer to that - would it be safe as 
it is now if the flushes were ignored (rbd cache = unsafe) or will it 
completely b0rk the filesystem when not flushed properly?


Generally the latter. Right now flushes are the only thing enforcing
ordering for rbd. As a block device it doesn't guarantee that e.g. the
extent at offset 0 is written before the extent at offset 4096 unless
it sees a flush between the writes.

As suggested earlier in this thread, maintaining order during writeback
would make not sending flushes (via mount -o nobarrier in the guest or
cache=unsafe for qemu) safer from a crash-consistency point of view.

An fs or database on top of rbd would still have to replay their
internal journal, and could lose some writes, but should be able to
end up in a consistent state that way. This would make larger caches
more useful, and would be a simple way to use a large local cache
devices as an rbd cache backend. Live migration should still work in
such a system because qemu will still tell rbd to flush data at that
point.

A distributed local cache like [1] might be better long term, but
much more complicated to implement.

Josh

[1] 
https://www.usenix.org/conference/fast15/technical-sessions/presentation/bhagwat


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph asok filling nova open files

2015-06-03 Thread Josh Durgin


On 06/03/2015 02:31 PM, Robert LeBlanc wrote:

We are experiencing a problem where nova is opening up all kinds of
sockets like:

nova-comp 20740 nova 1996u  unix 0x8811b3116b40  0t0 41081179
/var/run/ceph/ceph-client.volumes.20740.81999792.asok

hitting the open file limits rather quickly and preventing any new
work from happening in Nova.

The thing is, there isn't even that many volumes in the pool. Any ideas?


This is http://tracker.ceph.com/issues/11535 combined with nova
checking storage utilization periodically. Backports to hammer and
firefly are ready but not in a point release yet.

You can turn off the admin socket option in the ceph.conf that nova
uses to work around it.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph asok filling nova open files

2015-06-03 Thread Josh Durgin


On 06/03/2015 03:15 PM, Robert LeBlanc wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Thank you for pointing to the information. I'm glad a fix is already
ready. I can't tell from https://github.com/ceph/ceph/pull/4657, will
this be included in the next point release of hammer?


It'll be in 0.94.3.

0.94.2 is close to release already: http://tracker.ceph.com/issues/11492

Josh


Thanks,
- 
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Jun 3, 2015 at 4:00 PM, Josh Durgin  wrote:

On 06/03/2015 02:31 PM, Robert LeBlanc wrote:


We are experiencing a problem where nova is opening up all kinds of
sockets like:

nova-comp 20740 nova 1996u  unix 0x8811b3116b40  0t0 41081179
/var/run/ceph/ceph-client.volumes.20740.81999792.asok

hitting the open file limits rather quickly and preventing any new
work from happening in Nova.

The thing is, there isn't even that many volumes in the pool. Any ideas?



This is http://tracker.ceph.com/issues/11535 combined with nova
checking storage utilization periodically. Backports to hammer and
firefly are ready but not in a point release yet.

You can turn off the admin socket option in the ceph.conf that nova
uses to work around it.

Josh


-BEGIN PGP SIGNATURE-
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVb3ySCRDmVDuy+mK58QAAtW8P/i1jnakJRtSRy2a4xaLt
gRL0Ks5dRYpbnOZhmtucHyFW5C9y77ca70ydQQpuROS0Z3NO2EE2FwkErjjF
wT9/IZ8yvaBTyWYR2S/+WejxLqRbdJT3ILAXHXoSjNtuGQCLrM1IwQ4riqns
ai7lIB0xp4RpoZHev0VB8AAcatFOATKOodtImLWcGQWLLyWspqReguoyiyrl
BNArt3kG7x3ITUsVCVuVY8gZw8LvyG17ccc8hH4q8QUFYTbvtnvEEY1+l9Gy
2jk4odIE2/Xh3JrCuUURMn6svoZOW7/Akh5Qr2uCS64E4EajMBPv3mSJ5qiD
89Z7prlcRpt6Hpzeo0ZhMya2ZMZ30w8oFy4I7w3WUQ4iTSrsxKKTLQS+eWHt
fxJuOkHnQGbJB41w1t2pdt3HW0HpmqGlYrbQCvuQuAopYm7bZ8DDXBoVU2WX
5t/zkM/OuDi4l08qtjhQhBuPR2XvX0IrqNc8+j/pQmbsyQQ2kagf/eQeH2sW
XH0He17QV82ngRUPPciER3SOdJ+DHzmj65OEdYhVPOP9/m8p/h3zhFQsJFiI
elp9mRu2yTVpRazg5OlYQGY5yYZbFkGJheI3cTYZdSLwWzOdI7jQz8x5ZWp8
3v90PkTYldMw9iJA5fbqkzVy8IAvH3BEC2pIFj/vxHGPOQZZ+LIpWSDfTa6M
mg8M
=nzKz
-END PGP SIGNATURE-



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Synchronous writes - tuning and some thoughts about them?

2015-06-04 Thread Josh Durgin


On 06/03/2015 04:15 AM, Jan Schermer wrote:

Thanks for a very helpful answer.
So if I understand it correctly then what I want (crash consistency with RPO>0) 
isn’t possible now in any way.
If there is no ordering in RBD cache then ignoring barriers sounds like a very 
bad idea also.


Yes, that's why the default rbd cache configuration in hammer stays in
writethrough mode until it sees a flush from the guest.


Any thoughts on ext4 with journal_async_commit? That should be safe in any 
circumstance, but it’s pretty hard to test that assumption…


It doesn't sound incredibly well-tested in general. It does something 
like what you want, allowing some data to be lost but theoretically

preventing fs corruption, but I wouldn't trust it without a lot of
testing.

It seems like db-specific options for controlling how much data they
can lose may be best for your use case right now.


Is there someone running big database (OLTP) workloads on Ceph? What did you do 
to make them run? Out of box we are all limited to the same ~100 tqs/s (with 
5ms write latency)…


There is a lot of work going on to improve performance, and latency in
particular:

http://pad.ceph.com/p/performance_weekly

If you haven't seen them, Mark has a config optimized for latency at
the end of this:

http://nhm.ceph.com/Ceph_SSD_OSD_Performance.pdf

Josh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd cache + libvirt

2015-06-08 Thread Josh Durgin


On 06/08/2015 11:19 AM, Alexandre DERUMIER wrote:

Hi,

looking at the latest version of QEMU,


It's seem that it's was already this behaviour since the add of rbd_cache 
parsing in rbd.c by josh in 2012

http://git.qemu.org/?p=qemu.git;a=blobdiff;f=block/rbd.c;h=eebc3344620058322bb53ba8376af4a82388d277;hp=1280d66d3ca73e552642d7a60743a0e2ce05f664;hb=b11f38fcdf837c6ba1d4287b1c685eb3ae5351a8;hpb=166acf546f476d3594a1c1746dc265f1984c5c85


I'll do tests on my side tomorrow to be sure.


It seems like we should switch the order so ceph.conf is overridden by
qemu's cache settings. I don't remember a good reason to have it the
other way around.

Josh


- Mail original -
De: "Jason Dillaman" 
À: "Arnaud Virlet" 
Cc: "ceph-users" 
Envoyé: Lundi 8 Juin 2015 17:50:53
Objet: Re: [ceph-users] rbd cache + libvirt

Hmm ... looking at the latest version of QEMU, it appears that the RBD cache settings are 
changed prior to reading the configuration file instead of overriding the value after the 
configuration file has been read [1]. Try specifying the path to a new configuration file 
via the "conf=/path/to/my/new/ceph.conf" QEMU parameter where the RBD cache is 
explicitly disabled [2].


[1] 
http://git.qemu.org/?p=qemu.git;a=blob;f=block/rbd.c;h=fbe87e035b12aab2e96093922a83a3545738b68f;hb=HEAD#l478
[2] http://ceph.com/docs/master/rbd/qemu-rbd/#usage



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd cache + libvirt

2015-06-12 Thread Josh Durgin


On 06/08/2015 09:23 PM, Alexandre DERUMIER wrote:

In the short-term, you can remove the "rbd cache" setting from your ceph.conf


That's not true, you need to remove the ceph.conf file.
Removing rbd_cache is not enough or default rbd_cache=false will apply.


I have done tests, here the result matrix


host ceph.conf : no rbd_cache:  guest cache=writeback  : result : nocache   
(wrong)
host ceph.conf : rbd_cache=false :  guest cache=writeback  : result : nocache   
(wrong)
host ceph.conf : rbd_cache=true  :  guest cache=writeback  : result : cache
host ceph.conf : no rbd_cache:  guest cache=none   : result : nocache
host ceph.conf : rbd_cache=false :  guest cache=none   : result : no cache
host ceph.conf : rbd_cache=true  :  guest cache=none   : result : cache 
(wrong)


QEMU patch 3/4 fixes this:

http://comments.gmane.org/gmane.comp.emulators.qemu.block/2500

Josh


- Mail original -
De: "Jason Dillaman" 
À: "Andrey Korolyov" 
Cc: "Josh Durgin" , "aderumier" , 
"ceph-users" 
Envoyé: Lundi 8 Juin 2015 22:29:10
Objet: Re: [ceph-users] rbd cache + libvirt


On Mon, Jun 8, 2015 at 10:43 PM, Josh Durgin  wrote:

On 06/08/2015 11:19 AM, Alexandre DERUMIER wrote:


Hi,


looking at the latest version of QEMU,



It's seem that it's was already this behaviour since the add of rbd_cache
parsing in rbd.c by josh in 2012


http://git.qemu.org/?p=qemu.git;a=blobdiff;f=block/rbd.c;h=eebc3344620058322bb53ba8376af4a82388d277;hp=1280d66d3ca73e552642d7a60743a0e2ce05f664;hb=b11f38fcdf837c6ba1d4287b1c685eb3ae5351a8;hpb=166acf546f476d3594a1c1746dc265f1984c5c85


I'll do tests on my side tomorrow to be sure.



It seems like we should switch the order so ceph.conf is overridden by
qemu's cache settings. I don't remember a good reason to have it the
other way around.

Josh



Erm, doesn`t this code *already* represent the right priorities?
Cache=none setting should set a BDRV_O_NOCACHE which is effectively
disabling cache in a mentioned snippet.



Yes, the override is applied (correctly) based upon your QEMU cache settings. However, it then reads your configuration 
file and re-applies the "rbd_cache" setting based upon what is in the file (if it exists). So in the case 
where a configuration file has "rbd cache = true", the override of "rbd cache = false" derived from 
your QEMU cache setting would get wiped out. The long term solution would be to, as Josh noted, switch the order (so 
long as there wasn't a use-case for applying values in this order). In the short-term, you can remove the "rbd 
cache" setting from your ceph.conf so that QEMU controls it (i.e. it cannot get overridden when reading the 
configuration file) or use a different ceph.conf for a drive which requires different cache settings from the default 
configuration's settings.

Jason



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] backing Hadoop with Ceph ??

2015-07-17 Thread Josh Durgin


On 07/15/2015 11:48 AM, Shane Gibson wrote:


Somnath - thanks for the reply ...

:-)  Haven't tried anything yet - just starting to gather
info/input/direction for this solution.

Looking at the S3 API info [2] - there is no mention of support for the
"S3a" API extensions - namely "rename" support.  The problem with
backing via S3 API - if you need to rename a large (say multi GB) data
object - you have to copy to new name and delete - this is a very IO
expensive operation - and something we do a lot of.  That in and of
itself might be a deal breaker ...   Any idea/input/intention of
supporting the S3a exentsions within the RadosGW S3 API implementation?


I see you're trying out cephfs now, and I think that makes sense.

I just wanted to mention that at CDS a couple weeks ago Yehuda noted
that RGW's rename is cheap, since it does not require copying the data,
just updating its location [1].

Josh

[1] http://pad.ceph.com/p/hadoop-over-rgw


Plus - it seems like it's considered a "bad idea" to back Hadoop via S3
(and indirectly Ceph via RGW) [3]; though not sure if the architectural
differences from Amazon's S3 implementation and the far superior Ceph
make it more palatable?

~~shane

[2] http://ceph.com/docs/master/radosgw/s3/
[3] https://wiki.apache.org/hadoop/AmazonS3




On 7/15/15, 9:50 AM, "Somnath Roy" mailto:somnath@sandisk.com>> wrote:

Did you try to integrate ceph +rgw+s3 with Hadoop?

Sent from my iPhone

On Jul 15, 2015, at 8:58 AM, Shane Gibson mailto:shane_gib...@symantec.com>> wrote:




We are in the (very) early stages of considering testing backing
Hadoop via Ceph - as opposed to HDFS.  I've seen a few very vague
references to doing that, but haven't found any concrete info
(architecture, configuration recommendations, gotchas, lessons
learned, etc...).   I did find the ceph.com/docs/
 info [1] which discusses use of CephFS for
backing Hadoop - but this would be foolish for production clusters
given that CephFS isn't yet considered production quality/grade.

Does anyone in the ceph-users community have experience with this
that they'd be willing to share?   Preferably ... via use of Ceph
- not via CephFS...but I am interested in any CephFS related
experiences too.

If we were to do this, and Ceph proved out as a backing store to
Hadoop - there is the potential to be creating a fairly large
multi-Petabyte (100s ??) class backing store for Ceph.  We do a
very large amount of analytics on a lot of data sets for security
trending correlations, etc...

Our current Ceph experience is limited to a few small (90 x 4TB
OSD size) clusters - which we are working towards putting in
production for Glance/Cinder backing and for Block storage for
various large storage need platforms (eg software and package
repo/mirrors, etc...).

Thanks in  advance for any input, thoughts, or pointers ...

~~shane

[1] http://ceph.com/docs/master/cephfs/hadoop/



___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




PLEASE NOTE: The information contained in this electronic mail
message is intended only for the use of the designated recipient(s)
named above. If the reader of this message is not the intended
recipient, you are hereby notified that you have received this
message in error and that any review, dissemination, distribution,
or copying of this message is strictly prohibited. If you have
received this communication in error, please notify the sender by
telephone or e-mail (as shown above) immediately and destroy any and
all copies of this message in your possession (whether hard copies
or electronically stored copies).



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Best method to limit snapshot/clone space overhead

2015-07-23 Thread Josh Durgin


On 07/23/2015 06:31 AM, Jan Schermer wrote:

Hi all,
I am looking for a way to alleviate the overhead of RBD snapshots/clones for 
some time.

In our scenario there are a few “master” volumes that contain production data, 
and are frequently snapshotted and cloned for dev/qa use. Those 
snapshots/clones live for a few days to a few weeks before they get dropped, 
and they sometimes grow very fast (databases, etc.).

With the default 4MB object size there seems to be huge overhead involved with 
this, could someone give me some hints on how to solve that?

I have some hope in

1) FIEMAP
I’ve calculated that files on my OSDs are approx. 30% filled with NULLs - I 
suppose this is what it could save (best-scenario) and it should also make COW 
operations much faster.
But there are lots of bugs in FIEMAP in kernels (i saw some reference to CentOS 
6.5 kernel being buggy - which is what we use) and filesystems (like XFS). No 
idea about ext4 which we’d like to use in the future.

Is enabling FIEMAP a good idea at all? I saw some mention of it being replaced 
with SEEK_DATA and SEEK_HOLE.


fiemap (and ceph's use of it) has been buggy on all fses in the past.
SEEK_DATA and SEEK_HOLE are the proper interfaces to use for these
purposes. That said, it's not incredibly well tested since it's off by
default, so I wouldn't recommend using it without careful testing on
the fs you're using. I wouldn't expect it to make much of a difference
if you use small objects.


2) object size < 4MB for clones
I did some quick performance testing and setting this lower for production is 
probably not a good idea. My sweet spot is 8MB object size, however this would 
make the overhead for clones even worse than it already is.
But I could make the cloned images with a different block size from the 
snapshot (at least according to docs). Does someone use it like that? Any 
caveats? That way I could have the production data with 8MB block size but make 
the development snapshots with for example 64KiB granularity, probably at 
expense of some performance, but most of the data would remain in the (faster) 
master snapshot anyway. This should drop overhead tremendously, maybe even more 
than neabling FIEMAP. (Even better when working in tandem I suppose?)


Since these clones are relatively short-lived this seems like a better
way to go in the short term. 64k may be extreme, but if there aren't
too many of these clones it's not a big deal. There is more overhead
for recovery and scrub with smaller objects, so I wouldn't recommend
using tiny objects in general.

It'll be interesting to see your results. I'm not sure many folks
have looked at optimizing this use case.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] readonly snapshots of live mounted rbd?

2015-08-04 Thread Josh Durgin


On 08/01/2015 07:52 PM, pixelfairy wrote:

Id like to look at a read-only copy of running virtual machines for
compliance and potentially malware checks that the VMs are unaware of.

the first note on http://ceph.com/docs/master/rbd/rbd-snapshot/ warns
that the filesystem has to be in a consistent state. does that just mean
you might get a "crashed" filesystem or will some other bad thing happen
if you snapshot a running filesystem that hasned synced? would telling
the os to sync just before help?


Ideally you would do xfs_freeze -f, snap, xfs_freeze -u to get a
consistent fs for your snapshot. Despite the name this works on all
linux filesystems.

If you don't do this, like you said you get a crash-consistent snapshot,
which might require fs jounal replay (writing to the image). This is
doable using a clone of the snapshot, but it's a bit more complicated
to manage.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Warning regarding LTTng while checking status or restarting service

2015-08-06 Thread Josh Durgin


On 08/06/2015 03:10 AM, Daleep Bais wrote:

Hi,

Whenever I restart or check the logs for OSD, MON, I get below warning
message..

I am running a test cluster of 09 OSD's and 03 MON nodes.

[ceph-node1][WARNIN] libust[3549/3549]: Warning: HOME environment
variable not set. Disabling LTTng-UST per-user tracing. (in
setup_local_apps() at lttng-ust-comm.c:375)


In short: this is harmless, you can ignore it.

liblttng-ust tries to listen for control commands from lttng-sessiond
in a few places by default, including under $HOME. It does this via a
shared mmaped file. If you were interested in tracing as a non-root
user, you could set LTTNG_HOME to a place that was usable, like /var/lib
/ceph/. Since ceph daemons run as root today, this is irrelevant, and
you can still use lttng as root just fine. Unfortunately there's no
simple to silence liblttng-ust about this.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How can I fetch librbd debug logs?

2013-07-10 Thread Josh Durgin


On 07/06/2013 04:51 AM, Xue, Chendi wrote:

Hi, all

I wanna fetch debug librbd and debug rbd logs when I am using vm to read / 
write.

Details:
I created a volume from ceph and attached it to a vm.
So I suppose when I do read/write in the VM, I can get some rbd debug 
logs in the host where the vm lies in.
But in fact, after everything I can try, I still cannot get those logs.

Below is the configuration on qemu side
;global
[global]
 ; allow ourselves to open a lot of files
 max open files = 131072
 auth cluster required = none
 auth service required = none
 auth client required = none
 ; set log file
 log file = /var/log/ceph/$name.log
 ; set up pid files
 pid file = /var/run/ceph/$name.pid

;client
[client]
 rbd cache = true
 rbd cache size = 21474836480
 rbd cache max dirty = 0
 ; set qemu log file
 log file = /var/log/ceph/ceph.client.log
#   debug ms = 1
#   debug client = 20
 debug rbd = 20
 debug librbd = 20
 debug objectcacher = 20

; monitors
;  You need at least one.  You need at least three if you want to
;  tolerate any node failures.  Always create an odd number.
[mon]
 mon data = /data/$name
[mon.Ceph-N3]
 host = node3
 mon addr = 192.168.1.13:6789


This configuration is fine.


Something I tried:
I do can fetch librbd logs when using rbd command like "rbd list pool_name" and 
" rbd -p pool_name bench-write image_name --io-size 4096 --io-threads 1 --io-total 4096 
--io-pattern rand"

But if I using dd in vm to the virtual volume created by ceph, there is still 
no rbd debug logs.


Can the unix user running qemu write to /var/log/ceph to create 
/var/log/ceph/ceph.client.log?


SELinux or apparmor may be preventing it, if not usual unix permissions.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] feature set mismatch

2013-07-16 Thread Josh Durgin


On 07/16/2013 06:06 PM, Gaylord Holder wrote:

Now whenever I try to map an RBD to a machine, mon0 complains:

feature set mismatch, my 2 < server's 2040002, missing 204
missing required protocol features.


Your cluster is using newer crush tunables to get better data
distribution, but your kernel client doesn't support that.

You'll need to upgrade to linux 3.9, or set the tunables
to 'legacy', which your kernel understands [1].

Josh

[1] http://ceph.com/docs/master/rados/operations/crush-map/#tuning-crush

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] feature set mismatch

2013-07-17 Thread Josh Durgin


[please keep replies on the list]

On 07/17/2013 04:04 AM, Gaylord Holder wrote:



On 07/16/2013 09:22 PM, Josh Durgin wrote:

On 07/16/2013 06:06 PM, Gaylord Holder wrote:

Now whenever I try to map an RBD to a machine, mon0 complains:

feature set mismatch, my 2 < server's 2040002, missing 204
missing required protocol features.


Your cluster is using newer crush tunables to get better data
distribution, but your kernel client doesn't support that.

You'll need to upgrade to linux 3.9, or set the tunables
to 'legacy', which your kernel understands [1].

Josh

[1] http://ceph.com/docs/master/rados/operations/crush-map/#tuning-crush



Josh,

That was certainly the trick.

  ceph osd crush tunables legacy

now allows me to map the rbd.


To be clear, did you change the tunables before? If the upgrade enabled
them somehow without your intervention, it would be a bug.


Who need to be running 3.9?  Just the machines mounting the rbd, or
everyone?


Just the machines mounting it.



Is there a better place in the documentation to track the recommended
kernel version than

   http://ceph.com/docs/next/install/os-recommendations/


That and the release notes are the best places to look.
Nothing incompatible with old kernels should be enabled by default,
but some new features (like the crush tunables) may require newer
kernel clients.

Josh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Libvirt, quemu, ceph write cache settings

2013-07-17 Thread Josh Durgin


On 07/17/2013 05:59 AM, Maciej Gałkiewicz wrote:

Hello

Is there any way to verify that cache is enabled? My machine is running
with following parameters:

qemu-system-x86_64 -machine accel=kvm:tcg -name instance-0302 -S
-machine pc-i440fx-1.5,accel=kvm,usb=off -cpu
Westmere,+rdtscp,+avx,+osxsave,+xsave,+tsc-deadline,+pcid,+pdcm,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+dtes64,+pclmuldq,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
-m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid
a5a1406b-6899-4f2e-9d86-d52814b7e6ff -smbios
type=1,manufacturer=OpenStack Foundation,product=OpenStack
Nova,version=2013.1.2,serial=1b059fc0-5bcb-11d9-9bb6-c860005ff2fe,uuid=a5a1406b-6899-4f2e-9d86-d52814b7e6ff
-no-user-config -nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-0302.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc
base=utc,driftfix=slew -no-kvm-pit-reinjection -no-shutdown -device
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
file=rbd:cinder_volumes/volume-2dbad334-1dd5-4e26-8c6f-0bd79dd43d98:id=cinder_volumes:key=XXX:auth_supported=cephx\;none,if=none,id=drive-virtio-disk0,format=raw,serial=2dbad334-1dd5-4e26-8c6f-0bd79dd43d98,cache=writeback
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-drive
file=rbd:cinder_volumes/volume-5c310c6a-942f-4039-9594-074ee633656a:id=cinder_volumes:key=XXX:auth_supported=cephx\;none,if=none,id=drive-virtio-disk1,format=raw,serial=5c310c6a-942f-4039-9594-074ee633656a,cache=writeback
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1
-netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=26 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:e8:34:4c,bus=pci.0,addr=0x3
-chardev
file,id=charserial0,path=/var/lib/nova/instances/a5a1406b-6899-4f2e-9d86-d52814b7e6ff/console.log
-device isa-serial,chardev=charserial0,id=serial0 -chardev
pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1
-device usb-tablet,id=input0 -vnc 127.0.0.1:1  -k
en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6


You can verify that rbd_cache = true is set with the admin socket
command 'config get rbd_cache'. You'd need to set

   admin socket = /path/to/socket

in ceph.conf, then reattach the rbd device. librados will create the
admin socket as long as the user running qemu isn't prevented by unix
permissions, selinux, apparmor, etc.


ceph.conf has entry:
[client]
   rbd_cache = true


The ceph.conf entry isn't needed since qemu 1.2 when you've got
cache=setting in the qemu command line.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Libvirt, quemu, ceph write cache settings

2013-07-18 Thread Josh Durgin


On 07/17/2013 11:39 PM, Maciej Gałkiewicz wrote:

I have created VM with KVM 1.1.2 and all I had was rbd_cache configured
in ceph.conf. Cache option in libvirt set to "none":


   
   
   
 
   
   
   f81d6108-d8c9-4e06-94ef-02b1943a873d
 
 
   
   
   
 
   
   
   9ab3e9b3-e153-447c-ab1d-2f8f9bae095c
 

Config settings received from admin socket show that cache is enabled. I
thought that without configuring libvirt with cache options it is not
possible to force kvm to use it. Can you explain it a little bit why it
works or claims to work?


Setting rbd_cache=true in ceph.conf will make librbd turn on the cache
regardless of qemu. Setting qemu to cache=none tells qemu that it
doesn't need to send flush requests to the underlying storage, so it
does not do so. This means librbd is caching data, but qemu isn't
telling it to persist that data when the guest requests it. This is
the same as qemu's cache=unsafe mode, which makes it easy to get a
corrupt fs if the guest isn't shut down cleanly.

There's a ceph option to make this safer -
rbd_cache_writethrough_until_flush. If this and rbd_cache are true,
librbd will operate with the cache in writethrough mode until it is
sure that the guest using it is capable of sending flushes (i.e. qemu
has cache=writeback). Perhaps we should enable this by default so
people are less likely to accidentally use an unsafe configuration.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Libvirt, quemu, ceph write cache settings

2013-07-18 Thread Josh Durgin


On 07/18/2013 11:32 AM, Maciej Gałkiewicz wrote:

On 18 Jul 2013 20:25, "Josh Durgin" mailto:josh.dur...@inktank.com>> wrote:
 > Setting rbd_cache=true in ceph.conf will make librbd turn on the cache
 > regardless of qemu. Setting qemu to cache=none tells qemu that it
 > doesn't need to send flush requests to the underlying storage, so it
 > does not do so. This means librbd is caching data, but qemu isn't
 > telling it to persist that data when the guest requests it. This is
 > the same as qemu's cache=unsafe mode, which makes it easy to get a
 > corrupt fs if the guest isn't shut down cleanly.
 >
 > There's a ceph option to make this safer -
 > rbd_cache_writethrough_until_flush. If this and rbd_cache are true,
 > librbd will operate with the cache in writethrough mode until it is
 > sure that the guest using it is capable of sending flushes (i.e. qemu
 > has cache=writeback). Perhaps we should enable this by default so
 > people are less likely to accidentally use an unsafe configuration.

Ok. Now it make sense. So the last question is how to make sure that
qemu actually operates with cache=writeback with rbd?



If the setting is in the qemu command line, it'll send flushes,
and you can verify that librbd is seeing them by doing a 'perf dump'
on the admin socket and looking at the aio_flush count there.

This makes me notice that the synchronous flush perf counter went
missing, so it'll always read 0 [1].

Josh

[1] http://tracker.ceph.com/issues/5668
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Kernel's rbd in 3.10.1

2013-07-25 Thread Josh Durgin


On 07/24/2013 09:37 PM, Mikaël Cluseau wrote:

Hi,

I have a bug in the 3.10 kernel under debian, be it a self compiled
linux-stable from the git (built with make-kpkg) or the sid's package.

I'm using format-2 images (ceph version 0.61.6
(59ddece17e36fef69ecf40e239aeffad33c9db35)) to make snapshots and clones
of a database for development purposes. So I have a replay of the
database's logs on a ceph volume and I take a snapshots at fixed points
in time : mount -> recover database until a given time -> umount ->
snapshot -> go back to 1.

In both cases, it works for a while (mount/umount cycles) and after some
time it gives me the following error on mount :

Jul 25 15:20:46 **host** kernel: [14623.808604] [ cut here
]
Jul 25 15:20:46 **host** kernel: [14623.808622] kernel BUG at
/build/linux-dT6LW0/linux-3.10.1/net/ceph/osd_client.c:2103!
Jul 25 15:20:46 **host** kernel: [14623.808641] invalid opcode: 
[#1] SMP
Jul 25 15:20:46 **host** kernel: [14623.808657] Modules linked in: cbc
rbd libceph nfsd auth_rpcgss oid_registry nfs_acl nfs lockd sunrpc
sha256_generic hmac nls_utf8 cifs dns_resolver fscache bridge stp llc
xfs loop coretemp kvm_intel kvm crc32c_intel psmouse serio_raw snd_pcm
snd_page_alloc snd_timer snd soundcore iTCO_wdt iTCO_vendor_support
i2c_i801 i7core_edac microcode pcspkr lpc_ich mfd_core joydev ioatdma
evdev edac_core acpi_cpufreq mperf button processor thermal_sys ext4
crc16 jbd2 mbcache btrfs xor zlib_deflate raid6_pq crc32c libcrc32c
raid1 ohci_hcd hid_generic usbhid hid sr_mod sg cdrom sd_mod crc_t10dif
dm_mod md_mod ata_generic ata_piix libata uhci_hcd ehci_pci ehci_hcd
scsi_mod usbcore usb_common igb i2c_algo_bit i2c_core dca ptp pps_core
Jul 25 15:20:46 **host** kernel: [14623.809005] CPU: 6 PID: 9583 Comm:
mount Not tainted 3.10-1-amd64 #1 Debian 3.10.1-1
Jul 25 15:20:46 **host** kernel: [14623.809024] Hardware name:
Supermicro X8DTU/X8DTU, BIOS 2.1b   12/30/2011
Jul 25 15:20:46 **host** kernel: [14623.809041] task: 88082dfa2840
ti: 88080e2c2000 task.ti: 88080e2c2000
Jul 25 15:20:46 **host** kernel: [14623.809059] RIP:
0010:[]  []
ceph_osdc_build_request+0x370/0x3e9 [libceph]
Jul 25 15:20:46 **host** kernel: [14623.809092] RSP:
0018:88080e2c39b8  EFLAGS: 00010216
Jul 25 15:20:46 **host** kernel: [14623.809120] RAX: 88082e589a80
RBX: 88082e589b72 RCX: 0007
Jul 25 15:20:46 **host** kernel: [14623.809151] RDX: 88082e589b6f
RSI: 88082afd9078 RDI: 88082b308258
Jul 25 15:20:46 **host** kernel: [14623.809182] RBP: 1000
R08: 88082e10a400 R09: 88082afd9000
Jul 25 15:20:46 **host** kernel: [14623.809213] R10: 8806bfb1cd60
R11: 88082d153c01 R12: 88080e88e988
Jul 25 15:20:46 **host** kernel: [14623.809243] R13: 0001
R14: 88080eb874d8 R15: 88080eb875b8
Jul 25 15:20:46 **host** kernel: [14623.809275] FS:
7f2c893b77e0() GS:88083fc4() knlGS:
Jul 25 15:20:46 **host** kernel: [14623.809322] CS:  0010 DS:  ES:
 CR0: 8005003b
Jul 25 15:20:46 **host** kernel: [14623.809350] CR2: ff600400
CR3: 0006bfbc6000 CR4: 07e0
Jul 25 15:20:46 **host** kernel: [14623.809381] DR0: 
DR1:  DR2: 
Jul 25 15:20:46 **host** kernel: [14623.809413] DR3: 
DR6: 0ff0 DR7: 0400
Jul 25 15:20:46 **host** kernel: [14623.809442] Stack:
Jul 25 15:20:46 **host** kernel: [14623.814598]  2201
88080e2c3a30 1000 88042edf2240
Jul 25 15:20:46 **host** kernel: [14623.814656]  0024a05cbb01
 88082e84f348 88080e2c3a58
Jul 25 15:20:46 **host** kernel: [14623.814710]  88080eb874d8
88080e9aa290 88027abc6000 1000
Jul 25 15:20:46 **host** kernel: [14623.814765] Call Trace:
Jul 25 15:20:46 **host** kernel: [14623.814793]  [] ?
rbd_osd_req_format_write+0x81/0x8c [rbd]
Jul 25 15:20:46 **host** kernel: [14623.814827]  [] ?
rbd_img_request_fill+0x679/0x74f [rbd]
Jul 25 15:20:46 **host** kernel: [14623.814865]  [] ?
should_resched+0x5/0x23
Jul 25 15:20:46 **host** kernel: [14623.814896]  [] ?
rbd_request_fn+0x180/0x226 [rbd]
Jul 25 15:20:46 **host** kernel: [14623.814929]  [] ?
__blk_run_queue_uncond+0x1e/0x26
Jul 25 15:20:46 **host** kernel: [14623.814960]  [] ?
blk_queue_bio+0x299/0x2e8
Jul 25 15:20:46 **host** kernel: [14623.814990]  [] ?
generic_make_request+0x96/0xd5
Jul 25 15:20:46 **host** kernel: [14623.815021]  [] ?
submit_bio+0x10a/0x13b
Jul 25 15:20:46 **host** kernel: [14623.815053]  [] ?
bio_alloc_bioset+0xd0/0x172
Jul 25 15:20:46 **host** kernel: [14623.815083]  [] ?
_submit_bh+0x1b7/0x1d4
Jul 25 15:20:46 **host** kernel: [14623.815117]  [] ?
__sync_dirty_buffer+0x4e/0x7b
Jul 25 15:20:46 **host** kernel: [14623.815164]  [] ?
ext4_commit_super+0x192/0x1db [ext4]
Jul 25 15:20:46 **host** kernel: [14623.815206]  [] ?
ext4_setup_super+0xff/0x146 [ext4]
Jul 25 15:20:46 *

Re: [ceph-users] Mounting RBD or CephFS on Ceph-Node?

2013-07-25 Thread Josh Durgin


On 07/23/2013 06:09 AM, Oliver Schulz wrote:

Dear Ceph Experts,

I remember reading that at least in the past I wasn't recommended
to mount Ceph storage on a Ceph cluster node. Given a recent kernel
(3.8/3.9) and sufficient CPU and memory resources on the nodes,
would it now be safe to

* Mount RBD oder CephFS on a Ceph cluster node?


This will probably always be unsafe for kernel clients [1] [2].


* Run a VM that is based on RBD storage (libvirt?) and/or mounts
   CephFS on a Ceph node?


Using libvirt/qemu+librbd or ceph-fuse is fine, since they are
userspace. Using a kernel client inside a VM would work too.

Josh

[1] http://wiki.ceph.com/03FAQs/01General_FAQ#How_Can_I_Give_Ceph_a_Try.3F
[2] http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/1673
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Openstack glance ceph rbd_store_user authentification problem

2013-08-08 Thread Josh Durgin


On 08/08/2013 06:01 AM, Steffen Thorhauer wrote:

Hi,
recently I had a problem with openstack glance and ceph.
I used the
http://ceph.com/docs/master/rbd/rbd-openstack/#configuring-glance
documentation and
http://docs.openstack.org/developer/glance/configuring.html documentation
I'm using ubuntu 12.04 LTS with grizzly from Ubuntu Cloud Archive and
ceph 61.7.

glance-api.conf had following config options

default_store = rbd
rbd_store_user=images
rbd_store_pool = images
rbd_store_ceph_conf = /etc/ceph/ceph.conf


All the time when doing glance image create I get errors. In the glance
api log I only found error like

2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images Traceback (most
recent call last):
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images   File
"/usr/lib/python2.7/dist-packages/glance/api/v1/images.py", line 444, in
_upload
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images image_meta['size'])
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images   File
"/usr/lib/python2.7/dist-packages/glance/store/rbd.py", line 241, in add
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images with
rados.Rados(conffile=self.conf_file, rados_id=self.user) as conn:
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images   File
"/usr/lib/python2.7/dist-packages/rados.py", line 134, in __enter__
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images self.connect()
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images   File
"/usr/lib/python2.7/dist-packages/rados.py", line 192, in connect
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images raise
make_ex(ret, "error calling connect")
2013-08-08 10:25:38.021 5725 TRACE glance.api.v1.images ObjectNotFound:
error calling connect

This trace message helped me not very much :-(
My google search "glance.api.v1.images ObjectNotFound: error calling
connect" did only find
http://irclogs.ceph.widodh.nl/index.php?date=2012-10-26
This  points me to an ceph authentification problem. But the ceph tools
worked fine for me.
The I tried the debug option in glance-api.conf and I found following
entry .

DEBUG glance.common.config [-] rbd_store_pool = images
log_opt_values /usr/lib/python2.7/dist-packages/oslo/config/cfg.py:1485
DEBUG glance.common.config [-] rbd_store_user = glance
log_opt_values /usr/lib/python2.7/dist-packages/oslo/config/cfg.py:1485

The glance-api service  did not use my rbd_store_user = images option!!
Then I configured a client.glance auth and it worked with the
"implicit" glance user!!!

Now my question: Am I the only one with this problem??


I've seen people have this issue before due to the way the 
glance-api.conf can have multiple sections.


Make sure those rbd settings are in the [DEFAULT] section, not just
at the bottom of the file (which may be a different section).


Regards,
   Steffen Thorhauer
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Qemu-devel] [Bug 1207686]

2013-08-08 Thread Josh Durgin


On 08/08/2013 05:40 AM, Oliver Francke wrote:

Hi Josh,

I have a session logged with:

 debug_ms=1:debug_rbd=20:debug_objectcacher=30

as you requested from Mike, even if I think, we do have another story
here, anyway.

Host-kernel is: 3.10.0-rc7, qemu-client 1.6.0-rc2, client-kernel is
3.2.0-51-amd...

Do you want me to open a ticket for that stuff? I have about 5MB
compressed logfile waiting for you ;)


Yes, that'd be great. If you could include the time when you saw the 
guest hang that'd be ideal. I'm not sure if this is one or two bugs,

but it seems likely it's a bug in rbd and not qemu.

Thanks!
Josh


Thnx in advance,

Oliver.

On 08/05/2013 09:48 AM, Stefan Hajnoczi wrote:

On Sun, Aug 04, 2013 at 03:36:52PM +0200, Oliver Francke wrote:

Am 02.08.2013 um 23:47 schrieb Mike Dawson :

We can "un-wedge" the guest by opening a NoVNC session or running a
'virsh screenshot' command. After that, the guest resumes and runs
as expected. At that point we can examine the guest. Each time we'll
see:

If virsh screenshot works then this confirms that QEMU itself is still
responding.  Its main loop cannot be blocked since it was able to
process the screendump command.

This supports Josh's theory that a callback is not being invoked.  The
virtio-blk I/O request would be left in a pending state.

Now here is where the behavior varies between configurations:

On a Windows guest with 1 vCPU, you may see the symptom that the guest no
longer responds to ping.

On a Linux guest with multiple vCPUs, you may see the hung task message
from the guest kernel because other vCPUs are still making progress.
Just the vCPU that issued the I/O request and whose task is in
UNINTERRUPTIBLE state would really be stuck.

Basically, the symptoms depend not just on how QEMU is behaving but also
on the guest kernel and how many vCPUs you have configured.

I think this can explain how both problems you are observing, Oliver and
Mike, are a result of the same bug.  At least I hope they are :).

Stefan





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Qemu-devel] [Bug 1207686]

2013-08-10 Thread Josh Durgin


On 08/09/2013 08:03 AM, Stefan Hajnoczi wrote:

On Fri, Aug 09, 2013 at 03:05:22PM +0100, Andrei Mikhailovsky wrote:

I can confirm that I am having similar issues with ubuntu vm guests using fio 
with bs=4k direct=1 numjobs=4 iodepth=16. Occasionally i see hang tasks, 
occasionally guest vm stops responding without leaving anything in the logs and 
sometimes i see kernel panic on the console. I typically leave the runtime of 
the fio test for 60 minutes and it tends to stop responding after about 10-30 
mins.

I am on ubuntu 12.04 with 3.5 kernel backport and using ceph 0.61.7 with qemu 
1.5.0 and libvirt 1.0.2


Oliver's logs show one aio_flush() never getting completed, which
means it's an issue with aio_flush in librados when rbd caching isn't
used.

Mike's log is from a qemu without aio_flush(), and with caching turned 
on, and shows all flushes completing quickly, so it's a separate bug.



Josh,
In addition to the Ceph logs you can also use QEMU tracing with the
following events enabled:
virtio_blk_handle_write
virtio_blk_handle_read
virtio_blk_rw_complete

See docs/tracing.txt for details on usage.

Inspecting the trace output will let you observe the I/O request
submission/completion from the virtio-blk device perspective.  You'll be
able to see whether requests are never being completed in some cases.


Thanks for the info. That may be the best way to check what's happening
when caching is enabled. Mike, could you recompile qemu with tracing
enabled and get a trace of the hang you were seeing, in addition to
the ceph logs?


This bug seems like a corner case or race condition since most requests
seem to complete just fine.  The problem is that eventually the
virtio-blk device becomes unusable when it runs out of descriptors (it
has 128).  And before that limit is reached the guest may become
unusable due to the hung I/O requests.


It seems only one request hung from an important kernel thread in
Oliver's case, but it's good to be aware of the descriptor limit.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd map issues: no such file or directory (ENOENT) AND map wrong image

2013-08-12 Thread Josh Durgin


On 08/12/2013 10:19 AM, PJ wrote:

Hi All,

Before go on the issue description, here is our hardware configurations:
- Physical machine * 3: each has quad-core CPU * 2, 64+ GB RAM, HDD * 12
(500GB ~ 1TB per drive; 1 for system, 11 for OSD). ceph OSD are on
physical machines.
- Each physical machine runs 5 virtual machines. One VM as ceph MON
(i.e. totally 3 MONs), the other 4 VMs provides either iSCSI or FTP/NFS
service
- Physical machines and virtual machines are based on the same software
condition: Ubuntu 12.04 + kernel 3.6.11, ceph v0.61.7


The issues we met are,

1. Right after ceph installation, create pool then create image and map
is no problem. But if we do not use the whole environment more than half
day, do the same process (create pool -> create image -> map image) will
return error: no such file or directory (ENOENT). Once the issue occurs,
it could be easily reproduce by the same process. But this issue may be
disappear if wait 10+ minutes after pool creation. Reboot system also
could avoid it.


This sounds similar to http://tracker.ceph.com/issues/5925 - and
your case suggests it may be a monitor bug, since that test is userspace
and you're using the kernel client. Could you reproduce
this with logs from your monitors from the time of pool creation to
after the map fails with ENOENT, and these log settings on all mons:

debug ms = 1
debug mon = 20
debug paxos = 10

If you could attach those logs to the bug or otherwise make them
available that'd be great.


I had success and failed straces logged on the same virtual machine (the
one provides FTP/NFS):
success: https://www.dropbox.com/s/u8jc4umak24kr1y/rbd_done.txt
failed: https://www.dropbox.com/s/ycuupmmrlc4d0ht/rbd_failed.txt


Unfortunately these won't tell us much since the kernel is doing all the
work with rbd map.


2. The second issue is to create two images (AAA and BBB) under one pool
(xxx), if we map "rbd -p xxx image AAA", the result is success but it
shows BBB under /dev/rbd/xxx/. Use "rbd showmapped", it shows "AAA" of
pool xxx is mapped. I am not sure which one is really mapped because
both images are empty. This issue is hard to reproduce but once happens
/dev/rbd/ are mess-up.


That sounds very strange, since 'rbd showmapped' and the udev rule that
creates the /dev/rbd/pool/image symlinks use the same data source -
/sys/bus/rbd/N/name. This sounds like a race condition where sysfs is
being read (and reading stale memory) before the kernel finishes
populating it. Could you file this in the tracker? Checking whether
it still occurs in linux 3.10 would be great too. It doesn't seem
possible with the current code.


One more question but not about rbd map issues. Our usage is to map one
rbd device and mount in several places (in one virtual machine) for
iSCSI, FTP and NFS, does that cause any problem to ceph operation?


If it's read-only everywhere, it's fine, but otherwise you'll run into
problems unless you've got something on top of rbd managing access to
it, like ocfs2. You could use nfs on top of one rbd device, but having
multiple nfs servers on top of the same rbd device won't work unless
they can coordinate with each other. The same applies to iscsi and ftp.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd map issues: no such file or directory (ENOENT) AND map wrong image

2013-08-12 Thread Josh Durgin


[re-adding ceph-users so others can benefit from the archives]

On 08/12/2013 07:18 PM, PJ wrote:

2013/8/13 Josh Durgin :

On 08/12/2013 10:19 AM, PJ wrote:


Hi All,

Before go on the issue description, here is our hardware configurations:
- Physical machine * 3: each has quad-core CPU * 2, 64+ GB RAM, HDD * 12
(500GB ~ 1TB per drive; 1 for system, 11 for OSD). ceph OSD are on
physical machines.
- Each physical machine runs 5 virtual machines. One VM as ceph MON
(i.e. totally 3 MONs), the other 4 VMs provides either iSCSI or FTP/NFS
service
- Physical machines and virtual machines are based on the same software
condition: Ubuntu 12.04 + kernel 3.6.11, ceph v0.61.7


The issues we met are,

1. Right after ceph installation, create pool then create image and map
is no problem. But if we do not use the whole environment more than half
day, do the same process (create pool -> create image -> map image) will
return error: no such file or directory (ENOENT). Once the issue occurs,
it could be easily reproduce by the same process. But this issue may be
disappear if wait 10+ minutes after pool creation. Reboot system also
could avoid it.



This sounds similar to http://tracker.ceph.com/issues/5925 - and
your case suggests it may be a monitor bug, since that test is userspace
and you're using the kernel client. Could you reproduce
this with logs from your monitors from the time of pool creation to
after the map fails with ENOENT, and these log settings on all mons:

debug ms = 1
debug mon = 20
debug paxos = 10

If you could attach those logs to the bug or otherwise make them
available that'd be great.



We will add these settings to gather the log. By the way, we try to
avoid this issue by using the default pool (rbd) only. Will it be
useful in this case?


No, the case I'm interested in is when the 'rbd map' fails because
there's a new pool.




I had success and failed straces logged on the same virtual machine (the
one provides FTP/NFS):
success: https://www.dropbox.com/s/u8jc4umak24kr1y/rbd_done.txt
failed: https://www.dropbox.com/s/ycuupmmrlc4d0ht/rbd_failed.txt



Unfortunately these won't tell us much since the kernel is doing all the
work with rbd map.



2. The second issue is to create two images (AAA and BBB) under one pool
(xxx), if we map "rbd -p xxx image AAA", the result is success but it
shows BBB under /dev/rbd/xxx/. Use "rbd showmapped", it shows "AAA" of
pool xxx is mapped. I am not sure which one is really mapped because
both images are empty. This issue is hard to reproduce but once happens
/dev/rbd/ are mess-up.



That sounds very strange, since 'rbd showmapped' and the udev rule that
creates the /dev/rbd/pool/image symlinks use the same data source -
/sys/bus/rbd/N/name. This sounds like a race condition where sysfs is
being read (and reading stale memory) before the kernel finishes
populating it. Could you file this in the tracker?


I will file to tracker.


Checking whether it still occurs in linux 3.10 would be great too. It doesn't 
seem
possible with the current code.



Current code means Linux kernel 3.10 or 3.6?


Current code in 3.10 doesn't look like this issue is possible, unless
I'm missing something. There's been a lot of refactoring since 3.6
though, so it's possible the bug was fixed accidentally.


One more question but not about rbd map issues. Our usage is to map one
rbd device and mount in several places (in one virtual machine) for
iSCSI, FTP and NFS, does that cause any problem to ceph operation?



If it's read-only everywhere, it's fine, but otherwise you'll run into
problems unless you've got something on top of rbd managing access to
it, like ocfs2. You could use nfs on top of one rbd device, but having
multiple nfs servers on top of the same rbd device won't work unless
they can coordinate with each other. The same applies to iscsi and ftp.



If the target rbd device only map on one virtual machine, format it as
ext4 and mount to two places
   mount /dev/rbd0 /nfs --> for nfs server usage
   mount /dev/rbd0 /ftp  --> for ftp server usage
nfs and ftp servers run on the same virtual machine. Will file system
(ext4) help to handle the simultaneous access from nfs and ftp?


I doubt that'll work perfectly on a normal disk, although rbd should
behave the same in this case. Consider what happens when to be some
issues when the same files are modified at once by the ftp and nfs
servers. You could run ftp on an nfs client on a different machine
safely.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Glance image upload errors after upgrading to Dumpling

2013-08-14 Thread Josh Durgin


On 08/14/2013 02:22 PM, Michael Morgan wrote:

Hello Everyone,

  I have a Ceph test cluster doing storage for an OpenStack Grizzly platform
(also testing). Upgrading to 0.67 went fine on the Ceph side with the cluster
showing healthy but suddenly I can't upload images into Glance anymore. The
upload fails and glance-api throws an error:

2013-08-14 15:19:55.898 ERROR glance.api.v1.images 
[4dcd9de0-af65-4902-a36d-afc5497605e7 3867c65db6cc48398a0f57ce53144e69 
5dbca756421c4a3eb0a1cc2f1ee3c67c] Failed to upload image
2013-08-14 15:19:55.898 24740 TRACE glance.api.v1.images Traceback (most recent 
call last):
2013-08-14 15:19:55.898 24740 TRACE glance.api.v1.images   File 
"/usr/lib/python2.6/site-packages/glance/api/v1/images.py", line 444, in _upload
2013-08-14 15:19:55.898 24740 TRACE glance.api.v1.images image_meta['size'])
2013-08-14 15:19:55.898 24740 TRACE glance.api.v1.images   File 
"/usr/lib/python2.6/site-packages/glance/store/rbd.py", line 241, in add
2013-08-14 15:19:55.898 24740 TRACE glance.api.v1.images with 
rados.Rados(conffile=self.conf_file, rados_id=self.user) as conn:
2013-08-14 15:19:55.898 24740 TRACE glance.api.v1.images   File 
"/usr/lib/python2.6/site-packages/rados.py", line 195, in __init__
2013-08-14 15:19:55.898 24740 TRACE glance.api.v1.images raise Error("Rados(): 
can't supply both rados_id and name")
2013-08-14 15:19:55.898 24740 TRACE glance.api.v1.images Error: Rados(): can't 
supply both rados_id and name
2013-08-14 15:19:55.898 24740 TRACE glance.api.v1.images


This would be a backwards-compatibility regression in the librados
python bindings - a fix is in the dumpling branch, and a point
release is in the works. You could add name=None to that rados.Rados()
call in glance to work around it in the meantime.

Josh


  I'm not sure if there's a patch I need to track down for Glance or if I missed
a change in the necessary Glance/Ceph setup. Is anyone else seeing this
behavior? Thanks!

-Mike


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD and balanced reads

2013-08-20 Thread Josh Durgin


On 08/19/2013 11:24 AM, Gregory Farnum wrote:

On Mon, Aug 19, 2013 at 9:07 AM, Sage Weil  wrote:

On Mon, 19 Aug 2013, S?bastien Han wrote:

Hi guys,

While reading a developer doc, I came across the following options:

* osd balance reads = true
* osd shed reads = true
* osd shed reads min latency
* osd shed reads min latency diff

The problem is that I can't find any of these options in config_opts.h.


These are left over from an old unimplemented experiment and were removed
a while back.


Loic Dachary also gave me a flag that he found from the code.

m->get_flags() & CEPH_OSD_FLAG_LOCALIZE_READS)

So my questions are:

* Which from the above flags are correct?
* Do balanced reads really exist in RBD?


For localized reads you want

OPTION(rbd_balance_snap_reads, OPT_BOOL, false)
OPTION(rbd_localize_snap_reads, OPT_BOOL, false)

Note that the 'localize' logic is still very primitive (it matches by IP
address).  There is a blueprint to improve this:

 
http://wiki.ceph.com/01Planning/02Blueprints/Emperor/librados%2F%2Fobjecter%3A_smarter_localized_reads


Also, there are some issues with read/write consistency when using
localized reads because the replicas do not provide the ordering
guarantees that primaries will. See
http://tracker.ceph.com/issues/5388
At present localized reads are really only suitable for spreading the
load on write-once, read-many workloads.


To be clear, those issues don't apply to these rbd options, since they 
only affect reading from snapshots, which are inherently read-only.

This is particularly useful for cloned images reading from their parent
snapshot.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OpenStack Cinder + Ceph, unable to remove unattached volumes, still watchers

2013-08-20 Thread Josh Durgin


On 08/20/2013 11:20 AM, Vincent Hurtevent wrote:



I'm not the end user. It's possible that the volume has been detached
without unmounting.

As the volume is unattached and the initial kvm instance is down, I was
expecting the rbd volume is properly unlocked even if the guest unmount
hasn't been done, like a physical disk in fact.


Yes, detaching the volume will remove the watch regardless of the guest
having it mounted.


Which part of the Ceph thing is allways locked or marked in use ? Do we
have to go to the rados object level ?
The data can be destroy.


It's a watch on the rbd header object, registered when the rbd volume
is attached, and unregistered when it is detached or 30 seconds after
the qemu/kvm process using it dies.

From rbd info you can get the id of the image (part of the
block_name_prefix), and use the rados tool to see what ip is watching
the volume's header object, i.e.:

$ rbd info volume-name | grep prefix
block_name_prefix: rbd_data.102f74b0dc51
$ rados -p rbd listwatchers rbd_header.102f74b0dc51
watcher=192.168.106.222:0/1029129 client.4152 cookie=1


Reboot compute nodes could clean librbd layer and clean watchers ?


Yes, because this would kill all the qemu/kvm processes.

Josh



De : Don Talton (dotalton) [dotal...@cisco.com]
Date d'envoi : mardi 20 août 2013 19:57
À : HURTEVENT VINCENT
Objet : RE: [ceph-users] OpenStack Cinder + Ceph, unable to remove
unattached volumes, still watchers

Did you unmounts them in the guest before detaching?

 > -Original Message-
 > From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-
 > boun...@lists.ceph.com] On Behalf Of Vincent Hurtevent
 > Sent: Tuesday, August 20, 2013 10:33 AM
 > To: ceph-us...@ceph.com
 > Subject: [ceph-users] OpenStack Cinder + Ceph, unable to remove
 > unattached volumes, still watchers
 >
 > Hello,
 >
 > I'm using Ceph as Cinder backend. Actually it's working pretty well
and some
 > users are using this cloud platform for few weeks, but I come back from
 > vacation and I've got some errors removing volumes, errors I didn't
have few
 > weeks ago.
 >
 > Here's the situation :
 >
 > Volumes are unattached, but Ceph is telling Cinder or I, when I try
to remove
 > trough rbd tools, that the volume still has watchers.
 >
 > rbd --pool cinder rm volume-46e241ee-ed3f-446a-87c7-1c9df560d770
 > Removing image: 99% complete...failed.
 > rbd: error: image still has watchers
 > This means the image is still open or the client using it crashed.
Try again after
 > closing/unmapping it or waiting 30s for the crashed client to timeout.
 > 2013-08-20 19:17:36.075524 7fedbc7e1780 -1 librbd: error removing
 > header: (16) Device or resource busy
 >
 >
 > The kvm instances on which the volumes have been attached are now
 > terminated. There's no lock on the volume using 'rbd lock list'.
 >
 > I restarted all the monitors (3) one by one, with no better success.
 >
 >  From Openstack PoV, these volumes are well unattached.
 >
 > How can I unlock the volumes or trace back the watcher/process ? These
 > could be on several and different compute nodes.
 >
 >
 > Thank you for any hint,
 >
 >

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] locking rbd device

2013-08-26 Thread Josh Durgin


On 08/26/2013 12:03 AM, Wolfgang Hennerbichler wrote:

hi list,

I realize there's a command called "rbd lock" to lock an image. Can libvirt use 
this to prevent virtual machines from being started simultaneously on different 
virtualisation containers?

wogri


Yes - that's the reason for lock command's existence. You have
to be careful with things like live migration though, which will have
the device open in two places while migration completes. If libvirt
could use rbd locks like it uses its sanlock plugin, it would be
able to deal with this correctly.

Josh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] locking rbd device

2013-08-26 Thread Josh Durgin


On 08/26/2013 01:49 PM, Josh Durgin wrote:

On 08/26/2013 12:03 AM, Wolfgang Hennerbichler wrote:

hi list,

I realize there's a command called "rbd lock" to lock an image. Can
libvirt use this to prevent virtual machines from being started
simultaneously on different virtualisation containers?

wogri


Yes - that's the reason for lock command's existence. You have
to be careful with things like live migration though, which will have
the device open in two places while migration completes. If libvirt
could use rbd locks like it uses its sanlock plugin, it would be
able to deal with this correctly.


To be clear, libvirt doesn't use rbd locking at all right now, but it
could probably be patched to do so without too much effort.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Real size of rbd image

2013-08-27 Thread Josh Durgin


On 08/27/2013 01:39 PM, Timofey Koolin wrote:

Is way to know real size of rbd image and rbd snapshots?
rbd ls -l write declared size of image, but I want to know real size.


You can sum the sizes of the extents reported by:

rbd diff pool/image[@snap] [--format json]

That's the difference since the beginning of time, so it reports all
used extents.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Location field empty in Glance when instance to image

2013-08-30 Thread Josh Durgin


On 08/30/2013 03:40 AM, Toni F. [ackstorm] wrote:

Sorry, wrong list

Anyway i take this oportunity to ask two questions:

Somebody knows how i can download a image or snapshot?


Cinder has no way to export them, but you can use:

rbd export pool/image@snap /path/to/file


how the direct url are build?

rbd://9ed296cb-e9a7-4d36-b728-0ddc5f249ca0/images/7729788f-b80a-4d90-b3c7-6f61f5ebd535/snap


The format is rbd://fsid/pool/image/snapshot

fsid is a unique id for a ceph cluster.


This is from a image

I need to build this direct url for a snapshot and i don't know how


In this case it's a cinder snapshot, and you've already found it in the
rbd snap ls output.

Josh


Thanks
Regards

On 30/08/13 12:27, Toni F. [ackstorm] wrote:

Hi all,

With a running boot-from-volume instance backed in ceph, i launch
command to create an image from instance. All seems to work fine but
if i look in bdd i notice that location is empty

mysql> select * from images where
id="b7674970-5d60-41da-bbb9-2ef10955fbbe" \G;
*** 1. row ***
  id: b7674970-5d60-41da-bbb9-2ef10955fbbe
name: snapshot_athena326
size: 0
  status: active
   is_public: 1
*location: NULL*
  created_at: 2013-08-29 14:41:16
  updated_at: 2013-08-29 14:41:16
  deleted_at: NULL
 deleted: 0
 disk_format: raw
container_format: bare
checksum: 8e79e146ce5d2c71807362730e7b5a3b
   owner: 36d462972b1d49c5850ca864b6f39d05
min_disk: 0
 min_ram: 0
   protected: 0
1 row in set (0.00 sec)

Bug?

Aditional info

# glance index

ID   Name   Disk Format
 Container Format Size
 --
  --
7729788f-b80a-4d90-b3c7-6f61f5ebd535 Ubuntu 12.04 LTS 32bits
 raw  bare 2147483648
b0692408-6bcf-40b1-94c6-672154d5d7eb Ubuntu 12.04 LTS 64bits
 raw  bare 2147483648 

I created a instance from image 7729788f-b80a-4d90-b3c7-6f61f5ebd535

#nova list

+--+---+++
| ID   | Name  | Status | Networks
  |
+--+---+++
| bffd1b30-5690-4d2f-9347-1f0b7202ee6d | athena326 | ACTIVE |
Private_15=10.128.3.195, 88.87.208.155 |
+--+---+++


#nova image-create bffd1b30-5690-4d2f-9347-1f0b7202ee6d snapshot_athena326

///LOGS in cinder_volume

2013-08-29 16:41:16 INFO cinder.volume.manager
[req-8fc22aae-a516-4f62-a836-99f63f86f144
55b70876b2d24eb393da5119cb2b8ee4 36d462972b1d49c5850ca864b6f39d05]
snapshot snapshot-7a41d848-6d35-47a6-b3ce-7be1d3643e68: creating
2013-08-29 16:41:16 DEBUG cinder.volume.manager
[req-8fc22aae-a516-4f62-a836-99f63f86f144
55b70876b2d24eb393da5119cb2b8ee4 36d462972b1d49c5850ca864b6f39d05]
snapshot snapshot-7a41d848-6d35-47a6-b3ce-7be1d3643e68: creating
create_snapshot
/usr/lib/python2.7/dist-packages/cinder/volume/manager.py:234
2013-08-29 16:41:16 DEBUG cinder.utils
[req-8fc22aae-a516-4f62-a836-99f63f86f144
55b70876b2d24eb393da5119cb2b8ee4 36d462972b1d49c5850ca864b6f39d05]
Running cmd (subprocess): rbd snap create --pool volumes --snap
snapshot-7a41d848-6d35-47a6-b3ce-7be1d3643e68
volume-1b1e9684-05fa-4d8b-90a3-5bd2031c28bd execute
/usr/lib/python2.7/dist-packages/cinder/utils.py:167
2013-08-29 16:41:17 DEBUG cinder.utils
[req-8fc22aae-a516-4f62-a836-99f63f86f144
55b70876b2d24eb393da5119cb2b8ee4 36d462972b1d49c5850ca864b6f39d05]
Running cmd (subprocess): rbd --help execute
/usr/lib/python2.7/dist-packages/cinder/utils.py:167
2013-08-29 16:41:17 DEBUG cinder.utils
[req-8fc22aae-a516-4f62-a836-99f63f86f144
55b70876b2d24eb393da5119cb2b8ee4 36d462972b1d49c5850ca864b6f39d05]
Running cmd (subprocess): rbd snap protect --pool volumes --snap
snapshot-7a41d848-6d35-47a6-b3ce-7be1d3643e68
volume-1b1e9684-05fa-4d8b-90a3-5bd2031c28bd execute
/usr/lib/python2.7/dist-packages/cinder/utils.py:167
2013-08-29 16:41:17 DEBUG cinder.volume.manager
[req-8fc22aae-a516-4f62-a836-99f63f86f144
55b70876b2d24eb393da5119cb2b8ee4 36d462972b1d49c5850ca864b6f39d05]
snapshot snapshot-7a41d848-6d35-47a6-b3ce-7be1d3643e68: created
successfully create_snapshot
/usr/lib/python2.7/dist-packages/cinder/volume/manager.py:249

///LOGS in cinder_volume

root@nova-volume-lnx001:/home/ackstorm# glance index

ID   Name   Disk Format
 Container Format Size
 --
  --
b7674970-5d60-41da-bbb9-2ef10955fbbe snapshot_athena326   raw
 bare  0
7729788f-

Re: [ceph-users] ceph and incremental backups

2013-08-30 Thread Josh Durgin


On 08/30/2013 02:22 PM, Oliver Daudey wrote:

Hey Mark,

On vr, 2013-08-30 at 13:04 -0500, Mark Chaney wrote:

Full disclosure, I have zero experience with openstack and ceph so far.

If I am going to use a Ceph RBD cluster to store my kvm instances, how
should I be doing backups?

1) I would prefer them to be incremental so that a whole backup doesnt
have to happen every night.


Others, correct me if I'm wrong, just starting to play with this feature
myself. :-)

You create a base-snapshot, which you re-create and back up, say, every
month and then you create another snapshot, say, every day.  Ceph RBD
then allows you to generate a file with just the delta's between the two
snapshots.  You back up those delta's and then either throw away the
daily snapshot, or keep it around for online restores, if you want.  In
case of catastrophic failure of your RBD-pool, you first recover the
base-snapshots and then merge in the most recent daily delta's for that
month to get at your most recent state.


Yes, that's right. You may be interested in openstack cinder's backup
service as well, which recently got support for doing incremental
backups like this between ceph clusters [1]. In the future it could
potentially store the diffs somewhere else, instead of using a separate
ceph cluster.

Josh

[1] https://blueprints.launchpad.net/cinder/+spec/cinder-backup-to-ceph


See: http://ceph.com/dev-notes/incremental-snapshots-with-rbd/


2) I would also like the instances to obviously stay online during the
backup


No problem.  Before starting your backup of an image, you first make a
snapshot of it and then backup the snapshot instead.  Assuming you use a
journaled filesystem on your VMs, your general data should be consistent
in almost all cases.  Specific applications, like databases, might still
not recover very well and may need to be put in a consistent state
within the VM before generating the snapshot.  Snapshot-generation on
RBD is very fast and when finished, you can resume the VM and do your
backups off the snapshot.  Test what happens if you try to restore and
use your backups.  Regularly.. :-)


3) Backups will be stored off the ceph cluster on slower sata drives on
another storage server.


Sounds like a good idea.  Store it off-site or at least in another
location in the building, if possible.



Regards,

   Oliver

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] from whom and when will rbd_cache* be read

2013-09-01 Thread Josh Durgin


On 09/01/2013 03:35 AM, Kasper Dieter wrote:

Hi,

under
http://eu.ceph.com/docs/wip-rpm-doc/config-cluster/rbd-config-ref/
I found a good description about RBD cache parameters.


You're looking at an old branch there - the current description is a bit 
more clear that this doesn't affect rbd.ko at all:


http://eu.ceph.com/docs/master/rbd/rbd-config-ref/


But, I am missing information
- by whom these parameters are evaluated and
- when will this happen ?


They will be read by librbd when an image is opened. For qemu,
this means detaching and reattaching a disk will re-read the settings.

They can't be changed on the fly right now, though it's possible to
make that work with librbd itself. Note that qemu needs to think
it's doing writeback caching in order to pass guest flushes through
to librbd.

Josh


My assumption:
- the rbd_cache* parameter will be read by MONs
- rbd.ko will contact MON during 'insmod' or during 'open /dev/rbd*'
and get the rbd_cache info

Please confirm / correct.


Can I change the parameter only via ceph.conf + restart of MONs,
or can I change rbd_cache on runtime?
e.g.


# ceph --admin-daemon /var/run/ceph/ceph-mon.0.asok   config show | grep 
rbd_cache
   "rbd_cache": "false",
   "rbd_cache_writethrough_until_flush": "false",
   "rbd_cache_size": "33554432",
   "rbd_cache_max_dirty": "25165824",
   "rbd_cache_target_dirty": "16777216",
   "rbd_cache_max_dirty_age": "1",
   "rbd_cache_block_writes_upfront": "false",

# ceph mon tell   0 injectargs '--rbd_cache true'

# ceph --admin-daemon /var/run/ceph/ceph-mon.0.asok   config show | grep 
rbd_cache
   "rbd_cache": "true",
   "rbd_cache_writethrough_until_flush": "false",
   "rbd_cache_size": "33554432",
   "rbd_cache_max_dirty": "25165824",
   "rbd_cache_target_dirty": "16777216",
   "rbd_cache_max_dirty_age": "1",
   "rbd_cache_block_writes_upfront": "false",


Regards,
-Dieter
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd cp copies of sparse files become fully allocated

2013-09-09 Thread Josh Durgin


On 09/09/2013 04:57 AM, Andrey Korolyov wrote:

May I also suggest the same for export/import mechanism? Say, if image
was created by fallocate we may also want to leave holes upon upload
and vice-versa for export.


Import and export already omit runs of zeroes. They could detect
smaller runs (currently they look at object size chunks), and export
might be more efficient if it used diff_iterate() instead of
read_iterate(). Have you observed them misbehaving with sparse images?


On Mon, Sep 9, 2013 at 8:45 AM, Sage Weil  wrote:

On Sat, 7 Sep 2013, Oliver Daudey wrote:

Hey all,

This topic has been partly discussed here:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-March/000799.html

Tested on Ceph version 0.67.2.

If you create a fresh empty image of, say, 100GB in size on RBD and then
use "rbd cp" to make a copy of it, even though the image is sparse, the
command will attempt to read every part of it and take far more time
than expected.

After reading the above thread, I understand why the copy of an
essentially empty sparse image on RBD would take so long, but it doesn't
explain why the copy won't be sparse itself.  If I use "rbd cp" to copy
an image, the copy will take it's full allocated size on disk, even if
the original was empty.  If I use the QEMU "qemu-img"-tool's
"convert"-option to convert the original image to the copy without
changing the format, essentially only making a copy, it takes it's time
as well, but will be faster than "rbd cp" and the resulting copy will be
sparse.

Example-commands:
rbd create --size 102400 test1
rbd cp test1 test2
qemu-img convert -p -f rbd -O rbd rbd:rbd/test1 rbd:rbd/test3

Shouldn't "rbd cp" at least have an option to attempt to sparsify the
copy, or copy the sparse parts as sparse?  Same goes for "rbd clone",
BTW.


Yep, this is in fact a bug.  Opened http://tracker.ceph.com/issues/6257.

Thanks!
sage


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] blockdev --setro cannot set krbd to readonly

2013-09-09 Thread Josh Durgin


On 09/08/2013 01:14 AM, Da Chun Ng wrote:

I mapped an image to a system, and used blockdev to make it readonly.
But it failed.
[root@ceph0 mnt]# blockdev --setro /dev/rbd2
[root@ceph0 mnt]# blockdev --getro /dev/rbd2
0

It's on Centos6.4 with kernel 3.10.6 .
Ceph 0.61.8 .

Any idea?


For reasons I can't understand right now, calling set_device_ro(bdev, ro)
in the driver seems to prevent future BLKROSET ioctls from having any
effect, even though they should be calling exactly the same function.
The rbd driver always calls set_device_ro() right now, which causes
the problem.

Presumably there's some cached information that isn't updated if the
driver set the flags during device initialization. There's no reason
you shouldn't be able to change it for non-snapshot mappings though.

I added http://tracker.ceph.com/issues/6265 to track this.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] status of glance/cinder/nova integration in openstack grizzly

2013-09-10 Thread Josh Durgin


On 09/10/2013 01:50 PM, Darren Birkett wrote:

One last question: I presume the fact that the 'volume_image_metadata'
field is not populated when cloning a glance image into a cinder volume
is a bug?  It means that the cinder client doesn't show the volume as
bootable, though I'm not sure what other detrimental effect it actually
has (clearly the volume can be booted from).


I think this is populated in Havana, but nothing actually uses that
field still afaik. It's just a proxy for 'was this volume created from
an image'.

Josh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd cp copies of sparse files become fully allocated

2013-09-10 Thread Josh Durgin


On 09/10/2013 01:51 AM, Andrey Korolyov wrote:

On Tue, Sep 10, 2013 at 3:03 AM, Josh Durgin  wrote:

On 09/09/2013 04:57 AM, Andrey Korolyov wrote:


May I also suggest the same for export/import mechanism? Say, if image
was created by fallocate we may also want to leave holes upon upload
and vice-versa for export.



Import and export already omit runs of zeroes. They could detect
smaller runs (currently they look at object size chunks), and export
might be more efficient if it used diff_iterate() instead of
read_iterate(). Have you observed them misbehaving with sparse images?




Did you meant dumpling? As I had checked some months ago cuttlefish
not had such feature.


It's been there at least since bobtail. Export to stdout can't be sparse
though, since you can't seek stdout. Import and export haven't changed
much in a while, and the sparse detection certainly still works on
master (just tried with an empty 1G file).


On Mon, Sep 9, 2013 at 8:45 AM, Sage Weil  wrote:


On Sat, 7 Sep 2013, Oliver Daudey wrote:


Hey all,

This topic has been partly discussed here:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-March/000799.html

Tested on Ceph version 0.67.2.

If you create a fresh empty image of, say, 100GB in size on RBD and then
use "rbd cp" to make a copy of it, even though the image is sparse, the
command will attempt to read every part of it and take far more time
than expected.

After reading the above thread, I understand why the copy of an
essentially empty sparse image on RBD would take so long, but it doesn't
explain why the copy won't be sparse itself.  If I use "rbd cp" to copy
an image, the copy will take it's full allocated size on disk, even if
the original was empty.  If I use the QEMU "qemu-img"-tool's
"convert"-option to convert the original image to the copy without
changing the format, essentially only making a copy, it takes it's time
as well, but will be faster than "rbd cp" and the resulting copy will be
sparse.

Example-commands:
rbd create --size 102400 test1
rbd cp test1 test2
qemu-img convert -p -f rbd -O rbd rbd:rbd/test1 rbd:rbd/test3

Shouldn't "rbd cp" at least have an option to attempt to sparsify the
copy, or copy the sparse parts as sparse?  Same goes for "rbd clone",
BTW.



Yep, this is in fact a bug.  Opened http://tracker.ceph.com/issues/6257.

Thanks!
sage





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] live migration with rbd/cinder/nova - not supported?

2013-09-12 Thread Josh Durgin

On 09/12/2013 11:33 AM, Darren Birkett wrote:

Hi Maciej,

That's interesting.  The following also seems to suggest that nova has
those shared storage dependencies for live migration that I spoke about:

http://tracker.ceph.com/issues/5938

That's obsolete for Grizzly. True live migration works fine with an
rbd-backed instance. I haven't checked Havana yet, particularly with
libvirt_image_type using rbd. If anyone's interested in checking out
the new libvirt_image_type option, now is a good time to catch
any bugs before Havana is released.

Josh

Thanks
Darren

On 12 September 2013 17:01, Maciej Gałkiewicz mailto:mac...@shellycloud.com>> wrote:

On 12 September 2013 15:15, Darren Birkett mailto:darren.birk...@gmail.com>> wrote:
 > Hi Maciej,
 >
 > I'm using Grizzly, but the live migration doesn't appear to be
changed even
 > in trunk.  It seems to check if you are using shared storage by
writing a
 > test file on the destination host (in /var/lib/nova/instances)
and then
 > trying to read it on the source host, and will fail if this test
does not
 > succeed.  Given that we do not use a shared filesystem as such
when using
 > RBD backed instances, I don't understand how this can succeed:
 >
 >

https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L3726-L3761

Apparently the code is not used in case of rbd. I didnt notice any
issues caused by this method.

 > Seems like I might be missing something and am probably reading
the code
 > wrong, as it sounds like you have it working.  Are there any
particular
 > settings you had to add to nova to make it work?

Nova does not require any changes. Only cinder.

regards
--
Maciej Gałkiewicz
Shelly Cloud Sp. z o. o., Sysadmin
http://shellycloud.com/, mac...@shellycloud.com

KRS: 440358 REGON: 101504426

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph performance with 8K blocks.

2013-09-17 Thread Josh Durgin


Also enabling rbd writeback caching will allow requests to be merged,
which will help a lot for small sequential I/O.

On 09/17/2013 02:03 PM, Gregory Farnum wrote:

Try it with oflag=dsync instead? I'm curious what kind of variation
these disks will provide.

Anyway, you're not going to get the same kind of performance with
RADOS on 8k sync IO that you will with a local FS. It needs to
traverse the network and go through work queues in the daemon; your
primary limiter here is probably the per-request latency that you're
seeing (average ~30 ms, looking at the rados bench results). The good
news is that means you should be able to scale out to a lot of
clients, and if you don't force those 8k sync IOs (which RBD won't,
unless the application asks for them by itself using directIO or
frequent fsync or whatever) your performance will go way up.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Sep 17, 2013 at 1:47 PM, Jason Villalta  wrote:


Here are the stats with direct io.

dd of=ddbenchfile if=/dev/zero bs=8K count=100 oflag=direct
819200 bytes (8.2 GB) copied, 68.4789 s, 120 MB/s

dd if=ddbenchfile of=/dev/null bs=8K
819200 bytes (8.2 GB) copied, 19.7318 s, 415 MB/s

These numbers are still over all much faster than when using RADOS bench.
The replica is set to 2.  The Journals are on the same disk but separate 
partitions.

I kept the block size the same 8K.




On Tue, Sep 17, 2013 at 11:37 AM, Campbell, Bill 
 wrote:


As Gregory mentioned, your 'dd' test looks to be reading from the cache (you 
are writing 8GB in, and then reading that 8GB out, so the reads are all cached 
reads) so the performance is going to seem good.  You can add the 
'oflag=direct' to your dd test to try and get a more accurate reading from that.

RADOS performance from what I've seen is largely going to hinge on replica size 
and journal location.  Are your journals on separate disks or on the same disk 
as the OSD?  What is the replica size of your pool?


From: "Jason Villalta" 
To: "Bill Campbell" 
Cc: "Gregory Farnum" , "ceph-users" 

Sent: Tuesday, September 17, 2013 11:31:43 AM

Subject: Re: [ceph-users] Ceph performance with 8K blocks.

Thanks for you feed back it is helpful.

I may have been wrong about the default windows block size.  What would be the 
best tests to compare native performance of the SSD disks at 4K blocks vs Ceph 
performance with 4K blocks?  It just seems their is a huge difference in the 
results.


On Tue, Sep 17, 2013 at 10:56 AM, Campbell, Bill 
 wrote:


Windows default (NTFS) is a 4k block.  Are you changing the allocation unit to 
8k as a default for your configuration?


From: "Gregory Farnum" 
To: "Jason Villalta" 
Cc: ceph-users@lists.ceph.com
Sent: Tuesday, September 17, 2013 10:40:09 AM
Subject: Re: [ceph-users] Ceph performance with 8K blocks.


Your 8k-block dd test is not nearly the same as your 8k-block rados bench or 
SQL tests. Both rados bench and SQL require the write to be committed to disk 
before moving on to the next one; dd is simply writing into the page cache. So 
you're not going to get 460 or even 273MB/s with sync 8k writes regardless of 
your settings.

However, I think you should be able to tune your OSDs into somewhat better 
numbers -- that rados bench is giving you ~300IOPs on every OSD (with a small 
pipeline!), and an SSD-based daemon should be going faster. What kind of 
logging are you running with and what configs have you set?

Hopefully you can get Mark or Sam or somebody who's done some performance 
tuning to offer some tips as well. :)
-Greg

On Tuesday, September 17, 2013, Jason Villalta wrote:


Hello all,
I am new to the list.

I have a single machines setup for testing Ceph.  It has a dual proc 6 
cores(12core total) for CPU and 128GB of RAM.  I also have 3 Intel 520 240GB 
SSDs and an OSD setup on each disk with the OSD and Journal in separate 
partitions formatted with ext4.

My goal here is to prove just how fast Ceph can go and what kind of performance 
to expect when using it as a back-end storage for virtual machines mostly 
windows.  I would also like to try to understand how it will scale IO by 
removing one disk of the three and doing the benchmark tests.  But that is 
secondary.  So far here are my results.  I am aware this is all sequential, I 
just want to know how fast it can go.

DD IO test of SSD disks:  I am testing 8K blocks since that is the default 
block size of windows.
  dd of=ddbenchfile if=/dev/zero bs=8K count=100
819200 bytes (8.2 GB) copied, 17.7953 s, 460 MB/s

dd if=ddbenchfile of=/dev/null bs=8K
819200 bytes (8.2 GB) copied, 2.94287 s, 2.8 GB/s

RADOS bench test with 3 SSD disks and 4MB object size(Default):
rados --no-cleanup bench -p pbench 30 write
Total writes made:  2061
Write size: 4194304
Bandwidth (MB/sec): 273.004

Stddev Bandwidth:   67.5237
Max bandwidth (MB/sec): 352
Min ba

Re: [ceph-users] Scaling RBD module

2013-09-18 Thread Josh Durgin


On 09/17/2013 03:30 PM, Somnath Roy wrote:

Hi,
I am running Ceph on a 3 node cluster and each of my server node is running 10 
OSDs, one for each disk. I have one admin node and all the nodes are connected 
with 2 X 10G network. One network is for cluster and other one configured as 
public network.

Here is the status of my cluster.

~/fio_test# ceph -s

   cluster b2e0b4db-6342-490e-9c28-0aadf0188023
health HEALTH_WARN clock skew detected on mon. , mon. 

monmap e1: 3 mons at {=xxx.xxx.xxx.xxx:6789/0, 
=xxx.xxx.xxx.xxx:6789/0, =xxx.xxx.xxx.xxx:6789/0}, election epoch 64, 
quorum 0,1,2 ,,
osdmap e391: 30 osds: 30 up, 30 in
 pgmap v5202: 30912 pgs: 30912 active+clean; 8494 MB data, 27912 MB used, 
11145 GB / 11172 GB avail
mdsmap e1: 0/0/1 up


I started with rados bench command to benchmark the read performance of this 
Cluster on a large pool (~10K PGs) and found that each rados client has a 
limitation. Each client can only drive up to a certain mark. Each server  node 
cpu utilization shows it is  around 85-90% idle and the admin node (from where 
rados client is running) is around ~80-85% idle. I am trying with 4K object 
size.


Note that rados bench with 4k objects is different from rbd with
4k-sized I/Os - rados bench sends each request to a new object,
while rbd objects are 4M by default.


Now, I started running more clients on the admin node and the performance is 
scaling till it hits the client cpu limit. Server still has the cpu of 30-35% 
idle. With small object size I must say that the ceph per osd cpu utilization 
is not promising!

After this, I started testing the rados block interface with kernel rbd module 
from my admin node.
I have created 8 images mapped on the pool having around 10K PGs and I am not 
able to scale up the performance by running fio (either by creating a software 
raid or running on individual /dev/rbd* instances). For example, running 
multiple fio instances (one in /dev/rbd1 and the other in /dev/rbd2)  the 
performance I am getting is half of what I am getting if running one instance. 
Here is my fio job script.

[random-reads]
ioengine=libaio
iodepth=32
filename=/dev/rbd1
rw=randread
bs=4k
direct=1
size=2G
numjobs=64

Let me know if I am following the proper procedure or not.

But, If my understanding is correct, kernel rbd module is acting as a client to 
the cluster and in one admin node I can run only one of such kernel instance.
If so, I am then limited to the client bottleneck that I stated earlier. The 
cpu utilization of the server side is around 85-90% idle, so, it is clear that 
client is not driving.

My question is, is there any way to hit the cluster  with more client from a 
single box while testing the rbd module ?


You can run multiple librbd instances easily (for example with
multiple runs of the rbd bench-write command).

The kernel rbd driver uses the same rados client instance for multiple
block devices by default. There's an option (noshare) to use a new
rados client instance for a newly mapped device, but it's not exposed
by the rbd cli. You need to use the sysfs interface that 'rbd map' uses
instead.

Once you've used rbd map once on a machine, the kernel will already
have the auth key stored, and you can use:

echo '1.2.3.4:6789 name=admin,key=client.admin,noshare poolname 
imagename' > /sys/bus/rbd/add


Where 1.2.3.4:6789 is the address of a monitor, and you're connecting
as client.admin.

You can use 'rbd unmap' as usual.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Scaling RBD module

2013-09-19 Thread Josh Durgin


On 09/19/2013 12:04 PM, Somnath Roy wrote:

Hi Josh,
Thanks for the information. I am trying to add the following but hitting some 
permission issue.

root@emsclient:/etc# echo :6789,:6789,:6789 
name=admin,key=client.admin,noshare test_rbd ceph_block_test' > /sys/bus/rbd/add
-bash: echo: write error: Operation not permitted


If you check dmesg, it will probably show an error trying to
authenticate to the cluster.

Instead of key=client.admin, you can pass the base64 secret value as
shown in 'ceph auth list' with the secret=X option.

BTW, there's a ticket for adding the noshare option to rbd map so using
the sysfs interface like this is never necessary:

http://tracker.ceph.com/issues/6264

Josh


Here is the contents of rbd directory..

root@emsclient:/sys/bus/rbd# ll
total 0
drwxr-xr-x  4 root root0 Sep 19 11:59 ./
drwxr-xr-x 30 root root0 Sep 13 11:41 ../
--w---  1 root root 4096 Sep 19 11:59 add
drwxr-xr-x  2 root root0 Sep 19 12:03 devices/
drwxr-xr-x  2 root root0 Sep 19 12:03 drivers/
-rw-r--r--  1 root root 4096 Sep 19 12:03 drivers_autoprobe
--w---  1 root root 4096 Sep 19 12:03 drivers_probe
--w---  1 root root 4096 Sep 19 12:03 remove
--w---  1 root root 4096 Sep 19 11:59 uevent


I checked even if I am logged in as root , I can't write anything on /sys.

Here is the Ubuntu version I am using..

root@emsclient:/etc# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 13.04
Release:13.04
Codename:   raring

Here is the mount information

root@emsclient:/etc# mount
/dev/mapper/emsclient--vg-root on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/cgroup type tmpfs (rw)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
udev on /dev type devtmpfs (rw,mode=0755)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
none on /run/shm type tmpfs (rw,nosuid,nodev)
none on /run/user type tmpfs (rw,noexec,nosuid,nodev,size=104857600,mode=0755)
/dev/sda1 on /boot type ext2 (rw)
/dev/mapper/emsclient--vg-home on /home type ext4 (rw)


Any idea what went wrong here ?

Thanks & Regards
Somnath

-Original Message-
From: Josh Durgin [mailto:josh.dur...@inktank.com]
Sent: Wednesday, September 18, 2013 6:10 PM
To: Somnath Roy
Cc: Sage Weil; ceph-de...@vger.kernel.org; Anirban Ray; 
ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Scaling RBD module

On 09/17/2013 03:30 PM, Somnath Roy wrote:

Hi,
I am running Ceph on a 3 node cluster and each of my server node is running 10 
OSDs, one for each disk. I have one admin node and all the nodes are connected 
with 2 X 10G network. One network is for cluster and other one configured as 
public network.

Here is the status of my cluster.

~/fio_test# ceph -s

cluster b2e0b4db-6342-490e-9c28-0aadf0188023
 health HEALTH_WARN clock skew detected on mon. , mon. 

 monmap e1: 3 mons at {=xxx.xxx.xxx.xxx:6789/0, 
=xxx.xxx.xxx.xxx:6789/0, =xxx.xxx.xxx.xxx:6789/0}, election epoch 64, 
quorum 0,1,2 ,,
 osdmap e391: 30 osds: 30 up, 30 in
  pgmap v5202: 30912 pgs: 30912 active+clean; 8494 MB data, 27912 MB used, 
11145 GB / 11172 GB avail
 mdsmap e1: 0/0/1 up


I started with rados bench command to benchmark the read performance of this 
Cluster on a large pool (~10K PGs) and found that each rados client has a 
limitation. Each client can only drive up to a certain mark. Each server  node 
cpu utilization shows it is  around 85-90% idle and the admin node (from where 
rados client is running) is around ~80-85% idle. I am trying with 4K object 
size.


Note that rados bench with 4k objects is different from rbd with 4k-sized I/Os 
- rados bench sends each request to a new object, while rbd objects are 4M by 
default.


Now, I started running more clients on the admin node and the performance is 
scaling till it hits the client cpu limit. Server still has the cpu of 30-35% 
idle. With small object size I must say that the ceph per osd cpu utilization 
is not promising!

After this, I started testing the rados block interface with kernel rbd module 
from my admin node.
I have created 8 images mapped on the pool having around 10K PGs and I am not 
able to scale up the performance by running fio (either by creating a software 
raid or running on individual /dev/rbd* instances). For example, running 
multiple fio instances (one in /dev/rbd1 and the other in /dev/rbd2)  the 
performance I am getting is half of what I am getting if running one instance. 
Here is my fio job script.

[random-reads]
ioengine=libaio
iodepth=32
filename=/dev/rbd1
rw=randread
bs=4k
direct=1
size=2G

Re: [ceph-users] Best practices for managing S3 objects store

2013-09-30 Thread Josh Durgin


On 09/29/2013 07:34 PM, Aniket Nanhe wrote:

Hi,
We have a Ceph cluster set up and are trying to evaluate Ceph for it's S3
compatible object storage. I came across this best practices document for
Amazon S3, which goes over how naming keys in a particular way can improve
performance of object GET and PUT operations (
http://aws.amazon.com/articles/1904/).
I wonder if this also applies to the object store in Ceph. I am also
curious about the best strategy to organize objects in buckets i.e. whether
it's a good idea to distribute objects to predefined number of  buckets
(say for instance 256 or 1024 buckets) or it just doesn't matter how many
objects you put in a bucket (i.e. just put all objects in a single bucket).
We have objects of size ranging from 50KB to 10 MB.


You probably want to shard them over many buckets. See 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-March/000595.html


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] authentication trouble

2013-09-30 Thread Josh Durgin


On 09/26/2013 10:11 AM, Jogi Hofmüller wrote:

Dear all,

I am fairly new to ceph and just in the process of testing it using
several virtual machines.

Now I tried to create a block device on a client and fumbled with
settings for about an hour or two until the command line

   rbd --id dovecot create home --size=1024

finally succeeded.  The keyring is /etc/ceph/ceph.keyring and I thought
the name [client.dovecot] would be used by rbd.


That's right, the [client.dovecot] section is read, and settings there
take precendence over settings in the [client] or [global] sections,
which also apply.

/etc/ceph/ceph.keyring is one of the default keyring locations, so you
shouldn't need any special settings to use it though. It just needs to
be readable by the unix user you're running commands as.


I would appreciated any hint on how to configure the client.NAME in the
config to ease operation.


You can also use /etc/ceph/ceph.client.$id.keyring, where $id = dovecot
in this case, so that different clients don't need to share the
/etc/ceph/ceph.keyring file.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD Snap removal priority

2013-09-30 Thread Josh Durgin


On 09/27/2013 09:25 AM, Travis Rhoden wrote:

Hello everyone,

I'm running a Cuttlefish cluster that hosts a lot of RBDs.  I recently
removed a snapshot of a large one (rbd snap rm -- 12TB), and I noticed
that all of the clients had markedly decreased performance.  Looking
at iostat on the OSD nodes had most disks pegged at 100% util.

I know there are thread priorities that can be set for clients vs
recovery, but I'm not sure what deleting a snapshot falls under.  I
couldn't really find anything relevant.  Is there anything I can tweak
to lower the priority of such an operation?  I didn't need it to
complete fast, as "rbd snap rm" returns immediately and the actual
deletion is done asynchronously.  I'd be fine with it taking longer at
a lower priority, but as it stands now it brings my cluster to a crawl
and is causing issues with several VMs.


There are message priorities for client vs recovery operations, but
unfortunately there's no setting for snapshot deletion yet. It is
called snap trimming internally, but the thread timeout option is just
for making sure the osd stops operating if the fs or disk beneath it
fails by blocking for a very long time.


I see an "osd snap trim thread timeout" option in the docs -- Is the
operation occuring here what you would call snap trimming?  If so, any
chance of adding an option for "osd snap trim priority" just like
there is for osd client op and osd recovery op?


There's an open issue to fix this:

http://tracker.ceph.com/issues/5844


Hope what I am saying makes sense...


Yes, thanks for the report!

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Loss of connectivity when using client caching with libvirt

2013-10-02 Thread Josh Durgin

On 10/02/2013 10:45 AM, Oliver Daudey wrote:

Hey Robert,

On 02-10-13 14:44, Robert van Leeuwen wrote:

Hi,

I'm running a test setup with Ceph (dumpling) and Openstack (Grizzly) using libvirt to
"patch" the ceph disk directly to the qemu instance.
I'm using SL6 with the patched qemu packages from the Ceph site (which the
latest version is still cuttlefish):
http://www.ceph.com/packages/ceph-extras

When I turn on client caching strange things start to happen:
I run filebench to test the performance.
During the filebench the virtual machine will have intermittently really slow
network connections:
I'm talking here about ping reply's taking 30 SECONDS so effectively losing the
network.

This is what I set in the ceph client:
[client]
rbd cache = true
rbd cache writethrough until flush = true

Anyone else noticed this behaviour before or have some troubleshooting tips?

I noticed exactly the same thing when trying RBD-caching on libvirt for
some KVM-instances (in combination with writeback-caching in
libvirt/KVM, as recommended). Even with moderate disk-access, it did
exactly what you described. Had to disable the caching again because of
this.

Using KVM 1.1.2 with libvirt 0.9.12, patched to auto-enable RBD-caching
on RBD with "cache=writeback", which is how I used it. AFAIK, this
patch made it into the later official versions. Haven't really started
debugging this yet.

The behavior you both are seeing is fixed by making flush requests
asynchronous in the qemu driver. This was fixed upstream in qemu 1.4.2
and 1.5.0. If you've installed from ceph-extras, make sure you're using
the .async rpms [1] (we should probably remove the non-async ones at
this point).

The cuttlefish qemu rpms should work fine with dumpling. They're only
separate from the bobtail ones to be able to use newer functions in
librbd.

Josh

[1]
http://www.ceph.com/packages/ceph-extras/rpm/centos6/x86_64/qemu-kvm-0.12.1.2-2.355.el6.2.cuttlefish.async.x86_64.rpm

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Loss of connectivity when using client caching with libvirt

2013-10-02 Thread Josh Durgin

On 10/02/2013 03:16 PM, Blair Bethwaite wrote:

Hi Josh,

Message: 3
Date: Wed, 02 Oct 2013 10:55:04 -0700
From: Josh Durgin 
To: Oliver Daudey , ceph-users@lists.ceph.com,
 robert.vanleeu...@spilgames.com
Subject: Re: [ceph-users] Loss of connectivity when using client
 caching with libvirt
Message-ID: <524c5df8.6000...@inktank.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

The behavior you both are seeing is fixed by making flush requests
asynchronous in the qemu driver. This was fixed upstream in qemu 1.4.2
and 1.5.0. If you've installed from ceph-extras, make sure you're using
the .async rpms [1] (we should probably remove the non-async ones at
this point).

The cuttlefish qemu rpms should work fine with dumpling. They're only
separate from the bobtail ones to be able to use newer functions in
librbd.

The OP piqued my interest with this as we are looking at caching options on
Ubuntu Precise (Ceph and Cloud) with Dumpling. Do the same caveats apply
for qemu-kvm on Precise? Presumably with just read caching there is no such
problem?

The version base of qemu in precise has the same problem. It only
affects writeback caching.

You can get qemu 1.5 (which fixes the issue) for precise from ubuntu's
cloud archive.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Loss of connectivity when using client caching with libvirt

2013-10-02 Thread Josh Durgin


On 10/02/2013 06:26 PM, Blair Bethwaite wrote:

Josh,

On 3 October 2013 10:36, Josh Durgin  wrote:

The version base of qemu in precise has the same problem. It only
affects writeback caching.

You can get qemu 1.5 (which fixes the issue) for precise from ubuntu's
cloud archive.


Thanks for the pointer! I had not realised there were newer than 1.0
qemu-kvm packages available anywhere for Precise. We'll definitely look
into that for other reasons too, especially better live-migration.

I know it's not specifically Ceph related, but are you aware of any
problems with these against Grizzly?


I'm not aware of any. libvirt maintains a stable interface so it
shouldn't be an issue to use newer versions of qemu and libvirt
with older versions of openstack. If you upgrade qemu, you may
need the newer libvirt in the cloud archive as well.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] qemu-kvm with rbd mem slow leak

2013-10-14 Thread Josh Durgin


On 10/13/2013 07:43 PM, alan.zhang wrote:

CPU: Intel(R) Xeon(R) CPU   E5620  @ 2.40GHz *2
MEM: 32GB
KVM: qemu-kvm-0.12.1.2-2.355.el6.2.cuttlefish.async.x86_64
Host: CentOS 6.4, kernel 2.6.32-358.14.1.el6.x86_64
Guest: CentOS 6.4, kernel 2.6.32-279.14.1.el6.x86_64
Ceph: ceph version 0.67.4 (ad85b8bfafea6232d64cb7ba76a8b6e8252fa0c7)
Opennebula: 4.2


top -M info:
top - 10:35:31 up 7 days,  9:19,  1 user,  load average: 0.85, 1.63, 1.40
Tasks: 454 total,   2 running, 452 sleeping,   0 stopped,   0 zombie
Cpu(s):  8.5%us,  6.6%sy,  0.0%ni, 84.2%id,  0.6%wa,  0.0%hi,  0.0%si,
0.0%st
Mem:  32865800k total, 32191072k used,   674728k free,59984k buffers
Swap: 10485752k total, 10134076k used,   351676k free,  3474176k cached

   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
20135 oneadmin  20   0 6381m 3.4g 9120 S  2.3 10.8 104:00.48 qemu-kvm
29171 oneadmin  20   0 6452m 3.2g 9072 S  2.0 10.2 168:02.06 qemu-kvm
  8857 oneadmin  20   0 6338m 2.9g 4504 S  2.3  9.3 289:14.48 qemu-kvm
12283 oneadmin  20   0 6591m 2.9g 4464 S  1.3  9.2 268:57.30 qemu-kvm
  6612 oneadmin  20   0 5050m 2.0g 4472 S 12.9  6.3 191:23.51 qemu-kvm
12006 oneadmin  20   0 5532m 1.9g 4468 S  4.3  6.1 236:43.50 qemu-kvm
  7216 oneadmin  20   0 3600m 1.9g 4680 S  1.3  6.1 159:40.53 qemu-kvm
10602 oneadmin  20   0 5333m 1.6g 4636 S  1.3  5.1 208:54.52 qemu-kvm
13162 oneadmin  20   0 3400m 989m 4528 S 50.3  3.1   4151:19 qemu-kvm
  5273 oneadmin  20   0 5168m 842m 4464 S  5.3  2.6 468:20.65 qemu-kvm
  6287 oneadmin  20   0 3150m 761m 4472 S 37.4  2.4 150:32.89 qemu-kvm
  6081 root  20   0 1732m 504m 5744 S  6.3  1.6 243:17.00 ceph-osd
11729 oneadmin  20   0 3541m 498m 4468 S  0.7  1.6  66:48.52 qemu-kvm
12503 oneadmin  20   0 3832m 428m 9336 S  0.3  1.3  19:58.78 qemu-kvm


such as 20135 process command line:
ps -ef | grep 20135
oneadmin 20135 1  2 Oct11 ?01:44:01 /usr/libexec/qemu-kvm
-name one-18 -S -M rhel6.4.0 -enable-kvm -m 2048 -smp
2,sockets=2,cores=1,threads=1 -uuid c40fe8a4-f4fa-9e02-cf2d-6eaaf5062440
-nodefconfig -nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-18.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
-no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
file=rbd:one/one-0-18-0:auth_supported=none,if=none,id=drive-virtio-disk0,format=raw,cache=none
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-drive
file=rbd:one/one-2:auth_supported=none,if=none,id=drive-virtio-disk1,format=raw,cache=none
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1
-drive
file=/var/lib/one/datastores/0/18/disk.1,if=none,media=cdrom,id=drive-ide0-0-0,readonly=on,format=raw
-device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0
-netdev tap,fd=22,id=hostnet0,vhost=on,vhostfd=27 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=02:00:c0:a8:0a:3b,bus=pci.0,addr=0x3
-chardev pty,id=charserial0 -device
isa-serial,chardev=charserial0,id=serial0 -vnc 0.0.0.0:18 -vga cirrus
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6

I have only give it 2GB,but as you see, VIRT/RES (6381m/3.4g).


Does the resident memory continue increasing, or does it stay constant?

How does this compare with using only local files instead of rbd with
that qemu package?


I think it must be mem leak.

could any one give me a hand?


If you do observe continued increasing memory usage with rbd, but not
with local files, gathering some heap snaphshots via massif would help
figure out what's leaking. (http://tracker.ceph.com/issues/6494 is a
good example of getting massif output).

Josh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Is there a way to query RBD usage

2013-10-16 Thread Josh Durgin

On 10/15/2013 08:56 PM, Blair Bethwaite wrote:

 > Date: Wed, 16 Oct 2013 16:06:49 +1300
 > From: Mark Kirkwood mailto:mark.kirkw...@catalyst.net.nz>>
 > To: Wido den Hollander mailto:w...@42on.com>>,
ceph-users@lists.ceph.com 
 > Subject: Re: [ceph-users] Is there a way to query RBD usage
 > Message-ID: <525e02c9.9050...@catalyst.net.nz
>
 > Content-Type: text/plain; charset=ISO-8859-1; format=flowed
 >
 > On 16/10/13 15:53, Wido den Hollander wrote:
 > > On 10/16/2013 03:15 AM, Blair Bethwaite wrote:
 > >> I.e., can we see what the actual allocated/touched size of an RBD
is in
 > >> relation to its provisioned size?
 > >>
 > >
 > > No, not an easy way. The only way would be to probe which RADOS
 > > objects exist, but that's a heavy operation you don't want to do with
 > > large images or with a large number of RBD images.
 > >
 >
 > So maybe a 'df' arg for rbd would be a nice addition to blueprints?

Yes, I think so. It does seem a little conflicting to promote Ceph as
doing thin-provisioned volumes, but then not actually be able to
interrogate their real usage against the provisioned size. As a cloud
admin using Ceph as my block-storage layer I really want to be able to
look at several metrics in relation to volumes and tenants:
total GB quota, GB provisioned (i.e., total size of volumes&snaps), GB
allocated
When users come crying for more quota I need to whether they're making
efficient use of what they've got.

This actually leads into more of a conversation around the quota model
of dishing out storage. IMHO it would be much more preferable to do
things in a more EBS oriented fashion, where we're able to see actual
usage in the backend. Especially true with snapshots - users are
typically dismayed that their snapshots count towards their quota for
the full size of the originally provisioned volume (despite the fact the
snapshot could usually be truncated/shrunk by a factor of two or more).

You can see the space written in the image and between snapshots (not
including fs overhead on the osds) since cuttlefish:

http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/3684

It'd be nice to wrap that in a df or similar command though.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] mounting RBD in linux containers

2013-10-18 Thread Josh Durgin


On 10/18/2013 10:04 AM, Kevin Weiler wrote:

The kernel is 3.11.4-201.fc19.x86_64, and the image format is 1. I did,
however, try a map with an RBD that was format 2. I got the same error.


To rule out any capability drops as the culprit, can you map an rbd
image on the same host outside of a container?

Josh


--

*Kevin Weiler*

IT

IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL
60606 | http://imc-chicago.com/

Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail:
_kevin.wei...@imc-chicago.com _


From: Gregory Farnum mailto:g...@inktank.com>>
Date: Friday, October 18, 2013 10:26 AM
To: Omar Marquez mailto:omar.marq...@imc-chicago.com>>
Cc: Kyle Bader mailto:kyle.ba...@gmail.com>>,
Kevin Weiler mailto:kevin.wei...@imc-chicago.com>>, "ceph-users@lists.ceph.com
" mailto:ceph-users@lists.ceph.com>>, Khalid Goudeaux
mailto:khalid.goude...@imc-chicago.com>>
Subject: Re: [ceph-users] mounting RBD in linux containers

What kernel are you running, and which format is the RBD image? I
thought we had a special return code for when the kernel doesn't support
the features used by that image, but that could be the problem.
-Greg

On Thursday, October 17, 2013, Omar Marquez wrote:


Strace produces below:

…

futex(0xb5637c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0xb56378,
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0xb562f8, FUTEX_WAKE_PRIVATE, 1)  = 1
add_key(0x424408, 0x7fff82c4e210, 0x7fff82c4e140, 0x22,
0xfffe) = 607085216
stat("/sys/bus/rbd", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
*open("/sys/bus/rbd/add", O_WRONLY)  = 3*
*write(3, "10.198.41.6:6789
,10.198.41.8:678
"..., 96) = -1 EINVAL (Invalid argument)*
close(3)= 0
rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7fbf8a7efa90},
{SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGQUIT, {SIG_IGN, [], SA_RESTORER,
0x7fbf8a7efa90}, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [PIPE], 8) = 0
clone(child_stack=0, flags=CLONE_PARENT_SETTID|SIGCHLD,
parent_tidptr=0x7fff82c4e040) = 22
wait4(22, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 22
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7fbf8a7efa90},
NULL, 8) = 0
rt_sigaction(SIGQUIT, {SIG_DFL, [], SA_RESTORER,
0x7fbf8a7efa90}, NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [PIPE], NULL, 8) = 0
write(2, "rbd: add failed: ", 17rbd: add failed: )   = 17
write(2, "(22) Invalid argument", 21(22) Invalid argument)   = 21
write(2, "\n", 1
)   = 1
exit_group(1)   = ?
+++ exited with 1 +++


The app is run inside the container with setuid = 0 and the
container is able to mount all required filesystems … could this
still be a capability problem ? Also I do not see any call to
capset() in the strafe log …

--
Om


From: Kyle Bader 
Date: Thursday, October 17, 2013 5:08 PM
To: Kevin Weiler 
Cc: "ceph-users@lists.ceph.com" , Omar
Marquez , Khalid Goudeaux

Subject: Re: [ceph-users] mounting RBD in linux containers

My first guess would be that it's due to LXC dropping capabilities,
I'd investigate whether CAP_SYS_ADMIN is being dropped. You need
CAP_SYS_ADMIN for mount and block ioctls, if the container doesn't
have those privs a map will likely fail. Maybe try tracing the
command with strace?

On Thu, Oct 17, 2013 at 2:45 PM, Kevin Weiler
 wrote:

Hi all,

We're trying to mount an rbd image inside of a linux container
that has been created with docker (https://www.docker.io/). We
seem to have access to the rbd kernel module from inside the
container:

# lsmod | grep ceph
libceph   218854  1 rbd
libcrc32c  12603  3 xfs,libceph,dm_persistent_data

And we can query the pool for available rbds and create rbds
from inside the container:

# rbd -p dockers --id dockers --keyring
/etc/ceph/ceph.client.dockers.keyring create lxctest --size 51200
# rbd -p dockers --id dockers --keyring
/etc/ceph/ceph.client.dockers.keyring ls
lxctest

But for some reason, we can't seem to map the device to the
container:

# rbd -p dockers --id dockers --keyring
/etc/ceph/ceph.client.dockers.keyring map lxctest
rbd: add failed: (22) Invalid argument

I don't see anything particularly interesting in dmesg or
messages on either the container or the host box. Any ideas on
how to troubleshoot this?

Thanks!


--

*Kevin Weiler*

IT

IMC Financial Mar

Re: [ceph-users] poor read performance on rbd+LVM, LVM overload

2013-10-20 Thread Josh Durgin


On 10/20/2013 08:18 AM, Ugis wrote:

output follows:
#pvs -o pe_start /dev/rbd1p1
   1st PE
 4.00m
# cat /sys/block/rbd1/queue/minimum_io_size
4194304
# cat /sys/block/rbd1/queue/optimal_io_size
4194304


Well, the parameters are being set at least.  Mike, is it possible that
having minimum_io_size set to 4m is causing some read amplification
in LVM, translating a small read into a complete fetch of the PE (or
somethinga long those lines)?

Ugis, if your cluster is on the small side, it might be interesting to see
what requests the client is generated in the LVM and non-LVM case by
setting 'debug ms = 1' on the osds (e.g., ceph tell osd.* injectargs
'--debug-ms 1') and then looking at the osd_op messages that appear in
/var/log/ceph/ceph-osd*.log.  It may be obvious that the IO pattern is
different.


Sage, here follows debug output. I am no pro in reading this, but
seems read block size differ(or what is that number following ~ sign)?


Yes, that's the I/O length. LVM is sending requests for 4k at a time,
while plain kernel rbd is sending 128k.




How to proceed with tuning read performance on LVM? Is there some
chanage needed in code of ceph/LVM or my config needs to be tuned?
If what is shown in logs means 4k read block in LVM case - then it
seems I need to tell LVM(or xfs on top of LVM dictates read block
side?) that io block should be rather 4m?


It's a client side issue of sending much smaller requests than it needs
to. Check the queue minimum and optimal sizes for the lvm device - it
sounds like they might be getting set to 4k for some reason.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Boot from volume with Dumpling on RDO/CentOS 6 (using backported QEMU 0.12)

2013-10-21 Thread Josh Durgin


On 10/21/2013 09:03 AM, Andrew Richards wrote:

Hi Everybody,

I'm attempting to get Ceph working for CentOS 6.4 running RDO Havana for
Cinder volume storage and boot-from-volume, and I keep bumping into a
very unhelpful errors on my nova-compute test node and my cinder
controller node.

Here is what I see on my cinder-volume controller (Node #1) when I try
to attach a RBD-backed Cinder volume to a Nova VM using either the GUI
or nova volume-attach (/var/log/cinder/volume.log):

2013-10-20 18:21:05.880 13668 ERROR cinder.openstack.common.rpc.amqp
[req-bd62cb07-42e7-414a-86dc-f26f7a569de6
9bfee22cd15b4dc0a2e203d7c151edbc 8431635821f84285afdd0f5faf1ce1aa]
Exception during message handling
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
Traceback (most recent call last):
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File
"/usr/lib/python2.6/site-packages/cinder/openstack/common/rpc/amqp.py",
line 441, in _process_data
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
**args)
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File
"/usr/lib/python2.6/site-packages/cinder/openstack/common/rpc/dispatcher.py",
line 148, in dispatch
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
return getattr(proxyobj, method)(ctxt, **kwargs)
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File "/usr/lib/python2.6/site-packages/cinder/utils.py", line 808, in
wrapper
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
return func(self, *args, **kwargs)
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File "/usr/lib/python2.6/site-packages/cinder/volume/manager.py", line
624, in initialize_connection
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
conn_info = self.driver.initialize_connection(volume, connector)
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File "/usr/lib/python2.6/site-packages/cinder/volume/drivers/rbd.py",
line 665, in initialize_connection
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
hosts, ports = self._get_mon_addrs()
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File "/usr/lib/python2.6/site-packages/cinder/volume/drivers/rbd.py",
line 312, in _get_mon_addrs
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
out, _ = self._execute(*args)
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File "/usr/lib/python2.6/site-packages/cinder/utils.py", line 142, in
execute
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
return processutils.execute(*cmd, **kwargs)
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File
"/usr/lib/python2.6/site-packages/cinder/openstack/common/processutils.py",
line 158, in execute
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
shell=shell)
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File "/usr/lib/python2.6/site-packages/eventlet/green/subprocess.py",
line 25, in __init__
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
subprocess_orig.Popen.__init__(self, args, 0, *argss, **kwds)
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File "/usr/lib64/python2.6/subprocess.py", line 642, in __init__
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
errread, errwrite)
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File "/usr/lib64/python2.6/subprocess.py", line 1234, in _execute_child
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
raise child_exception
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
OSError: [Errno 2] No such file or directory
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
2013-10-20 18:21:05.883 13668 ERROR cinder.openstack.common.rpc.common
[req-bd62cb07-42e7-414a-86dc-f26f7a569de6
9bfee22cd15b4dc0a2e203d7c151edbc 8431635821f84285afdd0f5faf1ce1aa]
Returning exception [Errno 2] No such file or directory to caller


Here is what I see on my nova-compute node (Node #2) when I try to boot
from volume (/var/log/nova/compute.log):

ERROR nova.compute.manager [req-ced59268-4766-4f57-9cdb-4ba451b0faaa
9bfee22cd15b4dc0a2e203d7c151edbc 8431635821f84285afdd0f5faf1ce1aa]
[instance: c80a053f-b84c-401c-8e29-022d4c6f56a0] Error: The server has
either erred or is incapable of performing the requested operation.
(HTTP 500) (Request-ID: req-44557bfa-6777-41a6-8183-e08dedf0611b)
2013-10-17 15:01:45.060 18546 TRACE nova.compute.manager [instance:
c80a053f-b84c-401c-8e29-022d4c6f56a0] Traceback (most recent call last):
2013-10-17 15:01:45.060 18546 TRACE nova.compute.manager [instance:
c80a053f-b84c-401c-8e29-022d4c6f56a0]   File
"/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 1028,
in _build_instance
2013-10-17 15:01:45.060 18546 TRACE nova.compute.manager [instance:

Re: [ceph-users] Boot from volume with Dumpling on RDO/CentOS 6 (using backported QEMU 0.12)

2013-10-21 Thread Josh Durgin


On 10/21/2013 10:35 AM, Andrew Richards wrote:

Thanks for the response Josh!

If the Ceph CLI tool still needs to be there for Cinder in Havana, then
am I correct in assuming that I still also need to export
"CEPH_ARGS='--id volumes'" in my cinder init script for the sake of
cephx like I had to do in Grizzly?


No, that's no longer necessary.

Josh


Thanks,
Andy

On Oct 21, 2013, at 12:26 PM, Josh Durgin mailto:josh.dur...@inktank.com>> wrote:


On 10/21/2013 09:03 AM, Andrew Richards wrote:

Hi Everybody,

I'm attempting to get Ceph working for CentOS 6.4 running RDO Havana for
Cinder volume storage and boot-from-volume, and I keep bumping into a
very unhelpful errors on my nova-compute test node and my cinder
controller node.

Here is what I see on my cinder-volume controller (Node #1) when I try
to attach a RBD-backed Cinder volume to a Nova VM using either the GUI
or nova volume-attach (/var/log/cinder/volume.log):

2013-10-20 18:21:05.880 13668 ERROR cinder.openstack.common.rpc.amqp
[req-bd62cb07-42e7-414a-86dc-f26f7a569de6
9bfee22cd15b4dc0a2e203d7c151edbc 8431635821f84285afdd0f5faf1ce1aa]
Exception during message handling
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
Traceback (most recent call last):
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File
"/usr/lib/python2.6/site-packages/cinder/openstack/common/rpc/amqp.py",
line 441, in _process_data
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
**args)
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File
"/usr/lib/python2.6/site-packages/cinder/openstack/common/rpc/dispatcher.py",
line 148, in dispatch
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
return getattr(proxyobj, method)(ctxt, **kwargs)
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File "/usr/lib/python2.6/site-packages/cinder/utils.py", line 808, in
wrapper
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
return func(self, *args, **kwargs)
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File "/usr/lib/python2.6/site-packages/cinder/volume/manager.py", line
624, in initialize_connection
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
conn_info = self.driver.initialize_connection(volume, connector)
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File "/usr/lib/python2.6/site-packages/cinder/volume/drivers/rbd.py",
line 665, in initialize_connection
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
hosts, ports = self._get_mon_addrs()
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File "/usr/lib/python2.6/site-packages/cinder/volume/drivers/rbd.py",
line 312, in _get_mon_addrs
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
out, _ = self._execute(*args)
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File "/usr/lib/python2.6/site-packages/cinder/utils.py", line 142, in
execute
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
return processutils.execute(*cmd, **kwargs)
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File
"/usr/lib/python2.6/site-packages/cinder/openstack/common/processutils.py",
line 158, in execute
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
shell=shell)
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File "/usr/lib/python2.6/site-packages/eventlet/green/subprocess.py",
line 25, in __init__
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
subprocess_orig.Popen.__init__(self, args, 0, *argss, **kwds)
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File "/usr/lib64/python2.6/subprocess.py", line 642, in __init__
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
errread, errwrite)
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
File "/usr/lib64/python2.6/subprocess.py", line 1234, in _execute_child
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
raise child_exception
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
OSError: [Errno 2] No such file or directory
2013-10-20 18:21:05.880 13668 TRACE cinder.openstack.common.rpc.amqp
2013-10-20 18:21:05.883 13668 ERROR cinder.openstack.common.rpc.common
[req-bd62cb07-42e7-414a-86dc-f26f7a569de6
9bfee22cd15b4dc0a2e203d7c151edbc 8431635821f84285afdd0f5faf1ce1aa]
Returning exception [Errno 2] No such file or directory to caller


Here is what I see on my nova-compute node (Node #2) when I try to boot
from volume (/var/log/nova/compute.log):

ERROR nova.compute.manager [req-ced59268-4766-4f57-9cdb-4ba451b0faaa
9bfee22cd15b4dc0a2e203d7c151edbc 8431635821f84285afdd0f5faf1ce1aa]
[instance: c80a

Re: [ceph-users] CloudStack + KVM(Ubuntu 12.04, Libvirt 1.0.2) + Ceph [Seeking Help]

2013-10-21 Thread Josh Durgin


On 10/16/2013 04:25 PM, Kelcey Jamison Damage wrote:

Hi,

I have gotten so close to have Ceph work in my cloud but I have reached
a roadblock. Any help would be greatly appreciated.

I receive the following error when trying to get KVM to run a VM with an
RBD volume:

Libvirtd.log:

2013-10-16 22 :05:15.516+: 9814: error :
qemuProcessReadLogOutput:1477 : internal error Process exited while
reading console log output:
char device redirected to /dev/pts/3
kvm: -drive
file=rbd:libvirt-pool/new-libvirt-image:id=libvirt:key=+F5ScBQlLhAAYCH8qhGEh/gjKW+NpziAlA==:auth_supported=cephx\;none:mon_host=

10.0.1.83\:6789,if=none,id=drive-ide0-0-1: error connecting
kvm: -drive
file=rbd:libvirt-pool/new-libvirt-image:id=libvirt:key=+F5ScBQlLhAAYCH8qhGEh/gjKW+NpziAlA==:auth_supported=cephx\;none:mon_host=

10.0.1.83\:6789,if=none,id=drive-ide0-0-1: could not open disk image
rbd:libvirt-pool/new-libvirt-image:id=libvirt:key=+F5ScBQlLhAAYCH8qhGEh
/gjKW+NpziAlA==:auth_supported=cephx\;none:mon_host=10.0.1.83\:6789:
Invalid argument


This looks correct, there could be a firewall or something else
preventing the connection from working. Could you share the output of:

qemu-img -f raw info 
'rbd:libvirt-pool/new-libvirt-image:id=libvirt:key=+F5ScBQlLhAAYCH8qhGEh/gjKW+NpziAlA==:auth_supported=cephx\;none:mon_host==10.0.1.83\\:6789:debug_ms=1'



Ceph Pool showing test volume exists:

root@ubuntu-test-KVM-RBD:/opt# rbd -p libvirt-pool ls
new-libvirt-image

Ceph Auth:

client.libvirt
key: AQBx+F5ScBQlLhAAYCH8qhGEh/gjKW+NpziAlA==
caps: [mon] allow r
caps: [osd] allow class-read object_prefix rbd_children, allow rwx
pool=libvirt-pool

KVM Drive Support:

root@ubuntu-test-KVM-RBD:/opt# kvm --drive
format=?ibvirt-image:id=libvirt:key=+F5Sc
Supported formats: vvfat vpc vmdk vdi sheepdog rbd raw host_cdrom
host_floppy host_device file qed qcow2 qcow parallels nbd dmg tftp ftps ft
p https http cow cloop bochs blkverify blkdebug


These settings all look fine too.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw-agent error

2013-10-30 Thread Josh Durgin


On 10/30/2013 01:54 AM, Mark Kirkwood wrote:

On 29/10/13 20:53, lixuehui wrote:

Hi,list
 From the document that a radosgw-agent's right info should like this

INFO:radosgw_agent.sync:Starting incremental sync
INFO:radosgw_agent.worker:17910 is processing shard number 0
INFO:radosgw_agent.worker:shard 0 has 0 entries after ''
INFO:radosgw_agent.worker:finished processing shard 0
INFO:radosgw_agent.worker:17910 is processing shard number 1
INFO:radosgw_agent.sync:1/64 shards processed
INFO:radosgw_agent.worker:shard 1 has 0 entries after ''
INFO:radosgw_agent.worker:finished processing shard 1
INFO:radosgw_agent.sync:2/64 shards processed

my radosgw-agent return error like

   out = request(connection, 'get', '/admin/log',
dict(type=shard_type))
   File
"/usr/lib/python2.7/dist-packages/radosgw_agent/client.py", line 76,
in request
 return result.json()
AttributeError: 'Response' object has no attribute 'json'
ERROR:root:error doing incremental sync, trying again later
Traceback (most recent call last):
   File
"/usr/lib/python2.7/dist-packages/radosgw_agent/cli.py", line 247, in
main
 args.max_entries)
   File
"/usr/lib/python2.7/dist-packages/radosgw_agent/sync.py", line 22, in
sync_incremental
 num_shards = client.num_log_shards(self.src_conn,
self._type)
   File
"/usr/lib/python2.7/dist-packages/radosgw_agent/client.py", line 142,
in num_log_shards
 out = request(connection, 'get', '/admin/log',
dict(type=shard_type))
   File
"/usr/lib/python2.7/dist-packages/radosgw_agent/client.py", line 76,
in request
Maybe there anyone ever encountered the same problem. Any help
is appropriated!



I received this error too - although I was attempting a 'full' sync at
the time. I surmised that maybe the response object == None at that
point? But otherwise I had no idea.


This particular error is coming from a too-old version of the 
python-requests package. We weren't setting a lower bound for that library

version before, but are now. If you install with the bootstrap script
you should get a new enough version in a virtualenv, and you can run
./radosgw-agent from your git checkout.


I was also confused about:
- was this even supposed to work with ceph 0.71?


No, there ended up being a bug and an admin api change, so if you want
to try it early you can use the next branch. You'll need to restart the
osds and radosgw if you're upgrading. It'll be backported to dumpling
as well, but the backport hasn't been finished yet.


- which radosgw-agent to use:
   * https://github.com/ceph/radosgw-agent


This one.


   * https://github.com/jdurgin/radosgw-agent

Given that the newly updated docs:
http://ceph.com/docs/wip-doc-radosgw/radosgw/federated-config/ suggest
ceph 0.72, I'm wondering if we just need to be more patient?


Note that the wip in the url means it's a work-in-progress branch,
so it's not totally ready yet either. If anything is confusing or
missing, let us know.


However - Inktank folks - there is a lot of interest in the feature, so
forgive us if we are jumping the gun, but also the current state of play
is murky and some clarification would not go amiss!


It's great people are interested in trying this early. It's very
helpful to find issues sooner (like the requests library version).

Thanks!
Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] "rbd map" says "bat option at rw"

2013-11-01 Thread Josh Durgin


On 11/01/2013 03:07 AM, nicolasc wrote:

Hi every one,

I finally and happily managed to get my Ceph cluster (3 monitors among 8
nodes, each with 9 OSDs) running on version 0.71, but the "rbd map"
command shows a weird behaviour.

I can list pools, create images and snapshots, alleluia!
However, mapping to a device with "rbd map" is not working. When I try
this from one of my nodes, the kernel says:
 libceph: bad option at 'rw'
Which "rbd" translates into:
 add failed: (22) Invalid argument

Any idea of what that could indicate?


Thanks for the report! The rw option was added in linux 3.7. In ceph
0.71, rbd map is passing the 'rw' and 'ro' options to tell the kernel
that to map the image as read-only or read/write. This will be fixed
in 0.72 by:

https://github.com/ceph/ceph/pull/807

Josh


I am using a basic config: no authentication, default crushmap (I just
changed some weights), and basic network config (public net, cluster
net). I have tried both image formats, different sizes and pools.

Moreover, I have a client running rbd from Ceph version 0.61.9, and from
there everything works fine with "rbd map" on the same image. Both nodes
(Ceph 0.61.9 and 0.71) are running Linux kernel 3.2 for Debian.

Hope you can provide some hints. Best regards,

Nicolas Canceill
Scalable Storage Systems
SURFsara (Amsterdam, NL)


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Havana & RBD - a few problems

2013-11-07 Thread Josh Durgin


On 11/08/2013 12:15 AM, Jens-Christian Fischer wrote:

Hi all

we have installed a Havana OpenStack cluster with RBD as the backing
storage for volumes, images and the ephemeral images. The code as
delivered in
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/imagebackend.py#L498
 fails,
because the RBD.path it not set. I have patched this to read:


Using libvirt_image_type=rbd to replace ephemeral disks is new with
Havana, and unfortunately some bug fixes did not make it into the
release. I've backported the current fixes on top of the stable/havana
branch here:

https://github.com/jdurgin/nova/tree/havana-ephemeral-rbd


  * @@ -419,10 +419,12 @@ class Rbd(Image):
  * if path:
  * try:
  * self.rbd_name = path.split('/')[1]
  * + self.path = path
  * except IndexError:
  * raise exception.InvalidDevicePath(path=path)
  * else:
  * self.rbd_name = '%s_%s' % (instance['name'], disk_name)
  * + self.path = 'volumes/%s' % self.rbd_name
  * self.snapshot_name = snapshot_name
  * if not CONF.libvirt_images_rbd_pool:
  * raise RuntimeError(_('You should specify'


but am not sure this is correct. I have the following problems:

1) can't inject data into image

2013-11-07 16:59:25.251 24891 INFO nova.virt.libvirt.driver
[req-f813ef24-de7d-4a05-ad6f-558e27292495
c66a737acf0545fdb9a0a920df0794d9 2096e25f5e814882b5907bc5db342308]
[instance: 2fa02e4f-f804-4679-9507-736eeebd9b8d] Injecting key into
  image fc8179d4-14f3-4f21-a76d-72b03b5c1862
2013-11-07 16:59:25.269 24891 WARNING nova.virt.disk.api
[req-f813ef24-de7d-4a05-ad6f-558e27292495
c66a737acf0545fdb9a0a920df0794d9 2096e25f5e814882b5907bc5db342308]
Ignoring error injecting data into image (Error mounting volumes/
instance-
0089_disk with libguestfs (volumes/instance-0089_disk: No such file
or directory))

possibly the self.path = … is wrong - but what are the correct values?


Like Dinu mentioned, I'd suggest disabling file injection and using
the metadata service + cloud-init instead. We should probably change
nova to log an error about this configuration when ephemeral volumes
are rbd.


2) Creating a new instance from an ISO image fails completely - no
bootable disk found, says the KVM console. Related?


This sounds like a bug in the ephemeral rbd code - could you file
it in launchpad if you can reproduce with file injection disabled?
I suspect it's not being attached as a cdrom.


3) When creating a new instance from an image (non ISO images work), the
disk is not resized to the size specified in the flavor (but left at the
size of the original image)


This one is fixed in the backports already.


I would be really grateful, if those people that have Grizzly/Havana
running with an RBD backend could pipe in here…


You're seeing some issues in the ephemeral rbd code, which is new
in Havana. None of these affect non-ephemeral rbd, or Grizzly.
Thanks for reporting them!

Josh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Block Storage QoS

2013-11-07 Thread Josh Durgin


On 11/08/2013 03:50 AM, Wido den Hollander wrote:

On 11/07/2013 08:42 PM, Gruher, Joseph R wrote:

Is there any plan to implement some kind of QoS in Ceph?  Say I want to
provide service level assurance to my OpenStack VMs and I might have to
throttle bandwidth to some to provide adequate bandwidth to others - is
anything like that planned for Ceph?  Generally with regard to block
storage (rbds), not object or filesystem.

Or is there already a better way to do this elsewhere in the OpenStack
cloud?



I don't know if OpenStack supports it, but in CloudStack we recently
implemented the I/O throttling mechanism of Qemu via libvirt.

That might be a solution if OpenStack implements that as well?


Indeed, that was implemented in OpenStack Havana. I think the docs
haven't been updated yet, but one of the related blueprints is:

https://blueprints.launchpad.net/cinder/+spec/pass-ratelimit-info-to-nova


Thanks,

Joe



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw-agent failed to sync object

2013-11-07 Thread Josh Durgin


On 11/07/2013 09:48 AM, lixuehui wrote:

Hi all :
After we build a region with two zones distributed in two ceph
cluster.Start the agent ,it start works!
But what we find in the radosgw-agent stdout is that it failed to sync
objects all the time .Paste the info:
  (env)root@ceph-rgw41:~/myproject# ./radosgw-agent -c cluster-data-sync.conf -q

region map is: {u'us': [u'us-west', u'us-east']}
ERROR:radosgw_agent.worker:failed to sync object 
new-east-bucket/new-east.json: state is error
ERROR:radosgw_agent.worker:failed to sync object 
new-east-bucket/new-east.json: state is error
ERROR:radosgw_agent.worker:failed to sync object 
new-east-bucket/new-east.json: state is error
ERROR:radosgw_agent.worker:failed to sync object 
new-east-bucket/new-east.json: state is error
ERROR:radosgw_agent.worker:failed to sync object 
new-east-bucket/new-east.json: state is error

Metadata has already been copied form the master zone.I'd like to
know the reason ,and what the'state is error 'mean!


This means the destination radosgw failed to fetch to the object from
the source radosgw. Does the system user from the secondary zone exist
in the master zone?

If you enable 'debug rgw=30' for both radosgw and share the logs we
can see why the sync is failing.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Block Storage QoS

2013-11-07 Thread Josh Durgin


On 11/08/2013 03:13 PM, ja...@peacon.co.uk wrote:


On 2013-11-08 03:20, Haomai Wang wrote:

On Fri, Nov 8, 2013 at 9:31 AM, Josh Durgin 
wrote:

I just list commands below to help users to understand:

cinder qos-create high_read_low_write consumer="front-end"
read_iops_sec=1000 write_iops_sec=10



Does this have any normalisation of the IO units, for example to 8K or
something?  In VMware we have similar controls for ages but they're not
useful, as a Windows server will through out 4MB IO's and skew all the
metrics.


I don't think it does any normalization, but you could have different
limits for different volume types, and use one volume type for windows
and one volume type for non-windows. This might not make sense for all
deployments, but it may be a usable workaround for that issue.

Josh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] help v.72configure federate gateway failed

2013-11-19 Thread Josh Durgin


Sorry for the delay, I'm still catching up since the openstack
conference.

Does the system user for the destination zone exist with the same
access secret and key in the source zone?

If you enable debug rgw = 30 on the destination you can see why the
copy_obj from the source zone is failing.

Josh

On 11/11/2013 12:52 AM, maoqi1982 wrote:

Hi list
ceph version is the latest v0.72 emperor, follow the
http://ceph.com/docs/master/radosgw/federated-config/ doc i deploy two
ceph cluster (one ceph  per datasite  )  to form a region (a master zone
, a slave zone ). the metadata seem to be sync ok. but the object is
failed to sync .

the error is as following:
INFO:radosgw_agent.worker:6053 is processing shard number 47
INFO:radosgw_agent.worker:finished processing shard 47
**
*INFO:radosgw_agent.sync:48/128 items processed*
*INFO:radosgw_agent.worker:6053 is processing shard number 48*
**
**
*INFO:radosgw_agent.worker:bucket instance "east-bucket:us-east.4139.1"
has 5 entries after "002.2.3"*
*INFO:radosgw_agent.worker:syncing bucket "east-bucket"*
*ERROR:radosgw_agent.worker:failed to sync object east-bucket/驽?*
*?docx: *
*ERROR:radosgw_agent.worker:failed to sync object east-bucket/sss.py:
state is error*
*ERROR:radosgw_agent.worker:failed to sync object east-bucket/Nfg.docx:
state is error*
**
INFO:radosgw_agent.worker:finished processing shard 48
INFO:radosgw_agent.worker:6053 is processing shard number 49
INFO:radosgw_agent.sync:49/128 items processed
INFO:radosgw_agent.sync:50/128 items processed
INFO:radosgw_agent.worker:finished processing shard 49
INFO:radosgw_agent.worker:6053 is processing shard number 50
INFO:radosgw_agent.worker:finished processing shard 50
INFO:radosgw_agent.sync:51/128 items processed
INFO:radosgw_agent.worker:6053 is processing shard number 51
INFO:radosgw_agent.worker:finished processing shard 51
INFO:radosgw_agent.sync:52/128 items processed
INFO:radosgw_agent.worker:6053 is processing shard number 52
INFO:radosgw_agent.sync:53/128 items processed
INFO:radosgw_agent.worker:finished processing shard 52

thanks


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw-agent AccessDenied 403

2013-11-19 Thread Josh Durgin


On 11/13/2013 09:06 PM, lixuehui wrote:

And on the slave zone gateway instence ,the info is like this :

2013-11-14 12:54:24.516840 7f51e7fef700  1 == starting new 
request req=0xb1e3b0 =
2013-11-14 12:54:24.526640 7f51e7fef700  1 == req done 
req=0xb1e3b0 http_status=200 ==
2013-11-14 12:54:24.545440 7f51e4fe9700  1 == starting new 
request req=0xb1c690 =
2013-11-14 12:54:24.551696 7f51e4fe9700  0 WARNING: couldn't find 
acl header for bucket, generating default
2013-11-14 12:54:24.566005 7f51e4fe9700  0 > HTTP_DATE -> Thu Nov 
14 04:54:24 2013
2013-11-14 12:54:24.566046 7f51e4fe9700  0 > HTTP_X_AMZ_COPY_SOURCE 
-> sss%2Frgwconf
2013-11-14 12:54:24.607998 7f51e4fe9700  1 == req done 
req=0xb1c690 http_status=403 ==
2013-11-14 12:54:24.626466 7f51e27e4700  1 == starting new 
request req=0xb24260 =

Any one could help to find the problem ? Does it mean , we should set
acl for the bucket . In fact ,the information goes the same as it before
, after setting acl for the bucket .
bucket-name sss
object-name rgwconf
Or is there something wrong with  either the "HTTP_DATE" or
"HTTP_X_AMS_COPY_SOURCE"?


Those headers are fine, and it's unrelated to acls since the gateway is
using a system user for cross-zone copies, which has full access.

Does the system user for the destination zone exist with the same
access secret and key in the source zone?

Josh



lixuehui
*发件人：* lixuehui 
*发送时间：* 2013-11-13 16:16
*收件人：* ceph-users 
*主题：* radosgw-agent AccessDenied 403
Hi ,list
We've ever reflected that ,radosgw-agent sync data failed all the
time ,before. We paste the concert log here to seek any help now .

  application/json; charset=UTF-8
Wed, 13 Nov 2013 07:24:45 GMT
x-amz-copy-source:sss%2Frgwconf
/sss/rgwconf
2013-11-13T15:24:45.510 11171:DEBUG:boto:Signature:
AWS CQHH7O4XULLINBNQQSPB:9ktSGas0/iuekklDmHRuU+OItek=
2013-11-13T15:24:45.511 11171:DEBUG:boto:url = 
'http://ceph-rgw41.com/sss/rgwconf'
params={'rgwx-op-id': 'ceph-rgw41:11160:2', 
'rgwx-source-zone': u'us-east', 'rgwx-client-id': 'radosgw-agent'}
headers={'Content-Length': '0', 'User-Agent': 
'Boto/2.16.0 Python/2.7.3 Linux/3.5.0-23-generic', 'x-amz-copy-source': 
'sss%2Frgwconf', 'Date': 'Wed, 13 Nov 2013 07:24:45 GMT', 'Content-Type': 
'application/json; charset=UTF-8', 'Authorization': 'AWS 
CQHH7O4XULLINBNQQSPB:9ktSGas0/iuekklDmHRuU+OItek='}
data=None
2013-11-13T15:24:45.519 
11171:INFO:requests.packages.urllib3.connectionpool:Starting new HTTP 
connection (1): ceph-rgw41.com
2013-11-13T15:24:45.580 
11171:DEBUG:requests.packages.urllib3.connectionpool:"PUT 
/sss/rgwconf?rgwx-op-id=ceph-rgw41%3A11160%3A2&rgwx-source-zone=us-east&rgwx-client-id=radosgw-agent
 HTTP/1.1" 403 78
2013-11-13T15:24:45.584 11171:DEBUG:radosgw_agent.worker:exception during sync: Http error code 403 
content AccessDenied
2013-11-13T15:24:45.587 11171:DEBUG:boto:StringToSign:
GET
Wed, 13 Nov 2013 07:24:45 GMT
/admin/opstate
2013-11-13T15:24:45.589 11171:DEBUG:boto:Signature:
AWS CQHH7O4XULLINBNQQSPB:JZwaFKhZEsQUj50jLxjNzni8n5Q=
2013-11-13T15:24:45.590 11171:DEBUG:boto:url = 
'http://ceph-rgw41.com/admin/opstate'
params={'client-id': 'radosgw-agent', 'object': 
'sss/rgwconf', 'op-id': 'ceph-rgw41:11160:2'}
headers={'Date': 'Wed, 13 Nov 2013 07:24:45 GMT', 
'Content-Length': '0', 'Authorization': 'AWS 
CQHH7O4XULLINBNQQSPB:JZwaFKhZEsQUj50jLxjNzni8n5Q=', 'User-Agent': 'Boto/2.16.0 
Python/2.7.3 Linux/3.5.0-23-generic'}
data=None
2013-11-13T15:24:45.594 
11171:INFO:requests.packages.urllib3.connectionpool:Starting new HTTP 
connection (1): ceph-rgw41.com
2013-11-13T15:24:45.607 
11171:DEBUG:requests.packages.urllib3.connectionpool:"GET 
/admin/opstate?client-id=radosgw-agent&object=sss%2Frgwconf&op-id=ceph-rgw41%3A11160%3A2 
HTTP/1.1" 200 None
2013-11-13T15:24:45.612 
11171:DEBUG:radosgw_agent.worker:op state is [{u'timestamp': u'2013-11-13 
07:24:45.561401Z', u'op_id': u'ceph-rgw41:11160:2', u'object': u'sss/rgwconf', 
u'state': u'error', u'client_id': u'radosgw-agent'}]
2013-11-13T15:24:45.614 
11171:ERROR:radosgw_agent.worker:failed to sync object sss/rgwconf: state is 
er

Re: [ceph-users] Ephemeral RBD with Havana and Dumpling

2013-11-19 Thread Josh Durgin


On 11/14/2013 09:54 AM, Dmitry Borodaenko wrote:

On Thu, Nov 14, 2013 at 6:00 AM, Haomai Wang  wrote:

We are using the nova fork by Josh Durgin
https://github.com/jdurgin/nova/commits/havana-ephemeral-rbd - are there
more patches that need to be integrated?

I hope I can release or push commits to this branch contains live-migration,
incorrect filesystem size fix and ceph-snapshort support in a few days.


Can't wait to see this patch! Are you getting rid of the shared
storage requirement for live-migration?


Yes, that's what Haomai's patch will fix for rbd-based ephemeral
volumes (bug https://bugs.launchpad.net/nova/+bug/1250751).

Note that volume-backed instances work with live migration just fine
without a shared fs for ephemeral disks since Grizzly.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Librados Error Codes

2013-11-19 Thread Josh Durgin


On 11/19/2013 05:28 AM, Behar Veliqi wrote:

Hi,

when using the librados c library, the documentation of the different functions 
just tells that it returns a negative error code on failure,
e.g. the rados_read function 
(http://ceph.com/docs/master/rados/api/librados/#rados_read).

Is there anywhere any further documentation which error code is returned under 
which condition and how to know _why_ the operation has failed?


For some functions there is, but for most of them there are many
common errors that aren't listed, and some errors depend on
the OSD backend being used.

The error codes are all negative POSIX errno values, so many of them
should be self-explanatory (i.e. -ENOENT when an object doesn't exist,
-EPERM when you don't have access to a pool, -EROFS if you try to write
to a snapshot, etc.). It would be good to document these though.

If you're looking into librados more, the C header has some more detail
in @defgroup blocks that aren't parsed into the web docs:

https://github.com/ceph/ceph/blob/master/src/include/rados/librados.h#L279

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Size of RBD images

2013-11-20 Thread Josh Durgin


On 11/20/2013 06:53 AM, nicolasc wrote:

Thank you Bernhard and Wogri. My old kernel version also explains the
format issue. Once again, sorry to have mixed that in the problem.

Back to my original inquiries, I hope someone can help me understand why:
* it is possible to create an RBD image larger than the total capacity
of the cluster


There's simply no checking of the size of the cluster by librbd.
rbd does not know whether you're about to add a bunch of capacity to the 
cluster, or whether you want your storage overcommitted (and by how much).


Higher level tools like openstack cinder can provide that kind of logic, 
but 'rbd create' is more of a low level tool at this time.



* a large empty image takes longer to shrink/delete than a small one


rbd doesn't keep an index of which objects exist (since doing so would 
hurt write performance). The downside is as you saw, when shrinking or

deleting an image it must look for all objects above the shrink size
(deleting is like shrinking to 0 of course).

In dumpling or later rbd can do this in parallel controlled by 
--rbd-concurrent-management-ops, which defaults to 10.


If you've never written to the image, you can just delete the rbd_header
and rbd_id objects for it (or just the $imagename.rbd object for format 1
images), then 'rbd rm' will be fast since it'll just remove its entry from
the rbd_directory object.

Josh


Best regards,

Nicolas Canceill
Scalable Storage Systems
SURFsara (Amsterdam, NL)



On 11/20/2013 01:56 PM, Bernhard Glomm wrote:

That might be,
manpage of
ceph version 0.72.1
tells me it isn't though.
anyhow still running kernel 3.8.xx

Bernhard

Am 19.11.2013 20:10:04, schrieb Wolfgang Hennerbichler:


On Nov 19, 2013, at 3:47 PM, Bernhard Glomm
 wrote:

Hi Nicolas
just fyi
rbd format 2 is not supported yet by the linux kernel (module)


I believe this is wrong. I think linux supports rbd format 2
images since 3.10.

wogri




--

*Ecologic Institute**Bernhard Glomm*
IT Administration

Phone:  +49 (30) 86880 134
Fax:+49 (30) 86880 100
Skype:  bernhard.glomm.ecologic

Website:  | Video:
 | Newsletter:
 | Facebook:
 | Linkedin:
 |
Twitter:  | YouTube:
 | Google+:

Ecologic Institut gemeinnützige GmbH | Pfalzburger Str. 43/44 | 10717
Berlin | Germany
GF: R. Andreas Kraemer | AG: Charlottenburg HRB 57947 | USt/VAT-IdNr.:
DE811963464
Ecologic™ is a Trade Mark (TM) of Ecologic Institut gemeinnützige GmbH






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] tracker.ceph.com - public email address visibility?

2013-11-27 Thread Josh Durgin


On 11/27/2013 07:21 AM, James Pearce wrote:

I was going to add something to the bug tracker, but it looks to me that
contributor email addresses all have public (unauthenticated)
visibility?  Can this be set in user preferences?


Yes, it can be hidden here: http://tracker.ceph.com/my/account
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Real size of rbd image

2013-11-27 Thread Josh Durgin


On 11/26/2013 02:22 PM, Stephen Taylor wrote:

 From ceph-users archive 08/27/2013:

On 08/27/2013 01:39 PM, Timofey Koolin wrote:


/Is way to know real size of rbd image and rbd snapshots?/



/rbd ls -l write declared size of image, but I want to know real size./


You can sum the sizes of the extents reported by:

  rbd diff pool/image[@snap] [--format json]

That's the difference since the beginning of time, so it reports all

used extents.

Josh

I don’t seem to be able to find any documentation supporting the [@snap]
parameter for this call, but it seems to work, at least in part. I have
a requirement to find the size of a snapshot relative to another
snapshot. Here is what I’ve used:

 rbd diff pool/image@snap2 --from-snap snap1


Most rbd commands work on snapshots too. The help text could certainly
be improved - suggestions welcome!


The returned list of extents seems to include all changes since snap1,
not just those up to snap2, but those that have been written after snap2
are labeled “zero” rather than as “data” extents. If I ignore the “zero”
extents and sum the lengths of the “data” extents, it seems to give me
an accurate relative snapshot size. Is this expected behavior and the
correct way to calculate the size I’m looking for?


Do you have discard/trim enabled for whatever's using the image?
The diff will include discarded extents as "zero". For calculating
size, it's fine to ignore them. It would be unexpected if these
aren't listed when you leave out the @snap2 portion though.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] can not get rbd cache perf counter

2013-11-27 Thread Josh Durgin


On 11/27/2013 01:31 AM, Shu, Xinxin wrote:

Recently,  I want to test performance benefit of rbd cache, i cannot get
obvious performance benefit at my setup, then I  try to make sure rbd
cache is enabled, but I cannot get rbd cache perf counter. In order to
identify how to enable rbd cache perf counter, I setup a simple
setup(one client hosted vms, one ceph cluster with two OSDs, each osd
has a SSD partition for journal.), then build ceph-0.67.4.

My ceph.conf shows as bellows:

[global]

 debug default = 0

 log file = /var/log/ceph/$name.log

 max open files = 131072

 auth cluster required = none

 auth service required = none

 auth client required = none

 rbd cache = true

[mon.a]

 host = {monitor_host_name}

mon addr = {monitor_addr}

[osd.0]

 host = {osd.0_hostname}

 public addr = {public_addr}

 cluster addr = {cluster_addr}

 osd mkfs type = xfs

 devs = /dev/sdb1

 osd journal = /dev/sdd5

[osd.1]

 host = {osd.1_hostname}

 public addr = {public_addr}

 cluster addr = {cluster_addr}

 osd mkfs type = xfs

 devs = /dev/sdc1

 osd journal = /dev/sdd6

after ceph cluster is built, I create a rbd image with rbd create –size
10240 –new-format test

then use virsh to start a vm, below is my vm xml file



   test

   524288

   524288

   1

   

 hvm

 

   

   

 

 

   

   

 

   

   

 

 

   

   destroy

   restart

   destroy

   

 /usr/bin/qemu-system-x86_64

 

   

   

   

   

 

  

   

   

   

   i

 

 

   

 

   



Then I add a rbd admin socket in ceph.conf on my client, below is the config

[global]

 auth cluster required = none

 auth service required = none

 auth client required = none

 rbd cache = true

 rbd cache writethrough until flush = true

[client]

 admin socket=/var/run/ceph/rbd-$pid.asok

[mon.a]

 host = {monitor_host_name}

 mon addr = {monitor_host_addr}

then I checked rbd cache perf counter by this socket, but the output did
not get any rbd cache statistics

ceph --admin-daemon /var/run/ceph/rbd-3526.asok perf dump output

{ "objecter": { "op_active": 0,

   "op_laggy": 0,

   "op_send": 0,

   "op_send_bytes": 0,

   "op_resend": 0,

   "op_ack": 0,

   "op_commit": 0,

   "op": 0,

   "op_r": 0,

   "op_w": 0,

   "op_rmw": 0,

   "op_pg": 0,

   "osdop_stat": 0,

   "osdop_create": 0,

   "osdop_read": 0,

   "osdop_write": 0,

   "osdop_writefull": 0,

   "osdop_append": 0,

   "osdop_zero": 0,

   "osdop_truncate": 0,

   "osdop_delete": 0,

   "osdop_mapext": 0,

   "osdop_sparse_read": 0,

   "osdop_clonerange": 0,

   "osdop_getxattr": 0,

   "osdop_setxattr": 0,

   "osdop_cmpxattr": 0,

   "osdop_rmxattr": 0,

   "osdop_resetxattrs": 0,

   "osdop_tmap_up": 0,

   "osdop_tmap_put": 0,

   "osdop_tmap_get": 0,

   "osdop_call": 0,

   "osdop_watch": 0,

   "osdop_notify": 0,

   "osdop_src_cmpxattr": 0,

   "osdop_pgls": 0,

   "osdop_pgls_filter": 0,

   "osdop_other": 0,

   "linger_active": 0,

   "linger_send": 0,

   "linger_resend": 0,

   "poolop_active": 0,

   "poolop_send": 0,

   "poolop_resend": 0,

   "poolstat_active": 0,

   "poolstat_send": 0,

   "poolstat_resend": 0,

   "statfs_active": 0,

   "statfs_send": 0,

   "statfs_resend": 0,

   "command_active": 0,

   "command_send": 0,

   "command_resend": 0,

   "map_epoch": 0,

   "map_full": 0,

   "map_inc": 0,

   "osd_sessions": 0,

   "osd_session_open": 0,

   "osd_session_close": 0,

   "osd_laggy": 0},

   "throttle-msgr_dispatch_throttler-radosclient": { "val": 0,

   "max": 104857600,

   "get": 11,

   "get_sum": 5655,

   "get_or_fail_fail": 0,

   "get_or_fail_success": 0,

   "take": 0,

   "take_sum": 0,

   "put": 11,

   "put_sum": 5655,

   "wait": { "avgcount": 0,

   "sum": 0.0}},

   "throttle-objecter_bytes": { "val": 0,

   "max": 104857600,

   "get": 0,

   "get_sum": 0,

   "get_or_fail_fail": 0,

   "get_or_fail_success": 0,

   "take": 0,

   "take_sum": 0,

   "put": 0,

   "put_sum": 0,

   "wait": { "avgcount": 0,

   "sum": 0.0}},

   "throttle-objecter_ops": { "val": 0,

   "max": 1024,

   "get": 0,

   "get_sum": 0,

   "get_or_fail_fail": 0,

   "get_or_fail_success": 0,

   "take": 0,

   "take_sum": 0,

   "put": 0,

   "put_sum": 0,

   "wait": { "avgcount": 0,

   "sum": 0.0}}}

Qemu version:  qemu-system-x86_64 --version

QEMU emulator version 1.2.0 (qemu-kvm-1.2.0+noroms-0ubuntu2.12.10.5,
Debian), Copyright (c) 2003-2008 Fabric

Re: [ceph-users] [Big Problem?] Why not using Device'UUID in ceph.conf

2013-11-27 Thread Josh Durgin


On 11/26/2013 01:14 AM, Ta Ba Tuan wrote:

Hi James,

Proplem is why the Ceph not  recommend using Device'UUID in Ceph.conf,
when, above error can be occur?


I think with the newer-style configuration, where your disks have
partition ids setup by ceph-disk instead of entries in ceph.conf, it
doesn't matter if they change names, as long as they mount point stays
the same.

Josh


--
TuanTaBa


On 11/26/2013 04:04 PM, James Harper wrote:

Hi all

I have 3 OSDs, named sdb, sdc, sdd.
Suppose, one OSD with device /dev/sdc die => My server have only sdb,
sdc
at the moment.
Because device /dev/sdc replaced by /dev/sdd

Can you just use one of the /dev/disk/by-/
symlinks?

Eg
/dev/disk/by-uuid/153cf32b-e46b-4d31-95ef-749db3a88d02
/dev/disk/by-id/scsi-SATA_WDC_WD10EACS-00D_WD-WCAU66606660

Your distribution should allow for such things automatically, and if
not you should be able to add some udev rules to do it.

James


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] can not get rbd cache perf counter

2013-11-27 Thread Josh Durgin


[re-adding the list]

It's not related to the version of qemu. When qemu starts up, it
creates the admin socket file, but it needs write access to do that.

Does the user running qemu (libvirt-qemu on ubuntu) have write access
to /var/run/ceph? It may be unix permissions blocking it, or apparmor
or selinux if those are enabled.

On 11/27/2013 07:20 PM, Shu, Xinxin wrote:

Hi josh,
   Thanks for your reply,  the pid in the filename did not match  kvm process, 
since I add option in ceph.conf for rbd admin socket, why not qemu create this 
admin socket, is this due to qemu is not installed correctly or this rbd admin 
socket depends on secified qemu package.

-Original Message-
From: Josh Durgin [mailto:josh.dur...@inktank.com]
Sent: Thursday, November 28, 2013 11:01 AM
To: Shu, Xinxin; ceph-us...@ceph.com
Subject: Re: [ceph-users] can not get rbd cache perf counter

On 11/27/2013 01:31 AM, Shu, Xinxin wrote:

Recently,  I want to test performance benefit of rbd cache, i cannot
get obvious performance benefit at my setup, then I  try to make sure
rbd cache is enabled, but I cannot get rbd cache perf counter. In
order to identify how to enable rbd cache perf counter, I setup a
simple setup(one client hosted vms, one ceph cluster with two OSDs,
each osd has a SSD partition for journal.), then build ceph-0.67.4.

My ceph.conf shows as bellows:

[global]

  debug default = 0

  log file = /var/log/ceph/$name.log

  max open files = 131072

  auth cluster required = none

  auth service required = none

  auth client required = none

  rbd cache = true

[mon.a]

  host = {monitor_host_name}

mon addr = {monitor_addr}

[osd.0]

  host = {osd.0_hostname}

  public addr = {public_addr}

  cluster addr = {cluster_addr}

  osd mkfs type = xfs

  devs = /dev/sdb1

  osd journal = /dev/sdd5

[osd.1]

  host = {osd.1_hostname}

  public addr = {public_addr}

  cluster addr = {cluster_addr}

  osd mkfs type = xfs

  devs = /dev/sdc1

  osd journal = /dev/sdd6

after ceph cluster is built, I create a rbd image with rbd create
-size
10240 -new-format test

then use virsh to start a vm, below is my vm xml file



test

524288

524288

1



  hvm

  





  

  





  





  

  



destroy

restart

destroy



  /usr/bin/qemu-system-x86_64

  









  

   







i

  

  



  





Then I add a rbd admin socket in ceph.conf on my client, below is the
config

[global]

  auth cluster required = none

  auth service required = none

  auth client required = none

  rbd cache = true

  rbd cache writethrough until flush = true

[client]

  admin socket=/var/run/ceph/rbd-$pid.asok

[mon.a]

  host = {monitor_host_name}

  mon addr = {monitor_host_addr}

then I checked rbd cache perf counter by this socket, but the output
did not get any rbd cache statistics

ceph --admin-daemon /var/run/ceph/rbd-3526.asok perf dump output

{ "objecter": { "op_active": 0,

"op_laggy": 0,

"op_send": 0,

"op_send_bytes": 0,

"op_resend": 0,

"op_ack": 0,

"op_commit": 0,

"op": 0,

"op_r": 0,

"op_w": 0,

"op_rmw": 0,

"op_pg": 0,

"osdop_stat": 0,

"osdop_create": 0,

"osdop_read": 0,

"osdop_write": 0,

"osdop_writefull": 0,

"osdop_append": 0,

"osdop_zero": 0,

"osdop_truncate": 0,

"osdop_delete": 0,

"osdop_mapext": 0,

"osdop_sparse_read": 0,

"osdop_clonerange": 0,

"osdop_getxattr": 0,

"osdop_setxattr": 0,

"osdop_cmpxattr": 0,

"osdop_rmxattr": 0,

"osdop_resetxattrs": 0,

"osdop_tmap_up": 0,

"osdop_tmap_put": 0,

"osdop_tmap_get": 0,

"osdop_call": 0,

"osdop_watch": 0,

"osdop_notify": 0,

"osdop_src_cmpxattr": 0,

"osdop_pgls": 0,

"osdop_pgls_filter": 0,

"osdop_other": 0,

"linger_active": 0,

"linger_send": 0,

"linger_resend": 0,

"poolop_active": 0,

"poolop_send": 0,

"poolop_resend": 0,

"poolstat_active": 0,

"poolstat_send": 0,

"poolstat_resend": 0

Re: [ceph-users] Real size of rbd image

2013-12-02 Thread Josh Durgin


On 12/02/2013 08:10 AM, Stephen Taylor wrote:

I had not enabled discard/trim for the filesystem using the image, but I have 
done so this morning. I doesn't appear to make a difference.

The extents I'm seeing reported as "zero" are not actually discarded extents. They are extents that 
contain data that was added after the ending snapshot I'm giving to the diff command. If I specify a later 
ending snapshot or none, then the same extents are reported as "data" rather than as 
"zero" extents. Ignoring those seems to give me the correct size when I'm calculating a diff size 
from one snapshot to another, but I wanted to make sure that this situation is expected before I rely on it, 
although continuing to ignore these extents in the future if they simply go away seems easy enough.


Ah, I see where these are coming from now. It's confusing but
technically correct output for objects that do not exist in the
given snapshots, but do in the current version of the image.
We should probably omit these extents from the output. They don't
affect correctness of size calculations or applying diffs, so it
should be backwards compatible to do so. Issue created [1].

Like you said, ignoring any extents marked 'zero' is always fine
for this size calculation.

Josh

[1] http://tracker.ceph.com/issues/6926


Again, I appreciate your help.

Steve

-Original Message-
From: Josh Durgin [mailto:josh.dur...@inktank.com]
Sent: Wednesday, November 27, 2013 7:51 PM
To: Stephen Taylor; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Real size of rbd image

On 11/26/2013 02:22 PM, Stephen Taylor wrote:

  From ceph-users archive 08/27/2013:

On 08/27/2013 01:39 PM, Timofey Koolin wrote:


/Is way to know real size of rbd image and rbd snapshots?/



/rbd ls -l write declared size of image, but I want to know real
size./


You can sum the sizes of the extents reported by:

   rbd diff pool/image[@snap] [--format json]

That's the difference since the beginning of time, so it reports all

used extents.

Josh

I don't seem to be able to find any documentation supporting the
[@snap] parameter for this call, but it seems to work, at least in
part. I have a requirement to find the size of a snapshot relative to
another snapshot. Here is what I've used:

  rbd diff pool/image@snap2 --from-snap snap1


Most rbd commands work on snapshots too. The help text could certainly be 
improved - suggestions welcome!


The returned list of extents seems to include all changes since snap1,
not just those up to snap2, but those that have been written after
snap2 are labeled "zero" rather than as "data" extents. If I ignore the "zero"
extents and sum the lengths of the "data" extents, it seems to give me
an accurate relative snapshot size. Is this expected behavior and the
correct way to calculate the size I'm looking for?


Do you have discard/trim enabled for whatever's using the image?
The diff will include discarded extents as "zero". For calculating size, it's 
fine to ignore them. It would be unexpected if these aren't listed when you leave out the 
@snap2 portion though.

Josh



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Granularity/efficiency of copy-on-write?

2013-12-03 Thread Josh Durgin


On 12/02/2013 03:26 PM, Bill Eldridge wrote:

Hi all,

We're looking at using Ceph's copy-on-write for a ton of users'
replicated cloud image environments,
and are wondering how efficient Ceph is for adding user data to base
images -
is data added in normal 4kB or 64kB sizes, or can you specify block size
for volumes
(so you can have video partitions with large content and email/chat/web
cache partitions with small files)


Copy-on-write is currently implemented at object granularity. Object 
size is determined when you create an rbd image, and defaults to 4MB.



Is Ceph's behavior & configuration for copy-on-write documented well
somewhere?


For a detailed description of the current copy-on-write implementation 
see [1]. You may also be interested in rbd's striping configuration, 
which uses the same parameters as cephfs [2].


Josh

[1] http://ceph.com/docs/master/dev/rbd-layering/
[2] http://ceph.com/docs/master/dev/file-striping/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD import slow

2014-09-25 Thread Josh Durgin


On 09/24/2014 04:57 PM, Brian Rak wrote:

I've been doing some testing of importing virtual machine images, and
I've found that 'rbd import' is at least 2x as slow as 'qemu-img
convert'.  Is there anything I can do to speed this process up?  I'd
like to use rbd import because it gives me a little additional flexibility.

My test setup was a 40960MB LVM volume, and I used the following two
commands:

rbd import /dev/lvmtest/testvol test
qemu-img convert /dev/lvmtest/testvol rbd:test/test

rbd import took 13 minutes, qemu-img took 5.

I'm at a loss to explain this, I would have expected rbd import to be
faster.

This is with ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)


rbd import was doing one synchronous I/O after another. Recently import
and export were parallelized according to 
--rbd-concurrent-management-ops (default 10), which helps quite a bit. 
This will be in

giant.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] librados crash in nova-compute

2014-10-24 Thread Josh Durgin


On 10/24/2014 08:21 AM, Xu (Simon) Chen wrote:

Hey folks,

I am trying to enable OpenStack to use RBD as image backend:
https://bugs.launchpad.net/nova/+bug/1226351

For some reason, nova-compute segfaults due to librados crash:

./log/SubsystemMap.h: In function 'bool
ceph::log::SubsystemMap::should_gather(unsigned int, int)' thread
7f1b477fe700 time 2014-10-24 03:20:17.382769
./log/SubsystemMap.h: 62: FAILED assert(sub < m_subsys.size())
ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6)
1: (()+0x42785) [0x7f1b4c4db785]
2: (ObjectCacher::flusher_entry()+0xfda) [0x7f1b4c53759a]
3: (ObjectCacher::FlusherThread::entry()+0xd) [0x7f1b4c54a16d]
4: (()+0x6b50) [0x7f1b6ea93b50]
5: (clone()+0x6d) [0x7f1b6df3e0ed]
NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
Aborted

I feel that there is some concurrency issue, since this sometimes happen
before and sometimes after this line:
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/rbd_utils.py#L208

Any idea what are the potential causes of the crash?

Thanks.
-Simon


This is http://tracker.ceph.com/issues/8912, fixed in the latest
firefly and dumpling releases.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Double-mounting of RBD

2014-12-17 Thread Josh Durgin


On 12/17/2014 03:49 PM, Gregory Farnum wrote:

On Wed, Dec 17, 2014 at 2:31 PM, McNamara, Bradley
 wrote:

I have a somewhat interesting scenario.  I have an RBD of 17TB formatted
using XFS.  I would like it accessible from two different hosts, one
mapped/mounted read-only, and one mapped/mounted as read-write.  Both are
shared using Samba 4.x.  One Samba server gives read-only access to the
world for the data.  The other gives read-write access to a very limited set
of users who occasionally need to add data.


However, when testing this, when changes are made to the read-write Samba
server the changes don’t seem to be seen by the read-only Samba server.  Is
there some file system caching going on that will eventually be flushed?



Am I living dangerously doing what I have set up?  I thought I would avoid
most/all potential file system corruption by making sure there is only one
read-write access method.  Thanks for any answers.


Well, you'll avoid corruption by only having one writer, but the other
reader is still caching data in-memory that will prevent it from
seeing the writes on the disk.
Plus I have no idea if mounting xfs read-only actually prevents it
from making any writes to the disk; I think some FSes will do stuff
like defragment internal data structures in that mode, maybe?
-Greg


FSes mounted read-only still do tend to do things like journal replay,
but since the block device is mapped read-only that won't be a problem
in this case.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Block device and Trim/Discard

2014-12-18 Thread Josh Durgin


On 12/18/2014 10:49 AM, Travis Rhoden wrote:

One question re: discard support for kRBD -- does it matter which format
the RBD is?  Format 1 and Format 2 are okay, or just for Format 2?


It shouldn't matter which format you use.

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-07 Thread Josh Durgin


On 01/06/2015 04:45 PM, Robert LeBlanc wrote:

Seems like a message bus would be nice. Each opener of an RBD could
subscribe for messages on the bus for that RBD. Anytime the map is
modified a message could be put on the bus to update the others. That
opens up a whole other can of worms though.


Rados' watch/notify functions are used as a limited form of this. That's
how rbd can notice that e.g. snapshots are created or disks are resized
online. With the object map code the idea is to funnel all management
operations like that through a single client that's locked the image
for write access (all handled automatically by librbd).

Using watch/notify to coordinate multi-client access would get complex
and inefficient pretty fast, and in general is best left to cephfs
rather than rbd.

Josh


On Jan 6, 2015 5:35 PM, "Josh Durgin" mailto:josh.dur...@inktank.com>> wrote:

On 01/06/2015 04:19 PM, Robert LeBlanc wrote:

The bitmap certainly sounds like it would help shortcut a lot of
code
that Xiaoxi mentions. Is the idea that the client caches the bitmap
for the RBD so it know which OSDs to contact (thus saving a
round trip
to the OSD), or only for the OSD to know which objects exist on it's
disk?


It's purely at the rbd level, so librbd caches it and maintains its
consistency. The idea is that since it's kept consistent, librbd can do
things like delete exactly the objects that exist without any
extra communication with the osds. Many things that were
O(size of image) become O(written objects in image).

The only restriction is that keeping the object map consistent requires
a single writer, so this does not work for the rare case of e.g. ocfs2
on top of rbd, where there are multiple clients writing to the same
rbd image at once.

Josh

    On Tue, Jan 6, 2015 at 4:19 PM, Josh Durgin
mailto:josh.dur...@inktank.com>> wrote:

On 01/06/2015 10:24 AM, Robert LeBlanc wrote:


Can't this be done in parallel? If the OSD doesn't have
an object then
it is a noop and should be pretty quick. The number of
outstanding
operations can be limited to 100 or a 1000 which would
provide a
balance between speed and performance impact if there is
data to be
trimmed. I'm not a big fan of a "--skip-trimming" option
as there is
the potential to leave some orphan objects that may not
be cleaned up
correctly.



Yeah, a --skip-trimming option seems a bit dangerous. This
trimming
actually is parallelized (10 ops at once by default,
changeable via
--rbd-concurrent-management-__ops) since dumpling.

What will really help without being dangerous is keeping a
map of
object existence [1]. This will avoid any unnecessary trimming
automatically, and it should be possible to add to existing
images.
It should be in hammer.

Josh

[1] https://github.com/ceph/ceph/__pull/2700
<https://github.com/ceph/ceph/pull/2700>


On Tue, Jan 6, 2015 at 8:09 AM, Jake Young
mailto:jak3...@gmail.com>> wrote:




On Monday, January 5, 2015, Chen, Xiaoxi
mailto:xiaoxi.c...@intel.com>> wrote:



When you shrinking the RBD, most of the time was
spent on
librbd/internal.cc::trim___image(), in this
function, client will iterator
all
unnecessary objects(no matter whether it exists)
and delete them.



So in this case,  when Edwin shrinking his RBD
from 650PB to 650GB,
there are[ (650PB * 1024GB/PB -650GB) *
1024MB/GB ] / 4MB/Object =
170,227,200 Objects need to be deleted.That will
definitely take a long
time
since rbd client need to send a delete request
to OSD, OSD need to find
out
the object context and delete(or doesn’t exist
at all). The time needed
to
trim an image is ratio to the size needed to trim.



make another image of the correct size and copy
your VM's file system to
the new image, then delete the old one will  NOT

Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-07 Thread Josh Durgin


On 01/06/2015 04:19 PM, Robert LeBlanc wrote:

The bitmap certainly sounds like it would help shortcut a lot of code
that Xiaoxi mentions. Is the idea that the client caches the bitmap
for the RBD so it know which OSDs to contact (thus saving a round trip
to the OSD), or only for the OSD to know which objects exist on it's
disk?


It's purely at the rbd level, so librbd caches it and maintains its
consistency. The idea is that since it's kept consistent, librbd can do
things like delete exactly the objects that exist without any
extra communication with the osds. Many things that were
O(size of image) become O(written objects in image).

The only restriction is that keeping the object map consistent requires
a single writer, so this does not work for the rare case of e.g. ocfs2
on top of rbd, where there are multiple clients writing to the same
rbd image at once.

Josh


On Tue, Jan 6, 2015 at 4:19 PM, Josh Durgin  wrote:

On 01/06/2015 10:24 AM, Robert LeBlanc wrote:


Can't this be done in parallel? If the OSD doesn't have an object then
it is a noop and should be pretty quick. The number of outstanding
operations can be limited to 100 or a 1000 which would provide a
balance between speed and performance impact if there is data to be
trimmed. I'm not a big fan of a "--skip-trimming" option as there is
the potential to leave some orphan objects that may not be cleaned up
correctly.



Yeah, a --skip-trimming option seems a bit dangerous. This trimming
actually is parallelized (10 ops at once by default, changeable via
--rbd-concurrent-management-ops) since dumpling.

What will really help without being dangerous is keeping a map of
object existence [1]. This will avoid any unnecessary trimming
automatically, and it should be possible to add to existing images.
It should be in hammer.

Josh

[1] https://github.com/ceph/ceph/pull/2700



On Tue, Jan 6, 2015 at 8:09 AM, Jake Young  wrote:




On Monday, January 5, 2015, Chen, Xiaoxi  wrote:



When you shrinking the RBD, most of the time was spent on
librbd/internal.cc::trim_image(), in this function, client will iterator
all
unnecessary objects(no matter whether it exists) and delete them.



So in this case,  when Edwin shrinking his RBD from 650PB to 650GB,
there are[ (650PB * 1024GB/PB -650GB) * 1024MB/GB ] / 4MB/Object =
170,227,200 Objects need to be deleted.That will definitely take a long
time
since rbd client need to send a delete request to OSD, OSD need to find
out
the object context and delete(or doesn’t exist at all). The time needed
to
trim an image is ratio to the size needed to trim.



make another image of the correct size and copy your VM's file system to
the new image, then delete the old one will  NOT help in general, just
because delete the old volume will take exactly the same time as
shrinking ,
they both need to call trim_image().



The solution in my mind may be we can provide a “—skip-triming” flag to
skip the trimming. When the administrator absolutely sure there is no
written have taken place in the shrinking area(that means there is no
object
created in these area), they can use this flag to skip the time
consuming
trimming.



How do you think?




That sounds like a good solution. Like doing "undo grow image"





From: Jake Young [mailto:jak3...@gmail.com]
Sent: Monday, January 5, 2015 9:45 PM
To: Chen, Xiaoxi
Cc: Edwin Peer; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day





On Sunday, January 4, 2015, Chen, Xiaoxi  wrote:

You could use rbd info   to see the block_name_prefix, the
object name consist like .,  so for
example, rb.0.ff53.3d1b58ba.e6ad should be the th object
of
the volume with block_name_prefix rb.0.ff53.3d1b58ba.

   $ rbd info huge
  rbd image 'huge':
   size 1024 TB in 268435456 objects
   order 22 (4096 kB objects)
   block_name_prefix: rb.0.8a14.2ae8944a
   format: 1

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Edwin Peer
Sent: Monday, January 5, 2015 3:55 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day

Also, which rbd objects are of interest?


ganymede ~ # rados -p client-disk-img0 ls | wc -l
1672636


And, all of them have cryptic names like:

rb.0.ff53.3d1b58ba.e6ad
rb.0.6d386.1d545c4d.00011461
rb.0.50703.3804823e.1c28
rb.0.1073e.3d1b58ba.b715
rb.0.1d76.2ae8944a.022d

which seem to bear no resemblance to the actual image names that the rbd
command line tools understands?

Regards,
Edwin Peer

On 01/04/2015 08:48 PM, Jake Young wrote:




On Sunday, January 4, 2015, Dyweni - Ceph-Users
<6exbab4fy...@dyweni.com <mailto:6exbab4fy...@dyweni.com>> wrote:

  Hi,

  If its the only think in your pool, you could try deleting the
  pool instead.

  I found that to be

Re: [ceph-users] rbd resize (shrink) taking forever and a day

2015-01-07 Thread Josh Durgin


On 01/06/2015 10:24 AM, Robert LeBlanc wrote:

Can't this be done in parallel? If the OSD doesn't have an object then
it is a noop and should be pretty quick. The number of outstanding
operations can be limited to 100 or a 1000 which would provide a
balance between speed and performance impact if there is data to be
trimmed. I'm not a big fan of a "--skip-trimming" option as there is
the potential to leave some orphan objects that may not be cleaned up
correctly.


Yeah, a --skip-trimming option seems a bit dangerous. This trimming
actually is parallelized (10 ops at once by default, changeable via
--rbd-concurrent-management-ops) since dumpling.

What will really help without being dangerous is keeping a map of
object existence [1]. This will avoid any unnecessary trimming
automatically, and it should be possible to add to existing images.
It should be in hammer.

Josh

[1] https://github.com/ceph/ceph/pull/2700


On Tue, Jan 6, 2015 at 8:09 AM, Jake Young  wrote:



On Monday, January 5, 2015, Chen, Xiaoxi  wrote:


When you shrinking the RBD, most of the time was spent on
librbd/internal.cc::trim_image(), in this function, client will iterator all
unnecessary objects(no matter whether it exists) and delete them.



So in this case,  when Edwin shrinking his RBD from 650PB to 650GB,
there are[ (650PB * 1024GB/PB -650GB) * 1024MB/GB ] / 4MB/Object =
170,227,200 Objects need to be deleted.That will definitely take a long time
since rbd client need to send a delete request to OSD, OSD need to find out
the object context and delete(or doesn’t exist at all). The time needed to
trim an image is ratio to the size needed to trim.



make another image of the correct size and copy your VM's file system to
the new image, then delete the old one will  NOT help in general, just
because delete the old volume will take exactly the same time as shrinking ,
they both need to call trim_image().



The solution in my mind may be we can provide a “—skip-triming” flag to
skip the trimming. When the administrator absolutely sure there is no
written have taken place in the shrinking area(that means there is no object
created in these area), they can use this flag to skip the time consuming
trimming.



How do you think?



That sounds like a good solution. Like doing "undo grow image"





From: Jake Young [mailto:jak3...@gmail.com]
Sent: Monday, January 5, 2015 9:45 PM
To: Chen, Xiaoxi
Cc: Edwin Peer; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day





On Sunday, January 4, 2015, Chen, Xiaoxi  wrote:

You could use rbd info   to see the block_name_prefix, the
object name consist like .,  so for
example, rb.0.ff53.3d1b58ba.e6ad should be the th object  of
the volume with block_name_prefix rb.0.ff53.3d1b58ba.

  $ rbd info huge
 rbd image 'huge':
  size 1024 TB in 268435456 objects
  order 22 (4096 kB objects)
  block_name_prefix: rb.0.8a14.2ae8944a
  format: 1

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Edwin Peer
Sent: Monday, January 5, 2015 3:55 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] rbd resize (shrink) taking forever and a day

Also, which rbd objects are of interest?


ganymede ~ # rados -p client-disk-img0 ls | wc -l
1672636


And, all of them have cryptic names like:

rb.0.ff53.3d1b58ba.e6ad
rb.0.6d386.1d545c4d.00011461
rb.0.50703.3804823e.1c28
rb.0.1073e.3d1b58ba.b715
rb.0.1d76.2ae8944a.022d

which seem to bear no resemblance to the actual image names that the rbd
command line tools understands?

Regards,
Edwin Peer

On 01/04/2015 08:48 PM, Jake Young wrote:



On Sunday, January 4, 2015, Dyweni - Ceph-Users
<6exbab4fy...@dyweni.com > wrote:

 Hi,

 If its the only think in your pool, you could try deleting the
 pool instead.

 I found that to be faster in my testing; I had created 500TB when
 I meant to create 500GB.

 Note for the Devs: I would be nice if rbd create/resize would
 accept sizes with units (i.e. MB GB TB PB, etc).




 On 2015-01-04 08:45, Edwin Peer wrote:

 Hi there,

 I did something stupid while growing an rbd image. I
accidentally
 mistook the units of the resize command for bytes instead of
 megabytes
 and grew an rbd image to 650PB instead of 650GB. This all
happened
 instantaneously enough, but trying to rectify the mistake is
 not going
 nearly as well.

 
 ganymede ~ # rbd resize --size 665600 --allow-shrink
 client-disk-img0/vol-x318644f-0
 Resizing image: 1% complete...
 

 It took a couple days before it started showing 1% complete
 and has
 been stuck on 1% for a couple more. At this rate, I should be
 able to
 shrink the image back to the intended size in about 2016.

Re: [ceph-users] Ephemeral RBD with Havana and Dumpling

2013-12-06 Thread Josh Durgin


On 12/05/2013 02:37 PM, Dmitry Borodaenko wrote:

Josh,

On Tue, Nov 19, 2013 at 4:24 PM, Josh Durgin  wrote:

I hope I can release or push commits to this branch contains live-migration,
incorrect filesystem size fix and ceph-snapshort support in a few days.


Can't wait to see this patch! Are you getting rid of the shared
storage requirement for live-migration?


Yes, that's what Haomai's patch will fix for rbd-based ephemeral
volumes (bug https://bugs.launchpad.net/nova/+bug/1250751).


We've got a version of a Nova patch that makes live migrations work
for non volume-backed instances, and hopefully addresses the concerns
raised in code review in https://review.openstack.org/56527, along
with a bunch of small bugfixes, e.g. missing max_size parameter in
direct_fetch, and a fix for http://tracker.ceph.com/issues/6693. I
have submitted it as a pull request to your nova fork on GitHub:

https://github.com/jdurgin/nova/pull/1


Thanks!


Our changes depend on the rest of commits on your havana-ephemeral-rbd
branch, and the whole patchset is now at 7 commits, which is going to
be rather tedious to submit to the OpenStack Gerrit as a series of
dependent changes. Do you think we should keep the current commit
history in its current form, or would it be easier to squash it down
to a more manageable number of patches?


As discussed on irc yesterday, most of these are submitted to icehouse
already in slightly different form, since this branch is based on
stable/havana.

I'd prefer to keep the commits small and self contained in this branch
at least. If it takes too long to get them upstream, I'm fine with
having them squashed for faster upstream review.

Josh

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

1 2 >

1 - 100 of 186 matches

Mail list logo