Re: [ceph-users] RHEL 7.1 ceph-disk failures creating OSD

2015-06-27 Thread Loic Dachary
Hi Bruce,

I think the problem comes from using /dev/disk/by-id/wwn-0x53959bd02f56 
instead of /dev/sdw for the data disk, because ceph-disk has a device name 
parsing logic that works with /dev/XXX. Could you run the ceph-disk prepare 
command again with --verbose to confirm ? If that's the case there should be an 
error instead of what appears to be something that only does part of the work.

Cheers

On 26/06/2015 18:56, Bruce McFarland wrote:
> Loic,
> Thank you very much for the partprobe workaround. I rebuilt the cluster using 
> 94.2. 
> 
> I've created partitions on the journal SSDs with parted and then use 
> ceph-disk prepare as below. I'm not seeing all of the disks with the tmp 
> mounts when I check 'mount' but I also don't see any of the mount directory 
> mount points at /var/lib/ceph/osd. I'm see the following output from prepare. 
> When I attempt to 'activate' it errors out saying the devices don't exist.
> 
> ceph-disk prepare --cluster ceph --cluster-uuid 
> b2c2e866-ab61-4f80-b116-20fa2ea2ca94 --fs-type xfs --zap-disk 
> /dev/disk/by-id/wwn-0x53959bd02f56 
> /dev/disk/by-id/wwn-0x500080d91010024b-part1
> Caution: invalid backup GPT header, but valid main header; regenerating
> backup header from main header.
> 
> 
> Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
> verification and recovery are STRONGLY recommended.
> 
> GPT data structures destroyed! You may now partition the disk using fdisk or
> other utilities.
> Creating new GPT entries.
> The operation has completed successfully.
> partx: specified range <1:0> does not make sense
> WARNING:ceph-disk:OSD will not be hot-swappable if journal is not the same 
> device as the osd data
> WARNING:ceph-disk:Journal /dev/disk/by-id/wwn-0x500080d91010024b-part1 was 
> not prepared with ceph-disk. Symlinking directly.
> The operation has completed successfully.
> partx: /dev/disk/by-id/wwn-0x53959bd02f56: error adding partition 1
> meta-data=/dev/sdw1  isize=2048   agcount=4, agsize=244188597 blks
>  =   sectsz=512   attr=2, projid32bit=1
>  =   crc=0finobt=0
> data =   bsize=4096   blocks=976754385, imaxpct=5
>  =   sunit=0  swidth=0 blks
> naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
> log  =internal log   bsize=4096   blocks=476930, version=2
>  =   sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none   extsz=4096   blocks=0, rtextents=0
> The operation has completed successfully.
> partx: /dev/disk/by-id/wwn-0x53959bd02f56: error adding partition 1
> 
> 
> [root@ceph0 ceph]# ceph -v
> ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
> [root@ceph0 ceph]# rpm -qa | grep ceph
> ceph-radosgw-0.94.2-0.el7.x86_64
> libcephfs1-0.94.2-0.el7.x86_64
> ceph-common-0.94.2-0.el7.x86_64
> python-cephfs-0.94.2-0.el7.x86_64
> ceph-0.94.2-0.el7.x86_64
> [root@ceph0 ceph]#
> 
> 
> 
>> -Original Message-
>> From: Loic Dachary [mailto:l...@dachary.org]
>> Sent: Friday, June 26, 2015 3:29 PM
>> To: Bruce McFarland; ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] RHEL 7.1 ceph-disk failures creating OSD
>>
>> Hi,
>>
>> Prior to firefly v0.80.8 ceph-disk zap did not call partprobe and that was
>> causing the kind of problems you're experiencing. It was fixed by
>> https://github.com/ceph/ceph/commit/e70a81464b906b9a304c29f474e672
>> 6762b63a7c and is described in more details at
>> http://tracker.ceph.com/issues/9665. Rebooting the machine ensures the
>> partition table is up to date and that's what you probably want to do after
>> that kind of failure. You can however avoid the failure by running:
>>
>>  * ceph-disk zap
>>  * partproble
>>  * ceph-disk prepare
>>
>> Cheers
>>
>> P.S. The "partx: /dev/disk/by-id/wwn-0x53959ba80a4e: error adding
>> partition 1" can be ignored, it does not actually matter. A message was
>> added later to avoid confusion with a real error.
>> .
>> On 26/06/2015 17:09, Bruce McFarland wrote:
>>> I have moved storage nodes to RHEL 7.1 and used the basic server install. I
>> installed ceph-deploy and used the ceph.repo/epel.repo for installation of
>> ceph 80.7. I have tried ceph-disk with issuing "zap" on the same command
>> line as "prepare" and on a separate command line immediately before the
>> ceph-disk prepare. I consistently run into the partition errors and am unable
>> to create OSD's on RHEL 7.1.
>>>
>>>
>>>
>>> ceph-disk prepare --cluster ceph --cluster-uuid 373a09f7-2070-4d20-8504-
>> c8653fb6db80 --fs-type xfs --zap-disk /dev/disk/by-id/wwn-
>> 0x53959ba80a4e /dev/disk/by-id/wwn-0x500080d9101001d6-part1
>>>
>>> Caution: invalid backup GPT header, but valid main header; regeneratin

Re: [ceph-users] Trying to understand Cache Pool behavior

2015-06-27 Thread Nick Fisk
Hi Reid,

Yes they will, but if the object which the user is writing to (Disk Block if
using RBD, which then maps to an object) has never been written to before,
it won't have to promote the object from the base pool before being able to
write it.

However as you write each object, once the cache pool is full, another
object will be demoted down to the base tier.

As long as you don't mind slow performance, using the cache tier should be
ok. Otherwise wait until the next release as there will be several
improvements.

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Reid Kelley
> Sent: 27 June 2015 00:04
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Trying to understand Cache Pool behavior
> 
> Have been reading the docs and trying to wrap my head around the idea of a
> "write miss" with a cache tier in write-back mode.
> 
> My use case is a large media archive, with write activity on file ingest
> (previews and thumbs generated) followed by very cold limited ready
> access. Seems to fit the cache model.
> 
> What I am confused with is the write-miss. Would a user uploading a new
file
> every experience a write-miss?
> 
> Thanks,
> Reid
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] kernel 3.18 io bottlenecks?

2015-06-27 Thread Stefan Priebe

Dear Ilya,

Am 25.06.2015 um 14:07 schrieb Ilya Dryomov:

On Wed, Jun 24, 2015 at 10:29 PM, Stefan Priebe  wrote:


Am 24.06.2015 um 19:53 schrieb Ilya Dryomov:


On Wed, Jun 24, 2015 at 8:38 PM, Stefan Priebe 
wrote:



Am 24.06.2015 um 16:55 schrieb Nick Fisk:



That kernel probably has the bug where tcp_nodelay is not enabled. That
is fixed in Kernel 4.0+, however also in 4.0 blk-mq was introduced which
brings two other limitations:-




blk-mq is terrible slow. That's correct.



Is that a general sentiment or your experience with rbd?  If the
latter, can you describe your workload and provide some before and
after blk-mq numbers?  We'd be very interested in identifying and
fixing any performance regressions you might have on blk-mq rbd.



oh i'm sorry. I accidently compiled blk-mq into the kernel when 3.18.1 came
out and i was wondering why my I/O waits on my ceph osds where doubled or
even tripled. After reverting back to cfq everything was fine again. I
didn't digged deeper into it as i thought blk-mq is experimental in 3.18.


That doesn't make sense - rbd was switched to blk-mq in 4.0.  Or did
you try to apply the patch from the mailing list to 3.18?


I'm talking about the ceph-osd process / side not about rbd client side.


If you're willing to assist i can give it a try - but need the patches you
mention first (git commit ids?).


No commit ids as the patches are not upstream yet.  I have everything
gathered in testing+blk-mq-plug branch of ceph-client.git:

https://github.com/ceph/ceph-client/tree/testing%2Bblk-mq-plug

A deb (ubuntu, debian, etc):

http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/testing_blk-mq-plug/linux-image.deb

An rpm (fedora, centos, rhel):

http://gitbuilder.ceph.com/kernel-rpm-centos7-x86_64-basic/ref/testing_blk-mq-plug/kernel.x86_64.rpm

These are built with slightly stripped down distro configs so it should
boot most boxes.

Thanks,

 Ilya


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Trying to understand Cache Pool behavior

2015-06-27 Thread Reid Kelley
Sounds good, thanks for the info and will wait and test with next releases. 





> On Jun 27, 2015, at 9:24 AM, Nick Fisk  wrote:
> 
> Hi Reid,
> 
> Yes they will, but if the object which the user is writing to (Disk Block if
> using RBD, which then maps to an object) has never been written to before,
> it won't have to promote the object from the base pool before being able to
> write it.
> 
> However as you write each object, once the cache pool is full, another
> object will be demoted down to the base tier.
> 
> As long as you don't mind slow performance, using the cache tier should be
> ok. Otherwise wait until the next release as there will be several
> improvements.
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Reid Kelley
>> Sent: 27 June 2015 00:04
>> To: ceph-users@lists.ceph.com
>> Subject: [ceph-users] Trying to understand Cache Pool behavior
>> 
>> Have been reading the docs and trying to wrap my head around the idea of a
>> "write miss" with a cache tier in write-back mode.
>> 
>> My use case is a large media archive, with write activity on file ingest
>> (previews and thumbs generated) followed by very cold limited ready
>> access. Seems to fit the cache model.
>> 
>> What I am confused with is the write-miss. Would a user uploading a new
> file
>> every experience a write-miss?
>> 
>> Thanks,
>> Reid
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] kernel 3.18 io bottlenecks?

2015-06-27 Thread Ilya Dryomov
On Sat, Jun 27, 2015 at 6:20 PM, Stefan Priebe  wrote:
> Dear Ilya,
>
> Am 25.06.2015 um 14:07 schrieb Ilya Dryomov:
>>
>> On Wed, Jun 24, 2015 at 10:29 PM, Stefan Priebe 
>> wrote:
>>>
>>>
>>> Am 24.06.2015 um 19:53 schrieb Ilya Dryomov:


 On Wed, Jun 24, 2015 at 8:38 PM, Stefan Priebe 
 wrote:
>
>
>
> Am 24.06.2015 um 16:55 schrieb Nick Fisk:
>>
>>
>>
>> That kernel probably has the bug where tcp_nodelay is not enabled.
>> That
>> is fixed in Kernel 4.0+, however also in 4.0 blk-mq was introduced
>> which
>> brings two other limitations:-
>
>
>
>
> blk-mq is terrible slow. That's correct.



 Is that a general sentiment or your experience with rbd?  If the
 latter, can you describe your workload and provide some before and
 after blk-mq numbers?  We'd be very interested in identifying and
 fixing any performance regressions you might have on blk-mq rbd.
>>>
>>>
>>>
>>> oh i'm sorry. I accidently compiled blk-mq into the kernel when 3.18.1
>>> came
>>> out and i was wondering why my I/O waits on my ceph osds where doubled or
>>> even tripled. After reverting back to cfq everything was fine again. I
>>> didn't digged deeper into it as i thought blk-mq is experimental in 3.18.
>>
>>
>> That doesn't make sense - rbd was switched to blk-mq in 4.0.  Or did
>> you try to apply the patch from the mailing list to 3.18?
>
>
> I'm talking about the ceph-osd process / side not about rbd client side.

Ah, sorry - Nick was clearly talking about the kernel client and
I replied to his mail.  The kernel you run your OSDs on shouldn't
matter much, as long as it's not something ancient (except when you
need to work around a particular filesystem bug), so I just assumed
you and German were talking about the kernel client.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Redundant networks in Ceph

2015-06-27 Thread Alex Gorbachev
The current network design in Ceph
(http://ceph.com/docs/master/rados/configuration/network-config-ref)
uses nonredundant networks for both cluster and public communication.
Ideally, in a high load environment these will be 10 or 40+ GbE
networks.  For cost reasons, most such installation will use the same
switch hardware and separate Ceph traffic using VLANs.

Networking in complex, and situations are possible when switches and
routers drop traffic.  We ran into one of those at one of our sites -
connections to hosts stay up (so bonding NICs does not help), yet OSD
communication gets disrupted, client IO hangs and failures cascade to
client applications.

My understanding is that if OSDs cannot connect for some time over the
cluster network, that IO will hang and time out.  The document states
"

If you specify more than one IP address and subnet mask for either the
public or the cluster network, the subnets within the network must be
capable of routing to each other."

Which in real world means complicated Layer 3 setup for routing and is
not practical in many configurations.

What if there was an option for "cluster 2" and "public 2" networks,
to which OSDs and MONs would go either in active/backup or
active/active mode (cluster 1 and cluster 2 exist separately do not
route to each other)?

The difference between this setup and bonding is that here decision to
fail over and try the other network is at OSD/MON level, and it bring
resilience to faults within the switch core, which is really only
detectable at application layer.

Am I missing an already existing feature?  Please advise.

Best regards,
Alex Gorbachev
Intelligent Systems Services Inc.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Redundant networks in Ceph

2015-06-27 Thread Nick Fisk
Hi Alex,

I think the answer is you do 1 of 2 things. You either design your network
so that it is fault tolerant in every way so that network interruption is
not possible. Or go with non-redundant networking, but design your crush map
around the failure domains of the network.

I'm interested in your example of where OSD's where unable to communicate.
What happened? Would it possible to redesign the network to stop this
happening?

Nick

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Alex Gorbachev
> Sent: 27 June 2015 19:02
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Redundant networks in Ceph
> 
> The current network design in Ceph
> (http://ceph.com/docs/master/rados/configuration/network-config-ref)
> uses nonredundant networks for both cluster and public communication.
> Ideally, in a high load environment these will be 10 or 40+ GbE networks.
For
> cost reasons, most such installation will use the same switch hardware and
> separate Ceph traffic using VLANs.
> 
> Networking in complex, and situations are possible when switches and
> routers drop traffic.  We ran into one of those at one of our sites -
> connections to hosts stay up (so bonding NICs does not help), yet OSD
> communication gets disrupted, client IO hangs and failures cascade to
client
> applications.
> 
> My understanding is that if OSDs cannot connect for some time over the
> cluster network, that IO will hang and time out.  The document states "
> 
> If you specify more than one IP address and subnet mask for either the
> public or the cluster network, the subnets within the network must be
> capable of routing to each other."
> 
> Which in real world means complicated Layer 3 setup for routing and is not
> practical in many configurations.
> 
> What if there was an option for "cluster 2" and "public 2" networks, to
which
> OSDs and MONs would go either in active/backup or active/active mode
> (cluster 1 and cluster 2 exist separately do not route to each other)?
> 
> The difference between this setup and bonding is that here decision to
fail
> over and try the other network is at OSD/MON level, and it bring
resilience to
> faults within the switch core, which is really only detectable at
application
> layer.
> 
> Am I missing an already existing feature?  Please advise.
> 
> Best regards,
> Alex Gorbachev
> Intelligent Systems Services Inc.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ubuntu -Juno Openstack - Ceph integrated - Istalling ubuntu server instance

2015-06-27 Thread Teclus Dsouza -X (teclus - TECH MAHINDRA LIM at Cisco)
Hello everyone,

I created a bootable volume on openstack and trying to boot a ubuntu-server 
14.04 image .   I am able to get to the initial setup screen of Ubuntu,  but 
after asking for the timezone and location I am not able to proceed further as 
it says it's not able to access the CDrom drive .

The backend storage is connected to Ceph.Does anyone any solutions 
workarounds  to this issue.?


cinder list
+--+---+--+--+-+--+--+
|  ID  |   Status  |   Display Name   | Size | 
Volume Type | Bootable | Attached to  |
+--+---+--+--+-+--+--+
| 002762cc-2e4b-417d-9d33-c90d6e87e758 |   in-use  |   my-boot-vol|  10  |  
   None|   true   | 78b80f78-07e9-46ed-8b0c-0e807a4c0805 |
| 94164dcb-7e7a-4d87-89b7-f535edf299b6 | available | cinder-ceph-vol1 |  10  |  
   None|  false   |  |
| c1f39f7d-82ef-48cd-884a-86431d251e43 | available | ubuntu-boot-vol2 |  40  |  
   None|   true   |  |
| ef346116-8ed8-48d0-9401-78b5c120a4ef | available | cinder-ceph-vol2 | 100  |  
   None|  false   |  |
| f67f79d0-cafb-4575-b88a-765fc631aa42 |   in-use  | ubuntu-boot-vol  |  10  |  
   None|   true   | c1289143-9bed-4d21-a12d-18c22b01163e |
+--+---+--+--+-+--+--

nova list
+--+--+-++-+--+
| ID   | Name | Status  | Task 
State | Power State | Networks |
+--+--+-++-+--+
| 78b80f78-07e9-46ed-8b0c-0e807a4c0805 | dsl-linux| SHUTOFF | - 
 | Shutdown| ext-net=10.11.12.158 |
| b6738a95-9bf6-441f-9cd5-e0665111c040 | testvm-1 | ACTIVE  | - 
 | Running | ext-net=10.11.12.156 |
| 01dd21c3-c84d-4572-93ef-c3904a6f877e | ubuntu-server-14 | ACTIVE  | - 
 | Running | ext-net=10.11.12.160 |
+--+--+-++-+--

[cid:image001.png@01D0B138.1FC2A800]


Regards
Teclus Dsouza
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Redundant networks in Ceph

2015-06-27 Thread Alex Gorbachev
Hi Nick,

Thank you fro writing back:

> I think the answer is you do 1 of 2 things. You either design your network
> so that it is fault tolerant in every way so that network interruption is
> not possible. Or go with non-redundant networking, but design your crush map
> around the failure domains of the network.

We'll redesign the network shortly - the problem is in general that I
am finding it is possible, in even well designed redundant networks,
to have packet loss occur for various reasons (maintenance, cables,
protocol issues etc.).  So while there is not an interruption (defined
as 100% service loss), there may be occasional packet loss issues and
high latency situations, even when the backbone is very fast.

The CRUSH map idea sounds interesting.  But there are still concerns,
such as massive data relocations East-West (between racks in a
leaf-spine architecture such as
https://community.mellanox.com/docs/DOC-1475 , should there be an
outage in the spine.  Plus such issues are enormously hard to
troubleshoot.

> I'm interested in your example of where OSD's where unable to communicate.
> What happened? Would it possible to redesign the network to stop this
> happening?

Our SuperCore design uses Ceph OSD nodes to provide storage to LIO
Target iSCSI nodes, which then deliver it to ESXi hosts.  LIO is
sensitive to hangs, and often we see an RBD hang translate into iSCSI
timeout, which causes ESXi to abort connections, hang and crash
applications.  This only happens at one site, where it is likely there
is a switch issue somewhere.  These issues are sporadic and come and
go as storms - so far all Ceph analysis pointed to network
disruptions, from which the RBD client is unable to recover.  The
network vendor still cannot find anything wrong.

We'll replace the whole network, but I was thinking, having seen such
issues at a few other sites, if a "B-bus" for networking would be a
good design for OSDs.  This approach is commonly used in traditional
SANs, where the "A bus" and "B bus" are not connected,so they cannot
possibly cross contaminate in any way.

Another reference is multipathing, where IO can be send via redundant
paths - most storage vendors recommend using application (higher)
level multipathing (aka MPIO) vs. network redundancy (such as
bonding).  We find this to be a valid recommendation as clients run
into issues less.  Somewhat related to
http://serverfault.com/questions/510882/why-mpio-instead-of-802-3ad-team-for-iscsi
to quote - "MPIO detects and handles path failures, whereas 802.3ad
can only compensate for a link failure".

I see OSD connections as paths, rather than links, as these are higher
level object storage exchanges.

Thank you,
Alex

>
> Nick
>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Alex Gorbachev
>> Sent: 27 June 2015 19:02
>> To: ceph-users@lists.ceph.com
>> Subject: [ceph-users] Redundant networks in Ceph
>>
>> The current network design in Ceph
>> (http://ceph.com/docs/master/rados/configuration/network-config-ref)
>> uses nonredundant networks for both cluster and public communication.
>> Ideally, in a high load environment these will be 10 or 40+ GbE networks.
> For
>> cost reasons, most such installation will use the same switch hardware and
>> separate Ceph traffic using VLANs.
>>
>> Networking in complex, and situations are possible when switches and
>> routers drop traffic.  We ran into one of those at one of our sites -
>> connections to hosts stay up (so bonding NICs does not help), yet OSD
>> communication gets disrupted, client IO hangs and failures cascade to
> client
>> applications.
>>
>> My understanding is that if OSDs cannot connect for some time over the
>> cluster network, that IO will hang and time out.  The document states "
>>
>> If you specify more than one IP address and subnet mask for either the
>> public or the cluster network, the subnets within the network must be
>> capable of routing to each other."
>>
>> Which in real world means complicated Layer 3 setup for routing and is not
>> practical in many configurations.
>>
>> What if there was an option for "cluster 2" and "public 2" networks, to
> which
>> OSDs and MONs would go either in active/backup or active/active mode
>> (cluster 1 and cluster 2 exist separately do not route to each other)?
>>
>> The difference between this setup and bonding is that here decision to
> fail
>> over and try the other network is at OSD/MON level, and it bring
> resilience to
>> faults within the switch core, which is really only detectable at
> application
>> layer.
>>
>> Am I missing an already existing feature?  Please advise.
>>
>> Best regards,
>> Alex Gorbachev
>> Intelligent Systems Services Inc.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
___
ceph-users mailing l

Re: [ceph-users] Redundant networks in Ceph

2015-06-27 Thread Nick Fisk




> -Original Message-
> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> Sent: 27 June 2015 21:55
> To: Nick Fisk
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Redundant networks in Ceph
> 
> Hi Nick,
> 
> Thank you fro writing back:
> 
> > I think the answer is you do 1 of 2 things. You either design your
> > network so that it is fault tolerant in every way so that network
> > interruption is not possible. Or go with non-redundant networking, but
> > design your crush map around the failure domains of the network.
> 
> We'll redesign the network shortly - the problem is in general that I am
> finding it is possible, in even well designed redundant networks, to have
> packet loss occur for various reasons (maintenance, cables, protocol issues
> etc.).  So while there is not an interruption (defined as 100% service loss),
> there may be occasional packet loss issues and high latency situations, even
> when the backbone is very fast.

I know what you mean, no matter how hard you try something unexpected always 
happens. That said I think OSD timeouts should be higher than HSRP and spanning 
tree convergence times, so I think it should survive most incidents that I can 
think of. 

> 
> The CRUSH map idea sounds interesting.  But there are still concerns, such as
> massive data relocations East-West (between racks in a leaf-spine
> architecture such as
> https://community.mellanox.com/docs/DOC-1475 , should there be an
> outage in the spine.  Plus such issues are enormously hard to troubleshoot.

You can set the maximum crush grouping that will allow OSD's to be marked out. 
You can use this to stop unwanted data movement from occurring during outages.

> 
> > I'm interested in your example of where OSD's where unable to
> communicate.
> > What happened? Would it possible to redesign the network to stop this
> > happening?
> 
> Our SuperCore design uses Ceph OSD nodes to provide storage to LIO Target
> iSCSI nodes, which then deliver it to ESXi hosts.  LIO is sensitive to hangs, 
> and
> often we see an RBD hang translate into iSCSI timeout, which causes ESXi to
> abort connections, hang and crash applications.  This only happens at one
> site, where it is likely there is a switch issue somewhere.  These issues are
> sporadic and come and go as storms - so far all Ceph analysis pointed to
> network disruptions, from which the RBD client is unable to recover.  The
> network vendor still cannot find anything wrong.

Ah, yeah, been there with LIO and esxi and gave up on it. I found any pause 
longer than around 10 seconds would send both of them into a death spiral. I 
know you currently only see it due to some networking blip, but you will most 
likely also see it when disks fail...etc For me I couldn't have all my 
Datastores going down every time something blipped or got a little slow. There 
are discussions ongoing about it on the Target mailing list and Mike Christie 
from Redhat is looking into the problem, so hopefully it will get sorted at 
some point. For what it's worth, both SCST and TGT seem to be immune from this.

> 
> We'll replace the whole network, but I was thinking, having seen such issues
> at a few other sites, if a "B-bus" for networking would be a good design for
> OSDs.  This approach is commonly used in traditional SANs, where the "A
> bus" and "B bus" are not connected,so they cannot possibly cross
> contaminate in any way.

Probably implementing something like multipathTCP would be the best bet to 
mirror the traditional dual fabric SAN design. 

> 
> Another reference is multipathing, where IO can be send via redundant
> paths - most storage vendors recommend using application (higher) level
> multipathing (aka MPIO) vs. network redundancy (such as bonding).  We find
> this to be a valid recommendation as clients run into issues less.  Somewhat
> related to http://serverfault.com/questions/510882/why-mpio-instead-of-
> 802-3ad-team-for-iscsi
> to quote - "MPIO detects and handles path failures, whereas 802.3ad can
> only compensate for a link failure".
> 
> I see OSD connections as paths, rather than links, as these are higher level
> object storage exchanges.
> 
> Thank you,
> Alex
> 
> >
> > Nick
> >
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of Alex Gorbachev
> >> Sent: 27 June 2015 19:02
> >> To: ceph-users@lists.ceph.com
> >> Subject: [ceph-users] Redundant networks in Ceph
> >>
> >> The current network design in Ceph
> >> (http://ceph.com/docs/master/rados/configuration/network-config-ref)
> >> uses nonredundant networks for both cluster and public communication.
> >> Ideally, in a high load environment these will be 10 or 40+ GbE networks.
> > For
> >> cost reasons, most such installation will use the same switch
> >> hardware and separate Ceph traffic using VLANs.
> >>
> >> Networking in complex, and situations are possible when switches and
> >> routers drop traffic.  We ran

[ceph-users] SSL Certificate failure when attaching volume to VM

2015-06-27 Thread Johanni Thunstrom
Dear Ceph Community,

We are trying to integrate Ceph with OpenStack, and facing certificate issues 
when attaching a Cinder volume to a Nova vm.  We have the environment variable 
OS_CACERT set to the correct certificate address, which is read to set cacert. 
The certificate is verified successfully in creating images, volumes, and vms. 
However when the compute vm tries to communicate with the controller, the 
certificate fails to verify. Is there a configuration variable that must be set 
for the certificate to verify correctly? Any advice is much appreciated.


2015-06-26 23:15:41.526 1437 TRACE oslo.messaging.rpc.dispatcher
2015-06-26 23:15:41.528 1437 ERROR oslo.messaging._drivers.common [-] Returning 
exception Unable to establish connection: [Errno 1] _ssl.c:492: 
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify 
failed to caller
2015-06-26 23:15:41.528 1437 ERROR oslo.messaging._drivers.common [-] 
['Traceback (most recent call last):\n', '  File 
"/usr/lib/python2.6/site-packages/oslo/messaging/rpc/dispatcher.py", line 133, 
in _dispatch_and_reply\nincoming.message))\n', '  File 
"/usr/lib/python2.6/site-packages/oslo/messaging/rpc/dispatcher.py", line 176, 
in _dispatch\nreturn self._do_dispatch(endpoint, method, ctxt, args)\n', '  
File "/usr/lib/python2.6/site-packages/oslo/messaging/rpc/dispatcher.py", line 
122, in _do_dispatch\nresult = getattr(endpoint, method)(ctxt, 
**new_args)\n', '  File 
"/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 393, in 
decorated_function\nreturn function(self, context, *args, **kwargs)\n', '  
File "/usr/lib/python2.6/site-packages/nova/exception.py", line 88, in 
wrapped\npayload)\n', '  File 
"/usr/lib/python2.6/site-packages/nova/openstack/common/excutils.py", line 68, 
in __exit__\nsix.reraise(self.type_, self.value, self.tb)\n', '  File 
"/usr/lib/python2.6/site-packages/nova/exception.py", line 71, in wrapped\n
return f(self, context, *args, **kw)\n', '  File 
"/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 274, in 
decorated_function\npass\n', '  File 
"/usr/lib/python2.6/site-packages/nova/openstack/common/excutils.py", line 68, 
in __exit__\nsix.reraise(self.type_, self.value, self.tb)\n', '  File 
"/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 260, in 
decorated_function\nreturn function(self, context, *args, **kwargs)\n', '  
File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 303, in 
decorated_function\ne, sys.exc_info())\n', '  File 
"/usr/lib/python2.6/site-packages/nova/openstack/common/excutils.py", line 68, 
in __exit__\nsix.reraise(self.type_, self.value, self.tb)\n', '  File 
"/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 290, in 
decorated_function\nreturn function(self, context, *args, **kwargs)\n', '  
File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 4167, in 
attach_volume\nbdm.destroy(context)\n', '  File 
"/usr/lib/python2.6/site-packages/nova/openstack/common/excutils.py", line 68, 
in __exit__\nsix.reraise(self.type_, self.value, self.tb)\n', '  File 
"/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 4164, in 
attach_volume\nreturn self._attach_volume(context, instance, 
driver_bdm)\n', '  File 
"/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 4185, in 
_attach_volume\nself.volume_api.unreserve_volume(context, 
bdm.volume_id)\n', '  File 
"/usr/lib/python2.6/site-packages/nova/volume/cinder.py", line 173, in 
wrapper\nres = method(self, ctx, volume_id, *args, **kwargs)\n', '  File 
"/usr/lib/python2.6/site-packages/nova/volume/cinder.py", line 249, in 
unreserve_volume\ncinderclient(context).volumes.unreserve(volume_id)\n', '  
File "/usr/lib/python2.6/site-packages/cinderclient/v1/volumes.py", line 293, 
in unreserve\nreturn self._action(\'os-unreserve\', volume)\n', '  File 
"/usr/lib/python2.6/site-packages/cinderclient/v1/volumes.py", line 250, in 
_action\nreturn self.api.client.post(url, body=body)\n', '  File 
"/usr/lib/python2.6/site-packages/cinderclient/client.py", line 223, in post\n  
  return self._cs_request(url, \'POST\', **kwargs)\n', '  File 
"/usr/lib/python2.6/site-packages/cinderclient/client.py", line 212, in 
_cs_request\nraise exceptions.ConnectionError(msg)\n', 'ConnectionError: 
Unable to establish connection: [Errno 1] _ssl.c:492: error:14090086:SSL 
routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed\n']

Sincerely,
Johanni B. Thunstrom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com