Re: [ceph-users] ceph + vmware

Frédéric Nass Fri, 22 Jul 2016 03:19:37 -0700


Le 22/07/2016 11:48, Nick Fisk a écrit :


*From:*Frédéric Nass [mailto:frederic.n...@univ-lorraine.fr]
*Sent:* 22 July 2016 10:40

*To:* n...@fisk.me.uk; 'Jake Young' <jak3...@gmail.com>; 'JanSchermer' <j...@schermer.cz>

*Cc:* ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] ceph + vmware

Le 22/07/2016 10:23, Nick Fisk a écrit :

    *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
    Behalf Of *Frédéric Nass
    *Sent:* 22 July 2016 09:10
    *To:* n...@fisk.me.uk <mailto:n...@fisk.me.uk>; 'Jake Young'
    <jak3...@gmail.com> <mailto:jak3...@gmail.com>; 'Jan Schermer'
    <j...@schermer.cz> <mailto:j...@schermer.cz>
    *Cc:* ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
    *Subject:* Re: [ceph-users] ceph + vmware

    Le 22/07/2016 09:47, Nick Fisk a écrit :

        *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com]
        *On Behalf Of *Frédéric Nass
        *Sent:* 22 July 2016 08:11
        *To:* Jake Young <jak3...@gmail.com>
        <mailto:jak3...@gmail.com>; Jan Schermer <j...@schermer.cz>
        <mailto:j...@schermer.cz>
        *Cc:* ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
        *Subject:* Re: [ceph-users] ceph + vmware

        Le 20/07/2016 21:20, Jake Young a écrit :



            On Wednesday, July 20, 2016, Jan Schermer <j...@schermer.cz
            <mailto:j...@schermer.cz>> wrote:


                > On 20 Jul 2016, at 18:38, Mike Christie
                <mchri...@redhat.com <mailto:mchri...@redhat.com>> wrote:
                >
                > On 07/20/2016 03:50 AM, Frédéric Nass wrote:
                >>
                >> Hi Mike,
                >>
                >> Thanks for the update on the RHCS iSCSI target.
                >>
                >> Will RHCS 2.1 iSCSI target be compliant with VMWare
                ESXi client ? (or is
                >> it too early to say / announce).
                >
                > No HA support for sure. We are looking into non HA
                support though.
                >
                >>
                >> Knowing that HA iSCSI target was on the roadmap, we
                chose iSCSI over NFS
                >> so we'll just have to remap RBDs to RHCS targets
                when it's available.
                >>
                >> So we're currently running :
                >>
                >> - 2 LIO iSCSI targets exporting the same RBD
                images. Each iSCSI target
                >> has all VAAI primitives enabled and run the same
                configuration.
                >> - RBD images are mapped on each target using the
                kernel client (so no
                >> RBD cache).
                >> - 6 ESXi. Each ESXi can access to the same LUNs
                through both targets,
                >> but in a failover manner so that each ESXi always
                access the same LUN
                >> through one target at a time.
                >> - LUNs are VMFS datastores and VAAI primitives are
                enabled client side
                >> (except UNMAP as per default).
                >>
                >> Do you see anthing risky regarding this configuration ?
                >
                > If you use a application that uses scsi persistent
                reservations then you
                > could run into troubles, because some apps expect
                the reservation info
                > to be on the failover nodes as well as the active ones.
                >
                > Depending on the how you do failover and the issue
                that caused the
                > failover, IO could be stuck on the old active node
                and cause data
                > corruption. If the initial active node looses its
                network connectivity
                > and you failover, you have to make sure that the
                initial active node is
                > fenced off and IO stuck on that node will never be
                executed. So do
                > something like add it to the ceph monitor blacklist
                and make sure IO on
                > that node is flushed and failed before
                unblacklisting it.
                >

                With iSCSI you can't really do hot failover unless you
                only use synchronous IO.

            VMware does only use synchronous IO. Since the hypervisor
            can't tell what type of data the VMs are writing, all IO
            is treated as needing to be synchronous.

                (With any of opensource target softwares available).
                Flushing the buffers doesn't really help because you
                don't know what in-flight IO happened before the outage
                and which didn't. You could end with only part of the
                "transaction" written on persistent storage.

                If you only use synchronous IO all the way from client
                to the persistent storage shared between
                iSCSI target then all should be fine, otherwise YMMV -
                some people run it like that without realizing
                the dangers and have never had a problem, so it may be
                strictly theoretical, and it all depends on how often
                you need to do the
                failover and what data you are storing - corrupting a
                few images on a gallery site could be fine but corrupting
                a large database tablespace is no fun at all.

            No, it's not. VMFS corruption is pretty bad too and there
            is no fsck for VMFS...


                Some (non opensource) solutions exist, Solaris
                supposedly does this in some(?) way, maybe some iSCSI guru
                can chime tell us what magic they do, but I don't
                think it's possible without client support
                (you essentialy have to do something like transactions
                and replay the last transaction on failover). Maybe
                something can be enabled in protocol to do the iSCSI
                IO synchronous or make it at least wait for some sort
                of ACK from the
                server (which would require some sort of cache
                mirroring between the targets) without making it
                synchronous all the way.

            This is why the SAN vendors wrote their own clients and
            drivers. It is not possible to dynamically make all OS's
            do what your iSCSI target expects.

            Something like VMware does the right thing pretty much all
            the time (there are some iSCSI initiator bugs in earlier
            ESXi 5.x).  If you have control of your ESXi hosts then
            attempting to set up HA iSCSI targets is possible.

            If you have a mixed client environment with various
            versions of Windows connecting to the target, you may be
            better off buying some SAN appliances.


                The one time I had to use it I resorted to simply
                mirroring in via mdraid on the client side over two
                targets sharing the same
                DAS, and this worked fine during testing but never
                went to production in the end.

                Jan

                >
                >>
                >> Would you recommend LIO or STGT (with rbd bs-type)
                target for ESXi
                >> clients ?
                >
                > I can't say, because I have not used stgt with rbd
                bs-type support enough.

            For starters, STGT doesn't implement VAAI properly and you
            will need to disable VAAI in ESXi.

            LIO does seem to implement VAAI properly, but performance
            is not nearly as good as STGT even with VAAI's benefits.
            The assumption for the cause is that LIO currently uses
            kernel rbd mapping and kernel rbd performance is not as
            good as librbd.

            I recently did a simple test of creating an 80GB eager
            zeroed disk with STGT (VAAI disabled, no rbd client
            cache) and LIO (VAAI enabled) and found that STGT was
            actually slightly faster.

            I think we're all holding our breath waiting for LIO
            librbd support via TCMU, which seems to be right around
            the corner. That solution will combine the performance
            benefits of librbd with the more feature-full LIO iSCSI
            interface. The lrbd configuration tool for LIO from SUSE
            is pretty cool and it makes configuring LIO easier than STGT.


        Hi Jake,

        Problem we're facing with LIO is that it has ESXs
        disconnecting from vCenter regularly. This is a result from
        the iSCSI datastore becoming unreachable.
        It's happens randomly, last time with almost no VM activity at
        all (only 6 VMs in the lab), but when ESX requested a write to
        '.iormstats.sf' file, which I suppose is related to storage
        I/O Control, but I'm not sure of that.

        Setting VMFS3.UseATSForHBOnVMFS5 to 0 didn't help. Restarting
        the LIO target almost instantly solves it.

        Any one of you ever encountered this issue with LIO target ?

        Yes, this is a current known problem that will hopefully be
        resolved soon. When there is a delay servicing IO, ESXi asks
        the target to cancel the IO, LIO tries to do this, but from
        what I understand, the RBD doesn’t have the API to allow LIO
        to reach into the Ceph cluster and cancel the in flight IO.
        LIO responds back, saying I can’t do this and then ESXi asks
        again. And so LIO and ESXi enter a loop forever.

    Hi Nick,

    Thanks for this explanation.

    Are you aware of any workaround or ESXi initiator option to tweak
    (like an I/O timeout value) to avoid that ?

    Or does this makes LIO target unusable with ESXi as of now ?

    Is STGT also affected or does it respond better with the rbd
    (librbd) backstore ?

    Check out my response in this thread

    
http://ceph-users.ceph.narkive.com/JFwme605/suse-enterprise-storage3-rbd-lio-vmware-performance-bad
    
<http://xo4t.mj.am/lnk/AEEAEYgWx-kAAAAAAAAAAF3gdvQAADNJBWwAAAAAAACRXwBXkevYGFfL2i3uRyGNwgconD9HlgAAlBI/1/kJs7X73J-W_C3B5XFZNraA/aHR0cDovL2NlcGgtdXNlcnMuY2VwaC5uYXJraXZlLmNvbS9KRndtZTYwNS9zdXNlLWVudGVycHJpc2Utc3RvcmFnZTMtcmJkLWxpby12bXdhcmUtcGVyZm9ybWFuY2UtYmFk>


Nick,

What a great post (#5) ! :-)

It clearly states what I'm hitting with LIO (vmkernel.log) :

2016-07-21T07:33:38.544Z cpu26:386324)WARNING: ScsiPath: 7154: Setretry timeout for failed TaskMgmt abort for CmdSN 0x0, status Failure,path vmhba40:C2:T1:L0


Have you try STGT (with rbd backstore) ? I'll give SCST a try...

Yep, but see my point about being unable to stop when there is ongoingIO, this makes clustering hard as you have to start adding resourceagents to block/manipulate TCP packets to drain iscsi connections. Igave up trying to get it to work 100% reliably.




When you say 'NFS is very easy to configure for HA', how that ?

I thought it was something hard to achieve, involving clusteringsoftware as Corosync, Pacemaker, DRBD or GFS2. Am I missing something? (NFS-Ganesha ?)

Easy compared to iSCSI. Yes, you have to use pacemaker/corosync, butthat’s the easy part of the whole process.


Ok. So this would be an active / passive scenario, right ?

The hard part seems to set the right fencing with the right commands oneach NFS node. :-/It's not really clear to me whether an active, under load, NFS serverwill accept to shutdown gracefully, so you can unmap the RBD withoutfear and have it remmaped on the other node.


Frederic.

There’s a lot of things that can go wrong doing clustered iscsi,whereas I have found NFS to be much simpler. ESXi seems to handle NFSfailure better. With iSCSI unless you catch it quickly everything goesAPD/PDL and you end up with all sorts of problems. NFS seems to beable to disappear and then pop back with no drama from what I haveseen so far.
Again thanks for you help,

Frederic.

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph + vmware

Reply via email to