Le 22/07/2016 09:47, Nick Fisk a écrit :

*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Frédéric Nass
*Sent:* 22 July 2016 08:11
*To:* Jake Young <jak3...@gmail.com>; Jan Schermer <j...@schermer.cz>
*Cc:* ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] ceph + vmware

Le 20/07/2016 21:20, Jake Young a écrit :



    On Wednesday, July 20, 2016, Jan Schermer <j...@schermer.cz
    <mailto:j...@schermer.cz>> wrote:


        > On 20 Jul 2016, at 18:38, Mike Christie <mchri...@redhat.com
        <mailto:mchri...@redhat.com>> wrote:
        >
        > On 07/20/2016 03:50 AM, Frédéric Nass wrote:
        >>
        >> Hi Mike,
        >>
        >> Thanks for the update on the RHCS iSCSI target.
        >>
        >> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi
        client ? (or is
        >> it too early to say / announce).
        >
        > No HA support for sure. We are looking into non HA support
        though.
        >
        >>
        >> Knowing that HA iSCSI target was on the roadmap, we chose
        iSCSI over NFS
        >> so we'll just have to remap RBDs to RHCS targets when it's
        available.
        >>
        >> So we're currently running :
        >>
        >> - 2 LIO iSCSI targets exporting the same RBD images. Each
        iSCSI target
        >> has all VAAI primitives enabled and run the same configuration.
        >> - RBD images are mapped on each target using the kernel
        client (so no
        >> RBD cache).
        >> - 6 ESXi. Each ESXi can access to the same LUNs through
        both targets,
        >> but in a failover manner so that each ESXi always access
        the same LUN
        >> through one target at a time.
        >> - LUNs are VMFS datastores and VAAI primitives are enabled
        client side
        >> (except UNMAP as per default).
        >>
        >> Do you see anthing risky regarding this configuration ?
        >
        > If you use a application that uses scsi persistent
        reservations then you
        > could run into troubles, because some apps expect the
        reservation info
        > to be on the failover nodes as well as the active ones.
        >
        > Depending on the how you do failover and the issue that
        caused the
        > failover, IO could be stuck on the old active node and cause
        data
        > corruption. If the initial active node looses its network
        connectivity
        > and you failover, you have to make sure that the initial
        active node is
        > fenced off and IO stuck on that node will never be executed.
        So do
        > something like add it to the ceph monitor blacklist and make
        sure IO on
        > that node is flushed and failed before unblacklisting it.
        >

        With iSCSI you can't really do hot failover unless you only
        use synchronous IO.

    VMware does only use synchronous IO. Since the hypervisor can't
    tell what type of data the VMs are writing, all IO is treated as
    needing to be synchronous.

        (With any of opensource target softwares available).
        Flushing the buffers doesn't really help because you don't
        know what in-flight IO happened before the outage
        and which didn't. You could end with only part of the
        "transaction" written on persistent storage.

        If you only use synchronous IO all the way from client to the
        persistent storage shared between
        iSCSI target then all should be fine, otherwise YMMV - some
        people run it like that without realizing
        the dangers and have never had a problem, so it may be
        strictly theoretical, and it all depends on how often you need
        to do the
        failover and what data you are storing - corrupting a few
        images on a gallery site could be fine but corrupting
        a large database tablespace is no fun at all.

    No, it's not. VMFS corruption is pretty bad too and there is no
    fsck for VMFS...


        Some (non opensource) solutions exist, Solaris supposedly does
        this in some(?) way, maybe some iSCSI guru
        can chime tell us what magic they do, but I don't think it's
        possible without client support
        (you essentialy have to do something like transactions and
        replay the last transaction on failover). Maybe
        something can be enabled in protocol to do the iSCSI IO
        synchronous or make it at least wait for some sort of ACK from the
        server (which would require some sort of cache mirroring
        between the targets) without making it synchronous all the way.

    This is why the SAN vendors wrote their own clients and drivers.
    It is not possible to dynamically make all OS's do what your iSCSI
    target expects.

    Something like VMware does the right thing pretty much all the
time (there are some iSCSI initiator bugs in earlier ESXi 5.x). If you have control of your ESXi hosts then attempting to set up
    HA iSCSI targets is possible.

    If you have a mixed client environment with various versions of
    Windows connecting to the target, you may be better off buying
    some SAN appliances.


        The one time I had to use it I resorted to simply mirroring in
        via mdraid on the client side over two targets sharing the same
        DAS, and this worked fine during testing but never went to
        production in the end.

        Jan

        >
        >>
        >> Would you recommend LIO or STGT (with rbd bs-type) target
        for ESXi
        >> clients ?
        >
        > I can't say, because I have not used stgt with rbd bs-type
        support enough.

    For starters, STGT doesn't implement VAAI properly and you will
    need to disable VAAI in ESXi.

    LIO does seem to implement VAAI properly, but performance is not
    nearly as good as STGT even with VAAI's benefits. The assumption
    for the cause is that LIO currently uses kernel rbd mapping and
    kernel rbd performance is not as good as librbd.

    I recently did a simple test of creating an 80GB eager zeroed disk
    with STGT (VAAI disabled, no rbd client cache) and LIO (VAAI
    enabled) and found that STGT was actually slightly faster.

    I think we're all holding our breath waiting for LIO librbd
    support via TCMU, which seems to be right around the corner. That
    solution will combine the performance benefits of librbd with the
    more feature-full LIO iSCSI interface. The lrbd configuration tool
    for LIO from SUSE is pretty cool and it makes configuring LIO
    easier than STGT.


Hi Jake,

Problem we're facing with LIO is that it has ESXs disconnecting from vCenter regularly. This is a result from the iSCSI datastore becoming unreachable. It's happens randomly, last time with almost no VM activity at all (only 6 VMs in the lab), but when ESX requested a write to '.iormstats.sf' file, which I suppose is related to storage I/O Control, but I'm not sure of that.

Setting VMFS3.UseATSForHBOnVMFS5 to 0 didn't help. Restarting the LIO target almost instantly solves it.

Any one of you ever encountered this issue with LIO target ?

Yes, this is a current known problem that will hopefully be resolved soon. When there is a delay servicing IO, ESXi asks the target to cancel the IO, LIO tries to do this, but from what I understand, the RBD doesn’t have the API to allow LIO to reach into the Ceph cluster and cancel the in flight IO. LIO responds back, saying I can’t do this and then ESXi asks again. And so LIO and ESXi enter a loop forever.



Hi Nick,

Thanks for this explanation.

Are you aware of any workaround or ESXi initiator option to tweak (like an I/O timeout value) to avoid that ?

Or does this makes LIO target unusable with ESXi as of now ?

Is STGT also affected or does it respond better with the rbd (librbd) backstore ?

Frederic.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to