Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Veritas HyperScale VxHS block device support

ashish mittal Tue, 29 Nov 2016 16:47:44 -0800
+ Rakesh from Veritas

On Mon, Nov 28, 2016 at 6:17 AM, Stefan Hajnoczi <stefa...@gmail.com> wrote:
> On Mon, Nov 28, 2016 at 10:23:41AM +0000, Ketan Nilangekar wrote:
>>
>>
>> On 11/25/16, 5:05 PM, "Stefan Hajnoczi" <stefa...@gmail.com> wrote:
>>
>>     On Fri, Nov 25, 2016 at 08:27:26AM +0000, Ketan Nilangekar wrote:
>>     > On 11/24/16, 9:38 PM, "Stefan Hajnoczi" <stefa...@gmail.com> wrote:
>>     >     On Thu, Nov 24, 2016 at 11:31:14AM +0000, Ketan Nilangekar wrote:
>>     >     > On 11/24/16, 4:41 PM, "Stefan Hajnoczi" <stefa...@gmail.com> 
>> wrote:
>>     >     >     On Thu, Nov 24, 2016 at 05:44:37AM +0000, Ketan Nilangekar 
>> wrote:
>>     >     >     > On 11/24/16, 4:07 AM, "Paolo Bonzini" 
>> <pbonz...@redhat.com> wrote:
>>     >     >     > >On 23/11/2016 23:09, ashish mittal wrote:
>>     >     >     > >> On the topic of protocol security -
>>     >     >     > >>
>>     >     >     > >> Would it be enough for the first patch to implement only
>>     >     >     > >> authentication and not encryption?
>>     >     >     > >
>>     >     >     > >Yes, of course.  However, as we introduce more and more 
>> QEMU-specific
>>     >     >     > >characteristics to a protocol that is already 
>> QEMU-specific (it doesn't
>>     >     >     > >do failover, etc.), I am still not sure of the actual 
>> benefit of using
>>     >     >     > >libqnio versus having an NBD server or FUSE driver.
>>     >     >     > >
>>     >     >     > >You have already mentioned performance, but the design 
>> has changed so
>>     >     >     > >much that I think one of the two things has to change: 
>> either failover
>>     >     >     > >moves back to QEMU and there is no (closed source) 
>> translator running on
>>     >     >     > >the node, or the translator needs to speak a well-known 
>> and
>>     >     >     > >already-supported protocol.
>>     >     >     >
>>     >     >     > IMO design has not changed. Implementation has changed 
>> significantly. I would propose that we keep resiliency/failover code out of 
>> QEMU driver and implement it entirely in libqnio as planned in a subsequent 
>> revision. The VxHS server does not need to understand/handle failover at all.
>>     >     >     >
>>     >     >     > Today libqnio gives us significantly better performance 
>> than any NBD/FUSE implementation. We know because we have prototyped with 
>> both. Significant improvements to libqnio are also in the pipeline which 
>> will use cross memory attach calls to further boost performance. Ofcourse a 
>> big reason for the performance is also the HyperScale storage backend but we 
>> believe this method of IO tapping/redirecting can be leveraged by other 
>> solutions as well.
>>     >     >
>>     >     >     By "cross memory attach" do you mean
>>     >     >     process_vm_readv(2)/process_vm_writev(2)?
>>     >     >
>>     >     > Ketan> Yes.
>>     >     >
>>     >     >     That puts us back to square one in terms of security.  You 
>> have
>>     >     >     (untrusted) QEMU + (untrusted) libqnio directly accessing 
>> the memory of
>>     >     >     another process on the same machine.  That process is 
>> therefore also
>>     >     >     untrusted and may only process data for one guest so that 
>> guests stay
>>     >     >     isolated from each other.
>>     >     >
>>     >     > Ketan> Understood but this will be no worse than the current 
>> network based communication between qnio and vxhs server. And although we 
>> have questions around QEMU trust/vulnerability issues, we are looking to 
>> implement basic authentication scheme between libqnio and vxhs server.
>>     >
>>     >     This is incorrect.
>>     >
>>     >     Cross memory attach is equivalent to ptrace(2) (i.e. debugger) 
>> access.
>>     >     It means process A reads/writes directly from/to process B memory. 
>>  Both
>>     >     processes must have the same uid/gid.  There is no trust boundary
>>     >     between them.
>>     >
>>     > Ketan> Not if vxhs server is running as root and initiating the cross 
>> mem attach. Which is also why we are proposing a basic authentication 
>> mechanism between qemu-vxhs. But anyway the cross memory attach is for a 
>> near future implementation.
>>     >
>>     >     Network communication does not require both processes to have the 
>> same
>>     >     uid/gid.  If you want multiple QEMU processes talking to a single 
>> server
>>     >     there must be a trust boundary between client and server.  The 
>> server
>>     >     can validate the input from the client and reject undesired 
>> operations.
>>     >
>>     > Ketan> This is what we are trying to propose. With the addition of 
>> authentication between qemu-vxhs server, we should be able to achieve this. 
>> Question is, would that be acceptable?
>>     >
>>     >     Hope this makes sense now.
>>     >
>>     >     Two architectures that implement the QEMU trust model correctly 
>> are:
>>     >
>>     >     1. Cross memory attach: each QEMU process has a dedicated vxhs 
>> server
>>     >        process to prevent guests from attacking each other.  This is 
>> where I
>>     >        said you might as well put the code inside QEMU since there is 
>> no
>>     >        isolation anyway.  From what you've said it sounds like the vxhs
>>     >        server needs a host-wide view and is responsible for all guests
>>     >        running on the host, so I guess we have to rule out this
>>     >        architecture.
>>     >
>>     >     2. Network communication: one vxhs server process and multiple 
>> guests.
>>     >        Here you might as well use NBD or iSCSI because it already 
>> exists and
>>     >        the vxhs driver doesn't add any unique functionality over 
>> existing
>>     >        protocols.
>>     >
>>     > Ketan> NBD does not give us the performance we are trying to achieve. 
>> Besides NBD does not have any authentication support.
>>
>>     NBD over TCP supports TLS with X.509 certificate authentication.  I
>>     think Daniel Berrange mentioned that.
>>
>> Ketan> I saw the patch to nbd that was merged in 2015. Before that NBD did 
>> not have any auth as Daniel Berrange mentioned.
>>
>>     NBD over AF_UNIX does not need authentication because it relies on file
>>     permissions for access control.  Each guest should have its own UNIX
>>     domain socket that it connects to.  That socket can only see exports
>>     that have been assigned to the guest.
>>
>>     > There is a hybrid 2.a approach which uses both 1 & 2 but I’d keep that 
>> for a later discussion.
>>
>>     Please discuss it now so everyone gets on the same page.  I think there
>>     is a big gap and we need to communicate so that progress can be made.
>>
>> Ketan> The approach was to use cross mem attach for IO path and a simplified 
>> network IO lib for resiliency/failover. Did not want to derail the current 
>> discussion hence the suggestion to take it up later.
>
> Why does the client have to know about failover if it's connected to a
> server process on the same host?  I thought the server process manages
> networking issues (like the actual protocol to speak to other VxHS nodes
> and for failover).
>
>>     >     >     There's an easier way to get even better performance: get 
>> rid of libqnio
>>     >     >     and the external process.  Move the code from the external 
>> process into
>>     >     >     QEMU to eliminate the 
>> process_vm_readv(2)/process_vm_writev(2) and
>>     >     >     context switching.
>>     >     >
>>     >     >     Can you remind me why there needs to be an external process?
>>     >     >
>>     >     > Ketan>  Apart from virtualizing the available direct attached 
>> storage on the compute, vxhs storage backend (the external process) provides 
>> features such as storage QoS, resiliency, efficient use of direct attached 
>> storage, automatic storage recovery points (snapshots) etc. Implementing 
>> this in QEMU is not practical and not the purpose of proposing this driver.
>>     >
>>     >     This sounds similar to what QEMU and Linux (file systems, LVM, 
>> RAID,
>>     >     etc) already do.  It brings to mind a third architecture:
>>     >
>>     >     3. A Linux driver or file system.  Then QEMU opens a raw block 
>> device.
>>     >        This is what the Ceph rbd block driver in Linux does.  This
>>     >        architecture has a kernel-userspace boundary so vxhs does not 
>> have to
>>     >        trust QEMU.
>>     >
>>     >     I suggest Architecture #2.  You'll be able to deploy on existing 
>> systems
>>     >     because QEMU already supports NBD or iSCSI.  Use the time you gain 
>> from
>>     >     switching to this architecture on benchmarking and optimizing NBD 
>> or
>>     >     iSCSI so performance is closer to your goal.
>>     >
>>     > Ketan> We have made a choice to go with QEMU driver approach after 
>> serious evaluation of most if not all standard IO tapping mechanisms 
>> including NFS, NBD and FUSE. None of these has been able to deliver the 
>> performance that we have set ourselves to achieve. Hence the effort to 
>> propose this new IO tap which we believe will provide an alternate to the 
>> existing mechanisms and hopefully benefit the community.
>>
>>     I thought the VxHS block driver was another network block driver like
>>     GlusterFS or Sheepdog but you are actually proposing a new local I/O tap
>>     with the goal of better performance.
>>
>> Ketan> The VxHS block driver is a new local IO tap with the goal of better 
>> performance specifically when used with the VxHS server. This coupled with 
>> shared mem IPC (like cross mem attach) could be a much better IO tap option 
>> for qemu users. This will also avoid context switch between qemu/network 
>> stack to service which happens today in NBD.
>>
>>
>>     Please share fio(1) or other standard benchmark configuration files and
>>     performance results.
>>
>> Ketan> We have fio results with the VxHS storage backend which I am not sure 
>> I can share in a public forum.
>>
>>     NBD and libqnio wire protocols have comparable performance
>>     characteristics.  There is no magic that should give either one a
>>     fundamental edge over the other.  Am I missing something?
>>
>> Ketan> I have not seen the NBD code but few things which we considered and 
>> are part of libqnio (though not exclusively) are low protocol overhead, 
>> threading model, queueing, latencies, memory pools, zero data copies in 
>> user-land, scatter-gather write/read etc. Again these are not exclusive to 
>> libqnio but could give one protocol the edge over the other. Also part of 
>> the “magic” is also in the VxHS storage backend which is able to ingest the 
>> IOs with lower latencies.
>>
>>     The main performance difference is probably that libqnio opens 8
>>     simultaneous connections but that's not unique to the wire protocol.
>>     What happens when you run 8 NBD simultaneous TCP connections?
>>
>> Ketan> Possibly. We have not benchmarked this.
>
> There must be benchmark data if you want to add a new feature or modify
> existing code for performance reasons.  This rule is followed in QEMU so
> that performance changes are justified.
>
> I'm afraid that when you look into the performance you'll find that any
> performance difference between NBD and this VxHS patch series is due to
> implementation differences that can be ported across to QEMU NBD, rather
> than wire protocol differences.
>
> If that's the case then it would save a lot of time to use NBD over
> AF_UNIX for now.  You could focus efforts on achieving the final
> architecture you've explained with cross memory attach.
>
> Please take a look at vhost-user-scsi, which folks from Nutanix are
> currently working on.  See "[PATCH v2 0/3] Introduce vhost-user-scsi and
> sample application" on qemu-devel.  It is a true zero-copy local I/O tap
> because it shares guest RAM.  This is more efficient than cross memory
> attach's single memory copy.  It does not require running the server as
> root.  This is the #1 thing you should evaluate for your final
> architecture.
>
> vhost-user-scsi works on the virtio-scsi emulation level.  That means
> the server must implement the virtio-scsi vring and device emulation.
> It is not a block driver.  By hooking in at this level you can achieve
> the best performance but you lose all QEMU block layer functionality and
> need to implement your own SCSI target.  You also need to consider live
> migration.
>
> Stefan
Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Veritas HyperScale VxHS block device support

Reply via email to