+ Rakesh from Veritas
On Mon, Nov 28, 2016 at 6:17 AM, Stefan Hajnoczi <stefa...@gmail.com> wrote:
> On Mon, Nov 28, 2016 at 10:23:41AM +0000, Ketan Nilangekar wrote:
>>
>>
>> On 11/25/16, 5:05 PM, "Stefan Hajnoczi" <stefa...@gmail.com> wrote:
>>
>> On Fri, Nov 25, 2016 at 08:27:26AM +0000, Ketan Nilangekar wrote:
>> > On 11/24/16, 9:38 PM, "Stefan Hajnoczi" <stefa...@gmail.com> wrote:
>> > On Thu, Nov 24, 2016 at 11:31:14AM +0000, Ketan Nilangekar wrote:
>> > > On 11/24/16, 4:41 PM, "Stefan Hajnoczi" <stefa...@gmail.com>
>> wrote:
>> > > On Thu, Nov 24, 2016 at 05:44:37AM +0000, Ketan Nilangekar
>> wrote:
>> > > > On 11/24/16, 4:07 AM, "Paolo Bonzini"
>> <pbonz...@redhat.com> wrote:
>> > > > >On 23/11/2016 23:09, ashish mittal wrote:
>> > > > >> On the topic of protocol security -
>> > > > >>
>> > > > >> Would it be enough for the first patch to implement only
>> > > > >> authentication and not encryption?
>> > > > >
>> > > > >Yes, of course. However, as we introduce more and more
>> QEMU-specific
>> > > > >characteristics to a protocol that is already
>> QEMU-specific (it doesn't
>> > > > >do failover, etc.), I am still not sure of the actual
>> benefit of using
>> > > > >libqnio versus having an NBD server or FUSE driver.
>> > > > >
>> > > > >You have already mentioned performance, but the design
>> has changed so
>> > > > >much that I think one of the two things has to change:
>> either failover
>> > > > >moves back to QEMU and there is no (closed source)
>> translator running on
>> > > > >the node, or the translator needs to speak a well-known
>> and
>> > > > >already-supported protocol.
>> > > >
>> > > > IMO design has not changed. Implementation has changed
>> significantly. I would propose that we keep resiliency/failover code out of
>> QEMU driver and implement it entirely in libqnio as planned in a subsequent
>> revision. The VxHS server does not need to understand/handle failover at all.
>> > > >
>> > > > Today libqnio gives us significantly better performance
>> than any NBD/FUSE implementation. We know because we have prototyped with
>> both. Significant improvements to libqnio are also in the pipeline which
>> will use cross memory attach calls to further boost performance. Ofcourse a
>> big reason for the performance is also the HyperScale storage backend but we
>> believe this method of IO tapping/redirecting can be leveraged by other
>> solutions as well.
>> > >
>> > > By "cross memory attach" do you mean
>> > > process_vm_readv(2)/process_vm_writev(2)?
>> > >
>> > > Ketan> Yes.
>> > >
>> > > That puts us back to square one in terms of security. You
>> have
>> > > (untrusted) QEMU + (untrusted) libqnio directly accessing
>> the memory of
>> > > another process on the same machine. That process is
>> therefore also
>> > > untrusted and may only process data for one guest so that
>> guests stay
>> > > isolated from each other.
>> > >
>> > > Ketan> Understood but this will be no worse than the current
>> network based communication between qnio and vxhs server. And although we
>> have questions around QEMU trust/vulnerability issues, we are looking to
>> implement basic authentication scheme between libqnio and vxhs server.
>> >
>> > This is incorrect.
>> >
>> > Cross memory attach is equivalent to ptrace(2) (i.e. debugger)
>> access.
>> > It means process A reads/writes directly from/to process B memory.
>> Both
>> > processes must have the same uid/gid. There is no trust boundary
>> > between them.
>> >
>> > Ketan> Not if vxhs server is running as root and initiating the cross
>> mem attach. Which is also why we are proposing a basic authentication
>> mechanism between qemu-vxhs. But anyway the cross memory attach is for a
>> near future implementation.
>> >
>> > Network communication does not require both processes to have the
>> same
>> > uid/gid. If you want multiple QEMU processes talking to a single
>> server
>> > there must be a trust boundary between client and server. The
>> server
>> > can validate the input from the client and reject undesired
>> operations.
>> >
>> > Ketan> This is what we are trying to propose. With the addition of
>> authentication between qemu-vxhs server, we should be able to achieve this.
>> Question is, would that be acceptable?
>> >
>> > Hope this makes sense now.
>> >
>> > Two architectures that implement the QEMU trust model correctly
>> are:
>> >
>> > 1. Cross memory attach: each QEMU process has a dedicated vxhs
>> server
>> > process to prevent guests from attacking each other. This is
>> where I
>> > said you might as well put the code inside QEMU since there is
>> no
>> > isolation anyway. From what you've said it sounds like the vxhs
>> > server needs a host-wide view and is responsible for all guests
>> > running on the host, so I guess we have to rule out this
>> > architecture.
>> >
>> > 2. Network communication: one vxhs server process and multiple
>> guests.
>> > Here you might as well use NBD or iSCSI because it already
>> exists and
>> > the vxhs driver doesn't add any unique functionality over
>> existing
>> > protocols.
>> >
>> > Ketan> NBD does not give us the performance we are trying to achieve.
>> Besides NBD does not have any authentication support.
>>
>> NBD over TCP supports TLS with X.509 certificate authentication. I
>> think Daniel Berrange mentioned that.
>>
>> Ketan> I saw the patch to nbd that was merged in 2015. Before that NBD did
>> not have any auth as Daniel Berrange mentioned.
>>
>> NBD over AF_UNIX does not need authentication because it relies on file
>> permissions for access control. Each guest should have its own UNIX
>> domain socket that it connects to. That socket can only see exports
>> that have been assigned to the guest.
>>
>> > There is a hybrid 2.a approach which uses both 1 & 2 but I’d keep that
>> for a later discussion.
>>
>> Please discuss it now so everyone gets on the same page. I think there
>> is a big gap and we need to communicate so that progress can be made.
>>
>> Ketan> The approach was to use cross mem attach for IO path and a simplified
>> network IO lib for resiliency/failover. Did not want to derail the current
>> discussion hence the suggestion to take it up later.
>
> Why does the client have to know about failover if it's connected to a
> server process on the same host? I thought the server process manages
> networking issues (like the actual protocol to speak to other VxHS nodes
> and for failover).
>
>> > > There's an easier way to get even better performance: get
>> rid of libqnio
>> > > and the external process. Move the code from the external
>> process into
>> > > QEMU to eliminate the
>> process_vm_readv(2)/process_vm_writev(2) and
>> > > context switching.
>> > >
>> > > Can you remind me why there needs to be an external process?
>> > >
>> > > Ketan> Apart from virtualizing the available direct attached
>> storage on the compute, vxhs storage backend (the external process) provides
>> features such as storage QoS, resiliency, efficient use of direct attached
>> storage, automatic storage recovery points (snapshots) etc. Implementing
>> this in QEMU is not practical and not the purpose of proposing this driver.
>> >
>> > This sounds similar to what QEMU and Linux (file systems, LVM,
>> RAID,
>> > etc) already do. It brings to mind a third architecture:
>> >
>> > 3. A Linux driver or file system. Then QEMU opens a raw block
>> device.
>> > This is what the Ceph rbd block driver in Linux does. This
>> > architecture has a kernel-userspace boundary so vxhs does not
>> have to
>> > trust QEMU.
>> >
>> > I suggest Architecture #2. You'll be able to deploy on existing
>> systems
>> > because QEMU already supports NBD or iSCSI. Use the time you gain
>> from
>> > switching to this architecture on benchmarking and optimizing NBD
>> or
>> > iSCSI so performance is closer to your goal.
>> >
>> > Ketan> We have made a choice to go with QEMU driver approach after
>> serious evaluation of most if not all standard IO tapping mechanisms
>> including NFS, NBD and FUSE. None of these has been able to deliver the
>> performance that we have set ourselves to achieve. Hence the effort to
>> propose this new IO tap which we believe will provide an alternate to the
>> existing mechanisms and hopefully benefit the community.
>>
>> I thought the VxHS block driver was another network block driver like
>> GlusterFS or Sheepdog but you are actually proposing a new local I/O tap
>> with the goal of better performance.
>>
>> Ketan> The VxHS block driver is a new local IO tap with the goal of better
>> performance specifically when used with the VxHS server. This coupled with
>> shared mem IPC (like cross mem attach) could be a much better IO tap option
>> for qemu users. This will also avoid context switch between qemu/network
>> stack to service which happens today in NBD.
>>
>>
>> Please share fio(1) or other standard benchmark configuration files and
>> performance results.
>>
>> Ketan> We have fio results with the VxHS storage backend which I am not sure
>> I can share in a public forum.
>>
>> NBD and libqnio wire protocols have comparable performance
>> characteristics. There is no magic that should give either one a
>> fundamental edge over the other. Am I missing something?
>>
>> Ketan> I have not seen the NBD code but few things which we considered and
>> are part of libqnio (though not exclusively) are low protocol overhead,
>> threading model, queueing, latencies, memory pools, zero data copies in
>> user-land, scatter-gather write/read etc. Again these are not exclusive to
>> libqnio but could give one protocol the edge over the other. Also part of
>> the “magic” is also in the VxHS storage backend which is able to ingest the
>> IOs with lower latencies.
>>
>> The main performance difference is probably that libqnio opens 8
>> simultaneous connections but that's not unique to the wire protocol.
>> What happens when you run 8 NBD simultaneous TCP connections?
>>
>> Ketan> Possibly. We have not benchmarked this.
>
> There must be benchmark data if you want to add a new feature or modify
> existing code for performance reasons. This rule is followed in QEMU so
> that performance changes are justified.
>
> I'm afraid that when you look into the performance you'll find that any
> performance difference between NBD and this VxHS patch series is due to
> implementation differences that can be ported across to QEMU NBD, rather
> than wire protocol differences.
>
> If that's the case then it would save a lot of time to use NBD over
> AF_UNIX for now. You could focus efforts on achieving the final
> architecture you've explained with cross memory attach.
>
> Please take a look at vhost-user-scsi, which folks from Nutanix are
> currently working on. See "[PATCH v2 0/3] Introduce vhost-user-scsi and
> sample application" on qemu-devel. It is a true zero-copy local I/O tap
> because it shares guest RAM. This is more efficient than cross memory
> attach's single memory copy. It does not require running the server as
> root. This is the #1 thing you should evaluate for your final
> architecture.
>
> vhost-user-scsi works on the virtio-scsi emulation level. That means
> the server must implement the virtio-scsi vring and device emulation.
> It is not a block driver. By hooking in at this level you can achieve
> the best performance but you lose all QEMU block layer functionality and
> need to implement your own SCSI target. You also need to consider live
> migration.
>
> Stefan