Another word for "The only "state" we use in kernel for dumps is the buffer 
itself, with its offset. It is not a "dump" offset, but a buffer offset.":
I remember the linux ovs kernel code, that it used some kind of "dump offset" 
(or, I don't know how to call it, that "index-based" dumping).
I did not take that approach when implementing the windows netlink dump. And it 
worked correctly with the same userspace interface that it is on linux.

Basically, this means that you don't need to provide an "offset" from userspace 
for a DeviceIoControl, unless you want the kernel to restrain himself from 
providing the whole dump reply at once.

Sam
________________________________________
From: Samuel Ghinet
Sent: Friday, August 08, 2014 9:11 PM
To: Eitan Eliahu; Alin Serdean; dev@openvswitch.org; Rajiv Krishnamurthy; Ben 
Pfaff; Kaushik Guha; Ben Pfaff; Justin Pettit; Nithin Raju; Ankur Sharma; Linda 
Sun; Keith Amidon
Subject: RE: Design notes for provisioning Netlink interface from the OVS 
Windows driver (Switch extension)

Eitan,

[QUOTE]).For example the offset for a port dump would be the last port which 
was returned by the previous nl_dump_next().[/QUOTE]
Our kernel implementation of the netlink component does not maintain such state 
for dumps.
The only "state" we use in kernel for dumps is the buffer itself, with its 
offset. It is not a "dump" offset, but a buffer offset.

Also, the offsets for buffers are used for normal (unicast) transactions as 
well (flow new, flow set, etc.). So basically, this means that if you want to 
read the reply of a "Flow/New" and provide a small buffer, you can read 
incrementally. However, the linux userspace, I think, simply provides a buffer 
it considers large enough to read the whole reply for this particular case.

Sam
________________________________________
From: Eitan Eliahu [elia...@vmware.com]
Sent: Friday, August 08, 2014 8:34 PM
To: Samuel Ghinet; Alin Serdean; dev@openvswitch.org; Rajiv Krishnamurthy; Ben 
Pfaff; Kaushik Guha; Ben Pfaff; Justin Pettit; Nithin Raju; Ankur Sharma; Linda 
Sun; Keith Amidon
Subject: RE: Design notes for provisioning Netlink interface from the OVS 
Windows driver (Switch extension)

Sam, we want to avoid any state maintenance in the driver.  The "offset" here 
is not a buffer offset rather an "offset" related to the specific dump command 
(DP. Port or Flow). For example the offset for a port dump would be the last 
port which was returned by the previous nl_dump_next().
The IOCTRT command sent from dump_next should be self contained (which means 
specifies the "offset: and the output buffer length).
Eitan

-----Original Message-----
From: Samuel Ghinet [mailto:sghi...@cloudbasesolutions.com]
Sent: Friday, August 08, 2014 10:25 AM
To: Eitan Eliahu; Alin Serdean; dev@openvswitch.org; Rajiv Krishnamurthy; Ben 
Pfaff; Kaushik Guha; Ben Pfaff; Justin Pettit; Nithin Raju; Ankur Sharma; Linda 
Sun; Keith Amidon
Subject: RE: Design notes for provisioning Netlink interface from the OVS 
Windows driver (Switch extension)

Hello Eitan,

[QUOTE]> You cannot do this dump() operation in a single DeviceIoControl 
request.
Certainly, not. I meant that each nl_dump_next use DeviceIOControl to retrieve 
the next chunk of data. We need to pass down to the driver an offset which 
indicates where this current chunk of data starts.

Did I answer your question?
[/QUOTE]
Yes, it answers the question.

But still, your aim:
"The Driver should not have to maintain a state or resources for transaction or 
dumps"
still cannot apply for dumps: you have the buffer, which is a state, that must 
be maintained across IOCTL requests.
And I don't see how keeping the offset in userspace will help you for this case.
If you look in our project, in WinlDevice.c and BufferControl.c you will see 
that there is no great difficulty in storing the buffer offset in kernel.
Is there any good reason why the offset should be stored in the userspace, and 
not in kernel, for dumps?

Sam
________________________________________
From: Eitan Eliahu [elia...@vmware.com]
Sent: Friday, August 08, 2014 8:13 PM
To: Samuel Ghinet; Alin Serdean; dev@openvswitch.org; Rajiv Krishnamurthy; Ben 
Pfaff; Kaushik Guha; Ben Pfaff; Justin Pettit; Nithin Raju; Ankur Sharma; Linda 
Sun; Keith Amidon
Subject: RE: Design notes for provisioning Netlink interface from the OVS 
Windows driver (Switch extension)

Sam,
Let's take the whole event mechanism (as currently implemented) out of this 
context as it is not Netlink specific, I will try to find your original email 
about Events.

> You cannot do this dump() operation in a single DeviceIoControl request.
Certainly, not. I meant that each nl_dump_next use DeviceIOControl to retrieve 
the next chunk of data. We need to pass down to the driver an offset which 
indicates where this current chunk of data starts.

Did I answer your question?
Thanks,
Eitan


-----Original Message-----
From: Samuel Ghinet [mailto:sghi...@cloudbasesolutions.com]
Sent: Friday, August 08, 2014 10:00 AM
To: Eitan Eliahu; Alin Serdean; dev@openvswitch.org; Rajiv Krishnamurthy; Ben 
Pfaff; Kaushik Guha; Ben Pfaff; Justin Pettit; Nithin Raju; Ankur Sharma; Linda 
Sun; Keith Amidon
Subject: RE: Design notes for provisioning Netlink interface from the OVS 
Windows driver (Switch extension)

Hello Eitan,

This discussion is getting more and more intrincate, with quotes of quotes of 
quotes :)

[QUOTE]Currently there are two IOCTLs implanted in the driver. One which just 
read an event from the event queue and is called synchronously. The other one 
which is used just for the purpose of signaling and it is always pended in the 
driver. Once the pended IRP is completed, the event in the overlapped structure 
is signaled and the user mode will read the event queue synchronously (through 
the call of nl_sock_recv().
nl_sock_recv() always returns immediately (it has a wait parameter but it is 
set to false).
[/QUOTE]
[QUOTE]>How exactly are events queued by the kernel associated with the 
userspace? I mean, how do you register a "nic connected" event so that when an 
event happens, >you know you need to update userspace data for a nic, not do 
something else. Would there be IDs stored in the OvsEvent structs that would 
specify what kind of >events they are? Would we also need context data 
associated with these events?
This should be no different than the current implementation.
[/QUOTE]
Sorry, I have the issue that I don't actually understand very well your current 
"packet to userspace" mechanism with OvsEvent.
I had sent an email one or two weeks ago asking for some help on this matter - 
had received no help / no answer.

[QUOTE]Yes, I looked on this issue yesterday with help of Ben. It seems that 
the Dump function allocate an initial 1024 byte buffer. In turn 
nl_sock_receive__() is called with another large buffer allocated on the stuck. 
If the data returned from the driver exceeds the initial allocated buffer, the 
one on the stuck will be copied to a new increased sized buffer in the ofbuf.
[/QUOTE]
I'm not sure I understand what you mean, neither that you understood my point.
I know that in the netlink protocol implemented in userspace, you hold an 
ofbuf, which, when writing / creating the netlink message, it automatically 
grows (reallocates) when it sees that it needs more space.
I am not talking about writing a message, I am talking about reading.
Here, the "sequential reading" happens via dump_start, dump_next and 
dump_finished(). Where the dump_next() used to recv() more from buffer.
Look at how "nl_dump_start" is implemented in netlink-socket.c: it's a mere 
send. While "nl_dump_next" calls nl_dump_refill to recv() more from the kernel 
buffer.
You cannot do this dump() operation in a single DeviceIoControl request.
the linux netlink socket recv() and the windows file function ReadFile() both 
tell you only how much they succeed to read: you say "read max X" and they tell 
you back "I read Y <= X".
We have a situation where we say "read max X" but you have a buffer in kernel 
of possible size Z >X. it will tell you back "Ok, read that max X" but the dump 
will not be finished reading! You'll have to continue reading the file / 
receiving from socket until it gives you an error "no more data!".
This is something that CANNOT be implemented as DeviceIoControl.

[QUOTE]We should probably modify the implementation of nl_dump_recv() , so it 
will issue DeviceIOControl() call. In input buffer we need to pass down 
information about the "offset" of the current transaction so the driver will 
know from which position it needs to start the dump.
[/QUOTE]
I see absolutely no reason why we should implement the "read offset" of the 
buffer in userspace. This should better be opaque to the userspace. Like a 
stream: you tell it "read!" and it reads from where it left of. This is how you 
normally read from a file, and this is how you normally receive from a socket.
And this "offset" implemented in userspace will still not save you from keeping 
state, because an additional Read from the same buffer still requires a "state" 
be preserved in kernel, among IOCTL calls: the buffer itself.

[QUOTE]
My concern would be that we should not break the Netlink protocol as it was 
selected in order to maintain user/kernel mode interoperability.
[/QUOTE]
Only if we cannot break the netlink protocol: if we figure a way where we do 
not use netlink, but only in cases where the linux ovs doesn't use netlink 
(linux ovs uses some ioctl-s to set options on sockets, or gets some configs on 
netdevs, etc.)

Sam
________________________________________
From: Eitan Eliahu [elia...@vmware.com]
Sent: Friday, August 08, 2014 7:24 PM
To: Samuel Ghinet; Alin Serdean; dev@openvswitch.org; Rajiv Krishnamurthy; Ben 
Pfaff; Kaushik Guha; Ben Pfaff; Justin Pettit; Nithin Raju; Ankur Sharma; Linda 
Sun; Keith Amidon
Subject: RE: Design notes for provisioning Netlink interface from the OVS 
Windows driver (Switch extension)

Hi Sam,
Please find inline.

>Do you mean we will no longer use nl_sock_transact_multiple in userspace for 
>these DPIF transactions?
No, we will use nl_sock_transact_multiple, but nl_sock_transact_multiple will 
be implemented through the call of DeviceIOControl() system call rather than se 
series of WriteFile()/ReadFile() pairs.

>[QUOTE]>You mean, whenever, say, a Flow dump request is issued, in one
>reply to give back all flows?> Not necessarily. I meant that the driver does 
>not have to maintain the state of the dump command.
>Each dump command sent down to the driver would be self-contained. [/QUOTE] We 
>currently have this in our implementation. The only thing 'left' would be the 
>>fact that we provide all the output buffer for dump at once. The userspace 
>can read sequentially from it. Unless there is a reason to write sequentially 
>from the >kernel to the userspace, and wait for the userspace to read, I think 
>that how we have this one is ok.
Sound good, We can leverage your implementation.

>[QUOTE]Yes, these are OVS events that are placed in a custom queue.
>There is a single Operating System event associated with the global socket 
>which collects all OVS events.
>It will be triggered through a completion of a pending I/O request in the 
>driver.[/QUOTE] I used to be a bit confused of your implementation in OvsEvent 
>and >OvsUser. Perhaps this discussion would clarify a bit more things. :) Ok, 
>so we'll hold OVERLAPPED structs in the kernel, as events. What kind of IRP 
>requests would be >returned as "pending" in the kernel? Requests coming as 
>"nl_sock_recv()" on the multicast groups?
>Will there be multiple multicast groups used? or all multicast operations 
>would queue events on the same event queue, where all the events are read from 
>the same >part of code in userspace?
Currently there are two IOCTLs implanted in the driver. One which just read an 
event from the event queue and is called synchronously. The other one which is 
used just for the purpose of signaling and it is always pended in the driver. 
Once the pended IRP is completed, the event in the overlapped structure is 
signaled and the user mode will read the event queue synchronously (through the 
call of nl_sock_recv().
nl_sock_recv() always returns immediately (it has a wait parameter but it is 
set to false).

>How exactly are events queued by the kernel associated with the userspace? I 
>mean, how do you register a "nic connected" event so that when an event 
>happens, >you know you need to update userspace data for a nic, not do 
>something else. Would there be IDs stored in the OvsEvent structs that would 
>specify what kind of >events they are? Would we also need context data 
>associated with these events?
This should be no different than the current implementation. These events are 
generated when NDIS calls the switch extension callback on switch port 
creation/deletion/link up/down etc..

>[QUOTE]>However, I think we need to take into account the situation where the 
>userspace might be providing a smaller buffer than it is the total to read. 
>Also, I >think the "dump" mechanism requires it.
>I (want) to assume that each transaction is self-contained which means that 
>the driver should not maintain a state of the transaction. Since, we will be 
>using an IOCTL for that transaction the user mode buffer length will be 
>specified in the command itself.
>All Write/Read dump pairs are replaced with a single IOCTL call.[/QUOTE] That 
>still did not answer my question :) You mean to use a very large read buffer, 
>so that >you would be able to read all in one single operation? I am more 
>concerned here about flow dumps, because you may not know whether you need an 
>1024 bytes >buffer or an 10240 byes buffer, or an 102400 bytes buffer, or etc.
Yes, I looked on this issue yesterday with help of Ben. It seems that the Dump 
function allocate an initial 1024 byte buffer. In turn nl_sock_receive__() is 
called with another large buffer allocated on the stuck. If the data returned 
from the driver exceeds the initial allocated buffer, the one on the stuck will 
be copied to a new increased sized buffer in the ofbuf.


>So I do not see how a DeviceIoControl operation could do both the 'write' and 
>the 'read' part for the dump.
We should probably modify the implementation of nl_dump_recv() , so it will 
issue DeviceIOControl() call. In input buffer we need to pass down information 
about the "offset" of the current transaction so the driver will know from 
which position it needs to start the dump.

>If you pass to the DeviceIoControl a buffer length = 8000, and the flow dump 
>reply buffer is 32000 bytes, you need to do additional reads AND maintain 
>state in the >kernel (e.g. offset in the kernel read buffer).
I would like to hold the "offset" in user mode (probably add a field in nl_dump 
structure for WIN32)

> [QUOTE]o) I believe we shouldn't use the netlink overhead (nlmsghdr, 
> genlmsghdr, attributes) when not needed (say, when registering a KEVENT 
> notification) , >and, if w>e choose not to use netlink protocol always, we 
> may need a way to differentiate between netlink and non-netlink requests.
>Possible, as phase for optimization[/QUOTE] Not necessarily: if we can make a 
>clear separation in code between netlink and non-netlink km-um, not using 
>netlink >where we don't need to might save us some development & 
>maintainability effort - both in kernel and in userspace. Because otherwise 
>we'd need to turn non->netlink messages of (windows) userspace code into 
>netlink messages.
My concern would be that we should not break the Netlink protocol as it was 
selected in order to maintain user/kernel mode interoperability.

Eitan

_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to