Re: [PATCH net] vsock: forward all packets to the host when no H2G is registered
On 19.11.20 15:03, Stefan Hajnoczi wrote: On Thu, Nov 12, 2020 at 02:38:37PM +0100, Stefano Garzarella wrote: Before commit c0cfa2d8a788 ("vsock: add multi-transports support"), if a G2H transport was loaded (e.g. virtio transport), every packets was forwarded to the host, regardless of the destination CID. The H2G transports implemented until then (vhost-vsock, VMCI) always responded with an error, if the destination CID was not VMADDR_CID_HOST. From that commit, we are using the remote CID to decide which transport to use, so packets with remote CID > VMADDR_CID_HOST(2) are sent only through H2G transport. If no H2G is available, packets are discarded directly in the guest. Some use cases (e.g. Nitro Enclaves [1]) rely on the old behaviour to implement sibling VMs communication, so we restore the old behavior when no H2G is registered. It will be up to the host to discard packets if the destination is not the right one. As it was already implemented before adding multi-transport support. Tested with nested QEMU/KVM by me and Nitro Enclaves by Andra. [1] Documentation/virt/ne_overview.rst Cc: Jorgen Hansen Cc: Dexuan Cui Fixes: c0cfa2d8a788 ("vsock: add multi-transports support") Reported-by: Andra Paraschiv Tested-by: Andra Paraschiv Signed-off-by: Stefano Garzarella --- net/vmw_vsock/af_vsock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Acked-by: Stefan Hajnoczi Is there anything we have to do to also get this into the affected stable trees? :) Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: How to implement message forwarding from one CID to another in vhost driver
Howdy, On 20.05.24 14:44, Dorjoy Chowdhury wrote: Hey Stefano, Thanks for the reply. On Mon, May 20, 2024, 2:55 PM Stefano Garzarella wrote: Hi Dorjoy, On Sat, May 18, 2024 at 04:17:38PM GMT, Dorjoy Chowdhury wrote: Hi, Hope you are doing well. I am working on adding AWS Nitro Enclave[1] emulation support in QEMU. Alexander Graf is mentoring me on this work. A v1 patch series has already been posted to the qemu-devel mailing list[2]. AWS nitro enclaves is an Amazon EC2[3] feature that allows creating isolated execution environments, called enclaves, from Amazon EC2 instances, which are used for processing highly sensitive data. Enclaves have no persistent storage and no external networking. The enclave VMs are based on Firecracker microvm and have a vhost-vsock device for communication with the parent EC2 instance that spawned it and a Nitro Secure Module (NSM) device for cryptographic attestation. The parent instance VM always has CID 3 while the enclave VM gets a dynamic CID. The enclave VMs can communicate with the parent instance over various ports to CID 3, for example, the init process inside an enclave sends a heartbeat to port 9000 upon boot, expecting a heartbeat reply, letting the parent instance know that the enclave VM has successfully booted. The plan is to eventually make the nitro enclave emulation in QEMU standalone i.e., without needing to run another VM with CID 3 with proper vsock If you don't have to launch another VM, maybe we can avoid vhost-vsock and emulate virtio-vsock in user-space, having complete control over the behavior. So we could use this opportunity to implement virtio-vsock in QEMU [4] or use vhost-user-vsock [5] and customize it somehow. (Note: vhost-user-vsock already supports sibling communication, so maybe with a few modifications it fits your case perfectly) [4] https://gitlab.com/qemu-project/qemu/-/issues/2095 [5] https://github.com/rust-vmm/vhost-device/tree/main/vhost-device-vsock Thanks for letting me know. Right now I don't have a complete picture but I will look into them. Thank you. communication support. For this to work, one approach could be to teach the vhost driver in kernel to forward CID 3 messages to another CID N So in this case both CID 3 and N would be assigned to the same QEMU process? CID N is assigned to the enclave VM. CID 3 was supposed to be the parent VM that spawns the enclave VM (this is how it is in AWS, where an EC2 instance VM spawns the enclave VM from inside it and that parent EC2 instance always has CID 3). But in the QEMU case as we don't want a parent VM (we want to run enclave VMs standalone) we would need to forward the CID 3 messages to host CID. I don't know if it means CID 3 and CID N is assigned to the same QEMU process. Sorry. There are 2 use cases here: 1) Enclave wants to treat host as parent (default). In this scenario, the "parent instance" that shows up as CID 3 in the Enclave doesn't really exist. Instead, when the Enclave attempts to talk to CID 3, it should really land on CID 0 (hypervisor). When the hypervisor tries to connect to the Enclave on port X, it should look as if it originates from CID 3, not CID 0. 2) Multiple parent VMs. Think of an actual cloud hosting scenario. Here, we have multiple "parent instances". Each of them thinks it's CID 3. Each can spawn an Enclave that talks to CID 3 and reach the parent. For this case, I think implementing all of virtio-vsock in user space is the best path forward. But in theory, you could also swizzle CIDs to make random "real" CIDs appear as CID 3. Do you have to allocate 2 separate virtio-vsock devices, one for the parent and one for the enclave? If there is a parent VM, then I guess both parent and enclave VMs need virtio-vsock devices. (set to CID 2 for host) i.e., it patches CID from 3 to N on incoming messages and from N to 3 on responses. This will enable users of the Will these messages have the VMADDR_FLAG_TO_HOST flag set? We don't support this in vhost-vsock yet, if supporting it helps, we might, but we need to better understand how to avoid security issues, so maybe each device needs to explicitly enable the feature and specify from which CIDs it accepts packets. I don't know about the flag. So I don't know if it will be set. Sorry. From the guest's point of view, the parent (CID 3) is just another VM. Since Linux as of https://patchwork.ozlabs.org/project/netdev/patch/20201204170235.84387-4-andra...@amazon.com/#2594117 always sets VMADDR_FLAG_TO_HOST when local_CID > 0 && remote_CID > 0, I would say the message has the flag set. How would you envision the host to implement the flag? Would the host allow user space to listen on any CID and hence receive the respective target connections? And wouldn't listening on CID 0 then mean you're effectively listening to "any" othe
Re: How to implement message forwarding from one CID to another in vhost driver
Hey Stefano, On 23.05.24 10:45, Stefano Garzarella wrote: On Tue, May 21, 2024 at 08:50:22AM GMT, Alexander Graf wrote: Howdy, On 20.05.24 14:44, Dorjoy Chowdhury wrote: Hey Stefano, Thanks for the reply. On Mon, May 20, 2024, 2:55 PM Stefano Garzarella wrote: Hi Dorjoy, On Sat, May 18, 2024 at 04:17:38PM GMT, Dorjoy Chowdhury wrote: Hi, Hope you are doing well. I am working on adding AWS Nitro Enclave[1] emulation support in QEMU. Alexander Graf is mentoring me on this work. A v1 patch series has already been posted to the qemu-devel mailing list[2]. AWS nitro enclaves is an Amazon EC2[3] feature that allows creating isolated execution environments, called enclaves, from Amazon EC2 instances, which are used for processing highly sensitive data. Enclaves have no persistent storage and no external networking. The enclave VMs are based on Firecracker microvm and have a vhost-vsock device for communication with the parent EC2 instance that spawned it and a Nitro Secure Module (NSM) device for cryptographic attestation. The parent instance VM always has CID 3 while the enclave VM gets a dynamic CID. The enclave VMs can communicate with the parent instance over various ports to CID 3, for example, the init process inside an enclave sends a heartbeat to port 9000 upon boot, expecting a heartbeat reply, letting the parent instance know that the enclave VM has successfully booted. The plan is to eventually make the nitro enclave emulation in QEMU standalone i.e., without needing to run another VM with CID 3 with proper vsock If you don't have to launch another VM, maybe we can avoid vhost-vsock and emulate virtio-vsock in user-space, having complete control over the behavior. So we could use this opportunity to implement virtio-vsock in QEMU [4] or use vhost-user-vsock [5] and customize it somehow. (Note: vhost-user-vsock already supports sibling communication, so maybe with a few modifications it fits your case perfectly) [4] https://gitlab.com/qemu-project/qemu/-/issues/2095 [5] https://github.com/rust-vmm/vhost-device/tree/main/vhost-device-vsock Thanks for letting me know. Right now I don't have a complete picture but I will look into them. Thank you. communication support. For this to work, one approach could be to teach the vhost driver in kernel to forward CID 3 messages to another CID N So in this case both CID 3 and N would be assigned to the same QEMU process? CID N is assigned to the enclave VM. CID 3 was supposed to be the parent VM that spawns the enclave VM (this is how it is in AWS, where an EC2 instance VM spawns the enclave VM from inside it and that parent EC2 instance always has CID 3). But in the QEMU case as we don't want a parent VM (we want to run enclave VMs standalone) we would need to forward the CID 3 messages to host CID. I don't know if it means CID 3 and CID N is assigned to the same QEMU process. Sorry. There are 2 use cases here: 1) Enclave wants to treat host as parent (default). In this scenario, the "parent instance" that shows up as CID 3 in the Enclave doesn't really exist. Instead, when the Enclave attempts to talk to CID 3, it should really land on CID 0 (hypervisor). When the hypervisor tries to connect to the Enclave on port X, it should look as if it originates from CID 3, not CID 0. 2) Multiple parent VMs. Think of an actual cloud hosting scenario. Here, we have multiple "parent instances". Each of them thinks it's CID 3. Each can spawn an Enclave that talks to CID 3 and reach the parent. For this case, I think implementing all of virtio-vsock in user space is the best path forward. But in theory, you could also swizzle CIDs to make random "real" CIDs appear as CID 3. Thank you for clarifying the use cases! Also for case 1, vhost-vsock doesn't support CID 0, so in my opinion it's easier to go into user-space with vhost-user-vsock or the built-in device. Sorry, I believe I meant CID 2. Effectively for case 1, when a process on the hypervisor listens on port 1234, it should be visible as 3:1234 from the VM and when the hypervisor process connects to :1234, it should look as if that connection came from CID 3. Maybe initially with vhost-user-vsock it's easier because we already have some thing that works and supports sibling communication (for case 2). The problem with vhost-user-vsock is that you don't get to use AF_VSOCK as a host process. A typical Nitro Enclaves application is split into 2 parts: An in-Enclave component that listens/connects to vsock and a parent process that listens/connects to vsock. The experience of launching an Enclave is very similar to launching a QEMU VM: You run nitro-cli and tell it to pop up the Enclave based on an EIF file. Nitro-cli then tells you the CID that was allocated for the Enclave and you communicate to it using that. What I would ideally like to have as development e
Re: How to implement message forwarding from one CID to another in vhost driver
On 27.05.24 09:08, Alexander Graf wrote: Hey Stefano, On 23.05.24 10:45, Stefano Garzarella wrote: On Tue, May 21, 2024 at 08:50:22AM GMT, Alexander Graf wrote: Howdy, On 20.05.24 14:44, Dorjoy Chowdhury wrote: Hey Stefano, Thanks for the reply. On Mon, May 20, 2024, 2:55 PM Stefano Garzarella wrote: Hi Dorjoy, On Sat, May 18, 2024 at 04:17:38PM GMT, Dorjoy Chowdhury wrote: Hi, Hope you are doing well. I am working on adding AWS Nitro Enclave[1] emulation support in QEMU. Alexander Graf is mentoring me on this work. A v1 patch series has already been posted to the qemu-devel mailing list[2]. AWS nitro enclaves is an Amazon EC2[3] feature that allows creating isolated execution environments, called enclaves, from Amazon EC2 instances, which are used for processing highly sensitive data. Enclaves have no persistent storage and no external networking. The enclave VMs are based on Firecracker microvm and have a vhost-vsock device for communication with the parent EC2 instance that spawned it and a Nitro Secure Module (NSM) device for cryptographic attestation. The parent instance VM always has CID 3 while the enclave VM gets a dynamic CID. The enclave VMs can communicate with the parent instance over various ports to CID 3, for example, the init process inside an enclave sends a heartbeat to port 9000 upon boot, expecting a heartbeat reply, letting the parent instance know that the enclave VM has successfully booted. The plan is to eventually make the nitro enclave emulation in QEMU standalone i.e., without needing to run another VM with CID 3 with proper vsock If you don't have to launch another VM, maybe we can avoid vhost-vsock and emulate virtio-vsock in user-space, having complete control over the behavior. So we could use this opportunity to implement virtio-vsock in QEMU [4] or use vhost-user-vsock [5] and customize it somehow. (Note: vhost-user-vsock already supports sibling communication, so maybe with a few modifications it fits your case perfectly) [4] https://gitlab.com/qemu-project/qemu/-/issues/2095 [5] https://github.com/rust-vmm/vhost-device/tree/main/vhost-device-vsock Thanks for letting me know. Right now I don't have a complete picture but I will look into them. Thank you. communication support. For this to work, one approach could be to teach the vhost driver in kernel to forward CID 3 messages to another CID N So in this case both CID 3 and N would be assigned to the same QEMU process? CID N is assigned to the enclave VM. CID 3 was supposed to be the parent VM that spawns the enclave VM (this is how it is in AWS, where an EC2 instance VM spawns the enclave VM from inside it and that parent EC2 instance always has CID 3). But in the QEMU case as we don't want a parent VM (we want to run enclave VMs standalone) we would need to forward the CID 3 messages to host CID. I don't know if it means CID 3 and CID N is assigned to the same QEMU process. Sorry. There are 2 use cases here: 1) Enclave wants to treat host as parent (default). In this scenario, the "parent instance" that shows up as CID 3 in the Enclave doesn't really exist. Instead, when the Enclave attempts to talk to CID 3, it should really land on CID 0 (hypervisor). When the hypervisor tries to connect to the Enclave on port X, it should look as if it originates from CID 3, not CID 0. 2) Multiple parent VMs. Think of an actual cloud hosting scenario. Here, we have multiple "parent instances". Each of them thinks it's CID 3. Each can spawn an Enclave that talks to CID 3 and reach the parent. For this case, I think implementing all of virtio-vsock in user space is the best path forward. But in theory, you could also swizzle CIDs to make random "real" CIDs appear as CID 3. Thank you for clarifying the use cases! Also for case 1, vhost-vsock doesn't support CID 0, so in my opinion it's easier to go into user-space with vhost-user-vsock or the built-in device. Sorry, I believe I meant CID 2. Effectively for case 1, when a process on the hypervisor listens on port 1234, it should be visible as 3:1234 from the VM and when the hypervisor process connects to :1234, it should look as if that connection came from CID 3. Now that I'm thinking about my message again: What if we just introduce a sysfs/sysctl file for vsock that indicates the "host CID" (default: 2)? Users that want vhost-vsock to behave as if the host is CID 3 can just write 3 to it. It means we'd need to change all references to VMADDR_CID_HOST to instead refer to a global variable that indicates the new "host CID". It'd need some more careful massaging to not break number namespace assumptions (<= CID_HOST no longer works), but the idea should fly. That would give us all 3 options: 1) User sets vsock.host_cid = 3 to simulate that the host is in reality an enclave parent 2) User spa
Re: How to implement message forwarding from one CID to another in vhost driver
On 29.05.24 10:04, Stefano Garzarella wrote: On Tue, May 28, 2024 at 06:38:24PM GMT, Paolo Bonzini wrote: On Tue, May 28, 2024 at 5:53 PM Stefano Garzarella wrote: On Tue, May 28, 2024 at 05:49:32PM GMT, Paolo Bonzini wrote: >On Tue, May 28, 2024 at 5:41 PM Stefano Garzarella wrote: >> >I think it's either that or implementing virtio-vsock in userspace >> >(https://lore.kernel.org/qemu-devel/30baeb56-64d2-4ea3-8e53-6a5c50999...@redhat.com/, >> >search for "To connect host<->guest"). >> >> For in this case AF_VSOCK can't be used in the host, right? >> So it's similar to vhost-user-vsock. > >Not sure if I understand but in this case QEMU knows which CIDs are >forwarded to the host (either listen on vsock and connect to the host, >or vice versa), so there is no kernel and no VMADDR_FLAG_TO_HOST >involved. I meant that the application in the host that wants to connect to the guest cannot use AF_VSOCK in the host, but must use the one where QEMU is listening (e.g. AF_INET, AF_UNIX), right? I think one of Alex's requirements was that the application in the host continue to use AF_VSOCK as in their environment. Can the host use VMADDR_CID_LOCAL for host-to-host communication? Yep! If so, the proposed "-object vsock-forward" syntax can connect to it and it should work as long as the application on the host does not assume that it is on CID 3. Right, good point! We can also support something similar in vhost-user-vsock, where instead of using AF_UNIX and firecracker's hybrid vsock, we can redirect everything to VMADDR_CID_LOCAL. Alex what do you think? That would simplify things a lot to do. The only difference is that the application in the host has to talk to VMADDR_CID_LOCAL (1). The application in the host would see an incoming connection from CID 1 (which is probably fine) and would still be able to establish outgoing connections to the actual VM's CID as long as the Enclave doesn't check for the peer CID (I haven't seen anyone check yet). So yes, indeed, this should work. The only case where I can see it breaking is when you run multiple Enclave VMs in parallel. In that case, each would try to listen to CID 3 and the second that does would fail. But it's a well solvable problem: We could (in addition to the simple in-QEMU case) build an external daemon that does the proxying and hence owns CID3. So the immediate plan would be to: 1) Build a new vhost-vsock-forward object model that connects to vhost as CID 3 and then forwards every packet from CID 1 to the Enclave-CID and every packet that arrives on to CID 3 to CID 2. 2) Create a machine option for -M nitro-enclave that automatically spawns the vhost-vsock-forward object. (default: off) The above may need some fiddling with object creation times to ensure that the forward object gets CID 3, not the Enclave as auto-assigned CID. Thanks, Alex Amazon Web Services Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597
[PATCH] lan78xx: Connect phy early
When using wicked with a lan78xx device attached to the system, we end up with ethtool commands issued on the device before an ifup got issued. That lead to the following crash: Unable to handle kernel NULL pointer dereference at virtual address 039c pgd = 800035b3 [039c] *pgd= Internal error: Oops: 9604 [#1] SMP Modules linked in: [...] Supported: Yes CPU: 3 PID: 638 Comm: wickedd Tainted: GE 4.12.14-0-default #1 Hardware name: raspberrypi rpi/rpi, BIOS 2018.03-rc2 02/21/2018 task: 800035e74180 task.stack: 800036718000 PC is at phy_ethtool_ksettings_get+0x20/0x98 LR is at lan78xx_get_link_ksettings+0x44/0x60 [lan78xx] pc : [] lr : [] pstate: 2005 sp : 80003671bb20 x29: 80003671bb20 x28: 800035e74180 x27: 08912000 x26: 001d x25: 0124 x24: 08f74d00 x23: 004000114809 x22: x21: 80003671bbd0 x20: x19: 80003671bbd0 x18: 040d x17: 0001 x16: x15: x14: x13: x12: 0020 x11: 0101010101010101 x10: fefefefefefefeff x9 : 7f7f7f7f7f7f7f7f x8 : fefefeff31677364 x7 : 80808080 x6 : 80003671bc9c x5 : 80003671b9f8 x4 : 80002c296190 x3 : x2 : x1 : 80003671bbd0 x0 : 80003671bc00 Process wickedd (pid: 638, stack limit = 0x800036718000) Call trace: Exception stack(0x80003671b9e0 to 0x80003671bb20) b9e0: 80003671bc00 80003671bbd0 ba00: 80002c296190 80003671b9f8 80003671bc9c 80808080 ba20: fefefeff31677364 7f7f7f7f7f7f7f7f fefefefefefefeff 0101010101010101 ba40: 0020 ba60: 0001 040d 80003671bbd0 ba80: 80003671bbd0 004000114809 baa0: 08f74d00 0124 001d 08912000 bac0: 800035e74180 80003671bb20 00dcca84 80003671bb20 bae0: 086f7f30 2005 80002c296000 800035223900 bb00: 80003671bb20 086f7f30 [] phy_ethtool_ksettings_get+0x20/0x98 [] lan78xx_get_link_ksettings+0x44/0x60 [lan78xx] [] ethtool_get_settings+0x68/0x210 [] dev_ethtool+0x214/0x2180 [] dev_ioctl+0x400/0x630 [] sock_do_ioctl+0x70/0x88 [] sock_ioctl+0x208/0x368 [] do_vfs_ioctl+0xb0/0x848 [] SyS_ioctl+0x8c/0xa8 Exception stack(0x80003671bec0 to 0x80003671c000) bec0: 0009 8946 f4e841d0 aa0032687465 bee0: fa2319d4 f4e841d4 32687465 32687465 bf00: 001d 7f7fff7f7f7f7f7f 72606b622e71ff4c 7f7f7f7f7f7f7f7f bf20: 0101010101010101 0020 7f510c68 bf40: 7f6a9d18 7f44ce30 040d 7f6f98f0 bf60: f4e842c0 0001 fa2c2e00 7f6ab000 bf80: f4e842c0 7f62a000 fa2b9f20 fa2c2e00 bfa0: f4e84818 f4e841a0 7f5ad0cc f4e841a0 bfc0: 7f44ce3c 8000 0009 001d bfe0: The culprit is quite simple: The driver tries to access the phy left and right, but only actually has a working reference to it when the device is up. The fix thus is quite simple too: Get a reference to the phy on probe already and keep it even when the device is going down. With this patch applied, I can successfully run wicked on my system and bring the interface up and down as many times as I want, without getting NULL pointer dereferences in between. Signed-off-by: Alexander Graf --- drivers/net/usb/lan78xx.c | 21 ++--- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c index 60a604cc7647..931cc124ab0c 100644 --- a/drivers/net/usb/lan78xx.c +++ b/drivers/net/usb/lan78xx.c @@ -2082,8 +2082,6 @@ static int lan78xx_phy_init(struct lan78xx_net *dev) dev->fc_autoneg = phydev->autoneg; - phy_start(phydev); - netif_dbg(dev, ifup, dev->net, "phy initialised successfully"); return 0; @@ -2512,9 +2510,7 @@ static int lan78xx_open(struct net_device *net) if (ret < 0) goto done; - ret = lan78xx_phy_init(dev); - if (ret < 0) - goto done; + phy_start(net->phydev); /* for Link Check */ if (dev->urb_intr) { @@ -2575,13 +2571,7 @@ static int lan78xx_stop(str
[PATCH v2] lan78xx: Connect phy early
When using wicked with a lan78xx device attached to the system, we end up with ethtool commands issued on the device before an ifup got issued. That lead to the following crash: Unable to handle kernel NULL pointer dereference at virtual address 039c pgd = 800035b3 [039c] *pgd= Internal error: Oops: 9604 [#1] SMP Modules linked in: [...] Supported: Yes CPU: 3 PID: 638 Comm: wickedd Tainted: GE 4.12.14-0-default #1 Hardware name: raspberrypi rpi/rpi, BIOS 2018.03-rc2 02/21/2018 task: 800035e74180 task.stack: 800036718000 PC is at phy_ethtool_ksettings_get+0x20/0x98 LR is at lan78xx_get_link_ksettings+0x44/0x60 [lan78xx] pc : [] lr : [] pstate: 2005 sp : 80003671bb20 x29: 80003671bb20 x28: 800035e74180 x27: 08912000 x26: 001d x25: 0124 x24: 08f74d00 x23: 004000114809 x22: x21: 80003671bbd0 x20: x19: 80003671bbd0 x18: 040d x17: 0001 x16: x15: x14: x13: x12: 0020 x11: 0101010101010101 x10: fefefefefefefeff x9 : 7f7f7f7f7f7f7f7f x8 : fefefeff31677364 x7 : 80808080 x6 : 80003671bc9c x5 : 80003671b9f8 x4 : 80002c296190 x3 : x2 : x1 : 80003671bbd0 x0 : 80003671bc00 Process wickedd (pid: 638, stack limit = 0x800036718000) Call trace: Exception stack(0x80003671b9e0 to 0x80003671bb20) b9e0: 80003671bc00 80003671bbd0 ba00: 80002c296190 80003671b9f8 80003671bc9c 80808080 ba20: fefefeff31677364 7f7f7f7f7f7f7f7f fefefefefefefeff 0101010101010101 ba40: 0020 ba60: 0001 040d 80003671bbd0 ba80: 80003671bbd0 004000114809 baa0: 08f74d00 0124 001d 08912000 bac0: 800035e74180 80003671bb20 00dcca84 80003671bb20 bae0: 086f7f30 2005 80002c296000 800035223900 bb00: 80003671bb20 086f7f30 [] phy_ethtool_ksettings_get+0x20/0x98 [] lan78xx_get_link_ksettings+0x44/0x60 [lan78xx] [] ethtool_get_settings+0x68/0x210 [] dev_ethtool+0x214/0x2180 [] dev_ioctl+0x400/0x630 [] sock_do_ioctl+0x70/0x88 [] sock_ioctl+0x208/0x368 [] do_vfs_ioctl+0xb0/0x848 [] SyS_ioctl+0x8c/0xa8 Exception stack(0x80003671bec0 to 0x80003671c000) bec0: 0009 8946 f4e841d0 aa0032687465 bee0: fa2319d4 f4e841d4 32687465 32687465 bf00: 001d 7f7fff7f7f7f7f7f 72606b622e71ff4c 7f7f7f7f7f7f7f7f bf20: 0101010101010101 0020 7f510c68 bf40: 7f6a9d18 7f44ce30 040d 7f6f98f0 bf60: f4e842c0 0001 fa2c2e00 7f6ab000 bf80: f4e842c0 7f62a000 fa2b9f20 fa2c2e00 bfa0: f4e84818 f4e841a0 7f5ad0cc f4e841a0 bfc0: 7f44ce3c 8000 0009 001d bfe0: The culprit is quite simple: The driver tries to access the phy left and right, but only actually has a working reference to it when the device is up. The fix thus is quite simple too: Get a reference to the phy on probe already and keep it even when the device is going down. With this patch applied, I can successfully run wicked on my system and bring the interface up and down as many times as I want, without getting NULL pointer dereferences in between. Signed-off-by: Alexander Graf --- v1 -> v2: - move debug message (Andrew Lunn) - check for phydev != NULL before phy_stop (Sayed Nisar) - move phy_disconnect before unregister_netdev (Sayed Nisar) - properly unroll failing probe phy_init (Sayed Nisar) - change phy_init to phy_start in reset_resume (Sayed Nisar) --- drivers/net/usb/lan78xx.c | 34 ++ 1 file changed, 18 insertions(+), 16 deletions(-) diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c index 60a604cc7647..6986f53fe967 100644 --- a/drivers/net/usb/lan78xx.c +++ b/drivers/net/usb/lan78xx.c @@ -2082,10 +2082,6 @@ static int lan78xx_phy_init(struct lan78xx_net *dev) dev->fc_autoneg = phydev->autoneg; - phy_start(phydev); - - netif_dbg(dev, ifup, dev->net, "phy initialised successfully"); - return 0; error: @@ -2512,9 +2508,9 @@ static i
[PATCH] [iputils] Print received packets as icmp_seq
Now, ping and ping6 print the packets which are actually received, too, not only the amount of sent packets. It has the format: icmp_seq=received/seq Signed-off-by: Alexander Graf <[EMAIL PROTECTED]> --- ping_common.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/ping_common.c b/ping_common.c index be36cbd..83be553 100644 --- a/ping_common.c +++ b/ping_common.c @@ -711,7 +711,7 @@ restamp: } else { int i; __u8 *cp, *dp; - printf("%d bytes from %s: icmp_seq=%u", cc, from, seq); + printf("%d bytes from %s: icmp_seq=%li/%u", cc, from, nreceived, seq); if (hops >= 0) printf(" ttl=%d", hops); -- 1.5.2.4 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [iputils] Print packet loss with more precision
Signed-off-by: Alexander Graf <[EMAIL PROTECTED]> --- ping_common.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/ping_common.c b/ping_common.c index 83be553..49acab2 100644 --- a/ping_common.c +++ b/ping_common.c @@ -795,9 +795,9 @@ void finish(void) if (nerrors) printf(", +%ld errors", nerrors); if (ntransmitted) { - printf(", %d%% packet loss", - (int) long long)(ntransmitted - nreceived)) * 100) / - ntransmitted)); + printf(", %f%% packet loss", + (((long long)(ntransmitted - nreceived)) * 100.0) / + ntransmitted); printf(", time %ldms", 1000*tv.tv_sec+tv.tv_usec/1000); } putchar('\n'); -- 1.5.2.4 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] phy dp83867: depend on CONFIG_OF_MDIO
The DP83867 phy driver doesn't actually work when CONFIG_OF_MDIO isn't enabled. It simply passes the device tree test, but leaves all internal configuration initialized at 0. Then it configures the phy with those values and renders a previously working configuration useless. This patch makes sure that we only build the DP83867 phy code when CONFIG_OF_MDIO is set, to not run into that problem. Signed-off-by: Alexander Graf --- drivers/net/phy/Kconfig | 1 + drivers/net/phy/dp83867.c | 7 --- 2 files changed, 1 insertion(+), 7 deletions(-) diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig index 6dad9a9..4265ad5 100644 --- a/drivers/net/phy/Kconfig +++ b/drivers/net/phy/Kconfig @@ -148,6 +148,7 @@ config DP83848_PHY config DP83867_PHY tristate "Drivers for Texas Instruments DP83867 Gigabit PHY" + depends on OF_MDIO ---help--- Currently supports the DP83867 PHY. diff --git a/drivers/net/phy/dp83867.c b/drivers/net/phy/dp83867.c index 2afa61b..ff867ba 100644 --- a/drivers/net/phy/dp83867.c +++ b/drivers/net/phy/dp83867.c @@ -99,7 +99,6 @@ static int dp83867_config_intr(struct phy_device *phydev) return phy_write(phydev, MII_DP83867_MICR, micr_status); } -#ifdef CONFIG_OF_MDIO static int dp83867_of_init(struct phy_device *phydev) { struct dp83867_private *dp83867 = phydev->priv; @@ -123,12 +122,6 @@ static int dp83867_of_init(struct phy_device *phydev) return of_property_read_u32(of_node, "ti,fifo-depth", &dp83867->fifo_depth); } -#else -static int dp83867_of_init(struct phy_device *phydev) -{ - return 0; -} -#endif /* CONFIG_OF_MDIO */ static int dp83867_config_init(struct phy_device *phydev) { -- 1.8.5.6
Re: [PATCH] phy dp83867: depend on CONFIG_OF_MDIO
Hi Dan, On 16.05.16 15:38, Dan Murphy wrote: > Alexander > > On 05/16/2016 06:28 AM, Alexander Graf wrote: >> The DP83867 phy driver doesn't actually work when CONFIG_OF_MDIO isn't >> enabled. >> It simply passes the device tree test, but leaves all internal configuration >> initialized at 0. Then it configures the phy with those values and renders a >> previously working configuration useless. >> >> This patch makes sure that we only build the DP83867 phy code when >> CONFIG_OF_MDIO is set, to not run into that problem. >> >> Signed-off-by: Alexander Graf >> --- >> drivers/net/phy/Kconfig | 1 + >> drivers/net/phy/dp83867.c | 7 --- >> 2 files changed, 1 insertion(+), 7 deletions(-) >> >> diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig >> index 6dad9a9..4265ad5 100644 >> --- a/drivers/net/phy/Kconfig >> +++ b/drivers/net/phy/Kconfig >> @@ -148,6 +148,7 @@ config DP83848_PHY >> >> config DP83867_PHY >> tristate "Drivers for Texas Instruments DP83867 Gigabit PHY" >> +depends on OF_MDIO >> ---help--- >>Currently supports the DP83867 PHY. >> >> diff --git a/drivers/net/phy/dp83867.c b/drivers/net/phy/dp83867.c >> index 2afa61b..ff867ba 100644 >> --- a/drivers/net/phy/dp83867.c >> +++ b/drivers/net/phy/dp83867.c >> @@ -99,7 +99,6 @@ static int dp83867_config_intr(struct phy_device *phydev) >> return phy_write(phydev, MII_DP83867_MICR, micr_status); >> } >> >> -#ifdef CONFIG_OF_MDIO >> static int dp83867_of_init(struct phy_device *phydev) >> { >> struct dp83867_private *dp83867 = phydev->priv; >> @@ -123,12 +122,6 @@ static int dp83867_of_init(struct phy_device *phydev) >> return of_property_read_u32(of_node, "ti,fifo-depth", >> &dp83867->fifo_depth); >> } >> -#else >> -static int dp83867_of_init(struct phy_device *phydev) >> -{ >> -return 0; >> -} >> -#endif /* CONFIG_OF_MDIO */ >> >> static int dp83867_config_init(struct phy_device *phydev) >> { > I don't think we want this to depend solely on OF_MDIO. > > The #else case should probably be coded to look at platform data, if > it exists. I don't have any boards that still used platform data to test this > out so I did not feel comfortable adding code I could not test. Since there was no code to look at platform data, those boards would be broken just as well today, no? So at the end of the day, this change should be no regression for them. Alex
Re: [PATCH] phy dp83867: depend on CONFIG_OF_MDIO
On 16.05.16 14:44, Andrew Lunn wrote: > On Mon, May 16, 2016 at 01:28:15PM +0200, Alexander Graf wrote: >> The DP83867 phy driver doesn't actually work when CONFIG_OF_MDIO isn't >> enabled. >> It simply passes the device tree test, but leaves all internal configuration >> initialized at 0. Then it configures the phy with those values and renders a >> previously working configuration useless. >> >> This patch makes sure that we only build the DP83867 phy code when >> CONFIG_OF_MDIO is set, to not run into that problem. > > Hi Alexander > > Looking at the code, the parameters read from device tree are needed > for RGMII mode. What about the case the PHY is used not in RGMII mode? > Could it be there are device trees which legitimately don't have these > properties since they are not needed, and are now going to get > -ENODEV? To be honest, I don't know. I only stumbled over this because we compile all those phy drivers as modules and the driver simply rendered one system I was working on unusable wrt network. Dan? Alex
[PATCH v2 1/2] phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
When CONFIG_OF_MDIO is configured as module, the #define for it really is CONFIG_OF_MDIO_MODULE, not CONFIG_OF_MDIO. So if we are compiling it as module, the dp83867 doesn't see that OF_MDIO was selected and doesn't read the dt rgmii parameters. The fix is simple: Use IS_ENABLED(). It checks for both - module as well as compiled in code. Signed-off-by: Alexander Graf --- drivers/net/phy/dp83867.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/phy/dp83867.c b/drivers/net/phy/dp83867.c index 2afa61b..94cc278 100644 --- a/drivers/net/phy/dp83867.c +++ b/drivers/net/phy/dp83867.c @@ -99,7 +99,7 @@ static int dp83867_config_intr(struct phy_device *phydev) return phy_write(phydev, MII_DP83867_MICR, micr_status); } -#ifdef CONFIG_OF_MDIO +#if IS_ENABLED(CONFIG_OF_MDIO) static int dp83867_of_init(struct phy_device *phydev) { struct dp83867_private *dp83867 = phydev->priv; -- 1.8.5.6
[PATCH v2 2/2] phy dp83867: Make rgmii parameters optional
If you compile without OF_MDIO support in an RGMII configuration, we fail to configure the dp83867 phy today by writing garbage into its configuration registers. On the other hand if you do compile with OF_MDIO and the phy gets loaded via device tree, you have to have the properties set in the device tree, otherwise we fail to load the driver and don't even attach the generic phy driver to the interface anymore. To make things slightly more consistent, make the rgmii configuration properties optional and allow a user to omit them in their device tree. Signed-off-by: Alexander Graf --- drivers/net/phy/dp83867.c | 31 --- 1 file changed, 28 insertions(+), 3 deletions(-) diff --git a/drivers/net/phy/dp83867.c b/drivers/net/phy/dp83867.c index 94cc278..1b01680 100644 --- a/drivers/net/phy/dp83867.c +++ b/drivers/net/phy/dp83867.c @@ -65,6 +65,7 @@ struct dp83867_private { int rx_id_delay; int tx_id_delay; int fifo_depth; + int values_are_sane; }; static int dp83867_ack_interrupt(struct phy_device *phydev) @@ -113,15 +114,30 @@ static int dp83867_of_init(struct phy_device *phydev) ret = of_property_read_u32(of_node, "ti,rx-internal-delay", &dp83867->rx_id_delay); if (ret) - return ret; + goto invalid_dt; ret = of_property_read_u32(of_node, "ti,tx-internal-delay", &dp83867->tx_id_delay); if (ret) - return ret; + goto invalid_dt; - return of_property_read_u32(of_node, "ti,fifo-depth", + ret = of_property_read_u32(of_node, "ti,fifo-depth", &dp83867->fifo_depth); + if (ret) + goto invalid_dt; + + dp83867->values_are_sane = 1; + + return 0; + +invalid_dt: + phydev_err(phydev, "missing properties in device tree"); + + /* +* We can still run with a broken dt by not using any of the optional +* parameters, so just don't set dp83867->values_are_sane. +*/ + return 0; } #else static int dp83867_of_init(struct phy_device *phydev) @@ -150,6 +166,15 @@ static int dp83867_config_init(struct phy_device *phydev) dp83867 = (struct dp83867_private *)phydev->priv; } + /* +* With no or broken device tree, we don't have the values that we would +* want to configure the phy with. In that case, cross our fingers and +* assume that firmware did everything correctly for us or that we don't +* need them. +*/ + if (!dp83867->values_are_sane) + return 0; + if (phy_interface_is_rgmii(phydev)) { ret = phy_write(phydev, MII_DP83867_PHYCTRL, (dp83867->fifo_depth << DP83867_PHYCR_FIFO_DEPTH_SHIFT)); -- 1.8.5.6
Re: [PATCH v2 2/2] phy dp83867: Make rgmii parameters optional
Hi David, > Am 17.05.2016 um 20:48 schrieb David Miller : > > From: Dan Murphy > Date: Tue, 17 May 2016 13:34:34 -0500 > >> David >> >>> On 05/17/2016 01:22 PM, David Miller wrote: >>> From: Alexander Graf >>> Date: Mon, 16 May 2016 20:52:43 +0200 >>> >>>> If you compile without OF_MDIO support in an RGMII configuration, we fail >>>> to configure the dp83867 phy today by writing garbage into its >>>> configuration >>>> registers. >>>> >>>> On the other hand if you do compile with OF_MDIO and the phy gets loaded >>>> via >>>> device tree, you have to have the properties set in the device tree, >>>> otherwise >>>> we fail to load the driver and don't even attach the generic phy driver to >>>> the interface anymore. >>>> >>>> To make things slightly more consistent, make the rgmii configuration >>>> properties >>>> optional and allow a user to omit them in their device tree. >>>> >>>> Signed-off-by: Alexander Graf >>> Applied. >> >> This patch should not have been applied. >> >> I did not believe the implementation was proper for that driver. >> >> It seems my objection to the code was not seen. Nor was Andrew's point >> about the DT bindings document >> >> https://patchwork.kernel.org/patch/9105371/ > > The discussions around the recent phy patches have been a labrynth that I've > found hard to follow, sorry. > > I'll revert these two, sigh The first patch is an obvious and correct fix. Discussions were only about the second one (which I'm happy to drop given the rat hole this turned out to be). Alex
Re: [PATCH v2 2/2] phy dp83867: Make rgmii parameters optional
> Am 17.05.2016 um 22:40 schrieb Dan Murphy : > >> On 05/17/2016 03:37 PM, Alexander Graf wrote: >> Hi David, >> >>> Am 17.05.2016 um 20:48 schrieb David Miller : >>> >>> From: Dan Murphy >>> Date: Tue, 17 May 2016 13:34:34 -0500 >>> >>>> David >>>> >>>>> On 05/17/2016 01:22 PM, David Miller wrote: >>>>> From: Alexander Graf >>>>> Date: Mon, 16 May 2016 20:52:43 +0200 >>>>> >>>>>> If you compile without OF_MDIO support in an RGMII configuration, we fail >>>>>> to configure the dp83867 phy today by writing garbage into its >>>>>> configuration >>>>>> registers. >>>>>> >>>>>> On the other hand if you do compile with OF_MDIO and the phy gets loaded >>>>>> via >>>>>> device tree, you have to have the properties set in the device tree, >>>>>> otherwise >>>>>> we fail to load the driver and don't even attach the generic phy driver >>>>>> to >>>>>> the interface anymore. >>>>>> >>>>>> To make things slightly more consistent, make the rgmii configuration >>>>>> properties >>>>>> optional and allow a user to omit them in their device tree. >>>>>> >>>>>> Signed-off-by: Alexander Graf >>>>> Applied. >>>> This patch should not have been applied. >>>> >>>> I did not believe the implementation was proper for that driver. >>>> >>>> It seems my objection to the code was not seen. Nor was Andrew's point >>>> about the DT bindings document >>>> >>>> https://patchwork.kernel.org/patch/9105371/ >>> The discussions around the recent phy patches have been a labrynth that I've >>> found hard to follow, sorry. >>> >>> I'll revert these two, sigh >> The first patch is an obvious and correct fix. Discussions were only about >> the second one (which I'm happy to drop given the rat hole this turned out >> to be). > > So are you going to abandon the second patch all together? > If you do, let me know I can submit a patch. I'm fairly sure you'll be done with it much quicker than me and the real bug I wanted to fix is that the driver broke with OF_MDIO=m ;) So yes, please take over from here. Alex
Re: [PATCH] qeth: Default to allow promiscuous mode
On 17.03.16 18:52, Evgeny Cherkashin wrote: > Hello all, > > -Alexander Graf wrote: - > >> To: linux-s...@vger.kernel.org >> From: Alexander Graf >> Date: 2016-03-17 20:01 >> Cc: Evgeny Cherkashin/Russia/IBM@IBMRU, linux-ker...@vger.kernel.org, >> Heiko Carstens , Martin Schwidefsky >> , Ursula Braun , >> i...@suse.de, m...@suse.com, pwieczorkiew...@suse.de >> Subject: [PATCH] qeth: Default to allow promiscuous mode >> >> When a qeth device is in bridge role, one of the ports of an adapter >> can >> ask for promiscuous mode and get it enabled. >> >> The default until now was to not allow user space to enable >> promiscuous mode >> without switching sysfs attributes as well though, diverging from >> usual >> network device semantics. >> >> This patch sets the default to allow promiscuous enablement. That way >> all >> existing tools "just work", albeit only one port of an adapter can be >> in >> promiscuous mode at a given time. >> >> Signed-off-by: Alexander Graf >> --- >> drivers/s390/net/qeth_l2_sys.c | 4 >> 1 file changed, 4 insertions(+) >> >> diff --git a/drivers/s390/net/qeth_l2_sys.c >> b/drivers/s390/net/qeth_l2_sys.c >> index 692db49..98c7ac5 100644 >> --- a/drivers/s390/net/qeth_l2_sys.c >> +++ b/drivers/s390/net/qeth_l2_sys.c >> @@ -258,6 +258,10 @@ void qeth_l2_setup_bridgeport_attrs(struct >> qeth_card *card) >> return; >> if (!card->options.sbp.supported_funcs) >> return; >> + >> +/* Allow to set promiscuous by default */ >> +card->options.sbp.reflect_promisc = 1; >> + >> if (card->options.sbp.role != QETH_SBP_ROLE_NONE) { >> /* Conditional to avoid spurious error messages */ >> qeth_bridgeport_setrole(card, card->options.sbp.role); >> -- >> 1.8.5.6 >> >> > > 1. The patch changes the default behaviour (which is, on s390, to ignore > promisc setting) and has potential to break existing setups. One potentially > dangerous scenario is that a Linux instance may inadvertently snatch the > BRIDGEPORT role from a zVM VSWITCH that bridges the HiperSockets LAN to the > outer world, disrupting connectivity for many users. Look at it the other way around. Hardware allows you to do X, why should Linux prevent you from doing X then? If you can break another LPAR on the system, then your virtualization solution sucks and people should be aware of it and simply configure it in a way that it doesn't. If you can't trust your admin of a Linux system to call the right commands, you can't trust him to not be malicious either and break it on purpose. > 2. The patch sets the reflect_promisc field without setting the > reflect_promisc_primary field, making it unclear to the user what is the > default role. (It will be "secondary" but it's better to set it explicitly.) I can send it explicitly if that makes it move obvious, sure. > 3. The patch should probably go via the -networking maillist. Ah, get_maintainers.pl didn't list it. You may want to send a patch for the MAINTAINERS file here :). > It seems to me that it would be better to use the sysfs attribute from a > startup script or a udev rule, *only* when the system configuration requires > "real" promiscuous mode, e.g. when the interface is a member of a software > bridge. You've now successfully added a lot of headaches, curses and pain to the plate of the admin of such a system. As an admin of such a system, I don't want to have to deal with s390x specifics. That's why I'm using Linux there. And who decides what "real" promiscuous mode is? Nobody should set a network device into promiscuous mode today already unless it's required. The current thinking in your company seems to be that "real" users should explicitly set sysfs attribute. Let's think this a bit ahead. Imagine a world with this sysfs attribute. Every "real" user of promiscuous mode will now set the sysfs attribute as well in parallel to setting promiscuous mode on the device. Once you have patched 50 user space packages to touch something they really shouldn't have to care about, the same applications that today set an interface to promiscuous mode, set it to promiscuous mode, just by writing to sysfs plus enabling it. So at the end of the day, the only thing you've done is you've created a lot of work for people who are not you, for no gain. That leaves only one option. We can create a udev rule that *always* sets the device to promiscuous capable. But that's the same net result as the patch I've sent, no? Alex