On 04/06/2016 04:22 PM, Dmitry Fleytman wrote: > Hi Jason, > > Please see my comments below. > >> On 8 Mar 2016, at 11:31 AM, Jason Wang <jasow...@redhat.com >> <mailto:jasow...@redhat.com>> wrote: >> >> >> >> On 02/23/2016 01:37 AM, Leonid Bloch wrote: >>> From: Dmitry Fleytman <dmitry.fleyt...@ravellosystems.com >>> <mailto:dmitry.fleyt...@ravellosystems.com>> >>> >>> This patch introduces emulation for the Intel 82574 adapter, AKA e1000e. >>> >>> This implementation is derived from the e1000 emulation code, and >>> utilizes the TX/RX packet abstractions that were initially developed for >>> the vmxnet3 device. Although some parts of the introduced code may be >>> shared with e1000, the differences are substantial enough so that the >>> only shared resources for the two devices are the definitions in >>> hw/net/e1000_regs.h. >>> >>> Similarly to vmxnet3, the new device uses virtio headers for task >>> offloads (for backends that support virtio extensions). Usage of >>> virtio headers may be forcibly disabled via a boolean device property >>> "vnet" (which is enabled by default). In such case task offloads >>> will be performed in software, in the same way it is done on >>> backends that do not support virtio headers. >>> >>> The device code is split into two parts: >>> >>> 1. hw/net/e1000e.c: QEMU-specific code for a network device; >>> 2. hw/net/e1000e_core.[hc]: Device emulation according to the spec. >>> >>> The new device name is e1000e. >>> >>> Intel specifications for the 82574 controller are available at: >>> http://www.intel.com/content/dam/doc/datasheet/82574l-gbe-controller-datasheet.pdf
[...] >>> +Device properties: >>> + >>> ++-------------------------------------------------+--------+---------+ >>> +| Propery name | Description | Type | Default | >>> ++--------------------------------------------------------------------| >>> +| subsys_ven | PCI device Subsystem Vendor ID | UINT16 | 0x8086 | >>> +| | | | | >>> +| subsys | PCI device Subsystem ID | UINT16 | 0 | >>> +| | | | | >>> +| vnet | Use virtio headers or perform | BOOL | TRUE | >>> +| | SW offloads emulation | | | >>> ++----------------+--------------------------------+--------+---------+ >> >> You may using PropertyInfo which has a "description" field to do this >> better (e.g qdev_prop_vlan). Then there's even no need for this file. > > We replaced this file with source code comments. > As for PropertyInfo description I can’t see any way to pass > description to DEFINE_PROP_* macros. Could you please elaborate on this? You could use DEFINE_PROP() which can accept a PropertyInfo parameter. > >> >>> diff --git a/hw/net/Makefile.objs b/hw/net/Makefile.objs >>> index 527d264..4201115 100644 >>> --- a/hw/net/Makefile.objs >>> +++ b/hw/net/Makefile.objs [...] >> >>> + MemoryRegion io; >>> + MemoryRegion msix; >>> + >>> + uint32_t ioaddr; >>> + >>> + uint16_t subsys_ven; >>> + uint16_t subsys; >>> + >>> + uint16_t subsys_ven_used; >>> + uint16_t subsys_used; >>> + >>> + uint32_t intr_state; >>> + bool use_vnet; >>> + >>> + E1000ECore core; >> >> What's the advantages of introducing another indirection level here? >> Looks like it causes lots of extra code overheads. > > This was done by intention. > Unlike e1000 and e1000e devices which are really different, e1000e and > newer Intel devices like ixgb* are rather similar. Introduction of > e1000e core abstraction will simplify possible code reuse in the future. Ok, I see. > >> >>> + >>> +} E1000EState; >>> + [...] >>> +static NetClientInfo net_e1000e_info = { >>> + .type = NET_CLIENT_OPTIONS_KIND_NIC, >>> + .size = sizeof(NICState), >>> + .can_receive = e1000e_nc_can_receive, >>> + .receive = e1000e_nc_receive, >>> + .receive_iov = e1000e_nc_receive_iov, >> >> All of the above three functions has a NetClientState parameter, so >> probably no need to have "nc" in their name. > > The issue is there are other functions with similar names but without > _nc_... > Right. >> >>> + .link_status_changed = e1000e_set_link_status, >>> +}; >>> + [...] >>> + for (i = 0; i < s->conf.peers.queues; i++) { >>> + nc = qemu_get_subqueue(s->nic, i); >>> + if (!nc->peer || !qemu_has_vnet_hdr(nc->peer)) { >>> + s->core.has_vnet = false; >> >> So in fact we're probing the vnet support instead of doing vnet our own >> if backend does not support it. It seems to me that there's no need to >> having "vnet" property. > > This property is intended for forcible disable of virtio headers > support. Possible use cases are verification of SW offloads logic and > work around for theoretical interoperability problems. Ok, so maybe we'd better rename it to "disable_vnet". > >> >>> + trace_e1000e_cfg_support_virtio(false); >>> + return; >>> + } >>> + } >>> + >>> + VMSTATE_UINT8(core.tx[1].sum_needed, E1000EState), >>> + VMSTATE_UINT8(core.tx[1].ipcss, E1000EState), >>> + VMSTATE_UINT8(core.tx[1].ipcso, E1000EState), >>> + VMSTATE_UINT16(core.tx[1].ipcse, E1000EState), >>> + VMSTATE_UINT8(core.tx[1].tucss, E1000EState), >>> + VMSTATE_UINT8(core.tx[1].tucso, E1000EState), >>> + VMSTATE_UINT16(core.tx[1].tucse, E1000EState), >>> + VMSTATE_UINT8(core.tx[1].hdr_len, E1000EState), >>> + VMSTATE_UINT16(core.tx[1].mss, E1000EState), >>> + VMSTATE_UINT32(core.tx[1].paylen, E1000EState), >>> + VMSTATE_INT8(core.tx[1].ip, E1000EState), >>> + VMSTATE_INT8(core.tx[1].tcp, E1000EState), >>> + VMSTATE_BOOL(core.tx[1].tse, E1000EState), >>> + VMSTATE_BOOL(core.tx[1].cptse, E1000EState), >>> + VMSTATE_BOOL(core.tx[1].skip_cp, E1000EState), >> >> You can move the tx state into another structure and use VMSTATE_ARRAY() >> here. > > Not sure I got you point. Could you please provide more details? I mean e.g, move skip_cp into e1000e_txd_props and move tx_pkt out of e1000_tx. Then you can use VMSTATE_ARRAY to save and load e1000_txd_props array. > >> >>> + VMSTATE_END_OF_LIST() >>> + } >>> +}; >>> + >>> +static Property e1000e_properties[] = { >>> + DEFINE_NIC_PROPERTIES(E1000EState, conf), >>> + DEFINE_PROP_BOOL("vnet", E1000EState, use_vnet, true), >>> + DEFINE_PROP_UINT16("subsys_ven", E1000EState, >>> + subsys_ven, PCI_VENDOR_ID_INTEL), >>> + DEFINE_PROP_UINT16("subsys", E1000EState, subsys, 0), >> >> I'm not quite sure the reason we need subsys_ven and subsys here? > > Some vendors (like VMWare) may replace these device IDs with their > own. These parameters provide a way to make device look exactly as > those. Useful in various V2V scenarios. I get it. Thanks > >> >>> + DEFINE_PROP_END_OF_LIST(), >>> +}; >>> + [...] >>> + >>> +static void >>> +e1000e_intrmgr_on_throttling_timer(void *opaque) >>> +{ >>> + E1000IntrDelayTimer *timer = opaque; >>> + >>> + assert(!msix_enabled(timer->core->owner)); >> >> Can this assert be triggered by guest? If yes, let's remove this. > > No, this timer is armed for legacy interrupts only. > This assertion would mean there is an internal logic error in the device, > i.e. the device code did not check we’re working with legacy > interrupts before arming it. Ok. > >> >>> + >>> + timer->running = false; >>> + >>> + if (!timer->core->itr_intr_pending) { >>> + trace_e1000e_irq_throttling_no_pending_interrupts(); >>> + return; >>> + } >>> + >>> + if (msi_enabled(timer->core->owner)) { >>> + trace_e1000e_irq_msi_notify_postponed(); >>> + e1000e_set_interrupt_cause(timer->core, 0); >>> + } else { >>> + trace_e1000e_irq_legacy_notify_postponed(); >>> + e1000e_set_interrupt_cause(timer->core, 0); >>> + } >>> +} >>> + >>> +static void >>> +e1000e_intrmgr_on_msix_throttling_timer(void *opaque) >>> +{ >>> + E1000IntrDelayTimer *timer = opaque; >>> + int idx = timer - &timer->core->eitr[0]; >>> + >>> + assert(msix_enabled(timer->core->owner)); >> >> Same question as above. > > The same. This timer is for MSI-X only. Ok. > >> >>> + >>> + timer->running = false; >>> + >>> + if (!timer->core->eitr_intr_pending[idx]) { >>> + trace_e1000e_irq_throttling_no_pending_vec(idx); >>> + return; >>> + } >>> + >>> + trace_e1000e_irq_msix_notify_postponed_vec(idx); >>> + msix_notify(timer->core->owner, idx); >>> +} >>> + >>> +static void >>> +e1000e_intrmgr_initialize_all_timers(E1000ECore *core, bool create) >>> +{ >>> + int i; >>> + >>> + core->radv.delay_reg = RADV; >>> + core->rdtr.delay_reg = RDTR; >>> + core->raid.delay_reg = RAID; >>> + core->tadv.delay_reg = TADV; >>> + core->tidv.delay_reg = TIDV; >>> + >>> + core->radv.delay_resolution_ns = E1000_INTR_DELAY_NS_RES; >>> + core->rdtr.delay_resolution_ns = E1000_INTR_DELAY_NS_RES; >>> + core->raid.delay_resolution_ns = E1000_INTR_DELAY_NS_RES; >>> + core->tadv.delay_resolution_ns = E1000_INTR_DELAY_NS_RES; >>> + core->tidv.delay_resolution_ns = E1000_INTR_DELAY_NS_RES; >>> + >>> + core->radv.core = core; >>> + core->rdtr.core = core; >>> + core->raid.core = core; >>> + core->tadv.core = core; >>> + core->tidv.core = core; >>> + >>> + core->itr.core = core; >> >> Couldn't we simply get core pointer through container_of() ? > > Unfortunately not. Timer abstraction functions are not aware of which > specific timer they’re dealing with. Yes, it is. > >> >>> + core->itr.delay_reg = ITR; >>> + core->itr.delay_resolution_ns = E1000_INTR_THROTTLING_NS_RES; >>> + >>> + for (i = 0; i < E1000E_MSIX_VEC_NUM; i++) { >>> + core->eitr[i].core = core; >>> + core->eitr[i].delay_reg = EITR + i; >>> + core->eitr[i].delay_resolution_ns = >>> E1000_INTR_THROTTLING_NS_RES; >>> + } >>> + >>> + if (!create) { >>> + return; >>> + } >>> + >>> + core->radv.timer = >>> + timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer, >>> &core->radv); >>> + core->rdtr.timer = >>> + timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer, >>> &core->rdtr); >>> + core->raid.timer = >>> + timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer, >>> &core->raid); >>> + >>> + core->tadv.timer = >>> + timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer, >>> &core->tadv); >>> + core->tidv.timer = >>> + timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer, >>> &core->tidv); >>> + >>> + core->itr.timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, >>> + e1000e_intrmgr_on_throttling_timer, >>> + &core->itr); >> >> Should we stop/start the above timers during vm stop/start through vm >> state change handler? > > Not sure. Is there any reason for this? Otherwise we may raise irq during vm stop? [...] >>> + >>> +static inline void >>> +e1000e_tx_ring_init(E1000ECore *core, E1000E_TxRing *txr, int idx) >>> +{ >>> + static const E1000E_RingInfo i[E1000E_NUM_QUEUES] = { >>> + { TDBAH, TDBAL, TDLEN, TDH, TDT, 0 }, >>> + { TDBAH1, TDBAL1, TDLEN1, TDH1, TDT1, 1 } >>> + }; >> >> Instead of using static inside a function, why not just use a global >> array and then there's no need for this function and caller can access >> it directly? > > Anyway there should be a function to combine the two assignments below. > And since this array is used by this function only it should better be > hidden. I mean you can avoid the assignment before each transmission by just introducing arrays like: int tdt[] = {TDT, TDT1}; And then access them through qidx, e.g: core->mac[tdt[qidx]]. > >> >>> + >>> + assert(idx < ARRAY_SIZE(i)); >>> + >>> + txr->i = &i[idx]; >>> + txr->tx = &core->tx[idx]; >>> +} >>> + >>> +typedef struct E1000E_RxRing_st { >>> + const E1000E_RingInfo *i; >>> +} E1000E_RxRing; >>> + >>> +static inline void >>> +e1000e_rx_ring_init(E1000ECore *core, E1000E_RxRing *rxr, int idx) >>> +{ >>> + static const E1000E_RingInfo i[E1000E_NUM_QUEUES] = { >>> + { RDBAH0, RDBAL0, RDLEN0, RDH0, RDT0, 0 }, >>> + { RDBAH1, RDBAL1, RDLEN1, RDH1, RDT1, 1 } >>> + }; >> >> Similar issue with above. >> >>> + >>> + assert(idx < ARRAY_SIZE(i)); >>> + >>> + rxr->i = &i[idx]; >>> +} >>> + [...] >>> + trace_e1000e_rx_start_recv(); >>> + >>> + for (i = 0; i <= core->max_queue_num; i++) { >>> + >>> qemu_flush_queued_packets(qemu_get_subqueue(core->owner_nic, i)); >> >> Is this the hardware behavior, I mean setting rdt of queue 0 will also >> flush queue 1? Looks like the correct behavior is to calculate the queue >> index and flush it. > > The issue is that there is no direct correspondence between virtio > queues and device queues. > There is RSS mechanism in the middle that may re-route packets from > virtio queue 0 to device queue 1 and vice versa, > so we cannot know where each specific packets goes until it read and > processed. Aha, I see. > >> >>> + } >>> +} >>> + >>> +int >>> +e1000e_can_receive(E1000ECore *core) >>> +{ >>> + int i; >>> + >>> + bool link_up = core->mac[STATUS] & E1000_STATUS_LU; >>> + bool rx_enabled = core->mac[RCTL] & E1000_RCTL_EN; >>> + bool pci_master = core->owner->config[PCI_COMMAND] & >>> PCI_COMMAND_MASTER; >> >> Should we flush the queue packets when guest enable bus master? (like >> what we did in e1000_write_config)? > > Yes, fixed. > >> >>> + >>> + if (!link_up || !rx_enabled || !pci_master) { >>> + trace_e1000e_rx_can_recv_disabled(link_up, rx_enabled, >>> pci_master); >>> + return false; >>> + } >>> + >>> + for (i = 0; i < E1000E_NUM_QUEUES; i++) { >>> + E1000E_RxRing rxr; >>> + >>> + e1000e_rx_ring_init(core, &rxr, i); >>> + if (e1000e_ring_enabled(core, rxr.i) && >>> + e1000e_has_rxbufs(core, rxr.i, 1)) { >>> + trace_e1000e_rx_can_recv(); >>> + return true; >> >> Similar issue, this will return true if anyone of the queue has >> available buffer. This seems wrong. > > > The same as above. This is done due to RSS mechanisms of the device. Right, looks ok now, but there's also a case e.g we want to queue the packet to queue 0, but only queue 1 has the available buffer. May need to check this in the future. [...] > > Thanks for review, > Dmitry > You're welcome