Re: [Qemu-devel] [PATCH v2 13/13] net: Introduce e1000e device emulation

Jason Wang Thu, 07 Apr 2016 00:25:56 -0700


On 04/06/2016 04:22 PM, Dmitry Fleytman wrote:
> Hi Jason,
>
> Please see my comments below.
>
>> On 8 Mar 2016, at 11:31 AM, Jason Wang <jasow...@redhat.com
>> <mailto:jasow...@redhat.com>> wrote:
>>
>>
>>
>> On 02/23/2016 01:37 AM, Leonid Bloch wrote:
>>> From: Dmitry Fleytman <dmitry.fleyt...@ravellosystems.com
>>> <mailto:dmitry.fleyt...@ravellosystems.com>>
>>>
>>> This patch introduces emulation for the Intel 82574 adapter, AKA e1000e.
>>>
>>> This implementation is derived from the e1000 emulation code, and
>>> utilizes the TX/RX packet abstractions that were initially developed for
>>> the vmxnet3 device. Although some parts of the introduced code may be
>>> shared with e1000, the differences are substantial enough so that the
>>> only shared resources for the two devices are the definitions in
>>> hw/net/e1000_regs.h.
>>>
>>> Similarly to vmxnet3, the new device uses virtio headers for task
>>> offloads (for backends that support virtio extensions). Usage of
>>> virtio headers may be forcibly disabled via a boolean device property
>>> "vnet" (which is enabled by default). In such case task offloads
>>> will be performed in software, in the same way it is done on
>>> backends that do not support virtio headers.
>>>
>>> The device code is split into two parts:
>>>
>>>  1. hw/net/e1000e.c: QEMU-specific code for a network device;
>>>  2. hw/net/e1000e_core.[hc]: Device emulation according to the spec.
>>>
>>> The new device name is e1000e.
>>>
>>> Intel specifications for the 82574 controller are available at:
>>> http://www.intel.com/content/dam/doc/datasheet/82574l-gbe-controller-datasheet.pdf


[...]

>>> +Device properties:
>>> +
>>> ++-------------------------------------------------+--------+---------+
>>> +| Propery name   |         Description            |  Type  | Default |
>>> ++--------------------------------------------------------------------|
>>> +|  subsys_ven    | PCI device Subsystem Vendor ID | UINT16 | 0x8086  |
>>> +|                |                                |        |         |
>>> +|  subsys        | PCI device Subsystem ID        | UINT16 | 0       |
>>> +|                |                                |        |         |
>>> +|  vnet          | Use virtio headers or perform  | BOOL   | TRUE    |
>>> +|                | SW offloads emulation          |        |         |
>>> ++----------------+--------------------------------+--------+---------+
>>
>> You may using PropertyInfo which has a "description" field to do this
>> better (e.g qdev_prop_vlan). Then there's even no need for this file.
>
> We replaced this file with source code comments.
> As for PropertyInfo description I can’t see any way to pass
> description to DEFINE_PROP_* macros. Could you please elaborate on this?

You could use DEFINE_PROP() which can accept a PropertyInfo parameter.

>
>>
>>> diff --git a/hw/net/Makefile.objs b/hw/net/Makefile.objs
>>> index 527d264..4201115 100644
>>> --- a/hw/net/Makefile.objs
>>> +++ b/hw/net/Makefile.objs

[...]

>>
>>> +    MemoryRegion io;
>>> +    MemoryRegion msix;
>>> +
>>> +    uint32_t ioaddr;
>>> +
>>> +    uint16_t subsys_ven;
>>> +    uint16_t subsys;
>>> +
>>> +    uint16_t subsys_ven_used;
>>> +    uint16_t subsys_used;
>>> +
>>> +    uint32_t intr_state;
>>> +    bool use_vnet;
>>> +
>>> +    E1000ECore core;
>>
>> What's the advantages of introducing another indirection level here?
>> Looks like it causes lots of extra code overheads.
>
> This was done by intention.
> Unlike e1000 and e1000e devices which are really different, e1000e and
> newer Intel devices like ixgb* are rather similar. Introduction of
> e1000e core abstraction will simplify possible code reuse in the future.

Ok, I see.

>
>>
>>> +
>>> +} E1000EState;
>>> +

[...]

>>> +static NetClientInfo net_e1000e_info = {
>>> +    .type = NET_CLIENT_OPTIONS_KIND_NIC,
>>> +    .size = sizeof(NICState),
>>> +    .can_receive = e1000e_nc_can_receive,
>>> +    .receive = e1000e_nc_receive,
>>> +    .receive_iov = e1000e_nc_receive_iov,
>>
>> All of the above three functions has a NetClientState parameter, so
>> probably no need to have "nc" in their name.
>
> The issue is there are other functions with similar names but without
> _nc_...
>

Right.

>>
>>> +    .link_status_changed = e1000e_set_link_status,
>>> +};
>>> +

[...]

>>> +    for (i = 0; i < s->conf.peers.queues; i++) {
>>> +        nc = qemu_get_subqueue(s->nic, i);
>>> +        if (!nc->peer || !qemu_has_vnet_hdr(nc->peer)) {
>>> +            s->core.has_vnet = false;
>>
>> So in fact we're probing the vnet support instead of doing vnet our own
>> if backend does not support it. It seems to me that there's no need to
>> having "vnet" property.
>
> This property is intended for forcible disable of virtio headers
> support. Possible use cases are verification of SW offloads logic and
> work around for theoretical interoperability problems.

Ok, so maybe we'd better rename it to "disable_vnet".

>
>>
>>> +            trace_e1000e_cfg_support_virtio(false);
>>> +            return;
>>> +        }
>>> +    }
>>> +



>>> +        VMSTATE_UINT8(core.tx[1].sum_needed, E1000EState),
>>> +        VMSTATE_UINT8(core.tx[1].ipcss, E1000EState),
>>> +        VMSTATE_UINT8(core.tx[1].ipcso, E1000EState),
>>> +        VMSTATE_UINT16(core.tx[1].ipcse, E1000EState),
>>> +        VMSTATE_UINT8(core.tx[1].tucss, E1000EState),
>>> +        VMSTATE_UINT8(core.tx[1].tucso, E1000EState),
>>> +        VMSTATE_UINT16(core.tx[1].tucse, E1000EState),
>>> +        VMSTATE_UINT8(core.tx[1].hdr_len, E1000EState),
>>> +        VMSTATE_UINT16(core.tx[1].mss, E1000EState),
>>> +        VMSTATE_UINT32(core.tx[1].paylen, E1000EState),
>>> +        VMSTATE_INT8(core.tx[1].ip, E1000EState),
>>> +        VMSTATE_INT8(core.tx[1].tcp, E1000EState),
>>> +        VMSTATE_BOOL(core.tx[1].tse, E1000EState),
>>> +        VMSTATE_BOOL(core.tx[1].cptse, E1000EState),
>>> +        VMSTATE_BOOL(core.tx[1].skip_cp, E1000EState),
>>
>> You can move the tx state into another structure and use VMSTATE_ARRAY()
>> here.
>
> Not sure I got you point. Could you please provide more details?

I mean e.g, move skip_cp into e1000e_txd_props and move tx_pkt out of
e1000_tx. Then you can use VMSTATE_ARRAY to save and load
e1000_txd_props array.

>
>>
>>> +        VMSTATE_END_OF_LIST()
>>> +    }
>>> +};
>>> +
>>> +static Property e1000e_properties[] = {
>>> +    DEFINE_NIC_PROPERTIES(E1000EState, conf),
>>> +    DEFINE_PROP_BOOL("vnet", E1000EState, use_vnet, true),
>>> +    DEFINE_PROP_UINT16("subsys_ven", E1000EState,
>>> +                       subsys_ven, PCI_VENDOR_ID_INTEL),
>>> +    DEFINE_PROP_UINT16("subsys", E1000EState, subsys, 0),
>>
>> I'm not quite sure the reason we need subsys_ven and subsys here?
>
> Some vendors (like VMWare) may replace these device IDs with their
> own. These parameters provide a way to make device look exactly as
> those. Useful in various V2V scenarios.

I get it.

Thanks

>
>>
>>> +    DEFINE_PROP_END_OF_LIST(),
>>> +};
>>> +

[...]

>>> +
>>> +static void
>>> +e1000e_intrmgr_on_throttling_timer(void *opaque)
>>> +{
>>> +    E1000IntrDelayTimer *timer = opaque;
>>> +
>>> +    assert(!msix_enabled(timer->core->owner));
>>
>> Can this assert be triggered by guest? If yes, let's remove this.
>
> No, this timer is armed for legacy interrupts only.
> This assertion would mean there is an internal logic error in the device,
> i.e. the device code did not check we’re working with legacy
> interrupts before arming it.

Ok.

>
>>
>>> +
>>> +    timer->running = false;
>>> +
>>> +    if (!timer->core->itr_intr_pending) {
>>> +        trace_e1000e_irq_throttling_no_pending_interrupts();
>>> +        return;
>>> +    }
>>> +
>>> +    if (msi_enabled(timer->core->owner)) {
>>> +        trace_e1000e_irq_msi_notify_postponed();
>>> +        e1000e_set_interrupt_cause(timer->core, 0);
>>> +    } else {
>>> +        trace_e1000e_irq_legacy_notify_postponed();
>>> +        e1000e_set_interrupt_cause(timer->core, 0);
>>> +    }
>>> +}
>>> +
>>> +static void
>>> +e1000e_intrmgr_on_msix_throttling_timer(void *opaque)
>>> +{
>>> +    E1000IntrDelayTimer *timer = opaque;
>>> +    int idx = timer - &timer->core->eitr[0];
>>> +
>>> +    assert(msix_enabled(timer->core->owner));
>>
>> Same question as above.
>
> The same. This timer is for MSI-X only.

Ok.

>
>>
>>> +
>>> +    timer->running = false;
>>> +
>>> +    if (!timer->core->eitr_intr_pending[idx]) {
>>> +        trace_e1000e_irq_throttling_no_pending_vec(idx);
>>> +        return;
>>> +    }
>>> +
>>> +    trace_e1000e_irq_msix_notify_postponed_vec(idx);
>>> +    msix_notify(timer->core->owner, idx);
>>> +}
>>> +
>>> +static void
>>> +e1000e_intrmgr_initialize_all_timers(E1000ECore *core, bool create)
>>> +{
>>> +    int i;
>>> +
>>> +    core->radv.delay_reg = RADV;
>>> +    core->rdtr.delay_reg = RDTR;
>>> +    core->raid.delay_reg = RAID;
>>> +    core->tadv.delay_reg = TADV;
>>> +    core->tidv.delay_reg = TIDV;
>>> +
>>> +    core->radv.delay_resolution_ns = E1000_INTR_DELAY_NS_RES;
>>> +    core->rdtr.delay_resolution_ns = E1000_INTR_DELAY_NS_RES;
>>> +    core->raid.delay_resolution_ns = E1000_INTR_DELAY_NS_RES;
>>> +    core->tadv.delay_resolution_ns = E1000_INTR_DELAY_NS_RES;
>>> +    core->tidv.delay_resolution_ns = E1000_INTR_DELAY_NS_RES;
>>> +
>>> +    core->radv.core = core;
>>> +    core->rdtr.core = core;
>>> +    core->raid.core = core;
>>> +    core->tadv.core = core;
>>> +    core->tidv.core = core;
>>> +
>>> +    core->itr.core = core;
>>
>> Couldn't we simply get core pointer through container_of() ?
>
> Unfortunately not. Timer abstraction functions are not aware of which
> specific timer they’re dealing with.

Yes, it is.

>
>>
>>> +    core->itr.delay_reg = ITR;
>>> +    core->itr.delay_resolution_ns = E1000_INTR_THROTTLING_NS_RES;
>>> +
>>> +    for (i = 0; i < E1000E_MSIX_VEC_NUM; i++) {
>>> +        core->eitr[i].core = core;
>>> +        core->eitr[i].delay_reg = EITR + i;
>>> +        core->eitr[i].delay_resolution_ns =
>>> E1000_INTR_THROTTLING_NS_RES;
>>> +    }
>>> +
>>> +    if (!create) {
>>> +        return;
>>> +    }
>>> +
>>> +    core->radv.timer =
>>> +        timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer,
>>> &core->radv);
>>> +    core->rdtr.timer =
>>> +        timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer,
>>> &core->rdtr);
>>> +    core->raid.timer =
>>> +        timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer,
>>> &core->raid);
>>> +
>>> +    core->tadv.timer =
>>> +        timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer,
>>> &core->tadv);
>>> +    core->tidv.timer =
>>> +        timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer,
>>> &core->tidv);
>>> +
>>> +    core->itr.timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
>>> +                                   e1000e_intrmgr_on_throttling_timer,
>>> +                                   &core->itr);
>>
>> Should we stop/start the above timers during vm stop/start through vm
>> state change handler?
>
> Not sure. Is there any reason for this?

Otherwise we may raise irq during vm stop?

[...]

>>> +
>>> +static inline void
>>> +e1000e_tx_ring_init(E1000ECore *core, E1000E_TxRing *txr, int idx)
>>> +{
>>> +    static const E1000E_RingInfo i[E1000E_NUM_QUEUES] = {
>>> +        { TDBAH,  TDBAL,  TDLEN,  TDH,  TDT, 0 },
>>> +        { TDBAH1, TDBAL1, TDLEN1, TDH1, TDT1, 1 }
>>> +    };
>>
>> Instead of using static inside a function, why not just use a global
>> array and then there's no need for this function and caller can access
>> it directly?
>
> Anyway there should be a function to combine the two assignments below.
> And since this array is used by this function only it should better be
> hidden.

I mean you can avoid the assignment before each transmission by just
introducing arrays like:

int tdt[] = {TDT, TDT1};

And then access them through qidx, e.g: core->mac[tdt[qidx]].

>
>>
>>> +
>>> +    assert(idx < ARRAY_SIZE(i));
>>> +
>>> +    txr->i     = &i[idx];
>>> +    txr->tx    = &core->tx[idx];
>>> +}
>>> +
>>> +typedef struct E1000E_RxRing_st {
>>> +    const E1000E_RingInfo *i;
>>> +} E1000E_RxRing;
>>> +
>>> +static inline void
>>> +e1000e_rx_ring_init(E1000ECore *core, E1000E_RxRing *rxr, int idx)
>>> +{
>>> +    static const E1000E_RingInfo i[E1000E_NUM_QUEUES] = {
>>> +        { RDBAH0, RDBAL0, RDLEN0, RDH0, RDT0, 0 },
>>> +        { RDBAH1, RDBAL1, RDLEN1, RDH1, RDT1, 1 }
>>> +    };
>>
>> Similar issue with above.
>>
>>> +
>>> +    assert(idx < ARRAY_SIZE(i));
>>> +
>>> +    rxr->i      = &i[idx];
>>> +}
>>> +

[...]

>>> +    trace_e1000e_rx_start_recv();
>>> +
>>> +    for (i = 0; i <= core->max_queue_num; i++) {
>>> +
>>>        qemu_flush_queued_packets(qemu_get_subqueue(core->owner_nic, i));
>>
>> Is this the hardware behavior, I mean setting rdt of queue 0 will also
>> flush queue 1? Looks like the correct behavior is to calculate the queue
>> index and flush it.
>
> The issue is that there is no direct correspondence between virtio
> queues and device queues.
> There is RSS mechanism in the middle that may re-route packets from
> virtio queue 0 to device queue 1 and vice versa,
> so we cannot know where each specific packets goes until it read and
> processed.

Aha, I see.

>
>>
>>> +    }
>>> +}
>>> +
>>> +int
>>> +e1000e_can_receive(E1000ECore *core)
>>> +{
>>> +    int i;
>>> +
>>> +    bool link_up = core->mac[STATUS] & E1000_STATUS_LU;
>>> +    bool rx_enabled = core->mac[RCTL] & E1000_RCTL_EN;
>>> +    bool pci_master = core->owner->config[PCI_COMMAND] &
>>> PCI_COMMAND_MASTER;
>>
>> Should we flush the queue packets when guest enable bus master? (like
>> what we did in e1000_write_config)?
>
> Yes, fixed.
>
>>
>>> +
>>> +    if (!link_up || !rx_enabled || !pci_master) {
>>> +        trace_e1000e_rx_can_recv_disabled(link_up, rx_enabled,
>>> pci_master);
>>> +        return false;
>>> +    }
>>> +
>>> +    for (i = 0; i < E1000E_NUM_QUEUES; i++) {
>>> +        E1000E_RxRing rxr;
>>> +
>>> +        e1000e_rx_ring_init(core, &rxr, i);
>>> +        if (e1000e_ring_enabled(core, rxr.i) &&
>>> +            e1000e_has_rxbufs(core, rxr.i, 1)) {
>>> +            trace_e1000e_rx_can_recv();
>>> +            return true;
>>
>> Similar issue, this will return true if anyone of the queue has
>> available buffer. This seems wrong.
>
>
> The same as above. This is done due to RSS mechanisms of the device.

Right, looks ok now, but there's also a case e.g we want to queue the
packet to queue 0, but only queue 1 has the available buffer. May need
to check this in the future.

[...]

>
> Thanks for review,
> Dmitry
>

You're welcome

Re: [Qemu-devel] [PATCH v2 13/13] net: Introduce e1000e device emulation

Reply via email to