Looks good, I can't recall the status of last version well but I remember Jason gave some comments about sanity check are quiet essential, have you addressed them?
Reviewed by: Wei Xu <w...@redhat.com> On Fri, Nov 09, 2018 at 04:58:27PM +0200, Yuri Benditovich wrote: > This commit adds implementation of RX packets > coalescing, compatible with requirements of Windows > Hardware compatibility kit. > > The device enables feature VIRTIO_NET_F_RSC_EXT in > host features if it supports extended RSC functionality > as defined in the specification. > This feature requires at least one of VIRTIO_NET_F_GUEST_TSO4, > VIRTIO_NET_F_GUEST_TSO6. Windows guest driver acks > this feature only if VIRTIO_NET_F_CTRL_GUEST_OFFLOADS > is also present. > > In case vhost is enabled the feature bit is cleared in > host_features during device initialization. > > If the guest driver acks VIRTIO_NET_F_RSC_EXT feature, > the device coalesces TCPv4 and TCPv6 packets (if > respective VIRTIO_NET_F_GUEST_TSO feature is on, > populates extended RSC information in virtio header > and sets VIRTIO_NET_HDR_F_RSC_INFO bit in header flags. > The device does not recalculate checksums in the coalesced > packet, so they are not valid. > > In this case: > All the data packets in a tcp connection are cached > to a single buffer in every receive interval, and will > be sent out via a timer, the 'virtio_net_rsc_timeout' > controls the interval, this value may impact the > performance and response time of tcp connection, > 50000(50us) is an experience value to gain a performance > improvement, since the whql test sends packets every 100us, > so '300000(300us)' passes the test case, it is the default > value as well, tune it via the command line parameter > 'rsc_interval' within 'virtio-net-pci' device, for example, > to launch a guest with interval set as '500000': > > 'virtio-net-pci,netdev=hostnet1,bus=pci.0,id=net1,mac=00, > guest_rsc_ext=on,rsc_interval=500000' > > The timer will only be triggered if the packets pool is not empty, > and it'll drain off all the cached packets. > > 'NetRscChain' is used to save the segments of IPv4/6 in a > VirtIONet device. > > A new segment becomes a 'Candidate' as well as it passed sanity check, > the main handler of TCP includes TCP window update, duplicated > ACK check and the real data coalescing. > > An 'Candidate' segment means: > 1. Segment is within current window and the sequence is the expected one. > 2. 'ACK' of the segment is in the valid window. > > Sanity check includes: > 1. Incorrect version in IP header > 2. An IP options or IP fragment > 3. Not a TCP packet > 4. Sanity size check to prevent buffer overflow attack. > 5. An ECN packet > > Even though, there might more cases should be considered such as > ip identification other flags, while it breaks the test because > windows set it to the same even it's not a fragment. > > Normally it includes 2 typical ways to handle a TCP control flag, > 'bypass' and 'finalize', 'bypass' means should be sent out directly, > while 'finalize' means the packets should also be bypassed, but this > should be done after search for the same connection packets in the > pool and drain all of them out, this is to avoid out of order fragment. > > All the 'SYN' packets will be bypassed since this always begin a new' > connection, other flags such 'URG/FIN/RST/CWR/ECE' will trigger a > finalization, because this normally happens upon a connection is going > to be closed, an 'URG' packet also finalize current coalescing unit. > > Statistics can be used to monitor the basic coalescing status, the > 'out of order' and 'out of window' means how many retransmitting packets, > thus describe the performance intuitively. > > Difference between ip v4 and v6 processing: > Fragment length in ipv4 header includes itself, while it's not > included for ipv6, thus means ipv6 can carry a real 65535 payload. > > Signed-off-by: Wei Xu <address@hidden> > Signed-off-by: Yuri Benditovich <yuri.benditov...@daynix.com> > --- > hw/net/virtio-net.c | 648 +++++++++++++++++++- > include/hw/virtio/virtio-net.h | 81 +++ > include/net/eth.h | 2 + > include/standard-headers/linux/virtio_net.h | 8 + > 4 files changed, 734 insertions(+), 5 deletions(-) > > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c > index 385b1a03e9..43a7021409 100644 > --- a/hw/net/virtio-net.c > +++ b/hw/net/virtio-net.c > @@ -41,6 +41,28 @@ > #define VIRTIO_NET_RX_QUEUE_MIN_SIZE VIRTIO_NET_RX_QUEUE_DEFAULT_SIZE > #define VIRTIO_NET_TX_QUEUE_MIN_SIZE VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE > > +#define VIRTIO_NET_IP4_ADDR_SIZE 8 /* ipv4 saddr + daddr */ > + > +#define VIRTIO_NET_TCP_FLAG 0x3F > +#define VIRTIO_NET_TCP_HDR_LENGTH 0xF000 > + > +/* IPv4 max payload, 16 bits in the header */ > +#define VIRTIO_NET_MAX_IP4_PAYLOAD (65535 - sizeof(struct ip_header)) > +#define VIRTIO_NET_MAX_TCP_PAYLOAD 65535 > + > +/* header length value in ip header without option */ > +#define VIRTIO_NET_IP4_HEADER_LENGTH 5 > + > +#define ETH_IP6_HDR_SZ (ETH_HDR_SZ + IP6_HDR_SZ) > +#define VIRTIO_NET_IP6_ADDR_SIZE 32 /* ipv6 saddr + daddr */ > +#define VIRTIO_NET_MAX_IP6_PAYLOAD VIRTIO_NET_MAX_TCP_PAYLOAD > + > +/* Purge coalesced packets timer interval, This value affects the performance > + a lot, and should be tuned carefully, '300000'(300us) is the recommended > + value to pass the WHQL test, '50000' can gain 2x netperf throughput with > + tso/gso/gro 'off'. */ > +#define VIRTIO_NET_RSC_INTERVAL 300000 > + > /* > * Calculate the number of bytes up to and including the given 'field' of > * 'container'. > @@ -628,6 +650,9 @@ static uint64_t virtio_net_get_features(VirtIODevice > *vdev, uint64_t features, > if (!get_vhost_net(nc->peer)) { > return features; > } > + /* clear RSC_EXT feature in case of VHOST*/ > + virtio_clear_feature(&features, VIRTIO_NET_F_RSC_EXT); > + > features = vhost_net_get_features(get_vhost_net(nc->peer), features); > vdev->backend_features = features; > > @@ -701,6 +726,11 @@ static void virtio_net_set_features(VirtIODevice *vdev, > uint64_t features) > virtio_has_feature(features, > VIRTIO_F_VERSION_1)); > > + n->rsc4_enabled = virtio_has_feature(features, VIRTIO_NET_F_RSC_EXT) && > + virtio_has_feature(features, VIRTIO_NET_F_GUEST_TSO4); > + n->rsc6_enabled = virtio_has_feature(features, VIRTIO_NET_F_RSC_EXT) && > + virtio_has_feature(features, VIRTIO_NET_F_GUEST_TSO6); > + > if (n->has_vnet_hdr) { > n->curr_guest_offloads = > virtio_net_guest_offloads_by_features(features); > @@ -781,6 +811,12 @@ static int virtio_net_handle_offloads(VirtIONet *n, > uint8_t cmd, > return VIRTIO_NET_ERR; > } > > + n->rsc4_enabled = virtio_has_feature(offloads, VIRTIO_NET_F_RSC_EXT) > && > + virtio_has_feature(offloads, VIRTIO_NET_F_GUEST_TSO4); > + n->rsc6_enabled = virtio_has_feature(offloads, VIRTIO_NET_F_RSC_EXT) > && > + virtio_has_feature(offloads, VIRTIO_NET_F_GUEST_TSO6); > + virtio_clear_feature(&offloads, VIRTIO_NET_F_RSC_EXT); > + > supported_offloads = virtio_net_supported_guest_offloads(n); > if (offloads & ~supported_offloads) { > return VIRTIO_NET_ERR; > @@ -1253,6 +1289,7 @@ static ssize_t virtio_net_receive_rcu(NetClientState > *nc, const uint8_t *buf, > } > > receive_header(n, sg, elem->in_num, buf, size); > + > offset = n->host_hdr_len; > total += n->guest_hdr_len; > guest_offset = n->guest_hdr_len; > @@ -1292,7 +1329,7 @@ static ssize_t virtio_net_receive_rcu(NetClientState > *nc, const uint8_t *buf, > return size; > } > > -static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, > +static ssize_t virtio_net_do_receive(NetClientState *nc, const uint8_t *buf, > size_t size) > { > ssize_t r; > @@ -1303,6 +1340,601 @@ static ssize_t virtio_net_receive(NetClientState *nc, > const uint8_t *buf, > return r; > } > > +static void virtio_net_rsc_extract_unit4(NetRscChain *chain, > + const uint8_t *buf, NetRscUnit* > unit) > +{ > + uint16_t ip_hdrlen; > + struct ip_header *ip; > + > + ip = (struct ip_header *)(buf + chain->n->guest_hdr_len > + + sizeof(struct eth_header)); > + unit->ip = (void *)ip; > + ip_hdrlen = (ip->ip_ver_len & 0xF) << 2; > + unit->ip_plen = &ip->ip_len; > + unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdrlen); > + unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10; > + unit->payload = htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hdrlen; > +} > + > +static void virtio_net_rsc_extract_unit6(NetRscChain *chain, > + const uint8_t *buf, NetRscUnit* > unit) > +{ > + struct ip6_header *ip6; > + > + ip6 = (struct ip6_header *)(buf + chain->n->guest_hdr_len > + + sizeof(struct eth_header)); > + unit->ip = ip6; > + unit->ip_plen = &(ip6->ip6_ctlun.ip6_un1.ip6_un1_plen); > + unit->tcp = (struct tcp_header *)(((uint8_t *)unit->ip)\ > + + sizeof(struct ip6_header)); > + unit->tcp_hdrlen = (htons(unit->tcp->th_offset_flags) & 0xF000) >> 10; > + > + /* There is a difference between payload lenght in ipv4 and v6, > + ip header is excluded in ipv6 */ > + unit->payload = htons(*unit->ip_plen) - unit->tcp_hdrlen; > +} > + > +static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *seg) > +{ > + int ret; > + struct virtio_net_hdr *h; > + > + h = (struct virtio_net_hdr *)seg->buf; > + h->flags = 0; > + h->gso_type = VIRTIO_NET_HDR_GSO_NONE; > + > + if (seg->is_coalesced) { > + h->rsc_ext_num_packets = seg->packets; > + h->rsc_ext_num_dupacks = seg->dup_ack; > + h->flags = VIRTIO_NET_HDR_F_RSC_INFO; > + if (chain->proto == ETH_P_IP) { > + h->gso_type = VIRTIO_NET_HDR_GSO_TCPV4; > + } else { > + h->gso_type = VIRTIO_NET_HDR_GSO_TCPV6; > + } > + } > + > + ret = virtio_net_do_receive(seg->nc, seg->buf, seg->size); > + QTAILQ_REMOVE(&chain->buffers, seg, next); > + g_free(seg->buf); > + g_free(seg); > + > + return ret; > +} > + > +static void virtio_net_rsc_purge(void *opq) > +{ > + NetRscSeg *seg, *rn; > + NetRscChain *chain = (NetRscChain *)opq; > + > + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) { > + if (virtio_net_rsc_drain_seg(chain, seg) == 0) { > + chain->stat.purge_failed++; > + continue; > + } > + } > + > + chain->stat.timer++; > + if (!QTAILQ_EMPTY(&chain->buffers)) { > + timer_mod(chain->drain_timer, > + qemu_clock_get_ns(QEMU_CLOCK_HOST) + chain->n->rsc_timeout); > + } > +} > + > +static void virtio_net_rsc_cleanup(VirtIONet *n) > +{ > + NetRscChain *chain, *rn_chain; > + NetRscSeg *seg, *rn_seg; > + > + QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) { > + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) { > + QTAILQ_REMOVE(&chain->buffers, seg, next); > + g_free(seg->buf); > + g_free(seg); > + } > + > + timer_del(chain->drain_timer); > + timer_free(chain->drain_timer); > + QTAILQ_REMOVE(&n->rsc_chains, chain, next); > + g_free(chain); > + } > +} > + > +static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientState *nc, > + const uint8_t *buf, size_t size) > +{ > + uint16_t hdr_len; > + NetRscSeg *seg; > + > + hdr_len = chain->n->guest_hdr_len; > + seg = g_malloc(sizeof(NetRscSeg)); > + seg->buf = g_malloc(hdr_len + sizeof(struct eth_header) > + + sizeof(struct ip6_header) + VIRTIO_NET_MAX_TCP_PAYLOAD); > + memcpy(seg->buf, buf, size); > + seg->size = size; > + seg->packets = 1; > + seg->dup_ack = 0; > + seg->is_coalesced = 0; > + seg->nc = nc; > + > + QTAILQ_INSERT_TAIL(&chain->buffers, seg, next); > + chain->stat.cache++; > + > + switch (chain->proto) { > + case ETH_P_IP: > + virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit); > + break; > + case ETH_P_IPV6: > + virtio_net_rsc_extract_unit6(chain, seg->buf, &seg->unit); > + break; > + default: > + g_assert_not_reached(); > + } > +} > + > +static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain, > + NetRscSeg *seg, const uint8_t *buf, > + struct tcp_header *n_tcp, > + struct tcp_header *o_tcp) > +{ > + uint32_t nack, oack; > + uint16_t nwin, owin; > + > + nack = htonl(n_tcp->th_ack); > + nwin = htons(n_tcp->th_win); > + oack = htonl(o_tcp->th_ack); > + owin = htons(o_tcp->th_win); > + > + if ((nack - oack) >= VIRTIO_NET_MAX_TCP_PAYLOAD) { > + chain->stat.ack_out_of_win++; > + return RSC_FINAL; > + } else if (nack == oack) { > + /* duplicated ack or window probe */ > + if (nwin == owin) { > + /* duplicated ack, add dup ack count due to whql test up to 1 */ > + chain->stat.dup_ack++; > + return RSC_FINAL; > + } else { > + /* Coalesce window update */ > + o_tcp->th_win = n_tcp->th_win; > + chain->stat.win_update++; > + return RSC_COALESCE; > + } > + } else { > + /* pure ack, go to 'C', finalize*/ > + chain->stat.pure_ack++; > + return RSC_FINAL; > + } > +} > + > +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain, > + NetRscSeg *seg, const uint8_t > *buf, > + NetRscUnit *n_unit) > +{ > + void *data; > + uint16_t o_ip_len; > + uint32_t nseq, oseq; > + NetRscUnit *o_unit; > + > + o_unit = &seg->unit; > + o_ip_len = htons(*o_unit->ip_plen); > + nseq = htonl(n_unit->tcp->th_seq); > + oseq = htonl(o_unit->tcp->th_seq); > + > + /* out of order or retransmitted. */ > + if ((nseq - oseq) > VIRTIO_NET_MAX_TCP_PAYLOAD) { > + chain->stat.data_out_of_win++; > + return RSC_FINAL; > + } > + > + data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen; > + if (nseq == oseq) { > + if ((o_unit->payload == 0) && n_unit->payload) { > + /* From no payload to payload, normal case, not a dup ack or etc > */ > + chain->stat.data_after_pure_ack++; > + goto coalesce; > + } else { > + return virtio_net_rsc_handle_ack(chain, seg, buf, > + n_unit->tcp, o_unit->tcp); > + } > + } else if ((nseq - oseq) != o_unit->payload) { > + /* Not a consistent packet, out of order */ > + chain->stat.data_out_of_order++; > + return RSC_FINAL; > + } else { > +coalesce: > + if ((o_ip_len + n_unit->payload) > chain->max_payload) { > + chain->stat.over_size++; > + return RSC_FINAL; > + } > + > + /* Here comes the right data, the payload length in v4/v6 is > different, > + so use the field value to update and record the new data len */ > + o_unit->payload += n_unit->payload; /* update new data len */ > + > + /* update field in ip header */ > + *o_unit->ip_plen = htons(o_ip_len + n_unit->payload); > + > + /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be coalesced > + for windows guest, while this may change the behavior for linux > + guest (only if it uses RSC feature). */ > + o_unit->tcp->th_offset_flags = n_unit->tcp->th_offset_flags; > + > + o_unit->tcp->th_ack = n_unit->tcp->th_ack; > + o_unit->tcp->th_win = n_unit->tcp->th_win; > + > + memmove(seg->buf + seg->size, data, n_unit->payload); > + seg->size += n_unit->payload; > + seg->packets++; > + chain->stat.coalesced++; > + return RSC_COALESCE; > + } > +} > + > +static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg *seg, > + const uint8_t *buf, size_t size, > + NetRscUnit *unit) > +{ > + struct ip_header *ip1, *ip2; > + > + ip1 = (struct ip_header *)(unit->ip); > + ip2 = (struct ip_header *)(seg->unit.ip); > + if ((ip1->ip_src ^ ip2->ip_src) || (ip1->ip_dst ^ ip2->ip_dst) > + || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport) > + || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) { > + chain->stat.no_match++; > + return RSC_NO_MATCH; > + } > + > + return virtio_net_rsc_coalesce_data(chain, seg, buf, unit); > +} > + > +static int32_t virtio_net_rsc_coalesce6(NetRscChain *chain, NetRscSeg *seg, > + const uint8_t *buf, size_t size, NetRscUnit *unit) > +{ > + struct ip6_header *ip1, *ip2; > + > + ip1 = (struct ip6_header *)(unit->ip); > + ip2 = (struct ip6_header *)(seg->unit.ip); > + if (memcmp(&ip1->ip6_src, &ip2->ip6_src, sizeof(struct in6_address)) > + || memcmp(&ip1->ip6_dst, &ip2->ip6_dst, sizeof(struct in6_address)) > + || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport) > + || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) { > + chain->stat.no_match++; > + return RSC_NO_MATCH; > + } > + > + return virtio_net_rsc_coalesce_data(chain, seg, buf, unit); > +} > + > +/* Packets with 'SYN' should bypass, other flag should be sent after drain > + * to prevent out of order */ > +static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain, > + struct tcp_header *tcp) > +{ > + uint16_t tcp_hdr; > + uint16_t tcp_flag; > + > + tcp_flag = htons(tcp->th_offset_flags); > + tcp_hdr = (tcp_flag & VIRTIO_NET_TCP_HDR_LENGTH) >> 10; > + tcp_flag &= VIRTIO_NET_TCP_FLAG; > + tcp_flag = htons(tcp->th_offset_flags) & 0x3F; > + if (tcp_flag & TH_SYN) { > + chain->stat.tcp_syn++; > + return RSC_BYPASS; > + } > + > + if (tcp_flag & (TH_FIN | TH_URG | TH_RST | TH_ECE | TH_CWR)) { > + chain->stat.tcp_ctrl_drain++; > + return RSC_FINAL; > + } > + > + if (tcp_hdr > sizeof(struct tcp_header)) { > + chain->stat.tcp_all_opt++; > + return RSC_FINAL; > + } > + > + return RSC_CANDIDATE; > +} > + > +static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, > + NetClientState *nc, > + const uint8_t *buf, size_t size, > + NetRscUnit *unit) > +{ > + int ret; > + NetRscSeg *seg, *nseg; > + > + if (QTAILQ_EMPTY(&chain->buffers)) { > + chain->stat.empty_cache++; > + virtio_net_rsc_cache_buf(chain, nc, buf, size); > + timer_mod(chain->drain_timer, > + qemu_clock_get_ns(QEMU_CLOCK_HOST) + chain->n->rsc_timeout); > + return size; > + } > + > + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) { > + if (chain->proto == ETH_P_IP) { > + ret = virtio_net_rsc_coalesce4(chain, seg, buf, size, unit); > + } else { > + ret = virtio_net_rsc_coalesce6(chain, seg, buf, size, unit); > + } > + > + if (ret == RSC_FINAL) { > + if (virtio_net_rsc_drain_seg(chain, seg) == 0) { > + /* Send failed */ > + chain->stat.final_failed++; > + return 0; > + } > + > + /* Send current packet */ > + return virtio_net_do_receive(nc, buf, size); > + } else if (ret == RSC_NO_MATCH) { > + continue; > + } else { > + /* Coalesced, mark coalesced flag to tell calc cksum for ipv4 */ > + seg->is_coalesced = 1; > + return size; > + } > + } > + > + chain->stat.no_match_cache++; > + virtio_net_rsc_cache_buf(chain, nc, buf, size); > + return size; > +} > + > +/* Drain a connection data, this is to avoid out of order segments */ > +static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClientState > *nc, > + const uint8_t *buf, size_t size, > + uint16_t ip_start, uint16_t ip_size, > + uint16_t tcp_port) > +{ > + NetRscSeg *seg, *nseg; > + uint32_t ppair1, ppair2; > + > + ppair1 = *(uint32_t *)(buf + tcp_port); > + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) { > + ppair2 = *(uint32_t *)(seg->buf + tcp_port); > + if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size) > + || (ppair1 != ppair2)) { > + continue; > + } > + if (virtio_net_rsc_drain_seg(chain, seg) == 0) { > + chain->stat.drain_failed++; > + } > + > + break; > + } > + > + return virtio_net_do_receive(nc, buf, size); > +} > + > +static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain, > + struct ip_header *ip, > + const uint8_t *buf, size_t size) > +{ > + uint16_t ip_len; > + > + /* Not an ipv4 packet */ > + if (((ip->ip_ver_len & 0xF0) >> 4) != IP_HEADER_VERSION_4) { > + chain->stat.ip_option++; > + return RSC_BYPASS; > + } > + > + /* Don't handle packets with ip option */ > + if ((ip->ip_ver_len & 0xF) != VIRTIO_NET_IP4_HEADER_LENGTH) { > + chain->stat.ip_option++; > + return RSC_BYPASS; > + } > + > + if (ip->ip_p != IPPROTO_TCP) { > + chain->stat.bypass_not_tcp++; > + return RSC_BYPASS; > + } > + > + /* Don't handle packets with ip fragment */ > + if (!(htons(ip->ip_off) & IP_DF)) { > + chain->stat.ip_frag++; > + return RSC_BYPASS; > + } > + > + /* Don't handle packets with ecn flag */ > + if (IPTOS_ECN(ip->ip_tos)) { > + chain->stat.ip_ecn++; > + return RSC_BYPASS; > + } > + > + ip_len = htons(ip->ip_len); > + if (ip_len < (sizeof(struct ip_header) + sizeof(struct tcp_header)) > + || ip_len > (size - chain->n->guest_hdr_len - > + sizeof(struct eth_header))) { > + chain->stat.ip_hacked++; > + return RSC_BYPASS; > + } > + > + return RSC_CANDIDATE; > +} > + > +static size_t virtio_net_rsc_receive4(NetRscChain *chain, NetClientState* nc, > + const uint8_t *buf, size_t size) > +{ > + int32_t ret; > + uint16_t hdr_len; > + NetRscUnit unit; > + > + hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len; > + > + if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct > ip_header) > + + sizeof(struct tcp_header))) { > + chain->stat.bypass_not_tcp++; > + return virtio_net_do_receive(nc, buf, size); > + } > + > + virtio_net_rsc_extract_unit4(chain, buf, &unit); > + if (virtio_net_rsc_sanity_check4(chain, unit.ip, buf, size) > + != RSC_CANDIDATE) { > + return virtio_net_do_receive(nc, buf, size); > + } > + > + ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp); > + if (ret == RSC_BYPASS) { > + return virtio_net_do_receive(nc, buf, size); > + } else if (ret == RSC_FINAL) { > + return virtio_net_rsc_drain_flow(chain, nc, buf, size, > + ((hdr_len + sizeof(struct eth_header)) + 12), > + VIRTIO_NET_IP4_ADDR_SIZE, > + hdr_len + sizeof(struct eth_header) + sizeof(struct > ip_header)); > + } > + > + return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit); > +} > + > +static int32_t virtio_net_rsc_sanity_check6(NetRscChain *chain, > + struct ip6_header *ip6, > + const uint8_t *buf, size_t size) > +{ > + uint16_t ip_len; > + > + if (((ip6->ip6_ctlun.ip6_un1.ip6_un1_flow & 0xF0) >> 4) > + != IP_HEADER_VERSION_6) { > + return RSC_BYPASS; > + } > + > + /* Both option and protocol is checked in this */ > + if (ip6->ip6_ctlun.ip6_un1.ip6_un1_nxt != IPPROTO_TCP) { > + chain->stat.bypass_not_tcp++; > + return RSC_BYPASS; > + } > + > + ip_len = htons(ip6->ip6_ctlun.ip6_un1.ip6_un1_plen); > + if (ip_len < sizeof(struct tcp_header) || > + ip_len > (size - chain->n->guest_hdr_len - sizeof(struct eth_header) > + - sizeof(struct ip6_header))) { > + chain->stat.ip_hacked++; > + return RSC_BYPASS; > + } > + > + /* Don't handle packets with ecn flag */ > + if (IP6_ECN(ip6->ip6_ctlun.ip6_un3.ip6_un3_ecn)) { > + chain->stat.ip_ecn++; > + return RSC_BYPASS; > + } > + > + return RSC_CANDIDATE; > +} > + > +static size_t virtio_net_rsc_receive6(void *opq, NetClientState* nc, > + const uint8_t *buf, size_t size) > +{ > + int32_t ret; > + uint16_t hdr_len; > + NetRscChain *chain; > + NetRscUnit unit; > + > + chain = (NetRscChain *)opq; > + hdr_len = ((VirtIONet *)(chain->n))->guest_hdr_len; > + > + if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct > ip6_header) > + + sizeof(tcp_header))) { > + return virtio_net_do_receive(nc, buf, size); > + } > + > + virtio_net_rsc_extract_unit6(chain, buf, &unit); > + if (RSC_CANDIDATE != virtio_net_rsc_sanity_check6(chain, > + unit.ip, buf, size)) { > + return virtio_net_do_receive(nc, buf, size); > + } > + > + ret = virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp); > + if (ret == RSC_BYPASS) { > + return virtio_net_do_receive(nc, buf, size); > + } else if (ret == RSC_FINAL) { > + return virtio_net_rsc_drain_flow(chain, nc, buf, size, > + ((hdr_len + sizeof(struct eth_header)) + 8), > + VIRTIO_NET_IP6_ADDR_SIZE, > + hdr_len + sizeof(struct eth_header) > + + sizeof(struct ip6_header)); > + } > + > + return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit); > +} > + > +static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n, > + NetClientState *nc, > + uint16_t proto) > +{ > + NetRscChain *chain; > + > + if ((proto != (uint16_t)ETH_P_IP) && (proto != (uint16_t)ETH_P_IPV6)) { > + return NULL; > + } > + > + QTAILQ_FOREACH(chain, &n->rsc_chains, next) { > + if (chain->proto == proto) { > + return chain; > + } > + } > + > + chain = g_malloc(sizeof(*chain)); > + chain->n = n; > + chain->proto = proto; > + if (proto == (uint16_t)ETH_P_IP) { > + chain->max_payload = VIRTIO_NET_MAX_IP4_PAYLOAD; > + chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV4; > + } else { > + chain->max_payload = VIRTIO_NET_MAX_IP6_PAYLOAD; > + chain->gso_type = VIRTIO_NET_HDR_GSO_TCPV6; > + } > + chain->drain_timer = timer_new_ns(QEMU_CLOCK_HOST, > + virtio_net_rsc_purge, chain); > + memset(&chain->stat, 0, sizeof(chain->stat)); > + > + QTAILQ_INIT(&chain->buffers); > + QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next); > + > + return chain; > +} > + > +static ssize_t virtio_net_rsc_receive(NetClientState *nc, > + const uint8_t *buf, > + size_t size) > +{ > + uint16_t proto; > + NetRscChain *chain; > + struct eth_header *eth; > + VirtIONet *n; > + > + n = qemu_get_nic_opaque(nc); > + if (size < (n->host_hdr_len + sizeof(struct eth_header))) { > + return virtio_net_do_receive(nc, buf, size); > + } > + > + eth = (struct eth_header *)(buf + n->guest_hdr_len); > + proto = htons(eth->h_proto); > + > + chain = virtio_net_rsc_lookup_chain(n, nc, proto); > + if (chain) { > + chain->stat.received++; > + if (proto == (uint16_t)ETH_P_IP && n->rsc4_enabled) { > + return virtio_net_rsc_receive4(chain, nc, buf, size); > + } else if (proto == (uint16_t)ETH_P_IPV6 && n->rsc6_enabled) { > + return virtio_net_rsc_receive6(chain, nc, buf, size); > + } > + } > + return virtio_net_do_receive(nc, buf, size); > +} > + > +static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, > + size_t size) > +{ > + VirtIONet *n = qemu_get_nic_opaque(nc); > + if ((n->rsc4_enabled || n->rsc6_enabled)) { > + return virtio_net_rsc_receive(nc, buf, size); > + } else { > + return virtio_net_do_receive(nc, buf, size); > + } > +} > + > static int32_t virtio_net_flush_tx(VirtIONetQueue *q); > > static void virtio_net_tx_complete(NetClientState *nc, ssize_t len) > @@ -1342,7 +1974,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q) > ssize_t ret; > unsigned int out_num; > struct iovec sg[VIRTQUEUE_MAX_SIZE], sg2[VIRTQUEUE_MAX_SIZE + 1], > *out_sg; > - struct virtio_net_hdr_mrg_rxbuf mhdr; > + struct virtio_net_hdr_mrg_rxbuf hdr; > > elem = virtqueue_pop(q->tx_vq, sizeof(VirtQueueElement)); > if (!elem) { > @@ -1359,7 +1991,7 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q) > } > > if (n->has_vnet_hdr) { > - if (iov_to_buf(out_sg, out_num, 0, &mhdr, n->guest_hdr_len) < > + if (iov_to_buf(out_sg, out_num, 0, &hdr, n->guest_hdr_len) < > n->guest_hdr_len) { > virtio_error(vdev, "virtio-net header incorrect"); > virtqueue_detach_element(q->tx_vq, elem, 0); > @@ -1367,8 +1999,8 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q) > return -EINVAL; > } > if (n->needs_vnet_hdr_swap) { > - virtio_net_hdr_swap(vdev, (void *) &mhdr); > - sg2[0].iov_base = &mhdr; > + virtio_net_hdr_swap(vdev, (void *) &hdr); > + sg2[0].iov_base = &hdr; > sg2[0].iov_len = n->guest_hdr_len; > out_num = iov_copy(&sg2[1], ARRAY_SIZE(sg2) - 1, > out_sg, out_num, > @@ -2075,6 +2707,7 @@ static void virtio_net_device_realize(DeviceState *dev, > Error **errp) > nc = qemu_get_queue(n->nic); > nc->rxfilter_notify_enabled = 1; > > + QTAILQ_INIT(&n->rsc_chains); > n->qdev = dev; > } > > @@ -2104,6 +2737,7 @@ static void virtio_net_device_unrealize(DeviceState > *dev, Error **errp) > timer_free(n->announce_timer); > g_free(n->vqs); > qemu_del_nic(n->nic); > + virtio_net_rsc_cleanup(n); > virtio_cleanup(vdev); > } > > @@ -2184,6 +2818,10 @@ static Property virtio_net_properties[] = { > DEFINE_PROP_BIT64("ctrl_guest_offloads", VirtIONet, host_features, > VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, true), > DEFINE_PROP_BIT64("mq", VirtIONet, host_features, VIRTIO_NET_F_MQ, > false), > + DEFINE_PROP_BIT64("guest_rsc_ext", VirtIONet, host_features, > + VIRTIO_NET_F_RSC_EXT, false), > + DEFINE_PROP_UINT32("rsc_interval", VirtIONet, rsc_timeout, > + VIRTIO_NET_RSC_INTERVAL), > DEFINE_NIC_PROPERTIES(VirtIONet, nic_conf), > DEFINE_PROP_UINT32("x-txtimer", VirtIONet, net_conf.txtimer, > TX_TIMER_INTERVAL), > diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h > index 4d7f3c82ca..0da99dcfa0 100644 > --- a/include/hw/virtio/virtio-net.h > +++ b/include/hw/virtio/virtio-net.h > @@ -44,6 +44,83 @@ typedef struct virtio_net_conf > uint8_t duplex; > } virtio_net_conf; > > +/* Coalesced packets type & status */ > +typedef enum { > + RSC_COALESCE, /* Data been coalesced */ > + RSC_FINAL, /* Will terminate current connection */ > + RSC_NO_MATCH, /* No matched in the buffer pool */ > + RSC_BYPASS, /* Packet to be bypass, not tcp, tcp ctrl, etc */ > + RSC_CANDIDATE /* Data want to be coalesced */ > +} COALESCE_STATUS; > + > +typedef struct NetRscStat { > + uint32_t received; > + uint32_t coalesced; > + uint32_t over_size; > + uint32_t cache; > + uint32_t empty_cache; > + uint32_t no_match_cache; > + uint32_t win_update; > + uint32_t no_match; > + uint32_t tcp_syn; > + uint32_t tcp_ctrl_drain; > + uint32_t dup_ack; > + uint32_t dup_ack1; > + uint32_t dup_ack2; > + uint32_t pure_ack; > + uint32_t ack_out_of_win; > + uint32_t data_out_of_win; > + uint32_t data_out_of_order; > + uint32_t data_after_pure_ack; > + uint32_t bypass_not_tcp; > + uint32_t tcp_option; > + uint32_t tcp_all_opt; > + uint32_t ip_frag; > + uint32_t ip_ecn; > + uint32_t ip_hacked; > + uint32_t ip_option; > + uint32_t purge_failed; > + uint32_t drain_failed; > + uint32_t final_failed; > + int64_t timer; > +} NetRscStat; > + > +/* Rsc unit general info used to checking if can coalescing */ > +typedef struct NetRscUnit { > + void *ip; /* ip header */ > + uint16_t *ip_plen; /* data len pointer in ip header field */ > + struct tcp_header *tcp; /* tcp header */ > + uint16_t tcp_hdrlen; /* tcp header len */ > + uint16_t payload; /* pure payload without virtio/eth/ip/tcp */ > +} NetRscUnit; > + > +/* Coalesced segmant */ > +typedef struct NetRscSeg { > + QTAILQ_ENTRY(NetRscSeg) next; > + void *buf; > + size_t size; > + uint16_t packets; > + uint16_t dup_ack; > + bool is_coalesced; /* need recal ipv4 header checksum, mark here */ > + NetRscUnit unit; > + NetClientState *nc; > +} NetRscSeg; > + > +struct VirtIONet; > +typedef struct VirtIONet VirtIONet; > + > +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */ > +typedef struct NetRscChain { > + QTAILQ_ENTRY(NetRscChain) next; > + VirtIONet *n; /* VirtIONet */ > + uint16_t proto; > + uint8_t gso_type; > + uint16_t max_payload; > + QEMUTimer *drain_timer; > + QTAILQ_HEAD(, NetRscSeg) buffers; > + NetRscStat stat; > +} NetRscChain; > + > /* Maximum packet size we can receive from tap device: header + 64k */ > #define VIRTIO_NET_MAX_BUFSIZE (sizeof(struct virtio_net_hdr) + (64 * KiB)) > > @@ -66,12 +143,16 @@ typedef struct VirtIONet { > VirtIONetQueue *vqs; > VirtQueue *ctrl_vq; > NICState *nic; > + QTAILQ_HEAD(, NetRscChain) rsc_chains; > uint32_t tx_timeout; > int32_t tx_burst; > uint32_t has_vnet_hdr; > size_t host_hdr_len; > size_t guest_hdr_len; > uint64_t host_features; > + uint32_t rsc_timeout; > + uint8_t rsc4_enabled; > + uint8_t rsc6_enabled; > uint8_t has_ufo; > uint32_t mergeable_rx_bufs; > uint8_t promisc; > diff --git a/include/net/eth.h b/include/net/eth.h > index e6dc8a7ba0..7f45c678e7 100644 > --- a/include/net/eth.h > +++ b/include/net/eth.h > @@ -177,6 +177,8 @@ struct tcp_hdr { > #define TH_PUSH 0x08 > #define TH_ACK 0x10 > #define TH_URG 0x20 > +#define TH_ECE 0x40 > +#define TH_CWR 0x80 > u_short th_win; /* window */ > u_short th_sum; /* checksum */ > u_short th_urp; /* urgent pointer */ > diff --git a/include/standard-headers/linux/virtio_net.h > b/include/standard-headers/linux/virtio_net.h > index 260c3681d7..0d8658c06a 100644 > --- a/include/standard-headers/linux/virtio_net.h > +++ b/include/standard-headers/linux/virtio_net.h > @@ -57,6 +57,10 @@ > * Steering */ > #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */ > > +#define VIRTIO_NET_F_RSC_EXT 38 > +#define VIRTIO_NET_F_GUEST_RSC4_DONT_USE 41 /* reserved */ > +#define VIRTIO_NET_F_GUEST_RSC6_DONT_USE 42 /* reserved */ > + > #define VIRTIO_NET_F_STANDBY 62 /* Act as standby for another device > * with the same MAC. > */ > @@ -104,6 +108,7 @@ struct virtio_net_config { > struct virtio_net_hdr_v1 { > #define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 /* Use csum_start, csum_offset > */ > #define VIRTIO_NET_HDR_F_DATA_VALID 2 /* Csum is valid */ > +#define VIRTIO_NET_HDR_F_RSC_INFO 4 /* rsc_ext data in csum_ fields > */ > uint8_t flags; > #define VIRTIO_NET_HDR_GSO_NONE 0 /* Not a GSO frame */ > #define VIRTIO_NET_HDR_GSO_TCPV4 1 /* GSO frame, IPv4 TCP (TSO) */ > @@ -118,6 +123,9 @@ struct virtio_net_hdr_v1 { > __virtio16 num_buffers; /* Number of merged rx buffers */ > }; > > +#define rsc_ext_num_packets csum_start > +#define rsc_ext_num_dupacks csum_offset > + > #ifndef VIRTIO_NET_NO_LEGACY > /* This header comes first in the scatter-gather list. > * For legacy virtio, if VIRTIO_F_ANY_LAYOUT is not negotiated, it must > -- > 2.17.1 > >