Re: [PATCH 0/2] dm-ioband: I/O bandwidth controller v1.7.0: Introduction

2008-10-09 Thread Ryo Tsuruta
Hi Dong-Jae,

> So, I tested dm-ioband and bio-cgroup patches with another IO testing
> tool, xdd ver6.5(http://www.ioperformance.com/),  after your reply.
> Xdd supports O_DIRECT mode and time limit options.
> I think, personally, it is proper tool for testing of IO controllers
> in Linux Container ML.

Xdd is really useful for me. Thanks for letting me know.

> And I found some strange points in test results. In fact, it will be
> not strange for other persons^^
> 
> 1. dm-ioband can control IO bandwidth well in O_DIRECT mode(read and
> write), I think the result is very reasonable. but it can't control it
> in Buffered mode when I checked just only output of xdd. I think
> bio-cgroup patches is for solving the problems, is it right? If so,
> how can I check or confirm the role of bio-cgroup patches?
>
> 2. As showed in test results, the IO performance in Buffered IO mode
> is very low compared with it in O_DIRECT mode. In my opinion, the
> reverse case is more natural in real life.
> Can you give me a answer about it?

Your results show all xdd programs belong to the same cgroup,
could you explain me in detail about your test procedure?

To know how many I/Os are actually issued to a physical device in
buffered mode within a measurement period, you should check the
/sys/block//stat file just before starting a test program and
just after the end of the test program. The contents of the stat file
is described in the following document:
   kernel/Documentation/block/stat.txt

> 3. Compared with physical bandwidth(when it is checked with one
> process and without dm-ioband device), the sum of the bandwidth by
> dm-ioband has very considerable gap with the physical bandwidth. I
> wonder the reason. Is it overhead of dm-ioband or bio-cgroup patches?
> or Are there any another reasons?

The followings are the results on my PC with SATA disk, and there is
no big difference between with and without dm-ioband. Please try the
same thing if you have time.

without dm-ioband
=
# xdd.linux -op write -queuedepth 16 -targets 1 /dev/sdb1 \
  -reqsize 8 -numreqs 128000 -verbose -timelimit 30 -dio -randomize

T  Q   Bytes  OpsTime  RateIOPS   Latency
%CPU  OP_TypeReqSize
0 16   1400012801709030.121 4.648 567.380.0018
0.01   write8192

with dm-ioband
==
* cgroup1 (weight 10)
# cat /cgroup/1/bio.id
1
# echo $$ > /cgroup/1/tasks
# xdd.linux -op write -queuedepth 16 -targets 1 /dev/mapper/ioband1
  -reqsize 8 -numreqs 128000 -verbose -timelimit 30 -dio -randomize
T  Q   Bytes  OpsTime  Rate  IOPS   Latency
%CPU  OP_TypeReqSize 
0 1614393344 175730.430 0.473  57.740.0173
0.00   write8192 

* cgroup2 (weight 20)
# cat /cgroup/2/bio.id
2
# echo $$ > /cgroup/2/tasks
# xdd.linux -op write -queuedepth 16 -targets 1 /dev/mapper/ioband1
  -reqsize 8 -numreqs 128000 -verbose -timelimit 30 -dio -randomize
T  Q   Bytes  OpsTime  Rate  IOPS   Latency
%CPU  OP_TypeReqSize 
0 1644113920 538530.380 1.452 177.250.0056
0.00   write8192 

* cgroup3 (weight 60)
# cat /cgroup/3/bio.id
3
# echo $$ > /cgroup/3/tasks
# xdd.linux -op write -queuedepth 16 -targets 1 /dev/mapper/ioband1
  -reqeize 8 -numreqs 128000 -verbose -timelimit 30 -dio -randomize
T  Q   Bytes  OpsTime  Rate  IOPS   Latency
%CPU  OP_TypeReqSize 
0 16824852481006930.256 2.726 332.790.0030
0.00   write8192 

Total
=
  BytesOpsRate   IOPS
  w/o dm-ioband  14000128017090  4.648  567.38
  w/  dm-ioband  14099251217211  4.651  567.78

> > Could you give me the O_DIRECT patch?
> >
> Of course, if you want. But it is nothing
> Tiobench tool is very simple and light source code, so I just add the
> O_DIRECT option in tiotest.c of tiobench testing tool.
> Anyway, after I make a patch file, I send it to you

Thank you very much!

Ryo Tsuruta
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme

2008-10-09 Thread Chris Wright
* Rusty Russell ([EMAIL PROTECTED]) wrote:
> On Thursday 09 October 2008 06:34:59 Mark McLoughlin wrote:
> > From: Herbert Xu <[EMAIL PROTECTED]>
> >
> > If segmentation offload is enabled by the host, we currently allocate
> > maximum sized packet buffers and pass them to the host. This uses up
> > 20 ring entries, allowing us to supply only 20 packet buffers to the
> > host with a 256 entry ring. This is a huge overhead when receiving
> > small packets, and is most keenly felt when receiving MTU sized
> > packets from off-host.
> 
> There are three approaches we should investigate before adding YA feature.  
> Obviously, we can simply increase the number of ring entries.

Tried that, it didn't help much.  I don't have my numbers handy, but
levelled off at about 512 and was a modest boost.  It's still wasteful
to preallocate like that on the off-chance it's a large packet.

thanks,
-chris
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme

2008-10-09 Thread Herbert Xu
On Thu, Oct 09, 2008 at 11:55:59AM +1100, Rusty Russell wrote:
>
> There are three approaches we should investigate before adding YA feature.  
> Obviously, we can simply increase the number of ring entries.

That's not going to work so well as you need to increase the ring
size by MAX_SKB_FRAGS times to achieve the same level of effect.

Basically the current scheme is either going to suck at non-TSO
traffic or it's going to chew too much resources.

> Secondly, we can put the virtio_net_hdr at the head of the skb data (this is 
> also worth considering for xmit I think if we have headroom) and drop 
> MAX_SKB_FRAGS which contains a gratuitous +2.

That's fine but having skb->data in the ring still means two
different kinds of memory in there and it sucks when you only
have 1500-byte packets.

> Thirdly, we can try to coalesce contiguous buffers.  The page caching scheme 
> we have might help here, I don't know.  Maybe we should be explicitly trying 
> to allocate higher orders.

That's not really the key problem here.  The problem here is
that the scheme we're currently using in virtio-net is simply
broken when it comes to 1500-byte sized packets.  Most of the
entries on the ring buffer go to waste.

We need a scheme that handles both 1500-byte packets as well
as 64K-byte size ones, and without holding down 16M of memory
per guest.

> > The size of the logical buffer is
> > returned to the guest rather than the size of the individual smaller
> > buffers.
> 
> That's a virtio transport breakage: can you use the standard virtio 
> mechanism, 
> just put the extended length or number of extra buffers inside the 
> virtio_net_hdr?

Sure that sounds reasonable.

> > Make use of this support by supplying single page receive buffers to
> > the host. On receive, we extract the virtio_net_hdr, copy 128 bytes of
> > the payload to the skb's linear data buffer and adjust the fragment
> > offset to point to the remaining data. This ensures proper alignment
> > and allows us to not use any paged data for small packets. If the
> > payload occupies multiple pages, we simply append those pages as
> > fragments and free the associated skbs.
> 
> > +   char *p = page_address(skb_shinfo(skb)->frags[0].page);
> ...
> > +   memcpy(hdr, p, sizeof(*hdr));
> > +   p += sizeof(*hdr);
> 
> I think you need kmap_atomic() here to access the page.  And yes, that will 
> effect performance :(

No we don't.  kmap would only be necessary for highmem which we
did not request.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme

2008-10-09 Thread Mark McLoughlin
On Thu, 2008-10-09 at 23:30 +0800, Herbert Xu wrote: 
> On Thu, Oct 09, 2008 at 11:55:59AM +1100, Rusty Russell wrote:
> >
> > There are three approaches we should investigate before adding YA feature.  
> > Obviously, we can simply increase the number of ring entries.
> 
> That's not going to work so well as you need to increase the ring
> size by MAX_SKB_FRAGS times to achieve the same level of effect.
> 
> Basically the current scheme is either going to suck at non-TSO
> traffic or it's going to chew too much resources.

Yeah ... to put some numbers on it, assume we have a 256 entry ring now.

Currently, with GSO enabled in the host the guest will fill this with 12
buffer heads with 20 buffers per head (a 10 byte buffer, an MTU sized
buffer and 18 page sized buffers).

That means we allocate ~900k for receive buffers, 12k for the ring, fail
to use 16 ring entries and the ring ends up with a capacity of 12
packets. In the case of MTU sized packets from an off-host source,
that's a huge amount of overhead for ~17k of data.

If we wanted to match the packet capacity that Herbert's suggestion
enables (i.e. 256 packets), we'd need to bump the ring size to 4k
entries (assuming we reduce it to 19 entries per packet). This would
mean we'd need to allocate ~200k for the ring and ~18M in receive
buffers. Again, assuming MTU sized packets, that's massive overhead for
~400k of data.

> > Secondly, we can put the virtio_net_hdr at the head of the skb data (this 
> > is 
> > also worth considering for xmit I think if we have headroom) and drop 
> > MAX_SKB_FRAGS which contains a gratuitous +2.
> 
> That's fine but having skb->data in the ring still means two
> different kinds of memory in there and it sucks when you only
> have 1500-byte packets.

Also, including virtio_net_hdr in the data buffer would need another
feature flag. Rightly or wrongly, KVM's implementation requires
virtio_net_hdr to be the first buffer:

if (elem.in_num < 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) {
fprintf(stderr, "virtio-net header not in first element\n");
exit(1);
}

i.e. it's part of the ABI ... at least as KVM sees it :-)

> > > The size of the logical buffer is
> > > returned to the guest rather than the size of the individual smaller
> > > buffers.
> > 
> > That's a virtio transport breakage: can you use the standard virtio 
> > mechanism, 
> > just put the extended length or number of extra buffers inside the 
> > virtio_net_hdr?
> 
> Sure that sounds reasonable.


I'll give that a shot.

Cheers,
Mark.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme

2008-10-09 Thread Anthony Liguori
Mark McLoughlin wrote:
> 
> Also, including virtio_net_hdr in the data buffer would need another
> feature flag. Rightly or wrongly, KVM's implementation requires
> virtio_net_hdr to be the first buffer:
> 
> if (elem.in_num < 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) {
> fprintf(stderr, "virtio-net header not in first element\n");
> exit(1);
> }
> 
> i.e. it's part of the ABI ... at least as KVM sees it :-)

This is actually something that's broken in a nasty way.  Having the 
header in the first element is not supposed to be part of the ABI but it 
sort of has to be ATM.

If an older version of QEMU were to use a newer kernel, and the newer 
kernel had a larger header size, then if we just made the header be the 
first X bytes, QEMU has no way of knowing how many bytes that should be. 
  Instead, the guest actually has to allocate the virtio-net header in 
such a way that it only presents the size depending on the features that 
the host supports.  We don't use a simple versioning scheme, so you'd 
have to check for a combination of features advertised by the host but 
that's not good enough because the host may disable certain features.

Perhaps the header size is whatever the longest element that has been 
commonly negotiated?

So that's why this aggressive check is here.  Not to necessarily cement 
this into the ABI but as a way to make someone figure out how to 
sanitize this all.

Regards,

Anthony Liguori

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization