Hi, Dave/Toke.
Thanks in turn for your rapid responses.
It appears that much of this is straightforward. I have added a few
comments in line. If you are able to produce an updated draft, I'll
check through it again before the end of Last call and the IESG meeting.
Cheers,
Elwyn
On 09/03/2016 21:04, Dave Taht wrote:
thx for such a comprehensive review! That is the most in-depth
critique we've had in a long time.
Some comments inline.
On Wed, Mar 9, 2016 at 9:12 AM, Elwyn Davies <[email protected]> wrote:
I am the assigned Gen-ART reviewer for this draft. The General Area
Review Team (Gen-ART) reviews all IETF documents being processed
by the IESG for the IETF Chair. Please treat these comments just
like any other last call comments.
For more information, please see the FAQ at
<http://wiki.tools.ietf.org/area/gen/trac/wiki/GenArtfaq>.
Document: draft-ietf-aqm-fq-codel-05.txt
Reviewer: Elwyn Davies
Review Date: 2016/03/06
IETF LC End Date: 2016/03/17
IESG Telechat date: (if known) 2016/03/17
Summary:
Almost ready. There are a couple of minor issues that are barely above the
level of nits plus a fair bit of editorial work.
I notice that the draft is on the IESG agenda on the day that last call
ends. If the authors wish to sort out the nits before the end of last call,
I will be happy to work with them on this.
Would be good.
Major issues:
None.
Minor issues:
Treatment of packets that don't fit into the hashing classification scheme:
The default FQ-CoDel hashing mechanism uses the protocol/addresses/ports
5-tuple, but there will be packets that don't fit this scheme (especially
ICMP). There is no mention of what the classification would do with these
packets.
How about?
The default hashing algorithm SHOULD make a best effort attempt at
consistently decoding a flow for any given protocol. It MUST fall back
to a 3 tuple for anything not specifically recognized
(srcip,dstip,protocol), by default.
That seems sensible - but use consistent words for destination/source
address and protocol number.
Did you see Brian Carpenter's note about issues with IPv6 when dealing
with hardware supported systems?
Locating the protocol number (and port numbers) in IPv6 packets in
packets with extension headers is a pain in the neck for fast hardware.
There isn't a lot that can be done about it immediately, but the problem
ought to be mentioned.
...
Arp is essentially the only exception to this. (LLC or other obscure
wire protocols? IPX?) Due to the permutation in the hash ARP will
always end up in a single, consistent bucket (not necessarily 0) - it
is so small and so infrequent that normally fq_codel will attempt to
push it out first except in the case of a hash collision. If you are
in a situation where you are sending a ton of arp requests, and codel
is dropping them... you've got other problems.
Agreed. Worth a few words perhaps.
I guess that one extra queue would probably suffice to hold any
such outliers, but it would be wise to say something about how the packets
from this/these queue(s) would be treated by the scheduler. It might also
be useful to say something about treatment of outliers in other
classification schemes, if only to say that the scheme used needs to think
about any such outliers.
OK, but see above.
s6.2, last para: The proposal to add flow related marking to encapsulated
packets needs to come with a very well exposed security and privacy health
warning.. [It occurs to me that it might be possible to confine these
markings to out of band notifications within the originating host which
would allow fq_codel or similar to Flow Queue the packets getting them into
a sensible order on the outgoing link. Once the packets had been ordered in
this way, a subsequent box (maybe an ADSL modem or suchlike) linking a home
environment to the outside world could work purely on source address,
thereby interspersing the packets from different hosts onto the external
link but not needing to reorder the packets from each individual host. Not
sure if this is a sensible idea but it looks like a way to avoid the privacy
issues.]
Obscuration happens a couple ways at the gateway - in linux the
hashing happens after the nat step in ipv4 typically, so the src ip is
the constant natted address, and we rely on the different ports,
protocol, and dest addr to distinguish the hash. In ipv6, it is end to
end, but in both cases the actual hash is permuted by a random number.
You can certainly see, from a packet capture off of a saturated link
that the traffic is being FQ'd. (and if ecn is enabled on the flows
and codel, when that is being applied) but I am not sure what this
actually "leaks" in terms of information.
Note that saturation does happen on tiny scales also - microbursts
will get broken up also on a nonsaturated link (example a typical TCP
dumps a whole IW10 via TSO on a wire at a gbit, then sent at 5mbit
over a cable modem)
I think, in trying to read what you wrote above, you are talking about
applying additional classification on a multi-band system like what
sqm-scripts does? Or applying (as many products do) some sort of
classification to raise the prio of voip or vpn packets and lower it
for things like torrent and backups?
This can get done several ways - We used dscp markings in cerowrt, as
that research was driven by trying to find rational e2e
classifications for multiple kinds of traffic (something that the dart
wg took on also, and cake has tried to implement), but aside from a
very few exceptions (vpn,dns) we found just fq'ing with short queue to
be superior approach.
A method more common (and lightweight) than touching the packet's dscp
is actually to use the native priorities built into the packet
classification system in linux to vector things into the right tier.
I am not sure we are talking about the same thing here.
I was concerned about the comment in the last sentence of s6.2:
Going forward, having some mechanism for opaque
encapsulations to express to the outer layer which flow a packet
belongs to, could be a way to mitigate this.
The case that was being discussed was encrypted tunnels, I believe,
typically for VPNs. In this case the only way you can determine what
flow a packet being tunnelled belongs to is (as the sentence suggests)
to add some extra information to the outer encapsulation. If this
escaped from the source machine, the security/privacy guys would have a
serious hissy fit, so the best you could do might be to either carry it
out-of-band from application to scheduler in the originating host (and
no further) or strip it from the packet before it gets out of the source
machine. If it gets out of the machine, the flow marking probably gives
enough traffic analysis clues to reveal a whole lot of stuff to a bad actor.
The aside was wondering if in a typical home network or small office,
the usual tree structure of networks would make it sufficient for the
default 5-tuple FQ-CoDel to work in the source machines and the outgoing
routers that manage the external links use only source address based
hashing to balance traffic across the various source machines in the
network, since the FQ-CoDel in the source machines would have
effectively made a single stream out of the machine's traffic already.
Please ignore me if I don't understand the problem!
s11: Internet Drafts are not scientific papers as such and do not usually
have a Conclusions section. I think it would be useful to move the
paragraph in s11 as is to s1. Since this draft is targeted for Experimental
RFC status, it would be useful to suggest (maybe in s7) that experiments
along the lines noted in s7 might be carried out and there results reported
(where? AQM WG?).
Given the developments with Cake, I am not sure whether
the authors expect this version of FQ-CoDel to make it onto standards track
or BCP, but to set out some sort of expected trajectory is desirable.
Darned if I know stds track or bcp. ask again in 3 years? fq_codel's
achieved wide deployment in Linux with no major issues reported, the
minor ones fixed over time.... the BSD implementation is landing,
which exposed some new potential issues (granularity of timestamps)
The same ideas are being tried in wifi. (the differences between
switched and shared mediums is proving important). Other AQMs
(docsis-pie) are entering the field.
I do not expect to see cake enter ietf processes, except as an
experimental results to iccrg or the aqm wgs. (and if we want to
re-open the wounds of dart, dart) The code is gpl/bsd dual licensed
and sufficiently complex to not want to reduce it to plain text. And
it's still out of tree, which means it will take years to see
deployment.
I've long thought that summoning the energy to ietf-standardize BQL (
https://lwn.net/Articles/469652/ ) with some document that also talks
about the negative effects of tcp packet offloads would be a good
idea.
Nits/editorial comments:
General: s/i.e./i.e.,/, s/e.g./e.g.,/
Draft Title: The acronym CoDel with an uppercase D is used everywhere else
in the document - Suggest the title should be FlowQueue-CoDel
K.
Abstract: Need to expand AQM. [DNS and and CPU are 'well-known' -
http://www.rfc-editor.org/materials/abbrev.expansion.txt]
General: It would be helpful to capitalize Quantum throughout (or at least
from s3 onwards) to emphasise that it is a configured value. Likewise
Interval and Target parameters. Maybe also Flow and Queue as they a defined
terms.
We'd used _param_ fairly consistently throughout, I thought. Not the
right thing?
With the ancient ASCII format, attempting to use fount emphasis doesn't
work very well. The (unwritten) rule is to use terms with a capitalized
first letter to indicate terms that you expect to find in the
terminology or are otherwise defined in the document. Whether this will
change in future with the transition to new formats is unclear, but for
the time being using capitalized names is pretty much universal.
s1, para 1: It would be helpful to give the full title (FlowQueue-CoDel),
expand AQM (again), provide a refernece explaining the term bufferbloat and
give references for AQM, basic CoDel, DRR and the ns2/ns3 network
simulators.
Oy.
I would think one or both of the following would be useful (long term
stable) refs for bufferbloat (the first is useful because the article is
publically available rather than needing a sub to IEEE or ACM):
Jim Gettys. 2011. Bufferbloat: Dark buffers in the Internet. IEEE Internet
Comput. 15, 3 (2011), 95–96. DOI:http://dx.doi.org/10.1109/MIC.2011.56
(freely available at
http://www.bufferbloat.net/attachments/27/IC-15-03-Backspace.pdf)
and/or
Jim Gettys and Kathleen Nichols. 2012. Bufferbloat:Dark buffers in the
Internet. Commun. ACM 55, 1 (2012), 1–15.
DOI:http://dx.doi.org/10.1145/2063176.2063196
k.
Suggest:
OLD:
The FQ-CoDel algorithm is a combined packet scheduler and AQM
developed as part of the bufferbloat-fighting community effort. It
is based on a modified Deficit Round Robin (DRR) queue scheduler,
with the CoDel AQM algorithm operating on each queue. This document
describes the combined algorithm; reference implementations are
available for ns2 and ns3 and it is included in the mainline Linux
kernel as the fq_codel queueing discipline.
NEW:
The FlowQueue-CoDel (FQ-CoDel) algorithm is a combined packet scheduler
and
Active Queue Management (AQM) [RFC 3168] algorithm
developed as part of the bufferbloat-fighting community effort (see
http://www.bufferbloat.net). It
is based on a modified Deficit Round Robin (DRR) queue scheduler
[DRR][DRRPP],
with the CoDel AQM
If we are going to cite these other things, cite [CODELDRAFT] here?
Yes.. surely!
algorithm operating on each queue. This document
describes the combined algorithm; reference implementations are
available for the ns2 (http://nsnam.sourceforge.net/wiki) and
ns3 (https://www.nsnam.org/wiki) network simulators, and it is
included in the mainline Linux kernel as the fq_codel queueing
discipline.
END
Sure. I note (sadly) that the ns3 version is still not in the mainline
ns3 code due to issues with exposing the header properly and other
integration issues.
But shouldn't the refs above be [NS2], [NS3], and then referred to as
links in the
normative references rather than inline?
For links you can do it either way. If you are using xml2rfc, <eref/>
is your friend here.
s1.2, definition of Flow: s/protocol/protocol number/; also mac is ambiguous
- s/mac address/media access control (MAC) address/
K.
s1.2, definitions of Flow and Queue: To make the mentions of hash in the
definition of Queue comprehensible, it would be sensible to mention 'hash'
in the definition of flow. Suggest adding:
The identifier(s) of a flow are mapped to a hash code used internally in
FQ-CoDel to organize the packets into Queues.
K.
s1.2: Probably worthwhile adding the references for CoDel and DRR to the
definitions.
s1.2: It would help with subsequent understanding to provide definitions of
the Interval and Target parameters of the CoD
el scheme. Suggest:
ADD:
Interval: Characteristic time period used by the control loop of CoDel to
age the value of the minimum sojourn time of packets in a Queue (i.e., the
time spent by the packet in the local Queue) that is used to indicate when a
persistent Queue is developing (see Section 4.3 of [CODELDRAFT]).
An issue with final acceptance is that the codel draft is not yet
approved, either.
That doesn't really affect 'acceptance'. The CoDel draft is a WG draft
which means that it has a pretty high probability of becoming an RFC -
and in the light of its state and who is backing it, I would say that
the CoDel draft has a probability of well in excess of 99% of becoming
an RFC (a.s. as we mathematicians say ;-) ).
Referencing drafts normatively means that they will get held up at the
final stage of the RFC Editor queue as a 'cluster' until the tree of
references becomes RFCs together. You can expedite things by
encouraging the authors of the CoDel draft to get their work finished.
In fact looking at the email list, there seems to have been one limited
comment at WGLC and that would indicate that it is time to bully Wes to
get the shepherd writeup done and get it submitted for publication
unless there is some background hassle.
Target: Setpoint value of the minimum sojourn time of packets in a Queue
used as the target of the control loop in CoDel (see Section 4.4 of
[CODELDRAFT]).
END
s1.3, para 1: It would be helpful to expand SQF: s/SQF/Shortest Queue First
(SQF)/
s1.3, para 1: s/differently than/differently from/ (the latter is good for
both US and British English)
s1.3, para 2: Suggest starting the para with 'By default,' so that the
'although' bit is not a surprise.
s1.3, para 2: Adding a reference for the CoDel algorithm would help. In
particular, linking to http://queue.acm.org/appendices/codel.html as well as
the basic article would be helpful - I am not sure if the CoDel article is
available for free to all but the appendix seems to be.
s1.3, para 4: s/current implementation/Linux implementation at the time of
writing/
s1.3, para 5: s/rather than a packet-based./rather than packet-based./
s1.3, para 6 (next to last para):
K.
However, in practice many things that are
beneficial to have prioritised for typical internet use (ACKs, DNS
lookups, interactive SSH, HTTP requests, ARP, RA, ICMP, VoIP) _tend_
to fall in this category, which is why FQ-CoDel performs so well for
many practical applications.
I am doubtful as whether mentioning ARP, RA and ICMP in this statement is
appropriate since they don't fit into the default flow classification
scheme, and are not really flows as such.
RA and ICMP are hashed on the 3-tuple.
Don't have a problem with dropping these from the text, although
frequently icmp could be considered as a flow.
The point tho was that arp,ra,icmp do land generally in their own
buckets and thus are "prioritized". The language here was also
intended to deprecate the word "priority" per se' as it's more
commonly used for diffserv - sparseness, fq, short queues are what
works here.
As mentioned in Minor Issues
above some discussion of non-flow packets is needed IMO. They could be left
out here without problem I think.
s3.1, para 1: s/buckets are configurable/buckets is configurable/
s3.1, para 3: s/This number is is/This number is/
s3.1, para 3:
OLD:
until the number of credits runs into the negative
NEW:
until the value of _byte credit_ becomes negative
I tend to prefer the language matches the code.
... but 'runs into the negative' is not very comprehensible English
(probably wouldn't get past the RFC Editor!)
END
What about _byte_credit_ == 0? (Just asking! Did possibly think that if
the queue has the same number of bytes as the credit, it might be possible
for the queue to jam in the active list while empty.)
You worry me. I'll go look.
s3.1,para 4: s/fairness queueing/fairness queueing scheme/
s3.1, last para: s/flow-building queues/queues that build a backlog/
s4.1, para 2: A reference for Jenkins hash functions would be desirable. It
seems the best we can probably do is Jenkins' web site:
http://www.burtleburtle.net/bob/hash/doobs.html
k.
s4.1: The basic Jenkins has used in the Linux scheduler code appears to be
targeted at IPv4 32 bit addresses. Technically, the abstract definition
doesn't need to worry about the lengths of addresses, but does anything
special need to be said about IPv6? I couldn't see in the Linux code where
IPv6 is dealt with. Furthermore, looking at the code a bit more closely, I
am not sure how the 112 bits (32 + 32 + 16 + 16 + 16) of the 5-tuple are
pushed into the hashing algorithm which has an input 'bandwidth' of 64 bits.
Again this doesn't desperately matter here but, if I understand correctly,
the calculations in s5.3 of hash collision probabilities reported in para 2
of s5.3 relate to the Jenkins update3 has algorithm used in Linux. This
isn't explicitly stated. I am not sure if the probabilities would be
different if the number of bits in the 5-tuple was greater - presumably this
depends to some extent on how the Jenkins algorithms are used. The core
Linux algorithm just uses the 'final' part of the algorithm. According to
the Jenkins comments, one probably ought to use the 'mix' part to combine
additional bits. Thoughts?
I will talk to the hashing issue in another message. It's evolved
quite a bit, please
take a look at the new hashing api that landed in linux 4.3 and later.
OK. I haven't done that yet.
s4.1, para 2: I am not sure that 'permuted by a random value' is very
clear, and it doesn't explain why this is done.
How about 'salted by modular addition of a random value'? It would also be
helpful to note here that this is done to mitigate a possible DoS attack
that could occur if the hash is externally predictable and point to the note
about this in the Security Considerations (s8).
k.
s4.1.1, para 1: s/mac address/MAC address/; s/diffserv/diffserve codepoint
values/
[I am not sure what firewall specific markings might be - any references?]
s4.1.1, para 2: Suggest:
OLD:
An alternative to changing the classification scheme is to perform
decapsulation prior to hashing.
NEW:
An addition to the default classification scheme is to perform
decapsulation of tunnelled packets prior to hashing on the 5-tuple in the
encapsulated.
END
Figure in s4.2: Needs a caption and a Figure number. It is also somewhat
incomplete as a state diagram. Both New and Old states have actions that
result from both Arrival/Enqueue and Dequeue events, and there are
'loopback' actions when there are Arrival/Enqueue and Dequeue events that
occur in both New and Old states when the conditions are not satisfied.
You are the first person to comment on that diagram ever!!!
Proud to be first ;-)
s5.2.1, para 1:
to ensure that the measured minimum delay does not become too stale.
Without detailed knowledge of CoDel this didn't read very easily - it is
taken out of context from the CoDel draft.
Suggest:
to ensure that the minimum sojourn time of packets in a queue used as an
estimator by the CoDel control algorithm is a relatively up-to-date value.
s5.2.1, para 1: s/It SHOULD be set on the order/It SHOULD be set to be on
the order/
"the order" implies a much less tight relationship that what I'd like, btw.
How about s/on the order/in the range/?
s5.2.3, para 2: s/10GigE/10 Gigabit Ethernet/
s5.2.4, para 2: Expand TSO (TCP Segmentation Offload). TSO probably could do
with a reference (Freimuth below should work).
s5.2.4, para 2: I think GRO-enabled should probably be GSO-enabled -
whichever it is, it needs to be expanded (Generic Segmentation Offload?).
There is also a web page for GSO (https://lwn.net/Articles/188489/)
GRO traffic is actually been one of our bigger headaches on several of
the deployed routers out there - the edgerouter, the latest linksys
1200ac (armada 385) (gro peeling is solved in cake)
Your typical cheapo router recieves traffic in a burst, GRO's it, then
tries to send it as one packet. 64k is often seen... dropping the
whole thing is catastrophic, splitting it back up
GSO/TSO are endpoint generated problems, GRO a router specific one. I
strongly suspect some CMTSes and other high end gear are also cheating
by bunching up packets into larger entities.
Fine.. I didn't know about any of these terms till I started reading the
draft! I *am* out of touch with the guts of network stacks... One upon
a time I knew all about RED, WFQ, etc, etc. Now I am obsessed with DTN.
s5.2.5, para 2 and s4.1, para 2: It would be sensible to use a common term
for 'initialisation time' and 'load time'.
s5.2.6, title and body: Need to expand ECN on first use. Also s/thus
unresponsive/thus the number of unresponsive/
s5.2.7: Expand CE (Congestion Encountered) on first occurrence.
s5.2.7, para 1: Expand DCTCP acronym and provide reference [Alizadeh] below
and/or [draft-ietf-tcpm-dctcp]....thus..
OLD:
This parameter enables DCTCP-like processing to enable CE marking ECT
packets at a lower setpoint than the default codel target.
NEW:
This parameter enables Date Centre TCP (DCTCP)-like processing resulting
in
CE (Congestion Encountered) marking on ECN-Capable Transport (ECT)
packets [RFC3168] starting at a lower sojourn delay setpoint than the
default CoDel
Target. Details of DCTCP can be found in [Alizadeh2011] and
[I-D.draft-ietf-tcpm-dctcp].
END
s5.3, para 2: see the comments on s4.1 above regarding the link between
probabilities and address lengths.
s5.3, para 3: Need to expand Cake acronym and providepointer to Section 7
which expalisn about Cake.
s5.4: Is it possible to add some comments to the codel_vars structure?
s5.5, para 2: s/resolution below the target/resolution significantly finer
than the CoDel Target setting/
s5.6: Need to expand SFQ, WFQ and QFQ. There is a reference for SFQ below
[McKenney]
We definately talk to ourselves too much. At the same time, the
acronym expansion is adding pages to this doc.
s5.7, last para: s/doesn't miss a beat/reacts almost immediately/ (Original
is rather too colloquial!)
yes.
s6.1, last para: s/classification/service classification/
s7: I might add 'Checking behaviour when two or more instantiations of
FQ-CoDel are used in series.'
s12.2: Various of the informative references could do with completion. I
think the referecnes below are correct:
Kathleen Nichols and Van Jacobson. 2012. "Controlling Queue Delay". Queue
10, 5 (May 2012), 20. DOI:http://dx.doi.org/10.1145/2208917.2209336
M. Shreedhar and George Varghese. 1996. "Efficient fair queuing using
deficit round-robin". IEEE/ACM Trans. Netw. 4, 3 (June 1996), 375–385.
DOI:http://dx.doi.org/10.1109/90.502236
M.H. MacGregor and W. Shi. 2000. "Deficits for bursty latency-critical
flows: DRR++". In Proceedings IEEE International Conference on Networks 2000
(ICON 2000). Networking Trends and Challenges in the New Millennium. IEEE
Comput. Soc, 287–293. DOI:http://dx.doi.org/10.1109/ICON.2000.875803
Y. Gong, D. Rossi, C. Testa, S. Valenti, and M.D. Taht. 2013. "Fighting the
bufferbloat: On the coexistence of AQM and low priority congestion control".
In 2013 IEEE Conference on Computer Communications Workshops (INFOCOM
WKSHPS). IEEE, 411–416. DOI:http://dx.doi.org/10.1109/INFCOMW.2013.6562885
Giovanna Carofiglio and Luca Muscariello. 2012. "On the Impact of TCP and
Per-Flow Scheduling on Internet Performance". IEEE/ACM Trans. Netw. 20, 2
(April 2012), 620–633. DOI:http://dx.doi.org/10.1109/TNET.2011.2164553
The following are extra references that I suggested above:
P.E. McKenney. 1990. Stochastic fairness queueing. In Proceedings. IEEE
INFOCOM ’90: Ninth Annual Joint Conference of the IEEE Computer and
Communications Societies@m_The Multiple Facets of Integration. IEEE Comput.
Soc. Press, 733–740. DOI:http://dx.doi.org/10.1109/INFCOM.1990.91316
I think this was the short version. We reference the long version as
that's where
a ton of new ideas came from.
Doug Freimuth et al. 2005. Server network scalability and TCP offload. In
USENIX Annual Technical Conference. USENIX Association, 209–222.
Mohammad Alizadeh et al. 2011. Data center TCP (DCTCP). ACM SIGCOMM Comput.
Commun. Rev. 41, 4 (October 2011), 63–74.
DOI:http://dx.doi.org/10.1145/2043164.1851192
_______________________________________________
Gen-art mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/gen-art