Hi, I would like files that are distributed from multiple mirrors to
work better with caching proxies, and I hope to write a Traffic Server
plugin to help with this
I would love any input or feedback on how mirrors can work better with
Traffic Server
The approach that I am taking for my initial attempt is to use RFC 6249,
Metalink/HTTP: Mirrors and Hashes. I listen for responses that are an
HTTP redirect and have "Link: <...>; rel=duplicate" headers, then I scan
the URLs for one that already exists in the cache. If found then I
transform the response, replacing the "Location: ..." header with the
URL that already exists in the cache
Later, I would also like to use RFC 3230, Instance Digests in HTTP, and
find a way to lookup URLs in the Traffic Server cache by content digest.
I gather that ATS does create checksums of content stored in the cache,
but doesn't support looking up content by digest. Some possibilities
include extending the core with new APIs to accomplish this, or a plugin
could add additional entries to the ATS cache for content digests.
Alternatively a separate cache could be used, e.g. KyotoDB or memcached
Some further ideas for download mirrors and Traffic Server include:
* Remember lists of mirrors so future requests for any of these URLs
use the same cache key. A problem is how to prevent a malicious domain
from distributing false information about URLs it doesn't control. This
could be addressed with a whitelist of domains
* Making decisions about the best mirror to choose, e.g. one that is
most cost efficient, faster, or more local
* Use content digest to detect or repair download errors
A first attempt at a plugin is up on GitHub: https://github.com/jablko/dedup
I would love any feedback on this code
1. I assume I want to minimize cache lookups, so I first check that a
response has both a "Location: ..." header and a "Link: <...>;
rel=duplicate" header
2. Then I check if the "Location: ..." URL already exists in the
cache. If so then I just reenable the response
3. Otherwise I check if the "Link: <...>; rel=duplicate" URL already
exists in the cache. If so then I rewrite the "Location: ..." header and
reenable the response
4. I continue to scan "Link: <...>; rel=duplicate" headers until a
URL is found that already exists in the cache. If none is found then I
just reenable the response without any changes
I use TS_HTTP_SEND_RESPONSE_HDR_HOOK to work on responses sent from the
cache to clients, vs. responses sent from the origin to the cache,
because it's likely that when the redirect is first received, no mirror
URLs are cached yet, so the "Location: ..." header will be unchanged. If
a mirror URL is later added to the cache, then subsequent responses of
the redirect to clients should be transformed accordingly. If a redirect
can't be cached then it makes no difference whether it's transformed
before or after cache
I use TSCacheKeyDigestFromUrlSet() and TSCacheRead() to check if a URL
already exists in the cache, thanks to sample code from Leif. This works
well so far
I use TSmalloc() to allocate a struct to pass variables to TSCacheRead()
callbacks. Leif mentioned in sample code that this is suboptimal and to
use jemalloc in configure instead. I will do so
The parsing of "Link: <...>; rel=duplicate" is rough, I would most
appreciate any feedback on this. I call TSUrlParse() from the second
character of the field value to the first ">" character after the first
character. I think that according to RFC 3986, a URI-reference can't
contain a ">" character, so I think this logic is okay? I use memchr()
to find the ">" character because "string values returned from marshall
buffers are not null-terminated ... cannot be passed into the common
str*() routines"
I'm not sure how best to test if Link headers have a "rel=duplicate"
parameter. Traffic Server has some private code,
HttpCompat::lookup_param_in_semicolon_string(), to parse, e.g.
"Content-Type: ...; charset=UTF-8", but nothing in the public API. I can
probably cobble together something from scratch with memchr(), etc. but
I'm nervous about getting it right, e.g. all the RFC rules about
whitespace, and is conformance good enough or are there nonconformant
implementations to consider? Finally are there any libraries I should
consider using?
Unfortunately I don't have enough experience to know which approach to
try first. If anyone can point me in the right direction, or offer
advice, I would be very grateful
We run Traffic Server here at a rural village in Rwanda. Getting
download mirrors to work well with Traffic Server is important because
many download sites have a download button that doesn't always send
users to the same mirror, so users can't predict whether a download will
take seconds or hours, which is frustrating
I am working on this as part of the Google Summer of Code