Hi, I would like files that are distributed from multiple mirrors to work better with caching proxies, and I hope to write a Traffic Server plugin to help with this

I would love any input or feedback on how mirrors can work better with Traffic Server

The approach that I am taking for my initial attempt is to use RFC 6249, Metalink/HTTP: Mirrors and Hashes. I listen for responses that are an HTTP redirect and have "Link: <...>; rel=duplicate" headers, then I scan the URLs for one that already exists in the cache. If found then I transform the response, replacing the "Location: ..." header with the URL that already exists in the cache

Later, I would also like to use RFC 3230, Instance Digests in HTTP, and find a way to lookup URLs in the Traffic Server cache by content digest. I gather that ATS does create checksums of content stored in the cache, but doesn't support looking up content by digest. Some possibilities include extending the core with new APIs to accomplish this, or a plugin could add additional entries to the ATS cache for content digests. Alternatively a separate cache could be used, e.g. KyotoDB or memcached

Some further ideas for download mirrors and Traffic Server include:

* Remember lists of mirrors so future requests for any of these URLs use the same cache key. A problem is how to prevent a malicious domain from distributing false information about URLs it doesn't control. This could be addressed with a whitelist of domains

* Making decisions about the best mirror to choose, e.g. one that is most cost efficient, faster, or more local

  * Use content digest to detect or repair download errors

A first attempt at a plugin is up on GitHub: https://github.com/jablko/dedup

I would love any feedback on this code

1. I assume I want to minimize cache lookups, so I first check that a response has both a "Location: ..." header and a "Link: <...>; rel=duplicate" header

2. Then I check if the "Location: ..." URL already exists in the cache. If so then I just reenable the response

3. Otherwise I check if the "Link: <...>; rel=duplicate" URL already exists in the cache. If so then I rewrite the "Location: ..." header and reenable the response

4. I continue to scan "Link: <...>; rel=duplicate" headers until a URL is found that already exists in the cache. If none is found then I just reenable the response without any changes

I use TS_HTTP_SEND_RESPONSE_HDR_HOOK to work on responses sent from the cache to clients, vs. responses sent from the origin to the cache, because it's likely that when the redirect is first received, no mirror URLs are cached yet, so the "Location: ..." header will be unchanged. If a mirror URL is later added to the cache, then subsequent responses of the redirect to clients should be transformed accordingly. If a redirect can't be cached then it makes no difference whether it's transformed before or after cache

I use TSCacheKeyDigestFromUrlSet() and TSCacheRead() to check if a URL already exists in the cache, thanks to sample code from Leif. This works well so far

I use TSmalloc() to allocate a struct to pass variables to TSCacheRead() callbacks. Leif mentioned in sample code that this is suboptimal and to use jemalloc in configure instead. I will do so

The parsing of "Link: <...>; rel=duplicate" is rough, I would most appreciate any feedback on this. I call TSUrlParse() from the second character of the field value to the first ">" character after the first character. I think that according to RFC 3986, a URI-reference can't contain a ">" character, so I think this logic is okay? I use memchr() to find the ">" character because "string values returned from marshall buffers are not null-terminated ... cannot be passed into the common str*() routines"

I'm not sure how best to test if Link headers have a "rel=duplicate" parameter. Traffic Server has some private code, HttpCompat::lookup_param_in_semicolon_string(), to parse, e.g. "Content-Type: ...; charset=UTF-8", but nothing in the public API. I can probably cobble together something from scratch with memchr(), etc. but I'm nervous about getting it right, e.g. all the RFC rules about whitespace, and is conformance good enough or are there nonconformant implementations to consider? Finally are there any libraries I should consider using?

Unfortunately I don't have enough experience to know which approach to try first. If anyone can point me in the right direction, or offer advice, I would be very grateful

We run Traffic Server here at a rural village in Rwanda. Getting download mirrors to work well with Traffic Server is important because many download sites have a download button that doesn't always send users to the same mirror, so users can't predict whether a download will take seconds or hours, which is frustrating

I am working on this as part of the Google Summer of Code

Reply via email to