Download mirrors, plugin, GSoC

Jack Bates Sat, 12 May 2012 07:48:27 -0700

Hi, I would like files that are distributed from multiple mirrors towork better with caching proxies, and I hope to write a Traffic Serverplugin to help with this

I would love any input or feedback on how mirrors can work better withTraffic Server

The approach that I am taking for my initial attempt is to use RFC 6249,Metalink/HTTP: Mirrors and Hashes. I listen for responses that are anHTTP redirect and have "Link: <...>; rel=duplicate" headers, then I scanthe URLs for one that already exists in the cache. If found then Itransform the response, replacing the "Location: ..." header with theURL that already exists in the cache

Later, I would also like to use RFC 3230, Instance Digests in HTTP, andfind a way to lookup URLs in the Traffic Server cache by content digest.I gather that ATS does create checksums of content stored in the cache,but doesn't support looking up content by digest. Some possibilitiesinclude extending the core with new APIs to accomplish this, or a plugincould add additional entries to the ATS cache for content digests.Alternatively a separate cache could be used, e.g. KyotoDB or memcached


Some further ideas for download mirrors and Traffic Server include:

* Remember lists of mirrors so future requests for any of these URLsuse the same cache key. A problem is how to prevent a malicious domainfrom distributing false information about URLs it doesn't control. Thiscould be addressed with a whitelist of domains

* Making decisions about the best mirror to choose, e.g. one that ismost cost efficient, faster, or more local


  * Use content digest to detect or repair download errors

A first attempt at a plugin is up on GitHub: https://github.com/jablko/dedup

I would love any feedback on this code

1. I assume I want to minimize cache lookups, so I first check that aresponse has both a "Location: ..." header and a "Link: <...>;rel=duplicate" header

2. Then I check if the "Location: ..." URL already exists in thecache. If so then I just reenable the response

3. Otherwise I check if the "Link: <...>; rel=duplicate" URL alreadyexists in the cache. If so then I rewrite the "Location: ..." header andreenable the response

4. I continue to scan "Link: <...>; rel=duplicate" headers until aURL is found that already exists in the cache. If none is found then Ijust reenable the response without any changes

I use TS_HTTP_SEND_RESPONSE_HDR_HOOK to work on responses sent from thecache to clients, vs. responses sent from the origin to the cache,because it's likely that when the redirect is first received, no mirrorURLs are cached yet, so the "Location: ..." header will be unchanged. Ifa mirror URL is later added to the cache, then subsequent responses ofthe redirect to clients should be transformed accordingly. If a redirectcan't be cached then it makes no difference whether it's transformedbefore or after cache

I use TSCacheKeyDigestFromUrlSet() and TSCacheRead() to check if a URLalready exists in the cache, thanks to sample code from Leif. This workswell so far

I use TSmalloc() to allocate a struct to pass variables to TSCacheRead()callbacks. Leif mentioned in sample code that this is suboptimal and touse jemalloc in configure instead. I will do so

The parsing of "Link: <...>; rel=duplicate" is rough, I would mostappreciate any feedback on this. I call TSUrlParse() from the secondcharacter of the field value to the first ">" character after the firstcharacter. I think that according to RFC 3986, a URI-reference can'tcontain a ">" character, so I think this logic is okay? I use memchr()to find the ">" character because "string values returned from marshallbuffers are not null-terminated ... cannot be passed into the commonstr*() routines"

I'm not sure how best to test if Link headers have a "rel=duplicate"parameter. Traffic Server has some private code,HttpCompat::lookup_param_in_semicolon_string(), to parse, e.g."Content-Type: ...; charset=UTF-8", but nothing in the public API. I canprobably cobble together something from scratch with memchr(), etc. butI'm nervous about getting it right, e.g. all the RFC rules aboutwhitespace, and is conformance good enough or are there nonconformantimplementations to consider? Finally are there any libraries I shouldconsider using?

Unfortunately I don't have enough experience to know which approach totry first. If anyone can point me in the right direction, or offeradvice, I would be very grateful

We run Traffic Server here at a rural village in Rwanda. Gettingdownload mirrors to work well with Traffic Server is important becausemany download sites have a download button that doesn't always sendusers to the same mirror, so users can't predict whether a download willtake seconds or hours, which is frustrating


I am working on this as part of the Google Summer of Code

Download mirrors, plugin, GSoC

Reply via email to