I’d agree that Leif point out the problem here, we may call this a de-duplicate 
solution but mostly after we save the content when we get from the origin, it 
is already wasting your disk storage, you will get the same hash after all the 
data is completed from the origin, and the disk already wasted in this 
duplicated file.

a good solution would be:
the origin send out the content with common headers plus SHA hash string 
and(or) MD5 hash string, and then we can go lookup the key in our storage, then 
it should work as expected




在 2014年8月29日,上午4:09,Leif Hedstrom <zw...@apache.org> 写道:

> 
> On Aug 28, 2014, at 12:19 PM, Bill Zeng <billzeng2...@gmail.com> wrote:
> 
>> 
>> 
>> 
>> On Thu, Aug 28, 2014 at 10:41 AM, Leif Hedstrom <zw...@apache.org> wrote:
>> 
>> On Aug 28, 2014, at 11:35 AM, Bill Zeng <billzeng2...@gmail.com> wrote:
>> 
>>> Just to throw another idea your way. We can insert another level of 
>>> indirection between URL's and objects. Every object has a unique hash. 
>>> URL's point to the hashes instead of objects. The hashes are used to look 
>>> up objects. Even if multiple URL's are duplicated and hence their hashes, 
>>> they always point to the same object. It seems a non-easy project though. 
>>> It requires major changes to ATS.
>> 
>> 
>> I’m not sure I understand this, or how it helps this problem? However, isn’t 
>> this sort of how the cache already works? There’s a hash from URL to the 
>> “header” entry, which then has its own hash to the actual object. Alan?
>> 
>> Maybe I did not understand it correctly. Currently, ATS calculates a hash 
>> from a URL and uses the hash to look up the actual object. That is "URL --> 
>> actual object". My idea is to "URL --> hash of an object --> actual object". 
>> We calculate the hash of a URL and use that to look up the hash of an actual 
>> object and then use the hash of the actual object to look up the actual 
>> object.
> 
> 
> But what problem does that solve? You have URL <A> and <B>, both which  point 
> to the same object. How do you find that object based only on the client 
> request (URL + headers)? How do you generate the “object hash” for the 
> lookup, without going to origin first? That’s the problem here, afaik?
> 
> Or is your suggestion here to solve the cache deduping problem (which is not 
> what the OP asked for)? If so, there was the beginning for that in the cache 
> code, storing the hash of objects in the cache as well (but maybe that’s gone 
> now?). There is also a CRC (checksum) feature in the cache, maybe the 
> intention back then was to generalizing the cache dedup with these checksums. 
> Only John Plevyak would know :).
> 
> Fwiw, this problem is what Metalink is intended to solve for some use cases 
> (e.g. site mirrors), but Metalink requires cooperation (additional Metalink 
> headers) from the origin. It does not solve (or intend to solve) the issue 
> where e.g. YouTube rotates the content URLs frequently.
> 
> — Leif

- Yongming Zhao 赵永明

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to