what about a post-optimization of the cache? I mean... 

1. when ATS receives a huge data it stores the URLs with a rounded timestamp 
and the flag "checked:true/false" into a RDBMS  (eg. postgresql) with a unique 
constraint on URLs and timestamp fields 
2. a batch process periodically get URLs ( last_check_time<timestamp, 
checked=false) from DB, requests them to ATS that has cached them, calculates 
SHA and then performs two queries to a NoSQL: insert "key:URL,value:SHA" into 
table "A" (always), insert "key:SHA, value:URL" into table "B" (if not exists, 
else update the expire timeout for this key and delete the ATS cache of the new 
URL), finally set flag checked=true
3. when ATS receives requests from a client (not the batch process) it looks 
for records in table "A" of NoSQL, if a value is returned it looks for the url 
from table "B" and finally returns its cached data, else forward request to 
origin.

Obviously you should estimate the convenience of something like that. Do you 
have so much huge traffic/cache?

Reply via email to