Sean Mackrory created HADOOP-15999:
--------------------------------------

             Summary: [s3a] Better support for out-of-band operations
                 Key: HADOOP-15999
                 URL: https://issues.apache.org/jira/browse/HADOOP-15999
             Project: Hadoop Common
          Issue Type: New Feature
            Reporter: Sean Mackrory


S3Guard was initially done on the premise that a new MetadataStore would be the 
source of truth, and that it wouldn't provide guarantees if updates were done 
without using S3Guard.

I've been seeing increased demand for better support for scenarios where 
operations are done on the data that can't reasonably be done with S3Guard 
involved. For example:
* A file is deleted using S3Guard, and replaced by some other tool. S3Guard 
can't tell the difference between the new file and delete / list inconsistency 
and continues to treat the file as deleted.
* An S3Guard-ed file is overwritten by a longer file by some other tool. When 
reading the file, only the length of the original file is read.

We could possibly have smarter behavior here by querying both S3 and the 
MetadataStore (even in cases where we may currently only query the 
MetadataStore in getFileStatus) and use whichever one has the higher modified 
time.

This kills the performance boost we currently get in some workloads with the 
short-circuited getFileStatus, but we could keep it with authoritative mode 
which should give a larger performance boost. At least we'd get more 
correctness without authoritative mode and a clear declaration of when we can 
make the assumptions required to short-circuit the process. If we can't 
consider S3Guard the source of truth, we need to defer to S3 more.

We'd need to be extra sure of any locality / time zone issues if we start 
relying on mod_time more directly, but currently we're tracking the 
modification time as returned by S3 anyway.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to