Saisai, Iceberg's FileIO interface doesn't require guarantees as strict as a Hadoop-compatible FileSystem. For S3, that allows us to avoid negative caching that can cause problems when reading a table that has just been updated. (Specifically, S3A performs a HEAD request to check whether a path is a directory even for overwrites.)
It's up to users whether they want to use S3FileIO or S3A with HadoopFileIO, or even HadoopFileIO with a custom FileSystem. We are working to make it easy to switch between them, but there are notable problems like ORC not using the FileIO interface yet. I would recommend using S3FileIO to avoid negative caching, but S3A with S3Guard may be another way to avoid the problem. I think the default should be to use S3FileIO if it is in the classpath, and fall back to a Hadoop FileSystem if it isn't. For cloud providers, I think the question is whether there is a need for the relaxed guarantees. If there is no need, then using HadoopFileIO and a FileSystem works great and is probably the easiest way to maintain an implementation for the object store. On Thu, Nov 12, 2020 at 7:31 PM Saisai Shao <sai.sai.s...@gmail.com> wrote: > Hi all, > > Sorry to chime in, I also have a same concern about using Iceberg with > Object storage. > > One of my concerns with S3FileIO is getting tied too much to a single >> cloud provider. I'm wondering if an ObjectStoreFileIO would be helpful >> so that S3FileIO and (a future) GCSFileIO could share logic? I haven't >> looked deep enough into the S3FileIO to know how much logic is not s3 >> specific. Maybe the FileIO interface is enough. >> > > Now we have a S3 specific FileIO implementation, is it recommended to use > this one instead of s3a like HCFS implementation? Also each public cloud > provider has its own HCFS implementation for its own object storage. Are we > going to suggest to create a specific FileIO implementation or use the > existing HCFS implementation? > > Thanks > Saisai > > > Daniel Weeks <dwe...@netflix.com.invalid> 于2020年11月13日周五 上午1:09写道: > >> Hey John, about the concerns around cloud provider dependency, I feel >> like the FileIO interface is actually the right level of abstraction >> already. >> >> That interface basically requires "open for read" and "open for write", >> where the implementation will diverge across different platforms. >> >> I guess you could think of it as S3FileIO is to FileIO what S3AFileSystem >> is to FileSystem (in Hadoop). You can have many different implementations >> that coexist. >> >> In fact, recent changes to the Catalog allow for very flexible management >> of FIleIO and you could even have files within a table split across >> multiple cloud vendors. >> >> As to the consistency questions, the list operation can be inconsistent >> (e.g. if a new file is created and the implementation relies on list then >> read, it may not see newly created objects. Iceberg does not list, so that >> should not be an issue). >> >> The stated read-after-write consistency is limited and does not include: >> - Read after overwrite >> - Read after delete >> - Read after negative cache (e.g. a GET or HEAD that occurred before the >> object was created). >> >> Some of those inconsistencies have caused problems in certain cases when >> it comes to committing data (the negative cache being the main culprit). >> >> -Dan >> >> >> On Wed, Nov 11, 2020 at 6:49 PM John Clara <john.anthony.cl...@gmail.com> >> wrote: >> >>> Update: I think I'm wrong about the listing part. I think it will only >>> do the HEAD request. Also it seems like the consistency issue is >>> probably not something my team would encounter with our current jobs. >>> >>> On 2020/11/12 02:17:10, John Clara <j...@gmail.com> wrote: >>> > (Not sure if this is actually replying or just starting a new thread)> >>> > >>> > Hi Daniel,> >>> > >>> > Thanks for the response! It's very helpful and answers a lot my >>> questions.> >>> > >>> > A couple follow ups:> >>> > >>> > One of my concerns with S3FileIO is getting tied too much to a single >>> > >>> > cloud provider. I'm wondering if an ObjectStoreFileIO would be >>> helpful > >>> > so that S3FileIO and (a future) GCSFileIO could share logic? I >>> haven't > >>> > looked deep enough into the S3FileIO to know how much logic is not s3 >>> > >>> > specific. Maybe the FileIO interface is enough.> >>> > >>> > About consistency (no need to respond here):> >>> > I'm seeing that during "getFileStatus" my version of s3a does some >>> list > >>> > requests (but I'm not sure if that could fail from consistency >>> issues).> >>> > I'm also confused about the read-after-(initial) write part:> >>> > "Amazon S3 provides read-after-write consistency for PUTS of new >>> objects > >>> > in your S3 bucket in all Regions with one caveat. The caveat is that >>> if > >>> > you make a HEAD or GET request to a key name before the object is > >>> > created, then create the object shortly after that, a subsequent GET > >>> > might not return the object due to eventual consistency. - > >>> > https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html"> >>> > >>> > When my version of s3a does a create, it first does a >>> getMetadataRequest > >>> > (HEAD) to check if the object exists before creating the object. I >>> think > >>> > this is talked about in this issue: > >>> > https://github.com/apache/iceberg/issues/1398 and talked about in >>> the > >>> > S3FileIO PR: https://github.com/apache/iceberg/pull/1573. I'll >>> follow >>> up > >>> > in that issue for more info.> >>> > >>> > John> >>> > >>> > >>> > On 2020/11/12 00:36:10, Daniel Weeks <d....@netflix.com.INVALID> >>> wrote:> >>> > > Hey John, I might be able to help answer some of your questions and >>> > >>> > provide>> >>> > > some context around how you might want to go forward.>> >>> > >> >>> > > So, one fundamental aspect of Iceberg is that it only relies on a >>> few>> >>> > > operations (as defined by the FileIO interface). This makes much of >>> the>> >>> > > functionality and complexity of full file system implementations>> >>> > > unnecessary. You should not need features like S3Guard or >>> additional S3>> >>> > > operations these implementations rely on in order to achieve file > >>> > system>> >>> > > contract behavior. Consistency issues should also not be a problem > >>> > since>> >>> > > Iceberg does not overwrite or list and read-after-(initial)write is >>> a>> >>> > > guarantee provided by S3.>> >>> > >> >>> > > At Netflix, we use a custom FileSystem implementation (somewhat >>> like > >>> > S3A),>> >>> > > but with much of the contract behavior that drives additional > >>> > operations>> >>> > > against S3 disabled. However, we are transitioning to a more >>> native>> >>> > > implementation of S3FileIO, which you'll see as part of the ongoing >>> > >>> > work in>> >>> > > Iceberg.>> >>> > >> >>> > > Per your specific questions:>> >>> > >> >>> > > 1) The S3FileIO implementation is very new, though internally we >>> have>> >>> > > something very similar. There are features missing that we are > >>> > working to>> >>> > > add (e.g. progressive multipart upload for large files is likely >>> the > >>> > most>> >>> > > important).>> >>> > > 2) You can use S3AFileSystem with the HadoopFileIO implementation, > >>> > though>> >>> > > you may still see similar behavior with additional calls being made >>> (I>> >>> > > don't know if these can be disabled).>> >>> > > 3) The PrestoS3FileSystem is tailored to Presto's use and is likely >>> > >>> > not as>> >>> > > complete as S3A, but seeing as it is using the Hadoop FileSystem >>> api, > >>> > it>> >>> > > would likely work for what HadoopFileIO exercises (as would the>> >>> > > EMRFileSystem).>> >>> > > 4) I would probably discourage you from writing your own file >>> system > >>> > as the>> >>> > > S3FileIO will likely be a more optimized implementation for what > >>> > Iceberg>> >>> > > needs.>> >>> > >> >>> > > If you want to contribute or have time to help contribute to > >>> > S3FileIO, that>> >>> > > is the path I would recommend. As for configuration, I would say a > >>> > lot of>> >>> > > it comes down to how to configure the AWS S3 Client that you >>> provide > >>> > to the>> >>> > > S3FileIO implementation, but a lot of the defaults are reasonable >>> (you>> >>> > > might want to tweak a few like max connections and maybe the retry > >>> > policy).>> >>> > >> >>> > > The recently committed work to dynamically load your FileIO should > >>> > make it>> >>> > > relatively easy to test out and we'd love to have extra eyes and > >>> > feedback>> >>> > > on it.>> >>> > >> >>> > > Let me know if that helps,>> >>> > > -Dan>> >>> > >> >>> > >> >>> > >> >>> > > On Wed, Nov 11, 2020 at 1:45 PM John Clara <jo...@gmail.com>>> >>> > > wrote:>> >>> > >> >>> > > > Hello all,>> >>> > > >>> >>> > > > Thank you all for creating/continuing this great project! I am >>> just>> >>> > > > starting to get comfortable with the fundamentals and I'm >>> thinking > >>> > that my>> >>> > > > team has been using Iceberg the wrong way at the FileIO level.>> >>> > > >>> >>> > > > I was wondering if people would be willing to share how they set >>> up > >>> > their>> >>> > > > FileIO/FileSystem with S3 and any customizations they had to >>> add.>> >>> > > >>> >>> > > > (Preferably from smaller teams. My team is small and cannot>> >>> > > > realistically customize everything. If there's an up to date >>> thread>> >>> > > > discussing this that I missed, please link me that instead.)>> >>> > > >>> >>> > > > ***** My team's specific problems/setup which you can ignore ***>> >>> > > >>> >>> > > > My team has been using Hadoop FileIO with the S3AFileSystem. Jars >>> are>> >>> > > > provided by AWS EMR 5.23 which is on Hadoop 2.8.5. We use >>> DynamoDB > >>> > for>> >>> > > > atomic renames by implementing Iceberg's provided interfaces. We > >>> > read/write>> >>> > > > from either Spark in EMR or on-prem JVM's in docker containers > >>> > (managed by>> >>> > > > k8s). Both use s3a, but the EMR clusters have HDFS (backed by >>> core > >>> > nodes)>> >>> > > > for the s3a buffered writes while the on-prem containers use the > >>> > docker>> >>> > > > container's default file system which uses an overlay2 storage > >>> > driver (that>> >>> > > > I know nothing about).>> >>> > > >>> >>> > > > Hadoop 2.8.5's S3AFileSystem does a bunch of unnecessary get and >>> list>> >>> > > > requests which is well known in the community (but not to my >>> team>> >>> > > > unfortunately). There's also GET PUT GET inconsistency issues >>> with > >>> > S3 that>> >>> > > > have been talked about, but I don't yet understand how they arise >>> > >>> > in the>> >>> > > > 2.8.5 S3AFilesystem >>> (https://github.com/apache/iceberg/issues/1398).>> >>> > > >>> >>> > > > *** End of specific ***>> >>> > > >>> >>> > > >>> >>> > > > The options I'm seeing are:>> >>> > > >>> >>> > > > 1. Using Iceberg's new S3 FileIO. Is anyone using this in prod?>> >>> > > >>> >>> > > > This still seems very new unless it is actually based on >>> Netflix's>> >>> > > > prod implementation that they're releasing to the community? (I'm >>> > >>> > wondering>> >>> > > > if it's safe to start moving onto it in prod in the near term. If >>> > >>> > Netflix>> >>> > > > is using it (or rolling it out) that would be more than enough >>> for > >>> > my team.)>> >>> > > >>> >>> > > > 2. Using a newer hadoop version and use the S3AFileSystem. Any>> >>> > > > recommendations on a version and are you also using S3Guard?>> >>> > > >>> >>> > > > From a quick look, most gains compared to older versions seem to >>> be>> >>> > > > from S3Guard. Are there substantial gains without it? (My team > >>> > doesn't have>> >>> > > > experience with S3Guard and Iceberg seems to not need it outside >>> of > >>> > atomic>> >>> > > > renames?)>> >>> > > >>> >>> > > > 3. Using an alternative hadoop file system. Any recommendations?>> >>> > > >>> >>> > > > In the recent Iceberg S3 FileIO, the License states it was based >>> off>> >>> > > > the Presto FileSystem. Has anyone used this file system as is >>> with > >>> > Iceberg?>> >>> > > > (https://github.com/apache/iceberg/blob/master/LICENSE#L251)>> >>> > > >>> >>> > > > 4. Roll our own hadoop file system. Anyone have stories/blogs >>> about>> >>> > > > pitfalls or difficulties?>> >>> > > >>> >>> > > > rdblue hints that Netflix already done this:>> >>> > > > > >>> > https://github.com/apache/iceberg/issues/1398#issuecomment-682837392 >>> .>> >>> > > > (My team probably doesn't have the capacity for this)>> >>> > > >>> >>> > > >>> >>> > > > Places where I tried looking for this info:>> >>> > > >>> >>> > > > - https://github.com/apache/iceberg/issues/761 (issue for >>> getting>> >>> > > > started guide)>> >>> > > > - https://iceberg.apache.org/spec/#file-system-operations>> >>> > > >>> >>> > > > Thanks everyone,>> >>> > > >>> >>> > > > John Clara>> >>> > > >>> >>> > >> >>> > >>> >> -- Ryan Blue Software Engineer Netflix