codope commented on a change in pull request #3526: URL: https://github.com/apache/hudi/pull/3526#discussion_r697295920
########## File path: website/blog/2021-08-23-s3-events-source.md ########## @@ -0,0 +1,111 @@ +--- +title: "Reliable ingestion from AWS S3 using Hudi" +excerpt: "From listing to log-based approach, a reliable way of ingesting data from AWS S3 into Hudi." +author: codope +category: blog +--- + +In this post we will talk about a new deltastreamer source which reliably and efficiently processes new data files as they arrive in AWS S3. + +## Motivation + +To ingest from S3 Hudi users leverage DFS source whose [path selector](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DFSPathSelector.java) would identify the source files modified since the last checkpoint based on max modification time. +The problem with this approach is that modification time precision is upto seconds in S3. It maybe possible that there were many files (beyond what the configurable source limit allows) modifed in that second and some files might be skipped. +This issue happened in production. For more details, please refer to [HUDI-1723](https://issues.apache.org/jira/browse/HUDI-1723). +While the workaround was to ignore the source limit and keep reading, the problem motivated us to redesign so that users can reliably ingest from S3. + +## Design + +We wanted to move away from listing to log-based approach. Review comment: Reworded. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
