vinothchandar commented on a change in pull request #3526: URL: https://github.com/apache/hudi/pull/3526#discussion_r697407383
########## File path: website/blog/2021-08-23-s3-events-source.md ########## @@ -0,0 +1,117 @@ +--- +title: "Reliable ingestion from AWS S3 using Hudi" +excerpt: "From listing to log-based approach, a reliable way of ingesting data from AWS S3 into Hudi." +author: codope +category: blog +--- + +In this post we will talk about a new deltastreamer source which reliably and efficiently processes new data files as they arrive in AWS S3. + +<!--truncate--> + +## Motivation + +As of today, to ingest data from S3 into Hudi, users leverage DFS source whose [path selector](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DFSPathSelector.java) would identify the source files modified since the last checkpoint based on max modification time. Review comment: for links, can we link to the 0.9.0 branch ? or use a permalink? ########## File path: website/blog/2021-08-23-s3-events-source.md ########## @@ -0,0 +1,117 @@ +--- +title: "Reliable ingestion from AWS S3 using Hudi" +excerpt: "From listing to log-based approach, a reliable way of ingesting data from AWS S3 into Hudi." +author: codope +category: blog +--- + +In this post we will talk about a new deltastreamer source which reliably and efficiently processes new data files as they arrive in AWS S3. + +<!--truncate--> + +## Motivation + +As of today, to ingest data from S3 into Hudi, users leverage DFS source whose [path selector](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DFSPathSelector.java) would identify the source files modified since the last checkpoint based on max modification time. +The problem with this approach is that modification time precision is upto seconds in S3. It maybe possible that there were many files (beyond what the configurable source limit allows) modifed in that second and some files might be skipped. +For more details, please refer to [HUDI-1723](https://issues.apache.org/jira/browse/HUDI-1723). +While the workaround is to ignore the source limit and keep reading, the problem motivated us to redesign so that users can reliably ingest from S3. + +## Design + +For use-cases where seconds granularity does not suffice, we have a new source in deltastreamer using log-based approach. +The new [S3 events source](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsSource.java) relies on change notification and incremental processing to ingest from S3. +The architecture is as shown in the figure below. + + + +In this approach, users need to [enable S3 event notifications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html). +There will be two deltastreamers as detailed below. + +1. [S3EventsSource](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/S3EventsSource.java): Create Hudi S3 metadata table. This source leverages AWS SNS and SQS services that subscribe to file events from the source bucket. Review comment: links for SQS/SNS? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
