Natu, Benjamin, With this mechanism you can configure notifications for *buckets* (if you only care about some key prefixes you can take a look at object key name filtering, see the docs) for various event types, and then these events can be published to SNS, SQS or Lambdas. I think using SQS as a source here will let you process these notifications from your Spark streaming job.
Nezih On Sat, Apr 9, 2016 at 10:58 AM Natu Lauchande <nlaucha...@gmail.com> wrote: > > Do you know if textFileStream can see if new files are created underneath > a whole bucket? > Only at the level of the folder that you specify . They don't do > subfolders. So your approach would be detecting everything under path > s3://bucket/path/20160409000002_data.csv > > > Also, will Spark Streaming not pick up these files again on the following > run knowing that it already picked them up or do we have to store state > somewhere, like the last run date and time to compare against? > Yes it does it automatically. It will only pick newly created files , > after the streamming app is working . > > > Thanks, > Natu > > > On Sat, Apr 9, 2016 at 4:44 PM, Benjamin Kim <bbuil...@gmail.com> wrote: > >> Natu, >> >> Do you know if textFileStream can see if new files are created underneath >> a whole bucket? For example, if the bucket name is incoming and new files >> underneath it are 2016/04/09/00/00/01/data.csv and >> 2016/04/09/00/00/02/data/csv, will these files be picked up? Also, will >> Spark Streaming not pick up these files again on the following run knowing >> that it already picked them up or do we have to store state somewhere, like >> the last run date and time to compare against? >> >> Thanks, >> Ben >> >> On Apr 8, 2016, at 9:15 PM, Natu Lauchande <nlaucha...@gmail.com> wrote: >> >> Hi Benjamin, >> >> I have done it . The critical configuration items are the ones below : >> >> ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl", >> "org.apache.hadoop.fs.s3native.NativeS3FileSystem") >> ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", >> AccessKeyId) >> >> ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", >> AWSSecretAccessKey) >> >> val inputS3Stream = ssc.textFileStream("s3://example_bucket/folder >> ") >> >> This code will probe for new S3 files created in your every batch >> interval. >> >> Thanks, >> Natu >> >> On Fri, Apr 8, 2016 at 9:14 PM, Benjamin Kim <bbuil...@gmail.com> wrote: >> >>> Has anyone monitored an S3 bucket or directory using Spark Streaming and >>> pulled any new files to process? If so, can you provide basic Scala coding >>> help on this? >>> >>> Thanks, >>> Ben >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >> >