Do you know if textFileStream can see if new files are created underneath a
whole bucket?
Only at the level of the folder that you specify . They don't do
subfolders. So your approach would be detecting everything under path
s3://bucket/path/20160409000002_data.csv

Also, will Spark Streaming not pick up these files again on the following
run knowing that it already picked them up or do we have to store state
somewhere, like the last run date and time to compare against?
Yes it does it automatically. It will only pick newly created files , after
the streamming app is working .


Thanks,
Natu


On Sat, Apr 9, 2016 at 4:44 PM, Benjamin Kim <bbuil...@gmail.com> wrote:

> Natu,
>
> Do you know if textFileStream can see if new files are created underneath
> a whole bucket? For example, if the bucket name is incoming and new files
> underneath it are 2016/04/09/00/00/01/data.csv and
> 2016/04/09/00/00/02/data/csv, will these files be picked up? Also, will
> Spark Streaming not pick up these files again on the following run knowing
> that it already picked them up or do we have to store state somewhere, like
> the last run date and time to compare against?
>
> Thanks,
> Ben
>
> On Apr 8, 2016, at 9:15 PM, Natu Lauchande <nlaucha...@gmail.com> wrote:
>
> Hi Benjamin,
>
> I have done it . The critical configuration items are the ones below :
>
>       ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl",
> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>       ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId",
> AccessKeyId)
>
> ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey",
> AWSSecretAccessKey)
>
>       val inputS3Stream =  ssc.textFileStream("s3://example_bucket/folder
> ")
>
> This code will probe for new S3 files created in your every batch interval.
>
> Thanks,
> Natu
>
> On Fri, Apr 8, 2016 at 9:14 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> Has anyone monitored an S3 bucket or directory using Spark Streaming and
>> pulled any new files to process? If so, can you provide basic Scala coding
>> help on this?
>>
>> Thanks,
>> Ben
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>

Reply via email to