Re: Monitoring S3 Bucket with Spark Streaming

Nezih Yigitbasi Sat, 09 Apr 2016 12:03:04 -0700

Natu, Benjamin,
With this mechanism you can configure notifications for *buckets* (if you
only care about some key prefixes you can take a look at object key name
filtering, see the docs)  for various event types, and then these events
can be published to SNS, SQS or Lambdas. I think using SQS as a source here
will let you process these notifications from your Spark streaming job.


Nezih

On Sat, Apr 9, 2016 at 10:58 AM Natu Lauchande <nlaucha...@gmail.com> wrote:

>
> Do you know if textFileStream can see if new files are created underneath
> a whole bucket?
> Only at the level of the folder that you specify . They don't do
> subfolders. So your approach would be detecting everything under path
> s3://bucket/path/20160409000002_data.csv
>
>
> Also, will Spark Streaming not pick up these files again on the following
> run knowing that it already picked them up or do we have to store state
> somewhere, like the last run date and time to compare against?
> Yes it does it automatically. It will only pick newly created files ,
> after the streamming app is working .
>
>
> Thanks,
> Natu
>
>
> On Sat, Apr 9, 2016 at 4:44 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> Natu,
>>
>> Do you know if textFileStream can see if new files are created underneath
>> a whole bucket? For example, if the bucket name is incoming and new files
>> underneath it are 2016/04/09/00/00/01/data.csv and
>> 2016/04/09/00/00/02/data/csv, will these files be picked up? Also, will
>> Spark Streaming not pick up these files again on the following run knowing
>> that it already picked them up or do we have to store state somewhere, like
>> the last run date and time to compare against?
>>
>> Thanks,
>> Ben
>>
>> On Apr 8, 2016, at 9:15 PM, Natu Lauchande <nlaucha...@gmail.com> wrote:
>>
>> Hi Benjamin,
>>
>> I have done it . The critical configuration items are the ones below :
>>
>>       ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl",
>> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>>       ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId",
>> AccessKeyId)
>>
>> ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey",
>> AWSSecretAccessKey)
>>
>>       val inputS3Stream =  ssc.textFileStream("s3://example_bucket/folder
>> ")
>>
>> This code will probe for new S3 files created in your every batch
>> interval.
>>
>> Thanks,
>> Natu
>>
>> On Fri, Apr 8, 2016 at 9:14 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>
>>> Has anyone monitored an S3 bucket or directory using Spark Streaming and
>>> pulled any new files to process? If so, can you provide basic Scala coding
>>> help on this?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>

Re: Monitoring S3 Bucket with Spark Streaming

Reply via email to