Re: Monitoring S3 Bucket with Spark Streaming

programminggeek72 Sat, 09 Apr 2016 09:46:03 -0700

Someone please correct me if I am wrong as I am still rather green to spark, 
however it appears that through the S3 notification mechanism described below, 
you can publish events to SQS and use SQS as a streaming source into spark. The 
project at https://github.com/imapi/spark-sqs-receiver appears to provide 
libraries for doing this.


Hope this helps.

Sent from my iPhone

> On Apr 9, 2016, at 9:55 AM, Benjamin Kim <bbuil...@gmail.com> wrote:
> 
> Nezih,
> 
> This looks like a good alternative to having the Spark Streaming job check 
> for new files on its own. Do you know if there is a way to have the Spark 
> Streaming job get notified with the new file information and act upon it? 
> This can reduce the overhead and cost of polling S3. Plus, I can use this to 
> notify and kick off Lambda to process new data files and make them ready for 
> Spark Streaming to consume. This will also use notifications to trigger. I 
> just need to have all incoming folders configured for notifications for 
> Lambda and all outgoing folders for Spark Streaming. This sounds like a 
> better setup than we have now.
> 
> Thanks,
> Ben
> 
>> On Apr 9, 2016, at 12:25 AM, Nezih Yigitbasi <nyigitb...@netflix.com> wrote:
>> 
>> While it is doable in Spark, S3 also supports notifications: 
>> http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
>> 
>> 
>>> On Fri, Apr 8, 2016 at 9:15 PM Natu Lauchande <nlaucha...@gmail.com> wrote:
>>> Hi Benjamin,
>>> 
>>> I have done it . The critical configuration items are the ones below :
>>> 
>>>       ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl", 
>>> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>>>       ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", 
>>> AccessKeyId)
>>>       ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", 
>>> AWSSecretAccessKey)
>>> 
>>>       val inputS3Stream =  ssc.textFileStream("s3://example_bucket/folder")
>>> 
>>> This code will probe for new S3 files created in your every batch interval.
>>> 
>>> Thanks,
>>> Natu
>>> 
>>>> On Fri, Apr 8, 2016 at 9:14 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>>> Has anyone monitored an S3 bucket or directory using Spark Streaming and 
>>>> pulled any new files to process? If so, can you provide basic Scala coding 
>>>> help on this?
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>

Re: Monitoring S3 Bucket with Spark Streaming

Reply via email to