Re: Monitoring S3 Bucket with Spark Streaming

2016-04-12 Thread Benjamin Kim
All, I have more of a general Scala JSON question. I have setup a notification on the S3 source bucket that triggers a Lambda function to unzip the new file placed there. Then, it saves the unzipped CSV file into another destination bucket where a notification is sent to a SQS topic. The conte

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
Ah, I spoke too soon. I thought the SQS part was going to be a spark package. It looks like it has be compiled into a jar for use. Am I right? Can someone help with this? I tried to compile it using SBT, but I’m stuck with a SonatypeKeys not found error. If there’s an easier alternative, please

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
This was easy! I just created a notification on a source S3 bucket to kick off a Lambda function that would decompress the dropped file and save it to another S3 bucket. In return, this S3 bucket has a notification to send a SNS message to me via email. I can just as easily setup SQS to be the

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Gourav Sengupta
why not use AWS Lambda? Regards, Gourav On Fri, Apr 8, 2016 at 8:14 PM, Benjamin Kim wrote: > Has anyone monitored an S3 bucket or directory using Spark Streaming and > pulled any new files to process? If so, can you provide basic Scala coding > help on this? > > Thanks, > Ben > > > --

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Nezih Yigitbasi
Natu, Benjamin, With this mechanism you can configure notifications for *buckets* (if you only care about some key prefixes you can take a look at object key name filtering, see the docs) for various event types, and then these events can be published to SNS, SQS or Lambdas. I think using SQS as a

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Natu Lauchande
Do you know if textFileStream can see if new files are created underneath a whole bucket? Only at the level of the folder that you specify . They don't do subfolders. So your approach would be detecting everything under path s3://bucket/path/2016040902_data.csv Also, will Spark Streaming not p

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
This is awesome! I have someplace to start from. Thanks, Ben > On Apr 9, 2016, at 9:45 AM, programminggee...@gmail.com wrote: > > Someone please correct me if I am wrong as I am still rather green to spark, > however it appears that through the S3 notification mechanism described > below, you

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread programminggeek72
Someone please correct me if I am wrong as I am still rather green to spark, however it appears that through the S3 notification mechanism described below, you can publish events to SQS and use SQS as a streaming source into spark. The project at https://github.com/imapi/spark-sqs-receiver appea

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
Nezih, This looks like a good alternative to having the Spark Streaming job check for new files on its own. Do you know if there is a way to have the Spark Streaming job get notified with the new file information and act upon it? This can reduce the overhead and cost of polling S3. Plus, I can

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
Natu, Do you know if textFileStream can see if new files are created underneath a whole bucket? For example, if the bucket name is incoming and new files underneath it are 2016/04/09/00/00/01/data.csv and 2016/04/09/00/00/02/data/csv, will these files be picked up? Also, will Spark Streaming n

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Natu Lauchande
Can you elaborate a bit more in your approach using s3 notifications ? Just curious. dealing with a similar issue right now that might benefit from this. On 09 Apr 2016 9:25 AM, "Nezih Yigitbasi" wrote: > While it is doable in Spark, S3 also supports notifications: > http://docs.aws.amazon.com/Am

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Nezih Yigitbasi
While it is doable in Spark, S3 also supports notifications: http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html On Fri, Apr 8, 2016 at 9:15 PM Natu Lauchande wrote: > Hi Benjamin, > > I have done it . The critical configuration items are the ones below : > > ssc.sparkCo

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-08 Thread Natu Lauchande
Hi Benjamin, I have done it . The critical configuration items are the ones below : ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", AccessKeyId) ssc.spar