Why cant you keep a persistent queue of S3 files to process? The process that you are running has two threads Thread 1: Continuously gets SQS messages and write to the queue. This queue is persisted to the reliable storage, like HDFS / S3. Thread 2: Peek into the queue, and whenever there is any message, start spark job to process the files. Once the spark job is over, it will dequeue that file from that queue. If this process fails in the middle of processing some files, then it can be restarted and thread 2 will start processing the files in the queue again (since they were not dequeued, as the jobs had not finished successfully).
The file input stream in Spark Streaming essentially does this same thing. So you can either implement this directly. Or, you can subclass FileInputDStream, and override the functions that find new files to process. There you can start a different thread, that listens and queues SQS messages. Then when compute() method is called after every batch interval, the SQS message queued should process, and RDDs from the files needs to be generated. This new Input dstream may be worth a try, but I feel that there will be corner cases which will be harder to deal with in this architecture, but can be dealt with the custom architecture i stated first. Corner cases like: what if a SQS message is downloaded but the process failes before creating RDDs out of them (there could be a delay of batch interval between those two in Spark Streaming)? When restarted, can you refetch the messages once again? Hope this helps! TD On Wed, Aug 6, 2014 at 12:13 AM, lalit1303 <la...@sigmoidanalytics.com> wrote: > Hi TD, > > Thanks a lot for your reply :) > I am already looking into creating a new DStream for SQS messages. It would > be very helpful if you can provide with some guidance regarding the same. > > The main motive of integrating SQS with spark streaming is to make my Jobs > run in high availability. > As of now I am having a downloader, which downloads file pointed by SQS > messages and the my spark streaming job comes in action to process them. I > am planning to move whole architecture into high availability (spark > streaming job can easily be shifted to high availability), only piece left > is integrate SQS with spark streaming such that it can automatically > recover > master node failure. Also, I want to make a single pipeline, start from > getting SQS message to the processing of corresponding file. > > I couldn't think of any other approach to make my SQS downloader run in > high > availability mode. The only thing I have to get, is create a Dstream which > reads sqs messages from the corresponding queue. > > Please let me know if there is any other work around. > > Thanks > -- Lalit > > > > ----- > Lalit Yadav > la...@sigmoidanalytics.com > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-at-least-once-guarantee-tp10902p11525.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >