Why cant you keep a persistent queue of S3 files to process? The process
that you are running has two threads
Thread 1: Continuously gets SQS messages and write to the queue. This queue
is persisted to the reliable storage, like HDFS / S3.
Thread 2: Peek into the queue, and whenever there is any message, start
spark job to process the files. Once the spark job is over, it will dequeue
that file from that queue.
If this process fails in the middle of processing some files, then it can
be restarted and thread 2 will start processing the files in the queue
again (since they were not dequeued, as the jobs had not finished
successfully).

The file input stream in Spark Streaming essentially does this same thing.
So you can either implement this directly. Or, you can subclass
FileInputDStream, and override the functions that find new files to
process. There you can start a different thread, that listens and queues
SQS messages. Then when compute() method is called after every batch
interval, the SQS message queued should process, and RDDs from the files
needs to be generated.

This new Input dstream may be worth a try, but I feel that there will be
corner cases which will be harder to deal with in this architecture, but
can be dealt with the custom architecture i stated first. Corner cases
like: what if a SQS message is downloaded but the process failes before
creating RDDs out of them (there could be a delay of batch interval between
those two in Spark Streaming)? When restarted, can you refetch the messages
once again?

Hope this helps!

TD




On Wed, Aug 6, 2014 at 12:13 AM, lalit1303 <la...@sigmoidanalytics.com>
wrote:

> Hi TD,
>
> Thanks a lot for your reply :)
> I am already looking into creating a new DStream for SQS messages. It would
> be very helpful if you can provide with some guidance regarding the same.
>
> The main motive of integrating SQS with spark streaming is to make my Jobs
> run in high availability.
> As of now I am having a downloader, which downloads file pointed by SQS
> messages and the my spark streaming job comes in action to process them. I
> am planning to move whole architecture into high availability (spark
> streaming job can easily be shifted to high availability), only piece left
> is integrate SQS with spark streaming such that it can automatically
> recover
> master node failure. Also, I want to make a single pipeline, start from
> getting SQS message to the processing of corresponding file.
>
> I couldn't think of any other approach to make my SQS downloader run in
> high
> availability mode. The only thing I have to get, is create a Dstream which
> reads sqs messages from the corresponding queue.
>
> Please let me know if there is any other work around.
>
> Thanks
> -- Lalit
>
>
>
> -----
> Lalit Yadav
> la...@sigmoidanalytics.com
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-at-least-once-guarantee-tp10902p11525.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to