On Mon, Feb 22, 2016 at 12:18 PM, ashokkumar rajendran < ashokkumar.rajend...@gmail.com> wrote:
> Hi Folks, > > > > I am exploring spark for streaming from two sources (a) Kinesis and (b) > HDFS for some of our use-cases. Since we maintain state gathered over last > x hours in spark streaming, we would like to replay the data from last x > hours as batches during deployment. I have gone through the Spark APIs but > could not find anything that initiates with older timestamp. Appreciate > your input on the same. > > 1. Restarting with check-pointing runs the batches faster for missed > timestamp period, but when we upgrade with new code, the same checkpoint > directory cannot be reused. > => It is true that you won't be able to use the checkpoint when you upgrade your code, the production codes are not upgraded every now and then. You can basically create a configuration file in which you can put most of then stuffs (like streaming duration, parameters etc) instead of updating them in the code and breaking the checkpoint. > 2. For the case with kinesis as source, we can change the last > checked sequence number in DynamoDB to get the data from last x hours, but > this will be one large bunch of data for first restarted batch. So the data > is not processed as natural multiple batches inside spark. > => The kinesis API has a way to limit the data rate, you might want to look into that and implement a custom receiver for your use-case. http://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetRecords.html > 3. For the source from HDFS, I could not find any alternative to > start the streaming from old timestamp data, unless I manually (or with > script) rename the old files after starting the stream. (This workaround > leads to other complications too further). > => What you can do is, once the files are processed, you can move them to a different directory and when you restart the stream for whatever reason, you can make it pick all the files instead of the latest ones (by passing the *newFilesOnly* boolean param) > 4. May I know how is the zero data loss is achieved while having > hdfs as source? i.e. if the driver fails while processing a micro batch, > what happens when the application is restarted? Is the same micro-batch > reprocessed? > => Yes, If the application is restarted then the micro-batch will be reprocessed. > > > Regards > > Ashok >