Re: [Streaming-Kafka] How to start from topic offset when streamcontext is using checkpoint

Cody Koeninger Fri, 22 Jan 2016 07:09:11 -0800

Offsets are stored in the checkpoint.  If you want to manage offsets
yourself, don't restart from the checkpoint, specify the starting offsets
when you create the stream.


Have you read / watched the materials linked from

https://github.com/koeninger/kafka-exactly-once

Regarding the small files problem, either don't use HDFS, or use something
like filecrush for merging.

On Fri, Jan 22, 2016 at 3:03 AM, Raju Bairishetti <r...@apache.org> wrote:

> Hi,
>
>
>    I am very new to spark & spark-streaming. I am planning to use spark
> streaming for real time processing.
>
>    I have created a streaming context and checkpointing to hdfs directory
> for recovery purposes in case of executor failures & driver failures.
>
> I am creating Dstream with offset map for getting the data from kafka. I
> am simply ignoring the offsets to understand the behavior. Whenver I
> restart application driver restored from checkpoint as expected but Dstream
> is not getting started from the initial offsets. Dstream was created with
> the last consumed offsets instead of startign from 0 offsets for each topic
> partition as I am not storing the offsets any where.
>
> def main : Unit = {
>
>     var sparkStreamingContext = 
> StreamingContext.getOrCreate(SparkConstants.CHECKPOINT_DIR_LOCATION,
>   () => creatingFunc())
>
>     ...
>
>
> }
>
> def creatingFunc(): Unit = {
>
>     ...
>
>     var offsets:Map[TopicAndPartition, Long] = 
> Map(TopicAndPartition("sample_sample3_json",0) -> 0)
>
>         KafkaUtils.createDirectStream[String,String, StringDecoder, 
> StringDecoder,
> String](sparkStreamingContext, kafkaParams, offsets, messageHandler)
>
> ...
> }
>
> I want to get control over offset management at event level instead of RDD
> level to make sure that at least once delivery to end system.
>
> As per my understanding, every RDD or RDD partition will stored in hdfs as
> a file If I choose to use HDFS as output. If I use 1sec as batch interval
> then it will be ended up having huge number of small files in HDFS. Having
> small files in HDFS will leads to lots of other issues.
> Is there any way to write multiple RDDs into single file? Don't have muh
> idea about *coalesce* usage. In the worst case, I can merge all small files
> in HDFS in regular intervals.
>
> Thanks...
>
> ------
> Thanks
> Raju Bairishetti
> www.lazada.com
>
>
>
>

Re: [Streaming-Kafka] How to start from topic offset when streamcontext is using checkpoint

Reply via email to