Re: [spark streaming] checkpoint location feature for batch processing

2020-05-03 Thread Jungtaek Lim
Replied inline: On Sun, May 3, 2020 at 6:25 PM Magnus Nilsson wrote: > Thank you, so that would mean spark gets the current latest offset(s) when > the trigger fires and then process all available messages in the topic upto > and including that offset as long as maxOffsetsPerTrigger is the defau

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-03 Thread Magnus Nilsson
Thank you, so that would mean spark gets the current latest offset(s) when the trigger fires and then process all available messages in the topic upto and including that offset as long as maxOffsetsPerTrigger is the default of None (or large enought to handle all available messages). I think the w

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-02 Thread Jungtaek Lim
If I understand correctly, Trigger.once executes only one micro-batch and terminates, that's all. Your understanding of structured streaming applies there as well. It's like a hybrid approach as bringing incremental processing from micro-batch but having processing interval as batch. That said, wh

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-02 Thread Magnus Nilsson
I've always had a question about Trigger.Once that I never got around to ask or test for myself. If you have a 24/7 stream to a Kafka topic. Will Trigger.Once get the last offset(s) when it starts and then quit once it hits this offset(s) or will the job run until no new messages is added to the t

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-01 Thread Rishi Shah
Thanks Burak! Appreciate it. This makes sense. How do you suggest we make sure resulting data doesn't produce tiny files? If we are not on databricks yet and can not leverage delta lake features? Also checkpointing feature, do you have active blog/article I can take a look at to try out an example

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-01 Thread Burak Yavuz
Hi Rishi, That is exactly why Trigger.Once was created for Structured Streaming. The way we look at streaming is that it doesn't have to be always real time, or 24-7 always on. We see streaming as a workflow that you have to repeat indefinitely. See this blog post for more details! https://databri

[spark streaming] checkpoint location feature for batch processing

2020-05-01 Thread Rishi Shah
Hi All, I recently started playing with spark streaming, and checkpoint location feature looks very promising. I wonder if anyone has an opinion about using spark streaming with checkpoint location option as a slow batch processing solution. What would be the pros and cons of utilizing streaming w

Spark Streaming: Checkpoint, Recovery and Idempotency

2019-05-29 Thread sheelstera
Hello, I am trying to understand the content of a checkpoint and corresponding recovery. *My understanding of Spark Checkpointing: * If you have really long DAGs and your spark cluster fails, checkpointing helps by persisting intermediate state e.g. to HDFS. So, a DAG of 50 transformations can be

Spark Streaming checkpoint

2018-01-29 Thread KhajaAsmath Mohammed
Hi, I have written spark streaming job to use the checkpoint. I have stopped the streaming job for 5 days and then restart it today. I have encountered weird issue where it shows as zero records for all cycles till date. is it causing data loss? [image: Inline image 1] Thanks, Asmath

Spark 2.1.2 Spark Streaming checkpoint interval not respected

2017-11-18 Thread Shing Hing Man
Hi, In the following example using mapWithState, I set checkpoint interval to 1 minute. From the log, Spark stills write to the checkpoint directory every second. Would be appreciated if someone can point out what I have done wrong. object MapWithStateDemo { def main(args: Array[String]) {

Re: Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-06 Thread Tathagata Das
ing >> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html> >> instead. >> >> On Mon, Jun 5, 2017 at 11:16 AM, anbucheeralan >> wrote: >> >>> I am using Spark Streaming Checkpoint and Kafka Direct Stream. >>> It

Re: Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-06 Thread ALunar Beach
ommend > learning Structured Streaming > <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html> > instead. > > On Mon, Jun 5, 2017 at 11:16 AM, anbucheeralan > wrote: > >> I am using Spark Streaming Checkpoint and Kafka Direct Stream.

Re: Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-06 Thread Tathagata Das
:16 AM, anbucheeralan wrote: > I am using Spark Streaming Checkpoint and Kafka Direct Stream. > It uses a 30 sec batch duration and normally the job is successful in > 15-20 sec. > > If the spark application fails after the successful completion > (149668428ms in the log b

Fwd: Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-05 Thread anbucheeralan
I am using Spark Streaming Checkpoint and Kafka Direct Stream. It uses a 30 sec batch duration and normally the job is successful in 15-20 sec. If the spark application fails after the successful completion (149668428ms in the log below) and restarts, it's duplicating the last batch

Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-05 Thread ALunar Beach
I am using Spark Streaming Checkpoint and Kafka Direct Stream. It uses a 30 sec batch duration and normally the job is successful in 15-20 sec. If the spark application fails after the successful completion (149668428ms in the log below) and restarts, it's duplicating the last batch

[Spark Streaming] Checkpoint backup (.bk) file purpose

2017-03-16 Thread Bartosz Konieczny
Hello, Actually I'm studying metadata checkpoint implementation in Spark Streaming and I was wondering the purpose of so called "backup files": CheckpointWriter snippet: > // We will do checkpoint when generating a batch and completing a batch. > When the processing > // time of a batch is great

Can Spark Streaming checkpoint only metadata ?

2016-06-21 Thread Natu Lauchande
Hi, I wonder if it is possible to checkpoint only metadata and not the data in RDD's and dataframes. Thanks, Natu

Re: Question about Spark Streaming checkpoint interval

2015-12-18 Thread Shixiong Zhu
You are right. "checkpointInterval" is only for data checkpointing. "metadata checkpoint" is done for each batch. Feel free to send a PR to add the missing doc. Best Regards, Shixiong Zhu 2015-12-18 8:26 GMT-08:00 Lan Jiang : > Need some clarification about the documentation. According to Spark

Question about Spark Streaming checkpoint interval

2015-12-18 Thread Lan Jiang
Need some clarification about the documentation. According to Spark doc "the default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream

Re: Spark Streaming Checkpoint help failed application

2015-11-11 Thread Gideon
ll be the Kafka offsets I hope this helps :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Checkpoint-help-failed-application-tp25347p25357.html Sent from the Apache Spark User Lis

Re: Spark streaming checkpoint against s3

2015-10-15 Thread Tian Zhang
So as long as jar is kept on s3 and available across different runs, then the s3 checkpoint is working. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-checkpoint-against-s3-tp25068p25081.html Sent from the Apache Spark User List mailing

Re: Spark streaming checkpoint against s3

2015-10-14 Thread Tian Zhang
://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-checkpoint-against-s3-tp25068p25070.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Spark streaming checkpoint against s3

2015-10-14 Thread Tian Zhang
Hi, I am trying to set spark streaming checkpoint to s3, here is what I did basically val checkpoint = "s3://myBucket/checkpoint" val ssc = StreamingContext.getOrCreate(checkpointDir, () => getStreamingContext

Re: Spark Streaming checkpoint recovery throws Stack Overflow Error

2015-09-18 Thread Saisai Shao
) >> at >> >> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349) >> at >> >> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStrea

Re: Spark Streaming checkpoint recovery throws Stack Overflow Error

2015-09-18 Thread Ted Yu
.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399) > at > > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344) > at > > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DSt

Spark Streaming checkpoint recovery throws Stack Overflow Error

2015-09-18 Thread swetha
$1.apply(DStream.scala:342) at scala.Option.orElse(Option.scala:257) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-throws-Stack-Overflow-Error-tp24737.html Sent from the Apache Spark User List mailing list

Re: Retrieving offsets from previous spark streaming checkpoint

2015-08-13 Thread Cody Koeninger
Access the offsets using HasOffsetRanges, save them in your datastore, provide them as the fromOffsets argument when starting the stream. See https://github.com/koeninger/kafka-exactly-once On Thu, Aug 13, 2015 at 3:53 PM, Stephen Durfey wrote: > When deploying a spark streaming application I w

Retrieving offsets from previous spark streaming checkpoint

2015-08-13 Thread Stephen Durfey
When deploying a spark streaming application I want to be able to retrieve the lastest kafka offsets that were processed by the pipeline, and create my kafka direct streams from those offsets. Because the checkpoint directory isn't guaranteed to be compatible between job deployments, I don't want t

Re: broadcast variable and accumulators issue while spark streaming checkpoint recovery

2015-07-29 Thread Tathagata Das
1. Same way, using static fields in a class. 2. Yes, same way. 3. Yes, you can do that. To differentiate from "first time" v/s "continue", you have to build your own semantics. For example, if the location in HDFS you are suppose to store the offsets does not have any data, that means its probably

Re: broadcast variable and accumulators issue while spark streaming checkpoint recovery

2015-07-29 Thread Shushant Arora
1.How to do it in java? 2.Can broadcast objects also be created in same way after checkpointing. 3.Is it safe If I disable checkpoint and write offsets at end of each batch to hdfs in mycode and somehow specify in my job to use this offset for creating kafkastream at first time. How can I specify

Re: broadcast variable and accumulators issue while spark streaming checkpoint recovery

2015-07-29 Thread Tathagata Das
Rather than using accumulator directly, what you can do is something like this to lazily create an accumulator and use it (will get lazily recreated if driver restarts from checkpoint) dstream.transform { rdd => val accum = SingletonObject.getOrCreateAccumulator() // single object method to

broadcast variable and accumulators issue while spark streaming checkpoint recovery

2015-07-29 Thread Shushant Arora
Hi I am using spark streaming 1.3 and using checkpointing. But job is failing to recover from checkpoint on restart. For broadcast variable it says : 1.WARN TaskSetManager: Lost task 15.0 in stage 7.0 (TID 1269, hostIP): java.lang.ClassCastException: [B cannot be cast to pkg.broadcastvariableclas

Re: spark streaming - checkpoint

2015-06-29 Thread ram kumar
on using yarn-cluster, it works good On Mon, Jun 29, 2015 at 12:07 PM, ram kumar wrote: > SPARK_CLASSPATH=$CLASSPATH:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/* > in spark-env.sh > > I think i am facing the same issue > https://issues.apache.org/jira/browse/SPARK-6203 > > > > On Mon, Jun 29, 2015 a

Re: spark streaming - checkpoint

2015-06-28 Thread ram kumar
SPARK_CLASSPATH=$CLASSPATH:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/* in spark-env.sh I think i am facing the same issue https://issues.apache.org/jira/browse/SPARK-6203 On Mon, Jun 29, 2015 at 11:38 AM, ram kumar wrote: > I am using Spark 1.2.0.2.2.0.0-82 (git revision de12451) built for Hadoop

Re: spark streaming - checkpoint

2015-06-27 Thread Tathagata Das
Do you have SPARK_CLASSPATH set in both cases? Before and after checkpoint? If yes, then you should not be using SPARK_CLASSPATH, it has been deprecated since Spark 1.0 because of its ambiguity. Also where do you have spark.executor.extraClassPath set? I dont see it in the spark-submit command. On

spark streaming - checkpoint

2015-06-26 Thread ram kumar
Hi, - JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(1)); ssc.checkpoint(checkPointDir); JavaStreamingContextFactory factory = new JavaStreamingContextFactory() { public JavaStreamingContext create() {

Re: Spark Streaming checkpoint recovery causes IO re-execution

2015-01-20 Thread RodrigoB
bly not able to give you too many details right now, but I expect to have a concrete solution on which ultimately I could push as proposal to the Spark dev team. I will definitely notify people on this thread at least. Tnks, Rod -- View this message in context: http://apache-spark-user-

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-31 Thread RodrigoB
27;t make sense. tnks, Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-tp12568p13205.html Sent from the Apache Spark User List mailing list archiv

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-29 Thread Yana Kadiyska
covery. If > these lambda functions send events to other systems these events would get > resent upon re-computation causing overall system instability. > > Hope this helps you understand the problematic. > > tnks, > Rod > > > > -- > View this message in context: >

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-28 Thread RodrigoB
understand the problematic. tnks, Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-tp12568p13043.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-28 Thread Yana Kadiyska
ng with external > entities. > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-tp12568p13009.html > Sent from the Apache Spark User List mailing list archive at Na

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-28 Thread GADV
ode is running for the first time or during a recovery so I can avoid to update the database again. More generally I want to know this in case I'm interacting with external entities. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkp

Re: Spark Streaming - What does Spark Streaming checkpoint?

2014-08-21 Thread Chris Fregly
The StreamingContext can be recreated from a checkpoint file, indeed. check out the following Spark Streaming source files for details: StreamingContext, Checkpoint, DStream, DStreamCheckpoint, and DStreamGraph. On Wed, Jul 9, 2014 at 6:11 PM, Yan Fang wrote: > Hi guys, > > I am a little co

Spark Streaming checkpoint recovery causes IO re-execution

2014-08-21 Thread RodrigoB
1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-tp12568.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additiona

Re: Spark Streaming Checkpoint: SparkContext is not serializable class

2014-07-30 Thread RodrigoB
a cached rdd on following operations. Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Checkpoint-SparkContext-is-not-serializable-class-tp10456p10985.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark Streaming - What does Spark Streaming checkpoint?

2014-07-09 Thread Yan Fang
Hi guys, I am a little confusing by the checkpointing in Spark Streaming. It checkpoints the intermediate data for the stateful operations for sure. Does it also checkpoint the information of StreamingContext? Because it seems we can recreate the SC from the checkpoint in a driver node failure sce