Replied inline:
On Sun, May 3, 2020 at 6:25 PM Magnus Nilsson wrote:
> Thank you, so that would mean spark gets the current latest offset(s) when
> the trigger fires and then process all available messages in the topic upto
> and including that offset as long as maxOffsetsPerTrigger is the defau
Thank you, so that would mean spark gets the current latest offset(s) when
the trigger fires and then process all available messages in the topic upto
and including that offset as long as maxOffsetsPerTrigger is the default of
None (or large enought to handle all available messages).
I think the w
If I understand correctly, Trigger.once executes only one micro-batch and
terminates, that's all. Your understanding of structured streaming applies
there as well.
It's like a hybrid approach as bringing incremental processing from
micro-batch but having processing interval as batch. That said, wh
I've always had a question about Trigger.Once that I never got around to
ask or test for myself. If you have a 24/7 stream to a Kafka topic.
Will Trigger.Once get the last offset(s) when it starts and then quit once
it hits this offset(s) or will the job run until no new messages is added
to the t
Thanks Burak! Appreciate it. This makes sense.
How do you suggest we make sure resulting data doesn't produce tiny files?
If we are not on databricks yet and can not leverage delta lake features?
Also checkpointing feature, do you have active blog/article I can take
a look at to try out an example
Hi Rishi,
That is exactly why Trigger.Once was created for Structured Streaming. The
way we look at streaming is that it doesn't have to be always real time, or
24-7 always on. We see streaming as a workflow that you have to repeat
indefinitely. See this blog post for more details!
https://databri
Hi All,
I recently started playing with spark streaming, and checkpoint location
feature looks very promising. I wonder if anyone has an opinion about using
spark streaming with checkpoint location option as a slow batch processing
solution. What would be the pros and cons of utilizing streaming w
Hello,
I am trying to understand the content of a checkpoint and corresponding
recovery.
*My understanding of Spark Checkpointing:
*
If you have really long DAGs and your spark cluster fails, checkpointing
helps by persisting intermediate state e.g. to HDFS. So, a DAG of 50
transformations can be
Hi,
I have written spark streaming job to use the checkpoint. I have stopped
the streaming job for 5 days and then restart it today.
I have encountered weird issue where it shows as zero records for all
cycles till date. is it causing data loss?
[image: Inline image 1]
Thanks,
Asmath
Hi,
In the following example using mapWithState, I set checkpoint interval to 1
minute. From the log, Spark stills write to the checkpoint directory every
second. Would be appreciated if someone can point out what I have done wrong.
object MapWithStateDemo {
def main(args: Array[String]) {
ing
>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>
>> instead.
>>
>> On Mon, Jun 5, 2017 at 11:16 AM, anbucheeralan
>> wrote:
>>
>>> I am using Spark Streaming Checkpoint and Kafka Direct Stream.
>>> It
ommend
> learning Structured Streaming
> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>
> instead.
>
> On Mon, Jun 5, 2017 at 11:16 AM, anbucheeralan
> wrote:
>
>> I am using Spark Streaming Checkpoint and Kafka Direct Stream.
:16 AM, anbucheeralan
wrote:
> I am using Spark Streaming Checkpoint and Kafka Direct Stream.
> It uses a 30 sec batch duration and normally the job is successful in
> 15-20 sec.
>
> If the spark application fails after the successful completion
> (149668428ms in the log b
I am using Spark Streaming Checkpoint and Kafka Direct Stream.
It uses a 30 sec batch duration and normally the job is successful in 15-20
sec.
If the spark application fails after the successful completion
(149668428ms in the log below) and restarts, it's duplicating the last
batch
I am using Spark Streaming Checkpoint and Kafka Direct Stream.
It uses a 30 sec batch duration and normally the job is successful in 15-20
sec.
If the spark application fails after the successful completion
(149668428ms in the log below) and restarts, it's duplicating the last
batch
Hello,
Actually I'm studying metadata checkpoint implementation in Spark Streaming
and I was wondering the purpose of so called "backup files":
CheckpointWriter snippet:
> // We will do checkpoint when generating a batch and completing a batch.
> When the processing
> // time of a batch is great
Hi,
I wonder if it is possible to checkpoint only metadata and not the data in
RDD's and dataframes.
Thanks,
Natu
You are right. "checkpointInterval" is only for data checkpointing.
"metadata checkpoint" is done for each batch. Feel free to send a PR to add
the missing doc.
Best Regards,
Shixiong Zhu
2015-12-18 8:26 GMT-08:00 Lan Jiang :
> Need some clarification about the documentation. According to Spark
Need some clarification about the documentation. According to Spark doc
"the default interval is a multiple of the batch interval that is at least 10
seconds. It can be set by using dstream.checkpoint(checkpointInterval).
Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream
ll be the Kafka offsets
I hope this helps :)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Checkpoint-help-failed-application-tp25347p25357.html
Sent from the Apache Spark User Lis
So as long as jar is kept on s3 and available across different runs, then the
s3 checkpoint is working.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-checkpoint-against-s3-tp25068p25081.html
Sent from the Apache Spark User List mailing
://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-checkpoint-against-s3-tp25068p25070.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
Hi, I am trying to set spark streaming checkpoint to s3, here is what I did
basically
val checkpoint = "s3://myBucket/checkpoint"
val ssc = StreamingContext.getOrCreate(checkpointDir,
() =>
getStreamingContext
)
>> at
>>
>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
>> at
>>
>> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStrea
.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
> at
>
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
> at
>
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DSt
$1.apply(DStream.scala:342)
at scala.Option.orElse(Option.scala:257)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-throws-Stack-Overflow-Error-tp24737.html
Sent from the Apache Spark User List mailing list
Access the offsets using HasOffsetRanges, save them in your datastore,
provide them as the fromOffsets argument when starting the stream.
See https://github.com/koeninger/kafka-exactly-once
On Thu, Aug 13, 2015 at 3:53 PM, Stephen Durfey wrote:
> When deploying a spark streaming application I w
When deploying a spark streaming application I want to be able to retrieve
the lastest kafka offsets that were processed by the pipeline, and create
my kafka direct streams from those offsets. Because the checkpoint
directory isn't guaranteed to be compatible between job deployments, I
don't want t
1. Same way, using static fields in a class.
2. Yes, same way.
3. Yes, you can do that. To differentiate from "first time" v/s "continue",
you have to build your own semantics. For example, if the location in HDFS
you are suppose to store the offsets does not have any data, that means its
probably
1.How to do it in java?
2.Can broadcast objects also be created in same way after checkpointing.
3.Is it safe If I disable checkpoint and write offsets at end of each batch
to hdfs in mycode and somehow specify in my job to use this offset for
creating kafkastream at first time. How can I specify
Rather than using accumulator directly, what you can do is something like
this to lazily create an accumulator and use it (will get lazily recreated
if driver restarts from checkpoint)
dstream.transform { rdd =>
val accum = SingletonObject.getOrCreateAccumulator() // single object
method to
Hi
I am using spark streaming 1.3 and using checkpointing.
But job is failing to recover from checkpoint on restart.
For broadcast variable it says :
1.WARN TaskSetManager: Lost task 15.0 in stage 7.0 (TID 1269, hostIP):
java.lang.ClassCastException: [B cannot be cast to
pkg.broadcastvariableclas
on using yarn-cluster, it works good
On Mon, Jun 29, 2015 at 12:07 PM, ram kumar wrote:
> SPARK_CLASSPATH=$CLASSPATH:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/*
> in spark-env.sh
>
> I think i am facing the same issue
> https://issues.apache.org/jira/browse/SPARK-6203
>
>
>
> On Mon, Jun 29, 2015 a
SPARK_CLASSPATH=$CLASSPATH:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/*
in spark-env.sh
I think i am facing the same issue
https://issues.apache.org/jira/browse/SPARK-6203
On Mon, Jun 29, 2015 at 11:38 AM, ram kumar wrote:
> I am using Spark 1.2.0.2.2.0.0-82 (git revision de12451) built for Hadoop
Do you have SPARK_CLASSPATH set in both cases? Before and after checkpoint?
If yes, then you should not be using SPARK_CLASSPATH, it has been
deprecated since Spark 1.0 because of its ambiguity.
Also where do you have spark.executor.extraClassPath set? I dont see it in
the spark-submit command.
On
Hi,
-
JavaStreamingContext ssc = new JavaStreamingContext(conf, new
Duration(1));
ssc.checkpoint(checkPointDir);
JavaStreamingContextFactory factory = new JavaStreamingContextFactory() {
public JavaStreamingContext create() {
bly not able to
give you too many details right now, but I expect to have a concrete
solution on which ultimately I could push as proposal to the Spark dev team.
I will definitely notify people on this thread at least.
Tnks,
Rod
--
View this message in context:
http://apache-spark-user-
27;t make sense.
tnks,
Rod
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-tp12568p13205.html
Sent from the Apache Spark User List mailing list archiv
covery. If
> these lambda functions send events to other systems these events would get
> resent upon re-computation causing overall system instability.
>
> Hope this helps you understand the problematic.
>
> tnks,
> Rod
>
>
>
> --
> View this message in context:
>
understand the problematic.
tnks,
Rod
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-tp12568p13043.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
ng with external
> entities.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-tp12568p13009.html
> Sent from the Apache Spark User List mailing list archive at Na
ode is running for the first time or during a recovery so I can avoid
to update the database again.
More generally I want to know this in case I'm interacting with external
entities.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkp
The StreamingContext can be recreated from a checkpoint file, indeed.
check out the following Spark Streaming source files for details:
StreamingContext, Checkpoint, DStream, DStreamCheckpoint, and DStreamGraph.
On Wed, Jul 9, 2014 at 6:11 PM, Yan Fang wrote:
> Hi guys,
>
> I am a little co
1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-tp12568.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additiona
a cached rdd on following operations.
Rod
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Checkpoint-SparkContext-is-not-serializable-class-tp10456p10985.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi guys,
I am a little confusing by the checkpointing in Spark Streaming. It
checkpoints the intermediate data for the stateful operations for sure.
Does it also checkpoint the information of StreamingContext? Because it
seems we can recreate the SC from the checkpoint in a driver node failure
sce
46 matches
Mail list logo