Re: [VOTE] Release Spark 3.3.0 (RC5)

2022-06-08 Thread Jerry Peng
I agree with Jungtaek,  -1 from me because of the issue of Kafka source
throwing an error with an incorrect error message that was introduced
recently.  This may mislead users and cause unnecessary confusion.

On Wed, Jun 8, 2022 at 12:04 AM Jungtaek Lim 
wrote:

> Apologize for late participation.
>
> I'm sorry, but -1 (non-binding) from me.
>
> Unfortunately I found a major user-facing issue which hurts UX seriously
> on Kafka data source usage.
>
> In some cases, Kafka data source can throw IllegalStateException for the
> case of failOnDataLoss=true which condition is bound to the state of Kafka
> topic (not Spark's issue). With the recent change of Spark,
> IllegalStateException is now bound to the "internal error", and Spark gives
> incorrect guidance to the end users, telling to end users that Spark has a
> bug and they are encouraged to file a JIRA ticket which is simply wrong.
>
> Previously, Kafka data source provided the error message with the context
> why it failed, and how to workaround it. I feel this is a serious
> regression on UX.
>
> Please look into https://issues.apache.org/jira/browse/SPARK-39412 for
> more details.
>
>
> On Wed, Jun 8, 2022 at 3:40 PM Hyukjin Kwon  wrote:
>
>> Okay. Thankfully the binary release is fine per
>> https://github.com/apache/spark/blob/v3.3.0-rc5/dev/create-release/release-build.sh#L268
>> .
>> The source package (and GitHub tag) has 3.3.0.dev0, and the binary
>> package has 3.3.0. Technically this is not a blocker now because PyPI
>> upload will be able to be made correctly.
>> I lowered the priority to critical. I switch my -1 to 0.
>>
>> On Wed, 8 Jun 2022 at 15:17, Hyukjin Kwon  wrote:
>>
>>> Arrrgh  .. I am very sorry that I found this problem late.
>>> RC 5 does not have the correct version of PySpark, see
>>> https://github.com/apache/spark/blob/v3.3.0-rc5/python/pyspark/version.py#L19
>>> I think the release script was broken because the version now has 'str'
>>> type, see
>>> https://github.com/apache/spark/blob/v3.3.0-rc5/dev/create-release/release-tag.sh#L88
>>> I filed a JIRA at https://issues.apache.org/jira/browse/SPARK-39411
>>>
>>> -1 from me
>>>
>>>
>>>
>>> On Wed, 8 Jun 2022 at 13:16, Cheng Pan  wrote:
>>>
 +1 (non-binding)

 * Verified SPARK-39313 has been address[1]
 * Passed integration test w/ Apache Kyuubi (Incubating)[2]

 [1] https://github.com/housepower/spark-clickhouse-connector/pull/123
 [2] https://github.com/apache/incubator-kyuubi/pull/2817

 Thanks,
 Cheng Pan

 On Wed, Jun 8, 2022 at 7:04 AM Chris Nauroth 
 wrote:
 >
 > +1 (non-binding)
 >
 > * Verified all checksums.
 > * Verified all signatures.
 > * Built from source, with multiple profiles, to full success, for
 Java 11 and Scala 2.13:
 > * build/mvn -Phadoop-3 -Phadoop-cloud -Phive-thriftserver
 -Pkubernetes -Pscala-2.13 -Psparkr -Pyarn -DskipTests clean package
 > * Tests passed.
 > * Ran several examples successfully:
 > * bin/spark-submit --class org.apache.spark.examples.SparkPi
 examples/jars/spark-examples_2.12-3.3.0.jar
 > * bin/spark-submit --class
 org.apache.spark.examples.sql.hive.SparkHiveExample
 examples/jars/spark-examples_2.12-3.3.0.jar
 > * bin/spark-submit
 examples/src/main/python/streaming/network_wordcount.py localhost 
 > * Tested some of the issues that blocked prior release candidates:
 > * bin/spark-sql -e 'SELECT (SELECT IF(x, 1, 0)) AS a FROM (SELECT
 true) t(x) UNION SELECT 1 AS a;'
 > * bin/spark-sql -e "select date '2018-11-17' > 1"
 > * SPARK-39293 ArrayAggregate fix
 >
 > Chris Nauroth
 >
 >
 > On Tue, Jun 7, 2022 at 1:30 PM Cheng Su 
 wrote:
 >>
 >> +1 (non-binding). Built and ran some internal test for Spark SQL.
 >>
 >>
 >>
 >> Thanks,
 >>
 >> Cheng Su
 >>
 >>
 >>
 >> From: L. C. Hsieh 
 >> Date: Tuesday, June 7, 2022 at 1:23 PM
 >> To: dev 
 >> Subject: Re: [VOTE] Release Spark 3.3.0 (RC5)
 >>
 >> +1
 >>
 >> Liang-Chi
 >>
 >> On Tue, Jun 7, 2022 at 1:03 PM Gengliang Wang 
 wrote:
 >> >
 >> > +1 (non-binding)
 >> >
 >> > Gengliang
 >> >
 >> > On Tue, Jun 7, 2022 at 12:24 PM Thomas Graves <
 tgraves...@gmail.com> wrote:
 >> >>
 >> >> +1
 >> >>
 >> >> Tom Graves
 >> >>
 >> >> On Sat, Jun 4, 2022 at 9:50 AM Maxim Gekk
 >> >>  wrote:
 >> >> >
 >> >> > Please vote on releasing the following candidate as Apache
 Spark version 3.3.0.
 >> >> >
 >> >> > The vote is open until 11:59pm Pacific time June 8th and passes
 if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
 >> >> >
 >> >> > [ ] +1 Release this package as Apache Spark 3.3.0
 >> >> > [ ] -1 Do not release this package because ...
 >> >> >
 >> >> > To learn more about Apache Spark, please 

[DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming

2022-11-22 Thread Jerry Peng
Hi all,

I would like to start the discussion for a SPIP, Asynchronous Offset
Management in Structured Streaming.  The high level summary of the SPIP is
that currently in Structured Streaming we perform a couple of offset
management operations for progress tracking purposes synchronously on the
critical path which can contribute significantly to processing latency.  If
we were to make these operations asynchronous and less frequent we can
dramatically improve latency for certain types of workloads.

I have put together a SPIP to implement such a mechanism.  Please take a
look!

SPIP Jira: https://issues.apache.org/jira/browse/SPARK-39591

SPIP doc:
https://docs.google.com/document/d/1iPiI4YoGCM0i61pBjkxcggU57gHKf2jVwD7HWMHgH-Y/edit?usp=sharing


Best,

Jerry


Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming

2022-11-22 Thread Jerry Peng
Jungtaek,

Thanks for taking up the role to shepard this SPIP!  Thank you for also
chiming in on your thoughts concerning the continuous mode!

Best,

Jerry

On Tue, Nov 22, 2022 at 5:57 PM Jungtaek Lim 
wrote:

> Just FYI, I'm shepherding this SPIP project.
>
> I think the major meta question would be, "why don't we spend effort on
> continuous mode rather than initiating another feature aiming for the
> same workload?". Jerry already updated the doc to answer the question, but
> I can also share my thoughts about it.
>
> I feel like the current "continuous mode" is a niche solution. (It's not
> to blame. If you have to deal with such workload but can't rewrite the
> underlying engine from scratch, then there are really few options.)
> Since the implementation went with a workaround to implement which the
> architecture does not support natively e.g. distributed snapshot, it gets
> quite tricky on maintaining and expanding the project. It also requires 3rd
> parties to implement a separate source and sink implementation, which I'm
> not sure how many 3rd parties actually followed so far.
>
> Eventually, "continuous mode" becomes an area no one in the active
> community knows the details and has willingness to maintain. I wouldn't say
> we are confident to remove the tag on "experimental", although the feature
> has been shipped for years. It was introduced in Spark 2.3, surprising
> enough?
>
> We went back and thought about the approach from scratch. Jerry came up
> with the idea which leverages existing microbatch execution, hence
> relatively stable and no need to require 3rd parties to support another
> mode. It adds complexity against microbatch execution but it's a lot less
> complicated compared to the existing continuous mode. Definitely quite less
> than creating a new record-to-record engine from scratch.
>
> That said, we want to propose and move forward with the new approach.
>
> ps. Eventually we could probably discuss retiring continuous mode if the
> new approach gets accepted and eventually considered as a stable one after
> several minor releases. That's just me.
>
> On Wed, Nov 23, 2022 at 5:16 AM Jerry Peng 
> wrote:
>
>> Hi all,
>>
>> I would like to start the discussion for a SPIP, Asynchronous Offset
>> Management in Structured Streaming.  The high level summary of the SPIP is
>> that currently in Structured Streaming we perform a couple of offset
>> management operations for progress tracking purposes synchronously on the
>> critical path which can contribute significantly to processing latency.  If
>> we were to make these operations asynchronous and less frequent we can
>> dramatically improve latency for certain types of workloads.
>>
>> I have put together a SPIP to implement such a mechanism.  Please take a
>> look!
>>
>> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-39591
>>
>> SPIP doc:
>> https://docs.google.com/document/d/1iPiI4YoGCM0i61pBjkxcggU57gHKf2jVwD7HWMHgH-Y/edit?usp=sharing
>>
>>
>> Best,
>>
>> Jerry
>>
>


Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming

2022-11-28 Thread Jerry Peng
Hi all,

I will add my two cents.  Improving the Microbatch execution engine does
not prevent us from working/improving on the continuous execution engine in
the future.  These are orthogonal issues.  This new mode I am proposing in
the microbatch execution engine intends to lower latency of this execution
engine that most people use today.  We can view it as an incremental
improvement on the existing engine. I see the continuous execution engine
as a partially completed re-write of spark streaming and may serve as the
"future" engine powering Spark Streaming.   Improving the "current" engine
does not mean we cannot work on a "future" engine.  These two are not
mutually exclusive. I would like to focus the discussion on the merits of
this feature in regards to the current micro-batch execution engine and not
a discussion on the future of continuous execution engine.

Best,

Jerry


On Wed, Nov 23, 2022 at 3:17 AM Jungtaek Lim 
wrote:

> Hi Mridul,
>
> I'd like to make clear to avoid any misunderstanding - the decision was
> not led by me. (I'm just a one of engineers in the team. Not even TL.) As
> you see the direction, there was an internal consensus to not revisit the
> continuous mode. There are various reasons, which I think we know already.
> You seem to remember I have raised concerns about continuous mode, but have
> you indicated that it was even over 2 years ago? I still see no traction
> around the project. The main reason I abandoned the discussion was due to
> promising effort on integrating push based shuffle into continuous mode to
> achieve shuffle, but no effort has been made so far.
>
> The goal of this SPIP is to have an alternative approach dealing with same
> workload, given that we no longer have confidence of success of continuous
> mode. But I also want to make clear that deprecating and eventually
> retiring continuous mode is not a goal of this project. If that happens
> eventually, that would be a side-effect. Someone may have concerns that we
> have two different projects aiming for similar thing, but I'd rather see
> both projects having competition. If anyone willing to improve continuous
> mode can start making the effort right now. This SPIP does not block it.
>
>
> On Wed, Nov 23, 2022 at 5:29 PM Mridul Muralidharan 
> wrote:
>
>>
>> Hi Jungtaek,
>>
>>   Given the goal of the SPIP is reducing latency for stateless apps, and
>> should reasonably fit continuous mode design goals, it feels odd to not
>> support it fin the proposal.
>>
>> I know you have raised concerns about continuous mode in past as well in
>> dev@ list, and we are further ignoring it in this proposal (and possibly
>> other enhancements in past few releases).
>>
>> Do you want to revisit the discussion to support it and propose a vote on
>> that ? And move it to deprecated ?
>>
>> I am much more comfortable not supporting this SPIP for CM if it was
>> deprecated.
>>
>> Thoughts ?
>>
>> Regards,
>> Mridul
>>
>>
>>
>>
>> On Wed, Nov 23, 2022 at 1:16 AM Jerry Peng 
>> wrote:
>>
>>> Jungtaek,
>>>
>>> Thanks for taking up the role to shepard this SPIP!  Thank you for also
>>> chiming in on your thoughts concerning the continuous mode!
>>>
>>> Best,
>>>
>>> Jerry
>>>
>>> On Tue, Nov 22, 2022 at 5:57 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> Just FYI, I'm shepherding this SPIP project.
>>>>
>>>> I think the major meta question would be, "why don't we spend effort on
>>>> continuous mode rather than initiating another feature aiming for the
>>>> same workload?". Jerry already updated the doc to answer the question, but
>>>> I can also share my thoughts about it.
>>>>
>>>> I feel like the current "continuous mode" is a niche solution. (It's
>>>> not to blame. If you have to deal with such workload but can't rewrite the
>>>> underlying engine from scratch, then there are really few options.)
>>>> Since the implementation went with a workaround to implement which the
>>>> architecture does not support natively e.g. distributed snapshot, it gets
>>>> quite tricky on maintaining and expanding the project. It also requires 3rd
>>>> parties to implement a separate source and sink implementation, which I'm
>>>> not sure how many 3rd parties actually followed so far.
>>>>
>>>> Eventually, "continuous mode" becomes an area no one in the active
>

Re: [VOTE][RESULT][SPIP] Asynchronous Offset Management in Structured Streaming

2022-12-05 Thread Jerry Peng
Thanks Jungtaek for shepherding this effort!

On Sun, Dec 4, 2022 at 6:25 PM Jungtaek Lim 
wrote:

> The vote passes with 7 +1s (5 binding +1s).
> Thanks to all who reviews the SPIP doc and votes!
>
> (* = binding)
> +1:
> - Jungtaek Lim
> - Xingbo Jiang
> - Mridul Muralidharan (*)
> - Hyukjin Kwon (*)
> - Shixiong Zhu (*)
> - Wenchen Fan (*)
> - Dongjoon Hyun (*)
>
> +0: None
>
> -1: None
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>


Re: [DISCUSS] Deprecate DStream in 3.4

2023-01-13 Thread Jerry Peng
+1 in general for marking the DStreams API as deprecated

Jungtaek, can you please provide / elaborate on the concrete actions you
intend on taking for the depreciation process?

Best,

Jerry

On Thu, Jan 12, 2023 at 11:16 PM L. C. Hsieh  wrote:

> +1
>
> On Thu, Jan 12, 2023 at 10:39 PM Jungtaek Lim
>  wrote:
> >
> > Yes, exactly. I'm sorry to bring confusion - should have clarified
> action items on the proposal.
> >
> > On Fri, Jan 13, 2023 at 3:31 PM Dongjoon Hyun 
> wrote:
> >>
> >> Then, could you elaborate `the proposed code change` specifically?
> >> Maybe, usual deprecation warning logs and annotation on the API?
> >>
> >>
> >> On Thu, Jan 12, 2023 at 10:05 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
> >>>
> >>> Maybe I need to clarify - my proposal is "explicitly" deprecating it,
> which incurs code change for sure. Guidance on the Spark website is done
> already as I mentioned - we updated the DStream doc page to mention that
> DStream is a "legacy" project and users should move to SS. I don't feel
> this is sufficient to refrain users from using it, hence initiating this
> proposal.
> >>>
> >>> Sorry to make confusion. I just wanted to make sure the goal of the
> proposal is not "removing" the API. The discussion on the removal of API
> doesn't tend to go well, so I wanted to make sure I don't mean that.
> >>>
> >>> On Fri, Jan 13, 2023 at 2:46 PM Dongjoon Hyun 
> wrote:
> 
>  +1 for the proposal (guiding only without any code change).
> 
>  Thanks,
>  Dongjoon.
> 
>  On Thu, Jan 12, 2023 at 9:33 PM Shixiong Zhu 
> wrote:
> >
> > +1
> >
> >
> > On Thu, Jan 12, 2023 at 5:08 PM Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
> >>
> >> +1
> >>
> >> On Thu, Jan 12, 2023 at 7:46 PM Hyukjin Kwon 
> wrote:
> >>>
> >>> +1
> >>>
> >>> On Fri, 13 Jan 2023 at 08:51, Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
> 
>  bump for more visibility.
> 
>  On Wed, Jan 11, 2023 at 12:20 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
> >
> > Hi dev,
> >
> > I'd like to propose the deprecation of DStream in Spark 3.4, in
> favor of promoting Structured Streaming.
> > (Sorry for the late proposal, if we don't make the change in
> 3.4, we will have to wait for another 6 months.)
> >
> > We have been focusing on Structured Streaming for years (across
> multiple major and minor versions), and during the time we haven't made any
> improvements for DStream. Furthermore, recently we updated the DStream doc
> to explicitly say DStream is a legacy project.
> >
> https://spark.apache.org/docs/latest/streaming-programming-guide.html#note
> >
> > The baseline of deprecation is that we don't see a particular
> use case which only DStream solves. This is a different story with GraphX
> and MLLIB, as we don't have replacements for that.
> >
> > The proposal does not mean we will remove the API soon, as the
> Spark project has been making deprecation against public API. I don't
> intend to propose the target version for removal. The goal is to guide
> users to refrain from constructing a new workload with DStream. We might
> want to go with this in future, but it would require a new discussion
> thread at that time.
> >
> > What do you think?
> >
> > Thanks,
> > Jungtaek Lim (HeartSaVioR)
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Enhanced Console Sink for Structured Streaming

2024-02-08 Thread Jerry Peng
I am generally a +1 on this as we can use this information in our docs to
demonstrate certains concepts to potential users.

I am in agreement with other reviewers that we should keep the existing
default behavior of the console sink.  This new style of output should be
enabled behind a flag.

As for the output of this "new mode" in the console sink, can we be more
explicit about what is the actual output and what is the metadata?  It is
not clear from the logged output.

On Tue, Feb 6, 2024 at 11:08 AM Neil Ramaswamy
 wrote:

> Jungtaek and Raghu, thanks for the input. I'm happy with the verbose mode
> being off by default.
>
> I think it's reasonable to have 1 or 2 levels of verbosity:
>
>1. The first verbose mode could target new users, and take a highly
>opinionated view on what's important to understand streaming semantics.
>This would include printing the sink rows, watermark, number of dropped
>rows (if any), and state data. For state data, we should print for all
>state stores (for multiple stateful operators), but for joins, I think
>rendering just the KeyWithIndexToValueStore(s) is reasonable. Timestamps
>would render as durations (see original message) to make small examples
>easy to understand.
>2. The second verbose mode could target more advanced users trying to
>create a reproduction. In addition to the first verbose mode, it would also
>print the other join state store, the number of evicted rows due to the
>watermark, and print timestamps as extended ISO 8601 strings (same as
>today).
>
> Rather than implementing both, I would prefer to implement the first
> level, and evaluate later if the second would be useful.
>
> Mich, can you elaborate on why you don't think it's useful? To reiterate,
> this proposal is to bring to light certain metrics/values that are
> essential for understanding SS micro-batching semantics. It's to help users
> go from 0 to 1, not 1 to 100. (And the Spark UI can't be the place for
> rendering sink data or state store values—there should be no sensitive user
> data there.)
>
> On Mon, Feb 5, 2024 at 11:32 PM Mich Talebzadeh 
> wrote:
>
>> I don't think adding this to the streaming flow (at micro level) will be
>> that useful
>>
>> However, this can be added to Spark UI as an enhancement to the Streaming
>> Query Statistics page.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 6 Feb 2024 at 03:49, Raghu Angadi 
>> wrote:
>>
>>> Agree, the default behavior does not need to change.
>>>
>>> Neil, how about separating it into two sections:
>>>
>>>- Actual rows in the sink (same as current output)
>>>- Followed by metadata data
>>>
>>>


Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Jerry Peng
gt;>>>>>>
>>>>>>>>> The proposal is precisely about enabling Spark to power
>>>>>>>>> applications where, as I define it, the *timeliness* of the
>>>>>>>>> answer is as critical as its *correctness*. Spark's current
>>>>>>>>> streaming engine, primarily operating on micro-batches, often delivers
>>>>>>>>> results that are technically "correct" but arrive too late to be truly
>>>>>>>>> useful for certain high-stakes, real-time scenarios. This makes them
>>>>>>>>> "useless or wrong" in a practical, business-critical sense.
>>>>>>>>>
>>>>>>>>> For example *in real-time fraud detection* and In *high-frequency
>>>>>>>>> trading,* market data or trade execution commands must be
>>>>>>>>> delivered with minimal latency. Even a slight delay can mean missed
>>>>>>>>> opportunities or significant financial losses, making a "correct" 
>>>>>>>>> price
>>>>>>>>> update useless if it's not instantaneous. able for these
>>>>>>>>> demanding use cases, where a "late but correct" answer is simply not 
>>>>>>>>> good
>>>>>>>>> enough. As a colliery it is a fundamental concept, so it has to be 
>>>>>>>>> treated
>>>>>>>>> as such not as a comment.in SPIP
>>>>>>>>>
>>>>>>>>> Hope this clarifies the connection in practical terms
>>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis |
>>>>>>>>> GDPR
>>>>>>>>>
>>>>>>>>>view my Linkedin profile
>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Mich,
>>>>>>>>>>
>>>>>>>>>> Sorry, I may be missing something here but what does your
>>>>>>>>>> definition here have to do with the SPIP?   Perhaps add comments 
>>>>>>>>>> directly
>>>>>>>>>> to the SPIP to provide context as the code snippet below is a direct 
>>>>>>>>>> copy
>>>>>>>>>> from the SPIP itself.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Denny
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <
>>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> just to add
>>>>>>>>>>>
>>>>>>>>>>> A stronger definition of real time. The engineering definition
>>>>>>>>>>> of real time is roughly fast enough to be interactive
>>>>>>>>>>>
>>>>>>>>>>> However, I put a stronger definition. In real time application
>>>>>>>>>>> or data, there is nothing as an answer which is supposed to be late 
>>>>>>>>>>> and
>>>>>>>>>>> correct. The timeliness is part of the application.if I get the 
>>>>>>>>>>> right
>>>>>>>>>>> answer too slowly it becomes useless or wrong
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis |
>>>>>>>>>>> GDPR
>>>>>>>>>>>
>>>>>>>

[DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-27 Thread Jerry Peng
Hi all,

I want to start a discussion thread for the SPIP titled “Real-Time Mode in
Apache Spark Structured Streaming” that I've been working on with Siying
Dong, Indrajit Roy, Chao Sun, Jungtaek Lim, and Michael Armbrust: [JIRA
] [Doc

].

The SPIP proposes a new execution mode called “Real-time Mode” in Spark
Structured Streaming that significantly lowers end-to-end latency for
processing streams of data.

A key principle of this proposal is compatibility. Our goal is to make
Spark capable of handling streaming jobs that need results almost
immediately (within O(100) milliseconds). We want to achieve this without
changing the high-level DataFrame/Dataset API that users already use – so
existing streaming queries can run in this new ultra-low-latency mode by
simply turning it on, without rewriting their logic.

In short, we’re trying to enable Spark to power real-time applications
(like instant anomaly alerts or live personalization) that today cannot
meet their latency requirements with Spark’s current streaming engine.

We'd greatly appreciate your feedback, thoughts, and suggestions on this
approach!


Re: [VOTE] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-06-04 Thread Jerry Peng
Thank you all!  Glad to see this much interest and support for this
initiative!

On Wed, Jun 4, 2025 at 1:27 PM L. C. Hsieh  wrote:

> Hi all,
>
> Thanks all for participating and your support! The vote has been passed.
> I'll send out the result in a separate thread.
>
> On Mon, Jun 2, 2025 at 7:53 PM Wenchen Fan  wrote:
> >
> > +1
> >
> > On Tue, Jun 3, 2025 at 10:16 AM bo yang  wrote:
> >>
> >> +1 (non-binding)
> >>
> >> On Mon, Jun 2, 2025 at 7:13 PM Reynold Xin 
> wrote:
> >>>
> >>> +1
> >>>
> >>> On Mon, Jun 2, 2025 at 7:10 PM Kent Yao  wrote:
> 
>  +1
> 
>  Sandy Ryza  于2025年6月2日周一 23:00写道:
> >
> > +1 (non-binding)
> >
> > On Mon, Jun 2, 2025 at 7:34 AM Chao Sun  wrote:
> >>
> >> +1
> >>
> >> On Mon, Jun 2, 2025 at 7:31 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
> >>>
> >>> +1 (non-binding)
> >>>
> >>> On Mon, Jun 2, 2025 at 11:09 PM Wenchen Fan 
> wrote:
> 
>  +1
> 
>  On Mon, Jun 2, 2025 at 8:55 PM Peter Toth 
> wrote:
> >
> > +1
> >
> > On Mon, Jun 2, 2025 at 2:33 PM xianjin 
> wrote:
> >>
> >> +1.
> >> Sent from my iPhone
> >>
> >> On Jun 2, 2025, at 12:50 PM, DB Tsai  wrote:
> >>
> >> +1 looking forward to seeing real-time mode.
> >> Sent from my iPhone
> >>
> >> On Jun 1, 2025, at 9:47 PM, Xiao Li 
> wrote:
> >>
> >> 
> >> +1
> >>
> >> huaxin gao  于2025年6月1日周日 20:00写道:
> >>>
> >>> +1
> >>>
> >>> On Sun, Jun 1, 2025 at 7:50 PM Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
> 
>  +1 (binding)
>  super excited about this!
> 
>  On Sun, Jun 1, 2025 at 10:45 PM Yuanjian Li <
> xyliyuanj...@gmail.com> wrote:
> >
> > +1
> >
> > On Sun, Jun 1, 2025 at 19:00 Dongjoon Hyun <
> dongjoon.h...@gmail.com> wrote:
> >>
> >> +1
> >>
> >> Dongjoon
> >>
> >>
> >> On Sun, Jun 1, 2025 at 12:02 L. C. Hsieh 
> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I would like to start a vote on the new real-time mode in
> Apache Spark
> >>> Structured Streaming.
> >>>
> >>> Discussion thread:
> >>>
> https://lists.apache.org/thread/ovmfbzfkc3t9odvv5gs75fhpvdckn90f
> >>> SPIP:
> https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?tab=t.0#heading=h.ulas5788cm9t
> >>> JIRA: https://issues.apache.org/jira/browse/SPARK-52330
> >>>
> >>> Please vote on the SPIP for the next 72 hours:
> >>>
> >>> [ ] +1: Accept the proposal as an official SPIP
> >>> [ ] +0
> >>> [ ] -1: I don’t think this is a good idea because …
> >>>
> >>>
> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Jerry Peng
Mark,

As an example of my point if you go the the Apache Storm (another stream
processing engine) website:

https://storm.apache.org/

It describes Storm as:

"Apache Storm is a free and open source distributed *realtime* computation
system"

If you can to apache Flink:

https://flink.apache.org/2025/03/24/apache-flink-2.0.0-a-new-era-of-real-time-data-processing/

"Apache Flink 2.0.0: A new Era of *Real-Time* Data Processing"

Thus, what the term "rea-time" implies in this should not be confusing for
folks in this area.

On Thu, May 29, 2025 at 10:22 PM Jerry Peng 
wrote:

> Mich,
>
> If I understood your last email correctly, I think you also wanted to have
> a discussion about naming?  Why are we calling this new execution mode
> described in the SPIP "Real-time Mode"?  Here are my two cents.  Firstly,
> "continuous mode" is taken and we want another name to describe an
> execution mode that provides ultra low latency processing.  We could have
> called it "low latency mode", though I don't really like that naming since
> it implies the other execution modes are not low latency which I don't
> believe is true.  This new proposed mode can simply deliver even lower
> latency.  Thus, we came up with the name "Real-time Mode".  Of course, we
> are talking about "soft" real-time here.  I think when we are talking about
> distributed stream processing systems in the space of big data analytics,
> it is reasonable to assume anything described in this space as "real-time"
> implies "soft" real-time.  Though if this is confusing or misleading, we
> can provide clear documentation on what "real-time" in real-time mode means
> and what it guarantees.  Just my thoughts.  I would love to hear other
> perspectives.
>
> On Thu, May 29, 2025 at 3:48 PM Mich Talebzadeh 
> wrote:
>
>> I think from what I have seen there are a good number of +1 responses as
>> opposed to quantitative discussions (based on my observations only). Given
>> the objectives of the thread, we ought to focus on what is meant by real
>> time compared to continuous   modes.To be fair, it is a common point of
>> confusion, and the terms are often used interchangeably in general
>> conversation, but in technical contexts, especially with streaming data
>> platforms, they have specific and important differences.
>>
>> "Continuous Mode" refers to a processing strategy that aims for true,
>> uninterrupted, sub-millisecond latency processing.  Chiefly
>>
>>- Event-at-a-Time (or very small  batch groups): The system processes
>>individual events or extremely small groups of events -> microbatches as
>>they flow through the pipeline.
>>- Minimal Latency: The primary goal is to achieve the absolute lowest
>>possible end-to-end latency, often in the order of milliseconds or even
>>below
>>- Most business use cases (say financial markets) can live with this
>>as they do not rely on rdges
>>
>> Now what is meant by "Real-time Mode"
>>
>> This is where the nuance comes in. "Real-time" is a broader and sometimes
>> more subjective term. When the text introduces "Real-time Mode" as distinct
>> from "Continuous Mode," it suggests a specific implementation that achieves
>> real-time characteristics but might do so differently or more robustly than
>> a "continuous" mode attempt. Going back to my earlier mention, in real time
>> application , there is nothing as an answer which is supposed to be late
>> and correct. The timeliness is part of the application. if I get the
>> right answer too slowly it becomes useless or wrong. What I call the "Late
>> and Correct is Useless" Principle
>>
>> In summary, "Real-time Mode" seems to describe an approach that delivers
>> low-latency processing with high reliability and ease of use, leveraging
>> established, battle-tested components.I invite the audience to have a
>> discussion on this.
>>
>> HTH
>>
>> Dr Mich Talebzadeh,
>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>>
>>
>> On Thu, 29 May 2025 at 19:15, Yang Jie  wrote:
>>
>>> +1
>>>
>>> On 2025/05/29 16:25:19 Xiao Li wrote:
>>> > +1
>>> >
>>> > Yuming Wang  于2025年5月29日周四 02:22写道:
>>> >
>>> > > +1.
>>> > >
>>> > > On Th

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Jerry Peng
t;>>>>
>> > >>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <
>> > >>>>>>> mich.talebza...@gmail.com> wrote:
>> > >>>>>>>
>> > >>>>>>>> Hi,
>> > >>>>>>>>
>> > >>>>>>>> My point about "in real time application or data, there is
>> nothing
>> > >>>>>>>> as an answer which is supposed to be late and correct. The
>> timeliness is
>> > >>>>>>>> part of the application. if I get the right answer too slowly
>> it becomes
>> > >>>>>>>> useless or wrong" is actually fundamental to *why* we need this
>> > >>>>>>>> Spark Structured Streaming proposal.
>> > >>>>>>>>
>> > >>>>>>>> The proposal is precisely about enabling Spark to power
>> > >>>>>>>> applications where, as I define it, the *timeliness* of the
>> answer
>> > >>>>>>>> is as critical as its *correctness*. Spark's current streaming
>> > >>>>>>>> engine, primarily operating on micro-batches, often delivers
>> results that
>> > >>>>>>>> are technically "correct" but arrive too late to be truly
>> useful for
>> > >>>>>>>> certain high-stakes, real-time scenarios. This makes them
>> "useless or
>> > >>>>>>>> wrong" in a practical, business-critical sense.
>> > >>>>>>>>
>> > >>>>>>>> For example *in real-time fraud detection* and In
>> *high-frequency
>> > >>>>>>>> trading,* market data or trade execution commands must be
>> > >>>>>>>> delivered with minimal latency. Even a slight delay can mean
>> missed
>> > >>>>>>>> opportunities or significant financial losses, making a
>> "correct" price
>> > >>>>>>>> update useless if it's not instantaneous. able for these
>> demanding
>> > >>>>>>>> use cases, where a "late but correct" answer is simply not
>> good enough. As
>> > >>>>>>>> a colliery it is a fundamental concept, so it has to be
>> treated as such not
>> > >>>>>>>> as a comment.in SPIP
>> > >>>>>>>>
>> > >>>>>>>> Hope this clarifies the connection in practical terms
>> > >>>>>>>> Dr Mich Talebzadeh,
>> > >>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis
>> |
>> > >>>>>>>> GDPR
>> > >>>>>>>>
>> > >>>>>>>>view my Linkedin profile
>> > >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee > >
>> > >>>>>>>> wrote:
>> > >>>>>>>>
>> > >>>>>>>>> Hey Mich,
>> > >>>>>>>>>
>> > >>>>>>>>> Sorry, I may be missing something here but what does your
>> > >>>>>>>>> definition here have to do with the SPIP?   Perhaps add
>> comments directly
>> > >>>>>>>>> to the SPIP to provide context as the code snippet below is a
>> direct copy
>> > >>>>>>>>> from the SPIP itself.
>> > >>>>>>>>>
>> > >>>>>>>>> Thanks,
>> > >>>>>>>>> Denny
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <
>> > >>>>>>>>> mich.talebza...@gmail.com> w

Re: [DISCUSS][MINOR] Fix broken link in spark-website for SS Programming Guide

2025-05-30 Thread Jerry Peng
+1 for fixing this immediately.

Anish, thanks for pointing this issue out!

On Fri, May 30, 2025 at 12:12 AM Jungtaek Lim 
wrote:

> I’m +1 to fix this in website for 4.0.0 immediately.
>
> I got some inputs about this and they were unable to figure out the
> correct page url. I’m mostly sure it will happen to many users as well.
>
> We could also fix this in the next maintenance release, but since we just
> released Apache Spark 4.0.0, it doesn’t seem to be ideal to have another
> release just because of this.
>
> 2025년 5월 30일 (금) 오후 3:37, Anish Shrigondekar
> 님이 작성:
>
>> Hi,
>>
>> We have a broken link for the latest docs for the 4.0 release.
>>
>> This page:
>> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
>> has a hyperlink that points to the contents of the Structured Streaming
>> guide. But it seems this link is broken and points back to the main
>> streaming page here: https://spark.apache.org/streaming/. We have fixed
>> this in the Spark code base, but I think we can also fix it in the
>> spark-website repo instead of waiting for the next release. From a couple
>> other places, the link is intact and points to the correct link for the SS
>> programming guide here:
>> https://spark.apache.org/docs/latest/streaming/index.html
>>
>> Please let us know your thoughts about this and if we are ok making this
>> fix on spark-website.
>>
>> Thanks,
>> Anish
>>
>


Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Jerry Peng
P?   Perhaps add
>> comments directly
>> > >>>>>>>>> to the SPIP to provide context as the code snippet below is a
>> direct copy
>> > >>>>>>>>> from the SPIP itself.
>> > >>>>>>>>>
>> > >>>>>>>>> Thanks,
>> > >>>>>>>>> Denny
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <
>> > >>>>>>>>> mich.talebza...@gmail.com> wrote:
>> > >>>>>>>>>
>> > >>>>>>>>>> just to add
>> > >>>>>>>>>>
>> > >>>>>>>>>> A stronger definition of real time. The engineering
>> definition of
>> > >>>>>>>>>> real time is roughly fast enough to be interactive
>> > >>>>>>>>>>
>> > >>>>>>>>>> However, I put a stronger definition. In real time
>> application or
>> > >>>>>>>>>> data, there is nothing as an answer which is supposed to be
>> late and
>> > >>>>>>>>>> correct. The timeliness is part of the application.if I get
>> the right
>> > >>>>>>>>>> answer too slowly it becomes useless or wrong
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> Dr Mich Talebzadeh,
>> > >>>>>>>>>> Architect | Data Science | Financial Crime | Forensic
>> Analysis |
>> > >>>>>>>>>> GDPR
>> > >>>>>>>>>>
>> > >>>>>>>>>>view my Linkedin profile
>> > >>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <
>> > >>>>>>>>>> mich.talebza...@gmail.com> wrote:
>> > >>>>>>>>>>
>> > >>>>>>>>>>> The current limitations in SSS come from micro-batching.If
>> you
>> > >>>>>>>>>>> are going to reduce micro-batching, this reduction must be
>> balanced against
>> > >>>>>>>>>>> the available processing capacity of the cluster to prevent
>> back pressure
>> > >>>>>>>>>>> and instability. In the case of Continuous Processing mode,
>> a
>> > >>>>>>>>>>> specific continuous trigger with a desired checkpoint
>> interval quote
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> "
>> > >>>>>>>>>>> df.writeStream
>> > >>>>>>>>>>>.format("...")
>> > >>>>>>>>>>>.option("...")
>> > >>>>>>>>>>>.trigger(Trigger.RealTime(“300 Seconds”))// new
>> trigger
>> > >>>>>>>>>>> type to enable real-time Mode
>> > >>>>>>>>>>>.start()
>> > >>>>>>>>>>> This Trigger.RealTime signals that the query should run in
>> the
>> > >>>>>>>>>>> new ultra low-latency execution mode.  A time interval can
>> also be
>> > >>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long each
>> micro-batch should
>> > >>>>>>>>>>> run for.
>> > >>>>>>>>>>> "
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> will inevitably depend on many factors. Not that simple
>> > >>>>>>>>>>> HTH
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Dr Mich Talebzadeh,
>> > >>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic
>> Analysis |
>> > >>>>>>>>>>> GDPR
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>view my Linkedin profile
>> > >>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <
>> > >>>>>>>>>>> jerry.boyang.p...@gmail.com> wrote:
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> Hi all,
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> I want to start a discussion thread for the SPIP titled
>> > >>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming” that
>> I've been
>> > >>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun,
>> Jungtaek Lim, and
>> > >>>>>>>>>>>> Michael Armbrust: [JIRA
>> > >>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc
>> > >>>>>>>>>>>> <
>> https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing
>> >
>> > >>>>>>>>>>>> ].
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> The SPIP proposes a new execution mode called “Real-time
>> Mode”
>> > >>>>>>>>>>>> in Spark Structured Streaming that significantly lowers
>> end-to-end latency
>> > >>>>>>>>>>>> for processing streams of data.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> A key principle of this proposal is compatibility. Our
>> goal is
>> > >>>>>>>>>>>> to make Spark capable of handling streaming jobs that need
>> results almost
>> > >>>>>>>>>>>> immediately (within O(100) milliseconds). We want to
>> achieve this without
>> > >>>>>>>>>>>> changing the high-level DataFrame/Dataset API that users
>> already use – so
>> > >>>>>>>>>>>> existing streaming queries can run in this new
>> ultra-low-latency mode by
>> > >>>>>>>>>>>> simply turning it on, without rewriting their logic.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> In short, we’re trying to enable Spark to power real-time
>> > >>>>>>>>>>>> applications (like instant anomaly alerts or live
>> personalization) that
>> > >>>>>>>>>>>> today cannot meet their latency requirements with Spark’s
>> current streaming
>> > >>>>>>>>>>>> engine.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and
>> > >>>>>>>>>>>> suggestions on this approach!
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>
>> > >>>>>>
>> > >>>>>> --
>> > >>>>>> Best,
>> > >>>>>> Yanbo
>> > >>>>>>
>> > >>>>>
>> > >>
>> > >> --
>> > >> John Zhuge
>> > >>
>> > >>
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Jerry Peng
Mark,

I thought we are simply discussing the naming of the mode?  Like I
mentioned, if you think simply calling this mode "real-time" mode may cause
confusion because "real-time" can mean other things in other fields, I can
clarify what we mean by "real-time" explicitly in the SPIP document and any
future documentation. That is not a problem and thank you for your feedback.

On Thu, May 29, 2025 at 10:37 PM Mark Hamstra  wrote:

> Referencing other misuse of "real-time" is not persuasive. A SPIP is an
> engineering document, not a marketing document. Technical clarity and
> accuracy should be non-negotiable.
>
>
> On Thu, May 29, 2025 at 10:27 PM Jerry Peng 
> wrote:
>
>> Mark,
>>
>> As an example of my point if you go the the Apache Storm (another stream
>> processing engine) website:
>>
>> https://storm.apache.org/
>>
>> It describes Storm as:
>>
>> "Apache Storm is a free and open source distributed *realtime*
>> computation system"
>>
>> If you can to apache Flink:
>>
>>
>> https://flink.apache.org/2025/03/24/apache-flink-2.0.0-a-new-era-of-real-time-data-processing/
>>
>> "Apache Flink 2.0.0: A new Era of *Real-Time* Data Processing"
>>
>> Thus, what the term "rea-time" implies in this should not be confusing
>> for folks in this area.
>>
>> On Thu, May 29, 2025 at 10:22 PM Jerry Peng 
>> wrote:
>>
>>> Mich,
>>>
>>> If I understood your last email correctly, I think you also wanted to
>>> have a discussion about naming?  Why are we calling this new execution mode
>>> described in the SPIP "Real-time Mode"?  Here are my two cents.  Firstly,
>>> "continuous mode" is taken and we want another name to describe an
>>> execution mode that provides ultra low latency processing.  We could have
>>> called it "low latency mode", though I don't really like that naming since
>>> it implies the other execution modes are not low latency which I don't
>>> believe is true.  This new proposed mode can simply deliver even lower
>>> latency.  Thus, we came up with the name "Real-time Mode".  Of course, we
>>> are talking about "soft" real-time here.  I think when we are talking about
>>> distributed stream processing systems in the space of big data analytics,
>>> it is reasonable to assume anything described in this space as "real-time"
>>> implies "soft" real-time.  Though if this is confusing or misleading, we
>>> can provide clear documentation on what "real-time" in real-time mode means
>>> and what it guarantees.  Just my thoughts.  I would love to hear other
>>> perspectives.
>>>
>>> On Thu, May 29, 2025 at 3:48 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> I think from what I have seen there are a good number of +1 responses
>>>> as opposed to quantitative discussions (based on my observations only).
>>>> Given the objectives of the thread, we ought to focus on what is meant by
>>>> real time compared to continuous   modes.To be fair, it is a common point
>>>> of confusion, and the terms are often used interchangeably in general
>>>> conversation, but in technical contexts, especially with streaming data
>>>> platforms, they have specific and important differences.
>>>>
>>>> "Continuous Mode" refers to a processing strategy that aims for true,
>>>> uninterrupted, sub-millisecond latency processing.  Chiefly
>>>>
>>>>- Event-at-a-Time (or very small  batch groups): The system
>>>>processes individual events or extremely small groups of events ->
>>>>microbatches as they flow through the pipeline.
>>>>- Minimal Latency: The primary goal is to achieve the absolute
>>>>lowest possible end-to-end latency, often in the order of milliseconds 
>>>> or
>>>>even below
>>>>- Most business use cases (say financial markets) can live with
>>>>this as they do not rely on rdges
>>>>
>>>> Now what is meant by "Real-time Mode"
>>>>
>>>> This is where the nuance comes in. "Real-time" is a broader and
>>>> sometimes more subjective term. When the text introduces "Real-time Mode"
>>>> as distinct from "Continuous Mode," it suggests a specific implementation
>>>> that a

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-30 Thread Jerry Peng
Mich,

Sounds good.  I will add the clarification to the SPIP.

On Fri, May 30, 2025 at 3:47 AM Mich Talebzadeh 
wrote:

> Hi Jerry,
>
> In essence, these definitions (hard or soft) help clarify that "real-time"
> is* not a single, monolithic concept here,* but rather a spectrum defined
> by the criticality of timeliness and systems under consideration. Common
> data processing solutions branded as "real-time" are typically operating on
> the softer end of this spectrum, providing performance crucial for
> applications under considerations (for example within SLAs)  where delays
> are undesirable but not show stopper.
>
> I  therefore suggest the SPIP should mention this explicitly, so we can
> move on
>
> Dr Mich Talebzadeh,
> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>
>
> On Fri, 30 May 2025 at 07:57, Jerry Peng 
> wrote:
>
>> Mark,
>>
>> For real-time systems there is a concept of "soft" real-time and "hard"
>> real-time systems.  These concepts exist in textbooks.  Here is a document
>> by intel that explains it:
>>
>>
>> https://www.intel.com/content/www/us/en/learn/what-is-a-real-time-system.html
>>
>> "In a soft real-time system, computers or equipment will continue to
>> function after a missed deadline but may produce a lower-quality output.
>> For example, latency in online video games can impact player interactions,
>> but otherwise present no serious consequences."
>>
>> "Hard real-time systems have zero delay tolerance, and delayed signals
>> can result in total failure or present immediate danger to users. Flight
>> control systems and pacemakers are both examples where timeliness is not
>> only essential but the lack of it can result in a life-or-death situation."
>>
>> I don't think it is inaccurate or misleading to call this mode
>> real-time.  It is soft real-time.
>>
>> On Thu, May 29, 2025 at 11:44 PM Mark Hamstra 
>> wrote:
>>
>>> Clarifying what is meant by "real-time" and explicitly differentiating
>>> it from actual real-time computing should be a bare minimum. I still don't
>>> like the use of marketing-speak "real-time" that isn't really real-time in
>>> engineering documents or API namespaces.
>>>
>>> On Thu, May 29, 2025 at 10:43 PM Jerry Peng 
>>> wrote:
>>>
>>>> Mark,
>>>>
>>>> I thought we are simply discussing the naming of the mode?  Like I
>>>> mentioned, if you think simply calling this mode "real-time" mode may cause
>>>> confusion because "real-time" can mean other things in other fields, I can
>>>> clarify what we mean by "real-time" explicitly in the SPIP document and any
>>>> future documentation. That is not a problem and thank you for your 
>>>> feedback.
>>>>
>>>> On Thu, May 29, 2025 at 10:37 PM Mark Hamstra 
>>>> wrote:
>>>>
>>>>> Referencing other misuse of "real-time" is not persuasive. A SPIP is
>>>>> an engineering document, not a marketing document. Technical clarity and
>>>>> accuracy should be non-negotiable.
>>>>>
>>>>>
>>>>> On Thu, May 29, 2025 at 10:27 PM Jerry Peng <
>>>>> jerry.boyang.p...@gmail.com> wrote:
>>>>>
>>>>>> Mark,
>>>>>>
>>>>>> As an example of my point if you go the the Apache Storm (another
>>>>>> stream processing engine) website:
>>>>>>
>>>>>> https://storm.apache.org/
>>>>>>
>>>>>> It describes Storm as:
>>>>>>
>>>>>> "Apache Storm is a free and open source distributed *realtime*
>>>>>> computation system"
>>>>>>
>>>>>> If you can to apache Flink:
>>>>>>
>>>>>>
>>>>>> https://flink.apache.org/2025/03/24/apache-flink-2.0.0-a-new-era-of-real-time-data-processing/
>>>>>>
>>>>>> "Apache Flink 2.0.0: A new Era of *Real-Time* Data Processing"
>>>>>>
>>>>>> Thus, what the term "rea-time" implies in this should not be
>>>>>> confusing for folks in this area.
>>>>>>
>>>>>> On Th

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Jerry Peng
Mark,

For real-time systems there is a concept of "soft" real-time and "hard"
real-time systems.  These concepts exist in textbooks.  Here is a document
by intel that explains it:

https://www.intel.com/content/www/us/en/learn/what-is-a-real-time-system.html

"In a soft real-time system, computers or equipment will continue to
function after a missed deadline but may produce a lower-quality output.
For example, latency in online video games can impact player interactions,
but otherwise present no serious consequences."

"Hard real-time systems have zero delay tolerance, and delayed signals can
result in total failure or present immediate danger to users. Flight
control systems and pacemakers are both examples where timeliness is not
only essential but the lack of it can result in a life-or-death situation."

I don't think it is inaccurate or misleading to call this mode real-time.
It is soft real-time.

On Thu, May 29, 2025 at 11:44 PM Mark Hamstra  wrote:

> Clarifying what is meant by "real-time" and explicitly differentiating it
> from actual real-time computing should be a bare minimum. I still don't
> like the use of marketing-speak "real-time" that isn't really real-time in
> engineering documents or API namespaces.
>
> On Thu, May 29, 2025 at 10:43 PM Jerry Peng 
> wrote:
>
>> Mark,
>>
>> I thought we are simply discussing the naming of the mode?  Like I
>> mentioned, if you think simply calling this mode "real-time" mode may cause
>> confusion because "real-time" can mean other things in other fields, I can
>> clarify what we mean by "real-time" explicitly in the SPIP document and any
>> future documentation. That is not a problem and thank you for your feedback.
>>
>> On Thu, May 29, 2025 at 10:37 PM Mark Hamstra 
>> wrote:
>>
>>> Referencing other misuse of "real-time" is not persuasive. A SPIP is an
>>> engineering document, not a marketing document. Technical clarity and
>>> accuracy should be non-negotiable.
>>>
>>>
>>> On Thu, May 29, 2025 at 10:27 PM Jerry Peng 
>>> wrote:
>>>
>>>> Mark,
>>>>
>>>> As an example of my point if you go the the Apache Storm (another
>>>> stream processing engine) website:
>>>>
>>>> https://storm.apache.org/
>>>>
>>>> It describes Storm as:
>>>>
>>>> "Apache Storm is a free and open source distributed *realtime*
>>>> computation system"
>>>>
>>>> If you can to apache Flink:
>>>>
>>>>
>>>> https://flink.apache.org/2025/03/24/apache-flink-2.0.0-a-new-era-of-real-time-data-processing/
>>>>
>>>> "Apache Flink 2.0.0: A new Era of *Real-Time* Data Processing"
>>>>
>>>> Thus, what the term "rea-time" implies in this should not be confusing
>>>> for folks in this area.
>>>>
>>>> On Thu, May 29, 2025 at 10:22 PM Jerry Peng <
>>>> jerry.boyang.p...@gmail.com> wrote:
>>>>
>>>>> Mich,
>>>>>
>>>>> If I understood your last email correctly, I think you also wanted to
>>>>> have a discussion about naming?  Why are we calling this new execution 
>>>>> mode
>>>>> described in the SPIP "Real-time Mode"?  Here are my two cents.  Firstly,
>>>>> "continuous mode" is taken and we want another name to describe an
>>>>> execution mode that provides ultra low latency processing.  We could have
>>>>> called it "low latency mode", though I don't really like that naming since
>>>>> it implies the other execution modes are not low latency which I don't
>>>>> believe is true.  This new proposed mode can simply deliver even lower
>>>>> latency.  Thus, we came up with the name "Real-time Mode".  Of course, we
>>>>> are talking about "soft" real-time here.  I think when we are talking 
>>>>> about
>>>>> distributed stream processing systems in the space of big data analytics,
>>>>> it is reasonable to assume anything described in this space as "real-time"
>>>>> implies "soft" real-time.  Though if this is confusing or misleading, we
>>>>> can provide clear documentation on what "real-time" in real-time mode 
>>>>> means
>>>>> and what it guarantees.  Just my thoughts.  I would love to hear other
>>>