Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-30 Thread Jungtaek Lim
I agree and I appreciate your input to clarify the term and the gap we have from the theoretical definition. I just would like to put some color here for just 2 cents. It is not uncommon for the technical term to be re-interpreted and expanded. One of the known examples is "exactly-once processin

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-30 Thread Sakthi
+1 (non-binding) On Fri, May 30, 2025 at 2:39 PM Jules Damji wrote: > +1 (non-binding) > — > Sent from my iPhone > Pardon the dumb thumb typos :) > > On May 30, 2025, at 12:39 PM, Mark Hamstra wrote: > >  > > A soft real-time system still defines an interval or frame within which > results sho

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-30 Thread Jules Damji
+1 (non-binding) —Sent from my iPhonePardon the dumb thumb typos :)On May 30, 2025, at 12:39 PM, Mark Hamstra wrote:A soft real-time system still defines an interval or frame within which results should be available, and often provides explicit warning or error-handling mechanisms when frame rate

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-30 Thread Mich Talebzadeh
ok fair points This SPIP (Structured Streaming, in this context) admittedly does not meet the rigorous, academic definition of a soft real-time system, due to the lack of explicit, guaranteed deadlines and internal mechanisms for handling missed frames. Having said that, despite not being a "stri

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-30 Thread Mark Hamstra
A soft real-time system still defines an interval or frame within which results should be available, and often provides explicit warning or error-handling mechanisms when frame rates are missed. I see nothing like that in the SPIP. Instead, the length of the underlying microbatches is specified in

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-30 Thread L. C. Hsieh
Thanks to everyone in the community for your interest and support for this proposal. We've had extensive and constructive discussions both in this thread and in the SPIP document. These conversations have been positive and encouraging for moving in this direction. Special thanks to the SPIP authors

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-30 Thread Denny Lee
+1 (non-binding) On Fri, May 30, 2025 at 9:17 AM xianjin wrote: > +1 > Sent from my iPhone > > On May 29, 2025, at 12:53 PM, Yuanjian Li wrote: > >  > +1 > > Kent Yao 于2025年5月28日周三 19:31写道: > >> +1, LGTM. >> >> Kent >> >> 在 2025年5月29日星期四,Chao Sun 写道: >> >>> +1. Super excited by this initiati

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-30 Thread xianjin
+1Sent from my iPhoneOn May 29, 2025, at 12:53 PM, Yuanjian Li wrote:+1Kent Yao 于2025年5月28日周三 19:31写道:+1, LGTM.Kent在 2025年5月29日星期四,Chao Sun 写道:+1. Super excited by this initiative!On Wed, May 28, 2025 at 1:54 PM Yanbo Liang wrote:+1On We

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-30 Thread Jerry Peng
Mich, Sounds good. I will add the clarification to the SPIP. On Fri, May 30, 2025 at 3:47 AM Mich Talebzadeh wrote: > Hi Jerry, > > In essence, these definitions (hard or soft) help clarify that "real-time" > is* not a single, monolithic concept here,* but rather a spectrum defined > by the cr

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-30 Thread Mich Talebzadeh
Hi Jerry, In essence, these definitions (hard or soft) help clarify that "real-time" is* not a single, monolithic concept here,* but rather a spectrum defined by the criticality of timeliness and systems under consideration. Common data processing solutions branded as "real-time" are typically ope

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Jerry Peng
Mark, For real-time systems there is a concept of "soft" real-time and "hard" real-time systems. These concepts exist in textbooks. Here is a document by intel that explains it: https://www.intel.com/content/www/us/en/learn/what-is-a-real-time-system.html "In a soft real-time system, computers

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Mark Hamstra
Clarifying what is meant by "real-time" and explicitly differentiating it from actual real-time computing should be a bare minimum. I still don't like the use of marketing-speak "real-time" that isn't really real-time in engineering documents or API namespaces. On Thu, May 29, 2025 at 10:43 PM Jer

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Jerry Peng
Mark, I thought we are simply discussing the naming of the mode? Like I mentioned, if you think simply calling this mode "real-time" mode may cause confusion because "real-time" can mean other things in other fields, I can clarify what we mean by "real-time" explicitly in the SPIP document and an

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Jerry Peng
Mich, If I understood your last email correctly, I think you also wanted to have a discussion about naming? Why are we calling this new execution mode described in the SPIP "Real-time Mode"? Here are my two cents. Firstly, "continuous mode" is taken and we want another name to describe an execu

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Mark Hamstra
Referencing other misuse of "real-time" is not persuasive. A SPIP is an engineering document, not a marketing document. Technical clarity and accuracy should be non-negotiable. On Thu, May 29, 2025 at 10:27 PM Jerry Peng wrote: > Mark, > > As an example of my point if you go the the Apache Stor

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Jerry Peng
Mark, As an example of my point if you go the the Apache Storm (another stream processing engine) website: https://storm.apache.org/ It describes Storm as: "Apache Storm is a free and open source distributed *realtime* computation system" If you can to apache Flink: https://flink.apache.org/2

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Mark Hamstra
It should not be assumed. In something called "real-time", it should be very explicit what clock-time constraints are and are not guaranteed. On Thu, May 29, 2025 at 10:00 PM Jerry Peng wrote: > It was kind of hard to see what mich's point was in the plethora of > emails he sent :) > > In embed

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Jerry Peng
Mich, Thank you for chiming in and providing insights into the importance of not only getting correct results but also timely results. You are absolutely right that the reason why something like Real-time Mode is valuable is its ability to provide timely results for certain use cases that require

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Mich Talebzadeh
I think from what I have seen there are a good number of +1 responses as opposed to quantitative discussions (based on my observations only). Given the objectives of the thread, we ought to focus on what is meant by real time compared to continuous modes.To be fair, it is a common point of confus

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Yang Jie
+1 On 2025/05/29 16:25:19 Xiao Li wrote: > +1 > > Yuming Wang 于2025年5月29日周四 02:22写道: > > > +1. > > > > On Thu, May 29, 2025 at 3:36 PM DB Tsai wrote: > > > >> +1 > >> Sent from my iPhone > >> > >> On May 29, 2025, at 12:15 AM, John Zhuge wrote: > >> > >>  > >> +1 Nice feature > >> > >> On We

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Jerry Peng
Hi all, A big thanks to everyone that provided feedback to the SPIP! My co-authors and I really appreciate it. I am excited to see this amount of interest in the proposal. I am also glad to see all the support this initiative is getting from the community. Let me summarize some of the common qu

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Xiao Li
+1 Yuming Wang 于2025年5月29日周四 02:22写道: > +1. > > On Thu, May 29, 2025 at 3:36 PM DB Tsai wrote: > >> +1 >> Sent from my iPhone >> >> On May 29, 2025, at 12:15 AM, John Zhuge wrote: >> >>  >> +1 Nice feature >> >> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li >> wrote: >> >>> +1 >>> >>> Kent Yao

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread Yuming Wang
+1. On Thu, May 29, 2025 at 3:36 PM DB Tsai wrote: > +1 > Sent from my iPhone > > On May 29, 2025, at 12:15 AM, John Zhuge wrote: > >  > +1 Nice feature > > On Wed, May 28, 2025 at 9:53 PM Yuanjian Li > wrote: > >> +1 >> >> Kent Yao 于2025年5月28日周三 19:31写道: >> >>> +1, LGTM. >>> >>> Kent >>> >>

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread DB Tsai
+1Sent from my iPhoneOn May 29, 2025, at 12:15 AM, John Zhuge wrote:+1 Nice featureOn Wed, May 28, 2025 at 9:53 PM Yuanjian Li wrote:+1Kent Yao 于2025年5月28日周三 19:31写道:+1, LGTM.Kent在 2025年5月29日星期四,Chao Sun 写道:+1. Super excited by this

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-29 Thread John Zhuge
+1 Nice feature On Wed, May 28, 2025 at 9:53 PM Yuanjian Li wrote: > +1 > > Kent Yao 于2025年5月28日周三 19:31写道: > >> +1, LGTM. >> >> Kent >> >> 在 2025年5月29日星期四,Chao Sun 写道: >> >>> +1. Super excited by this initiative! >>> >>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang wrote: >>> +1 >>>

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-28 Thread Yuanjian Li
+1 Kent Yao 于2025年5月28日周三 19:31写道: > +1, LGTM. > > Kent > > 在 2025年5月29日星期四,Chao Sun 写道: > >> +1. Super excited by this initiative! >> >> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang wrote: >> >>> +1 >>> >>> On Wed, May 28, 2025 at 12:34 PM huaxin gao >>> wrote: >>> +1 By unifying ba

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-28 Thread Kent Yao
+1, LGTM. Kent 在 2025年5月29日星期四,Chao Sun 写道: > +1. Super excited by this initiative! > > On Wed, May 28, 2025 at 1:54 PM Yanbo Liang wrote: > >> +1 >> >> On Wed, May 28, 2025 at 12:34 PM huaxin gao >> wrote: >> >>> +1 >>> By unifying batch and low-latency streaming in Spark, we can eliminate >

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-28 Thread Chao Sun
+1. Super excited by this initiative! On Wed, May 28, 2025 at 1:54 PM Yanbo Liang wrote: > +1 > > On Wed, May 28, 2025 at 12:34 PM huaxin gao > wrote: > >> +1 >> By unifying batch and low-latency streaming in Spark, we can eliminate >> the need for separate streaming engines, reducing system co

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-28 Thread Yanbo Liang
+1 On Wed, May 28, 2025 at 12:34 PM huaxin gao wrote: > +1 > By unifying batch and low-latency streaming in Spark, we can eliminate the > need for separate streaming engines, reducing system complexity and > operational cost. Excited to see this direction! > > On Wed, May 28, 2025 at 9:08 AM Mic

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-28 Thread huaxin gao
+1 By unifying batch and low-latency streaming in Spark, we can eliminate the need for separate streaming engines, reducing system complexity and operational cost. Excited to see this direction! On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh wrote: > Hi, > > My point about "in real time applica

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-28 Thread Mich Talebzadeh
Hi, My point about "in real time application or data, there is nothing as an answer which is supposed to be late and correct. The timeliness is part of the application. if I get the right answer too slowly it becomes useless or wrong" is actually fundamental to *why* we need this Spark Structured

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-28 Thread Denny Lee
Hey Mich, Sorry, I may be missing something here but what does your definition here have to do with the SPIP? Perhaps add comments directly to the SPIP to provide context as the code snippet below is a direct copy from the SPIP itself. Thanks, Denny On Wed, May 28, 2025 at 06:48 Mich Talebz

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-28 Thread Mich Talebzadeh
just to add A stronger definition of real time. The engineering definition of real time is roughly fast enough to be interactive However, I put a stronger definition. In real time application or data, there is nothing as an answer which is supposed to be late and correct. The timeliness is part o

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

2025-05-28 Thread Mich Talebzadeh
The current limitations in SSS come from micro-batching.If you are going to reduce micro-batching, this reduction must be balanced against the available processing capacity of the cluster to prevent back pressure and instability. In the case of Continuous Processing mode, a specific continuous trig