+1 By unifying batch and low-latency streaming in Spark, we can eliminate the need for separate streaming engines, reducing system complexity and operational cost. Excited to see this direction!
On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi, > > My point about "in real time application or data, there is nothing as an > answer which is supposed to be late and correct. The timeliness is part of > the application. if I get the right answer too slowly it becomes useless or > wrong" is actually fundamental to *why* we need this Spark Structured > Streaming proposal. > > The proposal is precisely about enabling Spark to power applications > where, as I define it, the *timeliness* of the answer is as critical as > its *correctness*. Spark's current streaming engine, primarily operating > on micro-batches, often delivers results that are technically "correct" but > arrive too late to be truly useful for certain high-stakes, real-time > scenarios. This makes them "useless or wrong" in a practical, > business-critical sense. > > For example *in real-time fraud detection* and In *high-frequency > trading,* market data or trade execution commands must be delivered with > minimal latency. Even a slight delay can mean missed opportunities or > significant financial losses, making a "correct" price update useless if > it's not instantaneous. able for these demanding use cases, where a "late > but correct" answer is simply not good enough. As a colliery it is a > fundamental concept, so it has to be treated as such not as a comment.in > SPIP > > Hope this clarifies the connection in practical terms > Dr Mich Talebzadeh, > Architect | Data Science | Financial Crime | Forensic Analysis | GDPR > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > > > On Wed, 28 May 2025 at 16:32, Denny Lee <denny.g....@gmail.com> wrote: > >> Hey Mich, >> >> Sorry, I may be missing something here but what does your definition here >> have to do with the SPIP? Perhaps add comments directly to the SPIP to >> provide context as the code snippet below is a direct copy from the SPIP >> itself. >> >> Thanks, >> Denny >> >> >> >> >> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >>> just to add >>> >>> A stronger definition of real time. The engineering definition of real >>> time is roughly fast enough to be interactive >>> >>> However, I put a stronger definition. In real time application or data, >>> there is nothing as an answer which is supposed to be late and correct. The >>> timeliness is part of the application.if I get the right answer too slowly >>> it becomes useless or wrong >>> >>> >>> >>> Dr Mich Talebzadeh, >>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> >>> >>> >>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <mich.talebza...@gmail.com> >>> wrote: >>> >>>> The current limitations in SSS come from micro-batching.If you are >>>> going to reduce micro-batching, this reduction must be balanced against the >>>> available processing capacity of the cluster to prevent back pressure and >>>> instability. In the case of Continuous Processing mode, a specific >>>> continuous trigger with a desired checkpoint interval quote >>>> >>>> " >>>> df.writeStream >>>> .format("...") >>>> .option("...") >>>> .trigger(Trigger.RealTime(“300 Seconds”)) // new trigger type to >>>> enable real-time Mode >>>> .start() >>>> This Trigger.RealTime signals that the query should run in the new >>>> ultra low-latency execution mode. A time interval can also be specified, >>>> e.g. “300 Seconds”, to indicate how long each micro-batch should run for. >>>> " >>>> >>>> will inevitably depend on many factors. Not that simple >>>> HTH >>>> >>>> >>>> Dr Mich Talebzadeh, >>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> >>>> >>>> >>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <jerry.boyang.p...@gmail.com> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I want to start a discussion thread for the SPIP titled “Real-Time >>>>> Mode in Apache Spark Structured Streaming” that I've been working on with >>>>> Siying Dong, Indrajit Roy, Chao Sun, Jungtaek Lim, and Michael Armbrust: [ >>>>> JIRA <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc >>>>> <https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing> >>>>> ]. >>>>> >>>>> The SPIP proposes a new execution mode called “Real-time Mode” in >>>>> Spark Structured Streaming that significantly lowers end-to-end latency >>>>> for >>>>> processing streams of data. >>>>> >>>>> A key principle of this proposal is compatibility. Our goal is to make >>>>> Spark capable of handling streaming jobs that need results almost >>>>> immediately (within O(100) milliseconds). We want to achieve this without >>>>> changing the high-level DataFrame/Dataset API that users already use – so >>>>> existing streaming queries can run in this new ultra-low-latency mode by >>>>> simply turning it on, without rewriting their logic. >>>>> >>>>> In short, we’re trying to enable Spark to power real-time applications >>>>> (like instant anomaly alerts or live personalization) that today cannot >>>>> meet their latency requirements with Spark’s current streaming engine. >>>>> >>>>> We'd greatly appreciate your feedback, thoughts, and suggestions on >>>>> this approach! >>>>> >>>>>