Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

huaxin gao Wed, 28 May 2025 12:33:26 -0700

+1
By unifying batch and low-latency streaming in Spark, we can eliminate the
need for separate streaming engines, reducing system complexity and
operational cost. Excited to see this direction!


On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> My point about "in real time application or data, there is nothing as an
> answer which is supposed to be late and correct. The timeliness is part of
> the application. if I get the right answer too slowly it becomes useless or
> wrong" is actually fundamental to *why* we need this Spark Structured
> Streaming proposal.
>
> The proposal is precisely about enabling Spark to power applications
> where, as I define it, the *timeliness* of the answer is as critical as
> its *correctness*. Spark's current streaming engine, primarily operating
> on micro-batches, often delivers results that are technically "correct" but
> arrive too late to be truly useful for certain high-stakes, real-time
> scenarios. This makes them "useless or wrong" in a practical,
> business-critical sense.
>
> For example *in real-time fraud detection* and In *high-frequency
> trading,* market data or trade execution commands must be delivered with
> minimal latency. Even a slight delay can mean missed opportunities or
> significant financial losses, making a "correct" price update useless if
> it's not instantaneous. able for these demanding use cases, where a "late
> but correct" answer is simply not good enough. As a colliery it is a
> fundamental concept, so it has to be treated as such not as a comment.in
> SPIP
>
> Hope this clarifies the connection in practical terms
> Dr Mich Talebzadeh,
> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>
>
> On Wed, 28 May 2025 at 16:32, Denny Lee <denny.g....@gmail.com> wrote:
>
>> Hey Mich,
>>
>> Sorry, I may be missing something here but what does your definition here
>> have to do with the SPIP?   Perhaps add comments directly to the SPIP to
>> provide context as the code snippet below is a direct copy from the SPIP
>> itself.
>>
>> Thanks,
>> Denny
>>
>>
>>
>>
>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> just to add
>>>
>>> A stronger definition of real time. The engineering definition of real
>>> time is roughly fast enough to be interactive
>>>
>>> However, I put a stronger definition. In real time application or data,
>>> there is nothing as an answer which is supposed to be late and correct. The
>>> timeliness is part of the application.if I get the right answer too slowly
>>> it becomes useless or wrong
>>>
>>>
>>>
>>> Dr Mich Talebzadeh,
>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>>> The current limitations in SSS come from micro-batching.If you are
>>>> going to reduce micro-batching, this reduction must be balanced against the
>>>> available processing capacity of the cluster to prevent back pressure and
>>>> instability. In the case of Continuous Processing mode, a specific
>>>> continuous trigger with a desired checkpoint interval quote
>>>>
>>>> "
>>>> df.writeStream
>>>>    .format("...")
>>>>    .option("...")
>>>>    .trigger(Trigger.RealTime(“300 Seconds”))    // new trigger type to
>>>> enable real-time Mode
>>>>    .start()
>>>> This Trigger.RealTime signals that the query should run in the new
>>>> ultra low-latency execution mode.  A time interval can also be specified,
>>>> e.g. “300 Seconds”, to indicate how long each micro-batch should run for.
>>>> "
>>>>
>>>> will inevitably depend on many factors. Not that simple
>>>> HTH
>>>>
>>>>
>>>> Dr Mich Talebzadeh,
>>>> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <jerry.boyang.p...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I want to start a discussion thread for the SPIP titled “Real-Time
>>>>> Mode in Apache Spark Structured Streaming” that I've been working on with
>>>>> Siying Dong, Indrajit Roy, Chao Sun, Jungtaek Lim, and Michael Armbrust: [
>>>>> JIRA <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc
>>>>> <https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing>
>>>>> ].
>>>>>
>>>>> The SPIP proposes a new execution mode called “Real-time Mode” in
>>>>> Spark Structured Streaming that significantly lowers end-to-end latency 
>>>>> for
>>>>> processing streams of data.
>>>>>
>>>>> A key principle of this proposal is compatibility. Our goal is to make
>>>>> Spark capable of handling streaming jobs that need results almost
>>>>> immediately (within O(100) milliseconds). We want to achieve this without
>>>>> changing the high-level DataFrame/Dataset API that users already use – so
>>>>> existing streaming queries can run in this new ultra-low-latency mode by
>>>>> simply turning it on, without rewriting their logic.
>>>>>
>>>>> In short, we’re trying to enable Spark to power real-time applications
>>>>> (like instant anomaly alerts or live personalization) that today cannot
>>>>> meet their latency requirements with Spark’s current streaming engine.
>>>>>
>>>>> We'd greatly appreciate your feedback, thoughts, and suggestions on
>>>>> this approach!
>>>>>
>>>>>

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Reply via email to