Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Jungtaek Lim Fri, 30 May 2025 22:27:37 -0700

I agree and I appreciate your input to clarify the term and the gap we have
from the theoretical definition.


I just would like to put some color here for just 2 cents.

It is not uncommon for the technical term to be re-interpreted and
expanded. One of the known examples is "exactly-once processing" semantics.
Does this mean the engine processes the data only once? No. Most streaming
engines deal with this via loosening the definition to be "effectively
once" (the term was mentioned in Jay Krep's relevant blog post
<https://medium.com/@jaykreps/exactly-once-support-in-apache-kafka-55e1fdd0a35f>),
via ensuring at-least-once processing semantic plus making the writes to be
idempotent or be transactional. That said, upon failure, the engine
processes the data multiple times (at-least-once), but the output will be
visible for the same when the engine processes the data as
true exactly-once semantic.

So does the industry try to correct the term? No, because it is still
practically correct in some aspects. If someone implies the strict
definition of exactly-once to build their streaming app, they will fail to
see their own expectation upon failure. The checkpoint interval is
important in the restoration of the streaming app because there is an
implication that we reprocess data, which breaks the expectation in some
sense. Still, the term "exactly-once" is used, because everyone's focus is
about duplications of the output and it fits that aspect.

That said, the thing is how moderate users would reason about the term
"real-time" on the streaming engine. Likewise Jerry referred to two
well-known open source streaming engines, the term "real-time" was not used
strictly to represent the definition in the textbook. That has been there
for more than 10 years. I have to speculate, but the term "real-time" is
probably leveraged for emphasizing the difference with batch query/app.

Is 'real-time' a marketing term? Probably yes, and we should be better to
clarify the definition to avoid confusion on reading through the technical
doc. Shouldn't we use that term? I'm not sure. As long as we introduce a
new feature and name it, it is pretty much natural to consider the
marketing point of view for the name. IMHO, the reason 'real-time' is
referenced so many times in this industry (despite not being strictly
correct on the definition in the textbook) is probably because it is proved
to be very effective at marketing.


On Sat, May 31, 2025 at 4:40 AM Mark Hamstra <[email protected]> wrote:

> A soft real-time system still defines an interval or frame within which
> results should be available, and often provides explicit warning or
> error-handling mechanisms when frame rates are missed. I see nothing like
> that in the SPIP. Instead, the length of the underlying microbatches is
> specified in the Trigger, but result reporting is just as quickly as
> possible with no reporting interval or frame rate specified and nothing
> that I can see happening if results take longer than the user is guessing
> or expecting. That's a low-latency, "we'll do it as fast as we can, but no
> promises or guarantees" system, not real-time.
>
> On Thu, May 29, 2025 at 11:57 PM Jerry Peng <[email protected]>
> wrote:
>
>> Mark,
>>
>> For real-time systems there is a concept of "soft" real-time and "hard"
>> real-time systems.  These concepts exist in textbooks.  Here is a document
>> by intel that explains it:
>>
>>
>> https://www.intel.com/content/www/us/en/learn/what-is-a-real-time-system.html
>>
>> "In a soft real-time system, computers or equipment will continue to
>> function after a missed deadline but may produce a lower-quality output.
>> For example, latency in online video games can impact player interactions,
>> but otherwise present no serious consequences."
>>
>> "Hard real-time systems have zero delay tolerance, and delayed signals
>> can result in total failure or present immediate danger to users. Flight
>> control systems and pacemakers are both examples where timeliness is not
>> only essential but the lack of it can result in a life-or-death situation."
>>
>> I don't think it is inaccurate or misleading to call this mode
>> real-time.  It is soft real-time.
>>
>> On Thu, May 29, 2025 at 11:44 PM Mark Hamstra <[email protected]>
>> wrote:
>>
>>> Clarifying what is meant by "real-time" and explicitly differentiating
>>> it from actual real-time computing should be a bare minimum. I still don't
>>> like the use of marketing-speak "real-time" that isn't really real-time in
>>> engineering documents or API namespaces.
>>>
>>> On Thu, May 29, 2025 at 10:43 PM Jerry Peng <[email protected]>
>>> wrote:
>>>
>>>> Mark,
>>>>
>>>> I thought we are simply discussing the naming of the mode?  Like I
>>>> mentioned, if you think simply calling this mode "real-time" mode may cause
>>>> confusion because "real-time" can mean other things in other fields, I can
>>>> clarify what we mean by "real-time" explicitly in the SPIP document and any
>>>> future documentation. That is not a problem and thank you for your 
>>>> feedback.
>>>>
>>>> On Thu, May 29, 2025 at 10:37 PM Mark Hamstra <[email protected]>
>>>> wrote:
>>>>
>>>>> Referencing other misuse of "real-time" is not persuasive. A SPIP is
>>>>> an engineering document, not a marketing document. Technical clarity and
>>>>> accuracy should be non-negotiable.
>>>>>
>>>>>
>>>>> On Thu, May 29, 2025 at 10:27 PM Jerry Peng <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Mark,
>>>>>>
>>>>>> As an example of my point if you go the the Apache Storm (another
>>>>>> stream processing engine) website:
>>>>>>
>>>>>> https://storm.apache.org/
>>>>>>
>>>>>> It describes Storm as:
>>>>>>
>>>>>> "Apache Storm is a free and open source distributed *realtime*
>>>>>> computation system"
>>>>>>
>>>>>> If you can to apache Flink:
>>>>>>
>>>>>>
>>>>>> https://flink.apache.org/2025/03/24/apache-flink-2.0.0-a-new-era-of-real-time-data-processing/
>>>>>>
>>>>>> "Apache Flink 2.0.0: A new Era of *Real-Time* Data Processing"
>>>>>>
>>>>>> Thus, what the term "rea-time" implies in this should not be
>>>>>> confusing for folks in this area.
>>>>>>
>>>>>> On Thu, May 29, 2025 at 10:22 PM Jerry Peng <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Mich,
>>>>>>>
>>>>>>> If I understood your last email correctly, I think you also wanted
>>>>>>> to have a discussion about naming?  Why are we calling this new 
>>>>>>> execution
>>>>>>> mode described in the SPIP "Real-time Mode"?  Here are my two cents.
>>>>>>> Firstly, "continuous mode" is taken and we want another name to 
>>>>>>> describe an
>>>>>>> execution mode that provides ultra low latency processing.  We could 
>>>>>>> have
>>>>>>> called it "low latency mode", though I don't really like that naming 
>>>>>>> since
>>>>>>> it implies the other execution modes are not low latency which I don't
>>>>>>> believe is true.  This new proposed mode can simply deliver even lower
>>>>>>> latency.  Thus, we came up with the name "Real-time Mode".  Of course, 
>>>>>>> we
>>>>>>> are talking about "soft" real-time here.  I think when we are talking 
>>>>>>> about
>>>>>>> distributed stream processing systems in the space of big data 
>>>>>>> analytics,
>>>>>>> it is reasonable to assume anything described in this space as 
>>>>>>> "real-time"
>>>>>>> implies "soft" real-time.  Though if this is confusing or misleading, we
>>>>>>> can provide clear documentation on what "real-time" in real-time mode 
>>>>>>> means
>>>>>>> and what it guarantees.  Just my thoughts.  I would love to hear other
>>>>>>> perspectives.
>>>>>>>
>>>>>>> On Thu, May 29, 2025 at 3:48 PM Mich Talebzadeh <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> I think from what I have seen there are a good number of +1
>>>>>>>> responses as opposed to quantitative discussions (based on my 
>>>>>>>> observations
>>>>>>>> only). Given the objectives of the thread, we ought to focus on what is
>>>>>>>> meant by real time compared to continuous   modes.To be fair, it is a
>>>>>>>> common point of confusion, and the terms are often used 
>>>>>>>> interchangeably in
>>>>>>>> general conversation, but in technical contexts, especially with 
>>>>>>>> streaming
>>>>>>>> data platforms, they have specific and important differences.
>>>>>>>>
>>>>>>>> "Continuous Mode" refers to a processing strategy that aims for
>>>>>>>> true, uninterrupted, sub-millisecond latency processing.  Chiefly
>>>>>>>>
>>>>>>>>    - Event-at-a-Time (or very small  batch groups): The system
>>>>>>>>    processes individual events or extremely small groups of events ->
>>>>>>>>    microbatches as they flow through the pipeline.
>>>>>>>>    - Minimal Latency: The primary goal is to achieve the absolute
>>>>>>>>    lowest possible end-to-end latency, often in the order of 
>>>>>>>> milliseconds or
>>>>>>>>    even below
>>>>>>>>    - Most business use cases (say financial markets) can live with
>>>>>>>>    this as they do not rely on rdges
>>>>>>>>
>>>>>>>> Now what is meant by "Real-time Mode"
>>>>>>>>
>>>>>>>> This is where the nuance comes in. "Real-time" is a broader and
>>>>>>>> sometimes more subjective term. When the text introduces "Real-time 
>>>>>>>> Mode"
>>>>>>>> as distinct from "Continuous Mode," it suggests a specific 
>>>>>>>> implementation
>>>>>>>> that achieves real-time characteristics but might do so differently or 
>>>>>>>> more
>>>>>>>> robustly than a "continuous" mode attempt. Going back to my earlier
>>>>>>>> mention, in real time application , there is nothing as an answer 
>>>>>>>> which is
>>>>>>>> supposed to be late and correct. The timeliness is part of the 
>>>>>>>> application.
>>>>>>>> if I get the right answer too slowly it becomes useless or wrong. What 
>>>>>>>> I
>>>>>>>> call the "Late and Correct is Useless" Principle
>>>>>>>>
>>>>>>>> In summary, "Real-time Mode" seems to describe an approach that
>>>>>>>> delivers low-latency processing with high reliability and ease of use,
>>>>>>>> leveraging established, battle-tested components.I invite the audience 
>>>>>>>> to
>>>>>>>> have a discussion on this.
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis |
>>>>>>>> GDPR
>>>>>>>>
>>>>>>>>    view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, 29 May 2025 at 19:15, Yang Jie <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> +1
>>>>>>>>>
>>>>>>>>> On 2025/05/29 16:25:19 Xiao Li wrote:
>>>>>>>>> > +1
>>>>>>>>> >
>>>>>>>>> > Yuming Wang <[email protected]> 于2025年5月29日周四 02:22写道：
>>>>>>>>> >
>>>>>>>>> > > +1.
>>>>>>>>> > >
>>>>>>>>> > > On Thu, May 29, 2025 at 3:36 PM DB Tsai <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>> > >
>>>>>>>>> > >> +1
>>>>>>>>> > >> Sent from my iPhone
>>>>>>>>> > >>
>>>>>>>>> > >> On May 29, 2025, at 12:15 AM, John Zhuge <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>> > >>
>>>>>>>>> > >> 
>>>>>>>>> > >> +1 Nice feature
>>>>>>>>> > >>
>>>>>>>>> > >> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li <
>>>>>>>>> [email protected]>
>>>>>>>>> > >> wrote:
>>>>>>>>> > >>
>>>>>>>>> > >>> +1
>>>>>>>>> > >>>
>>>>>>>>> > >>> Kent Yao <[email protected]> 于2025年5月28日周三 19:31写道：
>>>>>>>>> > >>>
>>>>>>>>> > >>>> +1, LGTM.
>>>>>>>>> > >>>>
>>>>>>>>> > >>>> Kent
>>>>>>>>> > >>>>
>>>>>>>>> > >>>> 在 2025年5月29日星期四，Chao Sun <[email protected]> 写道：
>>>>>>>>> > >>>>
>>>>>>>>> > >>>>> +1. Super excited by this initiative!
>>>>>>>>> > >>>>>
>>>>>>>>> > >>>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <
>>>>>>>>> [email protected]>
>>>>>>>>> > >>>>> wrote:
>>>>>>>>> > >>>>>
>>>>>>>>> > >>>>>> +1
>>>>>>>>> > >>>>>>
>>>>>>>>> > >>>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <
>>>>>>>>> [email protected]>
>>>>>>>>> > >>>>>> wrote:
>>>>>>>>> > >>>>>>
>>>>>>>>> > >>>>>>> +1
>>>>>>>>> > >>>>>>> By unifying batch and low-latency streaming in Spark, we
>>>>>>>>> can
>>>>>>>>> > >>>>>>> eliminate the need for separate streaming engines,
>>>>>>>>> reducing system
>>>>>>>>> > >>>>>>> complexity and operational cost. Excited to see this
>>>>>>>>> direction!
>>>>>>>>> > >>>>>>>
>>>>>>>>> > >>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <
>>>>>>>>> > >>>>>>> [email protected]> wrote:
>>>>>>>>> > >>>>>>>
>>>>>>>>> > >>>>>>>> Hi,
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>> My point about "in real time application or data, there
>>>>>>>>> is nothing
>>>>>>>>> > >>>>>>>> as an answer which is supposed to be late and correct.
>>>>>>>>> The timeliness is
>>>>>>>>> > >>>>>>>> part of the application. if I get the right answer too
>>>>>>>>> slowly it becomes
>>>>>>>>> > >>>>>>>> useless or wrong" is actually fundamental to *why* we
>>>>>>>>> need this
>>>>>>>>> > >>>>>>>> Spark Structured Streaming proposal.
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>> The proposal is precisely about enabling Spark to power
>>>>>>>>> > >>>>>>>> applications where, as I define it, the *timeliness* of
>>>>>>>>> the answer
>>>>>>>>> > >>>>>>>> is as critical as its *correctness*. Spark's current
>>>>>>>>> streaming
>>>>>>>>> > >>>>>>>> engine, primarily operating on micro-batches, often
>>>>>>>>> delivers results that
>>>>>>>>> > >>>>>>>> are technically "correct" but arrive too late to be
>>>>>>>>> truly useful for
>>>>>>>>> > >>>>>>>> certain high-stakes, real-time scenarios. This makes
>>>>>>>>> them "useless or
>>>>>>>>> > >>>>>>>> wrong" in a practical, business-critical sense.
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>> For example *in real-time fraud detection* and In
>>>>>>>>> *high-frequency
>>>>>>>>> > >>>>>>>> trading,* market data or trade execution commands must
>>>>>>>>> be
>>>>>>>>> > >>>>>>>> delivered with minimal latency. Even a slight delay can
>>>>>>>>> mean missed
>>>>>>>>> > >>>>>>>> opportunities or significant financial losses, making a
>>>>>>>>> "correct" price
>>>>>>>>> > >>>>>>>> update useless if it's not instantaneous. able for
>>>>>>>>> these demanding
>>>>>>>>> > >>>>>>>> use cases, where a "late but correct" answer is simply
>>>>>>>>> not good enough. As
>>>>>>>>> > >>>>>>>> a colliery it is a fundamental concept, so it has to be
>>>>>>>>> treated as such not
>>>>>>>>> > >>>>>>>> as a comment.in SPIP
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>> Hope this clarifies the connection in practical terms
>>>>>>>>> > >>>>>>>> Dr Mich Talebzadeh,
>>>>>>>>> > >>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>>>>>>>> Analysis |
>>>>>>>>> > >>>>>>>> GDPR
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>>    view my Linkedin profile
>>>>>>>>> > >>>>>>>> <
>>>>>>>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <
>>>>>>>>> [email protected]>
>>>>>>>>> > >>>>>>>> wrote:
>>>>>>>>> > >>>>>>>>
>>>>>>>>> > >>>>>>>>> Hey Mich,
>>>>>>>>> > >>>>>>>>>
>>>>>>>>> > >>>>>>>>> Sorry, I may be missing something here but what does
>>>>>>>>> your
>>>>>>>>> > >>>>>>>>> definition here have to do with the SPIP?   Perhaps
>>>>>>>>> add comments directly
>>>>>>>>> > >>>>>>>>> to the SPIP to provide context as the code snippet
>>>>>>>>> below is a direct copy
>>>>>>>>> > >>>>>>>>> from the SPIP itself.
>>>>>>>>> > >>>>>>>>>
>>>>>>>>> > >>>>>>>>> Thanks,
>>>>>>>>> > >>>>>>>>> Denny
>>>>>>>>> > >>>>>>>>>
>>>>>>>>> > >>>>>>>>>
>>>>>>>>> > >>>>>>>>>
>>>>>>>>> > >>>>>>>>>
>>>>>>>>> > >>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <
>>>>>>>>> > >>>>>>>>> [email protected]> wrote:
>>>>>>>>> > >>>>>>>>>
>>>>>>>>> > >>>>>>>>>> just to add
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>> A stronger definition of real time. The engineering
>>>>>>>>> definition of
>>>>>>>>> > >>>>>>>>>> real time is roughly fast enough to be interactive
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>> However, I put a stronger definition. In real time
>>>>>>>>> application or
>>>>>>>>> > >>>>>>>>>> data, there is nothing as an answer which is supposed
>>>>>>>>> to be late and
>>>>>>>>> > >>>>>>>>>> correct. The timeliness is part of the application.if
>>>>>>>>> I get the right
>>>>>>>>> > >>>>>>>>>> answer too slowly it becomes useless or wrong
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>>> > >>>>>>>>>> Architect | Data Science | Financial Crime | Forensic
>>>>>>>>> Analysis |
>>>>>>>>> > >>>>>>>>>> GDPR
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>    view my Linkedin profile
>>>>>>>>> > >>>>>>>>>> <
>>>>>>>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <
>>>>>>>>> > >>>>>>>>>> [email protected]> wrote:
>>>>>>>>> > >>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>> The current limitations in SSS come from
>>>>>>>>> micro-batching.If you
>>>>>>>>> > >>>>>>>>>>> are going to reduce micro-batching, this reduction
>>>>>>>>> must be balanced against
>>>>>>>>> > >>>>>>>>>>> the available processing capacity of the cluster to
>>>>>>>>> prevent back pressure
>>>>>>>>> > >>>>>>>>>>> and instability. In the case of Continuous
>>>>>>>>> Processing mode, a
>>>>>>>>> > >>>>>>>>>>> specific continuous trigger with a desired
>>>>>>>>> checkpoint interval quote
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>> "
>>>>>>>>> > >>>>>>>>>>> df.writeStream
>>>>>>>>> > >>>>>>>>>>>    .format("...")
>>>>>>>>> > >>>>>>>>>>>    .option("...")
>>>>>>>>> > >>>>>>>>>>>    .trigger(Trigger.RealTime(“300 Seconds”))    //
>>>>>>>>> new trigger
>>>>>>>>> > >>>>>>>>>>> type to enable real-time Mode
>>>>>>>>> > >>>>>>>>>>>    .start()
>>>>>>>>> > >>>>>>>>>>> This Trigger.RealTime signals that the query should
>>>>>>>>> run in the
>>>>>>>>> > >>>>>>>>>>> new ultra low-latency execution mode.  A time
>>>>>>>>> interval can also be
>>>>>>>>> > >>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long
>>>>>>>>> each micro-batch should
>>>>>>>>> > >>>>>>>>>>> run for.
>>>>>>>>> > >>>>>>>>>>> "
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>> will inevitably depend on many factors. Not that
>>>>>>>>> simple
>>>>>>>>> > >>>>>>>>>>> HTH
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>> Dr Mich Talebzadeh,
>>>>>>>>> > >>>>>>>>>>> Architect | Data Science | Financial Crime |
>>>>>>>>> Forensic Analysis |
>>>>>>>>> > >>>>>>>>>>> GDPR
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>    view my Linkedin profile
>>>>>>>>> > >>>>>>>>>>> <
>>>>>>>>> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <
>>>>>>>>> > >>>>>>>>>>> [email protected]> wrote:
>>>>>>>>> > >>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>> Hi all,
>>>>>>>>> > >>>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>> I want to start a discussion thread for the SPIP
>>>>>>>>> titled
>>>>>>>>> > >>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured
>>>>>>>>> Streaming” that I've been
>>>>>>>>> > >>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao
>>>>>>>>> Sun, Jungtaek Lim, and
>>>>>>>>> > >>>>>>>>>>>> Michael Armbrust: [JIRA
>>>>>>>>> > >>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>]
>>>>>>>>> [Doc
>>>>>>>>> > >>>>>>>>>>>> <
>>>>>>>>> https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing
>>>>>>>>> >
>>>>>>>>> > >>>>>>>>>>>> ].
>>>>>>>>> > >>>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>> The SPIP proposes a new execution mode called
>>>>>>>>> “Real-time Mode”
>>>>>>>>> > >>>>>>>>>>>> in Spark Structured Streaming that significantly
>>>>>>>>> lowers end-to-end latency
>>>>>>>>> > >>>>>>>>>>>> for processing streams of data.
>>>>>>>>> > >>>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>> A key principle of this proposal is compatibility.
>>>>>>>>> Our goal is
>>>>>>>>> > >>>>>>>>>>>> to make Spark capable of handling streaming jobs
>>>>>>>>> that need results almost
>>>>>>>>> > >>>>>>>>>>>> immediately (within O(100) milliseconds). We want
>>>>>>>>> to achieve this without
>>>>>>>>> > >>>>>>>>>>>> changing the high-level DataFrame/Dataset API that
>>>>>>>>> users already use – so
>>>>>>>>> > >>>>>>>>>>>> existing streaming queries can run in this new
>>>>>>>>> ultra-low-latency mode by
>>>>>>>>> > >>>>>>>>>>>> simply turning it on, without rewriting their logic.
>>>>>>>>> > >>>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>> In short, we’re trying to enable Spark to power
>>>>>>>>> real-time
>>>>>>>>> > >>>>>>>>>>>> applications (like instant anomaly alerts or live
>>>>>>>>> personalization) that
>>>>>>>>> > >>>>>>>>>>>> today cannot meet their latency requirements with
>>>>>>>>> Spark’s current streaming
>>>>>>>>> > >>>>>>>>>>>> engine.
>>>>>>>>> > >>>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and
>>>>>>>>> > >>>>>>>>>>>> suggestions on this approach!
>>>>>>>>> > >>>>>>>>>>>>
>>>>>>>>> > >>>>>>>>>>>>
>>>>>>>>> > >>>>>>
>>>>>>>>> > >>>>>> --
>>>>>>>>> > >>>>>> Best,
>>>>>>>>> > >>>>>> Yanbo
>>>>>>>>> > >>>>>>
>>>>>>>>> > >>>>>
>>>>>>>>> > >>
>>>>>>>>> > >> --
>>>>>>>>> > >> John Zhuge
>>>>>>>>> > >>
>>>>>>>>> > >>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe e-mail: [email protected]
>>>>>>>>>
>>>>>>>>>

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Reply via email to