From: Joseph Torres
Sent: Tuesday, May 1, 2018 1:58:54 PM
To: Ryan Blue
Cc: Thakrar, Jayesh; dev@spark.apache.org
Subject: Re: Datasource API V2 and checkpointing
I agree that Spark should fully handle state serialization and recovery for
most sources. This is how
I agree that Spark should fully handle state serialization and recovery for
most sources. This is how it works in V1, and we definitely wouldn't want
or need to change that in V2.* The question is just whether we should have
an escape hatch for the sources that don't want Spark to do that, and if s
I think there's a difference. You're right that we wanted to clean up the
API in V2 to avoid file sources using side channels. But there's a big
difference between adding, for example, a way to report partitioning and
designing for sources that need unbounded state. It's a judgment call, but
I thin
I'd argue that letting bad cases influence the design is an explicit goal
of DataSourceV2. One of the primary motivations for the project was that
file sources hook into a series of weird internal side channels, with
favorable performance characteristics that are difficult to match in the
API we ac
Should we really plan the API for a source with state that grows
indefinitely? It sounds like we're letting a bad case influence the design,
when we probably shouldn't.
On Mon, Apr 30, 2018 at 11:05 AM, Joseph Torres <
joseph.tor...@databricks.com> wrote:
> Offset is just a type alias for arbitra
Offset is just a type alias for arbitrary JSON-serializable state. Most
implementations should (and do) just toss the blob at Spark and let Spark
handle recovery on its own.
In the case of file streams, the obstacle is that the conceptual offset is
very large: a list of every file which the stream
Why don't we just have the source return a Serializable of state when it
reports offsets? Then Spark could handle storing the source's state and the
source wouldn't need to worry about file system paths. I think that would
be easier for implementations and better for recovery because it wouldn't
le
Thanks Joseph!
From: Joseph Torres
Date: Friday, April 27, 2018 at 11:23 AM
To: "Thakrar, Jayesh"
Cc: "dev@spark.apache.org"
Subject: Re: Datasource API V2 and checkpointing
The precise interactions with the DataSourceV2 API haven't yet been hammered
out in desig
The precise interactions with the DataSourceV2 API haven't yet been
hammered out in design. But much of this comes down to the core of
Structured Streaming rather than the API details.
The execution engine handles checkpointing and recovery. It asks the
streaming data source for offsets, and then
Wondering if this issue is related to SPARK-23323?
Any pointers will be greatly appreciated….
Thanks,
Jayesh
From: "Thakrar, Jayesh"
Date: Monday, April 23, 2018 at 9:49 PM
To: "dev@spark.apache.org"
Subject: Datasource API V2 and checkpointing
I was wondering when checkpointing is enabled, w
10 matches
Mail list logo