Re: Datasource API V2 and checkpointing

2018-05-01 Thread Thakrar, Jayesh
From: Joseph Torres Sent: Tuesday, May 1, 2018 1:58:54 PM To: Ryan Blue Cc: Thakrar, Jayesh; dev@spark.apache.org Subject: Re: Datasource API V2 and checkpointing I agree that Spark should fully handle state serialization and recovery for most sources. This is how

Re: Datasource API V2 and checkpointing

2018-05-01 Thread Joseph Torres
I agree that Spark should fully handle state serialization and recovery for most sources. This is how it works in V1, and we definitely wouldn't want or need to change that in V2.* The question is just whether we should have an escape hatch for the sources that don't want Spark to do that, and if s

Re: Datasource API V2 and checkpointing

2018-05-01 Thread Ryan Blue
I think there's a difference. You're right that we wanted to clean up the API in V2 to avoid file sources using side channels. But there's a big difference between adding, for example, a way to report partitioning and designing for sources that need unbounded state. It's a judgment call, but I thin

Re: Datasource API V2 and checkpointing

2018-04-30 Thread Joseph Torres
I'd argue that letting bad cases influence the design is an explicit goal of DataSourceV2. One of the primary motivations for the project was that file sources hook into a series of weird internal side channels, with favorable performance characteristics that are difficult to match in the API we ac

Re: Datasource API V2 and checkpointing

2018-04-30 Thread Ryan Blue
Should we really plan the API for a source with state that grows indefinitely? It sounds like we're letting a bad case influence the design, when we probably shouldn't. On Mon, Apr 30, 2018 at 11:05 AM, Joseph Torres < joseph.tor...@databricks.com> wrote: > Offset is just a type alias for arbitra

Re: Datasource API V2 and checkpointing

2018-04-30 Thread Joseph Torres
Offset is just a type alias for arbitrary JSON-serializable state. Most implementations should (and do) just toss the blob at Spark and let Spark handle recovery on its own. In the case of file streams, the obstacle is that the conceptual offset is very large: a list of every file which the stream

Re: Datasource API V2 and checkpointing

2018-04-30 Thread Ryan Blue
Why don't we just have the source return a Serializable of state when it reports offsets? Then Spark could handle storing the source's state and the source wouldn't need to worry about file system paths. I think that would be easier for implementations and better for recovery because it wouldn't le

Re: Datasource API V2 and checkpointing

2018-04-27 Thread Thakrar, Jayesh
Thanks Joseph! From: Joseph Torres Date: Friday, April 27, 2018 at 11:23 AM To: "Thakrar, Jayesh" Cc: "dev@spark.apache.org" Subject: Re: Datasource API V2 and checkpointing The precise interactions with the DataSourceV2 API haven't yet been hammered out in desig

Re: Datasource API V2 and checkpointing

2018-04-27 Thread Joseph Torres
The precise interactions with the DataSourceV2 API haven't yet been hammered out in design. But much of this comes down to the core of Structured Streaming rather than the API details. The execution engine handles checkpointing and recovery. It asks the streaming data source for offsets, and then

Re: Datasource API V2 and checkpointing

2018-04-27 Thread Thakrar, Jayesh
Wondering if this issue is related to SPARK-23323? Any pointers will be greatly appreciated…. Thanks, Jayesh From: "Thakrar, Jayesh" Date: Monday, April 23, 2018 at 9:49 PM To: "dev@spark.apache.org" Subject: Datasource API V2 and checkpointing I was wondering when checkpointing is enabled, w