Re: DataSourceWriter V2 Api questions

Reynold Xin Mon, 10 Sep 2018 10:26:17 -0700

I don't think the problem is just whether we have a starting point for
write. As a matter of fact there's always a starting point for write,
whether it is explicit or implicit.


This is a pretty big challenge in general for data sources -- for the vast
majority of data stores, the boundary of a transaction is per client. That
is, you can't have two clients doing writes and coordinating a single
transaction. That's certainly the case for almost all relational databases.
Spark, on the other hand, will have multiple clients (consider each task a
client) writing to the same underlying data store.

On Mon, Sep 10, 2018 at 10:19 AM Ryan Blue <rb...@netflix.com> wrote:

> Ross, I think the intent is to create a single transaction on the driver,
> write as part of it in each task, and then commit the transaction once the
> tasks complete. Is that possible in your implementation?
>
> I think that part of this is made more difficult by not having a clear
> starting point for a write, which we are fixing in the redesign of the v2
> API. That will have a method that creates a Write to track the operation.
> That can create your transaction when it is created and commit the
> transaction when commit is called on it.
>
> rb
>
> On Mon, Sep 10, 2018 at 9:05 AM Reynold Xin <r...@databricks.com> wrote:
>
>> Typically people do it via transactions, or staging tables.
>>
>>
>> On Mon, Sep 10, 2018 at 2:07 AM Ross Lawley <ross.law...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I've been prototyping an implementation of the DataSource V2 writer for
>>> the MongoDB Spark Connector and I have a couple of questions about how its
>>> intended to be used with database systems. According to the Javadoc for
>>> DataWriter.commit():
>>>
>>>
>>> *"this method should still "hide" the written data and ask the
>>> DataSourceWriter at driver side to do the final commit via
>>> WriterCommitMessage"*
>>>
>>> Although, MongoDB now has transactions, it doesn't have a way to "hide"
>>> the data once it has been written. So as soon as the DataWriter has
>>> committed the data, it has been inserted/updated in the collection and is
>>> discoverable - thereby breaking the documented contract.
>>>
>>> I was wondering how other databases systems plan to implement this API
>>> and meet the contract as per the Javadoc?
>>>
>>> Many thanks
>>>
>>> Ross
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: DataSourceWriter V2 Api questions

Reply via email to