I don't think the problem is just whether we have a starting point for write. As a matter of fact there's always a starting point for write, whether it is explicit or implicit.
This is a pretty big challenge in general for data sources -- for the vast majority of data stores, the boundary of a transaction is per client. That is, you can't have two clients doing writes and coordinating a single transaction. That's certainly the case for almost all relational databases. Spark, on the other hand, will have multiple clients (consider each task a client) writing to the same underlying data store. On Mon, Sep 10, 2018 at 10:19 AM Ryan Blue <rb...@netflix.com> wrote: > Ross, I think the intent is to create a single transaction on the driver, > write as part of it in each task, and then commit the transaction once the > tasks complete. Is that possible in your implementation? > > I think that part of this is made more difficult by not having a clear > starting point for a write, which we are fixing in the redesign of the v2 > API. That will have a method that creates a Write to track the operation. > That can create your transaction when it is created and commit the > transaction when commit is called on it. > > rb > > On Mon, Sep 10, 2018 at 9:05 AM Reynold Xin <r...@databricks.com> wrote: > >> Typically people do it via transactions, or staging tables. >> >> >> On Mon, Sep 10, 2018 at 2:07 AM Ross Lawley <ross.law...@gmail.com> >> wrote: >> >>> Hi all, >>> >>> I've been prototyping an implementation of the DataSource V2 writer for >>> the MongoDB Spark Connector and I have a couple of questions about how its >>> intended to be used with database systems. According to the Javadoc for >>> DataWriter.commit(): >>> >>> >>> *"this method should still "hide" the written data and ask the >>> DataSourceWriter at driver side to do the final commit via >>> WriterCommitMessage"* >>> >>> Although, MongoDB now has transactions, it doesn't have a way to "hide" >>> the data once it has been written. So as soon as the DataWriter has >>> committed the data, it has been inserted/updated in the collection and is >>> discoverable - thereby breaking the documented contract. >>> >>> I was wondering how other databases systems plan to implement this API >>> and meet the contract as per the Javadoc? >>> >>> Many thanks >>> >>> Ross >>> >> > > -- > Ryan Blue > Software Engineer > Netflix >