Also, most database provide a "full logging" option that let's you capture the whole row in the log (I know Oracle and MySQL have this) but it sounds like Mongo doesn't yet. That would be the ideal solution.
-Jay On Fri, Jan 29, 2016 at 9:38 AM, Jay Kreps <j...@confluent.io> wrote: > Ah, agreed. This approach is actually quite common in change capture, > though. For many use cases getting the final value is actually preferable > to getting intermediates. The exception is usually if you want to do > analytics on something like number of changes. > > On Fri, Jan 29, 2016 at 9:35 AM, Ewen Cheslack-Postava <e...@confluent.io> > wrote: > >> Jay, >> >> You can query after the fact, but you're not necessarily going to get the >> same value back. There could easily be dozens of changes to the document >> in >> the oplog so the delta you see may not even make sense given the current >> state of the document. Even if you can apply it the delta, you'd still be >> seeing data that is newer than the update. You can of course take this >> shortcut, but it won't give correct results. And if the data has been >> deleted since then, you won't even be able to write the full record... As >> far as I know, the way the op log is exposed won't let you do something >> like pin a query to the state of the db at a specific point in the op log >> and you may be reading from the beginning of the op log, so I don't think >> there's a way to get correct results by just querying the DB for the full >> documents. >> >> Strictly speaking you don't need to get all the data in memory, you just >> need a record of the current set of values somewhere. This is what I was >> describing following those two options -- if you do an initial dump to >> Kafka, you could track only offsets in memory and read back full values as >> needed to apply deltas, but this of course requires random reads into your >> Kafka topic (but may perform fine in practice depending on the workload). >> >> -Ewen >> >> On Fri, Jan 29, 2016 at 9:12 AM, Jay Kreps <j...@confluent.io> wrote: >> >> > Hey Ewen, how come you need to get it all in memory for approach (1)? I >> > guess the obvious thing to do would just be to query for the record >> > after-image when you get the diff--e.g. just read a batch of changes and >> > multi-get the final values. I don't know how bad the overhead of this >> would >> > be...batching might reduce it a fair amount. The guarantees for this are >> > slightly different than the pure oplog too (you get the current value >> not >> > every necessarily every intermediate) but that should be okay for most >> > uses. >> > >> > -Jay >> > >> > On Fri, Jan 29, 2016 at 8:54 AM, Ewen Cheslack-Postava < >> e...@confluent.io> >> > wrote: >> > >> > > Sunny, >> > > >> > > As I said on Twitter, I'm stoked to hear you're working on a Mongo >> > > connector! It struck me as a pretty natural source to tackle since it >> > does >> > > such a nice job of cleanly exposing the op log. >> > > >> > > Regarding the problem of only getting deltas, unfortunately there is >> not >> > a >> > > trivial solution here -- if you want to generate the full updated >> record, >> > > you're going to have to have a way to recover the original document. >> > > >> > > In fact, I'm curious how you were thinking of even bootstrapping. Are >> you >> > > going to do a full dump and then start reading the op log? Is there a >> > good >> > > way to do the dump and figure out the exact location in the op log >> that >> > the >> > > query generating the dump was initially performed? I know that >> internally >> > > mongo effectively does these two steps, but I'm not sure if the >> necessary >> > > info is exposed via normal queries. >> > > >> > > If you want to reconstitute the data, I can think of a couple of >> options: >> > > >> > > 1. Try to reconstitute inline in the connector. This seems difficult >> to >> > > make work in practice. At some point you basically have to query for >> the >> > > entire data set to bring it into memory and then the connector is >> > > effectively just applying the deltas to its in memory copy and then >> just >> > > generating one output record containing the full document each time it >> > > applies an update. >> > > 2. Make the connector send just the updates and have a separate stream >> > > processing job perform the reconstitution and send to another topic. >> In >> > > this case, the first topic should not be compacted, but the second one >> > > could be. >> > > >> > > Unfortunately, without additional hooks into the database, there's not >> > much >> > > you can do besides this pretty heavyweight process. There may be some >> > > tricks you can use to reduce the amount of memory used during the >> process >> > > (e.g. keep a small cache of actual records and for the rest only store >> > > Kafka offsets for the last full value, performing a (possibly >> expensive) >> > > random read as necessary to get the full document value back), but to >> get >> > > full correctness you will need to perform this process. >> > > >> > > In terms of Kafka Connect supporting something like this, I'm not sure >> > how >> > > general it could be made, or that you even want to perform the process >> > > inline with the Kafka Connect job. If it's an issue that repeatedly >> > arises >> > > across a variety of systems, then we should consider how to address it >> > more >> > > generally. >> > > >> > > -Ewen >> > > >> > > On Tue, Jan 26, 2016 at 8:43 PM, Sunny Shah <su...@tinyowl.co.in> >> wrote: >> > > >> > > > >> > > > Hi , >> > > > >> > > > We are trying to write a Kafka-connect connector for Mongodb. The >> issue >> > > > is, MongoDB does not provide an entire changed document for update >> > > > operations, It just provides the modified fields. >> > > > >> > > > if Kafka allows custom log compaction then It is possible to >> eventually >> > > > merge an entire document and subsequent update to to create an >> entire >> > > > record again. >> > > > >> > > > As Ewen pointed out to me on twitter, this is not possible, then >> What >> > is >> > > > the Kafka-connect way of solving this issue? >> > > > >> > > > @Ewen, Thanks a lot for a really quick answer on twitter. >> > > > >> > > > -- >> > > > Thanks and Regards, >> > > > Sunny >> > > > >> > > > The contents of this e-mail and any attachment(s) are confidential >> and >> > > > intended for the named recipient(s) only. It shall not attach any >> > > liability >> > > > on the originator or TinyOwl Technology Pvt. Ltd. or its affiliates. >> > Any >> > > > form of reproduction, dissemination, copying, disclosure, >> modification, >> > > > distribution and / or publication of this message without the prior >> > > written >> > > > consent of the author of this e-mail is strictly prohibited. If you >> > have >> > > > received this email in error please delete it and notify the sender >> > > > immediately. You are liable to the company (TinyOwl Technology Pvt. >> > > Ltd.) in >> > > > case of any breach in ​ >> > > > ​confidentialy (through any form of communication) wherein the >> company >> > > has >> > > > the right to injunct legal action and an equitable relief for >> damages. >> > > > >> > > >> > > >> > > >> > > -- >> > > Thanks, >> > > Ewen >> > > >> > >> >> >> >> -- >> Thanks, >> Ewen >> > >