Hi, Bart, Your question is more like "is Kafka reliable against failures"? As for the reliability of the changelog, Samza is designed as reliable as the underlying messaging layer provides. In the case of Kafka, there are configurations in the Kafka producer that users can tune up to make sure of no data loss. One example from the Kafka documentation: min.insync.replicas and request.required.acks allow you to enforce greater durability guarantees. A typical scenario would be to create a topic with a replication factor of 3, set min.insync.replicas to 2, and produce with request.required.acks of -1. This will ensure that the producer raises an exception if a majority of replicas do not receive a write. Of course, depend on the failure model, you may still see that the guarantees from the configuration is not enough to cover a whole cluster crash down, for example. But this would be a typical tradeoff between performance and reliability in configuration (i.e. the more replica and acks you configure, the less write throughput you may see).
And the More detailed configurations could be found from http://kafka.apache.org/documentation.html#configuration. Cheers, -Yi On Mon, Dec 7, 2015 at 3:15 AM, Bart De Vylder <bartdevyl...@gmail.com> wrote: > Hi all, > > I'm rather new to Samza and trying some things out using Kafka as the > message broker. One usecase i was interested in which is mentioned on the > documentation is creating a table-stream join using bootstrap streams. > > I'm interested in some recommendations/thoughts concerning the changelog > and database possibly going out of sync. > > Suppose I have my database push a changelog to Kafka for every > insert/update/delete and then have a Samza job consume this stream as a > bootstrap (+ maybe some other datastream). > > The only info about the database this job will ever see is by reading the > Kafka stream containing the changelogs (maybe compacted by kafka based on > key..). So losing any of these changelog messages ever is not an option as > then this job's view on the database will be wrong forever. This implies > that Kafka needs to be forced in fsyncing every new message for this > changelog topic? Or would it be better to still provide a complete > recreation of the changelog stream based on the current contents of the > database in case of disaster (all Kafka nodes losing power at the same > time). Or would it be bettter to recreate the database based on the > changelog (still some dataloss but at least the database and the changelog > are in sync). > > Any thought/experiences/references iis much appreciated. > Regards, > Bart > > > -- > Bart De Vylder > +32(0)496/558065 > bartdevyl...@gmail.com >