Hi,

We're considering Kafka to provide a queued transport mechanism for an ETL 
process with near-real-time capability.  Kafka is looking pretty good but I'm 
wondering about a couple of things.

It's not clear- my first inclination is to co-locate a broker on the client 
servers in order to provide a queueing mechanism on the clients, just for 
getting the data from the databases into Kafka.   This would allow the data to 
back up on a client if necessary without holding off the producer in the case 
of external network or server availability problems.   And then using broker 
replication, the queue would then be duplicated on the warehouse server where 
the consumer can process the data for storage in the warehouse database.    So 
each client database server would be set up in a Kafka partition that has two 
brokers, one residing on the client and one on the warehouse, set to replicate.

Without doing it this way, network outages or performance hits could slow the 
producer when unable to contact a broker to the point it might not be able to 
keep up, and we'd need to implement a storage queue for that as well to solve 
the problem.   Some amount of database-to-producer queue may still be required, 
but I was hoping to keep it short and depend on the fact there's a local broker 
to minimize the problem.    My thinking is to minimize the path to getting the 
data into Kafka by providing a local broker, and let its replication abilities 
take over from there.

Does it make any sense to think about it this way?   I realize this would mean 
the client broker could then persist a lot of data, eating up disk space but 
that's the nature of the problem if the source database is producing a lot of 
transactions which need to be stored somewhere.

And are the latency measures of Kafka rated based on single broker throughput, 
where a broker-to-broker replication across networks would not be taken into 
account?   What effect does broker-to-broker replication over a network have on 
latency?

I'm also wondering if a queue can be cleared "on demand."   I know you can 
configure the persistence based on time or size, but I'm wondering if the 
consumer could trigger the removal of data as the messages are processed.


--

Keith Doyle
Greenway Health

NOTICE: This e-mail message and all attachments transmitted with it may contain 
legally privileged and confidential information intended solely for the use of 
the addressee. If the reader of this message is not the intended recipient, you 
are hereby notified that any reading, dissemination, distribution, copying, or 
other use of this message or its attachments is strictly prohibited. If you 
have received this message in error, please notify the sender immediately by 
electronic mail and delete this message and all copies and backups thereof. 
Thank you. Greenway Health.

Reply via email to