While I agree with Mark that testing the end-to-end pipeline is critical, note that in terms of performance - whatever you write to hook-up Teradata to Kafka is unlikely to be as fast as Teradata connector for Sqoop (especially the newer one). Quite a lot of optimization by Teradata engineers went into the connector.
Actually, unless you need very low latency (seconds to few minutes), or consumers other than Hadoop, I'd go with Sqoop incremental jobs and leave Kafka out of the equation completely. This will save you quite a bit of work on connecting Teradata to Kafka, if it fits your user case. Gwen On Thu, Oct 23, 2014 at 9:48 AM, Mark Roberts <wiz...@gmail.com> wrote: > If you use Kafka for the first bulk load, you will test your new > Teradata->Kafka->Hive pipeline, as well as have the ability to blow away > the data in Hive and reflow it from Kafka without an expensive full > re-export from Teradata. As for whether Kafka can handle hundreds of GB of > data: Yes, absolutely. > > -Mark > > > On Thu, Oct 23, 2014 at 3:08 AM, Po Cheung <poche...@yahoo.com.invalid> > wrote: > >> Hello, >> >> We are planning to set up a data pipeline and send periodic, incremental >> updates from Teradata to Hadoop via Kafka. For a large DW table with >> hundreds of GB of data, is it okay (in terms of performance) to use Kafka >> for the initial bulk data load? Or will Sqoop with Teradata connector be >> more appropriate? >> >> >> Thanks, >> Po