Yes, sink seems like the right place to put the CSV-S3 code. Don't mess with the channel code unless you know what you're doing. Although since you're doing db lookups, I'd imagine that would slow down the whole channel depending on the source data rate. What I'd suggest is that you take a look at how interceptors work and/or maybe take a look at the morphline sdk ( http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/ ).
Keep the source for only reading files and sink for only writing files. Everything else in the interceptor/morphline. -- Sharninder On Fri, Sep 5, 2014 at 8:23 AM, Kevin Warner <kevinwarner7...@gmail.com> wrote: > Hello All, > We have the following configuration: > Source->Channel->Sink > > Now, the source is pointing to a folder that has lots of json files. The > channel is file based so that there is fault tolerance and the Sink is > putting CSV files on S3. > > Now, there is code written in Sink that takes the JSON events and does > some MySQL database lookup and generates CSV files to be put into S3. > > The question is, is it the right place for the code or should the code be > running in channel as the ACID gaurantees is present in Channel. Please > advise. > > -Kev > >