Hi I think you wrote to the wrong mailing list. This is about Apache Camel, not Apache Storm.
On Sun, Jul 3, 2016 at 9:15 PM, Sherwin Pinto <[email protected]> wrote: > Hi, > > I am using a Trident topology to process files, transform them from CSV, EDI, > XML to a general JSON format. I have a working prototype, but wanted to make > sure I am implementing this correctly. Here is the flow > > 1. Read message from kafka, this message is meta data of the file location on > S3 > 2. Next a function/bolt streams and transform the file from S3, emitting > records one at a time. > 3. Final step is a partitionPersist to kafka > > Here’s the topology > > TridentState kafkaState=topology.newStream("tracking-file-processor", > FileNameSpout.opaqueKafkaSpout(zkHosts,topicName)) //parallelism should be > number of partitions of topic > .parallelismHint(1) > .each(new Fields("str"),new S3Reader(), new > Fields("tracking_num", "json", "record_type")) > .shuffle() > .partitionPersist(stateFactory, new > Fields("tracking_num", "json", "record_type"), new TridentKafkaUpdater(), new > Fields()) > // .parallelismHint(10) > ; > > Questions > > 1. Is this the correct approach ? > 2. The files are of varying sizes and could be close to 500Mb, the S3Reader > function will emit one record (of the file) at a time, trident will batch > them before doing the partitionPersist, so basically the entire file would be > in memory ? While processing multiple files the memory requirement will > increase ? Do i just parallelize and spread partitions over multiple workers > or is there a better way ? > 3. This also means that the batch being written to kafka can vary in size and > maybe quite large, is this acceptable ? > 4. If i do need to write to a data source other than kafka, such as a regular > db (most likely will be kafka but just want to gain some more knowledge) what > would be the best way to do this ? > > Hoping the community can help > > Thanks > > Sherwin > > > -- Claus Ibsen ----------------- http://davsclaus.com @davsclaus Camel in Action 2: https://www.manning.com/ibsen2
