Hi Bharath, Thank you very much, this is extremely helpful! I had not originally contemplated using beam but will definitely look into it following your recommendation.
Based on your response and upon further reading of the source code, I realized my original understanding of the samza-elasticsearch system/connector and the overall Samza "system/connector" was incorrect. Elasticsearch system/connector, unlike Kafka or HDFS systems, does not have input nor output descriptors; furthermore since it maps stream name to Elasticsearch index and doc_type name, I incorrectly assumed that Samza internally used Kafka topics as the sources to Elasticsearch "system/connector" so that writes to Elasticsearch can be buffered through Kafka. I now understand the Elasticsearch "system/connector" is simply a lower-level API that can be called by the jobs as a message stream sink which means I need to implement a generic job to consume 1 or more Kafka topics and use the API to write to Elasticsearch. Regards, Eric On Tue, Oct 22, 2019 at 9:31 AM Bharath Kumara Subramanian < codin.mart...@gmail.com> wrote: > Hi Eric, > > Thanks for additional clarification. I now have a better understanding of > your use case. > Your use case be accomplished using various approaches. > > > 1. One Samza application that consumes all of your input streams (A and > D) and also your intermediate output streams (B) and routes it to HDFS > or > ElasticSearch. With this approach, > 1. You may have to scale up or scale down your entire job to react to > changes to either of your inputs (A, D & B as well). It might > potentially > result in under utilization of resources. > 2. A failure in one component could impact other components. For > example, it is possible that there are other writers to output > stream B and > you don't want to disrupt them because of a bug in the > transformation logic > of input stream A. > 3. On the similar lines of 2, it also possible that due to a HDFS or > elastic search outage, the back pressure causes output B to grow > which > might potentially impact the time to catch up and also impact your > job's > performance (e.g. this can happen if you have priorities setup > across your > streams) > 2. Another approach is to split them into multiple jobs; one for > consuming input sources (A and D) and routing to appropriate > destinations > (B & C), another for consuming the output from the previous job (B) and > routing it to HDFS or ElasticSearch. With this approach > 1. You isolate the common/shared logic to write to HDFS or > ElasticSearch into its own job which allows you to manage it > independently > including scale up/down. > 2. During HDFS/ElasticSearch outages, other components are > non-impacted and the back pressure causes your stream B to grow > which Kafka > can handle well. > > > Our recommend API to write Samza application is Apache Beam > <https://beam.apache.org/>. Examples on how to write a sample application > using Beam API and run it using Samza can be found: > https://github.com/apache/samza-beam-examples > > Please reach out to us if you have more questions. > > Thanks, > Bharath > > On Wed, Oct 16, 2019 at 7:31 AM Eric Shieh <datosl...@gmail.com> wrote: > > > Thank you Bharath. Regarding my 2nd question, perhaps the following > > scenario can help to illustrate what I am looking to achieve: > > > > Input stream A -> Job 1 -> Output stream B (Kafka Topic B) > > Input stream A -> Job 2 -> Output stream C > > Input stream D -> Job 3 -> Output stream B (Kafka Topic B) > > Input stream B (Kafka Topic B) -> Elasticsearch (or write to HDFS) > > > > In the case of "Input stream B (Kafka Topic B) -> Elasticsearch (or write > > to HDFS)" this is what I was referring to as "Common/shared system > > services" that does not have any transformation logic except message sink > > to either Elasticsearch or HDFS using Samza's systems/connectors. In > other > > words, Job 1 and Job 3 both output to "Output stream B" expecting > messages > > will be persisted in Elasticsearch or HDFS, would I need to specify the > > system/connector configuration separately in Job 1 and Job 3? Is there a > > way to have "Input stream B (Kafka Topic B) -> Elasticsearch (or write > to > > HDFS)" as its own stand-alone job so I can have the following: > > > > RESTful web services (or other none Samza services/applications) as Kafka > > producer -> Input stream B (Kafka Topic B) -> Elasticsearch (or write to > > HDFS) > > > > Regards, > > > > Eric > > > > On Mon, Oct 14, 2019 at 8:35 PM Bharath Kumara Subramanian < > > codin.mart...@gmail.com> wrote: > > > > > Hi Eric, > > > > > > Answers to your questions are as follows > > > > > > > > > > > > > > > > > > > > > > *Can I, or is it recommended to, package multiple jobs as 1 > > > deploymentwith > > > > 1 properties file or keep each app separated? Based on > > thedocumentation, > > > > it appears to support 1 app/job within a singleconfiguration as there > > is > > > no > > > > mechanism to assign multiple app classes andgiven each a name unless > I > > am > > > > mistaken* > > > > > > > > > > *app.class* is a single valued configuration and your understanding > > about > > > it based on the documentation is correct. > > > > > > > > > > > > > > > > > > *If only 1 app per config+deployment, what is the best way to > > > > handlerequirement #3 - common/shared system services as there is no > app > > > or > > > > jobper say, I just need to specify the streams and output system > > > > (ieorg.apache.samza.system.hdfs.writer* > > > > > > > > > > There are couple of options to achieve your #3 requirement. > > > > > > 1. If there is enough commonality between your jobs, you could have > > one > > > application class that describes the logic and have the different > > > configurations to modify the behavior of the application logic. This > > > does > > > come with some of the following considerations > > > 1. Your deployment system needs to support deploying the same > > > application with different configs. > > > 2. Potential duplication of configuration if you configuration > > system > > > doesn't support hierarchies and overrides. > > > 3. Potentially unmanageable for evolution, since the change in > > > application affects multiple jobs and requires extensive testing > > > across > > > different configurations. > > > 2. You could potentially have libraries to perform some piece of > > > business logic and have your different jobs leverage them using > > > composition. Some things to consider with this option > > > 1. Your application and configuration stay isolated. > > > 2. You could still leverage some of the common configurations if > > your > > > configuration system supports hierarchies and overrides > > > 3. Alleviates concerns over evolution and testing as long as the > > > changes are application specific. > > > > > > > > > I am still unclear about the second part of your 2nd question. > > > Do you mean to say all your jobs consume from same sources and write to > > > sources and only your processing logic is different? > > > > > > > > > > *common/shared system services as there is no app or jobper say, I > just > > > > need to specify the streams and output system* > > > > > > > > > Also, I am not sure I follow what do you mean by "*there is no app or > > > job"*. > > > You still have 1 app per config + deployment, right? > > > > > > Thanks, > > > Bharath > > > > > > On Thu, Oct 10, 2019 at 9:46 AM Eric Shieh <datosl...@gmail.com> > wrote: > > > > > > > Hi, > > > > > > > > I am new to Samza, I am evaluating Samza as the backbone for my > > streaming > > > > CEP requirement. I have: > > > > > > > > 1. Multiple data enrichment and ETL jobs > > > > 2. Multiple domain specific CEP rulesets > > > > 3. Common/shared system services like consuming topics/streams and > > > > persisting the messages in ElasticSearch and HDFS. > > > > > > > > My questions are: > > > > > > > > 1. Can I, or is it recommended to, package multiple jobs as 1 > > deployment > > > > with 1 properties file or keep each app separated? Based on the > > > > documentation, it appears to support 1 app/job within a single > > > > configuration as there is no mechanism to assign multiple app classes > > and > > > > given each a name unless I am mistaken. > > > > 2. If only 1 app per config+deployment, what is the best way to > handle > > > > requirement #3 - common/shared system services as there is no app or > > job > > > > per say, I just need to specify the streams and output system (ie > > > > org.apache.samza.system.hdfs.writer. > > > > BinarySequenceFileHdfsWriter or > > > > > > > > > > > > > > org.apache.samza.system.elasticsearch.indexrequest.DefaultIndexRequestFactory). > > > > Given it's a common shared system service not tied to specific jobs, > > can > > > it > > > > be deployed without an app? > > > > > > > > Thank you in advance for your help, looking forward to learning more > > > about > > > > Samza and developing this critical feature using Samza! > > > > > > > > Regards, > > > > > > > > Eric > > > > > > > > > >