Re: Samza App Deployment Topology - Best Practices

Eric Shieh Tue, 22 Oct 2019 07:27:16 -0700

Hi Bharath,

Thank you very much, this is extremely helpful!  I had not originally
contemplated using beam but will definitely look into it following your
recommendation.


Based on your response and upon further reading of the source code, I
realized my original understanding of the samza-elasticsearch
system/connector and the overall Samza "system/connector" was incorrect.
Elasticsearch system/connector, unlike Kafka or HDFS systems, does not have
input nor output descriptors; furthermore since it maps stream name to
Elasticsearch index and doc_type name, I incorrectly assumed that Samza
internally used Kafka topics as the sources to Elasticsearch
"system/connector" so that writes to Elasticsearch can be buffered through
Kafka.  I now understand the Elasticsearch "system/connector" is simply a
lower-level API that can be called by the jobs as a message stream sink
which means I need to implement a generic job to consume 1 or more Kafka
topics and use the API to write to Elasticsearch.

Regards,

Eric

On Tue, Oct 22, 2019 at 9:31 AM Bharath Kumara Subramanian <
codin.mart...@gmail.com> wrote:

> Hi Eric,
>
> Thanks for additional clarification. I now have a better understanding of
> your use case.
> Your use case be accomplished using various approaches.
>
>
>    1. One Samza application that consumes all of your input streams (A and
>    D) and also your intermediate output streams (B) and routes it to HDFS
> or
>    ElasticSearch. With this approach,
>       1. You may have to scale up or scale down your entire job to react to
>       changes to either of your inputs (A, D & B as well). It might
> potentially
>       result in under utilization of resources.
>       2. A failure in one component could impact other components. For
>       example, it is possible that there are other writers to output
> stream B and
>       you don't want to disrupt them because of a bug in the
> transformation logic
>       of input stream A.
>       3. On the similar lines of 2, it also possible that due to a HDFS or
>       elastic search outage, the back pressure causes output B to grow
> which
>       might potentially impact the time to catch up and also impact your
> job's
>       performance (e.g. this can happen if you have priorities setup
> across your
>       streams)
>    2. Another approach is to split them into multiple jobs; one for
>    consuming input sources (A and D) and routing to appropriate
> destinations
>    (B & C), another for consuming the output from the previous job (B) and
>    routing it to HDFS or ElasticSearch. With this approach
>       1. You isolate the common/shared logic to write to HDFS or
>       ElasticSearch into its own job which allows you to manage it
> independently
>       including scale up/down.
>       2. During HDFS/ElasticSearch outages, other components are
>       non-impacted and the back pressure causes your stream B to grow
> which Kafka
>       can handle well.
>
>
> Our recommend API to write Samza application is Apache Beam
> <https://beam.apache.org/>. Examples on how to write a sample application
> using Beam API and run it using Samza can be found:
> https://github.com/apache/samza-beam-examples
>
> Please reach out to us if you have more questions.
>
> Thanks,
> Bharath
>
> On Wed, Oct 16, 2019 at 7:31 AM Eric Shieh <datosl...@gmail.com> wrote:
>
> > Thank you Bharath.  Regarding my 2nd question, perhaps the following
> > scenario can help to illustrate what I am looking to achieve:
> >
> > Input stream A -> Job 1 -> Output stream B (Kafka Topic B)
> > Input stream A -> Job 2 -> Output stream C
> > Input stream D -> Job 3 -> Output stream B (Kafka Topic B)
> > Input stream B (Kafka Topic B) -> Elasticsearch (or write to HDFS)
> >
> > In the case of "Input stream B (Kafka Topic B) -> Elasticsearch (or write
> > to HDFS)" this is what I was referring to as "Common/shared system
> > services" that does not have any transformation logic except message sink
> > to either Elasticsearch or HDFS using Samza's systems/connectors.  In
> other
> > words, Job 1 and Job 3 both output to "Output stream B" expecting
> messages
> > will be persisted in Elasticsearch or HDFS, would I need to specify the
> > system/connector configuration separately in Job 1 and Job 3?  Is there a
> > way to have "Input stream B  (Kafka Topic B) -> Elasticsearch (or write
> to
> > HDFS)" as its own stand-alone job so I can have the following:
> >
> > RESTful web services (or other none Samza services/applications) as Kafka
> > producer ->  Input stream B (Kafka Topic B) -> Elasticsearch (or write to
> > HDFS)
> >
> > Regards,
> >
> > Eric
> >
> > On Mon, Oct 14, 2019 at 8:35 PM Bharath Kumara Subramanian <
> > codin.mart...@gmail.com> wrote:
> >
> > > Hi Eric,
> > >
> > > Answers to your questions are as follows
> > >
> > >
> > > >
> > > >
> > > >
> > > > *Can I, or is it recommended to, package multiple jobs as 1
> > > deploymentwith
> > > > 1 properties file or keep each app separated?  Based on
> > thedocumentation,
> > > > it appears to support 1 app/job within a singleconfiguration as there
> > is
> > > no
> > > > mechanism to assign multiple app classes andgiven each a name unless
> I
> > am
> > > > mistaken*
> > > >
> > >
> > >  *app.class* is a single valued configuration and your understanding
> > about
> > > it based on the documentation is correct.
> > >
> > >
> > > >
> > > >
> > > > *If only 1 app per config+deployment, what is the best way to
> > > > handlerequirement #3 - common/shared system services as there is no
> app
> > > or
> > > > jobper say, I just need to specify the streams and output system
> > > > (ieorg.apache.samza.system.hdfs.writer*
> > > >
> > >
> > > There are couple of options to achieve your #3 requirement.
> > >
> > >    1. If there is enough commonality between your jobs, you could have
> > one
> > >    application class that describes the logic and have the different
> > >    configurations to modify the behavior of the application logic. This
> > > does
> > >    come with some of the following considerations
> > >       1. Your deployment system needs to support deploying the same
> > >       application with different configs.
> > >       2. Potential duplication of configuration if you configuration
> > system
> > >       doesn't support hierarchies and overrides.
> > >       3. Potentially unmanageable for evolution, since the change in
> > >       application affects multiple jobs and requires extensive testing
> > > across
> > >       different configurations.
> > >    2. You could potentially have libraries to perform some piece of
> > >    business logic and have your different jobs leverage them using
> > >    composition. Some things to consider with this option
> > >    1. Your application and configuration stay isolated.
> > >       2. You could still leverage some of the common configurations if
> > your
> > >       configuration system supports hierarchies and overrides
> > >       3. Alleviates concerns over evolution and testing as long as the
> > >       changes are application specific.
> > >
> > >
> > > I am still unclear about the second part of your 2nd question.
> > > Do you mean to say all your jobs consume from same sources and write to
> > > sources and only your processing logic is different?
> > >
> > >
> > > > *common/shared system services as there is no app or jobper say, I
> just
> > > > need to specify the streams and output system*
> > >
> > >
> > > Also, I am not sure I follow what do you mean by "*there is no app or
> > > job"*.
> > > You still have 1 app per config + deployment, right?
> > >
> > > Thanks,
> > > Bharath
> > >
> > > On Thu, Oct 10, 2019 at 9:46 AM Eric Shieh <datosl...@gmail.com>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > I am new to Samza, I am evaluating Samza as the backbone for my
> > streaming
> > > > CEP requirement.  I have:
> > > >
> > > > 1. Multiple data enrichment and ETL jobs
> > > > 2. Multiple domain specific CEP rulesets
> > > > 3. Common/shared system services like consuming topics/streams and
> > > > persisting the messages in ElasticSearch and HDFS.
> > > >
> > > > My questions are:
> > > >
> > > > 1. Can I, or is it recommended to, package multiple jobs as 1
> > deployment
> > > > with 1 properties file or keep each app separated?  Based on the
> > > > documentation, it appears to support 1 app/job within a single
> > > > configuration as there is no mechanism to assign multiple app classes
> > and
> > > > given each a name unless I am mistaken.
> > > > 2. If only 1 app per config+deployment, what is the best way to
> handle
> > > > requirement #3 - common/shared system services as there is no app or
> > job
> > > > per say, I just need to specify the streams and output system (ie
> > > > org.apache.samza.system.hdfs.writer.
> > > > BinarySequenceFileHdfsWriter or
> > > >
> > > >
> > >
> >
> org.apache.samza.system.elasticsearch.indexrequest.DefaultIndexRequestFactory).
> > > > Given it's a common shared system service not tied to specific jobs,
> > can
> > > it
> > > > be deployed without an app?
> > > >
> > > > Thank you in advance for your help, looking forward to learning more
> > > about
> > > > Samza and developing this critical feature using Samza!
> > > >
> > > > Regards,
> > > >
> > > > Eric
> > > >
> > >
> >
>

Re: Samza App Deployment Topology - Best Practices

Reply via email to