Re: Architecture recommendations for a tricky use case

Cody Koeninger Thu, 29 Sep 2016 08:44:19 -0700

No, direct stream in and of itself won't ensure an end-to-end
guarantee, because it doesn't know anything about your output actions.


You still need to do some work.  The point is having easy access to
offsets for batches on a per-partition basis makes it easier to do
that work, especially in conjunction with aggregation.

On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma <deepakmc...@gmail.com> wrote:
> If you use spark direct streams , it ensure end to end guarantee for
> messages.
>
>
> On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>
>> My concern with Postgres / Cassandra is only scalability. I will look
>> further into Postgres horizontal scaling, thanks.
>>
>> Writes could be idempotent if done as upserts, otherwise updates will be
>> idempotent but not inserts.
>>
>> Data should not be lost. The system should be as fault tolerant as
>> possible.
>>
>> What's the advantage of using Spark for reading Kafka instead of direct
>> Kafka consumers?
>>
>> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>>
>>> I wouldn't give up the flexibility and maturity of a relational
>>> database, unless you have a very specific use case.  I'm not trashing
>>> cassandra, I've used cassandra, but if all I know is that you're doing
>>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>>> aggregations without a lot of forethought.  If you're worried about
>>> scaling, there are several options for horizontally scaling Postgres
>>> in particular.  One of the current best from what I've worked with is
>>> Citus.
>>>
>>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <deepakmc...@gmail.com>
>>> wrote:
>>> > Hi Cody
>>> > Spark direct stream is just fine for this use case.
>>> > But why postgres and not cassandra?
>>> > Is there anything specific here that i may not be aware?
>>> >
>>> > Thanks
>>> > Deepak
>>> >
>>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <c...@koeninger.org>
>>> > wrote:
>>> >>
>>> >> How are you going to handle etl failures?  Do you care about lost /
>>> >> duplicated data?  Are your writes idempotent?
>>> >>
>>> >> Absent any other information about the problem, I'd stay away from
>>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>>> >> feeding postgres.
>>> >>
>>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <ali.rac...@gmail.com>
>>> >> wrote:
>>> >> > Is there an advantage to that vs directly consuming from Kafka?
>>> >> > Nothing
>>> >> > is
>>> >> > being done to the data except some light ETL and then storing it in
>>> >> > Cassandra
>>> >> >
>>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
>>> >> > <deepakmc...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Its better you use spark's direct stream to ingest from kafka.
>>> >> >>
>>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <ali.rac...@gmail.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> I don't think I need a different speed storage and batch storage.
>>> >> >>> Just
>>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>>> >> >>> somewhere
>>> >> >>> where
>>> >> >>> the web UI can query it, seems like it will be enough.
>>> >> >>>
>>> >> >>> I'm thinking about:
>>> >> >>>
>>> >> >>> - Reading data from Kafka via Spark Streaming
>>> >> >>> - Standardizing, then storing it in Cassandra
>>> >> >>> - Querying Cassandra from the web ui
>>> >> >>>
>>> >> >>> That seems like it will work. My question now is whether to use
>>> >> >>> Spark
>>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>>> >> >>>
>>> >> >>>
>>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>>> >> >>> <mich.talebza...@gmail.com> wrote:
>>> >> >>>>
>>> >> >>>> - Spark Streaming to read data from Kafka
>>> >> >>>> - Storing the data on HDFS using Flume
>>> >> >>>>
>>> >> >>>> You don't need Spark streaming to read data from Kafka and store
>>> >> >>>> on
>>> >> >>>> HDFS. It is a waste of resources.
>>> >> >>>>
>>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>>> >> >>>>
>>> >> >>>> KafkaAgent.sources = kafka-sources
>>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>> >> >>>>
>>> >> >>>> That will be for your batch layer. To analyse you can directly
>>> >> >>>> read
>>> >> >>>> from
>>> >> >>>> hdfs files with Spark or simply store data in a database of your
>>> >> >>>> choice via
>>> >> >>>> cron or something. Do not mix your batch layer with speed layer.
>>> >> >>>>
>>> >> >>>> Your speed layer will ingest the same data directly from Kafka
>>> >> >>>> into
>>> >> >>>> spark streaming and that will be  online or near real time
>>> >> >>>> (defined
>>> >> >>>> by your
>>> >> >>>> window).
>>> >> >>>>
>>> >> >>>> Then you have a a serving layer to present data from both speed
>>> >> >>>> (the
>>> >> >>>> one from SS) and batch layer.
>>> >> >>>>
>>> >> >>>> HTH
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> Dr Mich Talebzadeh
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> LinkedIn
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> http://talebzadehmich.wordpress.com
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> Disclaimer: Use it at your own risk. Any and all responsibility
>>> >> >>>> for
>>> >> >>>> any
>>> >> >>>> loss, damage or destruction of data or any other property which
>>> >> >>>> may
>>> >> >>>> arise
>>> >> >>>> from relying on this email's technical content is explicitly
>>> >> >>>> disclaimed. The
>>> >> >>>> author will in no case be liable for any monetary damages arising
>>> >> >>>> from such
>>> >> >>>> loss, damage or destruction.
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar <ali.rac...@gmail.com>
>>> >> >>>> wrote:
>>> >> >>>>>
>>> >> >>>>> The web UI is actually the speed layer, it needs to be able to
>>> >> >>>>> query
>>> >> >>>>> the data online, and show the results in real-time.
>>> >> >>>>>
>>> >> >>>>> It also needs a custom front-end, so a system like Tableau can't
>>> >> >>>>> be
>>> >> >>>>> used, it must have a custom backend + front-end.
>>> >> >>>>>
>>> >> >>>>> Thanks for the recommendation of Flume. Do you think this will
>>> >> >>>>> work:
>>> >> >>>>>
>>> >> >>>>> - Spark Streaming to read data from Kafka
>>> >> >>>>> - Storing the data on HDFS using Flume
>>> >> >>>>> - Using Spark to query the data in the backend of the web UI?
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>>> >> >>>>> <mich.talebza...@gmail.com> wrote:
>>> >> >>>>>>
>>> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka can
>>> >> >>>>>> be
>>> >> >>>>>> stored on HDFS using flume.
>>> >> >>>>>>
>>> >> >>>>>> -  Query this data to generate reports / analytics (There will
>>> >> >>>>>> be a
>>> >> >>>>>> web UI which will be the front-end to the data, and will show
>>> >> >>>>>> the
>>> >> >>>>>> reports)
>>> >> >>>>>>
>>> >> >>>>>> This is basically batch layer and you need something like
>>> >> >>>>>> Tableau
>>> >> >>>>>> or
>>> >> >>>>>> Zeppelin to query data
>>> >> >>>>>>
>>> >> >>>>>> You will also need spark streaming to query data online for
>>> >> >>>>>> speed
>>> >> >>>>>> layer. That data could be stored in some transient fabric like
>>> >> >>>>>> ignite or
>>> >> >>>>>> even druid.
>>> >> >>>>>>
>>> >> >>>>>> HTH
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> Dr Mich Talebzadeh
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> LinkedIn
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> http://talebzadehmich.wordpress.com
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility
>>> >> >>>>>> for
>>> >> >>>>>> any loss, damage or destruction of data or any other property
>>> >> >>>>>> which
>>> >> >>>>>> may
>>> >> >>>>>> arise from relying on this email's technical content is
>>> >> >>>>>> explicitly
>>> >> >>>>>> disclaimed. The author will in no case be liable for any
>>> >> >>>>>> monetary
>>> >> >>>>>> damages
>>> >> >>>>>> arising from such loss, damage or destruction.
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar
>>> >> >>>>>> <ali.rac...@gmail.com>
>>> >> >>>>>> wrote:
>>> >> >>>>>>>
>>> >> >>>>>>> It needs to be able to scale to a very large amount of data,
>>> >> >>>>>>> yes.
>>> >> >>>>>>>
>>> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>>> >> >>>>>>> <deepakmc...@gmail.com> wrote:
>>> >> >>>>>>>>
>>> >> >>>>>>>> What is the message inflow ?
>>> >> >>>>>>>> If it's really high , definitely spark will be of great use .
>>> >> >>>>>>>>
>>> >> >>>>>>>> Thanks
>>> >> >>>>>>>> Deepak
>>> >> >>>>>>>>
>>> >> >>>>>>>>
>>> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <ali.rac...@gmail.com>
>>> >> >>>>>>>> wrote:
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for
>>> >> >>>>>>>>> ideas.
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and
>>> >> >>>>>>>>> writing
>>> >> >>>>>>>>> their
>>> >> >>>>>>>>> raw data into Kafka.
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I need to:
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Do ETL on the data, and standardize it.
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra /
>>> >> >>>>>>>>> Raw
>>> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Query this data to generate reports / analytics (There
>>> >> >>>>>>>>> will be
>>> >> >>>>>>>>> a
>>> >> >>>>>>>>> web UI which will be the front-end to the data, and will
>>> >> >>>>>>>>> show
>>> >> >>>>>>>>> the reports)
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> Java is being used as the backend language for everything
>>> >> >>>>>>>>> (backend
>>> >> >>>>>>>>> of the web UI, as well as the ETL layer)
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I'm considering:
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL
>>> >> >>>>>>>>> layer
>>> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
>>> >> >>>>>>>>> standardized
>>> >> >>>>>>>>> data, and to allow queries
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark to
>>> >> >>>>>>>>> run
>>> >> >>>>>>>>> queries across the data (mostly filters), or directly run
>>> >> >>>>>>>>> queries against
>>> >> >>>>>>>>> Cassandra / HBase
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these
>>> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka
>>> >> >>>>>>>>> consumers vs
>>> >> >>>>>>>>> Spark for
>>> >> >>>>>>>>> ETL, which persistent data store to use, and how to query
>>> >> >>>>>>>>> that
>>> >> >>>>>>>>> data store in
>>> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
>>> >> >>>>>>>>>
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> Thanks.
>>> >> >>>>>>>
>>> >> >>>>>>>
>>> >> >>>>>>
>>> >> >>>>>
>>> >> >>>>
>>> >> >>>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Thanks
>>> >> >> Deepak
>>> >> >> www.bigdatabig.com
>>> >> >> www.keosha.net
>>> >> >
>>> >> >
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Thanks
>>> > Deepak
>>> > www.bigdatabig.com
>>> > www.keosha.net
>>
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net

Re: Architecture recommendations for a tricky use case

Reply via email to