Re: Architecture recommendations for a tricky use case

Cody Koeninger Thu, 29 Sep 2016 09:24:17 -0700

> I still don't understand why writing to a transactional database with locking 
> and concurrency (read and writes) through JDBC will be fast for this sort of 
> data ingestion.


Who cares about fast if your data is wrong?  And it's still plenty fast enough

https://youtu.be/NVl9_6J1G60?list=WL&t=1819

https://www.citusdata.com/blog/2016/09/22/announcing-citus-mx/



On Thu, Sep 29, 2016 at 11:16 AM, Mich Talebzadeh
<mich.talebza...@gmail.com> wrote:
> The way I see this, there are two things involved.
>
> Data ingestion through source to Kafka
> Date conversion and Storage ETL/ELT
> Presentation
>
> Item 2 is the one that needs to be designed correctly. I presume raw data
> has to confirm to some form of MDM that requires schema mapping etc before
> putting into persistent storage (DB, HDFS etc). Which one to choose depends
> on your volume of ingestion and your cluster size and complexity of data
> conversion. Then your users will use some form of UI (Tableau, QlikView,
> Zeppelin, direct SQL) to query data one way or other. Your users can
> directly use UI like Tableau that offer in built analytics on SQL. Spark sql
> offers the same). Your mileage varies according to your needs.
>
> I still don't understand why writing to a transactional database with
> locking and concurrency (read and writes) through JDBC will be fast for this
> sort of data ingestion. If you ask me if I wanted to choose an RDBMS to
> write to as my sink,I would use Oracle which offers the best locking and
> concurrency among RDBMs and also handles key value pairs as well (assuming
> that is what you want). In addition, it can be used as a Data Warehouse as
> well.
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 29 September 2016 at 16:49, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>
>> The business use case is to read a user's data from a variety of different
>> services through their API, and then allowing the user to query that data,
>> on a per service basis, as well as an aggregate across all services.
>>
>> The way I'm considering doing it, is to do some basic ETL (drop all the
>> unnecessary fields, rename some fields into something more manageable, etc)
>> and then store the data in Cassandra / Postgres.
>>
>> Then, when the user wants to view a particular report, query the
>> respective table in Cassandra / Postgres. (select .. from data where user =
>> ? and date between <start> and <end> and some_field = ?)
>>
>> How will Spark Streaming help w/ aggregation? Couldn't the data be queried
>> from Cassandra / Postgres via the Kafka consumer and aggregated that way?
>>
>> On Thu, Sep 29, 2016 at 8:43 PM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>>
>>> No, direct stream in and of itself won't ensure an end-to-end
>>> guarantee, because it doesn't know anything about your output actions.
>>>
>>> You still need to do some work.  The point is having easy access to
>>> offsets for batches on a per-partition basis makes it easier to do
>>> that work, especially in conjunction with aggregation.
>>>
>>> On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma <deepakmc...@gmail.com>
>>> wrote:
>>> > If you use spark direct streams , it ensure end to end guarantee for
>>> > messages.
>>> >
>>> >
>>> > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar <ali.rac...@gmail.com>
>>> > wrote:
>>> >>
>>> >> My concern with Postgres / Cassandra is only scalability. I will look
>>> >> further into Postgres horizontal scaling, thanks.
>>> >>
>>> >> Writes could be idempotent if done as upserts, otherwise updates will
>>> >> be
>>> >> idempotent but not inserts.
>>> >>
>>> >> Data should not be lost. The system should be as fault tolerant as
>>> >> possible.
>>> >>
>>> >> What's the advantage of using Spark for reading Kafka instead of
>>> >> direct
>>> >> Kafka consumers?
>>> >>
>>> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <c...@koeninger.org>
>>> >> wrote:
>>> >>>
>>> >>> I wouldn't give up the flexibility and maturity of a relational
>>> >>> database, unless you have a very specific use case.  I'm not trashing
>>> >>> cassandra, I've used cassandra, but if all I know is that you're
>>> >>> doing
>>> >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc
>>> >>> aggregations without a lot of forethought.  If you're worried about
>>> >>> scaling, there are several options for horizontally scaling Postgres
>>> >>> in particular.  One of the current best from what I've worked with is
>>> >>> Citus.
>>> >>>
>>> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma
>>> >>> <deepakmc...@gmail.com>
>>> >>> wrote:
>>> >>> > Hi Cody
>>> >>> > Spark direct stream is just fine for this use case.
>>> >>> > But why postgres and not cassandra?
>>> >>> > Is there anything specific here that i may not be aware?
>>> >>> >
>>> >>> > Thanks
>>> >>> > Deepak
>>> >>> >
>>> >>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger
>>> >>> > <c...@koeninger.org>
>>> >>> > wrote:
>>> >>> >>
>>> >>> >> How are you going to handle etl failures?  Do you care about lost
>>> >>> >> /
>>> >>> >> duplicated data?  Are your writes idempotent?
>>> >>> >>
>>> >>> >> Absent any other information about the problem, I'd stay away from
>>> >>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream
>>> >>> >> feeding postgres.
>>> >>> >>
>>> >>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar
>>> >>> >> <ali.rac...@gmail.com>
>>> >>> >> wrote:
>>> >>> >> > Is there an advantage to that vs directly consuming from Kafka?
>>> >>> >> > Nothing
>>> >>> >> > is
>>> >>> >> > being done to the data except some light ETL and then storing it
>>> >>> >> > in
>>> >>> >> > Cassandra
>>> >>> >> >
>>> >>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma
>>> >>> >> > <deepakmc...@gmail.com>
>>> >>> >> > wrote:
>>> >>> >> >>
>>> >>> >> >> Its better you use spark's direct stream to ingest from kafka.
>>> >>> >> >>
>>> >>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar
>>> >>> >> >> <ali.rac...@gmail.com>
>>> >>> >> >> wrote:
>>> >>> >> >>>
>>> >>> >> >>> I don't think I need a different speed storage and batch
>>> >>> >> >>> storage.
>>> >>> >> >>> Just
>>> >>> >> >>> taking in raw data from Kafka, standardizing, and storing it
>>> >>> >> >>> somewhere
>>> >>> >> >>> where
>>> >>> >> >>> the web UI can query it, seems like it will be enough.
>>> >>> >> >>>
>>> >>> >> >>> I'm thinking about:
>>> >>> >> >>>
>>> >>> >> >>> - Reading data from Kafka via Spark Streaming
>>> >>> >> >>> - Standardizing, then storing it in Cassandra
>>> >>> >> >>> - Querying Cassandra from the web ui
>>> >>> >> >>>
>>> >>> >> >>> That seems like it will work. My question now is whether to
>>> >>> >> >>> use
>>> >>> >> >>> Spark
>>> >>> >> >>> Streaming to read Kafka, or use Kafka consumers directly.
>>> >>> >> >>>
>>> >>> >> >>>
>>> >>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh
>>> >>> >> >>> <mich.talebza...@gmail.com> wrote:
>>> >>> >> >>>>
>>> >>> >> >>>> - Spark Streaming to read data from Kafka
>>> >>> >> >>>> - Storing the data on HDFS using Flume
>>> >>> >> >>>>
>>> >>> >> >>>> You don't need Spark streaming to read data from Kafka and
>>> >>> >> >>>> store
>>> >>> >> >>>> on
>>> >>> >> >>>> HDFS. It is a waste of resources.
>>> >>> >> >>>>
>>> >>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly
>>> >>> >> >>>>
>>> >>> >> >>>> KafkaAgent.sources = kafka-sources
>>> >>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs
>>> >>> >> >>>>
>>> >>> >> >>>> That will be for your batch layer. To analyse you can
>>> >>> >> >>>> directly
>>> >>> >> >>>> read
>>> >>> >> >>>> from
>>> >>> >> >>>> hdfs files with Spark or simply store data in a database of
>>> >>> >> >>>> your
>>> >>> >> >>>> choice via
>>> >>> >> >>>> cron or something. Do not mix your batch layer with speed
>>> >>> >> >>>> layer.
>>> >>> >> >>>>
>>> >>> >> >>>> Your speed layer will ingest the same data directly from
>>> >>> >> >>>> Kafka
>>> >>> >> >>>> into
>>> >>> >> >>>> spark streaming and that will be  online or near real time
>>> >>> >> >>>> (defined
>>> >>> >> >>>> by your
>>> >>> >> >>>> window).
>>> >>> >> >>>>
>>> >>> >> >>>> Then you have a a serving layer to present data from both
>>> >>> >> >>>> speed
>>> >>> >> >>>> (the
>>> >>> >> >>>> one from SS) and batch layer.
>>> >>> >> >>>>
>>> >>> >> >>>> HTH
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>> Dr Mich Talebzadeh
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>> LinkedIn
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>> http://talebzadehmich.wordpress.com
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>> Disclaimer: Use it at your own risk. Any and all
>>> >>> >> >>>> responsibility
>>> >>> >> >>>> for
>>> >>> >> >>>> any
>>> >>> >> >>>> loss, damage or destruction of data or any other property
>>> >>> >> >>>> which
>>> >>> >> >>>> may
>>> >>> >> >>>> arise
>>> >>> >> >>>> from relying on this email's technical content is explicitly
>>> >>> >> >>>> disclaimed. The
>>> >>> >> >>>> author will in no case be liable for any monetary damages
>>> >>> >> >>>> arising
>>> >>> >> >>>> from such
>>> >>> >> >>>> loss, damage or destruction.
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>>
>>> >>> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar
>>> >>> >> >>>> <ali.rac...@gmail.com>
>>> >>> >> >>>> wrote:
>>> >>> >> >>>>>
>>> >>> >> >>>>> The web UI is actually the speed layer, it needs to be able
>>> >>> >> >>>>> to
>>> >>> >> >>>>> query
>>> >>> >> >>>>> the data online, and show the results in real-time.
>>> >>> >> >>>>>
>>> >>> >> >>>>> It also needs a custom front-end, so a system like Tableau
>>> >>> >> >>>>> can't
>>> >>> >> >>>>> be
>>> >>> >> >>>>> used, it must have a custom backend + front-end.
>>> >>> >> >>>>>
>>> >>> >> >>>>> Thanks for the recommendation of Flume. Do you think this
>>> >>> >> >>>>> will
>>> >>> >> >>>>> work:
>>> >>> >> >>>>>
>>> >>> >> >>>>> - Spark Streaming to read data from Kafka
>>> >>> >> >>>>> - Storing the data on HDFS using Flume
>>> >>> >> >>>>> - Using Spark to query the data in the backend of the web
>>> >>> >> >>>>> UI?
>>> >>> >> >>>>>
>>> >>> >> >>>>>
>>> >>> >> >>>>>
>>> >>> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh
>>> >>> >> >>>>> <mich.talebza...@gmail.com> wrote:
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka
>>> >>> >> >>>>>> can
>>> >>> >> >>>>>> be
>>> >>> >> >>>>>> stored on HDFS using flume.
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> -  Query this data to generate reports / analytics (There
>>> >>> >> >>>>>> will
>>> >>> >> >>>>>> be a
>>> >>> >> >>>>>> web UI which will be the front-end to the data, and will
>>> >>> >> >>>>>> show
>>> >>> >> >>>>>> the
>>> >>> >> >>>>>> reports)
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> This is basically batch layer and you need something like
>>> >>> >> >>>>>> Tableau
>>> >>> >> >>>>>> or
>>> >>> >> >>>>>> Zeppelin to query data
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> You will also need spark streaming to query data online for
>>> >>> >> >>>>>> speed
>>> >>> >> >>>>>> layer. That data could be stored in some transient fabric
>>> >>> >> >>>>>> like
>>> >>> >> >>>>>> ignite or
>>> >>> >> >>>>>> even druid.
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> HTH
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> Dr Mich Talebzadeh
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> LinkedIn
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> http://talebzadehmich.wordpress.com
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> Disclaimer: Use it at your own risk. Any and all
>>> >>> >> >>>>>> responsibility
>>> >>> >> >>>>>> for
>>> >>> >> >>>>>> any loss, damage or destruction of data or any other
>>> >>> >> >>>>>> property
>>> >>> >> >>>>>> which
>>> >>> >> >>>>>> may
>>> >>> >> >>>>>> arise from relying on this email's technical content is
>>> >>> >> >>>>>> explicitly
>>> >>> >> >>>>>> disclaimed. The author will in no case be liable for any
>>> >>> >> >>>>>> monetary
>>> >>> >> >>>>>> damages
>>> >>> >> >>>>>> arising from such loss, damage or destruction.
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar
>>> >>> >> >>>>>> <ali.rac...@gmail.com>
>>> >>> >> >>>>>> wrote:
>>> >>> >> >>>>>>>
>>> >>> >> >>>>>>> It needs to be able to scale to a very large amount of
>>> >>> >> >>>>>>> data,
>>> >>> >> >>>>>>> yes.
>>> >>> >> >>>>>>>
>>> >>> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma
>>> >>> >> >>>>>>> <deepakmc...@gmail.com> wrote:
>>> >>> >> >>>>>>>>
>>> >>> >> >>>>>>>> What is the message inflow ?
>>> >>> >> >>>>>>>> If it's really high , definitely spark will be of great
>>> >>> >> >>>>>>>> use .
>>> >>> >> >>>>>>>>
>>> >>> >> >>>>>>>> Thanks
>>> >>> >> >>>>>>>> Deepak
>>> >>> >> >>>>>>>>
>>> >>> >> >>>>>>>>
>>> >>> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar"
>>> >>> >> >>>>>>>> <ali.rac...@gmail.com>
>>> >>> >> >>>>>>>> wrote:
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for
>>> >>> >> >>>>>>>>> ideas.
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and
>>> >>> >> >>>>>>>>> writing
>>> >>> >> >>>>>>>>> their
>>> >>> >> >>>>>>>>> raw data into Kafka.
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> I need to:
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> - Do ETL on the data, and standardize it.
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> - Store the standardized data somewhere (HBase /
>>> >>> >> >>>>>>>>> Cassandra /
>>> >>> >> >>>>>>>>> Raw
>>> >>> >> >>>>>>>>> HDFS / ElasticSearch / Postgres)
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> - Query this data to generate reports / analytics (There
>>> >>> >> >>>>>>>>> will be
>>> >>> >> >>>>>>>>> a
>>> >>> >> >>>>>>>>> web UI which will be the front-end to the data, and will
>>> >>> >> >>>>>>>>> show
>>> >>> >> >>>>>>>>> the reports)
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> Java is being used as the backend language for
>>> >>> >> >>>>>>>>> everything
>>> >>> >> >>>>>>>>> (backend
>>> >>> >> >>>>>>>>> of the web UI, as well as the ETL layer)
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> I'm considering:
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the
>>> >>> >> >>>>>>>>> ETL
>>> >>> >> >>>>>>>>> layer
>>> >>> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it)
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the
>>> >>> >> >>>>>>>>> standardized
>>> >>> >> >>>>>>>>> data, and to allow queries
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark
>>> >>> >> >>>>>>>>> to
>>> >>> >> >>>>>>>>> run
>>> >>> >> >>>>>>>>> queries across the data (mostly filters), or directly
>>> >>> >> >>>>>>>>> run
>>> >>> >> >>>>>>>>> queries against
>>> >>> >> >>>>>>>>> Cassandra / HBase
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of
>>> >>> >> >>>>>>>>> these
>>> >>> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka
>>> >>> >> >>>>>>>>> consumers vs
>>> >>> >> >>>>>>>>> Spark for
>>> >>> >> >>>>>>>>> ETL, which persistent data store to use, and how to
>>> >>> >> >>>>>>>>> query
>>> >>> >> >>>>>>>>> that
>>> >>> >> >>>>>>>>> data store in
>>> >>> >> >>>>>>>>> the backend of the web UI, for displaying the reports).
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>>
>>> >>> >> >>>>>>>>> Thanks.
>>> >>> >> >>>>>>>
>>> >>> >> >>>>>>>
>>> >>> >> >>>>>>
>>> >>> >> >>>>>
>>> >>> >> >>>>
>>> >>> >> >>>
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> >> --
>>> >>> >> >> Thanks
>>> >>> >> >> Deepak
>>> >>> >> >> www.bigdatabig.com
>>> >>> >> >> www.keosha.net
>>> >>> >> >
>>> >>> >> >
>>> >>> >>
>>> >>> >>
>>> >>> >> ---------------------------------------------------------------------
>>> >>> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> >>> >>
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > --
>>> >>> > Thanks
>>> >>> > Deepak
>>> >>> > www.bigdatabig.com
>>> >>> > www.keosha.net
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Thanks
>>> > Deepak
>>> > www.bigdatabig.com
>>> > www.keosha.net
>>
>>
>

Re: Architecture recommendations for a tricky use case

Reply via email to