·         Kafka Connect for ingress “E”

·         Kafka Streams , Flink or Spark Streaming for “T” – Read from and 
write back to Kafka – Keep the sources of data for you processing engine small 
Separation of concerns, why should Spark care about where you upstream sources 
are for example

·         Kafka Connect for egress “L” to a datastore of your choice, Kudu, 
HDFS, Cassandra, ReThinkDB, HBase, postgre etc

·         RestProxy from Confluent or 
https://github.com/datamountaineer/stream-reactor/tree/master/kafka-socket-streamer
 for UI on real time streams



https://github.com/datamountaineer/stream-reactor





On 29/09/16 17:11, "Cody Koeninger" <c...@koeninger.org> wrote:



    How are you going to handle etl failures?  Do you care about lost /

    duplicated data?  Are your writes idempotent?



    Absent any other information about the problem, I'd stay away from

    cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream

    feeding postgres.



    On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:

    > Is there an advantage to that vs directly consuming from Kafka? Nothing is

    > being done to the data except some light ETL and then storing it in

    > Cassandra

    >

    > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <deepakmc...@gmail.com>

    > wrote:

    >>

    >> Its better you use spark's direct stream to ingest from kafka.

    >>

    >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:

    >>>

    >>> I don't think I need a different speed storage and batch storage. Just

    >>> taking in raw data from Kafka, standardizing, and storing it somewhere 
where

    >>> the web UI can query it, seems like it will be enough.

    >>>

    >>> I'm thinking about:

    >>>

    >>> - Reading data from Kafka via Spark Streaming

    >>> - Standardizing, then storing it in Cassandra

    >>> - Querying Cassandra from the web ui

    >>>

    >>> That seems like it will work. My question now is whether to use Spark

    >>> Streaming to read Kafka, or use Kafka consumers directly.

    >>>

    >>>

    >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh

    >>> <mich.talebza...@gmail.com> wrote:

    >>>>

    >>>> - Spark Streaming to read data from Kafka

    >>>> - Storing the data on HDFS using Flume

    >>>>

    >>>> You don't need Spark streaming to read data from Kafka and store on

    >>>> HDFS. It is a waste of resources.

    >>>>

    >>>> Couple Flume to use Kafka as source and HDFS as sink directly

    >>>>

    >>>> KafkaAgent.sources = kafka-sources

    >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs

    >>>>

    >>>> That will be for your batch layer. To analyse you can directly read 
from

    >>>> hdfs files with Spark or simply store data in a database of your 
choice via

    >>>> cron or something. Do not mix your batch layer with speed layer.

    >>>>

    >>>> Your speed layer will ingest the same data directly from Kafka into

    >>>> spark streaming and that will be  online or near real time (defined by 
your

    >>>> window).

    >>>>

    >>>> Then you have a a serving layer to present data from both speed  (the

    >>>> one from SS) and batch layer.

    >>>>

    >>>> HTH

    >>>>

    >>>>

    >>>>

    >>>>

    >>>> Dr Mich Talebzadeh

    >>>>

    >>>>

    >>>>

    >>>> LinkedIn

    >>>> 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

    >>>>

    >>>>

    >>>>

    >>>> http://talebzadehmich.wordpress.com

    >>>>

    >>>>

    >>>> Disclaimer: Use it at your own risk. Any and all responsibility for any

    >>>> loss, damage or destruction of data or any other property which may 
arise

    >>>> from relying on this email's technical content is explicitly 
disclaimed. The

    >>>> author will in no case be liable for any monetary damages arising from 
such

    >>>> loss, damage or destruction.

    >>>>

    >>>>

    >>>>

    >>>>

    >>>> On 29 September 2016 at 15:15, Ali Akhtar <ali.rac...@gmail.com> wrote:

    >>>>>

    >>>>> The web UI is actually the speed layer, it needs to be able to query

    >>>>> the data online, and show the results in real-time.

    >>>>>

    >>>>> It also needs a custom front-end, so a system like Tableau can't be

    >>>>> used, it must have a custom backend + front-end.

    >>>>>

    >>>>> Thanks for the recommendation of Flume. Do you think this will work:

    >>>>>

    >>>>> - Spark Streaming to read data from Kafka

    >>>>> - Storing the data on HDFS using Flume

    >>>>> - Using Spark to query the data in the backend of the web UI?

    >>>>>

    >>>>>

    >>>>>

    >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh

    >>>>> <mich.talebza...@gmail.com> wrote:

    >>>>>>

    >>>>>> You need a batch layer and a speed layer. Data from Kafka can be

    >>>>>> stored on HDFS using flume.

    >>>>>>

    >>>>>> -  Query this data to generate reports / analytics (There will be a

    >>>>>> web UI which will be the front-end to the data, and will show the 
reports)

    >>>>>>

    >>>>>> This is basically batch layer and you need something like Tableau or

    >>>>>> Zeppelin to query data

    >>>>>>

    >>>>>> You will also need spark streaming to query data online for speed

    >>>>>> layer. That data could be stored in some transient fabric like 
ignite or

    >>>>>> even druid.

    >>>>>>

    >>>>>> HTH

    >>>>>>

    >>>>>>

    >>>>>>

    >>>>>>

    >>>>>>

    >>>>>>

    >>>>>>

    >>>>>>

    >>>>>> Dr Mich Talebzadeh

    >>>>>>

    >>>>>>

    >>>>>>

    >>>>>> LinkedIn

    >>>>>> 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

    >>>>>>

    >>>>>>

    >>>>>>

    >>>>>> http://talebzadehmich.wordpress.com

    >>>>>>

    >>>>>>

    >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for

    >>>>>> any loss, damage or destruction of data or any other property which 
may

    >>>>>> arise from relying on this email's technical content is explicitly

    >>>>>> disclaimed. The author will in no case be liable for any monetary 
damages

    >>>>>> arising from such loss, damage or destruction.

    >>>>>>

    >>>>>>

    >>>>>>

    >>>>>>

    >>>>>> On 29 September 2016 at 15:01, Ali Akhtar <ali.rac...@gmail.com>

    >>>>>> wrote:

    >>>>>>>

    >>>>>>> It needs to be able to scale to a very large amount of data, yes.

    >>>>>>>

    >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma

    >>>>>>> <deepakmc...@gmail.com> wrote:

    >>>>>>>>

    >>>>>>>> What is the message inflow ?

    >>>>>>>> If it's really high , definitely spark will be of great use .

    >>>>>>>>

    >>>>>>>> Thanks

    >>>>>>>> Deepak

    >>>>>>>>

    >>>>>>>>

    >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <ali.rac...@gmail.com> wrote:

    >>>>>>>>>

    >>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas.

    >>>>>>>>>

    >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing 
their

    >>>>>>>>> raw data into Kafka.

    >>>>>>>>>

    >>>>>>>>> I need to:

    >>>>>>>>>

    >>>>>>>>> - Do ETL on the data, and standardize it.

    >>>>>>>>>

    >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw

    >>>>>>>>> HDFS / ElasticSearch / Postgres)

    >>>>>>>>>

    >>>>>>>>> - Query this data to generate reports / analytics (There will be a

    >>>>>>>>> web UI which will be the front-end to the data, and will show the 
reports)

    >>>>>>>>>

    >>>>>>>>> Java is being used as the backend language for everything (backend

    >>>>>>>>> of the web UI, as well as the ETL layer)

    >>>>>>>>>

    >>>>>>>>> I'm considering:

    >>>>>>>>>

    >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer

    >>>>>>>>> (receive raw data from Kafka, standardize & store it)

    >>>>>>>>>

    >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the 
standardized

    >>>>>>>>> data, and to allow queries

    >>>>>>>>>

    >>>>>>>>> - In the backend of the web UI, I could either use Spark to run

    >>>>>>>>> queries across the data (mostly filters), or directly run queries 
against

    >>>>>>>>> Cassandra / HBase

    >>>>>>>>>

    >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these

    >>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs 
Spark for

    >>>>>>>>> ETL, which persistent data store to use, and how to query that 
data store in

    >>>>>>>>> the backend of the web UI, for displaying the reports).

    >>>>>>>>>

    >>>>>>>>>

    >>>>>>>>> Thanks.

    >>>>>>>

    >>>>>>>

    >>>>>>

    >>>>>

    >>>>

    >>>

    >>

    >>

    >>

    >> --

    >> Thanks

    >> Deepak

    >> www.bigdatabig.com

    >> www.keosha.net

    >

    >


Reply via email to