Hi Cody Spark direct stream is just fine for this use case. But why postgres and not cassandra? Is there anything specific here that i may not be aware?
Thanks Deepak On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <[email protected]> wrote: > How are you going to handle etl failures? Do you care about lost / > duplicated data? Are your writes idempotent? > > Absent any other information about the problem, I'd stay away from > cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream > feeding postgres. > > On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <[email protected]> wrote: > > Is there an advantage to that vs directly consuming from Kafka? Nothing > is > > being done to the data except some light ETL and then storing it in > > Cassandra > > > > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <[email protected]> > > wrote: > >> > >> Its better you use spark's direct stream to ingest from kafka. > >> > >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <[email protected]> > wrote: > >>> > >>> I don't think I need a different speed storage and batch storage. Just > >>> taking in raw data from Kafka, standardizing, and storing it somewhere > where > >>> the web UI can query it, seems like it will be enough. > >>> > >>> I'm thinking about: > >>> > >>> - Reading data from Kafka via Spark Streaming > >>> - Standardizing, then storing it in Cassandra > >>> - Querying Cassandra from the web ui > >>> > >>> That seems like it will work. My question now is whether to use Spark > >>> Streaming to read Kafka, or use Kafka consumers directly. > >>> > >>> > >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh > >>> <[email protected]> wrote: > >>>> > >>>> - Spark Streaming to read data from Kafka > >>>> - Storing the data on HDFS using Flume > >>>> > >>>> You don't need Spark streaming to read data from Kafka and store on > >>>> HDFS. It is a waste of resources. > >>>> > >>>> Couple Flume to use Kafka as source and HDFS as sink directly > >>>> > >>>> KafkaAgent.sources = kafka-sources > >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs > >>>> > >>>> That will be for your batch layer. To analyse you can directly read > from > >>>> hdfs files with Spark or simply store data in a database of your > choice via > >>>> cron or something. Do not mix your batch layer with speed layer. > >>>> > >>>> Your speed layer will ingest the same data directly from Kafka into > >>>> spark streaming and that will be online or near real time (defined > by your > >>>> window). > >>>> > >>>> Then you have a a serving layer to present data from both speed (the > >>>> one from SS) and batch layer. > >>>> > >>>> HTH > >>>> > >>>> > >>>> > >>>> > >>>> Dr Mich Talebzadeh > >>>> > >>>> > >>>> > >>>> LinkedIn > >>>> https://www.linkedin.com/profile/view?id= > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > >>>> > >>>> > >>>> > >>>> http://talebzadehmich.wordpress.com > >>>> > >>>> > >>>> Disclaimer: Use it at your own risk. Any and all responsibility for > any > >>>> loss, damage or destruction of data or any other property which may > arise > >>>> from relying on this email's technical content is explicitly > disclaimed. The > >>>> author will in no case be liable for any monetary damages arising > from such > >>>> loss, damage or destruction. > >>>> > >>>> > >>>> > >>>> > >>>> On 29 September 2016 at 15:15, Ali Akhtar <[email protected]> > wrote: > >>>>> > >>>>> The web UI is actually the speed layer, it needs to be able to query > >>>>> the data online, and show the results in real-time. > >>>>> > >>>>> It also needs a custom front-end, so a system like Tableau can't be > >>>>> used, it must have a custom backend + front-end. > >>>>> > >>>>> Thanks for the recommendation of Flume. Do you think this will work: > >>>>> > >>>>> - Spark Streaming to read data from Kafka > >>>>> - Storing the data on HDFS using Flume > >>>>> - Using Spark to query the data in the backend of the web UI? > >>>>> > >>>>> > >>>>> > >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh > >>>>> <[email protected]> wrote: > >>>>>> > >>>>>> You need a batch layer and a speed layer. Data from Kafka can be > >>>>>> stored on HDFS using flume. > >>>>>> > >>>>>> - Query this data to generate reports / analytics (There will be a > >>>>>> web UI which will be the front-end to the data, and will show the > reports) > >>>>>> > >>>>>> This is basically batch layer and you need something like Tableau or > >>>>>> Zeppelin to query data > >>>>>> > >>>>>> You will also need spark streaming to query data online for speed > >>>>>> layer. That data could be stored in some transient fabric like > ignite or > >>>>>> even druid. > >>>>>> > >>>>>> HTH > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> Dr Mich Talebzadeh > >>>>>> > >>>>>> > >>>>>> > >>>>>> LinkedIn > >>>>>> https://www.linkedin.com/profile/view?id= > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > >>>>>> > >>>>>> > >>>>>> > >>>>>> http://talebzadehmich.wordpress.com > >>>>>> > >>>>>> > >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for > >>>>>> any loss, damage or destruction of data or any other property which > may > >>>>>> arise from relying on this email's technical content is explicitly > >>>>>> disclaimed. The author will in no case be liable for any monetary > damages > >>>>>> arising from such loss, damage or destruction. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 29 September 2016 at 15:01, Ali Akhtar <[email protected]> > >>>>>> wrote: > >>>>>>> > >>>>>>> It needs to be able to scale to a very large amount of data, yes. > >>>>>>> > >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma > >>>>>>> <[email protected]> wrote: > >>>>>>>> > >>>>>>>> What is the message inflow ? > >>>>>>>> If it's really high , definitely spark will be of great use . > >>>>>>>> > >>>>>>>> Thanks > >>>>>>>> Deepak > >>>>>>>> > >>>>>>>> > >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <[email protected]> wrote: > >>>>>>>>> > >>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas. > >>>>>>>>> > >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing > their > >>>>>>>>> raw data into Kafka. > >>>>>>>>> > >>>>>>>>> I need to: > >>>>>>>>> > >>>>>>>>> - Do ETL on the data, and standardize it. > >>>>>>>>> > >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw > >>>>>>>>> HDFS / ElasticSearch / Postgres) > >>>>>>>>> > >>>>>>>>> - Query this data to generate reports / analytics (There will be > a > >>>>>>>>> web UI which will be the front-end to the data, and will show > the reports) > >>>>>>>>> > >>>>>>>>> Java is being used as the backend language for everything > (backend > >>>>>>>>> of the web UI, as well as the ETL layer) > >>>>>>>>> > >>>>>>>>> I'm considering: > >>>>>>>>> > >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer > >>>>>>>>> (receive raw data from Kafka, standardize & store it) > >>>>>>>>> > >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the > standardized > >>>>>>>>> data, and to allow queries > >>>>>>>>> > >>>>>>>>> - In the backend of the web UI, I could either use Spark to run > >>>>>>>>> queries across the data (mostly filters), or directly run > queries against > >>>>>>>>> Cassandra / HBase > >>>>>>>>> > >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these > >>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs > Spark for > >>>>>>>>> ETL, which persistent data store to use, and how to query that > data store in > >>>>>>>>> the backend of the web UI, for displaying the reports). > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Thanks. > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > >> > >> > >> -- > >> Thanks > >> Deepak > >> www.bigdatabig.com > >> www.keosha.net > > > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: [email protected] > > -- Thanks Deepak www.bigdatabig.com www.keosha.net
