How are you going to handle etl failures? Do you care about lost / duplicated data? Are your writes idempotent?
Absent any other information about the problem, I'd stay away from cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream feeding postgres. On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <ali.rac...@gmail.com> wrote: > Is there an advantage to that vs directly consuming from Kafka? Nothing is > being done to the data except some light ETL and then storing it in > Cassandra > > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma <deepakmc...@gmail.com> > wrote: >> >> Its better you use spark's direct stream to ingest from kafka. >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <ali.rac...@gmail.com> wrote: >>> >>> I don't think I need a different speed storage and batch storage. Just >>> taking in raw data from Kafka, standardizing, and storing it somewhere where >>> the web UI can query it, seems like it will be enough. >>> >>> I'm thinking about: >>> >>> - Reading data from Kafka via Spark Streaming >>> - Standardizing, then storing it in Cassandra >>> - Querying Cassandra from the web ui >>> >>> That seems like it will work. My question now is whether to use Spark >>> Streaming to read Kafka, or use Kafka consumers directly. >>> >>> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh >>> <mich.talebza...@gmail.com> wrote: >>>> >>>> - Spark Streaming to read data from Kafka >>>> - Storing the data on HDFS using Flume >>>> >>>> You don't need Spark streaming to read data from Kafka and store on >>>> HDFS. It is a waste of resources. >>>> >>>> Couple Flume to use Kafka as source and HDFS as sink directly >>>> >>>> KafkaAgent.sources = kafka-sources >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs >>>> >>>> That will be for your batch layer. To analyse you can directly read from >>>> hdfs files with Spark or simply store data in a database of your choice via >>>> cron or something. Do not mix your batch layer with speed layer. >>>> >>>> Your speed layer will ingest the same data directly from Kafka into >>>> spark streaming and that will be online or near real time (defined by your >>>> window). >>>> >>>> Then you have a a serving layer to present data from both speed (the >>>> one from SS) and batch layer. >>>> >>>> HTH >>>> >>>> >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>>> loss, damage or destruction of data or any other property which may arise >>>> from relying on this email's technical content is explicitly disclaimed. >>>> The >>>> author will in no case be liable for any monetary damages arising from such >>>> loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> On 29 September 2016 at 15:15, Ali Akhtar <ali.rac...@gmail.com> wrote: >>>>> >>>>> The web UI is actually the speed layer, it needs to be able to query >>>>> the data online, and show the results in real-time. >>>>> >>>>> It also needs a custom front-end, so a system like Tableau can't be >>>>> used, it must have a custom backend + front-end. >>>>> >>>>> Thanks for the recommendation of Flume. Do you think this will work: >>>>> >>>>> - Spark Streaming to read data from Kafka >>>>> - Storing the data on HDFS using Flume >>>>> - Using Spark to query the data in the backend of the web UI? >>>>> >>>>> >>>>> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh >>>>> <mich.talebza...@gmail.com> wrote: >>>>>> >>>>>> You need a batch layer and a speed layer. Data from Kafka can be >>>>>> stored on HDFS using flume. >>>>>> >>>>>> - Query this data to generate reports / analytics (There will be a >>>>>> web UI which will be the front-end to the data, and will show the >>>>>> reports) >>>>>> >>>>>> This is basically batch layer and you need something like Tableau or >>>>>> Zeppelin to query data >>>>>> >>>>>> You will also need spark streaming to query data online for speed >>>>>> layer. That data could be stored in some transient fabric like ignite or >>>>>> even druid. >>>>>> >>>>>> HTH >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> >>>>>> >>>>>> LinkedIn >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> >>>>>> >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> >>>>>> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for >>>>>> any loss, damage or destruction of data or any other property which may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar <ali.rac...@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> It needs to be able to scale to a very large amount of data, yes. >>>>>>> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma >>>>>>> <deepakmc...@gmail.com> wrote: >>>>>>>> >>>>>>>> What is the message inflow ? >>>>>>>> If it's really high , definitely spark will be of great use . >>>>>>>> >>>>>>>> Thanks >>>>>>>> Deepak >>>>>>>> >>>>>>>> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <ali.rac...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas. >>>>>>>>> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing their >>>>>>>>> raw data into Kafka. >>>>>>>>> >>>>>>>>> I need to: >>>>>>>>> >>>>>>>>> - Do ETL on the data, and standardize it. >>>>>>>>> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / Raw >>>>>>>>> HDFS / ElasticSearch / Postgres) >>>>>>>>> >>>>>>>>> - Query this data to generate reports / analytics (There will be a >>>>>>>>> web UI which will be the front-end to the data, and will show the >>>>>>>>> reports) >>>>>>>>> >>>>>>>>> Java is being used as the backend language for everything (backend >>>>>>>>> of the web UI, as well as the ETL layer) >>>>>>>>> >>>>>>>>> I'm considering: >>>>>>>>> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer >>>>>>>>> (receive raw data from Kafka, standardize & store it) >>>>>>>>> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the standardized >>>>>>>>> data, and to allow queries >>>>>>>>> >>>>>>>>> - In the backend of the web UI, I could either use Spark to run >>>>>>>>> queries across the data (mostly filters), or directly run queries >>>>>>>>> against >>>>>>>>> Cassandra / HBase >>>>>>>>> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these >>>>>>>>> alternatives I should go with (e.g, using raw Kafka consumers vs >>>>>>>>> Spark for >>>>>>>>> ETL, which persistent data store to use, and how to query that data >>>>>>>>> store in >>>>>>>>> the backend of the web UI, for displaying the reports). >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks. >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >> >> >> -- >> Thanks >> Deepak >> www.bigdatabig.com >> www.keosha.net > >