It needs to be able to scale to a very large amount of data, yes. On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma <deepakmc...@gmail.com> wrote:
> What is the message inflow ? > If it's really high , definitely spark will be of great use . > > Thanks > Deepak > > On Sep 29, 2016 19:24, "Ali Akhtar" <ali.rac...@gmail.com> wrote: > >> I have a somewhat tricky use case, and I'm looking for ideas. >> >> I have 5-6 Kafka producers, reading various APIs, and writing their raw >> data into Kafka. >> >> I need to: >> >> - Do ETL on the data, and standardize it. >> >> - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / >> ElasticSearch / Postgres) >> >> - Query this data to generate reports / analytics (There will be a web UI >> which will be the front-end to the data, and will show the reports) >> >> Java is being used as the backend language for everything (backend of the >> web UI, as well as the ETL layer) >> >> I'm considering: >> >> - Using raw Kafka consumers, or Spark Streaming, as the ETL layer >> (receive raw data from Kafka, standardize & store it) >> >> - Using Cassandra, HBase, or raw HDFS, for storing the standardized data, >> and to allow queries >> >> - In the backend of the web UI, I could either use Spark to run queries >> across the data (mostly filters), or directly run queries against Cassandra >> / HBase >> >> I'd appreciate some thoughts / suggestions on which of these alternatives >> I should go with (e.g, using raw Kafka consumers vs Spark for ETL, which >> persistent data store to use, and how to query that data store in the >> backend of the web UI, for displaying the reports). >> >> >> Thanks. >> >