Yes we do something very similar and it's working well: Kafka -> Spark Streaming (write temp files, serialized RDDs) -> Spark Batch Application (build partitioned Parquet files on HDFS; this is needed because building Parquet files of a reasonable size is too slow for streaming) -> query with SparkSQL
On Thu, Sep 15, 2016 at 7:33 AM, Sean Owen <so...@cloudera.com> wrote: > If your core requirement is ad-hoc real-time queries over the data, > then the standard Hadoop-centric answer would be: > > Ingest via Kafka, > maybe using Flume, or possibly Spark Streaming, to read and land the data, > in... > Parquet on HDFS or possibly Kudu, and > Impala to query > > >> On 15 September 2016 at 09:35, Mich Talebzadeh < > mich.talebza...@gmail.com> > >> wrote: > >>> > >>> Hi, > >>> > >>> This is for fishing for some ideas. > >>> > >>> In the design we get prices directly through Kafka into Flume and store > >>> it on HDFS as text files > >>> We can then use Spark with Zeppelin to present data to the users. > >>> > >>> This works. However, I am aware that once the volume of flat files > rises > >>> one needs to do housekeeping. You don't want to read all files every > time. > >>> > >>> A more viable alternative would be to read data into some form of > tables > >>> (Hive etc) periodically through an hourly cron set up so batch process > will > >>> have up to date and accurate data up to last hour. > >>> > >>> That certainly be an easier option for the users as well. > >>> > >>> I was wondering what would be the best strategy here. Druid, Hive > others? > >>> > >>> The business case here is that users may want to access older data so a > >>> database of some sort will be a better solution? In all likelihood > they want > >>> a week's data. > >>> > >>> Thanks > >>> > >>> Dr Mich Talebzadeh > >>> > >>> > >>> > >>> LinkedIn > >>> https://www.linkedin.com/profile/view?id= > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > >>> > >>> > >>> > >>> http://talebzadehmich.wordpress.com > >>> > >>> > >>> Disclaimer: Use it at your own risk. Any and all responsibility for any > >>> loss, damage or destruction of data or any other property which may > arise > >>> from relying on this email's technical content is explicitly > disclaimed. The > >>> author will in no case be liable for any monetary damages arising from > such > >>> loss, damage or destruction. > >>> > >>> > >> > >> > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >