Re: Best way to present data collected by Flume through Spark

Jeff Nadler Thu, 15 Sep 2016 07:47:46 -0700

Yes we do something very similar and it's working well:

Kafka ->
Spark Streaming (write temp files, serialized RDDs) ->
Spark Batch Application (build partitioned Parquet files on HDFS; this is
needed because building Parquet files of a reasonable size is too slow for
streaming) ->
query with SparkSQL



On Thu, Sep 15, 2016 at 7:33 AM, Sean Owen <so...@cloudera.com> wrote:

> If your core requirement is ad-hoc real-time queries over the data,
> then the standard Hadoop-centric answer would be:
>
> Ingest via Kafka,
> maybe using Flume, or possibly Spark Streaming, to read and land the data,
> in...
> Parquet on HDFS or possibly Kudu, and
> Impala to query
>
> >> On 15 September 2016 at 09:35, Mich Talebzadeh <
> mich.talebza...@gmail.com>
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> This is for fishing for some ideas.
> >>>
> >>> In the design we get prices directly through Kafka into Flume and store
> >>> it on HDFS as text files
> >>> We can then use Spark with Zeppelin to present data to the users.
> >>>
> >>> This works. However, I am aware that once the volume of flat files
> rises
> >>> one needs to do housekeeping. You don't want to read all files every
> time.
> >>>
> >>> A more viable alternative would be to read data into some form of
> tables
> >>> (Hive etc) periodically through an hourly cron set up so batch process
> will
> >>> have up to date and accurate data up to last hour.
> >>>
> >>> That certainly be an easier option for the users as well.
> >>>
> >>> I was wondering what would be the best strategy here. Druid, Hive
> others?
> >>>
> >>> The business case here is that users may want to access older data so a
> >>> database of some sort will be a better solution? In all likelihood
> they want
> >>> a week's data.
> >>>
> >>> Thanks
> >>>
> >>> Dr Mich Talebzadeh
> >>>
> >>>
> >>>
> >>> LinkedIn
> >>> https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>
> >>>
> >>>
> >>> http://talebzadehmich.wordpress.com
> >>>
> >>>
> >>> Disclaimer: Use it at your own risk. Any and all responsibility for any
> >>> loss, damage or destruction of data or any other property which may
> arise
> >>> from relying on this email's technical content is explicitly
> disclaimed. The
> >>> author will in no case be liable for any monetary damages arising from
> such
> >>> loss, damage or destruction.
> >>>
> >>>
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Best way to present data collected by Flume through Spark

Reply via email to