I'm not getting your question about scheduling. Did you create a Spark
application and asking how to schedule it to run? Are you going to output
results from the scheduled run in hdfs and join them in the first chain with
the real time result?
--
View this message in context:
http://apache-spar
Hi Adrian,
yes, your assumption is correct.
I'm using HBase for storing the partial calculations.
Thank you for the feedbacks - it is exactly what I had in mind.
Thx
D
On Thu, Nov 5, 2015 at 10:43 AM, Adrian Tanase wrote:
> You should also specify how you’re planning to query or “publish” th
You should also specify how you’re planning to query or “publish” the data. I
would consider a combination of:
- spark streaming job that ingests the raw events in real time, validates,
pre-process and saves to stable storage
- stable storage could be HDFS/parquet or a database optimized for ti