I am working on analytic application using Apache Spark to store and analyze
data. Spark might be used as a ETL application to aggregate different metrics
and then join with the aggregated metrics. The data sources are flat files that
are coming from two different sources(interval meter data and customer
information) on a daily basis(65Gb per day - time series data). The end users
are BI users, so we cannot provide them notebook visualization. They only can
use Power BI , Tableua or Excel to do self service filters for run time
analytics, graphing the data and reporting.
So, my question is that what is the best tools to implement this pipeline? I do
not think storing parquet or orc in file system is a good choice in production,
and I think we have to deposit the data somewhere (time series or standard db)
, please correct me if I am wrong.
1- where to store the data? files system/time series db/azure cosmos / standard
db?2- Is it right way to do to use spark as to etl and aggregation application
, store it somewhere and use power bi for reporting and dashboard purposes?
Best Regards ....................................................... Amin
Mohebbi PhD candidate in Software Engineering at university of Malaysia Tel
: +60 18 2040 017 E-Mail : [email protected]
[email protected]