Hi All,


I wonder if Flink is a right tool for processing historical time series data 
e.g. many small files.
Our use case: we have clickstream histories (time series) of many users. We 
would like to calculate user specific sliding count window aggregates over past 
periods for a sample of users to create features to train machine learning 
models. 
As I see it, Flink would load user histories from some nosql database (e.g. 
hbase), process them and publish aggregates for machine learning. Flink also 
would update user histories with new events.

I wonder if its it equally efficient to load and process each user history in 
parallel or it's better to create one big dataset with multiple user histories 
and run single map-reduce task on it?
The first approach is more attractive since we could use same event aggregation 
code both for processing historical user data for training models and for 
aggregating real time user events into features for model execution.
thanks, Mindis


  

Reply via email to