Hi All,
I wonder if Flink is a right tool for processing historical time series data e.g. many small files. Our use case: we have clickstream histories (time series) of many users. We would like to calculate user specific sliding count window aggregates over past periods for a sample of users to create features to train machine learning models. As I see it, Flink would load user histories from some nosql database (e.g. hbase), process them and publish aggregates for machine learning. Flink also would update user histories with new events. I wonder if its it equally efficient to load and process each user history in parallel or it's better to create one big dataset with multiple user histories and run single map-reduce task on it? The first approach is more attractive since we could use same event aggregation code both for processing historical user data for training models and for aggregating real time user events into features for model execution. thanks, Mindis