Hi all,
I have a stream of data from Kafka that I want to process and store in hdfs
using Spark Streaming.
Each data has a date/time dimension and I want to write data within the
same time dimension to the same hdfs directory. The data stream might be
unordered (by time dimension).

I'm wondering what are the best practices in grouping/storing time series
data stream using Spark Streaming?

I'm considering grouping each batch of data in Spark Streaming per time
dimension and then saving each group to different hdfs directories. However
since it is possible for data with the same time dimension to be in
different batches, I would need to handle "update" in case the hdfs
directory already exists.

Is this a common approach? Are there any other approaches that I can try?

Thank you!
Nisrina.

Reply via email to