It's possible you could (ab)use updateStateByKey or mapWithState for this.
But honestly it's probably a lot more straightforward to just choose a
reasonable batch size that gets you a reasonable file size for most of your
keys, then use filecrush or something similar to deal with the hdfs small
fi
Hi,
Are there any ways to store DStreams / RDD read from Kafka in memory to be
processed at a later time ? What we need to do is to read data from Kafka,
process it to be keyed by some attribute that is present in the Kafka
messages, and write out the data related to each key when we have
accumula