You may actually want this implemented in a Streams app eventually, there is a KIP being discussed to support this type of incremental batch processing in Streams: https://cwiki.apache.org/confluence/display/KAFKA/KIP-95%3A+Incremental+Batch+Processing+for+Kafka+Streams
However, for now the approach you mentioned using a consumer would be the best approach. When you start up the app you can use the endOffsets API to determine what offset you should treat as the last offset: http://docs.confluent.io/3.1.1/clients/javadocs/org/apache/kafka/clients/consumer/KafkaConsumer.html#endOffsets(java.util.Collection) In terms of memory usage, you'll simply need to process in reasonably sized blocks. If you can already handle incremental processing like this then presumably it should be possible to create smaller sub-blocks and just run that process N times if you have too many messages. -Ewen On Sat, Dec 10, 2016 at 10:29 AM, Dominik Safaric <dominiksafa...@gmail.com> wrote: > Hi everyone, > > What is among the most efficient ways to fast consume, transform and > process Kafka messages? Importantly, I am not referring nor interested in > streams, because the Kafka topic from which I would like to process the > messages will eventually stop receiving messages, after which I should > process the messages by extracting certain keys in a batch processing like > manner. > > So far I’ve implemented a a Kafka Consumer group that consumers these > messages, hashes them according to a certain key, and upon retrieval of the > last message starts the processing script. However, I am dealing with > exactly 100.000.000 log messages, each of 16 bytes, meaning that preserving > 1.6GB of data in-memory i.e. on heap is not the most efficient manner - > performance and memory wise. > > Regards, > Dominik > > -- Thanks, Ewen