We’ve a use case where lots of messages will come in via AWS SQS from various devices. We’re thinking of reading these messages using Spark Structured Streaming, cleaning them up as needed & saving each message on Kafka. Later we’re thinking of using Kafka S3 Connector to push them to S3 on an hourly basis; meaning there will be a different directory for each hour. Challenge is that, within this hourly “partition” the messages need to be “sorted by” a certain field (let’s say device_id). Reason being, we’re planning to create an EXTERNAL table on it with BUCKETS on device_id. This will speed up the subsequent Aggregation jobs.
Questions: 1) Does Kafka S3 Connector allow messages to be sorted by a particular field within a partition – or – do we need to extend it? 2) Is there a better way to do this?