We’ve a use case where lots of messages will come in via AWS SQS from
various devices. We’re thinking of reading these messages using Spark
Structured Streaming, cleaning them up as needed & saving each message on
Kafka. Later we’re thinking of using Kafka S3 Connector to push them to S3
on an hourly basis; meaning there will be a different directory for each
hour. Challenge is that, within this hourly “partition” the messages need
to be “sorted by” a certain field (let’s say device_id). Reason being,
we’re planning to create an EXTERNAL table on it with BUCKETS on device_id.
This will speed up the subsequent Aggregation jobs.

Questions:

1) Does Kafka S3 Connector allow messages to be sorted by a particular
field within a partition – or – do we need to extend it?
2) Is there a better way to do this?

Reply via email to