We're thinking Kafka will allow us to scale to billions of messages in a day. That's the promise of Kafka, right? No other reason really. Main goal is to "batch" the messages per hour, create file(s) on S3 which are sorted by device_id so that we can do more aggregations which can later be sliced & diced using UI.
Feel free to suggest alternatives. Thanks. On Thu, Apr 29, 2021 at 10:22 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi Eric, > > On your second point "Is there a better way to do this" > > You are going to use Spark Structured Streaming (SSS) to clean and enrich > the data and then push the messages to Kafka. > > I assume you will be using foreachBatch in this case. What purpose is there > for Kafka to receive the enriched data from SSS? Any other reason except > hourly partition of your data? > > HTH > > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Thu, 29 Apr 2021 at 18:07, Eric Beabes <mailinglist...@gmail.com> > wrote: > > > We’ve a use case where lots of messages will come in via AWS SQS from > > various devices. We’re thinking of reading these messages using Spark > > Structured Streaming, cleaning them up as needed & saving each message on > > Kafka. Later we’re thinking of using Kafka S3 Connector to push them to > S3 > > on an hourly basis; meaning there will be a different directory for each > > hour. Challenge is that, within this hourly “partition” the messages need > > to be “sorted by” a certain field (let’s say device_id). Reason being, > > we’re planning to create an EXTERNAL table on it with BUCKETS on > device_id. > > This will speed up the subsequent Aggregation jobs. > > > > Questions: > > > > 1) Does Kafka S3 Connector allow messages to be sorted by a particular > > field within a partition – or – do we need to extend it? > > 2) Is there a better way to do this? > > >