Re: Kafka S3 Connector: Sort by a field within a partition

Eric Beabes Thu, 29 Apr 2021 10:34:37 -0700

We're thinking Kafka will allow us to scale to billions of messages in a
day. That's the promise of Kafka, right? No other reason really. Main goal
is to "batch" the messages per hour, create file(s) on S3 which are sorted
by device_id so that we can do more aggregations which can later be sliced
& diced using UI.


Feel free to suggest alternatives. Thanks.


On Thu, Apr 29, 2021 at 10:22 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Eric,
>
> On your second point "Is there a better way to do this"
>
> You are going to use Spark Structured Streaming (SSS) to clean and enrich
> the data and then push the messages to Kafka.
>
> I assume you will be using foreachBatch in this case. What purpose is there
> for Kafka to receive the enriched data from SSS? Any other reason except
> hourly partition of your data?
>
> HTH
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 29 Apr 2021 at 18:07, Eric Beabes <mailinglist...@gmail.com>
> wrote:
>
> > We’ve a use case where lots of messages will come in via AWS SQS from
> > various devices. We’re thinking of reading these messages using Spark
> > Structured Streaming, cleaning them up as needed & saving each message on
> > Kafka. Later we’re thinking of using Kafka S3 Connector to push them to
> S3
> > on an hourly basis; meaning there will be a different directory for each
> > hour. Challenge is that, within this hourly “partition” the messages need
> > to be “sorted by” a certain field (let’s say device_id). Reason being,
> > we’re planning to create an EXTERNAL table on it with BUCKETS on
> device_id.
> > This will speed up the subsequent Aggregation jobs.
> >
> > Questions:
> >
> > 1) Does Kafka S3 Connector allow messages to be sorted by a particular
> > field within a partition – or – do we need to extend it?
> > 2) Is there a better way to do this?
> >
>

Re: Kafka S3 Connector: Sort by a field within a partition

Reply via email to