Ok thanks for the info.

One question I forgot to ask is what is the streaming interval that the
source is sending messages to Kafka to be processed inside SSS? For example
are these market data etc?

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 29 Apr 2021 at 18:35, Eric Beabes <mailinglist...@gmail.com> wrote:

> We're thinking Kafka will allow us to scale to billions of messages in a
> day. That's the promise of Kafka, right? No other reason really. Main goal
> is to "batch" the messages per hour, create file(s) on S3 which are sorted
> by device_id so that we can do more aggregations which can later be sliced
> & diced using UI.
>
> Feel free to suggest alternatives. Thanks.
>
>
> On Thu, Apr 29, 2021 at 10:22 AM Mich Talebzadeh <
> mich.talebza...@gmail.com>
> wrote:
>
> > Hi Eric,
> >
> > On your second point "Is there a better way to do this"
> >
> > You are going to use Spark Structured Streaming (SSS) to clean and enrich
> > the data and then push the messages to Kafka.
> >
> > I assume you will be using foreachBatch in this case. What purpose is
> there
> > for Kafka to receive the enriched data from SSS? Any other reason except
> > hourly partition of your data?
> >
> > HTH
> >
> >
> >
> >    view my Linkedin profile
> > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> >
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising
> from
> > such loss, damage or destruction.
> >
> >
> >
> >
> > On Thu, 29 Apr 2021 at 18:07, Eric Beabes <mailinglist...@gmail.com>
> > wrote:
> >
> > > We’ve a use case where lots of messages will come in via AWS SQS from
> > > various devices. We’re thinking of reading these messages using Spark
> > > Structured Streaming, cleaning them up as needed & saving each message
> on
> > > Kafka. Later we’re thinking of using Kafka S3 Connector to push them to
> > S3
> > > on an hourly basis; meaning there will be a different directory for
> each
> > > hour. Challenge is that, within this hourly “partition” the messages
> need
> > > to be “sorted by” a certain field (let’s say device_id). Reason being,
> > > we’re planning to create an EXTERNAL table on it with BUCKETS on
> > device_id.
> > > This will speed up the subsequent Aggregation jobs.
> > >
> > > Questions:
> > >
> > > 1) Does Kafka S3 Connector allow messages to be sorted by a particular
> > > field within a partition – or – do we need to extend it?
> > > 2) Is there a better way to do this?
> > >
> >
>

Reply via email to