Re: Kafka S3 Connector: Sort by a field within a partition

2021-04-29 Thread Eric Beabes
Yes, of course, if 'sort by within a partition' is not available in the Connector we will start without it BUT was wondering if it is - OR - if someone has a better idea. On Thu, Apr 29, 2021 at 12:32 PM Mich Talebzadeh wrote: > Well, you will need to set-up a pilot, test these scenarios and com

Re: Kafka S3 Connector: Sort by a field within a partition

2021-04-29 Thread Mich Talebzadeh
Well, you will need to set-up a pilot, test these scenarios and come up with a minimal Viable Product (MVP) .It will be difficult to give accurate answers but would have thought that you could have written your processed (enriched dataframe) to a database table partitioned by hourly rate and then

Re: Kafka S3 Connector: Sort by a field within a partition

2021-04-29 Thread Eric Beabes
Correct. Question you've asked seems to be the one we're looking an answer for. If we set the processingTime to 60 minutes that will require tons of memory, right? What happens if the batch fails? Reprocess for the same hour? Not sure if this is the right approach. That's why we're thinking we wi

Re: Kafka S3 Connector: Sort by a field within a partition

2021-04-29 Thread Mich Talebzadeh
Let say that you have your readStream created in SSS with socket format. Now you want to process these files/messages result = streamingDataFrame.select( \ . writeStream. \ outputMode('append'). \ o

Re: Kafka S3 Connector: Sort by a field within a partition

2021-04-29 Thread Eric Beabes
Source (devices) are sending messages to AWS SQS (Not Kafka). Each message contains the path of the file on S3. (We have no control on the source. They won't change the way it's being done.) SSS will be listening to the SQS queue. We are thinking SSS will read each SQS message, get the file locati

Re: Kafka S3 Connector: Sort by a field within a partition

2021-04-29 Thread Mich Talebzadeh
Ok thanks for the info. One question I forgot to ask is what is the streaming interval that the source is sending messages to Kafka to be processed inside SSS? For example are these market data etc? HTH view my Linkedin profile *

Re: Kafka S3 Connector: Sort by a field within a partition

2021-04-29 Thread Eric Beabes
We're thinking Kafka will allow us to scale to billions of messages in a day. That's the promise of Kafka, right? No other reason really. Main goal is to "batch" the messages per hour, create file(s) on S3 which are sorted by device_id so that we can do more aggregations which can later be sliced &

Re: Kafka S3 Connector: Sort by a field within a partition

2021-04-29 Thread Mich Talebzadeh
Hi Eric, On your second point "Is there a better way to do this" You are going to use Spark Structured Streaming (SSS) to clean and enrich the data and then push the messages to Kafka. I assume you will be using foreachBatch in this case. What purpose is there for Kafka to receive the enriched d