Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-06-20 Thread Anil Dasari
Hello @Tathagata Das Could you share your thoughts on https://issues.apache.org/jira/browse/SPARK-48418 ? Let me know if you have any questions. thanks. Regards, Anil On Fri, May 24, 2024 at 12:13 AM Anil Dasari wrote: > It appears that structured streaming and Dstream have entirely different

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-24 Thread Anil Dasari
It appears that structured streaming and Dstream have entirely different microbatch metadata representation Can someone assist me in finding the following Dstream microbatch metadata equivalent in Structured streaming. 1. microbatch timestamp : structured streaming foreachBatch gives batchID which

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Anil Dasari
You are right. - another question on migration. Is there a way to get the microbatch id during the microbatch dataset `trasform` operation like in rdd transform ? I am attempting to implement the following pseudo functionality with structured streaming. In this approach, recordCategoriesMetadata is

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Mich Talebzadeh
With regard to this sentence *Offset Tracking with Structured Streaming:: While storing offsets in an external storage with DStreams was necessary, SSS handles this automatically through checkpointing. The checkpoints include the offsets processed by each micro-batch. However, you can still acces

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Mich Talebzadeh
Hi Anil, Ok let us put the complete picture here * Current DStreams Setup:* - Data Source: Kafka - Processing Engine: Spark DStreams - Data Transformation with Spark - Sink: S3 - Data Format: Parquet - Exactly-Once Delivery (Attempted): You're attempting exactly-once

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Tathagata Das
The right way to associated microbatches when committing to external storage is to use the microbatch id that you can get in foreachBatch. That microbatch id guarantees that the data produced in the batch is the always the same no matter any recomputations (assuming all processing logic is determin

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Anil Dasari
Thanks Das, Mtich. Mitch, We process data from Kafka and write it to S3 in Parquet format using Dstreams. To ensure exactly-once delivery and prevent data loss, our process records micro-batch offsets to an external storage at the end of each micro-batch in foreachRDD, which is then used when the

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Tathagata Das
If you want to find what offset ranges are present in a microbatch in Structured Streaming, you have to look at the StreamingQuery.lastProgress or use the QueryProgressListener . Both of these

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Mich Talebzadeh
OK to understand better your current model relies on streaming data input through Kafka topic, Spark does some ETL and you send to a sink, a database for file storage like HDFS etc? Your current architecture relies on Direct Streams (DStream) and RDDs and you want to move to Spark sStructured Stre

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread ashok34...@yahoo.com.INVALID
Hello, what options are you considering yourself? On Wednesday 22 May 2024 at 07:37:30 BST, Anil Dasari wrote: Hello, We are on Spark 3.x and using Spark dstream + kafka and planning to use structured streaming + Kafka. Is there an equivalent of Dstream HasOffsetRanges in structure s

Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-21 Thread Anil Dasari
Hello, We are on Spark 3.x and using Spark dstream + kafka and planning to use structured streaming + Kafka. Is there an equivalent of Dstream HasOffsetRanges in structure streaming to get the microbatch end offsets to the checkpoint in our external checkpoint store ? Thanks in advance. Regards