Re: [Spark Structured Streaming] How to delete old data that was created by Spark Structured Streaming?

Mich Talebzadeh Tue, 03 Dec 2024 11:50:13 -0800

Yes but your SSS job has to be stopped gracefully.

Originally I raised this SPIP request


https://issues.apache.org/jira/browse/SPARK-42485

Then I requested "Adding pause() method to
pyspark.sql.streaming.StreamingQuery"

I believe they are still open.

HTH
Mich Talebzadeh,

Architect | Data Science | Financial Crime | GDPR & Compliance Specialist
PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College
London <https://en.wikipedia.org/wiki/Imperial_College_London>
London, United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


‪On Tue, 3 Dec 2024 at 18:33, ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ <
yur...@gmail.com> wrote:‬

> Hi Yegor
> If your are not using Delta format (eg avro/json/parquet/csv/etc) then you
> have two options
> #1 cleanup WAL files (afaik it’s named _metadata folder insider your data
> folder) which requires that SSS job has to be stopped before you are
> cleaning the WAL.
> #2 you can use foreachBatch for write your data but then your SSS will not
> be exactly once but at least once
>
> Best regards
>
> On 3 Dec 2024, at 17:07, Дубинкин Егор <dubinkine...@gmail.com> wrote:
>
> 
> Hello Community,
>
> I need to delete old src data created by Spark Structured Streaming.
> Just deleting relevant folder throws an exception while reading batch
> dataframe from file-system:
>
> java.io.FileNotFoundException: File 
> file:/data/avro/year=2020/month=3/day=13/hour=12/part-00000-0cc84e65-3f49-4686-85e3-1ecf48952794.c000.avro
>  does not exist
>
> Issue is actualy the same that described here:
>
> https://stackoverflow.com/questions/60773445/how-to-delete-old-data-that-was-created-by-spark-structured-streaming?newreg=5cc791c48358491c88d9b2dae1e436d9
>
> Didn't find a way to delete it via Spark API.
> Are there any solutions to do it via API instead of editing metadata
> manually?
>
> Your help would be appreciated.
>
>

Re: [Spark Structured Streaming] How to delete old data that was created by Spark Structured Streaming?

Reply via email to