Re: [Spark Structured Streaming] How to delete old data that was created by Spark Structured Streaming?

Andrei L Tue, 03 Dec 2024 15:25:31 -0800

Hello!

We are facing the same issue.

Could you please elaborate on how to clean up WALs/delete irrelevant
partitions when SSS job is stopped?
Direct usage of FileStreamSinkLog in the mentioned StackOverflow question
looks "hacky".
Moreover workaround provided in SO question uses
FileStreamSinkLog.DELETE_ACTION constant which is not available since SPARK
3.1 (https://issues.apache.org/jira/browse/SPARK-32648) and code that
handles "delete" action string literal is also removed.

File sink's retention option looks promising but operates on files+batches
level, not partitions.

This task looks like the most basic housekeeping for streaming jobs. Can't
SSS handle it?

Also after examining the internals of _spark_metadata I think it's a really
bad idea to store absolute paths inside the output folder and use it on the
read code path. It prevents renaming folders. Had to mention hdfs cluster
name is just a logical name and on reading path different clients can have
distinct names for the same HDFS cluster. Users can even access the same
dataset in the same hdfs cluster over different protocols
hdfs/httpfs/webhdfs etc.

Sincerely, Andrei Lopukhov.

вт, 3 дек. 2024 г. в 22:41, Mich Talebzadeh <mich.talebza...@gmail.com>:

> Yes but your SSS job has to be stopped gracefully.
>
> Originally I raised this SPIP request
>
> https://issues.apache.org/jira/browse/SPARK-42485
>
> Then I requested "Adding pause() method to
> pyspark.sql.streaming.StreamingQuery"
>
> I believe they are still open.
>
> HTH
> Mich Talebzadeh,
>
> Architect | Data Science | Financial Crime | GDPR & Compliance Specialist
> PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College
> London <https://en.wikipedia.org/wiki/Imperial_College_London>
> London, United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> ‪On Tue, 3 Dec 2024 at 18:33, ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ <
> yur...@gmail.com> wrote:‬
>
>> Hi Yegor
>> If your are not using Delta format (eg avro/json/parquet/csv/etc) then
>> you have two options
>> #1 cleanup WAL files (afaik it’s named _metadata folder insider your data
>> folder) which requires that SSS job has to be stopped before you are
>> cleaning the WAL.
>> #2 you can use foreachBatch for write your data but then your SSS will
>> not be exactly once but at least once
>>
>> Best regards
>>
>> On 3 Dec 2024, at 17:07, Дубинкин Егор <dubinkine...@gmail.com> wrote:
>>
>> 
>> Hello Community,
>>
>> I need to delete old src data created by Spark Structured Streaming.
>> Just deleting relevant folder throws an exception while reading batch
>> dataframe from file-system:
>>
>> java.io.FileNotFoundException: File 
>> file:/data/avro/year=2020/month=3/day=13/hour=12/part-00000-0cc84e65-3f49-4686-85e3-1ecf48952794.c000.avro
>>  does not exist
>>
>> Issue is actualy the same that described here:
>>
>> https://stackoverflow.com/questions/60773445/how-to-delete-old-data-that-was-created-by-spark-structured-streaming?newreg=5cc791c48358491c88d9b2dae1e436d9
>>
>> Didn't find a way to delete it via Spark API.
>> Are there any solutions to do it via API instead of editing metadata
>> manually?
>>
>> Your help would be appreciated.
>>
>>

Re: [Spark Structured Streaming] How to delete old data that was created by Spark Structured Streaming?

Reply via email to