Thank you for your response! I misread "data lake" as "delta lake", my bad. Anyway I need to write output to file system. I see your point about data lakes, however migrations take time, so at least from this perspective I wouldn't deprecate FileStreamSink. I hope FileStreamSink will be still maintained. I understand that in background of rapid development of data lakes the FileStreamSink is not a pririty at all, so that I prepared the PR to help with a part of work. The other part is review that I kindly ask. IMO my PR is not a "band-aid fix", rather a low hanging fruit improvement that helps with a few issues. I might be biased obviously. :)
-- Kind regards/ Pozdrawiam, Wojciech Indyk pon., 17 kwi 2023 o 22:42 Jungtaek Lim <kabhwan.opensou...@gmail.com> napisał(a): > small correction: "I intentionally didn't enumerate." The meaning could be > quite different so making a small correction. > > On Tue, Apr 18, 2023 at 5:38 AM Jungtaek Lim <kabhwan.opensou...@gmail.com> > wrote: > >> There seems to be miscommunication - I didn't mean "Delta Lake". I meant >> "any" Data Lake products. Since I'm biased I didn't intentionally enumerate >> actual products, but there are "Apache Hudi", "Apache Iceberg", etc as well. >> >> We made non-trivial numbers of band-aid fixes already for file stream >> sink. For example, >> >> https://github.com/apache/spark/pull/28363 >> https://github.com/apache/spark/pull/28904 >> https://github.com/apache/spark/pull/29505 >> https://github.com/apache/spark/pull/31638 >> >> There were many push backs, because these fixes do not solve the real >> problem. The consensus was that we don't want to come up with another Data >> Lake product which requires us to put months (or maybe years) of effort. >> Now, these Data Lake products are backed by companies and they are >> successful projects as individuals. I'm not sure I can be supportive with >> the effort on another band-aid fix. >> >> Maintaining metadata directory is a root of the headache. Unless we see >> the benefit of removing the metadata directory (hence at-least-once) and >> plan to deal with that, I'd like to leave file stream sink as it is. >> >> On Mon, Apr 17, 2023 at 7:37 PM Wojciech Indyk <wojciechin...@gmail.com> >> wrote: >> >>> Hi Jungtaek, >>> integration with Delta Lake is not an option to me, I raised a PR for >>> improvement of FileStreamSink with the new parameter: >>> https://github.com/apache/spark/pull/40821. Can you please take a look? >>> >>> -- >>> Kind regards/ Pozdrawiam, >>> Wojciech Indyk >>> >>> >>> niedz., 16 kwi 2023 o 04:45 Jungtaek Lim <kabhwan.opensou...@gmail.com> >>> napisał(a): >>> >>>> Hi, >>>> >>>> We have been indicated with lots of issues with the current FileStream >>>> sink. The effort to fix these issues are quite significant, and it ended up >>>> with derivation of "Data Lake" products. >>>> >>>> I'd recommend not to fix the issue but leave it as its limitation, and >>>> integrate your workload with Data Lake products. For a full disclaimer, I >>>> work in Databricks so I might be biased, but even when I was working at the >>>> previous employer which didn't have the Data Lake product at that time, I >>>> also had to agree that there are too many things to fix, and the effort >>>> would be fully redundant with existing products. >>>> >>>> Maybe, it might be helpful to have an "at-least-once" version of >>>> FileStream sink, where a metadata directory is no longer needed. It may >>>> require the implementation to go back to the old way of atomic renaming, >>>> but it will also get rid of the necessity of a metadata directory, so >>>> someone might find it useful. For end-to-end exactly once, people can >>>> either use a limited current FileStream sink or use Data Lake products. I >>>> don't see the value in making improvements to the current FileStream sink. >>>> >>>> Thanks, >>>> Jungtaek Lim (HeartSaVioR) >>>> >>>> On Sun, Apr 16, 2023 at 2:52 AM Wojciech Indyk <wojciechin...@gmail.com> >>>> wrote: >>>> >>>>> Hi! >>>>> I raised a ticket on parametrisable output metadata path >>>>> https://issues.apache.org/jira/browse/SPARK-43152. >>>>> I am going to raise a PR against it and I realised, that this >>>>> relatively simple change impacts on method hasMetadata(path), that would >>>>> have a new meaning if we can define custom path for metadata of output >>>>> files. Can you please share your opinion on how the custom output >>>>> metadata >>>>> path can impact on design of structured streaming? >>>>> E.g. I can see one case when I set a parameter of output metadata >>>>> path, run a job on output path A, stop the job, change the output path to >>>>> B >>>>> and hasMetadata works well. If you have any corner case in mind where the >>>>> parametrised output metadata path can break something please describe it. >>>>> >>>>> -- >>>>> Kind regards/ Pozdrawiam, >>>>> Wojciech Indyk >>>>> >>>>