Re: Parametrisable output metadata path

Wojciech Indyk Tue, 18 Apr 2023 00:19:09 -0700

Thank you for your response!
I misread "data lake" as "delta lake", my bad. Anyway I need to write
output to file system. I see your point about data lakes, however
migrations take time, so at least from this perspective I wouldn't
deprecate FileStreamSink. I hope FileStreamSink will be still maintained. I
understand that in background of rapid development of data lakes the
FileStreamSink is not a pririty at all, so that I prepared the PR to help
with a part of work. The other part is review that I kindly ask. IMO my PR
is not a "band-aid fix", rather a low hanging fruit improvement that helps
with a few issues. I might be biased obviously. :)


--
Kind regards/ Pozdrawiam,
Wojciech Indyk


pon., 17 kwi 2023 o 22:42 Jungtaek Lim <[email protected]>
napisał(a):

> small correction: "I intentionally didn't enumerate." The meaning could be
> quite different so making a small correction.
>
> On Tue, Apr 18, 2023 at 5:38 AM Jungtaek Lim <[email protected]>
> wrote:
>
>> There seems to be miscommunication - I didn't mean "Delta Lake". I meant
>> "any" Data Lake products. Since I'm biased I didn't intentionally enumerate
>> actual products, but there are "Apache Hudi", "Apache Iceberg", etc as well.
>>
>> We made non-trivial numbers of band-aid fixes already for file stream
>> sink. For example,
>>
>> https://github.com/apache/spark/pull/28363
>> https://github.com/apache/spark/pull/28904
>> https://github.com/apache/spark/pull/29505
>> https://github.com/apache/spark/pull/31638
>>
>> There were many push backs, because these fixes do not solve the real
>> problem. The consensus was that we don't want to come up with another Data
>> Lake product which requires us to put months (or maybe years) of effort.
>> Now, these Data Lake products are backed by companies and they are
>> successful projects as individuals. I'm not sure I can be supportive with
>> the effort on another band-aid fix.
>>
>> Maintaining metadata directory is a root of the headache. Unless we see
>> the benefit of removing the metadata directory (hence at-least-once) and
>> plan to deal with that, I'd like to leave file stream sink as it is.
>>
>> On Mon, Apr 17, 2023 at 7:37 PM Wojciech Indyk <[email protected]>
>> wrote:
>>
>>> Hi Jungtaek,
>>> integration with Delta Lake is not an option to me, I raised a PR for
>>> improvement of FileStreamSink with the new parameter:
>>> https://github.com/apache/spark/pull/40821. Can you please take a look?
>>>
>>> --
>>> Kind regards/ Pozdrawiam,
>>> Wojciech Indyk
>>>
>>>
>>> niedz., 16 kwi 2023 o 04:45 Jungtaek Lim <[email protected]>
>>> napisał(a):
>>>
>>>> Hi,
>>>>
>>>> We have been indicated with lots of issues with the current FileStream
>>>> sink. The effort to fix these issues are quite significant, and it ended up
>>>> with derivation of "Data Lake" products.
>>>>
>>>> I'd recommend not to fix the issue but leave it as its limitation, and
>>>> integrate your workload with Data Lake products. For a full disclaimer, I
>>>> work in Databricks so I might be biased, but even when I was working at the
>>>> previous employer which didn't have the Data Lake product at that time, I
>>>> also had to agree that there are too many things to fix, and the effort
>>>> would be fully redundant with existing products.
>>>>
>>>> Maybe, it might be helpful to have an "at-least-once" version of
>>>> FileStream sink, where a metadata directory is no longer needed. It may
>>>> require the implementation to go back to the old way of atomic renaming,
>>>> but it will also get rid of the necessity of a metadata directory, so
>>>> someone might find it useful. For end-to-end exactly once, people can
>>>> either use a limited current FileStream sink or use Data Lake products. I
>>>> don't see the value in making improvements to the current FileStream sink.
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>> On Sun, Apr 16, 2023 at 2:52 AM Wojciech Indyk <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi!
>>>>> I raised a ticket on parametrisable output metadata path
>>>>> https://issues.apache.org/jira/browse/SPARK-43152.
>>>>> I am going to raise a PR against it and I realised, that this
>>>>> relatively simple change impacts on method hasMetadata(path), that would
>>>>> have a new meaning if we can define custom path for metadata of output
>>>>> files. Can you please share your opinion on  how the custom output 
>>>>> metadata
>>>>> path can impact on design of structured streaming?
>>>>> E.g. I can see one case when I set a parameter of output metadata
>>>>> path, run a job on output path A, stop the job, change the output path to 
>>>>> B
>>>>> and hasMetadata works well. If you have any corner case in mind where the
>>>>> parametrised output metadata path can break something please describe it.
>>>>>
>>>>> --
>>>>> Kind regards/ Pozdrawiam,
>>>>> Wojciech Indyk
>>>>>
>>>>

Re: Parametrisable output metadata path

Reply via email to