Re: Spark Merge On Read Support

Ryan Blue Thu, 18 Nov 2021 08:26:37 -0800

Puneet,

Good question. Reading v2 tables with delete files has been supported for
several versions, since before we adopted the v2 additions to the spec. You
should be fine when using Spark, Flink, Hive, etc. with runtime Jars from
the Iceberg project. Trino has yet to add support, but Jack has a couple
PRs that add it. And it will fail when reading a table with delete files so
you won't have any correctness problems.


The support that we're building now is for plans that actually write v2
deletes. For example, Spark's DELETE FROM in 0.12.1 and earlier will
rewrite whole data files (copy on write) instead of encoding deletes
against existing data files.

Ryan

On Wed, Nov 17, 2021 at 10:56 PM Puneet Zaroo <pza...@netflix.com.invalid>
wrote:

> Perhaps a newbie question, but if the requirement is to just read v2
> tables with equality and/or position delete files, does that also require
> Spark 3.2 or is that supported in Spark 2.4 as well (even if in a
> sub-optimal way).
>
> Thanks,
> - Puneet
>
>
> On Wed, Nov 17, 2021 at 10:07 AM Ryan Blue <b...@tabular.io> wrote:
>
>> The plan is to support it in 3.2. I think that we're very close but Anton
>> is the expert there.
>>
>> On Tue, Nov 16, 2021 at 6:22 AM Sreeram Garlapati <
>> gsreeramku...@gmail.com> wrote:
>>
>>> This makes sense, thanks a lot @Ryan Blue <b...@tabular.io>.
>>>
>>> Are all building blocks for MOR support (features like - delta-based
>>> plans) fully available in Spark 3.2 - or is there any reason we would need
>>> Spark 3.3? Or is there more ongoing work needed to fully validate this? I
>>> am in need of this specific data point *about the Spark version* - to
>>> move our organization into the correct Spark version. Truly appreciate your
>>> help.
>>>
>>> Best regards,
>>> Sreeram
>>>
>>> On Mon, Nov 15, 2021 at 4:37 PM Ryan Blue <b...@tabular.io> wrote:
>>>
>>>> Sreeram,
>>>>
>>>> The project tracking this is here:
>>>> https://github.com/apache/iceberg/projects/11
>>>>
>>>> It isn’t easy to get a good picture, since most of the PRs are merged.
>>>> But Anton is working on the next set of PRs for Spark. Maybe Anton can find
>>>> some time to add a few notes about what's left to be done.
>>>>
>>>> What’s been done so far is pretty significant:
>>>>
>>>>    - Add new writers that can handle deletes across multiple partition
>>>>    specs
>>>>    - Add Spark 3.2 module and refactor Spark builds
>>>>    - Add metadata columns to Spark 3.2
>>>>    - Add support for required distribution and ordering in Spark 3.2
>>>>    - Support Spark 3.2 dynamic filtering
>>>>
>>>> Many of those are the building blocks for the delta-based plans. And
>>>> it’s really amazing to finally have support for some major improvements:
>>>> dynamic filtering on all queries, metadata columns, and required
>>>> distribution and ordering!
>>>>
>>>> Ryan
>>>>
>>>> On Thu, Nov 11, 2021 at 11:46 PM Sreeram Garlapati <
>>>> gsreeramku...@gmail.com> wrote:
>>>>
>>>>> Hello Iceberg devs!
>>>>>
>>>>> After going through the mail threads (especially "Spark version
>>>>> support strategy") and relevant PRs - it looks like - *Merge on Read*
>>>>> Support (ie., Spark writers writing equality deletes) will be available
>>>>> with *Iceberg **+ Spark 3.2*. Is this understanding correct!? Or is
>>>>> this something that will be available only with Iceberg on Spark 3.3!?
>>>>>
>>>>> Would really appreciate it if someone can point me to any place -
>>>>> which tracks - the remaining work.
>>>>>
>>>>> Thanks,
>>>>> Sreeram
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: Spark Merge On Read Support

Reply via email to