Re: Spark Merge On Read Support

Yufei Gu Thu, 18 Nov 2021 08:52:16 -0800

Hi Puneet,

Agreed with Ryan, you can use Spark 2.4 to read Iceberg tables with delete
files. To add to this, we are recently adding vectorized read support in
Spark 3.2, which is 1.6 ~ 2 times faster than non-vectorized read(the
existing solution in Spark 2.4).
1. Position delete support https://github.com/apache/iceberg/pull/3287
(Merged)
2. Equality delete support https://github.com/apache/iceberg/pull/3557(WIP)


It would be easy to backport them to Spark 2.4 if you want it.

Best,

Yufei

`This is not a contribution`


On Thu, Nov 18, 2021 at 8:38 AM Puneet Zaroo <pza...@netflix.com.invalid>
wrote:

> Thanks Ryan,
> This is super helpful to know. Yes, the discussion about 'plans' in Spark
> 3.2 made me think it could be for read support.
> For the Presto read support, could you (or Jack) please point to the PRs
> that are work-in-progress.
> Thanks,
> - Puneet
>
> On Thu, Nov 18, 2021 at 8:26 AM Ryan Blue <b...@tabular.io> wrote:
>
>> Puneet,
>>
>> Good question. Reading v2 tables with delete files has been supported for
>> several versions, since before we adopted the v2 additions to the spec. You
>> should be fine when using Spark, Flink, Hive, etc. with runtime Jars from
>> the Iceberg project. Trino has yet to add support, but Jack has a couple
>> PRs that add it. And it will fail when reading a table with delete files so
>> you won't have any correctness problems.
>>
>> The support that we're building now is for plans that actually write v2
>> deletes. For example, Spark's DELETE FROM in 0.12.1 and earlier will
>> rewrite whole data files (copy on write) instead of encoding deletes
>> against existing data files.
>>
>> Ryan
>>
>> On Wed, Nov 17, 2021 at 10:56 PM Puneet Zaroo <pza...@netflix.com.invalid>
>> wrote:
>>
>>> Perhaps a newbie question, but if the requirement is to just read v2
>>> tables with equality and/or position delete files, does that also require
>>> Spark 3.2 or is that supported in Spark 2.4 as well (even if in a
>>> sub-optimal way).
>>>
>>> Thanks,
>>> - Puneet
>>>
>>>
>>> On Wed, Nov 17, 2021 at 10:07 AM Ryan Blue <b...@tabular.io> wrote:
>>>
>>>> The plan is to support it in 3.2. I think that we're very close but
>>>> Anton is the expert there.
>>>>
>>>> On Tue, Nov 16, 2021 at 6:22 AM Sreeram Garlapati <
>>>> gsreeramku...@gmail.com> wrote:
>>>>
>>>>> This makes sense, thanks a lot @Ryan Blue <b...@tabular.io>.
>>>>>
>>>>> Are all building blocks for MOR support (features like - delta-based
>>>>> plans) fully available in Spark 3.2 - or is there any reason we would need
>>>>> Spark 3.3? Or is there more ongoing work needed to fully validate this? I
>>>>> am in need of this specific data point *about the Spark version* - to
>>>>> move our organization into the correct Spark version. Truly appreciate 
>>>>> your
>>>>> help.
>>>>>
>>>>> Best regards,
>>>>> Sreeram
>>>>>
>>>>> On Mon, Nov 15, 2021 at 4:37 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>
>>>>>> Sreeram,
>>>>>>
>>>>>> The project tracking this is here:
>>>>>> https://github.com/apache/iceberg/projects/11
>>>>>>
>>>>>> It isn’t easy to get a good picture, since most of the PRs are
>>>>>> merged. But Anton is working on the next set of PRs for Spark. Maybe 
>>>>>> Anton
>>>>>> can find some time to add a few notes about what's left to be done.
>>>>>>
>>>>>> What’s been done so far is pretty significant:
>>>>>>
>>>>>>    - Add new writers that can handle deletes across multiple
>>>>>>    partition specs
>>>>>>    - Add Spark 3.2 module and refactor Spark builds
>>>>>>    - Add metadata columns to Spark 3.2
>>>>>>    - Add support for required distribution and ordering in Spark 3.2
>>>>>>    - Support Spark 3.2 dynamic filtering
>>>>>>
>>>>>> Many of those are the building blocks for the delta-based plans. And
>>>>>> it’s really amazing to finally have support for some major improvements:
>>>>>> dynamic filtering on all queries, metadata columns, and required
>>>>>> distribution and ordering!
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Thu, Nov 11, 2021 at 11:46 PM Sreeram Garlapati <
>>>>>> gsreeramku...@gmail.com> wrote:
>>>>>>
>>>>>>> Hello Iceberg devs!
>>>>>>>
>>>>>>> After going through the mail threads (especially "Spark version
>>>>>>> support strategy") and relevant PRs - it looks like - *Merge on
>>>>>>> Read* Support (ie., Spark writers writing equality deletes) will be
>>>>>>> available with *Iceberg **+ Spark 3.2*. Is this understanding
>>>>>>> correct!? Or is this something that will be available only with Iceberg 
>>>>>>> on
>>>>>>> Spark 3.3!?
>>>>>>>
>>>>>>> Would really appreciate it if someone can point me to any place -
>>>>>>> which tracks - the remaining work.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Sreeram
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: Spark Merge On Read Support

Reply via email to