Re: Secondary Indexes - Pluggable File Filter interface for Apache Iceberg

Ryan Blue Fri, 05 Mar 2021 08:56:00 -0800

I updated the invites. Sorry for the mixup!

On Fri, Mar 5, 2021 at 2:10 AM webdev.andrei <[email protected]>
wrote:


> Hi all,
>
> I would like to attend the discussion. I'm very interested into it as I'm
> working with Miao's team on indexing. The PR for Iceberg support in
> Hyperspace referred by Miao is my work.
>
> If needed I can explain how Hyperspace works and what’s the plan with
> Hyperspace for the near future.
>
> You can add me either with this email (personal email) or
> [email protected], or both.
>
> Thanks!
>
> Andrei Ionescu
>
>
>
>
>
> On Thu, Mar 4, 2021 at 8:55 PM Guy Khazma <[email protected]> wrote:
>
>> Hi Miao,
>>
>> I am looking forward to discuss this in the meeting.
>> I think these are valid concerns and there is a tradeoff between the
>> convenience of collecting and tracking the indexes per file independently
>> to the performance overhead of keeping them separately when used in run
>> time.
>> One possible approach is to use iceberg to save the metadata, and to use
>> compaction in iceberg in order to merge the indexes to a consolidated
>> location.
>>
>> Thanks,
>> Guy
>>
>> On 2021/03/04 04:30:53, Miao Wang <[email protected]> wrote:
>> > It works for me.
>> >
>> > With a quick thought, there may be a few concerns about consolidated
>> fashion storage.
>> >
>> > 1). Maintaining the consolidated storage may be a bit more complex;
>> > 2). It may make collecting index while writing data file (i.e., online
>> index building) more complex (e.g., we need to consider that multiple
>> writers write to the same consolidated index file in parallel);
>> > 3). We need to have some auxiliary structure in the index file to
>> quickly locate relevant index given some key (e.g., a data file name);
>> >
>> > However, I do think consolidated fashion storage is some meaningful
>> optimization on the disk. If we properly design splitable and mergeable
>> index file format, the consolidation fashion and 1-data-file-1-index (1:1
>> index file) are not mutual exclusive. Therefore, 1:1 index file can be the
>> building block for larger consolidated index files and index at different
>> levels, like partition level index.
>> >
>> > Our team member went through one pass of the design and shared some
>> thoughts with me. I will complete my pass.
>> >
>> > Thanks!
>> >
>> > Miao
>> >
>> >
>> > From: Ryan Blue <[email protected]>
>> > Date: Wednesday, March 3, 2021 at 6:08 PM
>> > To: OpenInx <[email protected]>
>> > Cc: Iceberg Dev List <[email protected]>
>> > Subject: Re: Secondary Indexes - Pluggable File Filter interface for
>> Apache Iceberg
>> > Great, thank you for planning to join! I definitely want to get your
>> input on this as well.
>> >
>> > On Wed, Mar 3, 2021 at 6:06 PM OpenInx <[email protected]<mailto:
>> [email protected]>> wrote:
>> > It will be  1:00 AM (China Standard Time) on 18 March,  and it works
>> for our Asia people.   I'd love to attend this discussion, Thanks.
>> >
>> > On Thu, Mar 4, 2021 at 9:50 AM Ryan Blue <[email protected]>
>> wrote:
>> > Thanks for putting this together, Guy! I just did a pass over the doc
>> and it looks like a really reasonable proposal for being able to inject
>> custom file filter implementations.
>> >
>> > One of the main things we need to think about is how to store and track
>> the index data. There's a comment in the doc about storing them in a
>> "consolidated fashion" and I'd like to hear more about what you're thinking
>> there. The index-per-file approach that Adobe is working on is a good way
>> to track index data because we get a clear lifecycle for index data because
>> it is tied to a data file that is immutable. On the other hand, the
>> drawback is that we have a lot of index files -- one per data file.
>> >
>> > Let's set up a time to go talk through the options. Would 9AM PST
>> (17:00 UTC) on 17 March work for everyone? I'm thinking in the morning so
>> everyone from IBM can attend. We can do a second discussion at a time that
>> works more for people in Asia later on as well.
>> >
>> > If that day works, then I'll send out an invite.
>> >
>> > On Fri, Feb 19, 2021 at 8:49 AM Guy Khazma <[email protected]<mailto:
>> [email protected]>> wrote:
>> > Hi All,
>> >
>> > Following up on our discussion from Wednesday sync here attached is a
>> proposal to enhance iceberg with a pluggable interface for data skipping
>> indexes to enable use of existing indexes in job planning.
>> >
>> >
>> https://docs.google.com/document/d/11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY/edit?usp=sharing
>> <
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY%2Fedit%3Fusp%3Dsharing&data=04%7C01%7Cmiwang%40adobe.com%7C9ce4b2e7876c4e23a8ac08d8deb26ffc%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637504205348408643%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=vFOaNdSwCYQO1p%2FDeX5glae%2BSo9aOF3S%2BR2bU2O1tM0%3D&reserved=0
>> >
>> >
>> > We will be glad to get you feedback.
>> >
>> > Thanks,
>> > Guy
>> >
>> >
>> > --
>> > Ryan Blue
>> > Software Engineer
>> > Netflix
>> >
>> >
>> > --
>> > Ryan Blue
>> > Software Engineer
>> > Netflix
>> >
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Secondary Indexes - Pluggable File Filter interface for Apache Iceberg

Reply via email to