Re: Proposal for REST APIs for Iceberg table scans

Renjie Liu Tue, 12 Dec 2023 18:21:26 -0800

Hi, Rahi and Jack:

Thanks for raising this.


My question is that the pagination and sharding will make the rest server
stateful, e.g. a sequence of calls is required to go to the same server. In
this case, how do we ensure the scalability of the rest server?


On Wed, Dec 13, 2023 at 4:09 AM Fokko Driesprong <fo...@apache.org> wrote:

> Hey Rahil and Jack,
>
> Thanks for bringing this up. Ryan and I also discussed this briefly in the
> early days of PyIceberg and it would have helped a lot in the speed of
> development. We went for the traditional approach because that would also
> support all the other catalogs, but now that the REST catalog is taking
> off, I think it still makes a lot of sense to get it in.
>
> I do share the concern raised Ryan around the concepts of shards and
> pagination. For PyIceberg (but also for Go, Rust, and DuckDB) that are
> living in a single process today the concept of shards doesn't add value. I
> see your concern with long-running jobs, but for the non-distributed cases,
> it will add additional complexity.
>
> Some suggestions that come to mind:
>
>    - Stream the tasks directly back using a chunked response, reducing
>    the latency to the first task. This would also solve things with the
>    pagination. The only downside I can think of is having delete files where
>    you first need to make sure there are deletes relevant to the task, this
>    might increase latency to the first task.
>    - Making the sharding optional. If you want to shard you call the
>    CreateScan first and then call the GetScanTask with the IDs. If you don't
>    want to shard, you omit the shard parameter and fetch the tasks directly
>    (here we need also replace the scan string with the full
>    column/expression/snapshot-id etc).
>
> Looking forward to discussing this tomorrow in the community sync
> <https://iceberg.apache.org/community/#iceberg-community-events>!
>
> Kind regards,
> Fokko
>
>
>
> Op ma 11 dec 2023 om 19:05 schreef Jack Ye <yezhao...@gmail.com>:
>
>> Hi Ryan, thanks for the feedback!
>>
>> I was a part of this design discussion internally and can provide more
>> details. One reason for separating the CreateScan operation was to make the
>> API asynchronous and thus keep HTTP communications short. Consider the case
>> where we only have GetScanTasks API, and there is no shard specified. It
>> might take tens of seconds, or even minutes to read through all the
>> manifest list and manifests before being able to return anything. This
>> means the HTTP connection has to remain open during that period, which is
>> not really a good practice in general (consider connection failure, load
>> balancer and proxy load, etc.). And when we shift the API to asynchronous,
>> it basically becomes something like the proposal, where a stateful ID is
>> generated to be able to immediately return back to the client, and the
>> client get results by referencing the ID. So in our current prototype
>> implementation we are actually keeping this ID and the whole REST service
>> is stateful.
>>
>> There were some thoughts we had about the possibility to define a "shard
>> ID generator" protocol: basically the client agrees with the service a way
>> to deterministically generate shard IDs, and service uses it to create
>> shards. That sounds like what you are suggesting here, and it pushes the
>> responsibility to the client side to determine the parallelism. But in some
>> bad cases (e.g. there are many delete files and we need to read all those
>> in each shard to apply filters), it seems like there might still be the
>> long open connection issue above. What is your thought on that?
>>
>> -Jack
>>
>> On Sun, Dec 10, 2023 at 10:27 AM Ryan Blue <b...@tabular.io> wrote:
>>
>>> Rahil, thanks for working on this. It has some really good ideas that we
>>> hadn't considered before like a way for the service to plan how to break up
>>> the work of scan planning. I really like that idea because it makes it much
>>> easier for the service to keep memory consumption low across requests.
>>>
>>> My primary feedback is that I think it's a little too complicated (with
>>> both sharding and pagination) and could be modified slightly so that the
>>> service doesn't need to be stateful. If the service isn't necessarily
>>> stateful then it should be easier to build implementations.
>>>
>>> To make it possible for the service to be stateless, I'm proposing that
>>> rather than creating shard IDs that are tracked by the service, the
>>> information for a shard can be sent to the client. My assumption here is
>>> that most implementations would create shards by reading the manifest list,
>>> filtering on partition ranges, and creating a shard for some reasonable
>>> size of manifest content. For example, if a table has 100MB of metadata in
>>> 25 manifests that are about 4 MB each, then it might create 9 shards with
>>> 1-4 manifests each. The service could send those shards to the client as a
>>> list of manifests to read and the client could send the shard information
>>> back to the service to get the data files in each shard (along with the
>>> original filter).
>>>
>>> There's a slight trade-off that the protocol needs to define how to
>>> break the work into shards. I'm interested in hearing if that would work
>>> with how you were planning on building the service on your end. Another
>>> option is to let the service send back arbitrary JSON that would get
>>> returned for each shard. Either way, I like that this would make it so the
>>> service doesn't need to persist anything. We could also make it so that
>>> small tables don't require multiple requests. For example, a client could
>>> call the route to get file tasks with just a filter.
>>>
>>> What do you think?
>>>
>>> Ryan
>>>
>>> On Fri, Dec 8, 2023 at 10:41 AM Chertara, Rahil
>>> <rcher...@amazon.com.invalid> wrote:
>>>
>>>> Hi all,
>>>>
>>>> My name is Rahil Chertara, and I’m a part of the Iceberg team at Amazon
>>>> EMR and Athena. I’m reaching out to share a proposal for a new Scan API
>>>> that will be utilized by the RESTCatalog. The process for table scan
>>>> planning is currently done within client engines such as Apache Spark. By
>>>> moving scan functionality to the RESTCatalog, we can integrate Iceberg
>>>> table scans with external services, which can lead to several benefits.
>>>>
>>>> For example, we can leverage caching and indexes on the server side to
>>>> improve planning performance. Furthermore, by moving this scan logic to the
>>>> RESTCatalog, non-JVM engines can integrate more easily. This all can be
>>>> found in the detailed proposal below. Feel free to comment, and add your
>>>> suggestions .
>>>>
>>>> Detailed proposal:
>>>> https://docs.google.com/document/d/1FdjCnFZM1fNtgyb9-v9fU4FwOX4An-pqEwSaJe8RgUg/edit#heading=h.cftjlkb2wh4h
>>>>
>>>> Github POC: https://github.com/apache/iceberg/pull/9252
>>>>
>>>> Regards,
>>>>
>>>> Rahil Chertara
>>>> Amazon EMR & Athena
>>>> rcher...@amazon.com
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

Re: Proposal for REST APIs for Iceberg table scans

Reply via email to