Re: Proposal for REST APIs for Iceberg table scans

Jack Ye Wed, 13 Dec 2023 11:52:52 -0800

The current proposal definitely makes the server stateful. In our prototype
we used other components like DynamoDB to keep track of states. If keeping
it stateless is a tenant we can definitely make the proposal closer to that
direction. Maybe one thing to make sure is, is this a core tenant of the
REST spec? Today we do not even have an official reference implementation
of the REST server, I feel it is hard to say what are the core tenants.
Maybe we should create one?


Pagination is a common issue in the REST spec. We also see similar
limitations with other APIs like GetTables, GetNamespaces. When a catalog
has many namespaces and tables it suffers from the same issue. It is also
not ideal for use cases like web browsers, since typically you display a
small page of results and do not need the full list immediately. So I feel
we cannot really avoid some state to be kept for those use cases.

Chunked response might be a good way to work around it. We also thought
about using HTTP2. However, these options seem to be not very compatible
with OpenAPI. We can do some further research in this domain, would really
appreciate it if anyone has more insights and experience with OpenAPI that
can provide some suggestions.

-Jack



On Tue, Dec 12, 2023 at 6:21 PM Renjie Liu <[email protected]> wrote:

> Hi, Rahi and Jack:
>
> Thanks for raising this.
>
> My question is that the pagination and sharding will make the rest server
> stateful, e.g. a sequence of calls is required to go to the same server. In
> this case, how do we ensure the scalability of the rest server?
>
>
> On Wed, Dec 13, 2023 at 4:09 AM Fokko Driesprong <[email protected]> wrote:
>
>> Hey Rahil and Jack,
>>
>> Thanks for bringing this up. Ryan and I also discussed this briefly in
>> the early days of PyIceberg and it would have helped a lot in the speed of
>> development. We went for the traditional approach because that would also
>> support all the other catalogs, but now that the REST catalog is taking
>> off, I think it still makes a lot of sense to get it in.
>>
>> I do share the concern raised Ryan around the concepts of shards and
>> pagination. For PyIceberg (but also for Go, Rust, and DuckDB) that are
>> living in a single process today the concept of shards doesn't add value. I
>> see your concern with long-running jobs, but for the non-distributed cases,
>> it will add additional complexity.
>>
>> Some suggestions that come to mind:
>>
>>    - Stream the tasks directly back using a chunked response, reducing
>>    the latency to the first task. This would also solve things with the
>>    pagination. The only downside I can think of is having delete files where
>>    you first need to make sure there are deletes relevant to the task, this
>>    might increase latency to the first task.
>>    - Making the sharding optional. If you want to shard you call the
>>    CreateScan first and then call the GetScanTask with the IDs. If you don't
>>    want to shard, you omit the shard parameter and fetch the tasks directly
>>    (here we need also replace the scan string with the full
>>    column/expression/snapshot-id etc).
>>
>> Looking forward to discussing this tomorrow in the community sync
>> <https://iceberg.apache.org/community/#iceberg-community-events>!
>>
>> Kind regards,
>> Fokko
>>
>>
>>
>> Op ma 11 dec 2023 om 19:05 schreef Jack Ye <[email protected]>:
>>
>>> Hi Ryan, thanks for the feedback!
>>>
>>> I was a part of this design discussion internally and can provide more
>>> details. One reason for separating the CreateScan operation was to make the
>>> API asynchronous and thus keep HTTP communications short. Consider the case
>>> where we only have GetScanTasks API, and there is no shard specified. It
>>> might take tens of seconds, or even minutes to read through all the
>>> manifest list and manifests before being able to return anything. This
>>> means the HTTP connection has to remain open during that period, which is
>>> not really a good practice in general (consider connection failure, load
>>> balancer and proxy load, etc.). And when we shift the API to asynchronous,
>>> it basically becomes something like the proposal, where a stateful ID is
>>> generated to be able to immediately return back to the client, and the
>>> client get results by referencing the ID. So in our current prototype
>>> implementation we are actually keeping this ID and the whole REST service
>>> is stateful.
>>>
>>> There were some thoughts we had about the possibility to define a "shard
>>> ID generator" protocol: basically the client agrees with the service a way
>>> to deterministically generate shard IDs, and service uses it to create
>>> shards. That sounds like what you are suggesting here, and it pushes the
>>> responsibility to the client side to determine the parallelism. But in some
>>> bad cases (e.g. there are many delete files and we need to read all those
>>> in each shard to apply filters), it seems like there might still be the
>>> long open connection issue above. What is your thought on that?
>>>
>>> -Jack
>>>
>>> On Sun, Dec 10, 2023 at 10:27 AM Ryan Blue <[email protected]> wrote:
>>>
>>>> Rahil, thanks for working on this. It has some really good ideas that
>>>> we hadn't considered before like a way for the service to plan how to break
>>>> up the work of scan planning. I really like that idea because it makes it
>>>> much easier for the service to keep memory consumption low across requests.
>>>>
>>>> My primary feedback is that I think it's a little too complicated (with
>>>> both sharding and pagination) and could be modified slightly so that the
>>>> service doesn't need to be stateful. If the service isn't necessarily
>>>> stateful then it should be easier to build implementations.
>>>>
>>>> To make it possible for the service to be stateless, I'm proposing that
>>>> rather than creating shard IDs that are tracked by the service, the
>>>> information for a shard can be sent to the client. My assumption here is
>>>> that most implementations would create shards by reading the manifest list,
>>>> filtering on partition ranges, and creating a shard for some reasonable
>>>> size of manifest content. For example, if a table has 100MB of metadata in
>>>> 25 manifests that are about 4 MB each, then it might create 9 shards with
>>>> 1-4 manifests each. The service could send those shards to the client as a
>>>> list of manifests to read and the client could send the shard information
>>>> back to the service to get the data files in each shard (along with the
>>>> original filter).
>>>>
>>>> There's a slight trade-off that the protocol needs to define how to
>>>> break the work into shards. I'm interested in hearing if that would work
>>>> with how you were planning on building the service on your end. Another
>>>> option is to let the service send back arbitrary JSON that would get
>>>> returned for each shard. Either way, I like that this would make it so the
>>>> service doesn't need to persist anything. We could also make it so that
>>>> small tables don't require multiple requests. For example, a client could
>>>> call the route to get file tasks with just a filter.
>>>>
>>>> What do you think?
>>>>
>>>> Ryan
>>>>
>>>> On Fri, Dec 8, 2023 at 10:41 AM Chertara, Rahil
>>>> <[email protected]> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> My name is Rahil Chertara, and I’m a part of the Iceberg team at
>>>>> Amazon EMR and Athena. I’m reaching out to share a proposal for a new Scan
>>>>> API that will be utilized by the RESTCatalog. The process for table scan
>>>>> planning is currently done within client engines such as Apache Spark. By
>>>>> moving scan functionality to the RESTCatalog, we can integrate Iceberg
>>>>> table scans with external services, which can lead to several benefits.
>>>>>
>>>>> For example, we can leverage caching and indexes on the server side to
>>>>> improve planning performance. Furthermore, by moving this scan logic to 
>>>>> the
>>>>> RESTCatalog, non-JVM engines can integrate more easily. This all can be
>>>>> found in the detailed proposal below. Feel free to comment, and add your
>>>>> suggestions .
>>>>>
>>>>> Detailed proposal:
>>>>> https://docs.google.com/document/d/1FdjCnFZM1fNtgyb9-v9fU4FwOX4An-pqEwSaJe8RgUg/edit#heading=h.cftjlkb2wh4h
>>>>>
>>>>> Github POC: https://github.com/apache/iceberg/pull/9252
>>>>>
>>>>> Regards,
>>>>>
>>>>> Rahil Chertara
>>>>> Amazon EMR & Athena
>>>>> [email protected]
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>

Re: Proposal for REST APIs for Iceberg table scans

Reply via email to