Re: Proposal for REST APIs for Iceberg table scans

Ryan Blue Wed, 13 Dec 2023 15:25:27 -0800

Jack,

It sounds like what I’m proposing isn’t quite clear because your initial
response was arguing for a sharding capability. I agree that sharding is a
good idea. I’m less confident about two points:


   1. Requiring that the service is stateful. As Renjie pointed out, that
   makes it harder to scale the service.
   2. The need for both pagination *and* sharding as separate things

And I also think that Fokko has a good point about trying to keep things
simple and not requiring the CreateScan endpoint.

For the first point, I’m proposing that we still have a CreateScan
endpoint, but instead of sending only a list of shard IDs it can also send
either a standard shard “task” or an optional JSON definition. Let’s assume
we can send arbitrary JSON for an example. Say I have a table with 4
manifests, A through D and that C and D match some query filter. When I
call the CreateScan endpoint, the service would send back tasks with that
information: {"id": 1, "manifests": ["C"]}, {"id": 2, "manifests": ["D"]}.
By sending what the shards mean (the manifests to read), my service can be
stateless: any node can get a request for shard 1, read manifest C, and
send back the resulting data files.

I don’t see much of an argument against doing this *in principle*. It gives
you the flexibility to store state if you choose or to send state to the
client for it to pass back when calling the GetTasks endpoint. There is a
practical problem, which is that it’s annoying to send a GET request with a
JSON payload because you can’t send a request body. It’s probably obvious,
but I’m also not a REST purist so I’d be fine using POST or QUERY for this.
It would look something like this Gist
<https://gist.github.com/rdblue/d2b65bd2ad20f85ee9d04ccf19ac8aba>.

In your last reply, you also asked whether a stateless service is a goal. I
don’t think that it is, but if we can make simple changes to the spec to
allow more flexibility on the server side, I think that’s a good direction.
You also asked about a reference implementation and I consider
CatalogHandlers
<https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java>
to be that reference. It does everything except for the work done by your
choice of web application framework. It isn’t stateless, but it only relies
on a Catalog implementation for persistence.

For the second point, I don’t understand why we need both sharding and
pagination. That is, if we have a protocol that allows sharding, why is
pagination also needed? From my naive perspective on how sharding would
work, we should be able to use metadata from the manifest list to limit the
potential number of data files in a given shard. As long as we can limit
the size of a shard to produce more, pagination seems like unnecessary
complication.

Lastly, for Fokko’s point, I think another easy extension to the proposal
is to support a direct call to GetTasks. There’s a trade-off here, but if
you’re already sending the original filter along with the request (in order
to filter records from manifest C for instance) then the request is already
something the protocol can express. There’s an objection concerning
resource consumption on the service and creating responses that are too
large or take too long, but we can get around that by responding with a
code that instructs the client to use the CreateScan API like 413 (Payload
too large). I think that would allow simple clients to function for all but
really large tables. The gist above also shows what this might look like.

Ryan

On Wed, Dec 13, 2023 at 11:53 AM Jack Ye <yezhao...@gmail.com> wrote:

> The current proposal definitely makes the server stateful. In our
> prototype we used other components like DynamoDB to keep track of states.
> If keeping it stateless is a tenant we can definitely make the proposal
> closer to that direction. Maybe one thing to make sure is, is this a core
> tenant of the REST spec? Today we do not even have an official reference
> implementation of the REST server, I feel it is hard to say what are the
> core tenants. Maybe we should create one?
>
> Pagination is a common issue in the REST spec. We also see similar
> limitations with other APIs like GetTables, GetNamespaces. When a catalog
> has many namespaces and tables it suffers from the same issue. It is also
> not ideal for use cases like web browsers, since typically you display a
> small page of results and do not need the full list immediately. So I feel
> we cannot really avoid some state to be kept for those use cases.
>
> Chunked response might be a good way to work around it. We also thought
> about using HTTP2. However, these options seem to be not very compatible
> with OpenAPI. We can do some further research in this domain, would really
> appreciate it if anyone has more insights and experience with OpenAPI that
> can provide some suggestions.
>
> -Jack
>
>
>
> On Tue, Dec 12, 2023 at 6:21 PM Renjie Liu <liurenjie2...@gmail.com>
> wrote:
>
>> Hi, Rahi and Jack:
>>
>> Thanks for raising this.
>>
>> My question is that the pagination and sharding will make the rest server
>> stateful, e.g. a sequence of calls is required to go to the same server. In
>> this case, how do we ensure the scalability of the rest server?
>>
>>
>> On Wed, Dec 13, 2023 at 4:09 AM Fokko Driesprong <fo...@apache.org>
>> wrote:
>>
>>> Hey Rahil and Jack,
>>>
>>> Thanks for bringing this up. Ryan and I also discussed this briefly in
>>> the early days of PyIceberg and it would have helped a lot in the speed of
>>> development. We went for the traditional approach because that would also
>>> support all the other catalogs, but now that the REST catalog is taking
>>> off, I think it still makes a lot of sense to get it in.
>>>
>>> I do share the concern raised Ryan around the concepts of shards and
>>> pagination. For PyIceberg (but also for Go, Rust, and DuckDB) that are
>>> living in a single process today the concept of shards doesn't add value. I
>>> see your concern with long-running jobs, but for the non-distributed cases,
>>> it will add additional complexity.
>>>
>>> Some suggestions that come to mind:
>>>
>>>    - Stream the tasks directly back using a chunked response, reducing
>>>    the latency to the first task. This would also solve things with the
>>>    pagination. The only downside I can think of is having delete files where
>>>    you first need to make sure there are deletes relevant to the task, this
>>>    might increase latency to the first task.
>>>    - Making the sharding optional. If you want to shard you call the
>>>    CreateScan first and then call the GetScanTask with the IDs. If you don't
>>>    want to shard, you omit the shard parameter and fetch the tasks directly
>>>    (here we need also replace the scan string with the full
>>>    column/expression/snapshot-id etc).
>>>
>>> Looking forward to discussing this tomorrow in the community sync
>>> <https://iceberg.apache.org/community/#iceberg-community-events>!
>>>
>>> Kind regards,
>>> Fokko
>>>
>>>
>>>
>>> Op ma 11 dec 2023 om 19:05 schreef Jack Ye <yezhao...@gmail.com>:
>>>
>>>> Hi Ryan, thanks for the feedback!
>>>>
>>>> I was a part of this design discussion internally and can provide more
>>>> details. One reason for separating the CreateScan operation was to make the
>>>> API asynchronous and thus keep HTTP communications short. Consider the case
>>>> where we only have GetScanTasks API, and there is no shard specified. It
>>>> might take tens of seconds, or even minutes to read through all the
>>>> manifest list and manifests before being able to return anything. This
>>>> means the HTTP connection has to remain open during that period, which is
>>>> not really a good practice in general (consider connection failure, load
>>>> balancer and proxy load, etc.). And when we shift the API to asynchronous,
>>>> it basically becomes something like the proposal, where a stateful ID is
>>>> generated to be able to immediately return back to the client, and the
>>>> client get results by referencing the ID. So in our current prototype
>>>> implementation we are actually keeping this ID and the whole REST service
>>>> is stateful.
>>>>
>>>> There were some thoughts we had about the possibility to define a
>>>> "shard ID generator" protocol: basically the client agrees with the service
>>>> a way to deterministically generate shard IDs, and service uses it to
>>>> create shards. That sounds like what you are suggesting here, and it pushes
>>>> the responsibility to the client side to determine the parallelism. But in
>>>> some bad cases (e.g. there are many delete files and we need to read all
>>>> those in each shard to apply filters), it seems like there might still be
>>>> the long open connection issue above. What is your thought on that?
>>>>
>>>> -Jack
>>>>
>>>> On Sun, Dec 10, 2023 at 10:27 AM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> Rahil, thanks for working on this. It has some really good ideas that
>>>>> we hadn't considered before like a way for the service to plan how to 
>>>>> break
>>>>> up the work of scan planning. I really like that idea because it makes it
>>>>> much easier for the service to keep memory consumption low across 
>>>>> requests.
>>>>>
>>>>> My primary feedback is that I think it's a little too complicated
>>>>> (with both sharding and pagination) and could be modified slightly so that
>>>>> the service doesn't need to be stateful. If the service isn't necessarily
>>>>> stateful then it should be easier to build implementations.
>>>>>
>>>>> To make it possible for the service to be stateless, I'm proposing
>>>>> that rather than creating shard IDs that are tracked by the service, the
>>>>> information for a shard can be sent to the client. My assumption here is
>>>>> that most implementations would create shards by reading the manifest 
>>>>> list,
>>>>> filtering on partition ranges, and creating a shard for some reasonable
>>>>> size of manifest content. For example, if a table has 100MB of metadata in
>>>>> 25 manifests that are about 4 MB each, then it might create 9 shards with
>>>>> 1-4 manifests each. The service could send those shards to the client as a
>>>>> list of manifests to read and the client could send the shard information
>>>>> back to the service to get the data files in each shard (along with the
>>>>> original filter).
>>>>>
>>>>> There's a slight trade-off that the protocol needs to define how to
>>>>> break the work into shards. I'm interested in hearing if that would work
>>>>> with how you were planning on building the service on your end. Another
>>>>> option is to let the service send back arbitrary JSON that would get
>>>>> returned for each shard. Either way, I like that this would make it so the
>>>>> service doesn't need to persist anything. We could also make it so that
>>>>> small tables don't require multiple requests. For example, a client could
>>>>> call the route to get file tasks with just a filter.
>>>>>
>>>>> What do you think?
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Fri, Dec 8, 2023 at 10:41 AM Chertara, Rahil
>>>>> <rcher...@amazon.com.invalid> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> My name is Rahil Chertara, and I’m a part of the Iceberg team at
>>>>>> Amazon EMR and Athena. I’m reaching out to share a proposal for a new 
>>>>>> Scan
>>>>>> API that will be utilized by the RESTCatalog. The process for table scan
>>>>>> planning is currently done within client engines such as Apache Spark. By
>>>>>> moving scan functionality to the RESTCatalog, we can integrate Iceberg
>>>>>> table scans with external services, which can lead to several benefits.
>>>>>>
>>>>>> For example, we can leverage caching and indexes on the server side
>>>>>> to improve planning performance. Furthermore, by moving this scan logic 
>>>>>> to
>>>>>> the RESTCatalog, non-JVM engines can integrate more easily. This all can 
>>>>>> be
>>>>>> found in the detailed proposal below. Feel free to comment, and add your
>>>>>> suggestions .
>>>>>>
>>>>>> Detailed proposal:
>>>>>> https://docs.google.com/document/d/1FdjCnFZM1fNtgyb9-v9fU4FwOX4An-pqEwSaJe8RgUg/edit#heading=h.cftjlkb2wh4h
>>>>>>
>>>>>> Github POC: https://github.com/apache/iceberg/pull/9252
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Rahil Chertara
>>>>>> Amazon EMR & Athena
>>>>>> rcher...@amazon.com
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>

-- 
Ryan Blue
Tabular

Re: Proposal for REST APIs for Iceberg table scans

Reply via email to